# Multi-model fallback and reliability routing

{% hint style="danger" %}
Experimental guide: validate every workflow in staging before production.
{% endhint %}

## The Problem

LLM applications fail in production for reasons that are often outside your code:

* transient provider outages,
* strict per-model rate limits,
* model-specific latency spikes,
* regional instability.

If your app is hard-wired to one model endpoint, uptime and user experience degrade immediately.

## The Flashback Pattern

Use one Flashback repository as your stable OpenAI-compatible integration point, then configure multiple AI LLM resources behind it.

Your application keeps one API contract, while your routing layer applies fallback order by model/provider when calls fail or exceed SLOs.

## Prerequisites

* Flashback repository configured for **OpenAI** endpoint type.
* At least two configured AI LLM resources (for example OpenAI + Anthropic-compatible endpoint).
* Repository API key (AI usage).
* Basic request telemetry (latency, failures, model used).

Reference pages:

* [Configure an AI LLM](https://docs.flashback.tech/guides/setup-the-cloud-and-ai-gateway/start-with-cloud-storage/create-a-bucket-1)
* [Build a Repository](https://docs.flashback.tech/guides/setup-the-cloud-and-ai-gateway/start-with-cloud-storage-1)
* [AI LLM APIs](https://docs.flashback.tech/support-reference/platform-api-reference/ai-apis/ai-llms)

## Implementation blueprint

{% stepper %}
{% step %}

#### Define fallback tiers

Example policy:

1. **Tier 1**: high-quality model for normal traffic.
2. **Tier 2**: similar quality but lower latency / alternate provider.
3. **Tier 3**: low-cost baseline for graceful degradation.

Use deterministic rules so behavior is easy to debug.
{% endstep %}

{% step %}

#### Configure one client against Flashback

```python
# fb_openai_client.py
from openai import OpenAI
import os

client = OpenAI(
    base_url=os.environ["FB_OPENAI_BASE_URL"],  # e.g. https://openai-us-east-1-aws.flashback.tech/v1
    api_key=os.environ["FB_API_KEY_SECRET"]
)
```

Use environment variables:

```bash
export FB_OPENAI_BASE_URL="https://openai-us-east-1-aws.flashback.tech/v1"
export FB_API_KEY_SECRET="<repo_api_key_secret>"
```

{% endstep %}

{% step %}

#### Add fallback execution in application code

```python
# fallback_completion.py
from fb_openai_client import client

MODEL_PRIORITY = [
    "gpt-4.1",        # Tier 1
    "gpt-4.1-mini",   # Tier 2
    "gpt-4o-mini"     # Tier 3
]

def complete_with_fallback(messages):
    last_error = None
    for model in MODEL_PRIORITY:
        try:
            res = client.chat.completions.create(
                model=model,
                temperature=0.2,
                messages=messages,
                timeout=20
            )
            return {
                "model": model,
                "content": res.choices[0].message.content,
                "usage": res.usage
            }
        except Exception as e:
            last_error = e
    raise RuntimeError(f"All model tiers failed: {last_error}")
```

{% endstep %}

{% step %}

#### Add reliability controls

* Per-tier timeout (e.g., 20s → 12s → 8s).
* Retry with exponential backoff before tier switch.
* Circuit breaker: temporarily remove a failing tier after N consecutive failures.
* Emit structured logs (`request_id`, `tier`, `model`, `latency_ms`, `status`).
  {% endstep %}

{% step %}

#### Validate in staging

Run synthetic checks every minute:

```bash
curl -sS "$FB_OPENAI_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $FB_API_KEY_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "model":"gpt-4.1-mini",
    "messages":[{"role":"user","content":"healthcheck"}],
    "max_tokens":8
  }'
```

Track:

* success rate by model,
* p95 latency by model,
* fallback activation rate.
  {% endstep %}
  {% endstepper %}

## Production checklist

* Keep at least 2 providers/models available.
* Cap fallback depth to avoid runaway latency.
* Alert when Tier 1 success rate drops below threshold.
* Review routing weekly using usage statistics and error trends.
