> For the complete documentation index, see [llms.txt](https://docs.flashback.tech/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.flashback.tech/guides/explore-use-cases/ai-llm/multi-model-fallback-and-reliability-routing.md).

# Multi-model fallback and reliability routing

{% hint style="danger" %}
Experimental guide: validate every workflow in staging before production.
{% endhint %}

## The Problem

LLM applications fail in production for reasons that are often outside your code:

* transient provider outages,
* strict per-model rate limits,
* model-specific latency spikes,
* regional instability.

If your app is hard-wired to one model endpoint, uptime and user experience degrade immediately.

## The Flashgate Pattern

Use one Flashgate repository as your stable OpenAI-compatible integration point, then configure multiple AI LLM resources behind it.

Your application keeps one API contract, while your routing layer applies fallback order by model/provider when calls fail or exceed SLOs.

## Prerequisites

* Flashgate repository configured for **OpenAI** endpoint type.
* At least two configured AI LLM resources (for example OpenAI + Anthropic-compatible endpoint).
* Repository API key (AI usage).
* Basic request telemetry (latency, failures, model used).

Reference pages:

* [Configure an AI LLM](/guides/setup-the-cloud-and-ai-gateway/start-with-cloud-storage/create-a-bucket-1.md)
* [Build a Repository](/guides/setup-the-cloud-and-ai-gateway/start-with-cloud-storage-1.md)
* [AI LLM APIs](/support-reference/platform-api-reference/ai-apis/ai-llms.md)

## Implementation blueprint

{% stepper %}
{% step %}

### Define fallback tiers

Example policy:

1. **Tier 1**: high-quality model for normal traffic.
2. **Tier 2**: similar quality but lower latency / alternate provider.
3. **Tier 3**: low-cost baseline for graceful degradation.

Use deterministic rules so behavior is easy to debug.
{% endstep %}

{% step %}

### Configure one client against Flashgate

```python
# fb_openai_client.py
from openai import OpenAI
import os

client = OpenAI(
    base_url=os.environ["FB_OPENAI_BASE_URL"],  # e.g. https://openai-us-east-1-aws.flashback.tech/v1
    api_key=os.environ["FB_API_KEY_SECRET"]
)
```

Use environment variables:

```bash
export FB_OPENAI_BASE_URL="https://openai-us-east-1-aws.flashback.tech/v1"
export FB_API_KEY_SECRET="<repo_api_key_secret>"
```

{% endstep %}

{% step %}

### Add fallback execution in application code

```python
# fallback_completion.py
from fb_openai_client import client

MODEL_PRIORITY = [
    "gpt-4.1",        # Tier 1
    "gpt-4.1-mini",   # Tier 2
    "gpt-4o-mini"     # Tier 3
]

def complete_with_fallback(messages):
    last_error = None
    for model in MODEL_PRIORITY:
        try:
            res = client.chat.completions.create(
                model=model,
                temperature=0.2,
                messages=messages,
                timeout=20
            )
            return {
                "model": model,
                "content": res.choices[0].message.content,
                "usage": res.usage
            }
        except Exception as e:
            last_error = e
    raise RuntimeError(f"All model tiers failed: {last_error}")
```

{% endstep %}

{% step %}

### Add reliability controls

* Per-tier timeout (e.g., 20s → 12s → 8s).
* Retry with exponential backoff before tier switch.
* Circuit breaker: temporarily remove a failing tier after N consecutive failures.
* Emit structured logs (`request_id`, `tier`, `model`, `latency_ms`, `status`).
  {% endstep %}

{% step %}

### Validate in staging

Run synthetic checks every minute:

```bash
curl -sS "$FB_OPENAI_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $FB_API_KEY_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "model":"gpt-4.1-mini",
    "messages":[{"role":"user","content":"healthcheck"}],
    "max_tokens":8
  }'
```

Track:

* success rate by model,
* p95 latency by model,
* fallback activation rate.
  {% endstep %}
  {% endstepper %}

## Production checklist

* Keep at least 2 providers/models available.
* Cap fallback depth to avoid runaway latency.
* Alert when Tier 1 success rate drops below threshold.
* Review routing weekly using usage statistics and error trends.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.flashback.tech/guides/explore-use-cases/ai-llm/multi-model-fallback-and-reliability-routing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
