Cross-provider failover at the edge — what we built (and what we did not)
Every LLM-backed product ships with a single point of failure that almost no one writes about: it talks to exactly one provider. When that provider returns 5xx, your app stops talking. The user sees a spinner. The eval set looks fine. The status page is green. The dashboard is green. And your support inbox fills with screenshots of an unresponsive chat panel.
This is not a cost problem. Reducing your monthly OpenAI bill does nothing for the 47-minute outage window. This is a request-infrastructure problem, and it has a different shape, a different solution, and a different category of tools coming for it.
We built cross-provider failover into the Tessera proxy partly because customers asked for it, and partly because once we sat in the request path, the failover work was already 80% in our hands. This post is the architecture: the latency budget math, the taxonomy of passive vs active failover, the failure modes we hit, and why we think solving it at the proxy layer rather than in the client SDK is the right call.
MVP scope (shipped 2026-05-20, worker 0.43.0-m11-failover). Passive failover only — 5xx / connection error / timeout from primary triggers a retry on OpenRouter as the universal fallback. Opt-in per workload, default off. OpenAI-shape primaries only. Non-streaming requests only. Active failover via population signal, mid-stream 5xx surfacing, and direct Anthropic / Bedrock / Vertex secondary keys are deliberately deferred and called out where they appear below — the architecture is real but not every branch is live.
The shape of the failover problem
LLM providers fail in two distinct ways and the difference matters for how you respond.
Hard failures. 5xx HTTP responses. Connection resets. TCP timeouts. These are the loud ones. OpenAI publishes incidents; Anthropic publishes incidents; Mistral publishes incidents. Every provider gets multi-hour outages a few times a year. The pattern is well-documented: a recent OpenAI ChatGPT outage in June 2024 lasted ~90 minutes during US business hours, and the API surface degraded with it.
Soft failures.p99 latency drift. Rate-limit rejections under load. Quality regression on a model revision. These never show up on status pages. Your p50 looks fine; your long-tail user-facing latency creeps up; eventually a complaint arrives that “the bot has been slow for two days.”
The single-provider workload has no recourse for either. The retry loop inside the OpenAI SDK doesn't know about Anthropic. The Anthropic SDK doesn't know about Mistral. Cross-provider knowledge has to live somewhere above the SDK — either in your application code (the slow path, with all the engineering overhead) or in the request layer between you and the providers (the fast path).
Why native SDK retries are not enough
The first answer engineers reach for is the SDK retry config. Both the OpenAI Python client and the Anthropic Python client expose max_retries with exponential backoff. Set it to 3, set it to 5, the SDK will dutifully retry on 5xx and on network errors.
The trouble is that retrying against the same upstream during an outage helps no one. If OpenAI is degraded, your retry hits the same degraded service. You amplify the load on a system already struggling (this is the textbook retry-storm pattern), and you burn your latency budget waiting on responses that will not come.
The harder problem: the SDK retry is in-process and blunt. It retries blindly because it has no signal about the broader fleet. A proxy layer sitting in front of every customer's requests has a different vantage. It sees error rates per-provider in real time across the entire population. When OpenAI's 5xx rate jumps from 0.1% to 8% in a 60-second window, the proxy knows. The SDK does not.
The deeper limitation: SDK retries can only retry against the same provider. If you want to try Anthropic when OpenAI fails, you have to write that logic yourself. That logic is non-trivial: messages-format vs chat-completions-format are different shapes, tool-calling schemas differ subtly, parameter limits differ. By the time you've written the cross-provider mapping plus the conditional fallback plus the error-class taxonomy, you have built a small piece of request infrastructure. We argue you should not be building this. We should.
What the proxy-layer failover actually looks like
A Tessera request begins with the customer's SDK calling api.tesseraai.io/v1/openai/chat/completions (or the Anthropic shape, or the Mistral shape — we accept all the major wire formats). The proxy lives on Cloudflare Workers, deployed to every Cloudflare PoP. The first decision the proxy makes is which upstream to hit.
That decision is informed by three signals: the requested model (canonical name plus catalog id), the worker-local circuit-breaker state per provider (in-isolate rolling 60-second 5xx rate), and a per-isolate sample of recent latencies. (Cross-isolate Durable Object consensus on error-rate is on the roadmap — today the breaker is per-worker-isolate, so two isolates may learn the same outage independently rather than instantly.)
Normal state: error rate < 1%, p95 < SLA target. Worker forwards to the requested provider, attaches the customer's API key for that provider, streams the response back. The proxy overhead in this path is small — typically 15-25 ms p50 from the edge hop.
Degraded state: primary upstream returned 5xx / connection error / timeout on this specific request. Worker consults the failover table. The MVP table maps OpenAI-shape primaries to the OpenRouter namespace — for gpt-5 the fallback is openai/gpt-5 via OpenRouter (same model, different upstream path); for deepseek-chat it is deepseek/deepseek-chat via the OR namespace. Cross-vendor alternatives (gpt-5 → claude-sonnet etc.) require body translation between OpenAI chat-completions shape and Anthropic Messages shape and are deferred — the OR universal-fallback path keeps body translation out of MVP entirely.
For requests with tools in the body, the fallback target has to support function calling — a static capability gate. If no eligible alternative exists, the primary 5xx surfaces unchanged. We refuse to silently drop a tool call to dodge an outage.
Recovery in MVP is implicit. We do not run a synthetic probe — on the next request, if the primary upstream returns success, traffic flows through it untouched. A timed recovery probe (10-second tick, 90-second healthy-streak gate before flip-back) is on the roadmap; MVP keeps the recovery path stateless because the customer-visible behaviour is identical in the common case where the outage resolves itself within a few seconds.
The latency budget math
The honest tension: every layer between user and provider adds latency. Proxy failover only pays if its decision overhead is smaller than the win it delivers during incidents. The budget math has to work.
Steady state (no incident):
- User TLS to Cloudflare PoP: 8–20 ms depending on geography
- Worker boot + Durable Object lookup: 2–4 ms p50, 8 ms p95
- Worker → upstream provider TLS: 10–25 ms
- Provider inference: dominates (50–4000+ ms depending on model and output length)
Net p50 overhead from inserting Tessera in the path: ~15–25 ms. The inference call dwarfs this; on a 1-second completion the proxy is 1.5-2.5% of the budget.
Incident state (failover active):
- Worker decision to fail over: same path as steady state, + ~1 ms for the alternative-table lookup
- New TLS handshake to fallback provider (cold): 60–120 ms first request, <5 ms subsequent (Workers reuses connections)
- Fallback provider inference: comparable to original
For the first request that triggers a fail-over, you pay one extra TLS handshake (~80 ms median) for the privilege of getting a response at all rather than no response. For every subsequent request in the window, you pay the steady-state overhead with the alternative provider. The customer's observable: a single response that's 80 ms slower, then normal latency from a different upstream. They never see the 5xx.
Passive vs active failover — MVP ships passive
The taxonomy that ended up mattering for us:
Passive failover. Triggered by a hard failure on the wire — a 5xx, a connection reset, a timeout. Reactive. Cheap to reason about. The downside is that by the time you observe the failure, the user has already burned latency budget on the failed call. If the original timeout is 30 seconds, the user waits 30 seconds, then waits another ~80 ms for the fallback to warm up, then waits for the alternative provider to respond. Worst case visible to user: ~31 seconds + alternative response.
Active failover. Triggered by the rolling population signal — when our Durable Object sees error rate or p95 exceed threshold across all customers, we mark the provider degraded for everyone. New requests go straight to the fallback. The user pays the proxy overhead plus alternative- provider response time, never the failed-call wait.
MVP ships passive failover. The customer-visible cost is the primary's response budget (the 5xx wait) plus one extra TLS handshake to OpenRouter on the first failover in a window. We chose to ship passive first because the design surface is smaller (no Durable Object population signal, no per-customer active-mode toggle) and because passive covers the failure shape that hurts most for latency-tolerant workloads — a clean recovery on the request that would otherwise have errored.
Active failover is on the roadmap. The design is sketched out above; the Durable Object work + per-workload active-mode toggle land in a follow-on after we have 30+ days of passive telemetry showing the per-pair quality preservation holds at ≥ 0.95 across the failover pairs we ship today.
Failure modes we did not anticipate
Shipping this exposed three sub-problems that the design did not obviously surface up front.
Provider key rotation during failover.If a customer rotates their OpenAI key mid-incident and we fail back while still holding the old token, we 401 against the recovered provider and the customer thinks the recovery itself failed. Fix: on every fail-back, re-fetch the customer's active provider credentials from KV before resuming. Cheap, but easy to forget.
Streaming response interruption.Mid-stream provider 5xx is genuinely awkward. You cannot “retry” a partial stream — the customer has already received tokens. MVP punts: the failover decision skips streaming requests entirely (worker checks stream: true in the body and surfaces the primary 5xx unchanged on SSE requests). Surfacing partial-completion + a terminal [ERROR] marker requires SSE tee + chunk-class detection that lives in a separate sprint. We'd rather not ship that work half-done.
Quality canary across fail-over.The daily promptfoo canary evaluates the customer's configured mechanic stack. Per-pair canary validation across the failover path (OpenRouter-served gpt-5 vs OpenAI-direct gpt-5) is on the roadmap — today, each failover event lands on /portal/audit with the original provider, OpenRouter as the alternate, the trigger class, and the cost delta. Two pricing snapshots are persisted per failover row (primary attempted + OR billed) so the audit-immutability invariant the rest of the proxy holds to covers failover the same way. We will flip the per-workload failover default from OFF to ON once 30+ days of stack-aware canary data confirm per-pair quality preservation ≥ 0.95.
Why this is not a cost problem
A common reaction we got while building this: “Cool, but doesn't this make my requests more expensive on incident days? Anthropic is pricier than OpenAI for some shapes.” Yes, sometimes. The fallback model is the closest acceptable quality match, not the cheapest available option. During the fail-over window the cost ledger absolutely shows higher actual_cost_usd than baseline. We surface this honestly in the audit log.
But the framing of “cheap requests vs no requests” gets it backwards. Cross-provider failover is a reliability mechanism. Cost optimization is what we deliver in steady state. Failover is what we deliver in the incident state. Selling these as the same product muddies both.
Once we sit in the request path, the same code can do work along multiple axes — cost, reliability, observability, governance, compliance. They are different concerns served by the same architectural pattern: a programmable substrate between application and provider, controlled per-request by data the provider SDK does not have. The shape that emerges is not “cost optimization product”; the shape that emerges is request infrastructure.
What this points at
The framework we keep coming back to: when a category of infrastructure starts existing — between application code and an external service — the early years are about a single use case. Stripe was “accept payments”; Twilio was “send an SMS.” The substrate that supported those use cases turned out to support payment risk, fraud, multi-rail routing (for Stripe) and carrier interconnect, programmable voice, number management (for Twilio).
For LLMs, the wedge use case is currently “cut my AI bill” — because that is the sharpest single pain. Cross- provider failover is the second use case to surface. Compliance gates (regulated traffic never routes through a specific provider), governance (rate-limit per agent within a workspace), quality SLA enforcement, observability primitives — they are all downstream of the same substrate.
Our internal phrasing: cost optimization is what we sell; request infrastructure is what we build. Customers find us through the first. They stay because of the second.
Try it on your workload
Cross-provider failover is opt-in per workload, default off. Quality lock 2026-05-20: OpenRouter routes through their own infra, not bit-identical to a direct provider call, so we ship the conservative default and ask the sponsor to flip it deliberately. We'll move to default-on once 30+ days of telemetry validate per-pair quality preservation ≥ 0.95 across the failover pairs we ship.
To enable: in /portal/settings on a workload, paste your OpenRouter API key (sk-or-v1-…) into the “OpenRouter failover key” field, save, and the toggle flips on in lockstep. The key is stored at-rest- encrypted via Supabase Vault; the worker JIT-decrypts via an HMAC-signed internal endpoint only on actual failover trigger (cold path), never caches the plaintext across requests.
Every failover event lands on /portal/audit with the original provider, OpenRouter as the alternate, the trigger class (upstream_5xx / connection_error / timeout), and the cost delta. The chip strip shows an m11 tag in the canonical mechanics stack. CSV export of the ledger is in place.
References
- Architecture: how the proxy + canary + ledger work
- Security & data handling: what we do and do not store
- Worked example: cutting a $24k/month OpenAI bill 38% (cost optimization companion piece)
- SDK on GitHub: github.com/tessera-llm/tessera-sdk
Try Tessera on your workload
60M tokens free, no card, 30-second signup
Cross-provider failover (opt-in), auto-route, prompt-cache headers, compress, batch arbitrage — all available per- workload. Kill-switch any time. Pay 20% only on measured savings.
Get free API key→