What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← Tessera Blog

2026-05-20·10 min read

Cross-provider failover at the edge — what we built (and what we did not)

Every LLM-backed product ships with a single point of failure that almost no one writes about: it talks to exactly one provider. When that provider returns 5xx, your app stops talking. The user sees a spinner. The eval set looks fine. The status page is green. The dashboard is green. And your support inbox fills with screenshots of an unresponsive chat panel.

This is not a cost problem. Reducing your monthly OpenAI bill does nothing for the 47-minute outage window. This is a request-infrastructure problem, and it has a different shape, a different solution, and a different category of tools coming for it.

We built cross-provider failover into the Tessera proxy partly because customers asked for it, and partly because once we sat in the request path, the failover work was already 80% in our hands. This post is the architecture: the latency budget math, the taxonomy of passive vs active failover, the failure modes we hit, and why we think solving it at the proxy layer rather than in the client SDK is the right call.

MVP scope (shipped 2026-05-20, worker 0.43.0-m11-failover). Passive failover only — 5xx / connection error / timeout from primary triggers a retry on OpenRouter as the universal fallback. Opt-in per workload, default off. OpenAI-shape primaries only. Non-streaming requests only. Active failover via population signal, mid-stream 5xx surfacing, and direct Anthropic / Bedrock / Vertex secondary keys are deliberately deferred and called out where they appear below — the architecture is real but not every branch is live.

The shape of the failover problem

LLM providers fail in two distinct ways and the difference matters for how you respond.

Hard failures. 5xx HTTP responses. Connection resets. TCP timeouts. These are the loud ones. OpenAI publishes incidents; Anthropic publishes incidents; Mistral publishes incidents. Every provider gets multi-hour outages a few times a year. The pattern is well-documented: a recent OpenAI ChatGPT outage in June 2024 lasted ~90 minutes during US business hours, and the API surface degraded with it.

Soft failures.p99 latency drift. Rate-limit rejections under load. Quality regression on a model revision. These never show up on status pages. Your p50 looks fine; your long-tail user-facing latency creeps up; eventually a complaint arrives that “the bot has been slow for two days.”

The single-provider workload has no recourse for either. The retry loop inside the OpenAI SDK doesn't know about Anthropic. The Anthropic SDK doesn't know about Mistral. Cross-provider knowledge has to live somewhere above the SDK — either in your application code (the slow path, with all the engineering overhead) or in the request layer between you and the providers (the fast path).

Why native SDK retries are not enough

The first answer engineers reach for is the SDK retry config. Both the OpenAI Python client and the Anthropic Python client expose max_retries with exponential backoff. Set it to 3, set it to 5, the SDK will dutifully retry on 5xx and on network errors.

The trouble is that retrying against the same upstream during an outage helps no one. If OpenAI is degraded, your retry hits the same degraded service. You amplify the load on a system already struggling (this is the textbook retry-storm pattern), and you burn your latency budget waiting on responses that will not come.

The harder problem: the SDK retry is in-process and blunt. It retries blindly because it has no signal about the broader fleet. A proxy layer sitting in front of every customer's requests has a different vantage. It sees error rates per-provider in real time across the entire population. When OpenAI's 5xx rate jumps from 0.1% to 8% in a 60-second window, the proxy knows. The SDK does not.

The deeper limitation: SDK retries can only retry against the same provider. If you want to try Anthropic when OpenAI fails, you have to write that logic yourself. That logic is non-trivial: messages-format vs chat-completions-format are different shapes, tool-calling schemas differ subtly, parameter limits differ. By the time you've written the cross-provider mapping plus the conditional fallback plus the error-class taxonomy, you have built a small piece of request infrastructure. We argue you should not be building this. We should.

What the proxy-layer failover actually looks like

A Tessera request begins with the customer's SDK calling api.tesseraai.io/v1/openai/chat/completions (or the Anthropic shape, or the Mistral shape — we accept all the major wire formats). The proxy lives on Cloudflare Workers, deployed to every Cloudflare PoP. The first decision the proxy makes is which upstream to hit.

That decision is informed by three signals: the requested model (canonical name plus catalog id), the worker-local circuit-breaker state per provider (in-isolate rolling 60-second 5xx rate), and a per-isolate sample of recent latencies. (Cross-isolate Durable Object consensus on error-rate is on the roadmap — today the breaker is per-worker-isolate, so two isolates may learn the same outage independently rather than instantly.)

Normal state: error rate < 1%, p95 < SLA target. Worker forwards to the requested provider, attaches the customer's API key for that provider, streams the response back. The proxy overhead in this path is small — typically 15-25 ms p50 from the edge hop.

Degraded state: primary upstream returned 5xx / connection error / timeout on this specific request. Worker consults the failover table. The MVP table maps OpenAI-shape primaries to the OpenRouter namespace — for gpt-5 the fallback is openai/gpt-5 via OpenRouter (same model, different upstream path); for deepseek-chat it is deepseek/deepseek-chat via the OR namespace. Cross-vendor alternatives (gpt-5 → claude-sonnet etc.) require body translation between OpenAI chat-completions shape and Anthropic Messages shape and are deferred — the OR universal-fallback path keeps body translation out of MVP entirely.

For requests with tools in the body, the fallback target has to support function calling — a static capability gate. If no eligible alternative exists, the primary 5xx surfaces unchanged. We refuse to silently drop a tool call to dodge an outage.

Recovery in MVP is implicit. We do not run a synthetic probe — on the next request, if the primary upstream returns success, traffic flows through it untouched. A timed recovery probe (10-second tick, 90-second healthy-streak gate before flip-back) is on the roadmap; MVP keeps the recovery path stateless because the customer-visible behaviour is identical in the common case where the outage resolves itself within a few seconds.

The latency budget math

The honest tension: every layer between user and provider adds latency. Proxy failover only pays if its decision overhead is smaller than the win it delivers during incidents. The budget math has to work.

Steady state (no incident):

User TLS to Cloudflare PoP: 8–20 ms depending on geography
Worker boot + Durable Object lookup: 2–4 ms p50, 8 ms p95
Worker → upstream provider TLS: 10–25 ms
Provider inference: dominates (50–4000+ ms depending on model and output length)

Net p50 overhead from inserting Tessera in the path: ~15–25 ms. The inference call dwarfs this; on a 1-second completion the proxy is 1.5-2.5% of the budget.

Incident state (failover active):

Worker decision to fail over: same path as steady state, + ~1 ms for the alternative-table lookup
New TLS handshake to fallback provider (cold): 60–120 ms first request, <5 ms subsequent (Workers reuses connections)
Fallback provider inference: comparable to original

For the first request that triggers a fail-over, you pay one extra TLS handshake (~80 ms median) for the privilege of getting a response at all rather than no response. For every subsequent request in the window, you pay the steady-state overhead with the alternative provider. The customer's observable: a single response that's 80 ms slower, then normal latency from a different upstream. They never see the 5xx.

Passive vs active failover — MVP ships passive

The taxonomy that ended up mattering for us:

Passive failover. Triggered by a hard failure on the wire — a 5xx, a connection reset, a timeout. Reactive. Cheap to reason about. The downside is that by the time you observe the failure, the user has already burned latency budget on the failed call. If the original timeout is 30 seconds, the user waits 30 seconds, then waits another ~80 ms for the fallback to warm up, then waits for the alternative provider to respond. Worst case visible to user: ~31 seconds + alternative response.

Active failover. Triggered by the rolling population signal — when our Durable Object sees error rate or p95 exceed threshold across all customers, we mark the provider degraded for everyone. New requests go straight to the fallback. The user pays the proxy overhead plus alternative- provider response time, never the failed-call wait.

MVP ships passive failover. The customer-visible cost is the primary's response budget (the 5xx wait) plus one extra TLS handshake to OpenRouter on the first failover in a window. We chose to ship passive first because the design surface is smaller (no Durable Object population signal, no per-customer active-mode toggle) and because passive covers the failure shape that hurts most for latency-tolerant workloads — a clean recovery on the request that would otherwise have errored.

Active failover is on the roadmap. The design is sketched out above; the Durable Object work + per-workload active-mode toggle land in a follow-on after we have 30+ days of passive telemetry showing the per-pair quality preservation holds at ≥ 0.95 across the failover pairs we ship today.

Failure modes we did not anticipate

Shipping this exposed three sub-problems that the design did not obviously surface up front.

Provider key rotation during failover.If a customer rotates their OpenAI key mid-incident and we fail back while still holding the old token, we 401 against the recovered provider and the customer thinks the recovery itself failed. Fix: on every fail-back, re-fetch the customer's active provider credentials from KV before resuming. Cheap, but easy to forget.

Streaming response interruption.Mid-stream provider 5xx is genuinely awkward. You cannot “retry” a partial stream — the customer has already received tokens. MVP punts: the failover decision skips streaming requests entirely (worker checks stream: true in the body and surfaces the primary 5xx unchanged on SSE requests). Surfacing partial-completion + a terminal [ERROR] marker requires SSE tee + chunk-class detection that lives in a separate sprint. We'd rather not ship that work half-done.

Quality canary across fail-over.The daily promptfoo canary evaluates the customer's configured mechanic stack. Per-pair canary validation across the failover path (OpenRouter-served gpt-5 vs OpenAI-direct gpt-5) is on the roadmap — today, each failover event lands on /portal/audit with the original provider, OpenRouter as the alternate, the trigger class, and the cost delta. Two pricing snapshots are persisted per failover row (primary attempted + OR billed) so the audit-immutability invariant the rest of the proxy holds to covers failover the same way. We will flip the per-workload failover default from OFF to ON once 30+ days of stack-aware canary data confirm per-pair quality preservation ≥ 0.95.

Why this is not a cost problem

A common reaction we got while building this: “Cool, but doesn't this make my requests more expensive on incident days? Anthropic is pricier than OpenAI for some shapes.” Yes, sometimes. The fallback model is the closest acceptable quality match, not the cheapest available option. During the fail-over window the cost ledger absolutely shows higher actual_cost_usd than baseline. We surface this honestly in the audit log.

But the framing of “cheap requests vs no requests” gets it backwards. Cross-provider failover is a reliability mechanism. Cost optimization is what we deliver in steady state. Failover is what we deliver in the incident state. Selling these as the same product muddies both.

Once we sit in the request path, the same code can do work along multiple axes — cost, reliability, observability, governance, compliance. They are different concerns served by the same architectural pattern: a programmable substrate between application and provider, controlled per-request by data the provider SDK does not have. The shape that emerges is not “cost optimization product”; the shape that emerges is request infrastructure.

What this points at

The framework we keep coming back to: when a category of infrastructure starts existing — between application code and an external service — the early years are about a single use case. Stripe was “accept payments”; Twilio was “send an SMS.” The substrate that supported those use cases turned out to support payment risk, fraud, multi-rail routing (for Stripe) and carrier interconnect, programmable voice, number management (for Twilio).

For LLMs, the wedge use case is currently “cut my AI bill” — because that is the sharpest single pain. Cross- provider failover is the second use case to surface. Compliance gates (regulated traffic never routes through a specific provider), governance (rate-limit per agent within a workspace), quality SLA enforcement, observability primitives — they are all downstream of the same substrate.

Our internal phrasing: cost optimization is what we sell; request infrastructure is what we build. Customers find us through the first. They stay because of the second.

Try it on your workload

Cross-provider failover is opt-in per workload, default off. Quality lock 2026-05-20: OpenRouter routes through their own infra, not bit-identical to a direct provider call, so we ship the conservative default and ask the sponsor to flip it deliberately. We'll move to default-on once 30+ days of telemetry validate per-pair quality preservation ≥ 0.95 across the failover pairs we ship.

To enable: in /portal/settings on a workload, paste your OpenRouter API key (sk-or-v1-…) into the “OpenRouter failover key” field, save, and the toggle flips on in lockstep. The key is stored at-rest- encrypted via Supabase Vault; the worker JIT-decrypts via an HMAC-signed internal endpoint only on actual failover trigger (cold path), never caches the plaintext across requests.

Every failover event lands on /portal/audit with the original provider, OpenRouter as the alternate, the trigger class (upstream_5xx / connection_error / timeout), and the cost delta. The chip strip shows an m11 tag in the canonical mechanics stack. CSV export of the ledger is in place.

References

Try Tessera on your workload

60M tokens free, no card, 30-second signup

Cross-provider failover (opt-in), auto-route, prompt-cache headers, compress, batch arbitrage — all available per- workload. Kill-switch any time. Flat monthly pricing by token volume — keep 100% of savings.

Get free API key→