What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via LLMLingua-2 where safe, auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera enter the Performance Fee.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. We bill on measured savings from our own proxy logs, not on a SaaS seat. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Two tiers. Free Sandbox: 60 million tokens per month, $0 fee, observe-only optimisations (savings measured but no fee accrued). Production: 20% of measured savings, billed against a prepaid balance (Stripe Checkout, $100 minimum top-up, or invoice on request). Zero savings = zero fee in that window. No floor, no retainer, no contract review for activation. If balance reaches zero, the proxy automatically pauses optimisations and forwards requests as passthrough — no fees accrue while paused. Top up to resume. Large-volume Production Clients (above $500,000 per month in measured savings) may negotiate a custom MSA addendum. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a 10% fee credit applied to your balance.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Performance Fee does not accrue on paused traffic. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most Production signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0 fee) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above $500k/mo in measured savings, Production Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M11 · Reliability primitive

Cross-provider failover

Shipped 20 May 2026 · Worker 0.43.0-m11-failover · Migration 0081

When the primary LLM provider returns a 5xx / connection error / timeout, the Tessera proxy retries the customer's request on OpenRouter with the namespaced equivalent model. The customer sees a successful response instead of a transport failure. This page documents what we built, the audit invariants that hold across the failover path, and the deferrals we deliberately did not paper over.

M11 is a reliability primitive, not a cost mechanic. It does not save money — when failover fires, the customer is briefly paying OpenRouter's 5.5% surcharge on top of the model rate. The framing matters: this is request infrastructure, sitting in the same proxy path as the nine cost mechanics, addressing a different concern.

What it does

On every request, after the primary forwarder returns (or throws), the worker classifies the outcome via parseFailoverTrigger:

upstream_5xx — primary returned a 5xx status. 4xx is excluded (caller-fault — the request would also fail on the alternate).
connection_error — fetch rejected before a response arrived (TCP reset, DNS failure, unreachable host).
timeout— request budget exceeded by the worker's outer guard.

Any classification other than these three returns null and the failover path is skipped. The customer sees the original response unchanged.

When the trigger fires AND the workload has opted in AND an OpenRouter API key is provisioned AND the request is not streaming AND the requested model exists in our static translation table — the worker JIT-decrypts the OR key, swaps the model field to the OR namespace ( gpt-5 → openai/gpt-5, deepseek-chat → deepseek/deepseek-chat), and retries against openrouter.ai/api/v1. The customer sees the OR response with extra response headers identifying the failover.

What we built (MVP scope locks)

Migration 0081 added two workload columns — cross_provider_failover_enabled (boolean, default FALSE) and openrouter_failover_key_vault_id (UUID into vault.secrets) — with a CHECK constraint that prevents orphan keys: vault uuid non-null requires the toggle to be ON. Three Vault RPC wrappers (m11_store_openrouter_failover_key, m11_read_openrouter_failover_key, m11_destroy_openrouter_failover_key) handle storage, JIT-decrypt, and cleanup, all SECURITY DEFINER with locked search paths.

The worker stores only the vault UUID in Cloudflare KV. The decrypted OpenRouter key is fetched per-failover via an HMAC-signed call to /api/internal/failover-key on the dashboard — cold path only, never cached across requests or across isolates. The plaintext key exists in worker memory for the duration of one failover and is then discarded. Audit: audit_events receives an m11_failover_key_decrypted row on every successful decrypt.

Four columns on optimize_savings record the failover provenance: failover_attempted, failover_from_provider, failover_trigger, primary_attempted_pricing_snapshot_id. Two pricing snapshots are persisted per failover row — the primary's catalog snapshot we would have used, and OpenRouter's snapshot we actually billed against. The audit-immutability invariant the rest of the proxy holds (every $ figure traceable to an immutable catalog row) covers the failover path the same way.

The capability gate in decideFailover refuses to fail over when the request body carries a tools array and the OR alternate lacks function-calling support. We do not silently drop a tool call to dodge an outage. If no eligible alternative exists, the primary 5xx surfaces unchanged.

What we don’t do yet

Five Phase 2 deferrals are explicit, not hidden:

Active failover via population signal. The worker today reads the per-isolate circuit-breaker state. Cross-isolate consensus via a Cloudflare Durable Object holding the canonical breaker state is on the roadmap — when the broader fleet sees 5xx rates spike before any one isolate accumulates enough samples to flip, active failover redirects new requests without first paying the per-request 5xx wait.
Anthropic /v1/messages primary. Anthropic's body shape diverges from OpenAI's chat-completions shape (separate system field, typed content blocks, distinct tool schema). MVP routes only OpenAI-shape primaries; cross-shape body translation lands in a separate sprint.
Mid-stream 5xx surfacing. Streaming requests (stream: true) skip failover. Surfacing partial completion plus a terminal [ERROR] marker requires SSE tee + chunk-class detection that we'd rather not ship half-done.
Recovery probe + fail-back. No synthetic 10-second probe to test the primary upstream while failover is active. Recovery is implicit: the next successful primary request returns the customer to the direct path. The synthetic probe is queued — the cost is monitoring complexity and the benefit only shows up during multi-minute outages where the implicit path delays recovery on the second-affected request.
Per-pair canary validation.The daily promptfoo canary scores the customer's configured mechanic stack. Per-pair validation across the failover path (OpenRouter-served gpt-5 vs OpenAI-direct gpt-5) is queued; we surface every failover event on the audit ledger ( /portal/audit for sponsors with provisioned access) today instead of asserting per-pair quality automatically.

How to enable

On a workload page in /portal/settings, paste an OpenRouter API key (sk-or-v1-…) into the OpenRouter failover key form and save. The store action wraps the key in Vault and writes the UUID; the toggle flips ON in lockstep (the DB CHECK constraint guarantees the (toggle, vault uuid) pair is consistent on every row). Existing keys can be rotated by saving a new value — the prior Vault row is destroyed and a fresh one is created. The Clear key button destroys the Vault row, NULLs the column, and force- disables the toggle.

KV refresh is immediate (refreshKvForWorkloadKeys fires after every settings mutation). The next request through the worker reads the new value without waiting for the 5-minute periodic sync.

Verification surfaces

M11 is observable in three places:

Response headers on the failover-served request: x-tessera-failover: applied, x-tessera-failover-from: openai, x-tessera-failover-trigger: upstream_5xx, x-tessera-failover-to: openrouter:openai/gpt-5.
/portal/audit chip strip. Failover rows carry an m11 chip in their canonical mechanics_stack alongside any other applied mechanics. A per-row caption surfaces the trigger class and from-provider so the failover event is visible without click-through.
audit_events table. Every JIT key decrypt writes an m11_failover_key_decrypted row with the request id + vault uuid for compliance traceability. The append-only invariant on audit_events covers M11 the same way it covers every other state-change event.

Composition with other mechanics

M11 sits outside the M3 / M7 / M8 content-mutation mutex. It composes freely with M1 auto-route, M2 exact cache, M5 semantic cache, M6 prompt cache, M9 output-length predictor, and M10 batch arbitrage. Practically: when the primary fails after auto-routing to a cheaper model (M1 fired gpt-5 → gpt-5-mini), the worker fails over to openai/gpt-5-mini on OpenRouter — the M1 swap is preserved, not undone.

Pricing follows the actual upstream: the OR pricing-catalog row for openai/gpt-5-mini backs actual_cost_usd (with OR's surcharge baked in); the primary's catalog row backs the snapshot we would have used. The performance-fee math runs against the customer's baseline anchor as always — failover does not unmoor the savings calculation.

What we promise

M11 is opt-in default OFF. Quality lock 2026-05-20: OpenRouter routes through their own infrastructure — not bit-identical to a direct provider call. We refuse to flip the default to ON until 30+ days of telemetry validate per-pair quality preservation at ≥ 0.95 across the failover pairs we ship. The same per-stack auto- rollback that protects the cost mechanics will cover M11 once per-pair canary validation lands.

Every commitment on this page is enforceable in source. The worker version is pinned at the top of this document. The migration that added the columns is listed alongside it. The response headers are stable contracts — if we ever change their semantics, that change ships with a CHANGELOG entry and a deprecation window.

Architecture write-up: Cross-provider failover at the edge — what we built (and what we did not). Parent index: How it works.