What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via LLMLingua-2 where safe, auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera enter the Performance Fee.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. We bill on measured savings from our own proxy logs, not on a SaaS seat. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Two tiers. Free Sandbox: 60 million tokens per month, $0 fee, observe-only optimisations (savings measured but no fee accrued). Production: 20% of measured savings, billed against a prepaid balance (Stripe Checkout, $100 minimum top-up, or invoice on request). Zero savings = zero fee in that window. No floor, no retainer, no contract review for activation. If balance reaches zero, the proxy automatically pauses optimisations and forwards requests as passthrough — no fees accrue while paused. Top up to resume. Large-volume Production Clients (above $500,000 per month in measured savings) may negotiate a custom MSA addendum. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a 10% fee credit applied to your balance.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Performance Fee does not accrue on paused traffic. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most Production signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0 fee) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above $500k/mo in measured savings, Production Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M1 · Non-mutating cost lever

Auto-route

Chained walks since 26 May 2026

When the customer's requested model has a cheaper sibling in the same provider's family AND the workload is opted in AND the daily canary has not flagged the swap as quality-degrading, the worker forwards to the cheaper model and the customer sees an output scored by their own promptfoo eval at ≥ 0.95 against the baseline. No cross-provider lines — that is what M11 is for. No cross- capability lines (reasoning ↔ non-reasoning, multimodal ↔ text). Conservative pairs only, narrowed against published model-card benchmarks before they land in the table.

M1 is the most visible cost mechanic — sponsors see it on the audit chip strip every time a routed request lands. It is also the mechanic with the highest quality-risk surface, because a wrong swap propagates silently into every downstream eval the customer runs against the output. The page below documents the gating logic in source-traceable detail.

The ROUTE_TABLE — what swaps are eligible

Single static map keyed by Provider. Every entry is a tuple of { from, to, quality_preservation_estimate }. The estimate is conservative — derived from published MMLU / HumanEval / MT-Bench benchmarks and rounded down. We never publish an estimate above 0.97 because no provider gives us the golden set the customer would actually evaluate against.

Representative pairs (subset, not exhaustive):

gpt-5 → gpt-5-mini (0.94 estimate); gpt-5-mini → gpt-5-nano (0.88).
claude-opus-4-7 → claude-sonnet-4-6 (0.91); claude-sonnet-4-6 → claude-haiku-4-5 (0.87).
gemini-1.5-pro → gemini-1.5-flash (0.85 — Gemini Flash benchmarks heavily as a capable mainline model, but Pro's longer reasoning advantage shows up on specific subsets; we round down).
OpenAI-compatible providers (DeepSeek, Mistral, Groq, Together, Fireworks, OpenRouter, Perplexity, Cerebras, Cohere) each carry their own intra-family pairs verified against the upstream provider's own pricing + capability documentation. Empty arrays for providers with no intra-tier price spread (e.g. deepseek-chat = deepseek-reasoner per-token rate — no swap produces savings).

OpenRouter is a special case: routing happens within the same underlying-family namespace (openai/gpt-4o → openai/gpt-4o-mini) because OR's 5.5% multiplicative surcharge means swapping a $0.15/M model for a $2.50/M one still produces real customer savings. Cross-vendor pairs through OR are excluded — those cross capability + tokenizer + behavioural lines that the canary alone cannot guarantee.

Single-hop vs chained walks

Default: single-hop. The worker looks up the first row matching the customer's requested model and forwards to to. Done.

Chained walks (shipped 26 May 2026, opt-in via worker env WORKER_M1_CHAINED_ENABLED): the worker follows the routing chain past the first hop — gpt-5 → gpt-5-mini → gpt-5-nano, opus → sonnet → haiku — and computes the cumulative quality_preservation as the product across hops. The walk stops at the first hop whose cumulative product would drop below the floor (per-workload threshold); the previous hop's destination is returned. Single-hop is the backward-compatible default, gated separately so existing customers see no behaviour change until they opt in.

Quality floors

The per-hop floor is the safety baseline: WORKER_M1_QUALITY_FLOOR (recommended 0.85) skips swaps whose estimate is below that number on tier_internal workloads (the default tier). Below-floor pairs in the table stay there for documentation transparency but never fire under the floor gate. tier_consumer bypasses the floor (caller opted into looser semantics); tier_regulated short-circuits routing entirely (see below).

When chained walks are enabled, the floor gates the cumulative product, not just the first hop. The walk stops as soon as the next candidate hop would push the cumulative below the floor. One single-hop swap below the floor never sneaks through under chained mode either — the first-hop check still fires.

Hard bypasses

M1 short-circuits to no-route on three independent gates, checked in order:

compliance_tier === 'tier_regulated'. Regulated traffic never routes — cache + measurement still apply, but no upstream model swap. Enforced at the route- detector level so no dashboard config can accidentally re- enable.
client_auto_route_enabled === false. The sponsor explicitly turned routing off for this workload.
Provider circuit breaker is non-healthy. When the per-provider rolling 5xx rate exceeds threshold, the breaker flips to OPEN (or HALF_OPEN during cooldown probe). M1 refuses to compound risk by adding a model swap on top of an already-flapping upstream.

The composition cap also takes M1 off the table whenever a content-mutating mechanic (M3 compress, M7 context prune, M8 structured output) is about to fire on the same request. Cost reduction stacked on model reduction multiplies quality risk — we elect the mutation site and leave M1 disabled for that request. The single-pick budget is published on the parent /how-it-works page.

The quality protection layer

The static estimates in the table are the prior. The canary is the evidence. Once a workload is opted into M1, the daily eval cron samples 10% of routed traffic and re-runs the customer's promptfoo eval on the routed output against the baseline. The per-mechanic-stack mean score is tracked over a rolling 3-day window.

Two auto-rollback layers protect the workload:

Workload-wide auto-disable. If the mean score drops below 0.90 (hard floor) for three consecutive days, the cron sets workloads.auto_route_disabled = true and KV is refreshed. M1 stops firing until an operator re- enables it after root-cause review. Sponsor receives an anomaly row on /portal/anomalies with the breach detail.
Per-stack auto-disable. Composition-specific regressions (e.g. m1+m7 tanks while m1-alone and m7-alone stay healthy) flip a workloads.disabled_stacks entry rather than the whole workload. The worker checks subset-match before applying mechanics — if a disabled entry is a subset of the prospective stack on this request, the request falls back to pure passthrough. Saves the other mechanic compositions that are still profitable.

Per-stack auto-disable comes with prorated SLA credit — a 10% credit on the fees attributable to the breached stack for the three breach-detection days. The math sums savings across rows tagged with the disabled stack and applies the credit to the customer's next monthly reading.

Verification surfaces

M1 is observable in four places:

Response headers on routed requests: x-tessera-auto-routed: gpt-5->gpt-5-mini.
/portal/audit chip strip. Routed rows carry m1 in the canonical mechanics_stack. The row also shows requested_model → actual_model so the swap is visible without click-through.
/portal/anomalies. Auto-rollback events land here with the breach detail, the disabled stack id (or workload-wide flag), the rolling mean score, and the credit issued.
Stack-aware canary data on /optimize/quality. Operator-facing breakdown of mean score per workload per mechanic-stack — the same data the auto-rollback cron reads.

What we don’t do

Five deliberate exclusions, called out so they cannot be surprises:

No cross-provider swaps. Those live in M11 cross-provider failover (deep dive) and only fire on upstream failure, not on cost-cut grounds.
No cross-capability-tier swaps (no gpt-5 → o3-mini — reasoning ↔ non-reasoning is a different request class).
No multimodal-to-text swaps (no pixtral-large → mistral-small — Mistral Small is text-only).
No experimental / unstable model targets (Gemini 2.0 Flash Exp is published but excluded — the rate-limit + SLA ceiling makes it unsuitable for prod routing).
No streaming-only routing decisions — M1 fires on streaming and non-streaming requests alike; the streaming decision is orthogonal.

What we promise

M1 routes only when the static estimate clears the floor AND the canary has not flagged the swap. The static table is auditable (it lives in workers/proxy/src/detectors/route.ts — every entry has a comment with provenance). The canary cron and the per-stack auto-rollback math are auditable (they live in dashboard/lib/canary-eval.ts and app/api/cron/run-canary-eval/route.ts). The audit ledger row for every routed request carries the pricing snapshot we billed against — the savings math is re-derivable from the customer's end.

Parent index: How it works. Companion mechanic: M11 cross-provider failover.