How it works
Updated 21 May 2026
Tessera is a substrate proxy. Your application points at our endpoint, we forward to OpenAI / Anthropic / Google / xAI on your behalf, and on the way we apply a small set of optimization mechanics measured against a jointly-ratified baseline. We bill 20% of the measured savings, prepaid. The mechanics fire conservatively, gated by your eval, with a public composition cap and a hard quality floor that auto-disables anything drifting beneath it.
Every claim on this page is enforced in code, observable on /portal/audit, and gated by a daily quality canary — not by our promise.
The seven mechanics
Each mechanic is opt-in per workload, observable per request, and bypasses when its eligibility heuristic says the savings won't cover the risk. False-negatives (we miss a borderline win) are cheap. False-positives (we mutate a request and the response degrades) are expensive — they damage trust. We bias toward pass-through.
Auto-route
non-mutatingWhen eval confirms a cheaper model produces equivalent output, we forward to the cheaper model.
Per-workload model pairing. The proxy queries the canary first; the routed call only goes live after the eval suite passes a 0.95 quality threshold against the original. Routing is paused for the workload if the daily canary average drops below SLA.
Exact-match cache
non-mutatingIdentical request → cached response served from KV. We never call the provider.
Cache key is a hash of the canonical request body (model, messages, temperature, tools, response_format). Cache hits return upstream of any mutation site, so they never compose with a content-changing mechanic.
Compress
mutatingHeuristic whitespace + structural normalization on long prompts. Preserves code, JSON, and tool messages verbatim.
Today: deterministic whitespace + structural pass. Roadmap: server-side LLMLingua-2 template substitution per workload (ADR-0003 documents the path). v0.1 yields 2–10% on poorly-formatted prompts and never touches semantics.
Batch routing
non-mutatingWorkloads tagged async-tolerant route to the provider's batch endpoint. Provider gives a flat 50% discount with a 24h SLA.
Opt-in per workload. We never auto-promote a real-time workload into batch — latency contract is sponsor-set. Status visible on /portal/workloads.
Semantic cache
non-mutatingCosine-similar requests (≥ 0.95) served from cache. Catches paraphrased queries that exact-match misses.
Cloudflare Vectorize backs the embedding store. The cache canary measures the served-from-cache response against the live provider on 5% of hits — drops back to provider on near-miss quality dips.
Provider prompt cache
non-mutatingAnthropic, OpenAI, and Google ship native prompt caching. We auto-inject the cache markers on stable system prompts.
Anthropic gives 90% off cached prefix tokens, Google 75%, OpenAI 50%. Per-provider mutation only — model output contract is identical to the un-marked request.
Context pruning
mutatingLong conversations: trim to last N turns + system + optional summary. RAG: rerank chunks, keep top-K by relevance.
Triggered when messages > 12 OR body > 32KB. RAG re-rank ships server-side via FlashRank (ADR-0002). Per-workload sliding aggressiveness, eval-gated.
Structured output
mutatingWhen a workload registers an expected JSON schema, we force the provider to obey it — schema mode on OpenAI / tool-use on Anthropic / responseSchema on Gemini.
Non-conformant responses log to anomaly inbox at Tier 1. Workloads without a registered schema fall back to auto-mode JSON when the body opts in.
The composition cap
At most one content-mutating mechanic per request.Mutating mechanics — M3 compress, M7 context prune, M8 structured output — can compose quality risk exponentially when stacked. If M7 trims 2% of quality and M8 schema-forces another 3%, the request that survives both isn't losing 5% — it can lose 15-25% because the prompt has been weakened twice.
The proxy enforces this as a hard mutex with priority M7 > M3 > M8. When two mutating candidates qualify on the same request, the higher-priority one fires and the lower one short-circuits. Auto-route (M1) is also locked off whenever any mutating candidate is present — we never compound model reduction with content reduction.
Non-mutating mechanics (M1 auto-route, M2 + M5 cache hit, M4 batch routing, M6 provider prompt cache) carry zero quality risk by contract — they either skip the provider entirely or hand the decision to the provider's own cache layer. They stack freely.
The quality SLA
We commit to a measured quality floor of 0.95on the daily promptfoo canary. That's a per-workload, per-mechanic-stack average score against a jointly-ratified golden set. If any (workload, stack) combination drifts below 0.95 for three consecutive days, the offending mechanic auto-disables for that workload and we issue a service credit on the next invoice. No paperwork.
The canary is stack-aware: the same request body fires through the full mechanic stack AND a pristine pass-through, and the eval compares both against the golden answer. That's how we detect "the route was fine but the compress was lossy" cases a single-mechanic canary would miss.
Free Sandbox clients get observability-only Tier 0 detection: we surface the anomaly, you decide. Production clients can opt in to Tier 1 (throttle on detection) or Tier 2 (auto-rollback the offending mechanic) in /portal/settings.
Auto-rollback
Tier 2 anomaly response is our deepest trust commitment. When the canary detects a quality regression on a workload that has opted in, the proxy disables the offending mechanic at the next request without operator intervention. The sponsor sees a row in /portal/anomaliesexplaining what dropped, by how much, and which mechanic got rolled back. Re-enablement is sponsor-driven, not time-based — we don't auto-flip the flag back hoping it'll behave.
The audit surface
Every request your application sends through Tessera produces a row on /portal/auditwith the canonical mechanics_stack and the savings delta against the baseline. Mutating mechanics render in warning-amber chips so the composition cap is visible at a glance. If the page ever shows two mutating mechanics on the same row, it's a bug — report it and we issue a service credit. We surface composition-cap violations to you on the page itself; the honesty has to be load-bearing or it isn't honesty.
What we don't do
We don't replace your prompts. We don't fine-tune your models. We don't add inference-time tooling beyond the seven mechanics on this page. We don't aggregate your traffic with other customers' (committed-use pooling is on the roadmap but lives in a separate disclosure when it lands).
We don't ship destructive mechanics. There is no "creative compression" mode that's aggressive by default. There is no token-level random substitution. There is no "model swap A/B test" without an eval gate. If a mechanic can't pass the 0.95 canary on your golden set, it doesn't fire on your traffic.
We earn 20% of measured savings. Bigger savings = bigger fee, which is a built-in pull toward aggressiveness. This page exists as the brake against that pull. Every commitment here is enforceable in code and visible to you in the dashboard.
Questions
Engagement structure and pricing on the pricing page. Data flow, subprocessors, and retention in the security posture and DPA. For the specific eval methodology and the golden-set construction protocol, talk to us — that's the first thing we set up in onboarding.