What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← Back to blog

2026-05-21 · Quality engineering

Per-stack quality canary

Cross-mechanic interactions hide quality regressions that per-mechanic monitoring misses. Tessera scores each (workload × mechanic-stack) tuple independently — and when one breaches, only that exact stack gets disabled. The other 30 mechanic combinations on the same workload keep saving.

The problem with per-mechanic monitoring

The intuitive way to monitor an LLM optimization stack is per-mechanic: watch m1 (auto-route) quality score, watch m3 (compress) quality score, watch m7 (context prune) quality score. If any drops below SLA, disable that mechanic.

That misses the entire interaction class. m1 alone might score 0.96 on your eval set. m7 alone might score 0.96. But m1+m7together — routing to a smaller model AND pruning the conversation — can produce 0.88 because the smaller model has less headroom to recover from the shortened context. Neither per-mechanic counter ever flags it. Both look healthy. The customer's output silently degrades on the 18% of traffic where both fire.

The fix is to monitor the COMPOSITION, not the components. Score each unique mechanic combination separately, breach on combinations, disable at the combination granularity.

What gets stored per request

Every request through the worker emits an optimize_savings row with a canonical sorted mechanics_stack column — a TEXT[] containing the mechanic ids that actually fired. Examples:

mechanics_stack = []                  -- pristine passthrough, no mechanic
mechanics_stack = ['m1']              -- auto-routed only
mechanics_stack = ['m1', 'm6']        -- auto-route + Anthropic prompt cache
mechanics_stack = ['m1', 'm3', 'm7']  -- route + compress + prune
mechanics_stack = ['m2']              -- exact cache hit, no upstream call

The stack is built ONCE per request, after every mechanic has had its chance to fire. Sorted alphabetically so m1+m7 is the canonical id whether m1 fires first or m7 fires first. The sort lets us key on the string 'm1+m7' downstream — comparable across requests, dedupable in SQL.

The canary loop

On every auto-routed request, the worker rolls a per-request dice against client.quality_canary_sample_rate (default 5%). On hit, the original (un-routed, un-compressed, un-pruned) request fires alongside the optimized one. Both responses come back. Both get stored in canary_runs alongside the request id + the mechanics_stack the optimized path used.

A daily cron in the dashboard pulls the last 24h of canary_runs and scores each pair via promptfoo runAssertions()against the workload's golden set (LLM-as-judge graders with a deterministic floor, semantic similarity, your custom assertions). Each row gets a 0–1 score.

The aggregation is the load-bearing step. Instead of one mean per workload-day, we compute one mean per (workload, stack) tuple:

workload-A · 2026-05-21
  stack=_none      samples=21   mean_score=0.99
  stack=m1         samples=58   mean_score=0.96
  stack=m1+m6      samples=44   mean_score=0.96
  stack=m1+m7      samples=12   mean_score=0.91   ⚠ below floor
  stack=m1+m3+m7   samples=8    mean_score=0.84   ⚠ below floor

The two breaching stacks each have a different problem profile — one is the route+prune interaction, one adds compression on top. A per-mechanic monitor would have seen the workload at 0.95 mean (the weighted average across all stacks) and shrugged. The per-stack monitor flags both compositions immediately.

Breach criteria — 3 days, 0.95 floor, 30 samples

The breach trigger has three guards, all of which must fire for the same stack on the same workload before auto-disable runs:

mean_score < 0.95 (the QUALITY_SLA_FLOOR constant — locked 2026-05-20 in lib/tier-features.ts). Strict less-than: exactly 0.95 is NOT a breach. The floor is the contractual SLA you see on tesseraai.io/security.
3 consecutive days of breach. Single-day dips from a bad eval prompt or a holiday-skewed traffic distribution do not auto-disable. A real degradation surfaces by day 3 with high confidence.
≥ 30 samples per dayon that stack. Below 30, the day's mean_score is too noisy to trust. The day is recorded but skipped from the breach evaluation (detectBreachingStacks in dashboard/lib/canary-eval.ts).

The three guards together give a 99%-confident detection at the cost of a 72-hour worst-case detection window. We chose this over a tighter window because the auto-rollback action is non-reversible without operator review — a false positive disables a working stack for a customer until they (or we) manually re-enable it. The conservative window minimizes that failure mode.

What happens on breach

The cron appends the breaching stack id to the workload's workloads.disabled_stacksTEXT[] column. The dashboard's next KV sync (within 60s, or immediately on Stripe upgrade) propagates the updated list to the worker's per-key ApiKeyRecord cache. An audit_events row records the trigger with full context: workload id, stack id, three-day rolling mean_score, sample counts, the date the cron fired.

At the next request the worker matches the prospective stack against the disabled list using subset semantics. A disabled entry m1+m7 matches:

prospective m1+m7 (exact)
prospective m1+m6+m7 (m1+m7 ⊂ m1+m6+m7)
prospective m1+m7+m8 (m1+m7 ⊂ m1+m7+m8)

but NOT m1 alone, m7 alone, or m1+m8 — none of these contain the full breaching set. The worker then falls back to pristine passthrough for that request (every mechanic skipped, request forwarded as-is to the original provider) and emits a measurement event with a disabled_stack_matched diagnostic.

Subset rather than exact-match is the conservative choice. If m1+m7 proves quality-degrading, then any composition containing it is presumed equally degrading until proven otherwise. We accept some over-restriction (m1+m7+m8 might actually be fine with M8 schema enforcement repairing the regression) in exchange for never serving a known-bad composition. _none in the disabled list is special-cased to exact-match the empty stack only, since otherwise it would match every request via the empty-set-is-subset-of-everything rule.

Why 5% sample rate

A 5% rate is the lower bound where you reliably hit 30 samples per stack per day at typical workload volume (say, 50k requests/day × 5% = 2500 canary samples spread across 5–10 active stacks = 250–500 samples per stack — comfortably above the 30-sample minimum).

You can raise it per workload. A high-stakes legal-review workload might want 20% sample rate (4x the cost on the canary path, in exchange for 4x faster breach detection). A high-volume content generation workload might lower it to 2% once it stabilizes. Theworkloads.quality_canary_sample_rate column controls this per-workload; the SDK exposes it as a dashboard knob.

The cost is real. 5% of your traffic costs ~2× per request (we fire the original AND the optimized model, then store both). That 2× is amortized across the 95% of un-sampled traffic, so the end-to-end cost increase is ~5%. The optimization savings dominate. On a 48%-savings workload, the canary overhead nets out to ~43% measured savings.

What sponsors see in the dashboard

The portal exposes the entire flow:

/portal/audit — per-request chip strip showing the exact mechanics_stack that fired. Sponsor can verify at a glance which mechanics ran on any historical request, and a composition cap violation surfaces in red.
/portal/optimize/quality — per-(workload, stack) mean_score over the last N days. Breaching stacks visibly tinted warning. Click through to see sample counts per day, the rolling mean trend, and which stacks have been auto-disabled.
/portal/settings → Anomaly response policy — set tier 0 (observe only), tier 1 (recommendation), tier 2 (auto- rollback) per workload. The per-stack breach detection runs at all tiers; tier 1 + 2 dictate what HAPPENS when a breach fires. Tier 2 on a regulated workload requires operator sign-off — see /how-it-works for the full ladder.

Honest tradeoffs

3-day window means up to 72-hour worst-case detection. A regression starting Monday gets auto-disabled Wednesday night. For workloads where 72-hour exposure is unacceptable, raise the sample rate (faster confidence) and set anomaly tier 1 (recommendation, not auto-rollback) so your team gets pinged at day-1 breach.
Promptfoo graders aren't perfect. LLM-as- judge can mis-score 5–10% of comparisons. The 30-sample minimum washes out most grader noise; a custom assertion in your golden set (exact-match, regex, deterministic function) reduces it further.
Subset matching over-restricts in rare cases. A stack disabled at m1+m7can't run with m1+m7+m8 even if M8 would actually fix the regression. Operator override is one form edit + audit row; we trade some optimization yield for a stronger correctness guarantee.
5% rate is a default, not a guarantee. If your workload runs at 500 requests/day instead of 50,000, 5% gives you 25 canary samples — below the 30-sample minimum, so no day gets evaluated. Either raise the sample rate or accept a longer confidence window. The dashboard flags low-sample days explicitly.

Why this matters for substrate trust

The proxy thesis is that we sit in your LLM request path and apply a mechanic stack that's individually safe but collectively unknown. Per-mechanic monitoring would have us shipping the stack on faith. Per-stack monitoring lets us ship it on evidence — every composition gets independently scored, every breach gets independently rolled back, every disabled stack gets recorded in the same audit ledger as the cost claims.

The same mechanics_stack column that drives the canary aggregation drives the per-request chip strip on /portal/audit. Same source of truth, same canonical sort, same string-key shape. The quality monitoring system and the dashboard you screenshot for your compliance team read from the same column. Nothing computed differently for different audiences.

Try it on your workload

Free Sandbox tier ships the canary at the default 5% rate. 60M tokens per month, no card. Sign up at tesseraai.io/dev — 30-second flow returns an instant API key plus magic-link dashboard access. Stack-breakdown surfaces show up after the first 24 hours of canary runs.

See also: how it works (10-mechanic deep dive), /security (SLA + retention + Vault encryption details), worked-numbers blog post (the customer-support example the 0.96 mean_score came from).

tessera — substrate proxy for AI cost optimization