Skip to main content
← Back to blog

2026-05-21 · Quality engineering

Per-stack quality canary

Cross-mechanic interactions hide quality regressions that per-mechanic monitoring misses. Tessera scores each (workload × mechanic-stack) tuple independently — and when one breaches, only that exact stack gets disabled. The other 30 mechanic combinations on the same workload keep saving.

The problem with per-mechanic monitoring

The intuitive way to monitor an LLM optimization stack is per-mechanic: watch m1 (auto-route) quality score, watch m3 (compress) quality score, watch m7 (context prune) quality score. If any drops below SLA, disable that mechanic.

That misses the entire interaction class. m1 alone might score 0.96 on your eval set. m7 alone might score 0.96. But m1+m7together — routing to a smaller model AND pruning the conversation — can produce 0.88 because the smaller model has less headroom to recover from the shortened context. Neither per-mechanic counter ever flags it. Both look healthy. The customer's output silently degrades on the 18% of traffic where both fire.

The fix is to monitor the COMPOSITION, not the components. Score each unique mechanic combination separately, breach on combinations, disable at the combination granularity.

What gets stored per request

Every request through the worker emits an optimize_savings row with a canonical sorted mechanics_stack column — a TEXT[] containing the mechanic ids that actually fired. Examples:

mechanics_stack = []                  -- pristine passthrough, no mechanic
mechanics_stack = ['m1']              -- auto-routed only
mechanics_stack = ['m1', 'm6']        -- auto-route + Anthropic prompt cache
mechanics_stack = ['m1', 'm3', 'm7']  -- route + compress + prune
mechanics_stack = ['m2']              -- exact cache hit, no upstream call

The stack is built ONCE per request, after every mechanic has had its chance to fire. Sorted alphabetically so m1+m7 is the canonical id whether m1 fires first or m7 fires first. The sort lets us key on the string 'm1+m7' downstream — comparable across requests, dedupable in SQL.

The canary loop

On every auto-routed request, the worker rolls a per-request dice against client.quality_canary_sample_rate (default 5%). On hit, the original (un-routed, un-compressed, un-pruned) request fires alongside the optimized one. Both responses come back. Both get stored in canary_runs alongside the request id + the mechanics_stack the optimized path used.

A daily cron in the dashboard pulls the last 24h of canary_runs and scores each pair via promptfoo runAssertions()against the workload's golden set (LLM-as-judge graders with a deterministic floor, semantic similarity, your custom assertions). Each row gets a 0–1 score.

The aggregation is the load-bearing step. Instead of one mean per workload-day, we compute one mean per (workload, stack) tuple:

workload-A · 2026-05-21
  stack=_none      samples=21   mean_score=0.99
  stack=m1         samples=58   mean_score=0.96
  stack=m1+m6      samples=44   mean_score=0.96
  stack=m1+m7      samples=12   mean_score=0.91   ⚠ below floor
  stack=m1+m3+m7   samples=8    mean_score=0.84   ⚠ below floor

The two breaching stacks each have a different problem profile — one is the route+prune interaction, one adds compression on top. A per-mechanic monitor would have seen the workload at 0.95 mean (the weighted average across all stacks) and shrugged. The per-stack monitor flags both compositions immediately.

Breach criteria — 3 days, 0.95 floor, 30 samples

The breach trigger has three guards, all of which must fire for the same stack on the same workload before auto-disable runs:

The three guards together give a 99%-confident detection at the cost of a 72-hour worst-case detection window. We chose this over a tighter window because the auto-rollback action is non-reversible without operator review — a false positive disables a working stack for a customer until they (or we) manually re-enable it. The conservative window minimizes that failure mode.

What happens on breach

The cron appends the breaching stack id to the workload's workloads.disabled_stacksTEXT[] column. The dashboard's next KV sync (within 60s, or immediately on Stripe upgrade) propagates the updated list to the worker's per-key ApiKeyRecord cache. An audit_events row records the trigger with full context: workload id, stack id, three-day rolling mean_score, sample counts, the date the cron fired.

At the next request the worker matches the prospective stack against the disabled list using subset semantics. A disabled entry m1+m7 matches:

but NOT m1 alone, m7 alone, or m1+m8 — none of these contain the full breaching set. The worker then falls back to pristine passthrough for that request (every mechanic skipped, request forwarded as-is to the original provider) and emits a measurement event with a disabled_stack_matched diagnostic.

Subset rather than exact-match is the conservative choice. If m1+m7 proves quality-degrading, then any composition containing it is presumed equally degrading until proven otherwise. We accept some over-restriction (m1+m7+m8 might actually be fine with M8 schema enforcement repairing the regression) in exchange for never serving a known-bad composition. _none in the disabled list is special-cased to exact-match the empty stack only, since otherwise it would match every request via the empty-set-is-subset-of-everything rule.

Why 5% sample rate

A 5% rate is the lower bound where you reliably hit 30 samples per stack per day at typical workload volume (say, 50k requests/day × 5% = 2500 canary samples spread across 5–10 active stacks = 250–500 samples per stack — comfortably above the 30-sample minimum).

You can raise it per workload. A high-stakes legal-review workload might want 20% sample rate (4x the cost on the canary path, in exchange for 4x faster breach detection). A high-volume content generation workload might lower it to 2% once it stabilizes. Theworkloads.quality_canary_sample_rate column controls this per-workload; the SDK exposes it as a dashboard knob.

The cost is real. 5% of your traffic costs ~2× per request (we fire the original AND the optimized model, then store both). That 2× is amortized across the 95% of un-sampled traffic, so the end-to-end cost increase is ~5%. The optimization savings dominate. On a 38%-savings workload, the canary overhead nets out to ~33% measured savings.

What sponsors see in the dashboard

The portal exposes the entire flow:

Honest tradeoffs

Why this matters for substrate trust

The proxy thesis is that we sit in your LLM request path and apply a mechanic stack that's individually safe but collectively unknown. Per-mechanic monitoring would have us shipping the stack on faith. Per-stack monitoring lets us ship it on evidence — every composition gets independently scored, every breach gets independently rolled back, every disabled stack gets recorded in the same audit ledger as the cost claims.

The same mechanics_stack column that drives the canary aggregation drives the per-request chip strip on /portal/audit. Same source of truth, same canonical sort, same string-key shape. The quality monitoring system and the dashboard you screenshot for your compliance team read from the same column. Nothing computed differently for different audiences.

Try it on your workload

Free Dev tier ships the canary at the default 5% rate. 60M tokens per month, no card. Sign up at tesseraai.io/dev — 30-second flow returns an instant API key plus magic-link dashboard access. Stack-breakdown surfaces show up after the first 24 hours of canary runs.

See also: how it works (10-mechanic deep dive), /security (SLA + retention + Vault encryption details), worked-numbers blog post (the customer-support example the 0.96 mean_score came from).

tessera — substrate proxy for AI cost optimization