2026-05-21 · Quality engineering
Per-stack quality canary
Cross-mechanic interactions hide quality regressions that per-mechanic monitoring misses. Tessera scores each (workload × mechanic-stack) tuple independently — and when one breaches, only that exact stack gets disabled. The other 30 mechanic combinations on the same workload keep saving.
The problem with per-mechanic monitoring
The intuitive way to monitor an LLM optimization stack is per-mechanic: watch m1 (auto-route) quality score, watch m3 (compress) quality score, watch m7 (context prune) quality score. If any drops below SLA, disable that mechanic.
That misses the entire interaction class. m1 alone might score 0.96 on your eval set. m7 alone might score 0.96. But m1+m7together — routing to a smaller model AND pruning the conversation — can produce 0.88 because the smaller model has less headroom to recover from the shortened context. Neither per-mechanic counter ever flags it. Both look healthy. The customer's output silently degrades on the 18% of traffic where both fire.
The fix is to monitor the COMPOSITION, not the components. Score each unique mechanic combination separately, breach on combinations, disable at the combination granularity.
What gets stored per request
Every request through the worker emits an optimize_savings row with a canonical sorted mechanics_stack column — a TEXT[] containing the mechanic ids that actually fired. Examples:
mechanics_stack = [] -- pristine passthrough, no mechanic mechanics_stack = ['m1'] -- auto-routed only mechanics_stack = ['m1', 'm6'] -- auto-route + Anthropic prompt cache mechanics_stack = ['m1', 'm3', 'm7'] -- route + compress + prune mechanics_stack = ['m2'] -- exact cache hit, no upstream call
The stack is built ONCE per request, after every mechanic has had its chance to fire. Sorted alphabetically so m1+m7 is the canonical id whether m1 fires first or m7 fires first. The sort lets us key on the string 'm1+m7' downstream — comparable across requests, dedupable in SQL.
The canary loop
On every auto-routed request, the worker rolls a per-request dice against client.quality_canary_sample_rate (default 5%). On hit, the original (un-routed, un-compressed, un-pruned) request fires alongside the optimized one. Both responses come back. Both get stored in canary_runs alongside the request id + the mechanics_stack the optimized path used.
A daily cron in the dashboard pulls the last 24h of canary_runs and scores each pair via promptfoo runAssertions()against the workload's golden set (LLM-as-judge graders with a deterministic floor, semantic similarity, your custom assertions). Each row gets a 0–1 score.
The aggregation is the load-bearing step. Instead of one mean per workload-day, we compute one mean per (workload, stack) tuple:
workload-A · 2026-05-21 stack=_none samples=21 mean_score=0.99 stack=m1 samples=58 mean_score=0.96 stack=m1+m6 samples=44 mean_score=0.96 stack=m1+m7 samples=12 mean_score=0.91 ⚠ below floor stack=m1+m3+m7 samples=8 mean_score=0.84 ⚠ below floor
The two breaching stacks each have a different problem profile — one is the route+prune interaction, one adds compression on top. A per-mechanic monitor would have seen the workload at 0.95 mean (the weighted average across all stacks) and shrugged. The per-stack monitor flags both compositions immediately.
Breach criteria — 3 days, 0.95 floor, 30 samples
The breach trigger has three guards, all of which must fire for the same stack on the same workload before auto-disable runs:
- mean_score < 0.95 (the
QUALITY_SLA_FLOORconstant — locked 2026-05-20 inlib/tier-features.ts). Strict less-than: exactly 0.95 is NOT a breach. The floor is the contractual SLA you see on tesseraai.io/security. - 3 consecutive days of breach. Single-day dips from a bad eval prompt or a holiday-skewed traffic distribution do not auto-disable. A real degradation surfaces by day 3 with high confidence.
- ≥ 30 samples per dayon that stack. Below 30, the day's mean_score is too noisy to trust. The day is recorded but skipped from the breach evaluation (
detectBreachingStacksindashboard/lib/canary-eval.ts).
The three guards together give a 99%-confident detection at the cost of a 72-hour worst-case detection window. We chose this over a tighter window because the auto-rollback action is non-reversible without operator review — a false positive disables a working stack for a customer until they (or we) manually re-enable it. The conservative window minimizes that failure mode.
What happens on breach
The cron appends the breaching stack id to the workload's workloads.disabled_stacksTEXT[] column. The dashboard's next KV sync (within 60s, or immediately on Stripe upgrade) propagates the updated list to the worker's per-key ApiKeyRecord cache. An audit_events row records the trigger with full context: workload id, stack id, three-day rolling mean_score, sample counts, the date the cron fired.
At the next request the worker matches the prospective stack against the disabled list using subset semantics. A disabled entry m1+m7 matches:
- prospective
m1+m7(exact) - prospective
m1+m6+m7(m1+m7 ⊂ m1+m6+m7) - prospective
m1+m7+m8(m1+m7 ⊂ m1+m7+m8)
but NOT m1 alone, m7 alone, or m1+m8 — none of these contain the full breaching set. The worker then falls back to pristine passthrough for that request (every mechanic skipped, request forwarded as-is to the original provider) and emits a measurement event with a disabled_stack_matched diagnostic.
Subset rather than exact-match is the conservative choice. If m1+m7 proves quality-degrading, then any composition containing it is presumed equally degrading until proven otherwise. We accept some over-restriction (m1+m7+m8 might actually be fine with M8 schema enforcement repairing the regression) in exchange for never serving a known-bad composition. _none in the disabled list is special-cased to exact-match the empty stack only, since otherwise it would match every request via the empty-set-is-subset-of-everything rule.
Why 5% sample rate
A 5% rate is the lower bound where you reliably hit 30 samples per stack per day at typical workload volume (say, 50k requests/day × 5% = 2500 canary samples spread across 5–10 active stacks = 250–500 samples per stack — comfortably above the 30-sample minimum).
You can raise it per workload. A high-stakes legal-review workload might want 20% sample rate (4x the cost on the canary path, in exchange for 4x faster breach detection). A high-volume content generation workload might lower it to 2% once it stabilizes. Theworkloads.quality_canary_sample_rate column controls this per-workload; the SDK exposes it as a dashboard knob.
The cost is real. 5% of your traffic costs ~2× per request (we fire the original AND the optimized model, then store both). That 2× is amortized across the 95% of un-sampled traffic, so the end-to-end cost increase is ~5%. The optimization savings dominate. On a 38%-savings workload, the canary overhead nets out to ~33% measured savings.
What sponsors see in the dashboard
The portal exposes the entire flow:
- /portal/audit — per-request chip strip showing the exact mechanics_stack that fired. Sponsor can verify at a glance which mechanics ran on any historical request, and a composition cap violation surfaces in red.
- /portal/optimize/quality — per-(workload, stack) mean_score over the last N days. Breaching stacks visibly tinted warning. Click through to see sample counts per day, the rolling mean trend, and which stacks have been auto-disabled.
- /portal/settings → Anomaly response policy — set tier 0 (observe only), tier 1 (recommendation), tier 2 (auto- rollback) per workload. The per-stack breach detection runs at all tiers; tier 1 + 2 dictate what HAPPENS when a breach fires. Tier 2 on a regulated workload requires operator sign-off — see /how-it-works for the full ladder.
Honest tradeoffs
- 3-day window means up to 72-hour worst-case detection. A regression starting Monday gets auto-disabled Wednesday night. For workloads where 72-hour exposure is unacceptable, raise the sample rate (faster confidence) and set anomaly tier 1 (recommendation, not auto-rollback) so your team gets pinged at day-1 breach.
- Promptfoo graders aren't perfect. LLM-as- judge can mis-score 5–10% of comparisons. The 30-sample minimum washes out most grader noise; a custom assertion in your golden set (exact-match, regex, deterministic function) reduces it further.
- Subset matching over-restricts in rare cases. A stack disabled at
m1+m7can't run withm1+m7+m8even if M8 would actually fix the regression. Operator override is one form edit + audit row; we trade some optimization yield for a stronger correctness guarantee. - 5% rate is a default, not a guarantee. If your workload runs at 500 requests/day instead of 50,000, 5% gives you 25 canary samples — below the 30-sample minimum, so no day gets evaluated. Either raise the sample rate or accept a longer confidence window. The dashboard flags low-sample days explicitly.
Why this matters for substrate trust
The proxy thesis is that we sit in your LLM request path and apply a mechanic stack that's individually safe but collectively unknown. Per-mechanic monitoring would have us shipping the stack on faith. Per-stack monitoring lets us ship it on evidence — every composition gets independently scored, every breach gets independently rolled back, every disabled stack gets recorded in the same audit ledger as the cost claims.
The same mechanics_stack column that drives the canary aggregation drives the per-request chip strip on /portal/audit. Same source of truth, same canonical sort, same string-key shape. The quality monitoring system and the dashboard you screenshot for your compliance team read from the same column. Nothing computed differently for different audiences.
Try it on your workload
Free Dev tier ships the canary at the default 5% rate. 60M tokens per month, no card. Sign up at tesseraai.io/dev — 30-second flow returns an instant API key plus magic-link dashboard access. Stack-breakdown surfaces show up after the first 24 hours of canary runs.
See also: how it works (10-mechanic deep dive), /security (SLA + retention + Vault encryption details), worked-numbers blog post (the customer-support example the 0.96 mean_score came from).
tessera — substrate proxy for AI cost optimization