Skip to main content
← How it works
Mechanic · M1 · Non-mutating cost lever

Auto-route

Chained walks since 26 May 2026

When the customer's requested model has a cheaper sibling in the same provider's family AND the workload is opted in AND the daily canary has not flagged the swap as quality-degrading, the worker forwards to the cheaper model and the customer sees an output scored by their own promptfoo eval at ≥ 0.95 against the baseline. No cross-provider lines — that is what M11 is for. No cross- capability lines (reasoning ↔ non-reasoning, multimodal ↔ text). Conservative pairs only, narrowed against published model-card benchmarks before they land in the table.

M1 is the most visible cost mechanic — sponsors see it on the audit chip strip every time a routed request lands. It is also the mechanic with the highest quality-risk surface, because a wrong swap propagates silently into every downstream eval the customer runs against the output. The page below documents the gating logic in source-traceable detail.

The ROUTE_TABLE — what swaps are eligible

Single static map keyed by Provider. Every entry is a tuple of { from, to, quality_preservation_estimate }. The estimate is conservative — derived from published MMLU / HumanEval / MT-Bench benchmarks and rounded down. We never publish an estimate above 0.97 because no provider gives us the golden set the customer would actually evaluate against.

Representative pairs (subset, not exhaustive):

  • gpt-5 → gpt-5-mini (0.94 estimate); gpt-5-mini → gpt-5-nano (0.88).
  • claude-opus-4-7 → claude-sonnet-4-6 (0.91); claude-sonnet-4-6 → claude-haiku-4-5 (0.87).
  • gemini-1.5-pro → gemini-1.5-flash (0.85 — Gemini Flash benchmarks heavily as a capable mainline model, but Pro's longer reasoning advantage shows up on specific subsets; we round down).
  • OpenAI-compatible providers (DeepSeek, Mistral, Groq, Together, Fireworks, OpenRouter, Perplexity, Cerebras, Cohere) each carry their own intra-family pairs verified against the upstream provider's own pricing + capability documentation. Empty arrays for providers with no intra-tier price spread (e.g. deepseek-chat = deepseek-reasoner per-token rate — no swap produces savings).

OpenRouter is a special case: routing happens within the same underlying-family namespace (openai/gpt-4o → openai/gpt-4o-mini) because OR's 5.5% multiplicative surcharge means swapping a $0.15/M model for a $2.50/M one still produces real customer savings. Cross-vendor pairs through OR are excluded — those cross capability + tokenizer + behavioural lines that the canary alone cannot guarantee.

Single-hop vs chained walks

Default: single-hop. The worker looks up the first row matching the customer's requested model and forwards to to. Done.

Chained walks (shipped 26 May 2026, opt-in via worker env WORKER_M1_CHAINED_ENABLED): the worker follows the routing chain past the first hop — gpt-5 → gpt-5-mini → gpt-5-nano, opus → sonnet → haiku — and computes the cumulative quality_preservation as the product across hops. The walk stops at the first hop whose cumulative product would drop below the floor (per-workload threshold); the previous hop's destination is returned. Single-hop is the backward-compatible default, gated separately so existing customers see no behaviour change until they opt in.

Quality floors

The per-hop floor is the safety baseline: WORKER_M1_QUALITY_FLOOR (recommended 0.85) skips swaps whose estimate is below that number on tier_internal workloads (the default tier). Below-floor pairs in the table stay there for documentation transparency but never fire under the floor gate. tier_consumer bypasses the floor (caller opted into looser semantics); tier_regulated short-circuits routing entirely (see below).

When chained walks are enabled, the floor gates the cumulative product, not just the first hop. The walk stops as soon as the next candidate hop would push the cumulative below the floor. One single-hop swap below the floor never sneaks through under chained mode either — the first-hop check still fires.

Hard bypasses

M1 short-circuits to no-route on three independent gates, checked in order:

  • compliance_tier === 'tier_regulated'. Regulated traffic never routes — cache + measurement still apply, but no upstream model swap. Enforced at the route- detector level so no dashboard config can accidentally re- enable.
  • client_auto_route_enabled === false. The sponsor explicitly turned routing off for this workload.
  • Provider circuit breaker is non-healthy. When the per-provider rolling 5xx rate exceeds threshold, the breaker flips to OPEN (or HALF_OPEN during cooldown probe). M1 refuses to compound risk by adding a model swap on top of an already-flapping upstream.

The composition cap also takes M1 off the table whenever a content-mutating mechanic (M3 compress, M7 context prune, M8 structured output) is about to fire on the same request. Cost reduction stacked on model reduction multiplies quality risk — we elect the mutation site and leave M1 disabled for that request. The single-pick budget is published on the parent /how-it-works page.

The quality protection layer

The static estimates in the table are the prior. The canary is the evidence. Once a workload is opted into M1, the daily eval cron samples 10% of routed traffic and re-runs the customer's promptfoo eval on the routed output against the baseline. The per-mechanic-stack mean score is tracked over a rolling 3-day window.

Two auto-rollback layers protect the workload:

  • Workload-wide auto-disable. If the mean score drops below 0.90 (hard floor) for three consecutive days, the cron sets workloads.auto_route_disabled = true and KV is refreshed. M1 stops firing until an operator re- enables it after root-cause review. Sponsor receives an anomaly row on /portal/anomalies with the breach detail.
  • Per-stack auto-disable. Composition-specific regressions (e.g. m1+m7 tanks while m1-alone and m7-alone stay healthy) flip a workloads.disabled_stacks entry rather than the whole workload. The worker checks subset-match before applying mechanics — if a disabled entry is a subset of the prospective stack on this request, the request falls back to pure passthrough. Saves the other mechanic compositions that are still profitable.

Per-stack auto-disable comes with prorated SLA credit — a 10% credit on the fees attributable to the breached stack for the three breach-detection days. The math sums savings across rows tagged with the disabled stack and applies the credit to the customer's next monthly reading.

Verification surfaces

M1 is observable in four places:

  • Response headers on routed requests: x-tessera-auto-routed: gpt-5->gpt-5-mini.
  • /portal/audit chip strip. Routed rows carry m1 in the canonical mechanics_stack. The row also shows requested_model → actual_model so the swap is visible without click-through.
  • /portal/anomalies. Auto-rollback events land here with the breach detail, the disabled stack id (or workload-wide flag), the rolling mean score, and the credit issued.
  • Stack-aware canary data on /optimize/quality. Operator-facing breakdown of mean score per workload per mechanic-stack — the same data the auto-rollback cron reads.

What we don’t do

Five deliberate exclusions, called out so they cannot be surprises:

  • No cross-provider swaps. Those live in M11 cross-provider failover (deep dive) and only fire on upstream failure, not on cost-cut grounds.
  • No cross-capability-tier swaps (no gpt-5 → o3-mini — reasoning ↔ non-reasoning is a different request class).
  • No multimodal-to-text swaps (no pixtral-large → mistral-small — Mistral Small is text-only).
  • No experimental / unstable model targets (Gemini 2.0 Flash Exp is published but excluded — the rate-limit + SLA ceiling makes it unsuitable for prod routing).
  • No streaming-only routing decisions — M1 fires on streaming and non-streaming requests alike; the streaming decision is orthogonal.

What we promise

M1 routes only when the static estimate clears the floor AND the canary has not flagged the swap. The static table is auditable (it lives in workers/proxy/src/detectors/route.ts — every entry has a comment with provenance). The canary cron and the per-stack auto-rollback math are auditable (they live in dashboard/lib/canary-eval.ts and app/api/cron/run-canary-eval/route.ts). The audit ledger row for every routed request carries the pricing snapshot we billed against — the savings math is re-derivable from the customer's end.

Parent index: How it works. Companion mechanic: M11 cross-provider failover.