Output-length predictor
Headroom multiplier default 1.5 · per-workload override
Most LLM callers either omit max_tokens entirely (defaults to provider ceiling — often huge, billed at full output rate even when the actual response is short) or set it conservatively high. M9 measures the workload's observed output-length distribution (past 14 days, sampled from optimize_savings rows that carry the provider's finish_reason) and injects a per-request max_tokens ceiling at the workload p90 × headroom multiplier (default 1.5). Tight ceiling = lower output token bill on the requests that would have padded; no impact on the requests that actually need the headroom.
M9 is parameter-mutating, not content-mutating. It changes the max_tokens parameter on the outgoing request, never touches the messages. Sits outside the M3 / M7 / M8 single-pick mutex — composes freely with everything.
The quality gate
M9 refuses to inject a tighter ceiling unless three things hold:
- Workload opt-in is true (
output_length_predictor_optin). - The 14-day-fresh p90 estimate exists. If the cron hasn't populated
predicted_output_tokens_p90yet, or the value is older than 14 days, M9 short-circuits. - The measured 7-day truncation rate (
predicted_output_truncation_rate) is below 0.02 — i.e. fewer than 2% of past requests hit the predicted ceiling. If we're truncating the model frequently at the predicted p90, we're mispredicting; M9 stays out of the way until the cron repopulates with fresh data.
Any failed gate → no-op + log reason. We never tighten beyond the caller's explicit max_tokens either — if the customer set 500, M9's 800 ceiling is irrelevant; if the customer set 5000, M9 may tighten to 800.
Headroom multiplier (per-workload, Session 6)
Default headroom: 1.5× p90 (OUTPUT_LENGTH_PREDICTOR_HEADROOM_MULTIPLIER). Per-workload override: workloads.output_length_headroom_multiplier (migration 0074, clamped [1.0, 3.0]). Tighter (closer to 1.0) extracts more savings but risks truncation; looser (closer to 3.0) safer but less efficient.
Why per-workload: a chat-summarisation workload with stable short outputs benefits from 1.2; a code-generation workload with high variance benefits from 2.0. Sponsor sets the trade- off; the truncation-rate gate enforces the safety floor.
Per-provider field mapping
OpenAI + xAI + every OpenAI-compatible provider use the top-level max_tokens field — single slot, simple write. Anthropic also uses max_tokens at top level (required field; M9 reads the caller's value first).
Google Gemini uses generationConfig.maxOutputTokens (nested) or max_output_tokens (flat, depending on SDK version). The applier writes to the provider-correct slot while preserving the rest of generationConfig.
The data flywheel
Every measured request writes finish_reason + tokens_out + max_tokens_requested to optimize_savings (migration 0070, columns added 2026-05-22). The M9 cron runs daily and:
- Computes p90 of
tokens_outper workload over the past 14 days. - Computes truncation rate per workload over the past 7 days as
COUNT(finish_reason='length') / COUNT(*). - Writes both back to
workloads(predicted_output_tokens_p90+predicted_output_truncation_rate+predicted_output_tokens_computed_at). - Triggers KV refresh so the worker reads the fresh values on the next request.
Workloads with low traffic may have p90 estimates based on small sample sizes — that's why we require 14 days of freshness AND the 7-day truncation gate. Two checks limit the mis-prediction surface.
What we don’t do (yet)
- No per-request prediction. v0.1 uses workload-wide p90 — no per-prompt classifier. A future v0.2 could route long-form prompts to a higher ceiling and short- form prompts to a tighter one based on input embedding + historical pairs. Out of v0.1 scope.
- No dynamic adjustment within a session. M9 reads from KV at request time; KV updates only on cron tick or explicit refresh. A session with a sudden output- length shift sees the old prediction until the next cron.
- No auto-loosening on truncation alarms. The cron writes the new measurement and the gate decides; we do not auto-increase headroom multiplier in response to a truncation spike. Sponsor receives an anomaly and decides.
- No streaming-only adjustment. M9 fires the same on streaming and non-streaming. Streaming
finish_reasonarrives via the SSE parser (wired Session 5, 2026-05-24) so the cron has consistent input regardless of mode.
Verification surfaces
- Response header
x-tessera-output-length-ceiling: Non every M9-applied request. Sponsors see exactly what ceiling was applied + can correlate with their own observability. - Request log line
output_length_predictor_appliedwithceiling+caller_requested_max_tokens+reason(decision provenance). -
/portal/auditchip strip —m9chip (non-mutating colour) +max_tokens_requestedcolumn showing the injected value.
What we promise
M9 fires only when the data supports it (14-day-fresh p90 AND 7-day truncation rate < 2%), never overrides a tighter caller-set ceiling, never increases a caller's ceiling (only ever tightens), surfaces the applied ceiling on every mutated request, and respects per-workload headroom-multiplier overrides. The data flywheel is closed-loop: every measured request feeds the next prediction.
Parent index: How it works. Adjacent mechanics: M1 auto-route · M8 structured output.