Skip to main content
← How it works
Mechanic · M9 · Parameter-mutating · Non-content

Output-length predictor

Headroom multiplier default 1.5 · per-workload override

Most LLM callers either omit max_tokens entirely (defaults to provider ceiling — often huge, billed at full output rate even when the actual response is short) or set it conservatively high. M9 measures the workload's observed output-length distribution (past 14 days, sampled from optimize_savings rows that carry the provider's finish_reason) and injects a per-request max_tokens ceiling at the workload p90 × headroom multiplier (default 1.5). Tight ceiling = lower output token bill on the requests that would have padded; no impact on the requests that actually need the headroom.

M9 is parameter-mutating, not content-mutating. It changes the max_tokens parameter on the outgoing request, never touches the messages. Sits outside the M3 / M7 / M8 single-pick mutex — composes freely with everything.

The quality gate

M9 refuses to inject a tighter ceiling unless three things hold:

  • Workload opt-in is true (output_length_predictor_optin).
  • The 14-day-fresh p90 estimate exists. If the cron hasn't populated predicted_output_tokens_p90 yet, or the value is older than 14 days, M9 short-circuits.
  • The measured 7-day truncation rate (predicted_output_truncation_rate) is below 0.02 — i.e. fewer than 2% of past requests hit the predicted ceiling. If we're truncating the model frequently at the predicted p90, we're mispredicting; M9 stays out of the way until the cron repopulates with fresh data.

Any failed gate → no-op + log reason. We never tighten beyond the caller's explicit max_tokens either — if the customer set 500, M9's 800 ceiling is irrelevant; if the customer set 5000, M9 may tighten to 800.

Headroom multiplier (per-workload, Session 6)

Default headroom: 1.5× p90 (OUTPUT_LENGTH_PREDICTOR_HEADROOM_MULTIPLIER). Per-workload override: workloads.output_length_headroom_multiplier (migration 0074, clamped [1.0, 3.0]). Tighter (closer to 1.0) extracts more savings but risks truncation; looser (closer to 3.0) safer but less efficient.

Why per-workload: a chat-summarisation workload with stable short outputs benefits from 1.2; a code-generation workload with high variance benefits from 2.0. Sponsor sets the trade- off; the truncation-rate gate enforces the safety floor.

Per-provider field mapping

OpenAI + xAI + every OpenAI-compatible provider use the top-level max_tokens field — single slot, simple write. Anthropic also uses max_tokens at top level (required field; M9 reads the caller's value first).

Google Gemini uses generationConfig.maxOutputTokens (nested) or max_output_tokens (flat, depending on SDK version). The applier writes to the provider-correct slot while preserving the rest of generationConfig.

The data flywheel

Every measured request writes finish_reason + tokens_out + max_tokens_requested to optimize_savings (migration 0070, columns added 2026-05-22). The M9 cron runs daily and:

  • Computes p90 of tokens_out per workload over the past 14 days.
  • Computes truncation rate per workload over the past 7 days as COUNT(finish_reason='length') / COUNT(*).
  • Writes both back to workloads (predicted_output_tokens_p90 + predicted_output_truncation_rate + predicted_output_tokens_computed_at).
  • Triggers KV refresh so the worker reads the fresh values on the next request.

Workloads with low traffic may have p90 estimates based on small sample sizes — that's why we require 14 days of freshness AND the 7-day truncation gate. Two checks limit the mis-prediction surface.

What we don’t do (yet)

  • No per-request prediction. v0.1 uses workload-wide p90 — no per-prompt classifier. A future v0.2 could route long-form prompts to a higher ceiling and short- form prompts to a tighter one based on input embedding + historical pairs. Out of v0.1 scope.
  • No dynamic adjustment within a session. M9 reads from KV at request time; KV updates only on cron tick or explicit refresh. A session with a sudden output- length shift sees the old prediction until the next cron.
  • No auto-loosening on truncation alarms. The cron writes the new measurement and the gate decides; we do not auto-increase headroom multiplier in response to a truncation spike. Sponsor receives an anomaly and decides.
  • No streaming-only adjustment. M9 fires the same on streaming and non-streaming. Streaming finish_reason arrives via the SSE parser (wired Session 5, 2026-05-24) so the cron has consistent input regardless of mode.

Verification surfaces

  • Response header x-tessera-output-length-ceiling: N on every M9-applied request. Sponsors see exactly what ceiling was applied + can correlate with their own observability.
  • Request log line output_length_predictor_applied with ceiling + caller_requested_max_tokens + reason (decision provenance).
  • /portal/audit chip strip — m9 chip (non-mutating colour) + max_tokens_requested column showing the injected value.

What we promise

M9 fires only when the data supports it (14-day-fresh p90 AND 7-day truncation rate < 2%), never overrides a tighter caller-set ceiling, never increases a caller's ceiling (only ever tightens), surfaces the applied ceiling on every mutated request, and respects per-workload headroom-multiplier overrides. The data flywheel is closed-loop: every measured request feeds the next prediction.

Parent index: How it works. Adjacent mechanics: M1 auto-route · M8 structured output.