What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M9 · Parameter-mutating · Non-content

Output-length predictor

Headroom multiplier default 1.5 · per-workload override

Most LLM callers either omit max_tokens entirely (defaults to provider ceiling. Often huge, billed at full output rate even when the actual response is short) or set it conservatively high. M9 measures the workload's observed output-length distribution (past 14 days, sampled from optimize_savings rows that carry the provider's finish_reason) and injects a per-request max_tokens ceiling at the workload p90 × headroom multiplier (default 1.5). Tight ceiling = lower output token bill on the requests that would have padded; no impact on the requests that actually need the headroom.

M9 is parameter-mutating, not content-mutating. It changes the max_tokens parameter on the outgoing request, never touches the messages. Sits outside the M3 / M7 / M8 single-pick mutex. Composes freely with everything.

The quality gate

M9 refuses to inject a tighter ceiling unless three things hold:

Workload opt-in is true (output_length_predictor_optin).
The 14-day-fresh p90 estimate exists. If the cron hasn't populated predicted_output_tokens_p90 yet, or the value is older than 14 days, M9 short-circuits.
The measured 7-day truncation rate (predicted_output_truncation_rate) is below 0.02. I.e. fewer than 2% of past requests hit the predicted ceiling. If we're truncating the model frequently at the predicted p90, we're mispredicting; M9 stays out of the way until the cron repopulates with fresh data.

Any failed gate → no-op + log reason. We never tighten beyond the caller's explicit max_tokens either. If the customer set 500, M9's 800 ceiling is irrelevant; if the customer set 5000, M9 may tighten to 800.

Headroom multiplier (per-workload, Session 6)

Default headroom: 1.5× p90 (OUTPUT_LENGTH_PREDICTOR_HEADROOM_MULTIPLIER). Per-workload override: workloads.output_length_headroom_multiplier (migration 0074, clamped [1.0, 3.0]). Tighter (closer to 1.0) extracts more savings but risks truncation; looser (closer to 3.0) safer but less efficient.

Why per-workload: a chat-summarisation workload with stable short outputs benefits from 1.2; a code-generation workload with high variance benefits from 2.0. Sponsor sets the trade- off; the truncation-rate gate enforces the safety floor.

Per-provider field mapping

OpenAI + xAI + every OpenAI-compatible provider use the top-level max_tokens field. Single slot, simple write. Anthropic also uses max_tokens at top level (required field; M9 reads the caller's value first).

Google Gemini uses generationConfig.maxOutputTokens (nested) or max_output_tokens (flat, depending on SDK version). The applier writes to the provider-correct slot while preserving the rest of generationConfig.

The data flywheel

Every measured request writes finish_reason + tokens_out + max_tokens_requested to optimize_savings (migration 0070, columns added 2026-05-22). The M9 cron runs daily and:

Computes p90 of tokens_out per workload over the past 14 days.
Computes truncation rate per workload over the past 7 days as COUNT(finish_reason='length') / COUNT(*).
Writes both back to workloads (predicted_output_tokens_p90 + predicted_output_truncation_rate + predicted_output_tokens_computed_at).
Triggers KV refresh so the worker reads the fresh values on the next request.

Workloads with low traffic may have p90 estimates based on small sample sizes. That's why we require 14 days of freshness AND the 7-day truncation gate. Two checks limit the mis-prediction surface.

What we don’t do (yet)

No per-request prediction. v0.1 uses workload-wide p90. No per-prompt classifier. A future v0.2 could route long-form prompts to a higher ceiling and short- form prompts to a tighter one based on input embedding + historical pairs. Out of v0.1 scope.
No dynamic adjustment within a session. M9 reads from KV at request time; KV updates only on cron tick or explicit refresh. A session with a sudden output- length shift sees the old prediction until the next cron.
No auto-loosening on truncation alarms. The cron writes the new measurement and the gate decides; we do not auto-increase headroom multiplier in response to a truncation spike. Sponsor receives an anomaly and decides.
No streaming-only adjustment. M9 fires the same on streaming and non-streaming. Streaming finish_reason arrives via the SSE parser (wired Session 5, 2026-05-24) so the cron has consistent input regardless of mode.

Verification surfaces

Response header x-tessera-output-length-ceiling: N on every M9-applied request. Sponsors see exactly what ceiling was applied + can correlate with their own observability.
Request log line output_length_predictor_applied with ceiling + caller_requested_max_tokens + reason (decision provenance).
/portal/audit chip strip - m9 chip (non-mutating colour) + max_tokens_requested column showing the injected value.

What we promise

M9 fires only when the data supports it (14-day-fresh p90 AND 7-day truncation rate < 2%), never overrides a tighter caller-set ceiling, never increases a caller's ceiling (only ever tightens), surfaces the applied ceiling on every mutated request, and respects per-workload headroom-multiplier overrides. The data flywheel is closed-loop: every measured request feeds the next prediction.

Parent index: How it works. Adjacent mechanics: M1 auto-route · M8 structured output.