What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M3 · Content-mutating

Compress

Per-role split since 26 May 2026 · Chars-capture since 26 May 2026

M3 strips redundant whitespace, collapses repeated newlines, and normalises structural noise from the outgoing request body before the upstream call. Code fences, JSON payloads, and tool-call / tool-result blocks are preserved verbatim. The compression is deterministic. Given the same input it produces byte-identical output every time. Sponsors see the before/after char counts on /portal/audit per request.

M3 is a content-mutating mechanic, the most invasive category Tessera ships. It also yields the most variable savings - 5–15% on workloads with bloated system prompts, near-zero on already-tight ones. Conservative by default: when the body eligibility heuristic fails, M3 stays out of the way. This page documents the heuristic, the per-role split shipped 2026-05-26, and the explicit guardrails that prevent compress from compounding with the other content-mutating mechanics.

Per-role opt-in (split 26 May 2026)

Pre-split: a single auto_compress_enabled toggle covered everything. System prompt, user turns, assistant turns. Workloads with stable expensive system prompts that benefited from compression there but with sensitive variable user content had no way to enable one without the other.

Post-split (migration 0076): two independent toggles - compress_system_enabled covers messages with role: 'system' AND the top-level Anthropic-shape system field; compress_user_turns_enabled covers messages with role IN (user, assistant). Either-true → stage active. Both-true → both compressed (matches legacy behaviour). Both-false → stage skipped entirely. The legacy master toggle is read-side compatible for KV rollback safety but worker 0.41.0+ no longer consults it.

Eligibility heuristic

Pre-detect runs in the worker's Stage-3 candidate-detection block before any route decision so the composition cap can correctly claim M1 off the table when a mutation is pending. M3 eligibility is two filters:

body_chars >= COMPRESS_MIN_BODY_CHARS (default ~512). Shorter bodies don't accumulate enough redundancy to justify a content mutation.
redundant_whitespace_ratio >= COMPRESS_MIN_REDUNDANT_PCT. We measure runs of consecutive spaces / repeated newlines / tab blocks as a fraction of total body chars. Tight prompts with no redundancy bypass compression.

Both filters must pass. A body that's long but tight (a single large code block, an embedded JSON payload) skips M3. A body that's redundant but short (boilerplate user-message under 512 chars) also skips. We'd rather miss a tiny win than mutate a request that wasn't going to save much.

What compression preserves

The applier (lib/optimization/compress-applier.ts) is conservative by structural inspection, not by heuristic:

Triple-backtick code fences. Content inside ```...``` passes through verbatim. Indentation, blank lines, and duplicate spaces inside code are semantically load-bearing for many languages (Python, YAML).
JSON payloads. When a message content is structured JSON (detected by leading { or [ with balanced braces), the applier skips it. Compressing JSON whitespace can subtly break downstream consumers that pattern- match on key spacing.
Tool-call and tool-result blocks. Anthropic tool_use / tool_result content arrays are passed through structurally; only free-text fields inside them are eligible.

What does change: runs of 2+ spaces collapse to 1; runs of 3+ newlines collapse to 2; trailing whitespace on lines is stripped. Leading whitespace on lines is left intact (it's load-bearing for markdown lists and outline structure).

Composition cap interaction

M3 is one of three content-mutating mechanics. M3 compress, M7 context prune, M8 structured output. The proxy enforces at most onemutation per request, with priority M7 > M3 > M8. If M7 already fired, contentMutationApplied = true short-circuits M3. If M3 fires, contentMutationApplied = true short-circuits M8 below.

The composition cap is documented on the parent /how-it-works page. M1 auto-route is also disabled when ANY mutating mechanic is about to fire. Cost-reduction stacked on prompt-mutation multiplies quality risk in a way the canary cannot easily decompose.

Char counts on every M3 request

Migration 0077 (2026-05-26) added two columns to optimize_savings: compress_chars_before and compress_chars_after. Both are JSON-stringify lengths captured immediately before and after the applier ran. NULL on non-M3 requests; integer pair on M3 requests.

Two surfaces consume these values: the per-row chip strip on /portal/audit (shows the delta), and the weekly compress-ratio widget on the operator dashboard (/optimize/quality/[workload_id] drill-down). Workloads with a 7-day rolling SUM(chars_before)/SUM(chars_after) near 1.0 are candidates for M3 disablement. The mechanic is firing but not producing real reduction. We surface that signal to the operator rather than auto-disable: there are cases where consistent small reductions across many requests still aggregate to material savings.

What we don’t do (yet)

Five explicit non-features:

No semantic compression. v0.1 is structural normalisation, not paraphrase-and-shorten. We do not invoke a secondary LLM to rewrite the prompt in fewer tokens.
No LLMLingua-2. ADR-0003 documents the server-side LLMLingua-2 path. Selective token removal guided by a small classifier model. Live customer evaluation gates shipping; until then we use only the deterministic heuristic.
No per-message redaction.M3 does not remove content because it's "unimportant". Every redundancy reduction is whitespace / structural; sentences survive verbatim.
No streaming-side compression. The applier runs once on the outgoing body. Streaming responses come back from the provider untouched.
No back-pressure. If the applier throws or takes more than expected, the worker discards the result and forwards the original body. We never let M3 break a request.

Verification surfaces

Request log line compress_applied on every M3 fire. Carries chars_before, chars_after, chars_saved, pct_saved, compress_system, compress_user_turns flags.
/portal/audit - the m3 chip in the canonical mechanics_stack. Mutating-mechanic chip colour (warning amber) distinguishes M3 from non-mutating siblings.
Operator dashboard /optimize/quality/[workload_id] . 7-day rolling compress ratio per workload, plus a per-stack breach timeline.

What we promise

M3 is opt-in per role, deterministic, structurally conservative, composition-capped to never stack with another mutation, and surfaced on every audit row. We never claim a saving we can't point at. Every M3 row carries the byte-exact before/after counts as a database column.

Parent index: How it works. Companion mechanic deep-dives: M1 auto-route · M11 cross-provider failover.