What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← Tessera Blog

2026-05-19·8 min read

The prompt bloat you don’t see — and what it’s costing you

Question for the CTO reading this: tell me, in tokens, how big your production system prompt was last Tuesday. Tell me how big it was three months ago. Tell me what changed. If you can't answer in under a minute, you are running blind on roughly 60-80% of your input-token cost.

The cost story most engineers tell themselves about LLM inference is about completions — the user types a question, the model writes an answer, the answer costs money. The completion cost is usually 20-30% of the bill. The other 70-80% is the system prompt and the retrieved-context blob you attach to every single call. Neither of those is “the model’s fault.” They are yours.

We added prompt-size capture to every Tessera proxy request a couple of weeks ago. The column is named compress_chars_before and compress_chars_after on the savings table — exact character counts of the in-flight prompt at the proxy boundary, before and after compression. The purpose was to validate that our compression mechanic was actually doing what we said. What we saw in the data ended up being more interesting.

Why prompts grow

Three patterns show up consistently in the workloads we proxy. Once you start watching for them, you see them everywhere.

Feature patching.Every time the LLM misbehaved in production — wrong tone, wrong format, hallucinated a feature, ignored a constraint — someone added a paragraph to the system prompt. “You must respond in JSON.” “Do not mention pricing.” “Never apologize.” Each paragraph individually is fine. Over six months, those paragraphs accumulate. By month nine the prompt has fourteen “rules,” six of which contradict each other, and nobody is sure which ones are still load-bearing. Token count: up 3-4x from the original.

Multi-author drift.Three engineers, three quarters, three writing styles. The original prompt was tight and Socratic. Engineer two prefers numbered lists. Engineer three prefers narrative paragraphs. The current production prompt is a Frankenstein of all three, with vestigial sections nobody owns. None of the three engineers feels authorized to delete the others' work without a meeting.

RAG context inflation.The retrieval system returns top-K chunks; over time, “K=5” becomes “K=10” in response to a recall complaint, then “K=15” in response to an edge-case complaint. Each chunk is 800-1500 tokens. Suddenly every call carries 12-20 KB of retrieved context. The recall metric inched up; the cost line ballooned. No one was watching both at once.

What the chars-capture data shows

With compress_chars_before / compress_chars_after on every request, you can ask three questions of any workload quickly: (1) how big are the prompts in steady state, (2) what compression ratio is achievable without semantic loss, (3) how much is the absolute character count drifting week-over-week.

Across the workloads in beta we see three rough shapes.

Stable bloat.Prompt is huge but not growing — flat line at 10-18 KB. Compression yields 8-15% (the easy wins: extra whitespace, redundant punctuation, double-newlines that should be single). Quality canary holds. This is the “cost-only” pattern — turn on M3 with compress_system_enabled = true, take the 10% savings, move on.

Drift bloat. Prompt growing 5-12% per month. Visible as a sloped line in the weekly-compress-ratio chart. This is the dangerous pattern. The growth is gradual enough that nobody notices in a single sprint review, but the per-1k-conversation cost is silently up 40-60% year-over-year. M3 here is treatment, not cure — the underlying drift needs an owner.

Silent regression.Prompt size jumps overnight, usually from a deploy. New feature shipped; a developer pasted in a 4 KB block of examples; nobody pushed back in review because the quality canary on staging looked fine. The before/after ratio chart spikes. M3's alert ping fires. The owning team gets to choose: revert, accept, or refactor.

Per-role compression — system vs user

When we first shipped M3 (heuristic compression), it had a single on/off toggle. That was wrong. The reason: system prompts and user turns have different shapes and different tolerance for normalization.

System prompts are usually long, sometimes templated, often generated from a CMS, frequently full of incidental whitespace. Compression on them is high-yield and low-risk; the model ingested them as instructions, not as content to be quoted back.

User turns are different. Users include code blocks, formatted tables, paths, IDs, and snippets where every whitespace character is significant. Compressing user turns is feasible but fragile — a too-aggressive ratio can break a copy-pasted regex. Defaulting to off is the safer call.

So we split the toggle. Workloads now configure compress_system_enabled and compress_user_turns_enabled independently. Default config for new workloads: system on, user off. Customers can flip either independently from /portal/optimize/quality.

The kill-switch is per-role too. If a stack's quality canary drops below 0.95 specifically on user-compression requests, only the user-compression flag flips off; system compression keeps running. Surgical, not nuclear. We learned this the hard way — early kill-switches were repo-wide, which meant a regression in one cohort took out savings for every other cohort that was perfectly happy.

The quality canary keeps us honest

The compression ratio is one half of the chars-capture story. The other half is the daily quality canary that runs against a customer-supplied promptfoo eval set, with 10% sample from production traffic, scored against the baseline (non-compressed) output.

The canary measures by stack — every combination of mechanics active for a workload (e.g. m1+m3-system, m1+m3-user, m1+m3-system+m3-user). The SLA floor per stack is 0.95. Two consecutive days under floor auto-disables the specific stack. The customer gets a 10% prorated credit on that month's subscription and an email pointing at the audit ledger row that triggered the rollback.

The reason for the per-stack granularity is exactly the failure mode we mentioned above: nuclear kill-switches penalize cohorts that did nothing wrong. If m3-user regresses on workload X, only m3-user flips off for workload X. Auto-route stays on. System-compress stays on. The customer keeps 80% of their savings while we investigate.

What this is worth in dollars

Concrete numbers from the customer-support workload we walked through in the $24k → $9.4k post:

System prefix was ~6 KB stable. Compression ratio: 0.93 (7% reduction). Dollar value of that 7%: $350/mo on a $24k bill.
User turns were already canonical (frontend formatter normalized them). Compression of user turns: not applied.
But: prompt-cache headers (M6) operate on the system prefix. 71% of queries hit a cached prefix — input billed at 10% of normal rate. That layer is worth ~$2,400/mo on the same bill.

On any single workload, M3 compression by itself looks like a small win. The leverage is that M3 stacks with M6 and withM1 — every 1% reduction in system prefix size is a 1% reduction in cache-rewrite cost, route-budget threshold, and downstream token math. The numbers compound. Without chars-capture, you would never know whether the compression ratio actually held month-over-month or silently drifted to 0.97, eroding the stack's value.

What to do next

You do not have to use Tessera to take the observation seriously. The cheap first step is to instrument your own production traffic to log len(system_prompt) and len(messages_serialized) per request. Plot the week-over-week trend. If you see a slope, you have a drift bloat problem. If you see a flat 10-18 KB line, you have a stable bloat problem. Either way you know something you did not know yesterday.

If you want the chars-capture, per-role compression, and per- stack quality canary out of the box, the free Dev tier on Tessera handles all of it without writing any of that code yourself. 60M tokens per month, no card, 30-second signup. The dashboard /portal/optimize/quality shows the chars-before and chars-after lines for every workload you wire up; the audit ledger explains every cost-delta number end-to-end.

Paid tiers are a flat monthly subscription by token volume: Starter $199 (≤1B) / Growth $999 (≤5B) / Scale $3,999 (≤20B). The Free Sandbox covers up to 60M tokens/mo at $0. If the math doesn't work for you after a 7-day run, kill-switch in settings reverts to passthrough.

References

Stop running blind on system-prompt cost

60M tokens free, chars-before/after capture out of the box

See your week-over-week prompt size, compression ratio, and per-stack quality canary. Free Sandbox tier, no card, 30-second signup.

Get free API key→