What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

FinOps · 21 May 2026 · ~12 min read

AI cost management in 2026: how proxies, compressors, and routers compose into a substrate

AI inference is the second-fastest-growing line on most B2B SaaS P&Ls in 2026. The discipline that handles it has a name now — AI FinOps — and a stack. This post documents the stack: five families of approach, what each one actually does, when each one fires, the realistic savings range per family, and how a substrate proxy composes them into one cascade under a single eval contract.

The thesis up front: the LLM cost stack is composable. Picking one vendor and ignoring the others leaves money on the table. Picking the whole stack without a contract layer turns the savings into an audit problem nobody can defend at quarterly review. The substrate-proxy pattern threads that needle by acting as the single eval-gated cascade that every other layer slots into.

The five families

Most of what gets sold under the "LLM cost optimization" banner falls into one of five families. They differ on where they intercept the request, what they change, and how they bill.

LLM gateways. Per-token-fee proxies that normalize API shapes across providers, do load-balancing and routing, and add observability. Examples: LiteLLM, Kong AI Gateway, Portkey. Billing: per-token surcharge or per-month seat.
Substrate proxies. Eval-gated request-path optimizers that run a cascade of conservative mechanics — routing, caching, compression, batching, output-length control — and measure the savings against the model the customer originally asked for. Example: Tessera. Billing: flat monthly subscription by token volume; the customer keeps 100% of the savings.
Specialist compressors. Single-mechanic vendors that implement prompt compression as their only product. Examples: TwoTrim, leanctx, The Token Company (TTC), LLMLingua-2 (Microsoft Research / Azure). Billing: per-token rate.
Specialist routers and fine-tuners. Single-mechanic vendors that handle model routing, distillation, or cheap-inference hosting. Examples: Unify, OpenPipe, NotDiamond, OpenRouter, Together AI, Fireworks, Chamber, RouteLLM. Billing: per-token rate or per-saved-dollar.
Observability and eval platforms. Read-side platforms that ingest request data and tell you what you spent, how prompts performed, and whether quality drifted. Examples: Helicone, Langfuse, Arize, PostHog AI, Datadog LLM Observability, promptfoo, Braintrust, LangSmith. Billing: per-seat or per-ingested-event.

A real production stack typically uses one from each family. The mistake is assuming they substitute for each other.

Where in the request path each family operates

Family	Intercept point	What it changes	Typical savings range
LLM gateways	Wire format / provider URL	Provider abstraction, load-balancing, retry	0% (often negative after fee — bought for governance, not cost)
Substrate proxies	Wire format / request body	Cascade of mechanics under eval contract	25-60% on typical SaaS LLM workloads
Specialist compressors	Prompt content	Tokens-per-prompt count via ML compression	5-25% on prompt tokens; quality-sensitive
Specialist routers + fine-tuners	Model selection	Choose a cheaper model with comparable output	10-40% when the model surface allows the swap
Observability + evals	Read side (logs / traces)	Nothing in the request; surfaces data	0% direct; necessary input to other layers' decisions

Read the table column by column and the composition logic falls out. A gateway gives you provider abstraction and governance; it does not save tokens. A substrate runs the cascade that does. Specialist vendors are candidate implementations for specific mechanic slots in the cascade. Observability tells everyone what happened. Each family does work the others structurally cannot.

Worked example — a customer-support agent on gpt-4o

Concrete numbers help. Take a B2B SaaS customer-support agent running on gpt-4o at 5 billion tokens per month, 70% input / 30% output mix. At list prices that bill is $24,000 / month at the start of the period. The stack looks like this:

Mechanic	What it does	Savings on $24k baseline
M2 — Exact-match cache	35% hit rate on FAQ-style repeat queries	−$8,400
M6 — Provider prompt cache	90% off cached system + tool prefix on Anthropic; 50% on OpenAI	−$3,200
M1 — Auto-route	gpt-4o → gpt-4o-mini on classification + drafting sub-tasks (canary 0.96)	−$1,800
M3 — Compress	System prompt + tool descriptions, per-role opt-in	−$700
M9 — Output-length ceiling	max_tokens = p90 × 1.3 per workload	−$400
M10 — Batch arbitrage	50% off on the analytics-ingest sub-workload (async-tolerant)	−$100
Total measured savings		−$14,600
Tessera subscription (Growth tier, flat — this workload runs 1.2B tokens/mo)		$999 / month
Savings the customer keeps	100% of the measured savings	$14,600 / month

Quality canary across the full mechanic stack held mean-score 0.96 for all 30 days; the per-stack 0.95 floor never tripped. Full walkthrough on the worked-numbers post.

Notice three things. First, no single mechanic accounts for more than 60% of the savings — the cascade matters. Second, the routing swap is the smallest, not the largest, line; cache hits dominate on workloads with prompt repetition. Third, the total is reported against the model the customer originally asked for — not against some generic reference rate — which is what the substrate's audit-immutable contract is for.

The audit-immutable contract

A savings claim that the customer cannot independently re-derive is a marketing claim. A substrate proxy fixes that with two pricing snapshots per request: original_cost_usd priced at the requested-model catalog rate, and actual_cost_usd priced at the actual-model rate after routing + cache hits + provider discounts. Both rates are pinned to a pricing_catalog snapshot version captured at request time. Mid-contract provider price changes do not retroactively alter past savings.

The customer can export the audit ledger and re-run the math at any time. Because Tessera bills a flat monthly subscription rather than a cut of the savings, the audit isn't protecting an invoice — it proves the reported savings are real, which is what separates a substrate proxy from a gateway that sells observability of cost without measuring it.

Composition rules that protect quality

A naive stacking of all five mechanics on every request produces worst-case behaviour: compression on a structured-output workload destroys the schema, context pruning on a multi-turn dialog drops the user's last instruction, batching on a real-time chat breaks UX. The substrate contract has three guardrails baked in.

Composition cap. No more than two content-mutating mechanics may fire on the same request. The cap is enforced at the mechanic-selection layer before any provider call. A workload eligible for M3 + M7 + M8 will see at most two of those three actually run, and which two is chosen by per-stack canary evidence.

Per-stack quality floor. Every mechanic combination that fires on a workload is tracked as a distinct stack (m1+m6+m9, m2+m3+m6, etc.). The canary cron runs the customer's eval set against each live stack daily; the rolling 3-day mean score must hold ≥ 0.95. Below that, the stack auto-rolls back to baseline and that month's subscription earns a 10% prorated credit.

Hard bypasses. Compliance-tier-regulated workloads short-circuit routing entirely. Streaming-required workloads bypass batching. Tool-using workloads bypass non-tool-capable route targets. The rules are documented per mechanic on how-it-works and enforced in worker code, not config.

When each family fires in a real deployment

Gateway fires when the application needs unified provider abstraction or governance — e.g., enforcing data-residency-aware routing, swapping providers without code change, applying per-team rate limits.
Substrate fires on every request — that is the whole value proposition. It is the layer that runs the cascade and emits the audit row.
Specialist compressor fireswhen the substrate's built-in M3 implementation underperforms on a specific workload and a third-party compressor wins the canary eval for that workload. Plugs in via the substrate's backend-marketplace pattern.
Specialist router / fine-tuner fireswhen the substrate's built-in M1 ROUTE_TABLE leaves money on the table for a workload that a learned router or distilled model can serve at lower cost. Same plug-in pattern.
Observability fires on every request, on the read side — never in the path that matters for latency.

What this implies for buyers

A buyer evaluating "which LLM cost vendor do I pick" should pick two: a substrate (to own the eval contract and the cascade) and an observability platform (to own the read side). Specialists become candidates inside the substrate when the per-workload canary points at them. The gateway question is orthogonal — buy one if you have a governance need; skip it if the substrate already gives you the provider abstraction you need (it does in most cases).

A buyer evaluating "is it worth the integration effort" should look at the typical-savings band per family and back out the dollar number. A B2B SaaS spending $50k/month on LLMs sees realistic substrate savings of $12.5k-$30k/month; a $200k/month workload sees $50k-$120k. The free tier covers low-volume workloads entirely, so the integration test costs nothing.

What this implies for vendors

A vendor shipping a single mechanic well — compression, routing, fine- tuning, cache, length prediction — should treat the substrate as a distribution channel, not a competitor. The interface to plug in is small and documented. The substrate brings the audit contract, the eval gate, and the billing pipeline. The specialist brings the implementation.