What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

Architecture · 21 May 2026 · ~9 min read

Tessera ecosystem map: how compression vendors fit the substrate cascade

Two kinds of map are useful in any infrastructure category. One is the competitive map — which vendor wins on which axis, where the moats are, who buys whom. The other is the compositional map — which components plug into which interfaces, what assembles into what, how the layer cake stacks. The LLM cost-optimization category gets the competitive map written about it weekly. This post is the compositional one.

Read the popular trade press on LLM cost in 2026 and you will see twelve companies arranged on a 2×2 of price-versus-quality. Read the actual deployments at customers running real workloads and you will see something different: three to seven different vendors stacked into one request path, each handling a slice the others cannot. Compression in front, routing in the middle, batch arbitrage at the back. Observability across the whole thing. None of the customers we have talked to picks one optimizer. They pick a stack.

The substrate cascade

The frame that makes this make sense is "substrate cascade." A substrate proxy is a thin layer that sits in the LLM request-path and runs a sequence of conservative optimization mechanics, each gated by an eval contract. The substrate is opinionated about the contract — quality floor, composition cap, audit-immutable measurement, flat-subscription billing — but deliberately not opinionated about which optimization implementations sit in the cascade.

Compression can be implemented as a deterministic whitespace pass, an LLMLingua-2 template substitution, or a fine-tuned compression LLM. Routing can be a static map, a learned router, or an external routing service. Caching can be exact-match, semantic, or template-aware. The substrate chooses which mechanic to fire on which request, in which order; it does not insist that any single technique is "the" technique.

That posture has a concrete consequence: when a new vendor ships a strong implementation of one mechanic, the substrate does not have to rewrite itself. It exposes a plug-point — the backend-marketplace pattern — and the new vendor becomes M-N in the cascade. Tessera as a substrate is structurally permissive about which implementations win each mechanic slot. The fight, if there is one, is about contract enforcement (does this implementation hold the quality floor? does its savings claim pass the audit-immutable test?), not about implementation choice.

The mechanic slots, and who fills them today

The Tessera cascade today has nine documented mechanic slots — M1 through M11 with M4 and M11's neighbours reserved for future variants. The rest of this post walks the slots, names who fills each one (including Tessera's own implementation today), and shows where the substrate marketplace pattern lets a third-party implementation slot in cleanly.

M1 — Auto-route

Today: Tessera ships a static ROUTE_TABLE per provider with conservative intra-family pairs (gpt-5 → gpt-5-mini, opus → sonnet → haiku), gated by a daily promptfoo canary on the customer's eval set. The per-hop quality-preservation estimate is the prior; the canary is the evidence. Chained walks past the first hop compose with cumulative product gating.

The marketplace plug-point: Unify, OpenRouter smart-routing, NotDiamond, RouteLLM (academic implementation), and similar learned-routing services can replace the static table with a per-request scoring head. The substrate interface — "given this request, return a candidate model id and a quality_preservation_estimate" — is small enough to swap. The eval contract stays the same. A learned router that holds the customer's eval at 0.95 wins the M1 slot for that workload; one that doesn't, stays in the candidate pool.

M2 / M5 / M6 — Caching

Today: exact-match cache (sha256 of canonical request body, 7-day TTL, Cloudflare edge KV), semantic cache (≥ 0.95 cosine similarity on Cloudflare Vectorize embeddings, with a 5% live-eval canary), and auto-injected provider prompt-cache markers (Anthropic 90% off cached prefix, Google 75%, OpenAI 50%).

The marketplace plug-point: Vellum, Helicone, Portkey, and any other vendor that exposes a cache primitive can be a candidate implementation. The interface is "given this canonical request, return a cached response if the cache invariants hold; otherwise null." Cache implementations differ on tokenizer agreement, key-normalization, and freshness policy — all settable per workload. The substrate contract is the same.

M3 / M7 — Prompt mutation

Today: deterministic whitespace + structural compression (preserves code fences and JSON shapes), per-role opt-in (system / user turns independent), capped under the composition rule that no more than two content-mutators may fire on the same request.

The marketplace plug-point: this is the slot the dedicated compression vendors target. LLMLingua-2 (Microsoft Research) does template substitution with a fine-tuned compression LLM. TwoTrim ships proprietary ML compression with reported quality preservation on specific benchmarks. leanctx targets RAG-shaped contexts specifically. The Token Company (TTC) shipped a $70M-funded proprietary ML compression line in W26. Microsoft itself bakes LLMLingua-2 into Azure AI Studio for first-party Azure customers.

The substrate-include posture is that all of these are candidate M3 implementations. The interface is "given this prompt, return a compressed prompt and a quality_preservation_estimate." A vendor that holds 0.95 on the customer's eval wins the M3 slot; one that doesn't (for that workload) sits in the candidate pool and gets re-evaluated when the workload changes. Multiple vendors can be installed and the per-workload canary picks the best one — a Bandit-style choice on live evidence rather than a marketing comparison.

M8 — Structured output

Today: constrain completions to a JSON schema with explicit max-tokens budgeting. Saves on over-runs and parse-retry traffic. Native to OpenAI, Anthropic, Google in 2026.

The marketplace plug-point: external schema-constrainers like Outlines, Instructor, BAML, and structured-output gateways can replace the provider-native implementation when the workload uses a non-native model or needs more aggressive constraint validation than the provider ships.

M9 — Output-length ceiling

Today: daily cron fits p90 of completion length per workload, injects max_tokens = p90 × 1.3 per request. Cuts completion-token waste without truncating real responses. This is a Tessera-specific mechanic — we have not seen another vendor ship it as a discrete primitive — but the slot is open for a learned length-predictor if one emerges.

M10 — Batch arbitrage

Today: route async-tolerant calls to OpenAI Batch + Anthropic Message Batches (both 50% off). Opt-in per workload.

The marketplace plug-point: OpenPipeships proprietary fine-tuned models served on cheaper inference; their billing model is a per-token rate well below the base provider's. A workload that has been distilled onto an OpenPipe model is structurally a candidate M10 alternative — "route this async-tolerant call to the cheaper inference target." Same for Chamber, Together AI, Fireworks, DeepInfra, Lambda, and other inference-as-a-service providers. The substrate-include version is "OpenPipe is an M10 candidate when the customer has distilled, and M1-baseline otherwise."

M11 — Cross-provider failover

Today: opt-in retry on OpenRouter when the primary upstream returns 5xx / connection error / timeout. Two pricing snapshots per request (the model the customer asked for, the model that actually ran via OpenRouter) for audit immutability. JIT-decrypted credential via HMAC-signed internal endpoint.

The marketplace plug-point: any meta-router (OpenRouter, AnyScale, Replicate, custom internal routers) is a candidate M11 backend. The interface is "given a primary-upstream failure, return a result from a different provider." The audit-immutable contract requires the M11 candidate to expose enough cost metadata for a snapshot row.

Where Tessera is not a substrate

Three layers of the stack are not substrate-shaped, and we have no plan to compete in them.

LLM gateways (LiteLLM, Kong AI Gateway, Portkey) are the per-token-fee category that owns API normalization and load balancing across providers. They are a substrate-include candidate when a customer needs broader provider coverage than Tessera offers; the contract is "Tessera fronts the gateway, gateway fronts the providers." The gateway gets paid per-token; Tessera gets paid per-saved-dollar. Different billing models, complementary placement.

Observability platforms (Helicone, Langfuse, Arize, PostHog AI, Datadog LLM Observability) are the read-side category that tells customers what they spent and how their LLMs are behaving. Tessera writes one row per request to its own audit ledger; an observability platform can ingest that ledger via export or via the proxy's response headers. Different read patterns, complementary placement.

Eval and prompt-management platforms (promptfoo, Braintrust, HumanLoop, LangSmith) own the eval-set contract that gates Tessera's mechanics. The customer writes their eval in promptfoo (or Braintrust, or a custom suite); Tessera runs it daily as the canary. Different write patterns, complementary placement.

What this means in practice

For a customer evaluating "which LLM cost vendor do I buy," the honest answer is: probably several, with the substrate making the composition predictable. Tessera's wedge is the substrate contract — the quality floor, the composition cap, the audit-immutable measurement, the flat-subscription billing. The mechanics inside that contract can come from anywhere that holds the contract.

For a vendor that ships a single mechanic well — compression, learned routing, fine-tuned cheap inference — the substrate is a distribution channel, not a competitor. The interface to plug in is documented (see how-it-works for the per-mechanic contracts), the eval gate is uniform, and the customer settles through one flat monthly subscription.

For an investor or analyst trying to size the category, the question "who wins LLM cost optimization" is the wrong question. The category will compose. The right question is which substrate wins the contract layer — because the contract layer is where the asymmetric customer trust accrues, and the asymmetric trust is what gets paid for.