Skip to main content
Architecture · 21 May 2026 · ~9 min read

Tessera ecosystem map: how compression vendors fit the substrate cascade

Two kinds of map are useful in any infrastructure category. One is the competitive map — which vendor wins on which axis, where the moats are, who buys whom. The other is the compositional map — which components plug into which interfaces, what assembles into what, how the layer cake stacks. The LLM cost-optimization category gets the competitive map written about it weekly. This post is the compositional one.

Read the popular trade press on LLM cost in 2026 and you will see twelve companies arranged on a 2×2 of price-versus-quality. Read the actual deployments at customers running real workloads and you will see something different: three to seven different vendors stacked into one request path, each handling a slice the others cannot. Compression in front, routing in the middle, batch arbitrage at the back. Observability across the whole thing. None of the customers we have talked to picks one optimizer. They pick a stack.

The substrate cascade

The frame that makes this make sense is "substrate cascade." A substrate proxy is a thin layer that sits in the LLM request-path and runs a sequence of conservative optimization mechanics, each gated by an eval contract. The substrate is opinionated about the contract — quality floor, composition cap, audit-immutable measurement, success-fee billing — but deliberately not opinionated about which optimization implementations sit in the cascade.

Compression can be implemented as a deterministic whitespace pass, an LLMLingua-2 template substitution, or a fine-tuned compression LLM. Routing can be a static map, a learned router, or an external routing service. Caching can be exact-match, semantic, or template-aware. The substrate chooses which mechanic to fire on which request, in which order; it does not insist that any single technique is "the" technique.

That posture has a concrete consequence: when a new vendor ships a strong implementation of one mechanic, the substrate does not have to rewrite itself. It exposes a plug-point — the backend-marketplace pattern — and the new vendor becomes M-N in the cascade. Tessera as a substrate is structurally permissive about which implementations win each mechanic slot. The fight, if there is one, is about contract enforcement (does this implementation hold the quality floor? does its savings claim pass the audit-immutable test?), not about implementation choice.

The mechanic slots, and who fills them today

The Tessera cascade today has nine documented mechanic slots — M1 through M11 with M4 and M11's neighbours reserved for future variants. The rest of this post walks the slots, names who fills each one (including Tessera's own implementation today), and shows where the substrate marketplace pattern lets a third-party implementation slot in cleanly.

M1 — Auto-route

Today: Tessera ships a static ROUTE_TABLE per provider with conservative intra-family pairs (gpt-5 → gpt-5-mini, opus → sonnet → haiku), gated by a daily promptfoo canary on the customer's eval set. The per-hop quality-preservation estimate is the prior; the canary is the evidence. Chained walks past the first hop compose with cumulative product gating.

The marketplace plug-point: Unify, OpenRouter smart-routing, NotDiamond, RouteLLM (academic implementation), and similar learned-routing services can replace the static table with a per-request scoring head. The substrate interface — "given this request, return a candidate model id and a quality_preservation_estimate" — is small enough to swap. The eval contract stays the same. A learned router that holds the customer's eval at 0.95 wins the M1 slot for that workload; one that doesn't, stays in the candidate pool.

M2 / M5 / M6 — Caching

Today: exact-match cache (sha256 of canonical request body, 7-day TTL, Cloudflare edge KV), semantic cache (≥ 0.95 cosine similarity on Cloudflare Vectorize embeddings, with a 5% live-eval canary), and auto-injected provider prompt-cache markers (Anthropic 90% off cached prefix, Google 75%, OpenAI 50%).

The marketplace plug-point: Vellum, Helicone, Portkey, and any other vendor that exposes a cache primitive can be a candidate implementation. The interface is "given this canonical request, return a cached response if the cache invariants hold; otherwise null." Cache implementations differ on tokenizer agreement, key-normalization, and freshness policy — all settable per workload. The substrate contract is the same.

M3 / M7 — Prompt mutation

Today: deterministic whitespace + structural compression (preserves code fences and JSON shapes), per-role opt-in (system / user turns independent), capped under the composition rule that no more than two content-mutators may fire on the same request.

The marketplace plug-point: this is the slot the dedicated compression vendors target. LLMLingua-2 (Microsoft Research) does template substitution with a fine-tuned compression LLM. TwoTrim ships proprietary ML compression with reported quality preservation on specific benchmarks. leanctx targets RAG-shaped contexts specifically. The Token Company (TTC) shipped a $70M-funded proprietary ML compression line in W26. Microsoft itself bakes LLMLingua-2 into Azure AI Studio for first-party Azure customers.

The substrate-include posture is that all of these are candidate M3 implementations. The interface is "given this prompt, return a compressed prompt and a quality_preservation_estimate." A vendor that holds 0.95 on the customer's eval wins the M3 slot; one that doesn't (for that workload) sits in the candidate pool and gets re-evaluated when the workload changes. Multiple vendors can be installed and the per-workload canary picks the best one — a Bandit-style choice on live evidence rather than a marketing comparison.

M8 — Structured output

Today: constrain completions to a JSON schema with explicit max-tokens budgeting. Saves on over-runs and parse-retry traffic. Native to OpenAI, Anthropic, Google in 2026.

The marketplace plug-point: external schema-constrainers like Outlines, Instructor, BAML, and structured-output gateways can replace the provider-native implementation when the workload uses a non-native model or needs more aggressive constraint validation than the provider ships.

M9 — Output-length ceiling

Today: daily cron fits p90 of completion length per workload, injects max_tokens = p90 × 1.3 per request. Cuts completion-token waste without truncating real responses. This is a Tessera-specific mechanic — we have not seen another vendor ship it as a discrete primitive — but the slot is open for a learned length-predictor if one emerges.

M10 — Batch arbitrage

Today: route async-tolerant calls to OpenAI Batch + Anthropic Message Batches (both 50% off). Opt-in per workload.

The marketplace plug-point: OpenPipeships proprietary fine-tuned models served on cheaper inference; their billing model is a per-token rate well below the base provider's. A workload that has been distilled onto an OpenPipe model is structurally a candidate M10 alternative — "route this async-tolerant call to the cheaper inference target." Same for Chamber, Together AI, Fireworks, DeepInfra, Lambda, and other inference-as-a-service providers. The substrate-include version is "OpenPipe is an M10 candidate when the customer has distilled, and M1-baseline otherwise."

M11 — Cross-provider failover

Today: opt-in retry on OpenRouter when the primary upstream returns 5xx / connection error / timeout. Two pricing snapshots per request (the model the customer asked for, the model that actually ran via OpenRouter) for audit immutability. JIT-decrypted credential via HMAC-signed internal endpoint.

The marketplace plug-point: any meta-router (OpenRouter, AnyScale, Replicate, custom internal routers) is a candidate M11 backend. The interface is "given a primary-upstream failure, return a result from a different provider." The audit-immutable contract requires the M11 candidate to expose enough cost metadata for a snapshot row.

Where Tessera is not a substrate

Three layers of the stack are not substrate-shaped, and we have no plan to compete in them.

LLM gateways (LiteLLM, Kong AI Gateway, Portkey) are the per-token-fee category that owns API normalization and load balancing across providers. They are a substrate-include candidate when a customer needs broader provider coverage than Tessera offers; the contract is "Tessera fronts the gateway, gateway fronts the providers." The gateway gets paid per-token; Tessera gets paid per-saved-dollar. Different billing models, complementary placement.

Observability platforms (Helicone, Langfuse, Arize, PostHog AI, Datadog LLM Observability) are the read-side category that tells customers what they spent and how their LLMs are behaving. Tessera writes one row per request to its own audit ledger; an observability platform can ingest that ledger via export or via the proxy's response headers. Different read patterns, complementary placement.

Eval and prompt-management platforms (promptfoo, Braintrust, HumanLoop, LangSmith) own the eval-set contract that gates Tessera's mechanics. The customer writes their eval in promptfoo (or Braintrust, or a custom suite); Tessera runs it daily as the canary. Different write patterns, complementary placement.

What this means in practice

For a customer evaluating "which LLM cost vendor do I buy," the honest answer is: probably several, with the substrate making the composition predictable. Tessera's wedge is the substrate contract — the quality floor, the composition cap, the audit-immutable measurement, the success-fee billing. The mechanics inside that contract can come from anywhere that holds the contract.

For a vendor that ships a single mechanic well — compression, learned routing, fine-tuned cheap inference — the substrate is a distribution channel, not a competitor. The interface to plug in is documented (see how-it-works for the per-mechanic contracts), the eval gate is uniform, and the billing flows through one prepaid balance.

For an investor or analyst trying to size the category, the question "who wins LLM cost optimization" is the wrong question. The category will compose. The right question is which substrate wins the contract layer — because the contract layer is where the asymmetric customer trust accrues, and the asymmetric trust is what gets paid for.

Further reading