AI cost management in 2026: how proxies, compressors, and routers compose into a substrate
AI inference is the second-fastest-growing line on most B2B SaaS P&Ls in 2026. The discipline that handles it has a name now — AI FinOps — and a stack. This post documents the stack: five families of approach, what each one actually does, when each one fires, the realistic savings range per family, and how a substrate proxy composes them into one cascade under a single eval contract.
The thesis up front: the LLM cost stack is composable. Picking one vendor and ignoring the others leaves money on the table. Picking the whole stack without a contract layer turns the savings into an audit problem nobody can defend at quarterly review. The substrate-proxy pattern threads that needle by acting as the single eval-gated cascade that every other layer slots into.
The five families
Most of what gets sold under the "LLM cost optimization" banner falls into one of five families. They differ on where they intercept the request, what they change, and how they bill.
- LLM gateways. Per-token-fee proxies that normalize API shapes across providers, do load-balancing and routing, and add observability. Examples: LiteLLM, Kong AI Gateway, Portkey. Billing: per-token surcharge or per-month seat.
- Substrate proxies. Eval-gated request-path optimizers that run a cascade of conservative mechanics — routing, caching, compression, batching, output-length control — and measure the savings against the model the customer originally asked for. Example: Tessera. Billing: success-fee on measured savings.
- Specialist compressors. Single-mechanic vendors that implement prompt compression as their only product. Examples: TwoTrim, leanctx, The Token Company (TTC), LLMLingua-2 (Microsoft Research / Azure). Billing: per-token rate.
- Specialist routers and fine-tuners. Single-mechanic vendors that handle model routing, distillation, or cheap-inference hosting. Examples: Unify, OpenPipe, NotDiamond, OpenRouter, Together AI, Fireworks, Chamber, RouteLLM. Billing: per-token rate or per-saved-dollar.
- Observability and eval platforms. Read-side platforms that ingest request data and tell you what you spent, how prompts performed, and whether quality drifted. Examples: Helicone, Langfuse, Arize, PostHog AI, Datadog LLM Observability, promptfoo, Braintrust, LangSmith. Billing: per-seat or per-ingested-event.
A real production stack typically uses one from each family. The mistake is assuming they substitute for each other.
Where in the request path each family operates
| Family | Intercept point | What it changes | Typical savings range |
|---|---|---|---|
| LLM gateways | Wire format / provider URL | Provider abstraction, load-balancing, retry | 0% (often negative after fee — bought for governance, not cost) |
| Substrate proxies | Wire format / request body | Cascade of mechanics under eval contract | 25-60% on typical SaaS LLM workloads |
| Specialist compressors | Prompt content | Tokens-per-prompt count via ML compression | 5-25% on prompt tokens; quality-sensitive |
| Specialist routers + fine-tuners | Model selection | Choose a cheaper model with comparable output | 10-40% when the model surface allows the swap |
| Observability + evals | Read side (logs / traces) | Nothing in the request; surfaces data | 0% direct; necessary input to other layers' decisions |
Read the table column by column and the composition logic falls out. A gateway gives you provider abstraction and governance; it does not save tokens. A substrate runs the cascade that does. Specialist vendors are candidate implementations for specific mechanic slots in the cascade. Observability tells everyone what happened. Each family does work the others structurally cannot.
Worked example — a customer-support agent on gpt-4o
Concrete numbers help. Take a B2B SaaS customer-support agent running on gpt-4o at 5 billion tokens per month, 70% input / 30% output mix. At list prices that bill is $24,000 / month at the start of the period. The stack looks like this:
| Mechanic | What it does | Savings on $24k baseline |
|---|---|---|
| M2 — Exact-match cache | 35% hit rate on FAQ-style repeat queries | −$8,400 |
| M6 — Provider prompt cache | 90% off cached system + tool prefix on Anthropic; 50% on OpenAI | −$3,200 |
| M1 — Auto-route | gpt-4o → gpt-4o-mini on classification + drafting sub-tasks (canary 0.96) | −$1,800 |
| M3 — Compress | System prompt + tool descriptions, per-role opt-in | −$700 |
| M9 — Output-length ceiling | max_tokens = p90 × 1.3 per workload | −$400 |
| M10 — Batch arbitrage | 50% off on the analytics-ingest sub-workload (async-tolerant) | −$100 |
| Total measured savings | −$14,600 | |
| Substrate success fee (20% × measured savings) | +$2,920 | |
| Net customer save | $11,680 / month |
Quality canary across the full mechanic stack held mean-score 0.96 for all 30 days; the per-stack 0.95 floor never tripped. Full walkthrough on the worked-numbers post.
Notice three things. First, no single mechanic accounts for more than 60% of the savings — the cascade matters. Second, the routing swap is the smallest, not the largest, line; cache hits dominate on workloads with prompt repetition. Third, the total is reported against the model the customer originally asked for — not against some generic reference rate — which is what the substrate's audit-immutable contract is for.
The audit-immutable contract
A savings claim that the customer cannot independently re-derive is a marketing claim. A substrate proxy fixes that with two pricing snapshots per request: original_cost_usd priced at the requested-model catalog rate, and actual_cost_usd priced at the actual-model rate after routing + cache hits + provider discounts. Both rates are pinned to a pricing_catalog snapshot version captured at request time. Mid-contract provider price changes do not retroactively alter past savings.
The customer can export the audit ledger and re-run the math at any time. That is what makes the success-fee billing model defensible — and what separates a substrate proxy from a gateway that sells observability of cost without measuring it.
Composition rules that protect quality
A naive stacking of all five mechanics on every request produces worst-case behaviour: compression on a structured-output workload destroys the schema, context pruning on a multi-turn dialog drops the user's last instruction, batching on a real-time chat breaks UX. The substrate contract has three guardrails baked in.
Composition cap. No more than two content-mutating mechanics may fire on the same request. The cap is enforced at the mechanic-selection layer before any provider call. A workload eligible for M3 + M7 + M8 will see at most two of those three actually run, and which two is chosen by per-stack canary evidence.
Per-stack quality floor. Every mechanic combination that fires on a workload is tracked as a distinct stack (m1+m6+m9, m2+m3+m6, etc.). The canary cron runs the customer's eval set against each live stack daily; the rolling 3-day mean score must hold ≥ 0.95. Below that, the stack auto-rolls back to baseline and the affected fee window earns a 10% credit.
Hard bypasses. Compliance-tier-regulated workloads short-circuit routing entirely. Streaming-required workloads bypass batching. Tool-using workloads bypass non-tool-capable route targets. The rules are documented per mechanic on how-it-works and enforced in worker code, not config.
When each family fires in a real deployment
- Gateway fires when the application needs unified provider abstraction or governance — e.g., enforcing data-residency-aware routing, swapping providers without code change, applying per-team rate limits.
- Substrate fires on every request — that is the whole value proposition. It is the layer that runs the cascade and emits the audit row.
- Specialist compressor fireswhen the substrate's built-in M3 implementation underperforms on a specific workload and a third-party compressor wins the canary eval for that workload. Plugs in via the substrate's backend-marketplace pattern.
- Specialist router / fine-tuner fireswhen the substrate's built-in M1 ROUTE_TABLE leaves money on the table for a workload that a learned router or distilled model can serve at lower cost. Same plug-in pattern.
- Observability fires on every request, on the read side — never in the path that matters for latency.
What this implies for buyers
A buyer evaluating "which LLM cost vendor do I pick" should pick two: a substrate (to own the eval contract and the cascade) and an observability platform (to own the read side). Specialists become candidates inside the substrate when the per-workload canary points at them. The gateway question is orthogonal — buy one if you have a governance need; skip it if the substrate already gives you the provider abstraction you need (it does in most cases).
A buyer evaluating "is it worth the integration effort" should look at the typical-savings band per family and back out the dollar number. A B2B SaaS spending $50k/month on LLMs sees realistic substrate savings of $12.5k-$30k/month; a $200k/month workload sees $50k-$120k. The free tier covers low-volume workloads entirely, so the integration test costs nothing.
What this implies for vendors
A vendor shipping a single mechanic well — compression, routing, fine- tuning, cache, length prediction — should treat the substrate as a distribution channel, not a competitor. The interface to plug in is small and documented. The substrate brings the audit contract, the eval gate, and the billing pipeline. The specialist brings the implementation.