Skip to main content
FinOps · 21 May 2026 · ~12 min read

AI cost management in 2026: how proxies, compressors, and routers compose into a substrate

AI inference is the second-fastest-growing line on most B2B SaaS P&Ls in 2026. The discipline that handles it has a name now — AI FinOps — and a stack. This post documents the stack: five families of approach, what each one actually does, when each one fires, the realistic savings range per family, and how a substrate proxy composes them into one cascade under a single eval contract.

The thesis up front: the LLM cost stack is composable. Picking one vendor and ignoring the others leaves money on the table. Picking the whole stack without a contract layer turns the savings into an audit problem nobody can defend at quarterly review. The substrate-proxy pattern threads that needle by acting as the single eval-gated cascade that every other layer slots into.

The five families

Most of what gets sold under the "LLM cost optimization" banner falls into one of five families. They differ on where they intercept the request, what they change, and how they bill.

  1. LLM gateways. Per-token-fee proxies that normalize API shapes across providers, do load-balancing and routing, and add observability. Examples: LiteLLM, Kong AI Gateway, Portkey. Billing: per-token surcharge or per-month seat.
  2. Substrate proxies. Eval-gated request-path optimizers that run a cascade of conservative mechanics — routing, caching, compression, batching, output-length control — and measure the savings against the model the customer originally asked for. Example: Tessera. Billing: success-fee on measured savings.
  3. Specialist compressors. Single-mechanic vendors that implement prompt compression as their only product. Examples: TwoTrim, leanctx, The Token Company (TTC), LLMLingua-2 (Microsoft Research / Azure). Billing: per-token rate.
  4. Specialist routers and fine-tuners. Single-mechanic vendors that handle model routing, distillation, or cheap-inference hosting. Examples: Unify, OpenPipe, NotDiamond, OpenRouter, Together AI, Fireworks, Chamber, RouteLLM. Billing: per-token rate or per-saved-dollar.
  5. Observability and eval platforms. Read-side platforms that ingest request data and tell you what you spent, how prompts performed, and whether quality drifted. Examples: Helicone, Langfuse, Arize, PostHog AI, Datadog LLM Observability, promptfoo, Braintrust, LangSmith. Billing: per-seat or per-ingested-event.

A real production stack typically uses one from each family. The mistake is assuming they substitute for each other.

Where in the request path each family operates

FamilyIntercept pointWhat it changesTypical savings range
LLM gatewaysWire format / provider URLProvider abstraction, load-balancing, retry0% (often negative after fee — bought for governance, not cost)
Substrate proxiesWire format / request bodyCascade of mechanics under eval contract25-60% on typical SaaS LLM workloads
Specialist compressorsPrompt contentTokens-per-prompt count via ML compression5-25% on prompt tokens; quality-sensitive
Specialist routers + fine-tunersModel selectionChoose a cheaper model with comparable output10-40% when the model surface allows the swap
Observability + evalsRead side (logs / traces)Nothing in the request; surfaces data0% direct; necessary input to other layers' decisions

Read the table column by column and the composition logic falls out. A gateway gives you provider abstraction and governance; it does not save tokens. A substrate runs the cascade that does. Specialist vendors are candidate implementations for specific mechanic slots in the cascade. Observability tells everyone what happened. Each family does work the others structurally cannot.

Worked example — a customer-support agent on gpt-4o

Concrete numbers help. Take a B2B SaaS customer-support agent running on gpt-4o at 5 billion tokens per month, 70% input / 30% output mix. At list prices that bill is $24,000 / month at the start of the period. The stack looks like this:

MechanicWhat it doesSavings on $24k baseline
M2 — Exact-match cache35% hit rate on FAQ-style repeat queries−$8,400
M6 — Provider prompt cache90% off cached system + tool prefix on Anthropic; 50% on OpenAI−$3,200
M1 — Auto-routegpt-4o → gpt-4o-mini on classification + drafting sub-tasks (canary 0.96)−$1,800
M3 — CompressSystem prompt + tool descriptions, per-role opt-in−$700
M9 — Output-length ceilingmax_tokens = p90 × 1.3 per workload−$400
M10 — Batch arbitrage50% off on the analytics-ingest sub-workload (async-tolerant)−$100
Total measured savings−$14,600
Substrate success fee (20% × measured savings)+$2,920
Net customer save$11,680 / month

Quality canary across the full mechanic stack held mean-score 0.96 for all 30 days; the per-stack 0.95 floor never tripped. Full walkthrough on the worked-numbers post.

Notice three things. First, no single mechanic accounts for more than 60% of the savings — the cascade matters. Second, the routing swap is the smallest, not the largest, line; cache hits dominate on workloads with prompt repetition. Third, the total is reported against the model the customer originally asked for — not against some generic reference rate — which is what the substrate's audit-immutable contract is for.

The audit-immutable contract

A savings claim that the customer cannot independently re-derive is a marketing claim. A substrate proxy fixes that with two pricing snapshots per request: original_cost_usd priced at the requested-model catalog rate, and actual_cost_usd priced at the actual-model rate after routing + cache hits + provider discounts. Both rates are pinned to a pricing_catalog snapshot version captured at request time. Mid-contract provider price changes do not retroactively alter past savings.

The customer can export the audit ledger and re-run the math at any time. That is what makes the success-fee billing model defensible — and what separates a substrate proxy from a gateway that sells observability of cost without measuring it.

Composition rules that protect quality

A naive stacking of all five mechanics on every request produces worst-case behaviour: compression on a structured-output workload destroys the schema, context pruning on a multi-turn dialog drops the user's last instruction, batching on a real-time chat breaks UX. The substrate contract has three guardrails baked in.

Composition cap. No more than two content-mutating mechanics may fire on the same request. The cap is enforced at the mechanic-selection layer before any provider call. A workload eligible for M3 + M7 + M8 will see at most two of those three actually run, and which two is chosen by per-stack canary evidence.

Per-stack quality floor. Every mechanic combination that fires on a workload is tracked as a distinct stack (m1+m6+m9, m2+m3+m6, etc.). The canary cron runs the customer's eval set against each live stack daily; the rolling 3-day mean score must hold ≥ 0.95. Below that, the stack auto-rolls back to baseline and the affected fee window earns a 10% credit.

Hard bypasses. Compliance-tier-regulated workloads short-circuit routing entirely. Streaming-required workloads bypass batching. Tool-using workloads bypass non-tool-capable route targets. The rules are documented per mechanic on how-it-works and enforced in worker code, not config.

When each family fires in a real deployment

What this implies for buyers

A buyer evaluating "which LLM cost vendor do I pick" should pick two: a substrate (to own the eval contract and the cascade) and an observability platform (to own the read side). Specialists become candidates inside the substrate when the per-workload canary points at them. The gateway question is orthogonal — buy one if you have a governance need; skip it if the substrate already gives you the provider abstraction you need (it does in most cases).

A buyer evaluating "is it worth the integration effort" should look at the typical-savings band per family and back out the dollar number. A B2B SaaS spending $50k/month on LLMs sees realistic substrate savings of $12.5k-$30k/month; a $200k/month workload sees $50k-$120k. The free tier covers low-volume workloads entirely, so the integration test costs nothing.

What this implies for vendors

A vendor shipping a single mechanic well — compression, routing, fine- tuning, cache, length prediction — should treat the substrate as a distribution channel, not a competitor. The interface to plug in is small and documented. The substrate brings the audit contract, the eval gate, and the billing pipeline. The specialist brings the implementation.

Further reading