Tessera engineering blog
Notes on LLM cost optimization.
From the request path.
Worked examples, architecture deep-dives, and engineering decisions from the team building Tessera — the substrate proxy that cuts LLM cost without dropping quality. No marketing fluff, no inflated numbers, full audit trails behind every claim.
- ·9 min read
Tessera ecosystem map: how compression vendors fit the substrate cascade
Compression vendors, fine-tuners, and routers are not competitors to a substrate proxy — they are candidate mechanics inside it. A compositional view of the LLM cost-optimization ecosystem: TwoTrim, leanctx, OpenPipe, Unify, Chamber, LiteLLM, and how each plugs into the Tessera backend-marketplace pattern.
architectureecosystemsubstraterequest-infrastructure - ·12 min read
AI cost management in 2026: how proxies, compressors, and routers compose into a substrate
A practitioner's guide to the LLM cost stack in 2026. The five approach families (gateways, proxies, compressors, routers, observability) are composable layers, not mutually-exclusive choices. Worked numbers per family, when each one fires, and how a substrate proxy assembles them under a single eval contract.
cost-optimizationfinopsarchitectureindustry-overview - ·9 min read
Per-stack quality canary — auto-rolling back any mechanic combination that degrades output
Cross-mechanic interactions hide regressions per-mechanic monitoring misses. Tessera scores each (workload × mechanic-stack) tuple independently, breaches at 3 days < 0.95 mean_score, auto-disables the breaching stack via subset-match enforcement. The architecture, the 5%/30-sample/0.95 trade-offs, the honest limitations.
qualityeval-gatedarchitecturerequest-infrastructure - ·10 min read
Cross-provider failover at the edge — what we built (and what we did not)
When the primary LLM provider 5xx’s, your app stops talking. M11 retries the request on OpenRouter so the customer sees a successful response instead. The architecture, the MVP scope locks (passive only, OpenAI-shape primaries, opt-in default-off), and the deliberate Phase 2 deferrals.
reliabilityrequest-infrastructurefailoverarchitecture - ·8 min read
The prompt bloat you don’t see — and what it’s costing you
Can you tell me, in tokens, how big your production system prompt was last month? Most teams can’t. We measured before/after compression across our beta workloads — here are the three patterns we keep finding and what to do about them.
cost-optimizationprompt-engineeringobservabilitycompression - ·7 min read
Using Tessera with LangChain in 30 seconds — drop-in cost optimization
pip install tessera-langchain, one line of config in your existing ChatOpenAI / ChatAnthropic / ChatMistralAI constructor. Tools, streaming, eval, structured output all pass through unchanged. Worked example: a LangChain customer-support agent drops from $24k/mo to $9.4k.
langchainintegrationcost-optimizationtutorial - ·9 min read
Audit immutability for AI cost claims — what snapshot-pinning actually buys you
Every Tessera cost number references an immutable pricing_snapshot_id. The catalog is multi-source verified (LiteLLM + tokencost + OpenRouter, consensus at ≥ 0.95 confidence). Two engineers, three hours, can re-derive any month from raw inputs. This post is the architecture.
audittransparencyarchitecturerequest-infrastructure - ·9 min read
How an AI customer-support workload cuts its GPT-4 bill 38% without quality regression
A worked example: $24k/month OpenAI bill cut to $9.4k via auto-route, exact + semantic cache, prompt cache headers, context pruning, output-length predictor, and batch arbitrage. Quality canary held 0.96.
cost-optimizationopenaieval-gatedcase-study