How an AI customer-support workload cuts its GPT-4 bill 38% without quality regression
Most cost-optimization tools for LLM workloads are observability tools. They watch your bill grow and produce charts. The cost-cutting decision — which model? which calls do we cache? which prompts compress? — is left to the engineer who already has 30 other things to do.
Tessera takes the opposite stance: be the substrate. Sit in the request path. Do the work. Bill on measured savings only.
This post walks through one workload — a customer-support AI agent serving 1.2 billion tokens/month — and shows exactly where each of the seven cost optimization mechanics cuts, what the quality canary did, and what the customer's net was after our 20% performance fee.
Before: $24,000/month at OpenAI list prices.
After Tessera (raw): $9,400/month.
Plus Tessera fee (20% × savings): $2,920/month.
Customer net pay: $12,320/month — $11,680 saved every month.
Quality canary mean-score across all stages: 0.96 (floor 0.95). We didn't change the customer's eval set. We didn't change their prompt templates. We didn't change their RAG retrieval. Two headers got added; the rest is proxy-side.
What the workload looked like at $24k/month
Customer-support AI agent. RAG-backed with top-5 retrieved docs per turn. Running on gpt-4o at OpenAI list prices. ~250M tokens/day at peak, 1.2B/month aggregate.
Cost breakdown: prompt ~70%, completion ~30%. The prompt carries a ~6 KB stable system prefix (a product knowledge base) plus ~3 KB variable per turn (user query + retrieved docs).
Why on gpt-4o and not gpt-4o-mini? The customer A/B-tested both internally; mini failed quality on 12% of edge-case queries (troubleshooting + escalation). Floor model stays gpt-4o.
Query mix matters here: 50% FAQ-style, 30% account-lookup, 15% troubleshooting, 5% escalation. The first 80% has high prompt repetition + stable RAG block reuse. That's what the mechanic stack will eat.
M1 — Auto-route from gpt-4o to gpt-4o-mini, gated by daily canary
The mapping table contains gpt-4o → gpt-4o-mini with a quality_preservation_estimate of 0.92. That estimate is static (it comes from published benchmarks); the live gate is something else.
The live gate: a daily canary cron runs the customer's promptfoo eval set (5–50 prompts representative of the workload) against a 10% sample of production traffic. Each sample runs twice — once on the routed model, once on the baseline — and the eval set scores both. The aggregate mean_score per stack must hold ≥ 0.95.
On this workload the canary held 0.96 across the FAQ + account- lookup cohort routed to mini. The troubleshooting + escalation cohort (22% of traffic) stayed on gpt-4o because their feature_tag opted them out of routing.
Savings on routed cohort: $24,000 → $19,920 (saved $4,080, 17% of baseline).
M2 — Exact-match KV cache for FAQ repetition
sha256 of the normalized request body becomes the cache key. KV lookup at Cloudflare edge before the upstream call. On hit, return the cached response (latency drops 10–50× because no upstream call at all). On miss, the response gets written to KV with a 7-day TTL.
The risk people worry about with caching is staleness — what if a knowledge base entry changes underneath you? Our answer: every cached response carries its original pricing_snapshot_id and request_id, and the cron loop invalidates stale entries on KB updates. Caches can be opted-out per-workload if invalidation is hard.
On this workload: 35% of queries hit an exact prior match — typical FAQ traffic. Savings: $19,920 → $13,920 (saved $6,000, 25% additional).
M6 — Native prompt cache headers (90% off cached input tokens)
OpenAI launched prompt caching October 2024. Anthropic shipped it April 2024. Google has its cachedContents lifecycle. All three offer ~90% off on tokens that match a cached prefix on the provider side.
The opportunity for our workload: 6 KB stable system prefix repeating across every single request. Without prompt cache, the customer pays full token price for that 6 KB on every call. With M6 (which is just header injection — OpenAI's cached_tokens_id or Anthropic's cache_control: ephemeral markers), the provider charges 10% on that cached prefix.
71% of queries hit a cached prefix on this workload. Cached input tokens billed at the 10% rate. The per-call input cost dropped by roughly 60–80% on the system block.
Savings:$13,920 → $11,520 (saved $2,400, 10% additional). This is provider-native; Tessera doesn't invent the savings — we notice that your system prefix is cacheable and turn the lever on.
M7 — Context pruning for long sessions
Customer-support sessions occasionally accumulate 15–25 message turns. The older turns rapidly lose relevance, but the LLM still pays for every token in the conversation history.
M7 fires when message count > 12 OR body > 32 KB. The trim is conservative: keep system + last 8 turns. For RAG-attached docs, TF-IDF rerank against the latest user query + keep top 5. We do not summarize old turns by default — summary loss is too easy to mis-eval. Customers can opt into more aggressive pruning per workload.
On this workload: 18% of queries had M7 applied (the long sessions). Quality canary specifically measured M7-applied requests and held 0.96. Savings: $11,520 → $11,000 (saved $520, 2%). Small contribution here because the mechanic fires on a minority of requests; on workloads with heavier conversation density (multi- turn coding agents, long therapy sessions) M7 can dominate.
M9 — max_tokens ceiling from p90 truncation rate
When a workload's historical truncation rate is under 2% AND we have ≥ 100 sample outputs, we can fit a per-workload p90 output length. The daily compute cron does this and injects amax_tokens ceiling of p90 × 1.3 (configurable per-workload up to 2.0× for safety).
On this workload the customer was using OpenAI's default (~4096 tokens). The actual p90 output was 480. M9 droppedmax_tokens to 624. Effect: output cost cut on completions that previously over-ran in budget but were never actually long.
Savings: $11,000 → $10,650 (saved $350, 1%).
M10 — Batch arbitrage where the workload allows
Not every request needs synchronous response. Back-fills, retrospective analysis, eval generation, content moderation, RAG indexing — all can wait 24h for completion. OpenAI Batch API and Anthropic Message Batches both offer 50% off list price; the trade is real-time latency for cost.
Tessera detects batch-eligibility via per-workload opt-in plus an optional per-request header x-tessera-batch-allowed: true. On this workload the customer flagged a weekly customer-retention digest generator as batch-eligible — about 4% of total token volume.
Savings:$10,650 → $10,500 (saved $150, <1%). Small here because batch-eligible traffic is rare on this workload. On workloads with heavy async paths M10 frequently becomes the dominant mechanic.
M3 — Heuristic prompt compression (system + user turns)
Whitespace + structural normalization. Preserves code fences, inline code spans, JSON-shape ranges. Per-role opt-in (compress system prompts and/or user turns independently — we shipped that split 2026-05-26 specifically because some workloads have stable expensive system prompts and variable user content, and the old unified toggle was too coarse).
On this workload the system prefix compressed at ratio 0.93 (saved 7% of system prefix tokens). User turns NOT compressed (their text shape was already canonical from the frontend formatter).
Savings: $10,500 → $10,150 (saved $350, 1%).
Stage-by-stage: how each mechanic moved the needle
| Stage | Cost / mo | Saved |
|---|---|---|
| Baseline (OpenAI direct) | $24,000 | — |
| + M2 auto-cache (35% hit rate) | $19,920 | $4,080 |
| + M1 auto-route (78% routed, canary held) | $13,920 | $6,000 |
| + M6 prompt cache (71% cacheable prefix) | $11,520 | $2,400 |
| + M7 context prune (18% applicable) | $11,000 | $520 |
| + M9 max_tokens ceiling | $10,650 | $350 |
| + M10 batch arbitrage (4% batch-eligible) | $10,500 | $150 |
| + M3 compress (system, ratio 0.93) | $10,150 | $350 |
| Tessera-optimized total | $9,400 | $14,600 |
| Tessera fee (20% × savings) | $2,920 | — |
| Customer net pay | $12,320 | $11,680 |
The savings table shows compounding — each subsequent mechanic operates on what the previous one left behind. Order matters: caches first (eliminate fully-redundant requests), route second (cheaper model on what remains), prompt-cache headers third (90% off the stable prefix), content mutators last.
Quality canary across the entire stack: mean-score 0.96, floor 0.95 held all 30 days. No SLA breach events. If it HAD breached, the specific stack (e.g. m1+m7) would auto-disable, the customer would get a 10% prorated credit on the affected fees, and m1 alone and m7 alone would stay live — surgical rollback, not nuclear.
Why you can trust the savings number
The honest critique we hear most often from CTOs: “What stops you from inflating the savings number?” Worth answering directly because the architecture makes it expensive to inflate AND easy to audit.
Every request emits two cost figures stored immutably: original_cost_usd (priced at the requested model's catalog rate) and actual_cost_usd (priced at the actual model after route + cache + provider discounts). Both rates are pinned to a pricing_catalog snapshot version id captured at the request.
The pricing_catalog table is multi-source verified: LiteLLM model_prices.json, the tokencost library, the OpenRouter API, and direct provider docs — each independently fetched, confidence-scored, with 0.99/day recency decay. If three of four sources agree on a price within 0.5%, the row gets confidence ≥ 0.95 and is eligible for use.
Mid-contract price changes don't retroactively alter past savings. If OpenAI cuts the price of gpt-4o tomorrow, your historical savings stay locked to the snapshot that was active when each request fired.
You can export the full audit ledger any time from /portal/audit as CSV. Every row carries pricing_snapshot_id, mechanics_stack, tokens in/out, original and actual cost. Two engineers, three hours, can re-derive any month's savings from raw inputs.
When NOT to use Tessera
Honest disqualification matters more than the pitch:
- Hobby projects under $500/month bills.The free 60M tokens/month tier covers you. Production-tier integration isn't worth the engineering time.
- Air-gapped on-prem deployments. Tessera is hosted-only by design. On-prem is on the roadmap but not shipping in 2026.
- High-latency-sensitivity workloads (< 10ms p50 SLO). Cache-miss path adds ~15–40 ms p50 from the Cloudflare edge hop. If your SLO is tighter than that, we're not your tool.
- Workloads with no repetition or stable prefix. If every request has a unique system prompt AND a unique user query, M2 and M6 don't fire. M1 and M10 might still help — worth measuring on free tier first.
- Compliance-locked data residency (specific country). Our edge runs multi-region Cloudflare. If you need Switzerland-only or India-only residency, we don't fit yet.
Try it on your workload
Free Dev tier: 60M tokens/month, no card, no commitment. Sign up at tesseraai.io/dev — 30-second flow returns an instant API key plus magic-link dashboard access.
One-line activation in Python or Node — your existing OpenAI / Anthropic / Mistral / Cohere / Groq client picks up the proxy transparently:
# Python
import tessera
tessera.activate("tk_your_tessera_key")
# Existing code runs unchanged. Provider keys stay in your env.Run your workload for 7 days. Compare baseline to Tessera-optimized in /portal/overview. If the math doesn't work for you, kill-switch in /portal/settings reverts to passthrough.
Production tier activates when you exceed 60M tokens/month. Pricing: 20% of measured savings only — zero savings, zero fee.
References
- Architecture: how the proxy + canary + ledger work
- Quality SLA: per-stack auto-rollback + 10% credit
- SDK on GitHub: github.com/tessera-llm/tessera-sdk
- Terms of service
Try Tessera on your workload
60M tokens free, no card, 30-second signup
Measure your actual cost delta. Kill-switch any time. Pay 20% only on measured savings — zero savings, zero fee.
Get free API key→