What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via LLMLingua-2 where safe, auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera enter the Performance Fee.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. We bill on measured savings from our own proxy logs, not on a SaaS seat. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Two tiers. Free Sandbox: 60 million tokens per month, $0 fee, observe-only optimisations (savings measured but no fee accrued). Production: 20% of measured savings, billed against a prepaid balance (Stripe Checkout, $100 minimum top-up, or invoice on request). Zero savings = zero fee in that window. No floor, no retainer, no contract review for activation. If balance reaches zero, the proxy automatically pauses optimisations and forwards requests as passthrough — no fees accrue while paused. Top up to resume. Large-volume Production Clients (above $500,000 per month in measured savings) may negotiate a custom MSA addendum. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a 10% fee credit applied to your balance.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Performance Fee does not accrue on paused traffic. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most Production signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0 fee) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above $500k/mo in measured savings, Production Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← Tessera Blog

2026-05-26·9 min read

How an AI customer-support workload cuts its GPT-4 bill 38% without quality regression

Most cost-optimization tools for LLM workloads are observability tools. They watch your bill grow and produce charts. The cost-cutting decision — which model? which calls do we cache? which prompts compress? — is left to the engineer who already has 30 other things to do.

Tessera takes the opposite stance: be the substrate. Sit in the request path. Do the work. Bill on measured savings only.

This post walks through one workload — a customer-support AI agent serving 1.2 billion tokens/month — and shows exactly where each of the seven cost optimization mechanics cuts, what the quality canary did, and what the customer's net was after our 20% performance fee.

Before: $24,000/month at OpenAI list prices.

After Tessera (raw): $9,400/month.

Plus Tessera fee (20% × savings): $2,920/month.

Customer net pay: $12,320/month — $11,680 saved every month.

Quality canary mean-score across all stages: 0.96 (floor 0.95). We didn't change the customer's eval set. We didn't change their prompt templates. We didn't change their RAG retrieval. Two headers got added; the rest is proxy-side.

What the workload looked like at $24k/month

Customer-support AI agent. RAG-backed with top-5 retrieved docs per turn. Running on gpt-4o at OpenAI list prices. ~250M tokens/day at peak, 1.2B/month aggregate.

Cost breakdown: prompt ~70%, completion ~30%. The prompt carries a ~6 KB stable system prefix (a product knowledge base) plus ~3 KB variable per turn (user query + retrieved docs).

Why on gpt-4o and not gpt-4o-mini? The customer A/B-tested both internally; mini failed quality on 12% of edge-case queries (troubleshooting + escalation). Floor model stays gpt-4o.

Query mix matters here: 50% FAQ-style, 30% account-lookup, 15% troubleshooting, 5% escalation. The first 80% has high prompt repetition + stable RAG block reuse. That's what the mechanic stack will eat.

M1 — Auto-route from gpt-4o to gpt-4o-mini, gated by daily canary

The mapping table contains gpt-4o → gpt-4o-mini with a quality_preservation_estimate of 0.92. That estimate is static (it comes from published benchmarks); the live gate is something else.

The live gate: a daily canary cron runs the customer's promptfoo eval set (5–50 prompts representative of the workload) against a 10% sample of production traffic. Each sample runs twice — once on the routed model, once on the baseline — and the eval set scores both. The aggregate mean_score per stack must hold ≥ 0.95.

On this workload the canary held 0.96 across the FAQ + account- lookup cohort routed to mini. The troubleshooting + escalation cohort (22% of traffic) stayed on gpt-4o because their feature_tag opted them out of routing.

Savings on routed cohort: $24,000 → $19,920 (saved $4,080, 17% of baseline).

M2 — Exact-match KV cache for FAQ repetition

sha256 of the normalized request body becomes the cache key. KV lookup at Cloudflare edge before the upstream call. On hit, return the cached response (latency drops 10–50× because no upstream call at all). On miss, the response gets written to KV with a 7-day TTL.

The risk people worry about with caching is staleness — what if a knowledge base entry changes underneath you? Our answer: every cached response carries its original pricing_snapshot_id and request_id, and the cron loop invalidates stale entries on KB updates. Caches can be opted-out per-workload if invalidation is hard.

On this workload: 35% of queries hit an exact prior match — typical FAQ traffic. Savings: $19,920 → $13,920 (saved $6,000, 25% additional).

M6 — Native prompt cache headers (90% off cached input tokens)

OpenAI launched prompt caching October 2024. Anthropic shipped it April 2024. Google has its cachedContents lifecycle. All three offer ~90% off on tokens that match a cached prefix on the provider side.

The opportunity for our workload: 6 KB stable system prefix repeating across every single request. Without prompt cache, the customer pays full token price for that 6 KB on every call. With M6 (which is just header injection — OpenAI's cached_tokens_id or Anthropic's cache_control: ephemeral markers), the provider charges 10% on that cached prefix.

71% of queries hit a cached prefix on this workload. Cached input tokens billed at the 10% rate. The per-call input cost dropped by roughly 60–80% on the system block.

Savings:$13,920 → $11,520 (saved $2,400, 10% additional). This is provider-native; Tessera doesn't invent the savings — we notice that your system prefix is cacheable and turn the lever on.

M7 — Context pruning for long sessions

Customer-support sessions occasionally accumulate 15–25 message turns. The older turns rapidly lose relevance, but the LLM still pays for every token in the conversation history.

M7 fires when message count > 12 OR body > 32 KB. The trim is conservative: keep system + last 8 turns. For RAG-attached docs, TF-IDF rerank against the latest user query + keep top 5. We do not summarize old turns by default — summary loss is too easy to mis-eval. Customers can opt into more aggressive pruning per workload.

On this workload: 18% of queries had M7 applied (the long sessions). Quality canary specifically measured M7-applied requests and held 0.96. Savings: $11,520 → $11,000 (saved $520, 2%). Small contribution here because the mechanic fires on a minority of requests; on workloads with heavier conversation density (multi- turn coding agents, long therapy sessions) M7 can dominate.

M9 — max_tokens ceiling from p90 truncation rate

When a workload's historical truncation rate is under 2% AND we have ≥ 100 sample outputs, we can fit a per-workload p90 output length. The daily compute cron does this and injects amax_tokens ceiling of p90 × 1.3 (configurable per-workload up to 2.0× for safety).

On this workload the customer was using OpenAI's default (~4096 tokens). The actual p90 output was 480. M9 droppedmax_tokens to 624. Effect: output cost cut on completions that previously over-ran in budget but were never actually long.

Savings: $11,000 → $10,650 (saved $350, 1%).

M10 — Batch arbitrage where the workload allows

Not every request needs synchronous response. Back-fills, retrospective analysis, eval generation, content moderation, RAG indexing — all can wait 24h for completion. OpenAI Batch API and Anthropic Message Batches both offer 50% off list price; the trade is real-time latency for cost.

Tessera detects batch-eligibility via per-workload opt-in plus an optional per-request header x-tessera-batch-allowed: true. On this workload the customer flagged a weekly customer-retention digest generator as batch-eligible — about 4% of total token volume.

Savings:$10,650 → $10,500 (saved $150, <1%). Small here because batch-eligible traffic is rare on this workload. On workloads with heavy async paths M10 frequently becomes the dominant mechanic.

M3 — Heuristic prompt compression (system + user turns)

Whitespace + structural normalization. Preserves code fences, inline code spans, JSON-shape ranges. Per-role opt-in (compress system prompts and/or user turns independently — we shipped that split 2026-05-26 specifically because some workloads have stable expensive system prompts and variable user content, and the old unified toggle was too coarse).

On this workload the system prefix compressed at ratio 0.93 (saved 7% of system prefix tokens). User turns NOT compressed (their text shape was already canonical from the frontend formatter).

Savings: $10,500 → $10,150 (saved $350, 1%).

Stage-by-stage: how each mechanic moved the needle

Stage	Cost / mo	Saved
Baseline (OpenAI direct)	$24,000	—
+ M2 auto-cache (35% hit rate)	$19,920	$4,080
+ M1 auto-route (78% routed, canary held)	$13,920	$6,000
+ M6 prompt cache (71% cacheable prefix)	$11,520	$2,400
+ M7 context prune (18% applicable)	$11,000	$520
+ M9 max_tokens ceiling	$10,650	$350
+ M10 batch arbitrage (4% batch-eligible)	$10,500	$150
+ M3 compress (system, ratio 0.93)	$10,150	$350
Tessera-optimized total	$9,400	$14,600
Tessera fee (20% × savings)	$2,920	—
Customer net pay	$12,320	$11,680

The savings table shows compounding — each subsequent mechanic operates on what the previous one left behind. Order matters: caches first (eliminate fully-redundant requests), route second (cheaper model on what remains), prompt-cache headers third (90% off the stable prefix), content mutators last.

Quality canary across the entire stack: mean-score 0.96, floor 0.95 held all 30 days. No SLA breach events. If it HAD breached, the specific stack (e.g. m1+m7) would auto-disable, the customer would get a 10% prorated credit on the affected fees, and m1 alone and m7 alone would stay live — surgical rollback, not nuclear.

Why you can trust the savings number

The honest critique we hear most often from CTOs: “What stops you from inflating the savings number?” Worth answering directly because the architecture makes it expensive to inflate AND easy to audit.

Every request emits two cost figures stored immutably: original_cost_usd (priced at the requested model's catalog rate) and actual_cost_usd (priced at the actual model after route + cache + provider discounts). Both rates are pinned to a pricing_catalog snapshot version id captured at the request.

The pricing_catalog table is multi-source verified: LiteLLM model_prices.json, the tokencost library, the OpenRouter API, and direct provider docs — each independently fetched, confidence-scored, with 0.99/day recency decay. If three of four sources agree on a price within 0.5%, the row gets confidence ≥ 0.95 and is eligible for use.

Mid-contract price changes don't retroactively alter past savings. If OpenAI cuts the price of gpt-4o tomorrow, your historical savings stay locked to the snapshot that was active when each request fired.

You can export the full audit ledger any time from /portal/audit as CSV. Every row carries pricing_snapshot_id, mechanics_stack, tokens in/out, original and actual cost. Two engineers, three hours, can re-derive any month's savings from raw inputs.

When NOT to use Tessera

Honest disqualification matters more than the pitch:

Hobby projects under $500/month bills.The free 60M tokens/month tier covers you. Production-tier integration isn't worth the engineering time.
Air-gapped on-prem deployments. Tessera is hosted-only by design. On-prem is on the roadmap but not shipping in 2026.
High-latency-sensitivity workloads (< 10ms p50 SLO). Cache-miss path adds ~15–40 ms p50 from the Cloudflare edge hop. If your SLO is tighter than that, we're not your tool.
Workloads with no repetition or stable prefix. If every request has a unique system prompt AND a unique user query, M2 and M6 don't fire. M1 and M10 might still help — worth measuring on free tier first.
Compliance-locked data residency (specific country). Our edge runs multi-region Cloudflare. If you need Switzerland-only or India-only residency, we don't fit yet.

Try it on your workload

Free Dev tier: 60M tokens/month, no card, no commitment. Sign up at tesseraai.io/dev — 30-second flow returns an instant API key plus magic-link dashboard access.

One-line activation in Python or Node — your existing OpenAI / Anthropic / Mistral / Cohere / Groq client picks up the proxy transparently:

# Python
import tessera
tessera.activate("tk_your_tessera_key")

# Existing code runs unchanged. Provider keys stay in your env.

Run your workload for 7 days. Compare baseline to Tessera-optimized in /portal/overview. If the math doesn't work for you, kill-switch in /portal/settings reverts to passthrough.

Production tier activates when you exceed 60M tokens/month. Pricing: 20% of measured savings only — zero savings, zero fee.

References

Try Tessera on your workload

60M tokens free, no card, 30-second signup

Measure your actual cost delta. Kill-switch any time. Pay 20% only on measured savings — zero savings, zero fee.

Get free API key→