Auto-route to a cheaper-equivalent model
Static ROUTE_TABLE per provider (gpt-5 → gpt-5-mini, opus → sonnet → haiku) gated by daily promptfoo canary on your eval set. Chained walks compose with cumulative quality-preservation product.
The substrate layer for LLM cost · v3.6 · Free Sandbox open
A thin proxy in your request path. Ten mechanics + cross-provider failover. Eval-gated. Audit-immutable measurement. Flat monthly pricing by token volume — keep every dollar we save you.
Hard per-workload spend caps included. The request-time block your provider dashboard quietly dropped in 2025. See how →
Worked example · 5B-token gpt-4o customer-support agent
$13,601/mo saved · quality canary held 0.96 across 30 daysfull breakdown ↓
$ pip install tessera-llm-proxy
import tessera
tessera.activate("tk_your_key")
# Existing OpenAI / Anthropic / Mistral / Groq / Cohere SDK calls
# routed through Tessera. Same response shape; savings appear in /portal/audit.
II · Proof
Customer-support agent on gpt-4o, 5 billion tokens per month. Below is the mechanic-by-mechanic breakdown. The same shape every audit ledger row at ledger.tesseraai.io is built from.
One TLS hop. ~15-40 ms p50 cache-miss latency overhead. Cache hits drop sub-10ms. No prompt-content storage at rest. Every saved dollar pinned to a pricing_catalog snapshot id captured at request time.
| Mechanic | Applied (in order) | Cost / mo | Saved |
|---|---|---|---|
| - | Baseline · OpenAI direct | $24,000 | - |
| M2 | Auto-cache · 35% hit rate | $19,920 | −$4,080 |
| M1 | Auto-route · 78% routed, canary held | $13,920 | −$6,000 |
| M6 | Provider prompt cache · 71% cacheable prefix | $11,520 | −$2,400 |
| M7 | Context prune · 18% applicable | $11,000 | −$520 |
| M9 | Output-length ceiling | $10,650 | −$350 |
| M10 | Batch arbitrage · 4% batch-eligible | $10,500 | −$150 |
| M3 | Compress · system, ratio 0.93 | $10,150 | −$350 |
| Tessera-optimized total · measured savings | −$14,600 | ||
| Tessera subscription · Growth tier (flat) | +$999 | ||
| Customer net pay · keeps 100% of savings | $10,399 | $13,601 | |
III · Mechanics
Each mechanic is opt-in per workload, eval-gated, observable per request, and bypasses when its canary drops below the per-stack 0.95 quality floor. Below: four highest-impact mechanics. Each shown alongside a sample audit ledger row (the same artifact format you see on /portal/audit per real request). Then a condensed strip for the remaining five plus M11 failover.
Static ROUTE_TABLE per provider (gpt-5 → gpt-5-mini, opus → sonnet → haiku) gated by daily promptfoo canary on your eval set. Chained walks compose with cumulative quality-preservation product.
sha256 cache on the canonical request body (model, messages, temperature, tools, response_format). Default TTL 7 days. Cache hits return upstream of any content-mutating mechanic. No provider call.
Per-role heuristic whitespace + structural compression. Preserves code fences and JSON shapes. System and user turns toggle independently. Roadmap: server-side LLMLingua-2 template substitution per workload.
OpenAI Batch + Anthropic Message Batches both ship 50% off. Per-workload opt-in. The proxy queues async-tolerant requests, dispatches on the batch window, returns when ready. Per-stack canary still fires.
M5 · Semantic cache
Cosine-similar requests served from Cloudflare Vectorize. Catches paraphrases the exact cache misses.
Deep dive →
M6 · Provider prompt cache
Anthropic 90% off cached prefix. OpenAI 50%. Google 75%. We auto-inject the markers on stable system prompts.
Deep dive →
M7 · Context prune
Long conversations trimmed to system + last 8 turns. RAG attachments reranked via FlashRank.
Deep dive →
M8 · Structured output
Constrain completions to a JSON schema with explicit max-tokens budgeting. Cuts over-runs and parse-retry traffic.
Deep dive →
M9 · Output-length ceiling
Daily cron fits p90 of completion length per workload. max_tokens = p90 × 1.3. Stops runaway completions.
Deep dive →
M11 · Cross-provider failover
Primary upstream 5xx → retry on OpenRouter. Opt-in, default OFF, two pricing snapshots for audit immutability.
Deep dive →
IV · Integrations
Seven first-class adapters for the agent / RAG / chat frameworks our pilots actually run on. One line of config; same proxy and mechanic stack across all of them. No re-architecture required.
Each card opens a documentation surface · clinical register
LangChain
Python · JS
Vercel AI SDK
TypeScript
LlamaIndex
Python · JS
Mastra
TypeScript
Pydantic AI
Python
CrewAI
Python
AutoGen 0.4+
Python
All 7 →
Open source · Apache-2.0
tessera-llm/tessera-sdk · tessera-langchain · tessera-vercel-ai · tessera-llamaindex
V · Economics
Flat monthly pricing by token volume — no per-token markup, no cut of your savings. The Free Sandbox covers most personal and prototype workloads entirely. Paid tiers start at $199/month. We measure and prove every dollar of savings (audit-immutable); you keep 100%.
Free Sandbox
60M tokens per month. Full mechanic stack active — same feature set as every paid tier; only the token cap differs. No credit card. No expiration. The same proxy infrastructure as the paid tiers.
Paid tiers
One predictable monthly price by gross tokens submitted — no per-token markup, no cut of your savings. You keep 100% of what we save you.
Every paid tier: full mechanic stack (route + cache + compress + batch), per-stack quality SLA (0.95 floor + auto-rollback), audit-immutable measurement.
Audit immutability · sample row from /portal/audit
request_id: req_01HZWAYK7XYZ
requested_model: gpt-4o
actual_model: gpt-4o-mini (M1 routed, canary 0.96)
original_cost_usd: 0.01250 (snapshot_id sn_2026_05_21_a7)
actual_cost_usd: 0.00075 (snapshot_id sn_2026_05_21_a7)
savings_usd: 0.01175
you_keep_usd: 0.01175 (100% — flat plan, no savings cut)
mechanics_stack: [m1, m6]
Both original_cost_usd and actual_cost_usd are priced against the same pricing_catalogsnapshot captured at request time. Mid-contract provider price changes don't retroactively alter past savings. Your CFO can re-derive the bill from raw inputs.
VI · Trust
EntityFintechagency OÜ d.b.a. Tessera. Estonia, registry code 16638667, Tallinn. Authoritative registry: ariregister.rik.ee.
Data handlingDefault: token counts, cost deltas, mechanics_stack, and provider response status are logged. Prompt and completion content are not persisted by default. Opt-in M5 semantic cache stores completion text in tenant-isolated edge KV. Full posture →
EncryptionCustomer Authorization keys at-rest-encrypted via Supabase Vault (XChaCha20-Poly1305-IETF). RLS isolation per tenant.
SOC 2 Type 1Targeted Q3 2026. Audit-trail with cryptographic provenance via pricing_catalog snapshot ids.
Vendor neutralityTessera accepts no referral fees, kickbacks, or sponsorship from any AI provider, gateway vendor, or observability platform (Terms §10).
Client pause controlAlways-available kill-switch in /portal/billing. Reversible at any time, no notice required. Paused traffic forwards as pure passthrough — no optimization runs.