Skip to main content

The substrate layer for LLM cost · v3.6 · Free Sandbox open

The substrate layer
for LLM cost.

A thin proxy in your request path. Ten mechanics + cross-provider failover. Eval-gated. Audit-immutable measurement. Flat monthly pricing by token volume — keep every dollar we save you.

Hard per-workload spend caps included. The request-time block your provider dashboard quietly dropped in 2025. See how →

Get a free 60M-token key →View on GitHub ↗

No card · Apache-2.0 · 30s install

Worked example · 5B-token gpt-4o customer-support agent

$13,601/mo saved · quality canary held 0.96 across 30 daysfull breakdown ↓

$ pip install tessera-llm-proxy

import tessera

tessera.activate("tk_your_key")

# Existing OpenAI / Anthropic / Mistral / Groq / Cohere SDK calls

# routed through Tessera. Same response shape; savings appear in /portal/audit.

II · Proof

$24,000 bill → $9,400.
Quality canary held 0.96.

Customer-support agent on gpt-4o, 5 billion tokens per month. Below is the mechanic-by-mechanic breakdown. The same shape every audit ledger row at ledger.tesseraai.io is built from.

Tessera Optimize Layer · request path schematicYour application sends requests to the Tessera proxy, which applies four optimizations before forwarding to your LLM provider: route to a cheaper model when quality holds, cache identical requests, compress prompt tokens, and batch where eligible. Five percent of routed requests are canary-sampled against the original model and the quality gate fails closed on uncertainty.FIG. I · REQUEST PATHYour appOPENAI · ANTHROPICGOOGLE · ETC.TESSERA · OPTIMIZE LAYERROUTECHEAPER MODEL142/hCACHEIDENTICAL REQ.98/hCOMPRESSPROMPT TOKENS67/hBATCHBATCH-ELIGIBLE14/hProviderOPENAI · ANTHROPICGOOGLE · XAI · ETC.REQUESTRESPONSERE-ROUTEDMEASURED5% canary sampled against original model — quality gate fails closed

One TLS hop. ~15-40 ms p50 cache-miss latency overhead. Cache hits drop sub-10ms. No prompt-content storage at rest. Every saved dollar pinned to a pricing_catalog snapshot id captured at request time.

Mechanic breakdown · 5B tokens/mo
Sequential mechanic-by-mechanic savings breakdown for a 5B-token gpt-4o customer-support workload. Each row applies on top of the previous, so the saved column compounds.
MechanicApplied (in order)Cost / moSaved
-Baseline · OpenAI direct$24,000-
M2Auto-cache · 35% hit rate$19,920−$4,080
M1Auto-route · 78% routed, canary held$13,920−$6,000
M6Provider prompt cache · 71% cacheable prefix$11,520−$2,400
M7Context prune · 18% applicable$11,000−$520
M9Output-length ceiling$10,650−$350
M10Batch arbitrage · 4% batch-eligible$10,500−$150
M3Compress · system, ratio 0.93$10,150−$350
Tessera-optimized total · measured savings−$14,600
Tessera subscription · Growth tier (flat)+$999
Customer net pay · keeps 100% of savings$10,399$13,601
$13,601/mo saved · quality canary mean-score 0.96 · floor 0.95 held all 30 days

III · Mechanics

Ten moves on every request, plus failover.

Each mechanic is opt-in per workload, eval-gated, observable per request, and bypasses when its canary drops below the per-stack 0.95 quality floor. Below: four highest-impact mechanics. Each shown alongside a sample audit ledger row (the same artifact format you see on /portal/audit per real request). Then a condensed strip for the remaining five plus M11 failover.

M1

Auto-route to a cheaper-equivalent model

Static ROUTE_TABLE per provider (gpt-5 → gpt-5-mini, opus → sonnet → haiku) gated by daily promptfoo canary on your eval set. Chained walks compose with cumulative quality-preservation product.

M2

Exact-match cache at the edge

sha256 cache on the canonical request body (model, messages, temperature, tools, response_format). Default TTL 7 days. Cache hits return upstream of any content-mutating mechanic. No provider call.

M3

Compress where heuristics confirm safe

Per-role heuristic whitespace + structural compression. Preserves code fences and JSON shapes. System and user turns toggle independently. Roadmap: server-side LLMLingua-2 template substitution per workload.

M10

Batch arbitrage on async-tolerant calls

OpenAI Batch + Anthropic Message Batches both ship 50% off. Per-workload opt-in. The proxy queues async-tolerant requests, dispatches on the batch window, returns when ready. Per-stack canary still fires.

IV · Integrations

Drop into whatever you already use.

Seven first-class adapters for the agent / RAG / chat frameworks our pilots actually run on. One line of config; same proxy and mechanic stack across all of them. No re-architecture required.

Each card opens a documentation surface · clinical register

Open source · Apache-2.0

tessera-llm/tessera-sdk · tessera-langchain · tessera-vercel-ai · tessera-llamaindex

GitHub org ↗

V · Economics

One flat price. Keep every dollar we save you.

Flat monthly pricing by token volume — no per-token markup, no cut of your savings. The Free Sandbox covers most personal and prototype workloads entirely. Paid tiers start at $199/month. We measure and prove every dollar of savings (audit-immutable); you keep 100%.

Free Sandbox

$0 / forever

60M tokens per month. Full mechanic stack active — same feature set as every paid tier; only the token cap differs. No credit card. No expiration. The same proxy infrastructure as the paid tiers.

  • · 30 requests / minute
  • · All providers (OpenAI, Anthropic, Mistral, Groq, Cohere, Google)
  • · /portal/audit ledger access
  • · Hard quota cap. No surprise overage bill
Get a free key →

Paid tiers

Flat / by token volume

One predictable monthly price by gross tokens submitted — no per-token markup, no cut of your savings. You keep 100% of what we save you.

Every paid tier: full mechanic stack (route + cache + compress + batch), per-stack quality SLA (0.95 floor + auto-rollback), audit-immutable measurement.

Audit immutability · sample row from /portal/audit

request_id: req_01HZWAYK7XYZ

requested_model: gpt-4o

actual_model: gpt-4o-mini (M1 routed, canary 0.96)

original_cost_usd: 0.01250 (snapshot_id sn_2026_05_21_a7)

actual_cost_usd: 0.00075 (snapshot_id sn_2026_05_21_a7)

savings_usd: 0.01175

you_keep_usd: 0.01175 (100% — flat plan, no savings cut)

mechanics_stack: [m1, m6]

Both original_cost_usd and actual_cost_usd are priced against the same pricing_catalogsnapshot captured at request time. Mid-contract provider price changes don't retroactively alter past savings. Your CFO can re-derive the bill from raw inputs.

VI · Trust

Operating posture.

EntityFintechagency OÜ d.b.a. Tessera. Estonia, registry code 16638667, Tallinn. Authoritative registry: ariregister.rik.ee.

Data handlingDefault: token counts, cost deltas, mechanics_stack, and provider response status are logged. Prompt and completion content are not persisted by default. Opt-in M5 semantic cache stores completion text in tenant-isolated edge KV. Full posture →

EncryptionCustomer Authorization keys at-rest-encrypted via Supabase Vault (XChaCha20-Poly1305-IETF). RLS isolation per tenant.

SOC 2 Type 1Targeted Q3 2026. Audit-trail with cryptographic provenance via pricing_catalog snapshot ids.

Vendor neutralityTessera accepts no referral fees, kickbacks, or sponsorship from any AI provider, gateway vendor, or observability platform (Terms §10).

Client pause controlAlways-available kill-switch in /portal/billing. Reversible at any time, no notice required. Paused traffic forwards as pure passthrough — no optimization runs.

60 million tokens per month, on the house.

Get a free key →