What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M2 · Non-mutating · Short-circuit

Exact-match cache

Default 7-day TTL · per-workload override

Identical-prompt requests within the cache window return cached responses. The key is SHA-256 of the canonical-normalised request body. Same model + messages + temperature + tools + response_format produces the same key, byte-for-byte. We never call the upstream provider on a cache hit. The measured savings is the full request cost: the upstream cost is zero, the baseline cost is what the request would have cost without the cache.

M2 is the simplest cost mechanic to reason about and the cheapest per-request decision. The cache lookup adds a single Cloudflare KV read (~5 ms p50 from worker) before the upstream forward. On a miss, the latency hit is negligible; on a hit, we save the full inference round-trip. M2 is non-mutating by construction. When it fires, no content reaches the provider.

Canonical normalisation

Bodies that differ only in field order (e.g. {model, messages} vs {messages, model}) MUST produce the same cache key. Otherwise the cache misses on requests the model would treat as identical. The normaliser (normaliseOpenAIBody) sorts top-level keys, drops Tessera-internal tags (tessera_feature_tag / tessera_customer_tag), and keys against the POST-route-decision model name so an M1-routed request caches under the alt model. Second-fire of the same source request hits cache against the same target.

Streaming requests bypass M2. The cache returns a complete JSON response, and we don't replay a synthetic stream from cached tokens. Future v0.2 could rewrap a cached response as a single- chunk SSE message for streaming callers; today the streaming path always forwards upstream.

Short-circuit ordering

M2 fires upstream of every content-mutating mechanic (M3 / M7 / M8) and upstream of M1 swap measurement. The order inside handleProviderRequest:

Auth + rate-limit + tier gating.
M1 auto-route decision (so we cache against the actual upstream model, not the requested one).
M2 cache lookup. On HIT, return immediately with the cached body + emit measureCacheHit event. No mutation, no upstream call.
M5 semantic cache (separate short-circuit, cosine-based) - runs only on M2 miss.
M6 / M3 / M7 / M8 / M9. Content + parameter mutations. Always after M2.

Consequence: M2 hits never appear in a mechanics_stack with M3 / M7 / M8. The hit short-circuits before any mutation site fires. The chip strip on /portal/audit for an M2 row carries m2 alone (potentially with m1 when the cache key was computed against an M1-swapped model).

TTL semantics

Default TTL: 7 days (cache_ttl_seconds = 604800 on the client record). Per-workload override: workloads.cache_ttl_seconds (migration 0072) takes precedence when set. Clamp[60, 30 × 86400] enforced server-side at write time AND at worker read time (defence in depth).

Why per-workload: a workload that asks "show me revenue this month" needs short TTL (data changes daily), while a workload that asks "summarise this PDF" can use a long TTL (immutable input). Sponsor sets the boundary; we honour it.

What we don’t do

No fuzzy match.M2 is byte-exact only. Near-identical prompts that differ in one word miss the cache. That's M5 (semantic cache, deep-dive queued). Separate mechanic, separate KV namespace, separate decision tree.
No streaming cache. Streaming callers always forward upstream. No synthetic SSE replay.
No cross-workload cache pollution.Cache key includes the client_id so two clients hitting the same model with the same prompt never share cache entries. Each tenant's cache is private.
No cache-side mutation. M2 returns the body exactly as the provider returned it on cache populate. We do not strip / normalise / annotate the cached response.
No replay attack on the cache key. A request from a different client with the same canonical body misses (client-scoped key). Same client, same body, same routed model = hit. Anything else = miss.

Verification surfaces

Response header x-tessera-cache: hit on M2 cache hits; x-tessera-cache: miss on misses. The miss header is set on every uncached request so sponsors can correlate cache behaviour with workload patterns.
/portal/audit chip strip. The m2 chip + the cache_hit=true column on the optimize_savings row. actual_cost_usd = 0 on every M2 hit; the savings is original_cost_usd in full.

What we promise

M2 is opt-in per workload (auto_cache_enabled), client- scoped (no cross-tenant pollution), byte-exact (no fuzzy match), short-circuit (never composes with content mutations), and observable per request (cache hit = visible header + audit row).

Parent index: How it works. Cache family deep-dives: M6 prompt cache. M5 semantic cache deep-dive queued.