What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M5 · Non-mutating · Short-circuit

Semantic cache

Cosine threshold 0.95 default · per-workload override

Cosine-similar requests served from cache. Catches paraphrased queries that exact-match (M2) misses. "summarise this" and "give me a summary of this" embed to nearly the same vector, return the same cached answer when the workload accepts the approximation. Default threshold 0.95. Per-workload override on workloads.semantic_cache_threshold (clamped [0.85, 0.99]).

M5 is the cache mechanic with the highest variance in customer utility. Some workloads (FAQ-style support bots, news summarisers) hit M5 80% of the time. Others (open-ended creative generation, unique user-message workloads) hit ~0%. The eligibility heuristic gates M5 firing per workload; the canary tracks whether the cached responses actually pass the customer's eval.

Embedding model + hot-path budget

Embeddings via OpenAI text-embedding-3-small on a dedicated Tessera-owned key (OPENAI_API_KEY_EMBEDDINGS worker secret), NOT the customer's upstream key. The customer may not send an OpenAI key at all (Anthropic-primary workloads); we provide the embedding capability.

Hot-path latency: ~80 ms p50 for embed + lookup. M5 only fires after M2 miss, so worst case the worker pays exact-match KV + embed + semantic KV before upstream. The trade-off: when M5 hits, the win is the full inference round-trip (saves seconds); when it misses, we've added 80 ms. Opt-in per workload.

Vector store + entry shape

Per-workload partition in the SEMANTIC_CACHE KV namespace, keyed by semantic:{client_id}:{workload_id}:{n}. Each entry carries:

embedding - the float32 vector (text-embedding-3-small produces 1536-dim, we store verbatim).
response_body - the cached provider response, byte-for-byte.
model - the actual model served. M1 may have routed this earlier; we remember which alt.
created_at + ttl_hours - default 24-hour TTL.

Per-workload semantic_cache_max_entries (clamped [10, 200], default 50) caps the partition size. Beyond the cap, LRU eviction removes the oldest entry on populate. The worker reads + clamps every value at the call site before handing to the lookup library.

Eligibility + tier gating

M5 fires only when:

Workload opt-in is true (tier feature tier.promptCacheAvailable is on for every active tier including Free Sandbox after the 2026-05-31 free=paid pivot — only the token cap differs).
Provider is OpenAI-shape OR Anthropic-shape (the embed-text extractor handles both; Gemini-shape lands in a follow-up).
Request is non-streaming.
SEMANTIC_CACHE and OPENAI_API_KEY_EMBEDDINGS env are present (graceful degrade to disabled when missing - worker logs a warning, doesn't break the request).

Any failed gate → M5 skipped, M2 lookup still ran in front, M6 / M3 / M7 / M8 still run downstream. The request continues with one less optimisation in the stack.

Populate path

On M5 miss + successful upstream response (200) + body ≤ 256 KB: the worker computes a fresh embedding (or reuses the lookup-side vector if available. Same text produces the same embedding, so we deduplicate the OpenAI embed call when possible) and stores an entry in the workload partition via populateEntry. The populate runs in ctx.waitUntil . The customer-facing response has already been returned by the time we write to KV.

Why the 256 KB cap: a single KV value over that size invokes chunking + indexing overhead that defeats the purpose. M5 is best for small-to-medium responses (FAQ answers, short summaries). Long-form generation (multi-page essays) bypasses populate to avoid bloating the partition.

What we don’t do (yet)

No Vectorize backing (yet). v0.1 uses KV with in-worker cosine compute (max 200 entries × 1536-dim ≈ small enough to brute-force per request). Cloudflare Vectorize as a proper ANN store lands when partition sizes pass that bound.
No re-rank by recency. v0.1 returns the single highest-similarity match above threshold. We do not blend recency × similarity. Recency-aware re-ranking is on the roadmap.
No Gemini-shape support. contents[].parts[] needs its own embed-text extractor; Gemini share gates the lift.
No customer-side embedding override. M5 uses text-embedding-3-small. Sponsors with their own embedding preferences (Cohere, Voyage) cannot swap models today.
No automatic threshold tuning. 0.95 is the published default; sponsors override per workload. We do not auto-tune based on canary signal.

Verification surfaces

Response header x-tessera-cache: semantic-hit on M5 hits, with x-tessera-cache-similarity: 0.9743 (4-decimal similarity score). Misses surface semantic-miss.
/portal/audit chip strip - m5 chip, semantic_cache_hit = true on the row.

What we promise

M5 is opt-in per workload, threshold-tunable per workload, tier- gated to the paid tiers, byte-exact on serve (cached response returned verbatim), and graceful-degrade on missing infrastructure. We never silently swap a customer's requested model under M5. The cache always returns the model that was served originally.

Parent index: How it works. Cache family: M2 exact cache · M6 prompt cache.