What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M6 · Non-mutating (semantically)

Prompt cache

Anthropic · OpenAI · Google · per-provider integration

Every major provider ships a prompt-caching path that lets the upstream skip re-tokenising and re-computing attention over a long shared prefix on every subsequent request. The hard part is wiring into each provider's specific surface. Anthropic wants a cache_control marker injected into the body, OpenAI applies caching automatically for prefixes ≥ 1024 tokens but Tessera needs to emit a signal to attribute cached-input billing, and Google requires creating a cachedContents/{name} resource ahead of time and swapping the inline prefix for a reference on the hot path. M6 unifies these three behind a single sponsor opt-in.

M6 is classified as non-mutating in the composition cap because it does not change the prompt the model sees. At most it adds a cache marker (Anthropic) or swaps an inline prefix for a server-side reference to the identical content (Google). The model's next token decision is bit-identical to the uncached path. M6 composes freely with M1, M3, M7, M8, M9.

Anthropic. Cache_control marker injection

On Anthropic /v1/messages requests, the applier (applyAnthropicPromptCache) walks the request body to the LAST system block and injects cache_control: { type: 'ephemeral' } on it. Anthropic's server then caches the system content for ~5 minutes. Subsequent requests with the identical system prefix hit the cache; Anthropic bills the cached portion at 10% of input rate, the non-cached portion at full rate.

The applier handles both shapes the SDKs use: a single string system field, or an array of typed content blocks. When the body has no system field at all, M6 is a no-op for that request. There's nothing stable to cache against.

Response header on Anthropic M6 hits: x-tessera-prompt-cache: applied-anthropic.

OpenAI. Auto-cache attribution signal

Since October 2024 OpenAI applies prompt caching automatically for any prefix ≥ 1024 tokens; no body change is required. So why does Tessera surface M6 on OpenAI workloads at all?

Because the canary's stack-aware grouping needs M6 in the mechanics_stack for OpenAI requests that hit cache. Without M6 in the stack, OpenAI workloads with cache-hit traffic get scored against the _none (pristine passthrough) bucket OR a partial stack. Hiding the actual prompt-cache contribution from breach detection. The worker emits the m6 tag for opted-in OpenAI workloads as the canary signal source, even though there's no body mutation.

Response header on OpenAI M6 attribution: x-tessera-prompt-cache: auto-openai. The auto- prefix distinguishes this from applied- markers where Tessera mutated the body.

Google. CachedContents lifecycle reference

Google Gemini's caching path is explicit: the customer (or Tessera) first creates a cachedContents/{name} resource via the Gemini API with the content to cache + a TTL. Subsequent requests pass cached_content: 'cachedContents/{name}' instead of including the content inline. Google bills the cached portion at 25% of input rate while the cache is live.

Tessera splits this into two phases:

Lifecycle (off the hot path). A dashboard cron creates / refreshes the cachedContents resource for opted-in workloads and writes the reference into the GOOGLE_CACHE_REGISTRY KV namespace keyed by workload id.
Worker hot path. On a Google opted-in request, the worker looks up the registry, swaps the inline prefix for the cached_content reference, and emits applied-google. Miss / no registry entry / expired → emits deferred-google so ops can see the opt-in is honoured at the intent level even when the cron hasn't populated the registry yet.

The Gemini API key for the lifecycle calls is stored on workloads.google_api_key per workload. The worker doesn't read this directly today - the cron writes the registry entry; the worker only references existing entries. Reserved for a future inline-create fallback if the cron lags behind demand.

Tier gating

M6 opt-ins are governed by getTierFeatures(customerTier) .promptCacheAvailable. After the 2026-05-31 free=paid pivot this flag is on for every active tier including Free Sandbox — the per-workload toggle is honoured the same way everywhere. Only the quantitative token cap differentiates Free from paid. M5 / M6 / M7 / M8 share the same gate.

The dashboard kv-sync effective-enabled flag is workload_flag AND tier_allows . Single source of truth, no drift between dashboard claim and worker behaviour.

What we don’t do (yet)

No xAI prompt-cache integration.xAI hasn't published a cache surface for grok-* as of ship date. M6 falls through to no-op on xAI requests.
No multi-marker Anthropic strategies.v0.1 injects on the LAST system block only. Anthropic's API supports up to 4 cache breakpoints with different TTLs. The multi-breakpoint optimisation lands in a follow-up once we have customer traffic shapes that benefit from it (mid- conversation cache anchors).
No Google inline-create fallback. When the registry entry is missing or expired, the worker emits deferred-google and forwards the inline body unchanged. We do not create the cachedContents resource on the hot path. That would add a second round-trip per cache-miss request, and the worker latency budget doesn't tolerate it. Cron repopulates within ~5 minutes.
No Tessera-side proxy cache for prompt-cache hits. That's M2 (exact cache) and M5 (semantic cache). M6 is specifically about getting the customer's prefix into the provider's OWN cache. We don't double-cache.

Verification surfaces

Response header x-tessera-prompt-cache with one of: applied-anthropic / auto-openai / applied-google / deferred-google.
/portal/audit chip strip. The m6 chip in canonical mechanics_stack. Non-mutating chip colour (subtle, not the warning-amber of M3 / M7 / M8).
Provider-side cache metrics. Anthropic responses carry usage.cache_read_input_tokens and usage.cache_creation_input_tokens; OpenAI responses carry usage.cached_tokens. These flow into optimize_savings.tokens_in (effective input) so the savings math accounts for the cache discount automatically.

What we promise

M6 marker injection (Anthropic, Google) is structural. It does not change the semantic content the model sees on cache MISS, and on cache HIT the model sees the exact content it would have processed inline. OpenAI's auto-cache path doesn't mutate the body at all; M6 is just the attribution signal so the canary breakdown stays honest.

When provider documentation contradicts our integration (e.g. Anthropic changes the cache_control syntax, OpenAI tightens the 1024-token threshold, Google deprecates cachedContents), the worker version that ships the fix carries a CHANGELOG entry + deprecation window for any sponsor still on a pre-update KV record.

Parent index: How it works. Cache mechanic family: M2 exact cache · M5 semantic cache (deep-dives queued). Adjacent: M1 auto-route · M3 compress.