Skip to main content
← How it works
Mechanic · M6 · Non-mutating (semantically)

Prompt cache

Anthropic · OpenAI · Google · per-provider integration

Every major provider ships a prompt-caching path that lets the upstream skip re-tokenising and re-computing attention over a long shared prefix on every subsequent request. The hard part is wiring into each provider's specific surface — Anthropic wants a cache_control marker injected into the body, OpenAI applies caching automatically for prefixes ≥ 1024 tokens but Tessera needs to emit a signal to attribute cached-input billing, and Google requires creating a cachedContents/{name} resource ahead of time and swapping the inline prefix for a reference on the hot path. M6 unifies these three behind a single sponsor opt-in.

M6 is classified as non-mutating in the composition cap because it does not change the prompt the model sees — at most it adds a cache marker (Anthropic) or swaps an inline prefix for a server-side reference to the identical content (Google). The model's next token decision is bit-identical to the uncached path. M6 composes freely with M1, M3, M7, M8, M9.

Anthropic — cache_control marker injection

On Anthropic /v1/messages requests, the applier (applyAnthropicPromptCache) walks the request body to the LAST system block and injects cache_control: { type: 'ephemeral' } on it. Anthropic's server then caches the system content for ~5 minutes. Subsequent requests with the identical system prefix hit the cache; Anthropic bills the cached portion at 10% of input rate, the non-cached portion at full rate.

The applier handles both shapes the SDKs use: a single string system field, or an array of typed content blocks. When the body has no system field at all, M6 is a no-op for that request — there's nothing stable to cache against.

Response header on Anthropic M6 hits: x-tessera-prompt-cache: applied-anthropic.

OpenAI — auto-cache attribution signal

Since October 2024 OpenAI applies prompt caching automatically for any prefix ≥ 1024 tokens; no body change is required. So why does Tessera surface M6 on OpenAI workloads at all?

Because the canary's stack-aware grouping needs M6 in the mechanics_stack for OpenAI requests that hit cache. Without M6 in the stack, OpenAI workloads with cache-hit traffic get scored against the _none (pristine passthrough) bucket OR a partial stack — hiding the actual prompt-cache contribution from breach detection. The worker emits the m6 tag for opted-in OpenAI workloads as the canary signal source, even though there's no body mutation.

Response header on OpenAI M6 attribution: x-tessera-prompt-cache: auto-openai. The auto- prefix distinguishes this from applied- markers where Tessera mutated the body.

Google — cachedContents lifecycle reference

Google Gemini's caching path is explicit: the customer (or Tessera) first creates a cachedContents/{name} resource via the Gemini API with the content to cache + a TTL. Subsequent requests pass cached_content: 'cachedContents/{name}' instead of including the content inline. Google bills the cached portion at 25% of input rate while the cache is live.

Tessera splits this into two phases:

  • Lifecycle (off the hot path). A dashboard cron creates / refreshes the cachedContents resource for opted-in workloads and writes the reference into the GOOGLE_CACHE_REGISTRY KV namespace keyed by workload id.
  • Worker hot path. On a Google opted-in request, the worker looks up the registry, swaps the inline prefix for the cached_content reference, and emits applied-google. Miss / no registry entry / expired → emits deferred-google so ops can see the opt-in is honoured at the intent level even when the cron hasn't populated the registry yet.

The Gemini API key for the lifecycle calls is stored on workloads.google_api_key per workload. The worker doesn't read this directly today — the cron writes the registry entry; the worker only references existing entries. Reserved for a future inline-create fallback if the cron lags behind demand.

Tier gating

M6 opt-ins are tier-gated via getTierFeatures(customerTier) .promptCacheAvailable. Free Sandbox tier observes the workload toggle but the worker does NOT apply M6 — keeps Free Sandbox semantically observe-only across all four request-time mechanic opt-ins (M5 / M6 / M7 / M8 share the same gate). Production tier honours the per-workload toggle.

The dashboard kv-sync effective-enabled flag is workload_flag AND tier_allows — single source of truth, no drift between dashboard claim and worker behaviour.

What we don’t do (yet)

  • No xAI prompt-cache integration.xAI hasn't published a cache surface for grok-* as of ship date. M6 falls through to no-op on xAI requests.
  • No multi-marker Anthropic strategies.v0.1 injects on the LAST system block only. Anthropic's API supports up to 4 cache breakpoints with different TTLs — the multi-breakpoint optimisation lands in a follow-up once we have customer traffic shapes that benefit from it (mid- conversation cache anchors).
  • No Google inline-create fallback. When the registry entry is missing or expired, the worker emits deferred-google and forwards the inline body unchanged. We do not create the cachedContents resource on the hot path — that would add a second round-trip per cache-miss request, and the worker latency budget doesn't tolerate it. Cron repopulates within ~5 minutes.
  • No Tessera-side proxy cache for prompt-cache hits. That's M2 (exact cache) and M5 (semantic cache). M6 is specifically about getting the customer's prefix into the provider's OWN cache — we don't double-cache.

Verification surfaces

  • Response header x-tessera-prompt-cache with one of: applied-anthropic / auto-openai / applied-google / deferred-google.
  • /portal/audit chip strip — the m6 chip in canonical mechanics_stack. Non-mutating chip colour (subtle, not the warning-amber of M3 / M7 / M8).
  • Provider-side cache metrics. Anthropic responses carry usage.cache_read_input_tokens and usage.cache_creation_input_tokens; OpenAI responses carry usage.cached_tokens. These flow into optimize_savings.tokens_in (effective input) so the savings math accounts for the cache discount automatically.

What we promise

M6 marker injection (Anthropic, Google) is structural — it does not change the semantic content the model sees on cache MISS, and on cache HIT the model sees the exact content it would have processed inline. OpenAI's auto-cache path doesn't mutate the body at all; M6 is just the attribution signal so the canary breakdown stays honest.

When provider documentation contradicts our integration (e.g. Anthropic changes the cache_control syntax, OpenAI tightens the 1024-token threshold, Google deprecates cachedContents), the worker version that ships the fix carries a CHANGELOG entry + deprecation window for any sponsor still on a pre-update KV record.

Parent index: How it works. Cache mechanic family: M2 exact cache · M5 semantic cache (deep-dives queued). Adjacent: M1 auto-route · M3 compress.