Skip to main content
← How it works
Mechanic · M2 · Non-mutating · Short-circuit

Exact-match cache

Default 7-day TTL · per-workload override

Identical-prompt requests within the cache window return cached responses. The key is SHA-256 of the canonical-normalised request body — same model + messages + temperature + tools + response_format produces the same key, byte-for-byte. We never call the upstream provider on a cache hit. Tessera bills 20% of the savings — the upstream cost is zero, the baseline cost is what the request would have cost without the cache.

M2 is the simplest cost mechanic to reason about and the cheapest per-request decision. The cache lookup adds a single Cloudflare KV read (~5 ms p50 from worker) before the upstream forward — on a miss, the latency hit is negligible; on a hit, we save the full inference round-trip. M2 is non-mutating by construction — when it fires, no content reaches the provider.

Canonical normalisation

Bodies that differ only in field order (e.g. {model, messages} vs {messages, model}) MUST produce the same cache key — otherwise the cache misses on requests the model would treat as identical. The normaliser (normaliseOpenAIBody) sorts top-level keys, drops Tessera-internal tags (tessera_feature_tag / tessera_customer_tag), and keys against the POST-route-decision model name so an M1-routed request caches under the alt model — second-fire of the same source request hits cache against the same target.

Streaming requests bypass M2 — the cache returns a complete JSON response, and we don't replay a synthetic stream from cached tokens. Future v0.2 could rewrap a cached response as a single- chunk SSE message for streaming callers; today the streaming path always forwards upstream.

Short-circuit ordering

M2 fires upstream of every content-mutating mechanic (M3 / M7 / M8) and upstream of M1 swap measurement. The order inside handleProviderRequest:

  • Auth + rate-limit + tier gating.
  • M1 auto-route decision (so we cache against the actual upstream model, not the requested one).
  • M2 cache lookup — on HIT, return immediately with the cached body + emit measureCacheHit event. No mutation, no upstream call.
  • M5 semantic cache (separate short-circuit, cosine-based) — runs only on M2 miss.
  • M6 / M3 / M7 / M8 / M9 — content + parameter mutations. Always after M2.

Consequence: M2 hits never appear in a mechanics_stack with M3 / M7 / M8 — the hit short-circuits before any mutation site fires. The chip strip on /portal/audit for an M2 row carries m2 alone (potentially with m1 when the cache key was computed against an M1-swapped model).

TTL semantics

Default TTL: 7 days (cache_ttl_seconds = 604800 on the client record). Per-workload override: workloads.cache_ttl_seconds (migration 0072) takes precedence when set. Clamp[60, 30 × 86400] enforced server-side at write time AND at worker read time (defence in depth).

Why per-workload: a workload that asks "show me revenue this month" needs short TTL (data changes daily), while a workload that asks "summarise this PDF" can use a long TTL (immutable input). Sponsor sets the boundary; we honour it.

What we don’t do

  • No fuzzy match.M2 is byte-exact only. Near-identical prompts that differ in one word miss the cache. That's M5 (semantic cache, deep-dive queued) — separate mechanic, separate KV namespace, separate decision tree.
  • No streaming cache. Streaming callers always forward upstream — no synthetic SSE replay.
  • No cross-workload cache pollution.Cache key includes the client_id so two clients hitting the same model with the same prompt never share cache entries. Each tenant's cache is private.
  • No cache-side mutation. M2 returns the body exactly as the provider returned it on cache populate. We do not strip / normalise / annotate the cached response.
  • No replay attack on the cache key. A request from a different client with the same canonical body misses (client-scoped key). Same client, same body, same routed model = hit. Anything else = miss.

Verification surfaces

  • Response header x-tessera-cache: hit on M2 cache hits; x-tessera-cache: miss on misses. The miss header is set on every uncached request so sponsors can correlate cache behaviour with workload patterns.
  • /portal/audit chip strip — the m2 chip + the cache_hit=true column on the optimize_savings row. actual_cost_usd = 0 on every M2 hit; the savings is original_cost_usd in full.

What we promise

M2 is opt-in per workload (auto_cache_enabled), client- scoped (no cross-tenant pollution), byte-exact (no fuzzy match), short-circuit (never composes with content mutations), and observable per request (cache hit = visible header + audit row).

Parent index: How it works. Cache family deep-dives: M6 prompt cache. M5 semantic cache deep-dive queued.