Skip to main content
← How it works
Mechanic · M5 · Non-mutating · Short-circuit

Semantic cache

Cosine threshold 0.95 default · per-workload override

Cosine-similar requests served from cache. Catches paraphrased queries that exact-match (M2) misses — "summarise this" and "give me a summary of this" embed to nearly the same vector, return the same cached answer when the workload accepts the approximation. Default threshold 0.95. Per-workload override on workloads.semantic_cache_threshold (clamped [0.85, 0.99]).

M5 is the cache mechanic with the highest variance in customer utility. Some workloads (FAQ-style support bots, news summarisers) hit M5 80% of the time. Others (open-ended creative generation, unique user-message workloads) hit ~0%. The eligibility heuristic gates M5 firing per workload; the canary tracks whether the cached responses actually pass the customer's eval.

Embedding model + hot-path budget

Embeddings via OpenAI text-embedding-3-small on a dedicated Tessera-owned key (OPENAI_API_KEY_EMBEDDINGS worker secret), NOT the customer's upstream key. The customer may not send an OpenAI key at all (Anthropic-primary workloads); we provide the embedding capability.

Hot-path latency: ~80 ms p50 for embed + lookup. M5 only fires after M2 miss, so worst case the worker pays exact-match KV + embed + semantic KV before upstream. The trade-off: when M5 hits, the win is the full inference round-trip (saves seconds); when it misses, we've added 80 ms. Opt-in per workload.

Vector store + entry shape

Per-workload partition in the SEMANTIC_CACHE KV namespace, keyed by semantic:{client_id}:{workload_id}:{n}. Each entry carries:

  • embedding — the float32 vector (text-embedding-3-small produces 1536-dim, we store verbatim).
  • response_body — the cached provider response, byte-for-byte.
  • model — the actual model served. M1 may have routed this earlier; we remember which alt.
  • created_at + ttl_hours — default 24-hour TTL.

Per-workload semantic_cache_max_entries (clamped [10, 200], default 50) caps the partition size. Beyond the cap, LRU eviction removes the oldest entry on populate. The worker reads + clamps every value at the call site before handing to the lookup library.

Eligibility + tier gating

M5 fires only when:

  • Workload opt-in is true AND tier permits (tier.promptCacheAvailable — Free Sandbox gets observe-only, Production applies the mechanic).
  • Provider is OpenAI-shape OR Anthropic-shape (the embed-text extractor handles both; Gemini-shape lands in a follow-up).
  • Request is non-streaming.
  • SEMANTIC_CACHE and OPENAI_API_KEY_EMBEDDINGS env are present (graceful degrade to disabled when missing — worker logs a warning, doesn't break the request).

Any failed gate → M5 skipped, M2 lookup still ran in front, M6 / M3 / M7 / M8 still run downstream. The request continues with one less optimisation in the stack.

Populate path

On M5 miss + successful upstream response (200) + body ≤ 256 KB: the worker computes a fresh embedding (or reuses the lookup-side vector if available — same text produces the same embedding, so we deduplicate the OpenAI embed call when possible) and stores an entry in the workload partition via populateEntry. The populate runs in ctx.waitUntil — the customer-facing response has already been returned by the time we write to KV.

Why the 256 KB cap: a single KV value over that size invokes chunking + indexing overhead that defeats the purpose. M5 is best for small-to-medium responses (FAQ answers, short summaries). Long-form generation (multi-page essays) bypasses populate to avoid bloating the partition.

What we don’t do (yet)

  • No Vectorize backing (yet). v0.1 uses KV with in-worker cosine compute (max 200 entries × 1536-dim ≈ small enough to brute-force per request). Cloudflare Vectorize as a proper ANN store lands when partition sizes pass that bound.
  • No re-rank by recency. v0.1 returns the single highest-similarity match above threshold. We do not blend recency × similarity. Recency-aware re-ranking is on the roadmap.
  • No Gemini-shape support. contents[].parts[] needs its own embed-text extractor; Gemini share gates the lift.
  • No customer-side embedding override. M5 uses text-embedding-3-small. Sponsors with their own embedding preferences (Cohere, Voyage) cannot swap models today.
  • No automatic threshold tuning. 0.95 is the published default; sponsors override per workload. We do not auto-tune based on canary signal.

Verification surfaces

  • Response header x-tessera-cache: semantic-hit on M5 hits, with x-tessera-cache-similarity: 0.9743 (4-decimal similarity score). Misses surface semantic-miss.
  • /portal/audit chip strip — m5 chip, semantic_cache_hit = true on the row.

What we promise

M5 is opt-in per workload, threshold-tunable per workload, tier- gated to Production, byte-exact on serve (cached response returned verbatim), and graceful-degrade on missing infrastructure. We never silently swap a customer's requested model under M5 — the cache always returns the model that was served originally.

Parent index: How it works. Cache family: M2 exact cache · M6 prompt cache.