Semantic cache
Cosine threshold 0.95 default · per-workload override
Cosine-similar requests served from cache. Catches paraphrased queries that exact-match (M2) misses — "summarise this" and "give me a summary of this" embed to nearly the same vector, return the same cached answer when the workload accepts the approximation. Default threshold 0.95. Per-workload override on workloads.semantic_cache_threshold (clamped [0.85, 0.99]).
M5 is the cache mechanic with the highest variance in customer utility. Some workloads (FAQ-style support bots, news summarisers) hit M5 80% of the time. Others (open-ended creative generation, unique user-message workloads) hit ~0%. The eligibility heuristic gates M5 firing per workload; the canary tracks whether the cached responses actually pass the customer's eval.
Embedding model + hot-path budget
Embeddings via OpenAI text-embedding-3-small on a dedicated Tessera-owned key (OPENAI_API_KEY_EMBEDDINGS worker secret), NOT the customer's upstream key. The customer may not send an OpenAI key at all (Anthropic-primary workloads); we provide the embedding capability.
Hot-path latency: ~80 ms p50 for embed + lookup. M5 only fires after M2 miss, so worst case the worker pays exact-match KV + embed + semantic KV before upstream. The trade-off: when M5 hits, the win is the full inference round-trip (saves seconds); when it misses, we've added 80 ms. Opt-in per workload.
Vector store + entry shape
Per-workload partition in the SEMANTIC_CACHE KV namespace, keyed by semantic:{client_id}:{workload_id}:{n}. Each entry carries:
embedding— the float32 vector (text-embedding-3-small produces 1536-dim, we store verbatim).response_body— the cached provider response, byte-for-byte.model— the actual model served. M1 may have routed this earlier; we remember which alt.created_at+ttl_hours— default 24-hour TTL.
Per-workload semantic_cache_max_entries (clamped [10, 200], default 50) caps the partition size. Beyond the cap, LRU eviction removes the oldest entry on populate. The worker reads + clamps every value at the call site before handing to the lookup library.
Eligibility + tier gating
M5 fires only when:
- Workload opt-in is true AND tier permits (
tier.promptCacheAvailable— Free Sandbox gets observe-only, Production applies the mechanic). - Provider is OpenAI-shape OR Anthropic-shape (the embed-text extractor handles both; Gemini-shape lands in a follow-up).
- Request is non-streaming.
-
SEMANTIC_CACHEandOPENAI_API_KEY_EMBEDDINGSenv are present (graceful degrade to disabled when missing — worker logs a warning, doesn't break the request).
Any failed gate → M5 skipped, M2 lookup still ran in front, M6 / M3 / M7 / M8 still run downstream. The request continues with one less optimisation in the stack.
Populate path
On M5 miss + successful upstream response (200) + body ≤ 256 KB: the worker computes a fresh embedding (or reuses the lookup-side vector if available — same text produces the same embedding, so we deduplicate the OpenAI embed call when possible) and stores an entry in the workload partition via populateEntry. The populate runs in ctx.waitUntil — the customer-facing response has already been returned by the time we write to KV.
Why the 256 KB cap: a single KV value over that size invokes chunking + indexing overhead that defeats the purpose. M5 is best for small-to-medium responses (FAQ answers, short summaries). Long-form generation (multi-page essays) bypasses populate to avoid bloating the partition.
What we don’t do (yet)
- No Vectorize backing (yet). v0.1 uses KV with in-worker cosine compute (max 200 entries × 1536-dim ≈ small enough to brute-force per request). Cloudflare Vectorize as a proper ANN store lands when partition sizes pass that bound.
- No re-rank by recency. v0.1 returns the single highest-similarity match above threshold. We do not blend recency × similarity. Recency-aware re-ranking is on the roadmap.
- No Gemini-shape support.
contents[].parts[]needs its own embed-text extractor; Gemini share gates the lift. - No customer-side embedding override. M5 uses text-embedding-3-small. Sponsors with their own embedding preferences (Cohere, Voyage) cannot swap models today.
- No automatic threshold tuning. 0.95 is the published default; sponsors override per workload. We do not auto-tune based on canary signal.
Verification surfaces
- Response header
x-tessera-cache: semantic-hiton M5 hits, withx-tessera-cache-similarity: 0.9743(4-decimal similarity score). Misses surfacesemantic-miss. -
/portal/auditchip strip —m5chip,semantic_cache_hit = trueon the row.
What we promise
M5 is opt-in per workload, threshold-tunable per workload, tier- gated to Production, byte-exact on serve (cached response returned verbatim), and graceful-degrade on missing infrastructure. We never silently swap a customer's requested model under M5 — the cache always returns the model that was served originally.
Parent index: How it works. Cache family: M2 exact cache · M6 prompt cache.