Skip to main content
← How it works
Mechanic · M7 · Content-mutating · Priority 1

Context pruning

OpenAI v0.1 · Anthropic wire-in 24 May 2026

M7 trims conversation turns and reranks RAG-style context blocks before forwarding the request upstream. The signal: messages count > 12 OR body length > 32,768 chars. The action: drop the middle of long conversation histories (preserving the system prompt + the most-recent N turns), and re-rank embedded RAG chunks by similarity to the latest user message before truncating the tail. Tool_use ↔ tool_result block pairs are kept together so the upstream never sees a dangling tool reference.

M7 is the highest-priority content mutation under the composition cap. When M7 is eligible AND opted-in, M3 compress and M8 structured-output enforcement are both short-circuited for the same request — the multiplicative quality risk of stacking two prompt mutations on top of a context trim is what the cap exists to prevent.

Trigger thresholds

shouldPruneContext() decides eligibility:

  • messages.length > 12 — long conversations are the primary M7 use case.
  • body_chars > 32_768 — large embedded RAG blocks. Either filter triggering enables M7 evaluation.

Below both thresholds, M7 stays out of the request entirely. Short conversations pass through untouched. The thresholds are conservative on purpose — even a 12-turn conversation might be the wrong place to trim if the early turns carry essential context. The cron + canary loop catches regressions and the per-stack auto-rollback retires the m7 entry from that workload if mean score drops below floor.

Conversation trim — pruneConversation

Preserves: system message (always), the last 8 conversation turns (sponsor-tunable), and any tool_use / tool_result pair whose tool_use anchor is within the kept window. Drops: middle conversation turns that don't carry a tool reference.

The tool-pair invariant matters because most providers reject a message array carrying a tool_result block referencing a tool_use_id that doesn't appear earlier in the array. If M7 dropped a tool_use turn while keeping its tool_result follow-up, the upstream would 400 the request. The applier id-matches every tool_result block in the kept window against the kept tool_use blocks and either keeps both or drops both — never one without the other. This is documented as an invariant; the test suite covers the dangling-reference case.

RAG block rerank — pruneRagBlocks

Detects RAG-style context blocks (typically the last user message before the question, or a tagged <context> XML-fence block) and re-ranks the embedded chunks by cosine similarity to the actual user question. Drops chunks below the relevance threshold. The reranker is local to the worker; we do not call out to a secondary embedding service on this hot path.

The pruner is conservative: when the chunk structure is ambiguous (no clear delimiter, no embedded count signal), M7 skips the rerank rather than guess.

Anthropic body shape (wire-in 24 May 2026)

Anthropic /v1/messages uses messages[].content as an array of typed blocks (text / tool_use / tool_result / image). The pruner extracts text chars per block, preserves tool_use ↔ tool_result pairs via id-match (same invariant as OpenAI shape), and skips image blocks (no text chars to count, dropping them changes semantics).

Composition cap priority

When M7 is eligible AND opted-in, two things follow:

  • M1 auto-route is forced OFF for the request (mutatingCandidatePresent locks the route detector). Cost reduction stacked on context mutation multiplies risk.
  • M3 compress and M8 structured output are short-circuited via contentMutationApplied = true. Single-pick budget: M7 wins under the priority order M7 > M3 > M8.

The cap is enforced inline in the worker, not via a separate policy engine. Tests cover the precedence (worker unit tests named compose-budget.test.ts exercise the priority matrix).

What we don’t do (yet)

  • Google Gemini support. Gemini's body uses contents[].parts[] (different shape again) — provider share is low enough that the M7 Gemini lift is gated on the first Gemini-primary paying customer with measurable context-prune ROI.
  • No model-based reranker on the hot path.v0.1 uses an in-worker similarity computation, not a separate embedding call. The latency budget for the hot path doesn't tolerate the +50–80 ms an embedding call would add.
  • No conversation summarisation. M7 drops middle turns; it does not replace them with a generated summary. Summarisation is a different mechanic in its own right (queued; not in v3.6 scope).
  • No RAG store side-effects.M7 reads the request and reranks in place — it does not call back to the customer's RAG store for re-retrieval. The chunks it sees are the chunks the customer's app already fetched.

Verification surfaces

  • Response header x-tessera-context-pruned: turns_removed=N,rag_kept=M surfaces the trim summary on every M7-applied request.
  • Request log line context_pruner_applied with turns_removed / rag_kept / rag_dropped / messages_before / messages_after counts.
  • /portal/audit chip strip — the m7 chip in canonical mechanics_stack (warning-amber colour for mutating-mechanic visibility).

What we promise

M7 is opt-in per workload, eligibility-gated by message count and body length, composition-cap-protected from stacking with M3 or M8, surfaced on every audit row, and protected by per-stack canary breach detection. Tool-pair preservation is a hard invariant — we will not ship a request that references a dangling tool_use. When in doubt about chunk structure or pair integrity, the pruner skips rather than mutates.

Parent index: How it works. Adjacent deep-dives: M3 compress · M1 auto-route.