What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M7 · Content-mutating · Priority 1

Context pruning

OpenAI v0.1 · Anthropic wire-in 24 May 2026

M7 trims conversation turns and reranks RAG-style context blocks before forwarding the request upstream. The signal: messages count > 12 OR body length > 32,768 chars. The action: drop the middle of long conversation histories (preserving the system prompt + the most-recent N turns), and re-rank embedded RAG chunks by similarity to the latest user message before truncating the tail. Tool_use ↔ tool_result block pairs are kept together so the upstream never sees a dangling tool reference.

M7 is the highest-priority content mutation under the composition cap. When M7 is eligible AND opted-in, M3 compress and M8 structured-output enforcement are both short-circuited for the same request. The multiplicative quality risk of stacking two prompt mutations on top of a context trim is what the cap exists to prevent.

Trigger thresholds

shouldPruneContext() decides eligibility:

messages.length > 12 . Long conversations are the primary M7 use case.
body_chars > 32_768 . Large embedded RAG blocks. Either filter triggering enables M7 evaluation.

Below both thresholds, M7 stays out of the request entirely. Short conversations pass through untouched. The thresholds are conservative on purpose. Even a 12-turn conversation might be the wrong place to trim if the early turns carry essential context. The cron + canary loop catches regressions and the per-stack auto-rollback retires the m7 entry from that workload if mean score drops below floor.

Conversation trim. PruneConversation

Preserves: system message (always), the last 8 conversation turns (sponsor-tunable), and any tool_use / tool_result pair whose tool_use anchor is within the kept window. Drops: middle conversation turns that don't carry a tool reference.

The tool-pair invariant matters because most providers reject a message array carrying a tool_result block referencing a tool_use_id that doesn't appear earlier in the array. If M7 dropped a tool_use turn while keeping its tool_result follow-up, the upstream would 400 the request. The applier id-matches every tool_result block in the kept window against the kept tool_use blocks and either keeps both or drops both. Never one without the other. This is documented as an invariant; the test suite covers the dangling-reference case.

RAG block rerank. PruneRagBlocks

Detects RAG-style context blocks (typically the last user message before the question, or a tagged <context> XML-fence block) and re-ranks the embedded chunks by cosine similarity to the actual user question. Drops chunks below the relevance threshold. The reranker is local to the worker; we do not call out to a secondary embedding service on this hot path.

The pruner is conservative: when the chunk structure is ambiguous (no clear delimiter, no embedded count signal), M7 skips the rerank rather than guess.

Anthropic body shape (wire-in 24 May 2026)

Anthropic /v1/messages uses messages[].content as an array of typed blocks (text / tool_use / tool_result / image). The pruner extracts text chars per block, preserves tool_use ↔ tool_result pairs via id-match (same invariant as OpenAI shape), and skips image blocks (no text chars to count, dropping them changes semantics).

Composition cap priority

When M7 is eligible AND opted-in, two things follow:

M1 auto-route is forced OFF for the request (mutatingCandidatePresent locks the route detector). Cost reduction stacked on context mutation multiplies risk.
M3 compress and M8 structured output are short-circuited via contentMutationApplied = true. Single-pick budget: M7 wins under the priority order M7 > M3 > M8.

The cap is enforced inline in the worker, not via a separate policy engine. Tests cover the precedence (worker unit tests named compose-budget.test.ts exercise the priority matrix).

What we don’t do (yet)

Google Gemini support. Gemini's body uses contents[].parts[] (different shape again). Provider share is low enough that the M7 Gemini lift is gated on the first Gemini-primary paying customer with measurable context-prune ROI.
No model-based reranker on the hot path.v0.1 uses an in-worker similarity computation, not a separate embedding call. The latency budget for the hot path doesn't tolerate the +50–80 ms an embedding call would add.
No conversation summarisation. M7 drops middle turns; it does not replace them with a generated summary. Summarisation is a different mechanic in its own right (queued; not in v3.6 scope).
No RAG store side-effects.M7 reads the request and reranks in place. It does not call back to the customer's RAG store for re-retrieval. The chunks it sees are the chunks the customer's app already fetched.

Verification surfaces

Response header x-tessera-context-pruned: turns_removed=N,rag_kept=M surfaces the trim summary on every M7-applied request.
Request log line context_pruner_applied with turns_removed / rag_kept / rag_dropped / messages_before / messages_after counts.
/portal/audit chip strip. The m7 chip in canonical mechanics_stack (warning-amber colour for mutating-mechanic visibility).

What we promise

M7 is opt-in per workload, eligibility-gated by message count and body length, composition-cap-protected from stacking with M3 or M8, surfaced on every audit row, and protected by per-stack canary breach detection. Tool-pair preservation is a hard invariant. We will not ship a request that references a dangling tool_use. When in doubt about chunk structure or pair integrity, the pruner skips rather than mutates.

Parent index: How it works. Adjacent deep-dives: M3 compress · M1 auto-route.