What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M10 · Routing fork · 50% list discount

Batch arbitrage

OpenAI · Anthropic · Vertex queued

Async-tolerant workloads that don't need a real-time response route to the provider's Batch API instead of the real-time endpoint. OpenAI Batch and Anthropic Message Batches both ship 50% off list rate for the same model with a 24-hour completion window. The customer's SDK gets a 202 Accepted with a polling URL; the customer's application polls the upstream provider directly (with their own API key) until the batch completes.

M10 is unique among the cost mechanics. It's a routing fork, not a request mutation. The dispatch path diverges before the regular real-time forwarder; M10 sends to a completely different upstream endpoint with a different request shape. Settlement happens hours later via cron.

Eligibility + signal precedence

M10 fires only when ALL of these hold:

Provider is OpenAI or Anthropic (v0.1 scope; Google Vertex Batch GA late 2026 queued as F3).
Path is a batch-supported endpoint - /chat/completions or /embeddings on OpenAI; /messages on Anthropic.
Workload is not paused, request is non-streaming, body is parseable JSON.
Async signal present, in priority order: per-request body field tessera_async_tolerant: true wins over per-request header x-tessera-async-tolerant: true wins over workload-default batch_arbitrage_default.
Workload's batch_arbitrage_sla_hours accommodates the provider's 24-hour batch window. SLA tighter than the window → fall back to real-time.

We never auto-promote a real-time workload into batch. Latency contract is sponsor-set. Per-request override exists for one-off escapes (a real-time workload that has one async-tolerant job can opt in for that single request).

Dispatch path. What the worker does

On a successful M10 routing decision, the worker:

Strips Tessera-internal tags (tessera_async_tolerant etc) from the body.
Calls dispatchOpenAiBatch or dispatchAnthropicBatch with the customer's upstream Authorization header. Provider returns a batch_id + file_id (OpenAI only) + estimated_complete_at.
Estimates expected savings via estimateBatchCost against the pricing_catalog: real-time cost × (1 − 0.5) = batch cost; difference = expected savings.
Records the dispatch in m10_batch_dispatches via the dashboard route /api/internal/m10/dispatch-record with the customer's settlement token (their upstream key) wrapped in Supabase Vault for at-rest encryption (migration 0079).
Returns 202 Accepted to the customer with tessera_batch_id, status: queued, polling_url pointing at the provider's native batch endpoint.

Settle path. Cost reconciliation

The hourly settle cron (/api/cron/m10-settle) walks m10_batch_dispatches rows in non-terminal status, polls the provider with the Vault-decrypted settlement token, and:

On completed . Downloads the output_file from the provider, parses the actual token counts per request, computes actual cost against the same pricing_catalog snapshot we billed the baseline against (audit-immutability invariant), and writes the final row to optimize_savings with auto_batched=true, mechanics_stack=['m10']. Dispatch-time estimates are overwritten with actuals.
On failed / expired . Marks the dispatch terminal, destroys the settlement token from Vault (m10_destroy_settlement_token), emits the failure to /portal/anomalies so the sponsor can re-dispatch via real-time if needed.
Confidence floor 0.50: if the pricing_catalog snapshot confidence at dispatch time was below 0.50, the cron leaves the row as estimate-not-actual (the savings claim isn't re-derivable to invariant-#8 standard).

The 0.50 floor prevents the pricing-catalog-confidence drift documented in lessons 2026-05-25. We'd rather show "estimate" on the audit row than book a savings number we can't defend.

What we don’t do (yet)

Google Vertex Batch. Vertex Batch GA late 2026 (F3 in the worker placeholder). Worker defaultGoogleFetcher returns google_vertex_batches_not_yet_implemented_F3 until Google announces GA.
No multi-request batches (yet). Worker dispatches single-request batches v0.1. Multi-request batches would need per-row dispatch matching in the settle.ts output-parser (which already handles multi-line output_file format). Customer-traffic gate.
No auto-promotion. Sponsors explicitly opt in workloads to batch. We do not look at latency tolerances and auto-flag candidates.
No mid-batch cancellation. Once dispatched, the customer waits for the batch window. Cancellation paths land later if customer traffic asks for them.

Verification surfaces

Response headers on M10-routed requests: x-tessera-batch-routed: true, x-tessera-batch-id: batch_xyz, HTTP 202 status code (not 200). Semantic signal that this response is async.
m10_batch_dispatches table on the dashboard. Full lifecycle row per dispatch: dispatched / completed / failed / expired with the settlement deadline, the Vault uuid, the estimated savings, and (post-settle) the actual savings.
/portal/audit chip strip - m10 chip on the settled row. Chip appears only after the cron completes. There's no per-request mechanics_stack chip at dispatch time (the cost isn't known yet).

What we promise

M10 dispatches only on explicit opt-in signal (body / header / workload-default with SLA hours that accommodate the 24-hour window). The settlement token (customer's upstream key) is at-rest-encrypted via Supabase Vault throughout the dispatch window. Cost reconciliation uses the same pricing_catalog snapshot the baseline was billed against. Same audit- immutability invariant the rest of the proxy holds. If the batch fails, we'd rather emit an anomaly than book savings we can't deliver.

Parent index: How it works. Reliability companion: M11 cross-provider failover - shares the Vault wrapper pattern for at-rest token encryption.