What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via LLMLingua-2 where safe, auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera enter the Performance Fee.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. We bill on measured savings from our own proxy logs, not on a SaaS seat. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Two tiers. Free Sandbox: 60 million tokens per month, $0 fee, observe-only optimisations (savings measured but no fee accrued). Production: 20% of measured savings, billed against a prepaid balance (Stripe Checkout, $100 minimum top-up, or invoice on request). Zero savings = zero fee in that window. No floor, no retainer, no contract review for activation. If balance reaches zero, the proxy automatically pauses optimisations and forwards requests as passthrough — no fees accrue while paused. Top up to resume. Large-volume Production Clients (above $500,000 per month in measured savings) may negotiate a custom MSA addendum. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a 10% fee credit applied to your balance.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Performance Fee does not accrue on paused traffic. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most Production signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0 fee) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above $500k/mo in measured savings, Production Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

← How it works

Mechanic · M8 · Content-mutating · Priority 3

Structured output

OpenAI · Anthropic · Gemini · v0.1

Workloads that ask the LLM for JSON output often emit JSON wrapped in prose ("Here is the JSON: { ... }Hope this helps!"), burning output tokens on apologies and natural-language framing the application then strips and discards. M8 injects the provider-native structured-output constraint — response_format on OpenAI-shape, prefill prefix on Anthropic-shape, responseSchema in generationConfig on Gemini — so the model emits only the schema-conformant payload. Typical output-token cut on JSON workloads: 30–70%.

M8 is content-mutating under the composition cap. Priority M7 > M3 > M8: if M7 context-prune or M3 compress already fired on this request, M8 short-circuits. Single-pick budget — one mutation per request, never compounded.

Schema-mode vs auto-mode

When workloads.expected_schema is set (a JSON-Schema document validated against the lightweight 2020-12 sanity check at write time), M8 injects strict-mode constraints:

OpenAI: response_format: { type: 'json_schema', json_schema: { name, schema, strict: true } }. OpenAI rejects model outputs that violate the schema with a 422; Tessera surfaces a Tier 1 anomaly when this fires.
Anthropic: prefill prefix injection — we add a pre-completion assistant message starting with { so the model continues from the open brace. Combined with the tool-use schema definition when the workload has one.
Gemini: generationConfig.responseMimeType: 'application/json' + responseSchema (Gemini accepts a subset of JSON-Schema; the applier translates).

When no schema is registered (workloads.expected_schema IS NULL but the toggle is ON), M8 runs in auto-mode:

OpenAI: response_format: { type: 'json_object' } — model emits JSON shape but no schema constraint.
Anthropic: prefill-only (no schema).
Gemini: responseMimeType: 'application/json' alone.

The caller-already-set bypass

If the caller already set response_format (OpenAI) or an equivalent provider field, M8 does not overwrite. Sponsors may have intentional non-JSON modes for some endpoints; we never override an explicit caller choice. The applier checks the field presence and exits without mutation when set.

Schema editor (sponsor self-serve)

The /portal/settings schema editor accepts a JSON document, validates against the lightweight 2020-12 sanity check (validateSchemaDocument), and writes to workloads.expected_schema. Empty {} is rejected to disambiguate "clear" (NULL via the form's dedicated clear button) from "save nothing" (the editor body left blank).

Server-side validation prevents the sponsor from bricking every JSON-mode call by pasting a malformed schema. KV refresh fires after every write so the next request sees the new schema without waiting for the 5-minute periodic sync.

Composition cap priority

M8 is third-priority among content mutators:

M7 context-prune wins. If M7 trimmed messages on this request, contentMutationApplied = true and M8 skips.
M3 compress wins. If M3 ran on this request, M8 skips.
M1 auto-route also disables when M8 is eligible to fire (cost stacked on mutation multiplies risk).

The cap is enforced inline in the worker, not via a separate policy engine. Unit tests in compose-budget.test.ts lock the priority order.

What we don’t do (yet)

No streaming structured-output.v0.1 injects on non-streaming JSON requests only. Streaming JSON-mode requests pass through (provider handles natively if the SDK sent the field; M8 doesn't inject).
No nested schema synthesis.The schema editor accepts whatever the sponsor pastes. We do not derive a schema from sample outputs — that's a separate workflow we haven't scoped.
No on-the-fly schema relaxation. If the model 422s repeatedly against the strict schema, we surface the anomaly to the sponsor — we do not auto-relax to auto-mode. The sponsor decides whether to fix the schema or drop strict.
No XML / YAML / other structured modes. v0.1 is JSON-only. Other structured outputs are queued behind customer-traffic signal.

Verification surfaces

Response header x-tessera-structured-output: json_schema (strict-mode applied) or json_object (auto-mode).
/portal/audit chip strip — m8 chip (warning-amber colour for mutating mechanic). structured_output_applied = true on the row.
Anomaly row m8_strict_schema_rejection when the provider 422s against a strict schema — surfaces the non-conformant fields on /portal/anomalies.

What we promise

M8 is opt-in per workload, schema-validated server-side before persistence, never overrides an explicit caller response_format, composition- cap-mutex with M3 and M7, and surfaced on every applied row. Provider-specific shape translation lives in dedicated appliers (applyOpenAIStructuredOutput, applyAnthropicStructuredOutput, applyGoogleStructuredOutput) — one source of truth per provider.

Parent index: How it works. Mutex-mate deep-dives: M3 compress · M7 context prune.