What is the Tessera Optimize Layer?

The Tessera Optimize Layer is a thin proxy that lives in your LLM request path. Your application points its OpenAI, Anthropic, or Google client at Tessera; Tessera forwards each request to the provider but first applies four moves: auto-route to a cheaper model when quality holds on your golden-set eval, auto-cache identical requests at the edge, auto-compress prompts via a deterministic whitespace + structural pass (LLMLingua-2 template substitution on roadmap), auto-batch where batch APIs apply. Every saved dollar is measured directly from proxy logs, not inferred from a billing CSV.

How does Tessera measure savings?

Savings are measured from Tessera proxy logs at request granularity. For each request that gets routed, cached, compressed, or batched, Tessera records the counterfactual provider cost (what the request would have cost without the optimization) and the actual incurred cost. Aggregate Ongoing Savings equal the sum of (counterfactual minus actual) across all in-scope workloads in the period. Provider price moves and unrelated workload shrinkage are excluded; only optimizations attributable to Tessera count toward your reported savings.

How is Tessera different from an observability platform?

Observability platforms trace requests and show dashboards. Tessera is an optimize layer — we sit in the request path and actively route, cache, compress, and batch on every call. Pricing is a flat monthly subscription by token volume, not a per-seat SaaS — and savings are measured directly from our own proxy logs. Your existing observability tool can stay — it still receives downstream telemetry. Substrate-include posture: Tessera is the layer above whatever vendors you already use, not their alternative.

What does Tessera cost?

Flat monthly pricing by gross tokens submitted, same feature set on every tier — only the token cap differs. Free Sandbox: 60 million tokens per month, $0, full mechanic stack active. Paid tiers: Starter $199/month (up to 1B tokens), Growth $999/month (up to 5B), Scale $3,999/month (up to 20B), Enterprise custom (20B+). No per-token markup and no cut of your savings — savings are measured and you keep 100%. No floor, no retainer, no contract review for activation. Quality preservation guaranteed at 0.95 by canary; three-day breach triggers auto-disable plus a service credit.

Does Tessera modify our production code?

No. Tessera is added to your request path via two HTTP headers and one config-line change — you point your OpenAI/Anthropic/Google client at the Tessera proxy base URL and add an API key. Tessera never modifies your application source, never deploys into your codebase, and never makes provider-side changes on your account. All optimization logic runs inside the Tessera proxy infrastructure. You can disable Tessera at any time by reverting the two headers, or by using the in-dashboard pause control (see next question).

Can I pause Tessera at any time?

Yes. Every operator dashboard ships with an always-available kill-switch — account-wide and per-workload. When engaged, the Tessera proxy bypasses all four optimizations (route, cache, compress, batch) and forwards your requests to the upstream provider as pure passthrough. Pause is reversible at any time without notice. The pause right is contractually preserved in §8 of the Tessera Terms of Service — Tessera does not work uncontrolled in your stack.

Two primary shapes carry most paid-tier signups: (1) Series A-B AI-native SaaS CTO, $20k-$200k per month on LLM APIs, gross margin under pressure, can change a base-URL in 30 minutes and signs up in 72 hours; (2) Series B-D scale-up adding AI features, $50k-$500k per month, AI Platform Lead owns the budget, light security review in 2-4 weeks. Plus a developer-facing Free Sandbox tier (60M tokens/mo, $0) for solo developers + side projects. Workloads tagged regulated (HIPAA, PCI-DSS, SOC 2 in-scope) never auto-route — the compliance gate blocks routing at the code level. Above 20B gross tokens/mo, Enterprise Clients may negotiate a custom MSA addendum (invoice billing, custom SLO).

How does Tessera handle data privacy?

Confidential Information furnished to Tessera is stored primarily inside the European Economic Area (EEA) under Estonian operating jurisdiction. We do not require, request, or process Client end-user personal data, model prompts, or model completions in their full content form — only token-count and structural metadata. US-based AI provider sub-processors (Anthropic, OpenAI, Google) are engaged solely for Tessera-internal analysis under strict anonymisation conditions per the Data Processing Agreement.

Does Tessera accept referral fees from AI providers?

No. Tessera receives no affiliate revenue, referral fees, kickbacks, sponsorships, advisory-board honoraria, or any other compensation from any AI provider, gateway vendor, or observability platform we recommend in the course of operating the proxy. Client fees are our only income. This is contractually binding under §10 of the Tessera Terms of Service (Vendor Neutrality) and breach permits Clients to terminate without penalty and withdraw their balance.

tesseraIntegrations

Integrations · LlamaIndex ↗

Tessera × LlamaIndex

tessera-llamaindex is a thin adapter that routes every LlamaIndex LLM call through the Tessera substrate proxy. Spread tessera_openai_config() (or the per-provider sibling) into the llama_index.llms.openai.OpenAI constructor and every .complete() / .chat() / .stream_complete() call lands on the proxy.

RAG pipelines benefit most. The provider prompt cache (M6) and exact cache (M2) fire on stable retrieval prefixes — typical RAG workloads see 40-70% savings on top-of-funnel input tokens just from those two mechanics. Multi-step query engines (sub-question, router, ReAct agents) compound caching benefits across the reasoning chain.

Install

pip install tessera-llamaindex
# Plus whichever LlamaIndex provider package you use:
pip install llama-index-llms-openai          # or llama-index-llms-anthropic / -mistralai / -groq / -cohere

Get a free Tessera API key at tesseraai.io/dev — 60M tokens/month, no card up front.

Quickstart

from llama_index.llms.openai import OpenAI
from tessera_llamaindex import tessera_openai_config

llm = OpenAI(
    model="gpt-4o",
    api_key="sk-...",                              # your OpenAI key, unchanged
    **tessera_openai_config(api_key="tk_..."),    # one line, routes through Tessera
)

# Existing LlamaIndex code (queries, RAG pipelines, agents, sub-question
# engines, multi-step reasoning) runs unchanged.
response = llm.complete("Summarize this document in 3 bullets.")

Scenario 1 — RAG over a document corpus with provider prompt cache

A LlamaIndex VectorStoreIndex on claude-sonnet-4-6 with a stable retrieved-context prefix. Tessera auto-injects Anthropic cache_control: ephemeral markers — 90% off cache reads on the long retrieval prefix.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from tessera_llamaindex import tessera_anthropic_config

Settings.llm = Anthropic(
    model="claude-sonnet-4-6",
    api_key="sk-ant-...",
    **tessera_anthropic_config(api_key="tk_..."),
)

docs = SimpleDirectoryReader("./policies").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(similarity_top_k=8)
answer = engine.query("Summarize our refund policy for gift cards in two sentences.")

Scenario 2 — Sub-question query engine across multiple corpora

When the user asks a question that spans multiple data sources, LlamaIndex SubQuestionQueryEngine decomposes it into per-corpus sub-queries. Tessera's exact + semantic cache catches repeat sub-queries across reasoning chains; auto-route evaluates whether the planner LLM can run on a cheaper model than the synthesizer LLM.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

tools = [
    QueryEngineTool(query_engine=q10_engine, metadata=ToolMetadata(name="q10_filings",
        description="Quarterly 10-Q filings, last 8 quarters")),
    QueryEngineTool(query_engine=transcripts_engine, metadata=ToolMetadata(name="earnings_calls",
        description="Transcripts of last 8 earnings calls")),
]

sub_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=tools,
    llm=Settings.llm,  # Tessera-wrapped from above
)

result = sub_engine.query("Has gross margin changed materially in the last year and why?")

Scenario 3 — ReAct agent with tool use

A LlamaIndex ReAct agent that calls a customer-data tool and a ticket-system tool. Auto-route gates on tool-calling capability — the agent never gets routed to a non-tool-capable model. Tool-call dispatch loops often hit identical sub-prompts; exact-match cache returns are free.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

def lookup_account(account_id: str) -> dict: ...
def open_ticket(account_id: str, summary: str) -> str: ...

tools = [FunctionTool.from_defaults(fn=lookup_account),
         FunctionTool.from_defaults(fn=open_ticket)]

agent = ReActAgent.from_tools(
    tools=tools,
    llm=Settings.llm,  # Tessera-wrapped — auto-route gates on tool-calling capability
    verbose=False,
)

agent.chat("Customer 7733 says their March invoice is wrong — investigate and open a ticket if needed.")

Worked savings example

RAG-heavy customer-support pipeline on claude-sonnet-4-6, 5B tokens/ month with a stable 8k-token retrieval-context prefix per query.

Stage	Cost / mo	Saved
Baseline — Anthropic direct via LlamaIndex	$21,000	—
+ Tessera (provider prompt cache + exact cache + context prune + M9)	$7,800	$13,200
Tessera subscription (Growth tier — ≤5B tokens/mo)	$999	—
You net pay	$8,799	$12,201 / mo saved

The lopsided savings on RAG workloads come from the stable retrieval prefix — cache_control markers on a long, repeated prefix recover 90% on each cache hit. Customers with high prompt-repetition see even higher percentages.

Architecture and quality contract

The adapter respects each LlamaIndex provider class's specific constructor signature — OpenAI uses api_base + default_headers, Anthropic uses base_url + default_headers, Mistral uses endpoint + additional_kwargs.http_headers, and so on. Constructor shapes are runtime-verified against installed LlamaIndex 0.6+ provider packages and locked into CI — any future LlamaIndex release that changes those kwargs trips the regression before we ship.

Composition cap (max 2 content-mutators), per-stack 0.95 quality floor with auto-rollback + credit on breach, audit-immutable measurement — all enforced on the proxy. The verified-savings ledger at ledger.tesseraai.io is your source of truth.

FAQ

Does this break my RAG pipeline / query engine / agent?: No. The LlamaIndex LLM object behaves identically. complete, chat, stream_complete, stream_chat all work unchanged. Index queries, retrievers, sub-question engines, OpenAI Agents, multi-step reasoning chains all use the LLM's standard interface and route through Tessera transparently.
Can I use this with LlamaIndex Settings.llm = ...?: Yes. Construct the LLM the same way and assign to Settings.llm — Tessera is wired in via constructor kwargs, not via a separate wrapper layer.
What about Mistral's odd headers shape?: LlamaIndex's MistralAI wrapper doesn't expose a top-level default_headers argument — it forwards additional_kwargs.http_headers to the underlying mistralai SDK on each request. The Tessera config function returns the correct shape automatically. Same handling for Cohere (additional_kwargs.headers).

Where to go next

Get a free 60M-tokens/month API key — sign-up takes ~30 seconds.
How the mechanic stack works — composition cap, quality SLA, per-stack canary, auto-rollback.
tessera-llm/tessera-llamaindex on GitHub — Apache-2.0, browse the source, file issues, fork.
tessera-llamaindex on PyPI
@tessera-llm/llamaindex on npm
Worked-numbers blog post — $24k → $9.4k, quality canary 0.96.