Tessera × LlamaIndex
tessera-llamaindex is a thin adapter that routes every LlamaIndex LLM call through the Tessera substrate proxy. Spread tessera_openai_config() (or the per-provider sibling) into the llama_index.llms.openai.OpenAI constructor and every .complete() / .chat() / .stream_complete() call lands on the proxy.
RAG pipelines benefit most. The provider prompt cache (M6) and exact cache (M2) fire on stable retrieval prefixes — typical RAG workloads see 40-70% savings on top-of-funnel input tokens just from those two mechanics. Multi-step query engines (sub-question, router, ReAct agents) compound caching benefits across the reasoning chain.
Install
pip install tessera-llamaindex
# Plus whichever LlamaIndex provider package you use:
pip install llama-index-llms-openai # or llama-index-llms-anthropic / -mistralai / -groq / -cohereGet a free Tessera API key at tesseraai.io/dev — 60M tokens/month, no card up front.
Quickstart
from llama_index.llms.openai import OpenAI
from tessera_llamaindex import tessera_openai_config
llm = OpenAI(
model="gpt-4o",
api_key="sk-...", # your OpenAI key, unchanged
**tessera_openai_config(api_key="tk_..."), # one line, routes through Tessera
)
# Existing LlamaIndex code (queries, RAG pipelines, agents, sub-question
# engines, multi-step reasoning) runs unchanged.
response = llm.complete("Summarize this document in 3 bullets.")Scenario 1 — RAG over a document corpus with provider prompt cache
A LlamaIndex VectorStoreIndex on claude-sonnet-4-6 with a stable retrieved-context prefix. Tessera auto-injects Anthropic cache_control: ephemeral markers — 90% off cache reads on the long retrieval prefix.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from tessera_llamaindex import tessera_anthropic_config
Settings.llm = Anthropic(
model="claude-sonnet-4-6",
api_key="sk-ant-...",
**tessera_anthropic_config(api_key="tk_..."),
)
docs = SimpleDirectoryReader("./policies").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(similarity_top_k=8)
answer = engine.query("Summarize our refund policy for gift cards in two sentences.")Scenario 2 — Sub-question query engine across multiple corpora
When the user asks a question that spans multiple data sources, LlamaIndex SubQuestionQueryEngine decomposes it into per-corpus sub-queries. Tessera's exact + semantic cache catches repeat sub-queries across reasoning chains; auto-route evaluates whether the planner LLM can run on a cheaper model than the synthesizer LLM.
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
tools = [
QueryEngineTool(query_engine=q10_engine, metadata=ToolMetadata(name="q10_filings",
description="Quarterly 10-Q filings, last 8 quarters")),
QueryEngineTool(query_engine=transcripts_engine, metadata=ToolMetadata(name="earnings_calls",
description="Transcripts of last 8 earnings calls")),
]
sub_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=tools,
llm=Settings.llm, # Tessera-wrapped from above
)
result = sub_engine.query("Has gross margin changed materially in the last year and why?")Scenario 3 — ReAct agent with tool use
A LlamaIndex ReAct agent that calls a customer-data tool and a ticket-system tool. Auto-route gates on tool-calling capability — the agent never gets routed to a non-tool-capable model. Tool-call dispatch loops often hit identical sub-prompts; exact-match cache returns are free.
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
def lookup_account(account_id: str) -> dict: ...
def open_ticket(account_id: str, summary: str) -> str: ...
tools = [FunctionTool.from_defaults(fn=lookup_account),
FunctionTool.from_defaults(fn=open_ticket)]
agent = ReActAgent.from_tools(
tools=tools,
llm=Settings.llm, # Tessera-wrapped — auto-route gates on tool-calling capability
verbose=False,
)
agent.chat("Customer 7733 says their March invoice is wrong — investigate and open a ticket if needed.")Worked savings example
RAG-heavy customer-support pipeline on claude-sonnet-4-6, 5B tokens/ month with a stable 8k-token retrieval-context prefix per query.
| Stage | Cost / mo | Saved |
|---|---|---|
| Baseline — Anthropic direct via LlamaIndex | $21,000 | — |
| + Tessera (provider prompt cache + exact cache + context prune + M9) | $7,800 | $13,200 |
| Tessera fee (20% × measured savings) | $2,640 | — |
| You net pay | $10,440 | $10,560 / mo saved |
The lopsided savings on RAG workloads come from the stable retrieval prefix — cache_control markers on a long, repeated prefix recover 90% on each cache hit. Customers with high prompt-repetition see even higher percentages.
Architecture and quality contract
The adapter respects each LlamaIndex provider class's specific constructor signature — OpenAI uses api_base + default_headers, Anthropic uses base_url + default_headers, Mistral uses endpoint + additional_kwargs.http_headers, and so on. Constructor shapes are runtime-verified against installed LlamaIndex 0.6+ provider packages and locked into CI — any future LlamaIndex release that changes those kwargs trips the regression before we ship.
Composition cap (max 2 content-mutators), per-stack 0.95 quality floor with auto-rollback + credit on breach, audit-immutable measurement — all enforced on the proxy. The verified-savings ledger at ledger.tesseraai.io is your source of truth.
FAQ
- Does this break my RAG pipeline / query engine / agent?
- No. The LlamaIndex LLM object behaves identically. complete, chat, stream_complete, stream_chat all work unchanged. Index queries, retrievers, sub-question engines, OpenAI Agents, multi-step reasoning chains all use the LLM's standard interface and route through Tessera transparently.
- Can I use this with LlamaIndex Settings.llm = ...?
- Yes. Construct the LLM the same way and assign to Settings.llm — Tessera is wired in via constructor kwargs, not via a separate wrapper layer.
- What about Mistral's odd headers shape?
- LlamaIndex's MistralAI wrapper doesn't expose a top-level default_headers argument — it forwards additional_kwargs.http_headers to the underlying mistralai SDK on each request. The Tessera config function returns the correct shape automatically. Same handling for Cohere (additional_kwargs.headers).
Where to go next
- Get a free 60M-tokens/month API key — sign-up takes ~30 seconds.
- How the mechanic stack works — composition cap, quality SLA, per-stack canary, auto-rollback.
- tessera-llm/tessera-llamaindex on GitHub — Apache-2.0, browse the source, file issues, fork.
- tessera-llamaindex on PyPI
- @tessera-llm/llamaindex on npm
- Worked-numbers blog post — $24k → $9.4k, quality canary 0.96.