Skip to main content

Tessera × LlamaIndex

tessera-llamaindex is a thin adapter that routes every LlamaIndex LLM call through the Tessera substrate proxy. Spread tessera_openai_config() (or the per-provider sibling) into the llama_index.llms.openai.OpenAI constructor and every .complete() / .chat() / .stream_complete() call lands on the proxy.

RAG pipelines benefit most. The provider prompt cache (M6) and exact cache (M2) fire on stable retrieval prefixes — typical RAG workloads see 40-70% savings on top-of-funnel input tokens just from those two mechanics. Multi-step query engines (sub-question, router, ReAct agents) compound caching benefits across the reasoning chain.

Install

pip install tessera-llamaindex
# Plus whichever LlamaIndex provider package you use:
pip install llama-index-llms-openai          # or llama-index-llms-anthropic / -mistralai / -groq / -cohere

Get a free Tessera API key at tesseraai.io/dev — 60M tokens/month, no card up front.

Quickstart

from llama_index.llms.openai import OpenAI
from tessera_llamaindex import tessera_openai_config

llm = OpenAI(
    model="gpt-4o",
    api_key="sk-...",                              # your OpenAI key, unchanged
    **tessera_openai_config(api_key="tk_..."),    # one line, routes through Tessera
)

# Existing LlamaIndex code (queries, RAG pipelines, agents, sub-question
# engines, multi-step reasoning) runs unchanged.
response = llm.complete("Summarize this document in 3 bullets.")

Scenario 1 — RAG over a document corpus with provider prompt cache

A LlamaIndex VectorStoreIndex on claude-sonnet-4-6 with a stable retrieved-context prefix. Tessera auto-injects Anthropic cache_control: ephemeral markers — 90% off cache reads on the long retrieval prefix.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from tessera_llamaindex import tessera_anthropic_config

Settings.llm = Anthropic(
    model="claude-sonnet-4-6",
    api_key="sk-ant-...",
    **tessera_anthropic_config(api_key="tk_..."),
)

docs = SimpleDirectoryReader("./policies").load_data()
index = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine(similarity_top_k=8)
answer = engine.query("Summarize our refund policy for gift cards in two sentences.")

Scenario 2 — Sub-question query engine across multiple corpora

When the user asks a question that spans multiple data sources, LlamaIndex SubQuestionQueryEngine decomposes it into per-corpus sub-queries. Tessera's exact + semantic cache catches repeat sub-queries across reasoning chains; auto-route evaluates whether the planner LLM can run on a cheaper model than the synthesizer LLM.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

tools = [
    QueryEngineTool(query_engine=q10_engine, metadata=ToolMetadata(name="q10_filings",
        description="Quarterly 10-Q filings, last 8 quarters")),
    QueryEngineTool(query_engine=transcripts_engine, metadata=ToolMetadata(name="earnings_calls",
        description="Transcripts of last 8 earnings calls")),
]

sub_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=tools,
    llm=Settings.llm,  # Tessera-wrapped from above
)

result = sub_engine.query("Has gross margin changed materially in the last year and why?")

Scenario 3 — ReAct agent with tool use

A LlamaIndex ReAct agent that calls a customer-data tool and a ticket-system tool. Auto-route gates on tool-calling capability — the agent never gets routed to a non-tool-capable model. Tool-call dispatch loops often hit identical sub-prompts; exact-match cache returns are free.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

def lookup_account(account_id: str) -> dict: ...
def open_ticket(account_id: str, summary: str) -> str: ...

tools = [FunctionTool.from_defaults(fn=lookup_account),
         FunctionTool.from_defaults(fn=open_ticket)]

agent = ReActAgent.from_tools(
    tools=tools,
    llm=Settings.llm,  # Tessera-wrapped — auto-route gates on tool-calling capability
    verbose=False,
)

agent.chat("Customer 7733 says their March invoice is wrong — investigate and open a ticket if needed.")

Worked savings example

RAG-heavy customer-support pipeline on claude-sonnet-4-6, 5B tokens/ month with a stable 8k-token retrieval-context prefix per query.

StageCost / moSaved
Baseline — Anthropic direct via LlamaIndex$21,000
+ Tessera (provider prompt cache + exact cache + context prune + M9)$7,800$13,200
Tessera fee (20% × measured savings)$2,640
You net pay$10,440$10,560 / mo saved

The lopsided savings on RAG workloads come from the stable retrieval prefix — cache_control markers on a long, repeated prefix recover 90% on each cache hit. Customers with high prompt-repetition see even higher percentages.

Architecture and quality contract

The adapter respects each LlamaIndex provider class's specific constructor signature — OpenAI uses api_base + default_headers, Anthropic uses base_url + default_headers, Mistral uses endpoint + additional_kwargs.http_headers, and so on. Constructor shapes are runtime-verified against installed LlamaIndex 0.6+ provider packages and locked into CI — any future LlamaIndex release that changes those kwargs trips the regression before we ship.

Composition cap (max 2 content-mutators), per-stack 0.95 quality floor with auto-rollback + credit on breach, audit-immutable measurement — all enforced on the proxy. The verified-savings ledger at ledger.tesseraai.io is your source of truth.

FAQ

Does this break my RAG pipeline / query engine / agent?
No. The LlamaIndex LLM object behaves identically. complete, chat, stream_complete, stream_chat all work unchanged. Index queries, retrievers, sub-question engines, OpenAI Agents, multi-step reasoning chains all use the LLM's standard interface and route through Tessera transparently.
Can I use this with LlamaIndex Settings.llm = ...?
Yes. Construct the LLM the same way and assign to Settings.llm — Tessera is wired in via constructor kwargs, not via a separate wrapper layer.
What about Mistral's odd headers shape?
LlamaIndex's MistralAI wrapper doesn't expose a top-level default_headers argument — it forwards additional_kwargs.http_headers to the underlying mistralai SDK on each request. The Tessera config function returns the correct shape automatically. Same handling for Cohere (additional_kwargs.headers).

Where to go next