Skip to main content
← Tessera Blog
·8 min read

The prompt bloat you don’t see — and what it’s costing you

Question for the CTO reading this: tell me, in tokens, how big your production system prompt was last Tuesday. Tell me how big it was three months ago. Tell me what changed. If you can't answer in under a minute, you are running blind on roughly 60-80% of your input-token cost.

The cost story most engineers tell themselves about LLM inference is about completions — the user types a question, the model writes an answer, the answer costs money. The completion cost is usually 20-30% of the bill. The other 70-80% is the system prompt and the retrieved-context blob you attach to every single call. Neither of those is “the model’s fault.” They are yours.

We added prompt-size capture to every Tessera proxy request a couple of weeks ago. The column is named compress_chars_before and compress_chars_after on the savings table — exact character counts of the in-flight prompt at the proxy boundary, before and after compression. The purpose was to validate that our compression mechanic was actually doing what we said. What we saw in the data ended up being more interesting.

Why prompts grow

Three patterns show up consistently in the workloads we proxy. Once you start watching for them, you see them everywhere.

Feature patching.Every time the LLM misbehaved in production — wrong tone, wrong format, hallucinated a feature, ignored a constraint — someone added a paragraph to the system prompt. “You must respond in JSON.” “Do not mention pricing.” “Never apologize.” Each paragraph individually is fine. Over six months, those paragraphs accumulate. By month nine the prompt has fourteen “rules,” six of which contradict each other, and nobody is sure which ones are still load-bearing. Token count: up 3-4x from the original.

Multi-author drift.Three engineers, three quarters, three writing styles. The original prompt was tight and Socratic. Engineer two prefers numbered lists. Engineer three prefers narrative paragraphs. The current production prompt is a Frankenstein of all three, with vestigial sections nobody owns. None of the three engineers feels authorized to delete the others' work without a meeting.

RAG context inflation.The retrieval system returns top-K chunks; over time, “K=5” becomes “K=10” in response to a recall complaint, then “K=15” in response to an edge-case complaint. Each chunk is 800-1500 tokens. Suddenly every call carries 12-20 KB of retrieved context. The recall metric inched up; the cost line ballooned. No one was watching both at once.

What the chars-capture data shows

With compress_chars_before / compress_chars_after on every request, you can ask three questions of any workload quickly: (1) how big are the prompts in steady state, (2) what compression ratio is achievable without semantic loss, (3) how much is the absolute character count drifting week-over-week.

Across the workloads in beta we see three rough shapes.

Stable bloat.Prompt is huge but not growing — flat line at 10-18 KB. Compression yields 8-15% (the easy wins: extra whitespace, redundant punctuation, double-newlines that should be single). Quality canary holds. This is the “cost-only” pattern — turn on M3 with compress_system_enabled = true, take the 10% savings, move on.

Drift bloat. Prompt growing 5-12% per month. Visible as a sloped line in the weekly-compress-ratio chart. This is the dangerous pattern. The growth is gradual enough that nobody notices in a single sprint review, but the per-1k-conversation cost is silently up 40-60% year-over-year. M3 here is treatment, not cure — the underlying drift needs an owner.

Silent regression.Prompt size jumps overnight, usually from a deploy. New feature shipped; a developer pasted in a 4 KB block of examples; nobody pushed back in review because the quality canary on staging looked fine. The before/after ratio chart spikes. M3's alert ping fires. The owning team gets to choose: revert, accept, or refactor.

Per-role compression — system vs user

When we first shipped M3 (heuristic compression), it had a single on/off toggle. That was wrong. The reason: system prompts and user turns have different shapes and different tolerance for normalization.

System prompts are usually long, sometimes templated, often generated from a CMS, frequently full of incidental whitespace. Compression on them is high-yield and low-risk; the model ingested them as instructions, not as content to be quoted back.

User turns are different. Users include code blocks, formatted tables, paths, IDs, and snippets where every whitespace character is significant. Compressing user turns is feasible but fragile — a too-aggressive ratio can break a copy-pasted regex. Defaulting to off is the safer call.

So we split the toggle. Workloads now configure compress_system_enabled and compress_user_turns_enabled independently. Default config for new workloads: system on, user off. Customers can flip either independently from /portal/optimize/quality.

The kill-switch is per-role too. If a stack's quality canary drops below 0.95 specifically on user-compression requests, only the user-compression flag flips off; system compression keeps running. Surgical, not nuclear. We learned this the hard way — early kill-switches were repo-wide, which meant a regression in one cohort took out savings for every other cohort that was perfectly happy.

The quality canary keeps us honest

The compression ratio is one half of the chars-capture story. The other half is the daily quality canary that runs against a customer-supplied promptfoo eval set, with 10% sample from production traffic, scored against the baseline (non-compressed) output.

The canary measures by stack — every combination of mechanics active for a workload (e.g. m1+m3-system, m1+m3-user, m1+m3-system+m3-user). The SLA floor per stack is 0.95. Two consecutive days under floor auto-disables the specific stack. The customer gets a 10% prorated credit on the affected fees and an email pointing at the audit ledger row that triggered the rollback.

The reason for the per-stack granularity is exactly the failure mode we mentioned above: nuclear kill-switches penalize cohorts that did nothing wrong. If m3-user regresses on workload X, only m3-user flips off for workload X. Auto-route stays on. System-compress stays on. The customer keeps 80% of their savings while we investigate.

What this is worth in dollars

Concrete numbers from the customer-support workload we walked through in the $24k → $9.4k post:

On any single workload, M3 compression by itself looks like a small win. The leverage is that M3 stacks with M6 and withM1 — every 1% reduction in system prefix size is a 1% reduction in cache-rewrite cost, route-budget threshold, and downstream token math. The numbers compound. Without chars-capture, you would never know whether the compression ratio actually held month-over-month or silently drifted to 0.97, eroding the stack's value.

What to do next

You do not have to use Tessera to take the observation seriously. The cheap first step is to instrument your own production traffic to log len(system_prompt) and len(messages_serialized) per request. Plot the week-over-week trend. If you see a slope, you have a drift bloat problem. If you see a flat 10-18 KB line, you have a stable bloat problem. Either way you know something you did not know yesterday.

If you want the chars-capture, per-role compression, and per- stack quality canary out of the box, the free Dev tier on Tessera handles all of it without writing any of that code yourself. 60M tokens per month, no card, 30-second signup. The dashboard /portal/optimize/quality shows the chars-before and chars-after lines for every workload you wire up; the audit ledger explains every cost-delta number end-to-end.

Production tier (over 60M tokens/mo) is 20% of measured savings, prepaid balance, $100 minimum top-up. Zero savings, zero fee. If the math doesn't work for you after a 7-day run, kill-switch in settings reverts to passthrough.

References

Stop running blind on system-prompt cost

60M tokens free, chars-before/after capture out of the box

See your week-over-week prompt size, compression ratio, and per-stack quality canary. Free Dev tier, no card, 30-second signup.

Get free API key