Batch arbitrage
OpenAI · Anthropic · Vertex queued
Async-tolerant workloads that don't need a real-time response route to the provider's Batch API instead of the real-time endpoint. OpenAI Batch and Anthropic Message Batches both ship 50% off list rate for the same model with a 24-hour completion window. The customer's SDK gets a 202 Accepted with a polling URL; the customer's application polls the upstream provider directly (with their own API key) until the batch completes.
M10 is unique among the cost mechanics — it's a routing fork, not a request mutation. The dispatch path diverges before the regular real-time forwarder; M10 sends to a completely different upstream endpoint with a different request shape. Settlement happens hours later via cron.
Eligibility + signal precedence
M10 fires only when ALL of these hold:
- Provider is OpenAI or Anthropic (v0.1 scope; Google Vertex Batch GA late 2026 queued as F3).
- Path is a batch-supported endpoint —
/chat/completionsor/embeddingson OpenAI;/messageson Anthropic. - Workload is not paused, request is non-streaming, body is parseable JSON.
- Async signal present, in priority order: per-request body field
tessera_async_tolerant: truewins over per-request headerx-tessera-async-tolerant: truewins over workload-defaultbatch_arbitrage_default. - Workload's
batch_arbitrage_sla_hoursaccommodates the provider's 24-hour batch window. SLA tighter than the window → fall back to real-time.
We never auto-promote a real-time workload into batch — latency contract is sponsor-set. Per-request override exists for one-off escapes (a real-time workload that has one async-tolerant job can opt in for that single request).
Dispatch path — what the worker does
On a successful M10 routing decision, the worker:
- Strips Tessera-internal tags (
tessera_async_tolerantetc) from the body. - Calls
dispatchOpenAiBatchordispatchAnthropicBatchwith the customer's upstream Authorization header. Provider returns abatch_id+file_id(OpenAI only) +estimated_complete_at. - Estimates expected savings via
estimateBatchCostagainst the pricing_catalog: real-time cost × (1 − 0.5) = batch cost; difference = expected savings. - Records the dispatch in
m10_batch_dispatchesvia the dashboard route/api/internal/m10/dispatch-recordwith the customer's settlement token (their upstream key) wrapped in Supabase Vault for at-rest encryption (migration 0079). - Returns 202 Accepted to the customer with
tessera_batch_id,status: queued,polling_urlpointing at the provider's native batch endpoint.
Settle path — cost reconciliation
The hourly settle cron (/api/cron/m10-settle) walks m10_batch_dispatches rows in non-terminal status, polls the provider with the Vault-decrypted settlement token, and:
- On
completed— downloads theoutput_filefrom the provider, parses the actual token counts per request, computes actual cost against the same pricing_catalog snapshot we billed the baseline against (audit-immutability invariant), and writes the final row tooptimize_savingswithauto_batched=true, mechanics_stack=['m10']. Dispatch-time estimates are overwritten with actuals. - On
failed/expired— marks the dispatch terminal, destroys the settlement token from Vault (m10_destroy_settlement_token), emits the failure to/portal/anomaliesso the sponsor can re-dispatch via real-time if needed. - Confidence floor 0.50: if the pricing_catalog snapshot confidence at dispatch time was below 0.50, the cron leaves the row as estimate-not-actual (the savings claim isn't re-derivable to invariant-#8 standard).
The 0.50 floor prevents the pricing-catalog-confidence drift documented in lessons 2026-05-25. We'd rather show "estimate" on the audit row than book a savings number we can't defend.
What we don’t do (yet)
- Google Vertex Batch. Vertex Batch GA late 2026 (F3 in the worker placeholder). Worker
defaultGoogleFetcherreturnsgoogle_vertex_batches_not_yet_implemented_F3until Google announces GA. - No multi-request batches (yet). Worker dispatches single-request batches v0.1. Multi-request batches would need per-row dispatch matching in the settle.ts output-parser (which already handles multi-line output_file format). Customer-traffic gate.
- No auto-promotion. Sponsors explicitly opt in workloads to batch. We do not look at latency tolerances and auto-flag candidates.
- No mid-batch cancellation. Once dispatched, the customer waits for the batch window. Cancellation paths land later if customer traffic asks for them.
Verification surfaces
- Response headers on M10-routed requests:
x-tessera-batch-routed: true,x-tessera-batch-id: batch_xyz, HTTP 202 status code (not 200) — semantic signal that this response is async. -
m10_batch_dispatchestable on the dashboard — full lifecycle row per dispatch: dispatched / completed / failed / expired with the settlement deadline, the Vault uuid, the estimated savings, and (post-settle) the actual savings. -
/portal/auditchip strip —m10chip on the settled row. Chip appears only after the cron completes — there's no per-request mechanics_stack chip at dispatch time (the cost isn't known yet).
What we promise
M10 dispatches only on explicit opt-in signal (body / header / workload-default with SLA hours that accommodate the 24-hour window). The settlement token (customer's upstream key) is at-rest-encrypted via Supabase Vault throughout the dispatch window. Cost reconciliation uses the same pricing_catalog snapshot the baseline was billed against — same audit- immutability invariant the rest of the proxy holds. If the batch fails, we'd rather emit an anomaly than book savings we can't deliver.
Parent index: How it works. Reliability companion: M11 cross-provider failover — shares the Vault wrapper pattern for at-rest token encryption.