Skip to main content
← How it works
Mechanic · M10 · Routing fork · 50% list discount

Batch arbitrage

OpenAI · Anthropic · Vertex queued

Async-tolerant workloads that don't need a real-time response route to the provider's Batch API instead of the real-time endpoint. OpenAI Batch and Anthropic Message Batches both ship 50% off list rate for the same model with a 24-hour completion window. The customer's SDK gets a 202 Accepted with a polling URL; the customer's application polls the upstream provider directly (with their own API key) until the batch completes.

M10 is unique among the cost mechanics — it's a routing fork, not a request mutation. The dispatch path diverges before the regular real-time forwarder; M10 sends to a completely different upstream endpoint with a different request shape. Settlement happens hours later via cron.

Eligibility + signal precedence

M10 fires only when ALL of these hold:

  • Provider is OpenAI or Anthropic (v0.1 scope; Google Vertex Batch GA late 2026 queued as F3).
  • Path is a batch-supported endpoint — /chat/completions or /embeddings on OpenAI; /messages on Anthropic.
  • Workload is not paused, request is non-streaming, body is parseable JSON.
  • Async signal present, in priority order: per-request body field tessera_async_tolerant: true wins over per-request header x-tessera-async-tolerant: true wins over workload-default batch_arbitrage_default.
  • Workload's batch_arbitrage_sla_hours accommodates the provider's 24-hour batch window. SLA tighter than the window → fall back to real-time.

We never auto-promote a real-time workload into batch — latency contract is sponsor-set. Per-request override exists for one-off escapes (a real-time workload that has one async-tolerant job can opt in for that single request).

Dispatch path — what the worker does

On a successful M10 routing decision, the worker:

  • Strips Tessera-internal tags (tessera_async_tolerant etc) from the body.
  • Calls dispatchOpenAiBatch or dispatchAnthropicBatch with the customer's upstream Authorization header. Provider returns a batch_id + file_id (OpenAI only) + estimated_complete_at.
  • Estimates expected savings via estimateBatchCost against the pricing_catalog: real-time cost × (1 − 0.5) = batch cost; difference = expected savings.
  • Records the dispatch in m10_batch_dispatches via the dashboard route /api/internal/m10/dispatch-record with the customer's settlement token (their upstream key) wrapped in Supabase Vault for at-rest encryption (migration 0079).
  • Returns 202 Accepted to the customer with tessera_batch_id, status: queued, polling_url pointing at the provider's native batch endpoint.

Settle path — cost reconciliation

The hourly settle cron (/api/cron/m10-settle) walks m10_batch_dispatches rows in non-terminal status, polls the provider with the Vault-decrypted settlement token, and:

  • On completed — downloads the output_file from the provider, parses the actual token counts per request, computes actual cost against the same pricing_catalog snapshot we billed the baseline against (audit-immutability invariant), and writes the final row to optimize_savings with auto_batched=true, mechanics_stack=['m10']. Dispatch-time estimates are overwritten with actuals.
  • On failed / expired — marks the dispatch terminal, destroys the settlement token from Vault (m10_destroy_settlement_token), emits the failure to /portal/anomalies so the sponsor can re-dispatch via real-time if needed.
  • Confidence floor 0.50: if the pricing_catalog snapshot confidence at dispatch time was below 0.50, the cron leaves the row as estimate-not-actual (the savings claim isn't re-derivable to invariant-#8 standard).

The 0.50 floor prevents the pricing-catalog-confidence drift documented in lessons 2026-05-25. We'd rather show "estimate" on the audit row than book a savings number we can't defend.

What we don’t do (yet)

  • Google Vertex Batch. Vertex Batch GA late 2026 (F3 in the worker placeholder). Worker defaultGoogleFetcher returns google_vertex_batches_not_yet_implemented_F3 until Google announces GA.
  • No multi-request batches (yet). Worker dispatches single-request batches v0.1. Multi-request batches would need per-row dispatch matching in the settle.ts output-parser (which already handles multi-line output_file format). Customer-traffic gate.
  • No auto-promotion. Sponsors explicitly opt in workloads to batch. We do not look at latency tolerances and auto-flag candidates.
  • No mid-batch cancellation. Once dispatched, the customer waits for the batch window. Cancellation paths land later if customer traffic asks for them.

Verification surfaces

  • Response headers on M10-routed requests: x-tessera-batch-routed: true, x-tessera-batch-id: batch_xyz, HTTP 202 status code (not 200) — semantic signal that this response is async.
  • m10_batch_dispatches table on the dashboard — full lifecycle row per dispatch: dispatched / completed / failed / expired with the settlement deadline, the Vault uuid, the estimated savings, and (post-settle) the actual savings.
  • /portal/audit chip strip — m10 chip on the settled row. Chip appears only after the cron completes — there's no per-request mechanics_stack chip at dispatch time (the cost isn't known yet).

What we promise

M10 dispatches only on explicit opt-in signal (body / header / workload-default with SLA hours that accommodate the 24-hour window). The settlement token (customer's upstream key) is at-rest-encrypted via Supabase Vault throughout the dispatch window. Cost reconciliation uses the same pricing_catalog snapshot the baseline was billed against — same audit- immutability invariant the rest of the proxy holds. If the batch fails, we'd rather emit an anomaly than book savings we can't deliver.

Parent index: How it works. Reliability companion: M11 cross-provider failover — shares the Vault wrapper pattern for at-rest token encryption.