Skip to main content
← How it works
Mechanic · M12 · Safety primitive

Guardrails

Shipped 21 May 2026 · v0.1 heuristic · Migration 0090

M12 inspects each request before the upstream provider sees it. PII strings (email, phone, SSN, credit card, IBAN, …) are redacted via a Python sidecar running microsoft/presidio. Jailbreak and toxicity hits are caught by pure-TypeScript heuristics inside the Worker isolate. Per-workload mode selects the consequence: observe, redact, or block.

Honest framing first: this is a v0.1 heuristic. The PII detector is presidio's regex + dictionary + spaCy NER ensemble. The jailbreak detector is a regex set covering the published DAN / Sydney / role-flip corpus. The toxicity filter is a banned-token list with a per-message ratio threshold. None of these are ML-classifier-backed. Suitable for low-risk redact mode and for strict-mode blunt blocking. Not a substitute for a dedicated moderation provider on workloads with stringent compliance contracts.

What it does

For every request through the proxy with guardrails_mode != 'off':

  • PII validator. HMAC-signs the request text, POSTs to the presidio-bridge sidecar at /v1/analyze-redact. The sidecar returns { redacted_text, entities_found[] } and the worker substitutes the redacted version back into each chat message slot when redact or strict mode is active.
  • Jailbreak validator. Pure-TypeScript regex scan over the concatenated message text. Pattern set covers "ignore previous instructions", DAN role-flip, persona override, mode escalation, system-prompt extraction, safety bypass, encoded-payload smuggling. ~80 µs per check.
  • Toxicity validator. Banned-token list with a ratio gate. Single-token hit flags short messages (< 50 words); long messages need ratio ≥ 0.02. Narrow list, conservative. False positives on enterprise traffic are more damaging at v0.1 than false negatives.

The three validators compose into one outcome. In strict mode the request is rejected with HTTP 403 + tessera-guardrails-blocked header carrying the block reason ( jailbreak_blocked or toxicity_blocked). PII never blocks; it always redacts. Blocking on PII would convert ordinary traffic into 403s for any message that happens to contain an email address, which is incompatible with the proxy contract.

Mode taxonomy

The per-workload guardrails_mode enum has four values, evaluated server-side on every request:

  • off . Default. Validators are skipped entirely. No audit row.
  • log_only . Validators run, audit row written, request unchanged. Use this to validate v0.1 behavior on real traffic before committing to a mutation policy.
  • redact . PII redacted in-place. Jailbreak and toxicity hits are recorded but the request still proceeds. Best for workloads that handle customer-supplied content where blocking is too disruptive but PII leakage to the upstream is unacceptable.
  • strict . PII redacted; jailbreak or toxicity hit returns 403 to the caller. Use only on workloads where you have a fall-back path for blocked requests. The block header taxonomy is stable.

Composition cap

M12 redact mutates the request body, so it counts as a content-mutatoralongside M3 compress and M7 context-prune. The proxy enforces a max of two content-mutators per request (LOCKED 2026-05-20). Stacking three mutations against the same model call multiplies the quality risk past the canary's ability to catch.

When M3 + M7 already fired on this request AND guardrails_mode is redact, the runner downgrades to log_only for that single request and emits an audit row with composition_cap_exceeded=true. Validators still run; entities still get recorded; the request body is just not mutated. Jailbreak and toxicity validators are non-mutating, so they always run when mode is anything but off regardless of cap.

Sidecar architecture

Presidio (the PII engine) is Python-only. It ships with spaCy and transformers dependencies that cannot run inside the Cloudflare Worker V8 isolate. We run presidio as an HTTP sidecar (presidio-bridge/ in the monorepo, deployable via Docker Compose) and call out server-to-server over HMAC-signed requests.

The HMAC scheme is identical to the worker → dashboard ingest path documented in .claude/rules/hmac.md: header X-Tessera-Signature: t=<ts>,v1=<base64url-hmac-sha256> with the same asymmetric drift window (reject futures > 30 s, reject pasts > 300 s). One secret rotation updates both sides in lockstep. The bridge fails closed with HTTP 500 if the env var is missing, never running un-authenticated.

Failure modes are graceful degrade. If the bridge is unreachable or returns a non-2xx, the PII validator logs and treats the request as unredacted-passthrough. The customer's call still succeeds. The other two validators are in-process and have no network dependency.

What we don’t do yet

Five deferrals are explicit, not hidden:

  • ML-classifier-backed jailbreak detection. v0.1 is regex-only. False-positive risk is real on long adversarial prompts; sponsors running strict mode should also wire in per-prompt monitoring. v0.2 adds a small transformer classifier (likely the Garak corpus fine-tuned model) as a second-pass when the regex fires.
  • ML-classifier-backed toxicity scoring. v0.1 is banned-token + ratio. False-negative rate on creatively-spelled slurs is high. The shipping API surface keeps room for a Detoxify-class drop-in once we ship v0.2.
  • Streaming response validators. v0.1 inspects the request only. Mid-stream output redaction requires SSE tee + per-chunk classification that we'd rather not ship half-done.
  • Multilingual jailbreak corpus. Regex set is English-only. Non-English jailbreak attempts will slip through v0.1 by construction.
  • Sponsor-authored custom validators. The RequestValidator interface is stable enough to support config-driven JSON rules; we did not ship the config-loader path so we could harden the three baseline validators first.

How to enable

On a workload page in /portal/settings, select a mode from the Guardrails (M12) dropdown and save. KV refresh fires immediately; the next request through the worker honours the new mode without waiting for the 5-minute periodic sync.

Default mode is off . Sponsors must explicitly opt-in per workload. Free Sandbox and paid-tier workloads both see the full mode flow; guardrails are a reliability/compliance primitive, not a paid-tier-only feature.

Verification surfaces

M12 is observable in three places:

  • Response headers on a strict-mode blocked request: tessera-guardrails-blocked with value jailbreak_blocked or toxicity_blocked plus the standard x-tessera-request-id for correlation.
  • Worker logs. guardrails_strict_block, guardrails_redact_applied, guardrails_log_only_triggered, guardrails_composition_cap_exceeded cover the four interesting paths. PII bridge failures surface as guardrails_pii_bridge_unreachable or guardrails_pii_bridge_non_2xx.
  • Audit row for every mode flip on audit_events: event_kind='guardrails_mode_change' with payload.from and payload.to. The per-request guardrails_outcome audit row ships in v0.2 when the audit-log ingest path is wired in for the runner output.

What we promise

M12 v0.1 is opt-in default OFF. We do not flip the default to anything else until v0.2 ships with a real classifier and the public framing changes in lockstep. The mode enum + block header taxonomy is stable contract. Changes ship with a CHANGELOG entry and a deprecation window.

Worker source: workers/proxy/src/guardrails/ (types · runner · three validators). Bridge source: presidio-bridge/ (Dockerfile · FastAPI app · HMAC verify · pytest suite). Migration: supabase/migrations/0090_guardrails_mode.sql. Parent index: How it works.