Aureliolo · Aureliolo · May 19, 2026 · May 18, 2026 · May 18, 2026 · May 19, 2026
@@ -73,6 +73,11 @@
             "type": "command",
             "command": "bash scripts/check_enforce_parallel_tests.sh",
             "timeout": 5000
+          },
+          {
+            "type": "command",
+            "command": "bash scripts/check_no_unapproved_e2e_tests.sh",
+            "timeout": 5000
           }
         ]
       },

@@ -171,12 +171,23 @@ Run these sequentially, fixing as we go:
    uv run mypy --num-workers=4 src/ tests/
    ```
 
-5. **Test:**
+5. **Test (unit suite ONLY):**
 
    ```bash
-   uv run python -m pytest tests/ -n 8
+   uv run python -m pytest tests/ -m unit
    ```
 
+   Run the **unit** suite only. NEVER run the whole tree (`pytest tests/`
+   with no marker), `-m e2e`, or `-m integration` here: that collects and
+   runs integration + e2e + benchmarks (Docker / real services, far
+   slower than the unit baseline) and is never part of the automated
+   pre-PR gate. If a change genuinely needs e2e/integration validation,
+   ask the user first, then run the approved command prefixed with
+   `ALLOW_E2E_TESTS=1` for that single invocation. Do not pass an
+   explicit `-n` (pyproject `addopts` already pins `-n=8 --dist=loadfile`).
+   Both rules are hook-enforced (`scripts/check_no_unapproved_e2e_tests.sh`,
+   `scripts/check_enforce_parallel_tests.sh`).
+
 **Web dashboard checks (steps 6-9):** Run only if `web_src` or `web_test` files changed.
 
 6. **Install dependencies:**

@@ -98,6 +98,18 @@ Every successful **scoped** `provider.complete()` call attributes a `CostRecord`
 
 This pattern mirrors `synthorg.observability.correlation.correlation_scope`, which is the established codebase precedent for cross-cutting per-call context bindings (`request_id` / `task_id` / `agent_id`).
 
+## Cassette Record / Replay
+
+Recorded-LLM **cassettes** make a company run deterministic and free to re-execute: record the exact provider responses of a run keyed by request, then replay them for byte-identical re-execution with zero real LLM calls. Like cost recording, this is a provider-layer concern, not per-driver.
+
+- **Seam**: `CassetteCompletionProvider` (`src/synthorg/providers/cassette/`) wraps an inner driver and overrides the **public** `complete()` / `stream()` / `get_model_capabilities()` / `batch_get_capabilities()`. It deliberately overrides the public methods, not the `_do_*` hooks: `BaseCompletionProvider.complete` merges fresh `_synthorg_latency_ms` / `_synthorg_retry_count` into `provider_metadata` after `_do_complete`, so replaying through `_do_complete` would clobber the recorded metadata and break byte-identical replay. The three `_do_*` hooks are unreachable guards raising `CassetteInternalError`.
+- **Decoration chokepoint**: `ProviderRegistry.from_config(..., cassette=...)` wraps every driver in one shared `CassetteSession` before the registry is frozen, so no consumer (engine, coordinator, judge, runtime builder) can bypass record/replay. In **replay** the inner driver is **not built at all** (no factory call), so a pure replay run constructs no real provider.
+- **Keying**: SHA-256 over the canonical request `(method, provider, model, messages, tools, config)` via `synthorg.versioning.hashing.compute_content_hash`. Repeated identical requests within a run are disambiguated by a **per-task FIFO lane**: each distinct asyncio task is assigned a stable monotonic lane on its first provider call. Replay matching is `(request_hash, lane, seq)`. This is stable across record and replay iff the first-call order of distinct tasks is identical, which the deterministic simulation harness provides; a cassette miss / sequence exhaustion fails loudly (`CassetteReplayMissError` / `CassetteReplayExhaustedError`) and never falls through to a real provider.
+- **Storage**: a single canonical JSON document (filesystem, no DB / no yoyo revision: this is test infrastructure). The session auto-persists after every recorded interaction (crash-safe), written atomically (temp file + rename). `cassette_format_version` gates incompatible formats with `CassetteFormatError`.
+- **Redaction boundary (SEC-1)**: the replay key is hashed on the **raw** request, and the **response / stream / capabilities outcome is stored verbatim** because it *is* the byte-identical replay artefact. Redaction (pluggable `CassetteRedactor`; default `PatternRedactor` scrubs bearer tokens, `sk-` keys, AWS keys, PEM blocks, labelled secrets) applies **only to the human-readable `request_repr`**, which is never consulted for replay. Provider credentials never reach `complete()` (they live in driver config); the residual exposure is a model echoing a prompt secret into its own output, which is accepted and documented (cassettes are dev/test artefacts; default cassette runs use scripted/seeded providers).
+- **Configuration**: `providers.cassette_mode` (`off` / `record` / `replay`) + `providers.cassette_path`, resolved once at the boot site via the Cat-2 bootstrap resolver (env > code default, `read_only_post_init`, `restart_required`); `off` is a structural no-op.
+- **Scope**: the record/replay seam is complete and independently validated under the live engine harness (a recorded multi-turn agent run replays byte-identically with zero real provider calls). Wiring the cassette into the golden-company benchmark suite is owned by the benchmark child issue, not this seam.
+
 ## LiteLLM Integration
 
 The framework uses **LiteLLM** as the provider abstraction layer:

@@ -0,0 +1,199 @@
+#!/usr/bin/env bash
+# PreToolUse(Bash) hook: never run the e2e / integration / whole-tree
+# pytest suite without explicit user approval.
+#
+# Why this exists:
+#   The sanctioned local gate is the UNIT suite only -- the project's
+#   CLAUDE.md Quick Commands separate ``pytest tests/ -m unit`` from
+#   ``-m integration``, ``-m e2e``, and the CodSpeed benchmarks. A bare
+#   ``pytest tests/`` (no marker) collects and RUNS the entire tree:
+#   unit + integration + e2e + benchmarks, under walltime, with the
+#   pinned 8 workers. That is far slower than the ~3:30 unit baseline,
+#   spins up e2e tests that may need Docker / real services, and was
+#   never approved. The /pre-pr-review skill historically instructed
+#   exactly this. The model must NOT run it on its own initiative.
+#
+# Sanctioned (allowed) forms:
+#   * ``-m unit`` marker (the gate), even with a ``tests/`` path
+#   * a path scoped under ``tests/unit`` (with or without a marker)
+#   * a single ``path::test_name`` node id (deliberate, bounded)
+#   * benchmarks / ``--codspeed`` (single-process by design; a
+#     separate gate already governs these)
+#   * any command prefixed with ``ALLOW_E2E_TESTS=1`` -- the explicit,
+#     per-invocation user-approval escape hatch (mirrors the
+#     ``ALLOW_BASELINE_GROWTH=1`` convention)
+#
+# Blocked forms (require ALLOW_E2E_TESTS=1 after user approval):
+#   * ``-m e2e`` / ``-m integration`` (or any marker selecting them)
+#   * a ``tests/e2e`` / ``tests/integration`` path
+#   * a bare whole-tree run: ``pytest tests/`` or ``pytest`` with no
+#     unit scoping at all (collects + runs e2e/integration)
+#
+# Modes: JSON stdin -> inspect command; no stdin -> pass.
+set -euo pipefail
+
+RAW="$(cat)"
+# No stdin (pre-commit, not a tool call): not applicable, pass.
+if [[ -z "${RAW//[[:space:]]/}" ]]; then
+    exit 0
+fi
+# stdin present: it MUST parse. A malformed envelope is an unknown
+# state, not "no opinion" -- fail closed so a corrupted/truncated
+# payload cannot silently bypass the gate.
+if ! COMMAND=$(printf '%s' "$RAW" | jq -r '.tool_input.command // empty' 2>/dev/null); then
+    echo "BLOCKED: malformed PreToolUse JSON envelope; gate fails closed." >&2
+    exit 2
+fi
+
+if [[ -z "$COMMAND" ]]; then
+    exit 0
+fi
+
+# Collapse newlines so a multi-line command is matched as one string.
+FLAT=$(printf '%s' "$COMMAND" | tr '\n\r' '  ')
+
+# Find the actual pytest INVOCATION, not the substring "pytest". A
+# `git commit -m "...pytest tests/e2e..."` merely mentions the word and
+# must not be flagged. Split the command into segments on `;`, `&&`,
+# `||`, `|`; for each, strip leading `VAR=val` env assignments, then
+# require that the segment's program is pytest itself, or `python -m
+# pytest`, optionally behind a `uv run` / `uvx` / `poetry run` / `time`
+# wrapper. The first matching segment is the run we analyse.
+SEGMENTS=$(printf '%s' "$FLAT" | sed -E 's/&&|\|\||;|\|/\n/g')
+PSEG=""
+PSEG_ALLOW=0
+while IFS= read -r seg; do
+    seg="${seg#"${seg%%[![:space:]]*}"}"
+    # The approval token must be a leading env assignment on THIS
+    # segment, not anywhere in the whole command: a token in an
+    # unrelated segment must never unblock a blocked pytest run.
+    env_prefix=$(printf '%s' "$seg" \
+        | sed -E 's/^(([A-Za-z_][A-Za-z0-9_]*=[^[:space:]]*[[:space:]]+)*)?.*/\1/')
+    seg_allow=0
+    if printf '%s' "$env_prefix" \
+        | grep -qE '(^|[[:space:]])ALLOW_E2E_TESTS=1([[:space:]]|$)'; then
+        seg_allow=1
+    fi
+    norm=$(printf '%s' "$seg" \
+        | sed -E 's/^([A-Za-z_][A-Za-z0-9_]*=[^[:space:]]*[[:space:]]+)+//')
+    if printf '%s' "$norm" | grep -qE \
+        '^(time[[:space:]]+)?((uv[[:space:]]+run|uvx|poetry[[:space:]]+run)[[:space:]]+)?(python[0-9.]*[[:space:]]+-m[[:space:]]+pytest|pytest)([[:space:]]|$)'
+    then
+        PSEG="$norm"
+        PSEG_ALLOW=$seg_allow
+        break
+    fi
+done <<EOF
+$SEGMENTS
+EOF
+
+# Not a pytest invocation -- no opinion.
+if [[ -z "$PSEG" ]]; then
+    exit 0
+fi
+
+# Explicit, per-invocation user approval. This is the ONLY sanctioned
+# way to run the e2e/integration/whole-tree suite. The model must not
+# add this token on its own initiative -- it represents the user
+# having said "yes, run it". Bound to the detected pytest segment so a
+# token in an unrelated segment cannot unblock this run.
+if [[ "$PSEG_ALLOW" -eq 1 ]]; then
+    exit 0
+fi
+
+# Benchmarks / CodSpeed are single-process by design and not e2e; a
+# separate gate governs them.
+if echo "$PSEG" | grep -qE '(--codspeed|tests/benchmarks)'; then
+    exit 0
+fi
+
+deny() {
+    local reason=$1
+    local escaped
+    escaped=$(printf '%s' "$reason" | jq -Rsa .)
+    cat <<ENDJSON
+{
+  "hookSpecificOutput": {
+    "hookEventName": "PreToolUse",
+    "permissionDecision": "deny",
+    "permissionDecisionReason": $escaped
+  }
+}
+ENDJSON
+    exit 2
+}
+
+DENY_MSG=$(cat <<'MSG'
+BLOCKED: this pytest run would execute the e2e / integration / whole
+test tree, which must NEVER run without explicit user approval. A bare
+`pytest tests/` (no marker) collects AND runs unit + integration + e2e
++ benchmarks -- far slower than the unit baseline and not approved.
+
+Use one of the sanctioned forms instead:
+  * `uv run python -m pytest tests/ -m unit`   (the unit gate)
+  * a path under `tests/unit/...`
+  * a single `path::test_name` node id (one targeted test)
+
+To run e2e / integration / the full suite, ASK THE USER FIRST, then
+prefix the approved command with `ALLOW_E2E_TESTS=1` for that single
+invocation. Do not add that token on your own initiative.
+MSG
+)
+
+# Neutralise the `python -m pytest` module flag so it is not mistaken
+# for the pytest `-m` marker selector.
+SCAN=$(printf '%s' "$PSEG" \
+    | sed -E 's/python[0-9.]*[[:space:]]+-m[[:space:]]+pytest/pytest/g')
+
+# Extract the -m marker expression (best effort: up to the next flag
+# or quote). Handles `-m unit`, `-m=unit`, `-m "not slow"`. `grep -m1`
+# stops at the first match (no `head`, so no SIGPIPE/pipefail trap);
+# `|| true` keeps a no-marker command from aborting under `set -e`.
+MARKER=""
+MRAW=$(printf '%s' "$SCAN" \
+    | grep -m1 -oE "(^|[[:space:]])-m[[:space:]=]+['\"]?[a-zA-Z0-9_ ()]+" \
+    || true)
+if [[ -n "$MRAW" ]]; then
+    MARKER=$(printf '%s' "$MRAW" | sed -E "s/.*-m[[:space:]=]+['\"]?//")
+fi
+
+# A marker that selects e2e/integration is always blocked.
+if echo "$MARKER" | grep -qiwE 'e2e|integration'; then
+    deny "$DENY_MSG"
+fi
+
+# `-m unit` (and it does not also pull e2e/integration) is the
+# sanctioned gate -- allow even with a broad `tests/` path, because
+# the marker guarantees only unit tests execute.
+if echo "$MARKER" | grep -qiwE 'unit'; then
+    exit 0
+fi
+
+# An explicit e2e / integration path is blocked regardless of marker.
+if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/(e2e|integration)(/|[[:space:]]|$)'; then
+    deny "$DENY_MSG"
+fi
+
+# A single targeted test (node id) is deliberate and bounded -- allow.
+if echo "$PSEG" | grep -qE '::'; then
+    exit 0
+fi
+
+# A path scoped under tests/unit is the unit suite -- allow.
+if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/unit(/|[[:space:]]|$)'; then
+    exit 0
+fi
+
+# Whole-tree / unscoped run: a bare `tests` / `tests/` path, or no
+# `tests` path token at all (bare `pytest` collects from rootdir =
+# the whole tree, including e2e/integration). Block it.
+if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/?([[:space:]]|$)'; then
+    deny "$DENY_MSG"
+fi
+if ! echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/[^[:space:]]'; then
+    deny "$DENY_MSG"
+fi
+
+# Otherwise the run is scoped to a specific non-e2e/integration path
+# (e.g. tests/conformance/...): deliberate and bounded -- allow.
+exit 0
@@ -53,6 +53,7 @@
     from synthorg.hr.registry import AgentRegistryService
     from synthorg.ontology.service import OntologyService
     from synthorg.persistence.protocol import PersistenceBackend
+    from synthorg.providers.cassette import CassetteConfig
     from synthorg.security.timeout.scheduler import ApprovalTimeoutScheduler
     from synthorg.settings.dispatcher import SettingsChangeDispatcher
     from synthorg.settings.service import SettingsService
@@ -198,12 +199,46 @@ def _wire_cost_tracker(effective_config: RootConfig) -> CostTracker:
     return tracker
 
 
+def _resolve_cassette_config() -> CassetteConfig | None:
+    """Resolve the boot-time cassette config (Cat-2: env > default).
+
+    Returns ``None`` when the seam is inert so the registry holds the
+    concrete drivers unchanged. Uses the sanctioned pre-init bootstrap
+    resolver -- no ``os.environ`` read in provider code.
+    """
+    from pathlib import Path  # noqa: PLC0415
+
+    from synthorg.providers.cassette import (  # noqa: PLC0415
+        CassetteConfig,
+        CassetteMode,
+    )
+    from synthorg.settings.bootstrap_resolver import (  # noqa: PLC0415
+        resolve_init_value,
+    )
+    from synthorg.settings.enums import SettingNamespace  # noqa: PLC0415
+
+    mode_raw = str(
+        resolve_init_value(SettingNamespace.PROVIDERS, "cassette_mode").value
+    ).strip()
+    mode = CassetteMode(mode_raw)
+    if mode is CassetteMode.OFF:
+        return None
+    path_resolved = resolve_init_value(
+        SettingNamespace.PROVIDERS, "cassette_path"
+    ).value
+    path = Path(str(path_resolved)) if path_resolved else None
+    return CassetteConfig(mode=mode, path=path)
+
+
 def _wire_provider_registry(
     effective_config: RootConfig,
 ) -> ProviderRegistry:
     """Create a ProviderRegistry from config."""
     try:
-        registry = ProviderRegistry.from_config(effective_config.providers)
+        registry = ProviderRegistry.from_config(
+            effective_config.providers,
+            cassette=_resolve_cassette_config(),
+        )
     except Exception as exc:
         logger.error(
             API_APP_STARTUP,

@@ -133,6 +133,15 @@
 PROVIDER_COST_SKIPPED: Final[str] = "provider.cost.skipped"
 PROVIDER_COST_FAILED: Final[str] = "provider.cost.failed"
 
+# ── Provider cassette record / replay ────────────────────────
+PROVIDER_CASSETTE_DRIVER_WRAPPED: Final[str] = "provider.cassette.driver_wrapped"
+PROVIDER_CASSETTE_RECORDED: Final[str] = "provider.cassette.recorded"
+PROVIDER_CASSETTE_REPLAYED: Final[str] = "provider.cassette.replayed"
+PROVIDER_CASSETTE_MISS: Final[str] = "provider.cassette.miss"
+PROVIDER_CASSETTE_EXHAUSTED: Final[str] = "provider.cassette.exhausted"
+PROVIDER_CASSETTE_FORMAT_ERROR: Final[str] = "provider.cassette.format_error"
+PROVIDER_CASSETTE_SESSION_FLUSHED: Final[str] = "provider.cassette.session_flushed"
+
 # ── Local model management ──────────────────────────────────
 PROVIDER_MODEL_PULL_STARTED: Final[str] = "provider.model.pull_started"
 PROVIDER_MODEL_PULL_COMPLETED: Final[str] = "provider.model.pull_completed"

@@ -6,6 +6,20 @@
 
 from .base import BaseCompletionProvider
 from .capabilities import ModelCapabilities
+from .cassette import (
+    CASSETTE_FORMAT_VERSION,
+    CassetteCompletionProvider,
+    CassetteConfig,
+    CassetteError,
+    CassetteFormatError,
+    CassetteMode,
+    CassetteRedactor,
+    CassetteReplayExhaustedError,
+    CassetteReplayMissError,
+    CassetteSession,
+    NullRedactor,
+    PatternRedactor,
+)
 from .cost_recording import (
     CostRecordingContext,
     cost_recording_scope,
@@ -73,6 +87,7 @@
 )
 
 __all__ = [
+    "CASSETTE_FORMAT_VERSION",
     "STRATEGY_MAP",
     "STRATEGY_NAME_CHEAPEST",
     "STRATEGY_NAME_COST_AWARE",
@@ -83,6 +98,15 @@
     "ZERO_TOKEN_USAGE",
     "AuthenticationError",
     "BaseCompletionProvider",
+    "CassetteCompletionProvider",
+    "CassetteConfig",
+    "CassetteError",
+    "CassetteFormatError",
+    "CassetteMode",
+    "CassetteRedactor",
+    "CassetteReplayExhaustedError",
+    "CassetteReplayMissError",
+    "CassetteSession",
     "ChatMessage",
     "CompletionConfig",
     "CompletionProvider",
@@ -105,6 +129,8 @@
     "ModelResolver",
     "ModelRouter",
     "NoAvailableModelError",
+    "NullRedactor",
+    "PatternRedactor",
     "ProviderConnectionError",
     "ProviderError",
     "ProviderInternalError",