Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,11 @@
"type": "command",
"command": "bash scripts/check_enforce_parallel_tests.sh",
"timeout": 5000
},
{
"type": "command",
"command": "bash scripts/check_no_unapproved_e2e_tests.sh",
"timeout": 5000
}
]
},
Expand Down
15 changes: 13 additions & 2 deletions .claude/skills/pre-pr-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,12 +171,23 @@ Run these sequentially, fixing as we go:
uv run mypy --num-workers=4 src/ tests/
```

5. **Test:**
5. **Test (unit suite ONLY):**

```bash
uv run python -m pytest tests/ -n 8
uv run python -m pytest tests/ -m unit
```

Run the **unit** suite only. NEVER run the whole tree (`pytest tests/`
with no marker), `-m e2e`, or `-m integration` here: that collects and
runs integration + e2e + benchmarks (Docker / real services, far
slower than the unit baseline) and is never part of the automated
pre-PR gate. If a change genuinely needs e2e/integration validation,
ask the user first, then run the approved command prefixed with
`ALLOW_E2E_TESTS=1` for that single invocation. Do not pass an
explicit `-n` (pyproject `addopts` already pins `-n=8 --dist=loadfile`).
Both rules are hook-enforced (`scripts/check_no_unapproved_e2e_tests.sh`,
`scripts/check_enforce_parallel_tests.sh`).

**Web dashboard checks (steps 6-9):** Run only if `web_src` or `web_test` files changed.

6. **Install dependencies:**
Expand Down
12 changes: 12 additions & 0 deletions docs/design/providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,18 @@ Every successful **scoped** `provider.complete()` call attributes a `CostRecord`

This pattern mirrors `synthorg.observability.correlation.correlation_scope`, which is the established codebase precedent for cross-cutting per-call context bindings (`request_id` / `task_id` / `agent_id`).

## Cassette Record / Replay

Recorded-LLM **cassettes** make a company run deterministic and free to re-execute: record the exact provider responses of a run keyed by request, then replay them for byte-identical re-execution with zero real LLM calls. Like cost recording, this is a provider-layer concern, not per-driver.

- **Seam**: `CassetteCompletionProvider` (`src/synthorg/providers/cassette/`) wraps an inner driver and overrides the **public** `complete()` / `stream()` / `get_model_capabilities()` / `batch_get_capabilities()`. It deliberately overrides the public methods, not the `_do_*` hooks: `BaseCompletionProvider.complete` merges fresh `_synthorg_latency_ms` / `_synthorg_retry_count` into `provider_metadata` after `_do_complete`, so replaying through `_do_complete` would clobber the recorded metadata and break byte-identical replay. The three `_do_*` hooks are unreachable guards raising `CassetteInternalError`.
- **Decoration chokepoint**: `ProviderRegistry.from_config(..., cassette=...)` wraps every driver in one shared `CassetteSession` before the registry is frozen, so no consumer (engine, coordinator, judge, runtime builder) can bypass record/replay. In **replay** the inner driver is **not built at all** (no factory call), so a pure replay run constructs no real provider.
- **Keying**: SHA-256 over the canonical request `(method, provider, model, messages, tools, config)` via `synthorg.versioning.hashing.compute_content_hash`. Repeated identical requests within a run are disambiguated by a **per-task FIFO lane**: each distinct asyncio task is assigned a stable monotonic lane on its first provider call. Replay matching is `(request_hash, lane, seq)`. This is stable across record and replay iff the first-call order of distinct tasks is identical, which the deterministic simulation harness provides; a cassette miss / sequence exhaustion fails loudly (`CassetteReplayMissError` / `CassetteReplayExhaustedError`) and never falls through to a real provider.
- **Storage**: a single canonical JSON document (filesystem, no DB / no yoyo revision: this is test infrastructure). The session auto-persists after every recorded interaction (crash-safe), written atomically (temp file + rename). `cassette_format_version` gates incompatible formats with `CassetteFormatError`.
- **Redaction boundary (SEC-1)**: the replay key is hashed on the **raw** request, and the **response / stream / capabilities outcome is stored verbatim** because it *is* the byte-identical replay artefact. Redaction (pluggable `CassetteRedactor`; default `PatternRedactor` scrubs bearer tokens, `sk-` keys, AWS keys, PEM blocks, labelled secrets) applies **only to the human-readable `request_repr`**, which is never consulted for replay. Provider credentials never reach `complete()` (they live in driver config); the residual exposure is a model echoing a prompt secret into its own output, which is accepted and documented (cassettes are dev/test artefacts; default cassette runs use scripted/seeded providers).
- **Configuration**: `providers.cassette_mode` (`off` / `record` / `replay`) + `providers.cassette_path`, resolved once at the boot site via the Cat-2 bootstrap resolver (env > code default, `read_only_post_init`, `restart_required`); `off` is a structural no-op.
- **Scope**: the record/replay seam is complete and independently validated under the live engine harness (a recorded multi-turn agent run replays byte-identically with zero real provider calls). Wiring the cassette into the golden-company benchmark suite is owned by the benchmark child issue, not this seam.

## LiteLLM Integration

The framework uses **LiteLLM** as the provider abstraction layer:
Expand Down
199 changes: 199 additions & 0 deletions scripts/check_no_unapproved_e2e_tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
#!/usr/bin/env bash
# PreToolUse(Bash) hook: never run the e2e / integration / whole-tree
# pytest suite without explicit user approval.
#
# Why this exists:
# The sanctioned local gate is the UNIT suite only -- the project's
# CLAUDE.md Quick Commands separate ``pytest tests/ -m unit`` from
# ``-m integration``, ``-m e2e``, and the CodSpeed benchmarks. A bare
# ``pytest tests/`` (no marker) collects and RUNS the entire tree:
# unit + integration + e2e + benchmarks, under walltime, with the
# pinned 8 workers. That is far slower than the ~3:30 unit baseline,
# spins up e2e tests that may need Docker / real services, and was
# never approved. The /pre-pr-review skill historically instructed
# exactly this. The model must NOT run it on its own initiative.
#
# Sanctioned (allowed) forms:
# * ``-m unit`` marker (the gate), even with a ``tests/`` path
# * a path scoped under ``tests/unit`` (with or without a marker)
# * a single ``path::test_name`` node id (deliberate, bounded)
# * benchmarks / ``--codspeed`` (single-process by design; a
# separate gate already governs these)
# * any command prefixed with ``ALLOW_E2E_TESTS=1`` -- the explicit,
# per-invocation user-approval escape hatch (mirrors the
# ``ALLOW_BASELINE_GROWTH=1`` convention)
#
# Blocked forms (require ALLOW_E2E_TESTS=1 after user approval):
# * ``-m e2e`` / ``-m integration`` (or any marker selecting them)
# * a ``tests/e2e`` / ``tests/integration`` path
# * a bare whole-tree run: ``pytest tests/`` or ``pytest`` with no
# unit scoping at all (collects + runs e2e/integration)
#
# Modes: JSON stdin -> inspect command; no stdin -> pass.
set -euo pipefail

RAW="$(cat)"
# No stdin (pre-commit, not a tool call): not applicable, pass.
if [[ -z "${RAW//[[:space:]]/}" ]]; then
exit 0
fi
# stdin present: it MUST parse. A malformed envelope is an unknown
# state, not "no opinion" -- fail closed so a corrupted/truncated
# payload cannot silently bypass the gate.
if ! COMMAND=$(printf '%s' "$RAW" | jq -r '.tool_input.command // empty' 2>/dev/null); then
echo "BLOCKED: malformed PreToolUse JSON envelope; gate fails closed." >&2
exit 2
fi

if [[ -z "$COMMAND" ]]; then
exit 0
fi

# Collapse newlines so a multi-line command is matched as one string.
FLAT=$(printf '%s' "$COMMAND" | tr '\n\r' ' ')

# Find the actual pytest INVOCATION, not the substring "pytest". A
# `git commit -m "...pytest tests/e2e..."` merely mentions the word and
# must not be flagged. Split the command into segments on `;`, `&&`,
# `||`, `|`; for each, strip leading `VAR=val` env assignments, then
# require that the segment's program is pytest itself, or `python -m
# pytest`, optionally behind a `uv run` / `uvx` / `poetry run` / `time`
# wrapper. The first matching segment is the run we analyse.
SEGMENTS=$(printf '%s' "$FLAT" | sed -E 's/&&|\|\||;|\|/\n/g')
PSEG=""
PSEG_ALLOW=0
while IFS= read -r seg; do
seg="${seg#"${seg%%[![:space:]]*}"}"
# The approval token must be a leading env assignment on THIS
# segment, not anywhere in the whole command: a token in an
# unrelated segment must never unblock a blocked pytest run.
env_prefix=$(printf '%s' "$seg" \
| sed -E 's/^(([A-Za-z_][A-Za-z0-9_]*=[^[:space:]]*[[:space:]]+)*)?.*/\1/')
seg_allow=0
if printf '%s' "$env_prefix" \
| grep -qE '(^|[[:space:]])ALLOW_E2E_TESTS=1([[:space:]]|$)'; then
seg_allow=1
fi
norm=$(printf '%s' "$seg" \
| sed -E 's/^([A-Za-z_][A-Za-z0-9_]*=[^[:space:]]*[[:space:]]+)+//')
if printf '%s' "$norm" | grep -qE \
'^(time[[:space:]]+)?((uv[[:space:]]+run|uvx|poetry[[:space:]]+run)[[:space:]]+)?(python[0-9.]*[[:space:]]+-m[[:space:]]+pytest|pytest)([[:space:]]|$)'
then
PSEG="$norm"
PSEG_ALLOW=$seg_allow
break
fi
done <<EOF
$SEGMENTS
EOF

# Not a pytest invocation -- no opinion.
if [[ -z "$PSEG" ]]; then
exit 0
fi

# Explicit, per-invocation user approval. This is the ONLY sanctioned
# way to run the e2e/integration/whole-tree suite. The model must not
# add this token on its own initiative -- it represents the user
# having said "yes, run it". Bound to the detected pytest segment so a
# token in an unrelated segment cannot unblock this run.
if [[ "$PSEG_ALLOW" -eq 1 ]]; then
exit 0
fi

# Benchmarks / CodSpeed are single-process by design and not e2e; a
# separate gate governs them.
if echo "$PSEG" | grep -qE '(--codspeed|tests/benchmarks)'; then
exit 0
fi

deny() {
local reason=$1
local escaped
escaped=$(printf '%s' "$reason" | jq -Rsa .)
cat <<ENDJSON
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": $escaped
}
}
ENDJSON
exit 2
}

DENY_MSG=$(cat <<'MSG'
BLOCKED: this pytest run would execute the e2e / integration / whole
test tree, which must NEVER run without explicit user approval. A bare
`pytest tests/` (no marker) collects AND runs unit + integration + e2e
+ benchmarks -- far slower than the unit baseline and not approved.

Use one of the sanctioned forms instead:
* `uv run python -m pytest tests/ -m unit` (the unit gate)
* a path under `tests/unit/...`
* a single `path::test_name` node id (one targeted test)

To run e2e / integration / the full suite, ASK THE USER FIRST, then
prefix the approved command with `ALLOW_E2E_TESTS=1` for that single
invocation. Do not add that token on your own initiative.
MSG
)

# Neutralise the `python -m pytest` module flag so it is not mistaken
# for the pytest `-m` marker selector.
SCAN=$(printf '%s' "$PSEG" \
| sed -E 's/python[0-9.]*[[:space:]]+-m[[:space:]]+pytest/pytest/g')

# Extract the -m marker expression (best effort: up to the next flag
# or quote). Handles `-m unit`, `-m=unit`, `-m "not slow"`. `grep -m1`
# stops at the first match (no `head`, so no SIGPIPE/pipefail trap);
# `|| true` keeps a no-marker command from aborting under `set -e`.
MARKER=""
MRAW=$(printf '%s' "$SCAN" \
| grep -m1 -oE "(^|[[:space:]])-m[[:space:]=]+['\"]?[a-zA-Z0-9_ ()]+" \
|| true)
if [[ -n "$MRAW" ]]; then
MARKER=$(printf '%s' "$MRAW" | sed -E "s/.*-m[[:space:]=]+['\"]?//")
fi

# A marker that selects e2e/integration is always blocked.
if echo "$MARKER" | grep -qiwE 'e2e|integration'; then
deny "$DENY_MSG"
fi

# `-m unit` (and it does not also pull e2e/integration) is the
# sanctioned gate -- allow even with a broad `tests/` path, because
# the marker guarantees only unit tests execute.
if echo "$MARKER" | grep -qiwE 'unit'; then
exit 0
fi

# An explicit e2e / integration path is blocked regardless of marker.
if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/(e2e|integration)(/|[[:space:]]|$)'; then
deny "$DENY_MSG"
fi

# A single targeted test (node id) is deliberate and bounded -- allow.
if echo "$PSEG" | grep -qE '::'; then
exit 0
fi

# A path scoped under tests/unit is the unit suite -- allow.
if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/unit(/|[[:space:]]|$)'; then
exit 0
fi

# Whole-tree / unscoped run: a bare `tests` / `tests/` path, or no
# `tests` path token at all (bare `pytest` collects from rootdir =
# the whole tree, including e2e/integration). Block it.
if echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/?([[:space:]]|$)'; then
deny "$DENY_MSG"
fi
if ! echo "$PSEG" | grep -qE '(^|[[:space:]=])tests/[^[:space:]]'; then
deny "$DENY_MSG"
fi

# Otherwise the run is scoped to a specific non-e2e/integration path
# (e.g. tests/conformance/...): deliberate and bounded -- allow.
exit 0
37 changes: 36 additions & 1 deletion src/synthorg/api/auto_wire.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
from synthorg.hr.registry import AgentRegistryService
from synthorg.ontology.service import OntologyService
from synthorg.persistence.protocol import PersistenceBackend
from synthorg.providers.cassette import CassetteConfig
from synthorg.security.timeout.scheduler import ApprovalTimeoutScheduler
from synthorg.settings.dispatcher import SettingsChangeDispatcher
from synthorg.settings.service import SettingsService
Expand Down Expand Up @@ -198,12 +199,46 @@ def _wire_cost_tracker(effective_config: RootConfig) -> CostTracker:
return tracker


def _resolve_cassette_config() -> CassetteConfig | None:
"""Resolve the boot-time cassette config (Cat-2: env > default).

Returns ``None`` when the seam is inert so the registry holds the
concrete drivers unchanged. Uses the sanctioned pre-init bootstrap
resolver -- no ``os.environ`` read in provider code.
"""
from pathlib import Path # noqa: PLC0415

from synthorg.providers.cassette import ( # noqa: PLC0415
CassetteConfig,
CassetteMode,
)
from synthorg.settings.bootstrap_resolver import ( # noqa: PLC0415
resolve_init_value,
)
from synthorg.settings.enums import SettingNamespace # noqa: PLC0415

mode_raw = str(
resolve_init_value(SettingNamespace.PROVIDERS, "cassette_mode").value
).strip()
mode = CassetteMode(mode_raw)
if mode is CassetteMode.OFF:
return None
path_resolved = resolve_init_value(
SettingNamespace.PROVIDERS, "cassette_path"
).value
path = Path(str(path_resolved)) if path_resolved else None
return CassetteConfig(mode=mode, path=path)


def _wire_provider_registry(
effective_config: RootConfig,
) -> ProviderRegistry:
"""Create a ProviderRegistry from config."""
try:
registry = ProviderRegistry.from_config(effective_config.providers)
registry = ProviderRegistry.from_config(
effective_config.providers,
cassette=_resolve_cassette_config(),
)
except Exception as exc:
logger.error(
API_APP_STARTUP,
Expand Down
9 changes: 9 additions & 0 deletions src/synthorg/observability/events/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,15 @@
PROVIDER_COST_SKIPPED: Final[str] = "provider.cost.skipped"
PROVIDER_COST_FAILED: Final[str] = "provider.cost.failed"

# ── Provider cassette record / replay ────────────────────────
PROVIDER_CASSETTE_DRIVER_WRAPPED: Final[str] = "provider.cassette.driver_wrapped"
PROVIDER_CASSETTE_RECORDED: Final[str] = "provider.cassette.recorded"
PROVIDER_CASSETTE_REPLAYED: Final[str] = "provider.cassette.replayed"
PROVIDER_CASSETTE_MISS: Final[str] = "provider.cassette.miss"
PROVIDER_CASSETTE_EXHAUSTED: Final[str] = "provider.cassette.exhausted"
PROVIDER_CASSETTE_FORMAT_ERROR: Final[str] = "provider.cassette.format_error"
PROVIDER_CASSETTE_SESSION_FLUSHED: Final[str] = "provider.cassette.session_flushed"

# ── Local model management ──────────────────────────────────
PROVIDER_MODEL_PULL_STARTED: Final[str] = "provider.model.pull_started"
PROVIDER_MODEL_PULL_COMPLETED: Final[str] = "provider.model.pull_completed"
Expand Down
26 changes: 26 additions & 0 deletions src/synthorg/providers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@

from .base import BaseCompletionProvider
from .capabilities import ModelCapabilities
from .cassette import (
CASSETTE_FORMAT_VERSION,
CassetteCompletionProvider,
CassetteConfig,
CassetteError,
CassetteFormatError,
CassetteMode,
CassetteRedactor,
CassetteReplayExhaustedError,
CassetteReplayMissError,
CassetteSession,
NullRedactor,
PatternRedactor,
)
from .cost_recording import (
CostRecordingContext,
cost_recording_scope,
Expand Down Expand Up @@ -73,6 +87,7 @@
)

__all__ = [
"CASSETTE_FORMAT_VERSION",
"STRATEGY_MAP",
"STRATEGY_NAME_CHEAPEST",
"STRATEGY_NAME_COST_AWARE",
Expand All @@ -83,6 +98,15 @@
"ZERO_TOKEN_USAGE",
"AuthenticationError",
"BaseCompletionProvider",
"CassetteCompletionProvider",
"CassetteConfig",
"CassetteError",
"CassetteFormatError",
"CassetteMode",
"CassetteRedactor",
"CassetteReplayExhaustedError",
"CassetteReplayMissError",
"CassetteSession",
"ChatMessage",
"CompletionConfig",
"CompletionProvider",
Expand All @@ -105,6 +129,8 @@
"ModelResolver",
"ModelRouter",
"NoAvailableModelError",
"NullRedactor",
"PatternRedactor",
"ProviderConnectionError",
"ProviderError",
"ProviderInternalError",
Expand Down
Loading
Loading