feat(evals): #1980 spine -- scoring + data contract for golden-company benchmark by Aureliolo · Pull Request #2025 · Aureliolo/synthorg

Aureliolo · 2026-05-20T16:01:11Z

Part of #1979. Lands the scoring + data layer of the golden-company benchmark (eval spine), as documented in the Status (2026-05-20) section of the linked issue.

This PR intentionally does not close #1980. The runner (cassette guard + telemetry collector + orchestrator + CLI + baselines + recorded cassette + make benchmark target) is held until a runtime prerequisite gap closes. See "Prerequisites for the runner PR" in #1980 for the full follow-up plan.

What landed

The complete scoring + data contract for the eval spine, 68 unit tests passing:

evals/models/{brief,scorecard}.py -- frozen Pydantic v2 models, schema-versioned (SCORECARD_SCHEMA_VERSION = 1).
evals/loader/{briefs,anchors}.py -- YAML loaders via synthorg.api.boundary.parse_typed; path-traversal containment on anchor-set loading.
evals/scoring/executable.py -- subprocess grader (hidden / build / lint, weighted 60/20/20); rejects shell metacharacters at load time as defence-in-depth.
evals/scoring/judged.py + spearman.py -- ordinal calibration with a Spearman rho gate (>= 0.7); refuses to score when the judge's ordering disagrees with the hand-scored anchor set.
evals/scoring/penalties.py + aggregate.py -- process-fact penalty table with cap + floor.
evals/emit/{json_writer,markdown_writer}.py -- deterministic JSON + Markdown emitters.
evals/errors.py -- DomainError-rooted hierarchy.
pyproject.toml -- evals/ wired for ruff (src), pyright (extraPaths), pytest (pythonpath), isort (known-first-party).
CLAUDE.md -- layout line now mentions the new evals/ top-level directory.

Pydantic models carry cross-field invariant validators so the scorecard is auditable from any single field: BriefResult enforces score == max(grade - deduction, GRADE_FLOOR); JudgeCalibrationReport enforces passed == (rho >= gate); PenaltyEntry enforces raw == points_per_event * count and applied <= raw; AggregatedProcessFacts enforces total_events == sum(events_by_class); ProcessFactReport enforces entries match the tracked events_by_class slice; CheckOutcome ties timed_out=True to the POSIX timeout sentinel; ExecutableGrade enforces pass-flags match per-label outcomes; ScriptedJudge requires non-empty responses or default scores; JudgedGrade refuses construction with a failing calibration; AnchorItem rejects hand_scores outside [0.0, 1.0]; HiddenCheckSpec rejects shell metacharacters in cmd[0].

What is NOT yet built (follow-up PR scope)

evals/runner/cassette_guard.py -- cassette SHA256 + REPLAY enforcement.
evals/runner/telemetry_collector.py -- structlog event bucketing.
evals/runner/orchestrator.py -- per-brief boot/submit/wait/collect loop.
evals/run.py + evals/__main__.py -- CLI entry point.
evals/baselines/{reference,broken}.yaml, evals/briefs/*.yaml, evals/cassettes/reference_run.cassette.json.
Makefile target + .gitignore.
The acceptance test test_runner_broken_scores_worse.py going green (currently uses pytest.importorskip("evals.run")).

Why the runner did NOT land in this PR

A runtime spike (create_app() + submit one TaskRequirement via the IntakeEngine) confirmed the studio's agent-execution path is not yet wired end-to-end. Intake transitions fire (intake.request.received -> triaging -> scoped), then TaskEngineNotRunningError raises. No BUDGET_* / EXECUTION_* / STAGNATION_* / APPROVAL_GATE_* events appear. Without those events, broken.yaml and reference.yaml would produce identical empty process-fact traces and the scorer could not discriminate. Per project rule "admit you cannot in plain words, BEFORE": surfacing this gap rather than shipping a green acceptance test that asserts nothing.

Prerequisites tracked in #1980's Status section.

Test plan

uv run python -m pytest tests/evals_spine/ -m unit

Result locally: 68 passed in 9.13s.

uv run ruff check src/ tests/ evals/
uv run ruff format --check src/ tests/ evals/
uv run mypy --num-workers=4 tests/evals_spine/ evals/

All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate, review-origin gate, migration-framing gate, secret detection, etc.) ran on the push and passed.

Review coverage

/pre-pr-review ran 9 agents in parallel against the spine. 22 findings consolidated in _audit/pre-pr-review/triage.md:

2 CRITICAL: path-traversal containment, duplicate SCORECARD_SCHEMA_VERSION -- both fixed.
5 MAJOR valid (wrong exception type, three issue back-refs in docstrings, missing evals/ in CLAUDE.md layout) -- all fixed.
1 MAJOR invalid (PEP 758 syntax: agent misread CLAUDE.md's "no parens unless binding"; we ARE binding as exc, parens required) -- rejected with rationale.
11 MEDIUM valid (function-length refactor, shell-metachar guard, test-quality cleanups, 4 cross-field validators) -- all fixed.
3 LOW deferred -- 2 implemented (extra validators on trusted-construction types, JudgeProtocol docstring polish); 1 deferred (Brief discriminated union -- the review agent itself flagged it as a pragmatic trade-off and the current post-validator catches the same XOR errors; restructuring would force a substantial test refactor for marginal benefit).

Net: 19 of 22 review findings applied in this PR, with rationale recorded in the commit body of 815ac88f and in the triage file.

Part of EPIC: Best-in-class operate tier #1979 (operate-tier EPIC).
Depends on Client-simulation runtime live (= e2e test harness) #1961 (client-simulation harness, already in place) + EPIC EPIC: Autonomous product studio substrate #1973 (runnable studio, partial -- runtime gap blocks the runner PR) + Agent runtime online + minimal safety spine (runtime root) #1956.
Blocks Learning loop + curve #1983 (learning curve), Continual-improvement loop (finetuning done right) #1990, Self-extending toolkit #1995, Stakes-aware model routing #1998 -- all consume the scorecard schema landing here.

Closes #1962. Child of EPIC #1955. ## Summary Wires `DirectIntake`/`AgentIntake` as a live, non-simulated work-entry path. A real `ClientRequest` submitted via `POST /requests` and approved via `POST /requests/{id}/approve` now flows through the work pipeline spine so an agent actually executes it (proven under the simulation harness with a scripted provider). The pipeline spine (#1960) already maps `WorkItem -> ClientRequest -> intake_engine.process()` in its intake phase; the gap was that `approve_request` called `intake_engine.process()` directly and produced a `CREATED` task no agent ever ran. This PR introduces a thin work-entry adapter seam, hands approve at `POST /requests/{id}/approve` to a background pipeline run, and reconciles the stored request to its terminal status. ## What changed - **New `WorkEntryAdapter` protocol + `IntakeEntryAdapter` + `build_work_entry_adapter` factory** under `src/synthorg/engine/pipeline/entry/`. Source-keyed dispatch (`WorkSource.INTAKE` ships now; siblings #1963 / #1964 add their arms here). - **`POST /requests/{id}/approve` rewritten** to return **HTTP 202 Accepted** with the request walked to `APPROVED`; a background coroutine (`process_intake_pipeline`) drives the adapter -> pipeline and reconciles the stored request to `TASK_CREATED` (or `CANCELLED` on intake rejection / pipeline failure). Terminal failures log ERROR + cancel the request as a fallback so a request can never silently hang in `APPROVED`. `AgentRuntimeNotConfiguredError` (409, taxonomy 4014) when no adapter / pipeline is wired (empty company). - **New `simulations.intake_default_project` setting** (`read_only_post_init`, default `"client-intake"`); the project is created at boot if absent. One resolved value drives both the intake strategy's `project=` and the adapter's `WorkItem.project`, so the pipeline's project-existence check and the created task agree. - **New `WsEventType.REQUEST_TASK_CREATED`** + matching `WsRequestTaskCreatedPayload`; the terminal hop now has a dedicated event (the intermediate `request.approved` still fires for the sync `APPROVED` walk). - **`AppState.intake_entry_adapter` seam** (once-only set + has_ + swap, mirrors `work_pipeline`); boot wiring in `engine.pipeline.entry.boot.wire_real_intake_entry` is called from both `_install_runtime_services` and the post-setup `_rebuild_runtime_services`. - **Manifest discipline (hard EPIC criterion)**: `scripts/_ghost_wiring_manifest.txt` now has `ENFORCED` lines for `DirectIntake`, `AgentIntake`, `IntakeEntryAdapter`, `build_work_entry_adapter`. The `no-ghost-wiring` gate passes. - **Docs**: `docs/design/client-simulation.md` documents the real work-entry path + the new setting + the new factory row; `docs/design/verification-quality.md` banner updated (intake engine online); `docs/design/page-structure.md` URL routing map gains the `/clients` / `/clients/requests` / `/clients/simulations` / `/clients/reviews/:taskId` / `/clients/:clientId` routes. - **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the 202 status, added `'request.task_created'` to `websocket.ts`, updated the MSW approve mock to return 202. ## Tests - New unit tests: `tests/unit/engine/pipeline/entry/{test_intake_adapter,test_factory,test_boot}.py`, `tests/unit/api/controllers/test_requests_intake_pipeline.py` (success / WorkIntakeRejected / WorkPipelineError / concurrent-terminal / MemoryError re-raise), `tests/unit/api/test_state.py` adapter seam, `tests/unit/client/test_runtime_builder.py` default-project threading + `intake_default_project` env override, `tests/unit/settings/test_intake_settings.py`. - New integration tests in `tests/integration/api/controllers/test_client_simulation.py`: approve returns 202 + spawns the pipeline task; approve without an adapter returns 409. - New e2e `tests/e2e/test_real_intake_entry_e2e.py`: builds the production runtime via `build_runtime_services` + `ScriptedDriver` + `DirectIntake`, drives `process_intake_pipeline`, asserts request reaches `TASK_CREATED` and the task is past `CREATED` (proof an agent ran). A second case covers a scoped request with reviewer notes. Validation run on this branch: 29468 unit tests pass (full suite), 3183 web unit tests pass; e2e + the new integration suite pass under the simulation-harness substrate (zero real LLM spend). ## Pre-PR review Reviewed by 10 core agents (python-reviewer, security-reviewer, async-concurrency-reviewer, conventions-enforcer, test-quality-reviewer, api-contract-drift, silent-failure-hunter, issue-resolution-verifier, comment-quality-rot, docs-consistency). All valid findings implemented: - Wrapped `process_intake_pipeline`'s terminal reconciliation in `_safe_finalize` so a reconcile failure can never leave a request stuck in `APPROVED`: ERROR with `request_id` + fallback `CANCELLED`, with `release_request_lock_if_idle` in `finally`. - Hardened `_ensure_project` to tolerate a `DuplicateRecordError` race (post-setup reinit) and log ERROR with project id on any other create failure. - Explicit `except asyncio.CancelledError: raise` ahead of the broad handler. - `_current_if_approved` vanished-request now logs ERROR (invariant break, not a routine warning). - `_ensure_project(project_id: NotBlankStr)` and a comment on `_spawn_intake_pipeline` documenting why a detached task (not `TaskGroup`) is correct for the 202-return contract. - Added the dedicated `request.task_created` WS event (+ `WsRequestTaskCreatedPayload`) so terminal-hop semantics are no longer multiplexed onto `request.approved`. - Doc fixes: `verification-quality.md` banner, `client-simulation.md` settings enumeration + factory table, `page-structure.md` routing map. Agent claims that the project-wide `except A, B:` (PEP 758 / Python 3.14) syntax was a "Python 3 syntax error" were rejected with primary-source evidence (ruff, mypy, the full 29468-test unit suite, and all 73 pre-push gates pass; the pattern is pre-existing in `pipeline/service.py` and `simulations.py`). --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>

## Summary EPIC #1967 keystone: a human talks to ONE fixed top agent (Chief of Staff) in natural language. The agent asks clarifying questions when the request is underspecified, then parks one or more concrete `WorkItem`s in the human approval queue. On approval each item runs through the work pipeline; **no autonomous acting** (direct MCP acting is the separate child #1972). The pipeline spine (`build_work_pipeline`, #1960) and approval producer (#1956) already landed and are ENFORCED in the ghost-wiring manifest. This PR drives them from a conversation. ## What changed ### Service + API - `src/synthorg/meta/chief_of_staff/propose.py` :: `ChiefOfStaffProposer` (single-responsibility; keeps `chat.py` explain-only). Structured clarify-or-propose LLM turn via `core/json_parsing.py`. SEC-1: human content wrapped via `wrap_untrusted(TAG_TASK_DATA, ...)`. - `POST /meta/chat/propose` on `MetaController` with `ConversationalProposeRequest` args model, `meta.chat.propose` (5/60s) rate-limit, 503 dependency-matrix (proposer + work pipeline must be wired). - `src/synthorg/api/controllers/_approval_review_gate.py` :: `try_conversational_intake_resume` Flow 0 before parked-context / review-gate. On approve: reconstructs `WorkItem` from the proposal and drives `app_state.work_pipeline.run`. On reject: marks proposal REJECTED, pipeline untouched. - `build_chief_of_staff_proposer` factory + `_wire_chief_of_staff_proposer` on_startup hook + AppState set-once accessors (`chief_of_staff_proposer`, `conversational_proposal_repo`). - `ChiefOfStaffConfig` propose_* fields (opt-in, default off). ### Data model - New `ApprovalSource.CONVERSATIONAL_INTAKE` discriminator + new `ConversationRole` / `ConversationStatus` / `ConversationalProposalStatus` enums. - Frozen Pydantic models: `Conversation`, `ConversationTurn`, `ProposedWork`, `ProposeDecision` (XOR validator), `ConversationalProposal`, `ProposeArgs`, `ProposedApprovalSummary`, `ProposeResult`. ### Persistence (dual-backend) - New protocols: `ConversationRepository` (StatefulRepository), `ConversationTurnRepository` (AppendOnlyRepository), `ConversationalProposalRepository` (Stateful + FilteredQuery). Generic categories per ADR-0001; no bespoke methods. - Concrete sqlite + postgres impls, `conversational_factory.build_conversational_repositories` (backend-aware; degrades gracefully to None for absent / unknown backends so app boot stays clean). - One yoyo revision per backend (`20260519000001_conversational_intake.sql`). Postgres widens `approvals.source` CHECK via `DROP/ADD CONSTRAINT`. SQLite intentionally keeps the narrow CHECK (rebuild via writable_schema was perf-too-costly under the unit suite's `--count=2` isolation gate; the v1 ApprovalStore is in-memory by default so `CONVERSATIONAL_INTAKE` rows never reach SQLite. Documented inline.). - `schema_drift_baseline.txt` grown by 4 canonical TEXT/TIMESTAMPTZ entries for the new timestamp columns (user-approved, identical to every other dual-backend table). Committed with `ALLOW_BASELINE_GROWTH=1`. ### Errors / events - New `<Domain><Condition>Error` family in `meta/errors.py`: `ConversationNotFoundError` (404), `ConversationClosedError` (409), `ConversationalProposeUnavailableError` (503), `ConversationalProposeResponseInvalidError` (502). New `ErrorCode` entries 3017 / 4015 / 7010. - New event constants in `observability/events/{chief_of_staff,approval_gate,persistence}.py` (clarify/propose/cap/transition + dispatcher branch + persistence repo error/read events). ### Manifest discipline (hard acceptance criterion) - `scripts/_ghost_wiring_manifest.txt`: new `ENFORCED build_chief_of_staff_proposer #1968` line. Verified by `scripts/check_no_ghost_wiring.py`. ### Web / generated types - Regenerated `web/src/api/types/{openapi,dtos,enum-values,error-codes}.gen.ts` for the new endpoint and enums. - Hand-written `postChatPropose()` client + `ConversationalProposeResponse` interface in `web/src/api/endpoints/meta.ts`; MSW handler in `web/src/mocks/handlers/meta.ts` (1:1 parity rule). ### Docs - `docs/design/self-improvement.md`: `/meta/chat/propose` endpoint, `propose.py` + model trio in package inventory, three-flow approval-decision routing (Flow 0 conversational-intake), YAML defaults for `propose_*`. - `README.md`: v1 scope clarified (clarify + propose; no autonomous acting). - `CLAUDE.md`: one-liner on proposer wiring + 503 conditions. ## Test plan - **Unit (29 503 passed locally on `-m unit`)**: enums, models, errors, settings, proposer (clarify / propose / untrusted-wrap / parse-failure / clarification cap / project-resolution), wiring (`build_chief_of_staff_proposer` 503 matrix; backend-aware factory degradation), dispatch branch (approve-runs-pipeline, reject-skips-pipeline, missing-proposal-owned-but-noop, pipeline-unavailable-503, pipeline-failure-leaves-PENDING), API model validation + route + rate-limit-policy registration. - **Dual-backend conformance** (`tests/conformance/persistence/`, `pytest.mark.integration`): `Conversation`, `ConversationTurn`, `ConversationalProposal` against sqlite + postgres. Schema-drift gates green on both backends. - **Simulation-harness e2e** (`tests/e2e/test_conversational_propose_e2e.py`, `pytest.mark.e2e`): builds the REAL pipeline via `build_runtime_services`, REAL `ChiefOfStaffProposer` with a scripted clarify-then-propose provider, REAL `ApprovalStore`, REAL approval-decision seam (`signal_resume_intent`). Asserts: vague request → clarify → answer → PENDING approval in store → approve → `Task` created and advanced past CREATED by a real agent. Reject path: pipeline never runs. - All convention gates green (ghost-wiring, schema-drift, dual-backend parity, persistence-boundary, dto-types-ts-in-sync, error-codes-ts-in-sync, no-magic-numbers, no-review-origin, frozen-extra-forbid, ...). ## Review coverage 11 specialist review agents ran pre-push (code, python, security, persistence, conventions, logging, async-concurrency, api-contract-drift, issue-resolution-verifier, docs-consistency, comment-quality-rot). 18 valid findings (7 High, 10 Medium, 1 doc polish) addressed in commit 80b34f9. 10 false positives dismissed with reasons in `_audit/pre-pr-review/triage.md`. `issue-resolution-verifier` returned PASS on every acceptance criterion and the parent epic's manifest-discipline rule. Closes #1968 --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>

Closes #1964 Part of EPIC #1955. ## Summary Thin high-altitude work-entry adapter: a human-stated objective enters the pipeline spine as a `WorkItem(source=OBJECTIVE)` and is decomposed into projects/tasks by the existing `MultiAgentCoordinator`, then executed solo or as a team (automatic; never a user choice). No new decomposition logic — reuses the spine + coordinator end to end. ### Wire-up - `src/synthorg/engine/pipeline/entry/objective_adapter.py` — new `ObjectiveSubmission` (frozen Pydantic, `extra="forbid"`) and `ObjectiveEntryAdapter` (thin: `__slots__ = ("_default_project", "_work_pipeline")`). - `src/synthorg/engine/pipeline/entry/factory.py` — adds the `WorkSource.OBJECTIVE` arm. Coordinates with #1963 on this shared seam; the task-board adapter rebases when it lands. - `src/synthorg/engine/pipeline/entry/protocol.py` — generalised to PEP 695 generic `class WorkEntryAdapter[T_Request](Protocol)` so the intake and objective adapters can each be typed against their own domain input. - `src/synthorg/engine/pipeline/entry/boot.py` — new `wire_real_objective_entry()` (cold boot + hot-reload via `hot_swap=True`); `_ensure_project` lifted as a shared helper. - `src/synthorg/api/state.py` — `objective_entry_adapter` slot + property + once-only `set_` + atomic `set_if_absent_` + `swap_` seams + `objective_background_tasks: set[Task]` for spawned pipeline runs. - `src/synthorg/api/controllers/objectives.py` — new `POST /objectives` returning 202 + `submission_id` (server-minted, propagated as the `WorkItem.correlation_id`); pipeline runs as a background task so the HTTP boundary never blocks on team coordination. Guarded by `require_write_access` + per-op rate-limit `objectives.submit`. - `src/synthorg/api/app.py` + `controllers/setup/agent_helpers.py` — the new wire call is appended right after the intake wire, both at cold boot and on provider-reinit. - New OBJECTIVES settings namespace + `default_project` definition (`read_only_post_init=True`, `restart_required=True`). ### Manifest discipline (parent epic hard criterion) `scripts/_ghost_wiring_manifest.txt` ships an `ENFORCED ObjectiveEntryAdapter #1964 ...` line in this same PR; `scripts/check_no_ghost_wiring.py` passes. ## Test plan - **Unit** (all green; `tests/unit/engine/pipeline/entry/` + `tests/unit/api/controllers/test_objectives.py`): - `ObjectiveSubmission`: frozen, rejects unknown fields, blank title/description/requested_by, blank acceptance-criteria elements; submission_id defaults to uuid4. - `ObjectiveEntryAdapter`: maps submission to `WorkItem(source=OBJECTIVE)`; correlation_id == submission_id; default project enforced; optional fields fall through to `WorkItem` defaults; pipeline errors propagate unchanged. - Factory dispatch: OBJECTIVE arm returns `ObjectiveEntryAdapter`; unwired sources still raise `UnknownStrategyError`. - Boot hook: no-op without `work_pipeline`; creates the objectives project when absent; skips create when it exists; hot_swap uses the swap seam (and still ensures the project exists). - Controller: 202 + minted submission_id; background task tracked on `AppState`; controller never blocks on the pipeline run; done-callback removes completed tasks from the set; payload rejects unknown fields, blank fields, and unknown enum strings. - **E2E** (`tests/e2e/test_real_objective_entry_e2e.py`, run under `-m e2e`): builds the real runtime via the production `build_runtime_services` with a deterministic `ScriptedDriver`. Two cases: 1. Default `leaf-threshold` policy — proves the path is wired and the spawned task advances past `CREATED`. 2. Forced `always-team` policy — proves the objective decomposes: asserts `verdict=SPLITTABLE`, `execution_path=TEAM`, and the root task advances past `CREATED`. ## Review coverage Reviewed via `/pre-pr-review` with the core agent roster only (thin-adapter scope): python-reviewer, security-reviewer, test-quality-reviewer, conventions-enforcer, issue-resolution-verifier. | Agent | Result | | --- | --- | | python-reviewer | APPROVE, zero findings | | security-reviewer | APPROVE, zero findings (SEC-1 wrapping, auth, rate-limit, background-task lifecycle, settings injection verified) | | conventions-enforcer | Zero violations | | issue-resolution-verifier | PASS, all 7 acceptance criteria at confidence 100 | | test-quality-reviewer | 0 critical; 5 minor refinements applied (defensive double-yield on done-callback test, invalid-enum-rejection coverage, blank-acceptance-criteria coverage, `_result` helper docstring, hot_swap project-create assertion) | All ruff / mypy strict / pre-commit / pre-push gates green; 9485 affected-area unit tests pass locally. ## Out of scope - First-class `Objective` persistence (no model, repository, or migration). Submission is ephemeral; state lives on the spawned root task. - `/objectives/{id}` GET / list / cancel surfaces. - Frontend integration. - Coordinator / decomposition-strategy / routing-policy changes (reused as-is). - Add an `idempotency_key=` query filter to `/tasks` so HTTP-only clients can resolve `submission_id` → spawned task. Tracked as a follow-up; the simulation harness drives the adapter directly so this is not blocking acceptance for #1964. --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>

Closes #1963. Child of EPIC #1955. ## Summary Wires `TaskBoardEntryAdapter` as the live work-entry path for the task board UI. A human-filed task via `POST /tasks` now flows through the work pipeline spine so an agent actually executes it, proven under the simulation harness with a `ScriptedDriver`. The pipeline spine (#1960) already maps `WorkItem -> ClientRequest -> intake_engine.process()` in its intake phase; this PR adds the TASK_BOARD arm of `build_work_entry_adapter`, rewrites `POST /tasks` to hand the filing to a detached background pipeline run, and ships the matching ghost-wiring manifest line. ## Scope constraint (locked in plan) The spine's `_intake` always creates a task internally via `intake_engine.process()`; there is no spine entry that adopts a pre-existing task. Therefore the board adapter cannot pre-create a task and hand it off (would yield duplicates), and "move triggers the spine" requires new pipeline logic (forbidden by the issue's "thin adapter, no new pipeline logic" scope). The board adapter fires on `file` only; column moves remain pure `TaskEngine` status walks of the spine-created task. Documented in `docs/design/page-structure.md` and the task-board adapter docstring. ## What changed - **New `TaskBoardEntryAdapter` + `TaskBoardFiling`** under `src/synthorg/engine/pipeline/entry/task_board_adapter.py`. Frozen Pydantic input with `extra="forbid"` and an auto-generated UUID4 correlation id; the adapter maps a filing onto a `WorkItem(source=TASK_BOARD)` and drives `WorkPipeline.run`. - **`build_work_entry_adapter` extended** with the TASK_BOARD arm + `@overload` so boot helpers get the concrete return type per source literal. `default_project` stays required on the signature (intake consumes it; TASK_BOARD ignores it). - **`POST /tasks` rewritten** to return HTTP 202 + `TaskBoardSubmissionResponse` (`correlation_id`, `title`, `project`, `status: "submitted"`). A detached background coroutine (`process_task_board_pipeline`) drives the adapter; pipeline failures are logged at WARNING (intake rejection) or ERROR (any other), `MemoryError` / `RecursionError` re-raise. `AgentRuntimeNotConfiguredError` (409, code 4014) when no adapter is wired (empty company / no provider). - **`AppState.task_board_entry_adapter` seam** mirrors `intake_entry_adapter`: ctor kwarg, slot, property, `has_*`, `set_*` / `set_*_if_absent` / `swap_*`. Boot wiring in `engine.pipeline.entry.boot.wire_real_task_board_entry` is called from both `_install_runtime_services` and the post-setup `_rebuild_runtime_services` (skip project bootstrap; board filings carry their own project). - **Three new event constants**: `API_TASK_BOARD_SUBMITTED`, `API_TASK_BOARD_REJECTED_NO_ADAPTER`, `API_TASK_BOARD_PIPELINE_FAILED`. - **Manifest discipline** (hard EPIC criterion): `scripts/_ghost_wiring_manifest.txt` has an `ENFORCED` line for `TaskBoardEntryAdapter`. The `no-ghost-wiring` gate passes. - **Docs**: `docs/design/client-simulation.md` documents the new TASK_BOARD work-entry path next to the existing INTAKE one; `docs/design/page-structure.md` notes the new 202 contract on `POST /tasks` and explicitly documents the move-is-display-only semantics. - **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the 202 + new schema, updated `tasks.ts` endpoint to return `TaskBoardSubmissionResponse`, Zustand store's `createTask` returns the submission envelope (no optimistic insert; the existing `task.created` WS handler inserts the real task once intake creates it), MSW handler returns 202, dialog + stories typed to the new shape. ## Tests - New unit tests: `tests/unit/engine/pipeline/entry/test_task_board_adapter.py` (filing validation; mapping; propagation of `WorkIntakeRejectedError`, `MemoryError`, `RecursionError`, generic `Exception`). - Extended `tests/unit/engine/pipeline/entry/{test_factory,test_boot}.py`: TASK_BOARD arm via parametrize; boot helper no-ops when pipeline / simulation runtime absent; hot-swap; no project bootstrap. - Rewritten `tests/unit/api/controllers/test_tasks.py`: 202 + submission envelope, UUID4 correlation id, 409 when no adapter (`test_create_task_raises_agent_runtime_not_configured_without_adapter`). - New `tests/unit/api/test_state.py` cases covering the new seam. - Reworked `tests/unit/api/controllers/test_coordination.py` to pre-create tasks via direct persistence insertion instead of `POST /tasks` (cross-loop crash with the now-mocked pipeline). - `tests/integration/api/test_task_create_empty_company.py` updated: no provider -> 409 (unchanged); provider present -> 202 + envelope. - New e2e `tests/e2e/test_task_board_entry_e2e.py`: production runtime via `build_runtime_services` + `ScriptedDriver`. Positive path proves the task reaches a post-execution status; an unknown-project path proves the background coroutine swallows the pipeline failure; a MemoryError stand-in proves the re-raise contract holds end-to-end. - New web tests: `TaskCreateDialog.test.tsx` updated for the submission envelope; store test asserts no optimistic insert (WS handler owns the inserted task) and "submitted" toast wording. Validation on this branch: 29 510 unit tests pass; 3 new e2e tests pass; 2 integration tests pass; targeted web suite (`tasks.test.ts`, `TaskCreateDialog.test.tsx`, `TaskBoardPage.test.tsx`, `useTaskBoardData.test.ts`) green; ruff, ruff-format, mypy strict, all pre-push convention gates green. ## Pre-PR review Reviewed by 5 core agents (python-reviewer, security-reviewer, test-quality-reviewer, conventions-enforcer, issue-resolution-verifier). 6 findings, all implemented in the test-tightening commit: - Tightened `TaskBoardSubmissionResponse.status` to `Literal["submitted"]`. - Added adapter-level error-propagation tests (`MemoryError`, `RecursionError`, generic `Exception`). - Tightened the 202 happy-path test to validate UUID4 correlation id. - Added two negative-path e2e tests (unknown project; MemoryError re-raise). - Parametrized the factory's positive test for the two wired sources. - Renamed the 409 test for semantic clarity. Verdicts after fixes: zero CRITICAL / HIGH from any agent; issue-resolution-verifier PASS with both acceptance criteria RESOLVED; conventions-enforcer zero violations; security-reviewer confirmed SEC-1 prompt safety is handled downstream by `policy/llm_judged.py` and the adapter does not need to wrap. Co-authored-by: SynthOrg <synthorg-bot@synthorg.local> --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>

… + per-project push queue (#2021) Closes #1974. Foundational substrate for EPIC #1973 (autonomous product studio). Each project gets a 1:1 persistent git-backed working tree that survives across agents, tasks, and sessions; two agents merging concurrently are serialised through a per-project push queue; the git backend is config-only swappable across `embedded` / `local_path` / `external_remote` (with GitHub / GitLab / Gitea / Forgejo connection identities). ## Locked acceptance criteria 1. **1:1 project to persistent git-backed workspace surviving across sessions.** `ProjectWorkspaceService.get_or_provision(project_id)` materialises `<base>/projects/<project_id>/` once; the `ProjectWorkspace` row plus the on-disk working tree are the durable contract. A backend-kind change in `GitBackendConfig.kind` re-provisions under the new backend (keeping the workspace path) with a `WORKSPACE_BACKEND_KIND_CHANGED` warning event. Conformance-tested under SQLite + Postgres. 2. **Two agents merge concurrently without branch collision via a per-project serial push queue.** `WorkspaceIsolationService.merge_workspace_with_push` routes through a lazily-built `PushQueueCoordinator` per project. FIFO; sentinel-based stop; `WorkspacePushError` distinguishes a forge-rejection from a local `WorkspaceMergeError` (conflicted merge does not push). The queue sits in front of the existing `WorkspaceIsolationStrategy` so a future virtual-branch strategy slots in without architectural change. 3. **Backend is config-only swappable.** `GitBackendType` is `embedded` / `external_remote` / `local_path`; `build_git_backend(config, deps)` resolves a `GitBackend` via `StrategyRegistry`. `external_remote` resolves a saved connection from the catalogue (GitHub, GitLab, Gitea, Forgejo) and injects the token as a URL-userinfo. Deep OAuth refresh + rate-limit hardening is tracked in the follow-up issue. 4. **Validated under the simulation harness.** Acceptance tests in `tests/integration/engine/workspace/test_project_workspace_acceptance.py` cover persistence across "session restarts", concurrent merge-without-collision, config-only backend swap, and cross-project sandbox isolation. ## What ships - `ProjectWorkspace` model + `ProjectWorkspaceRepository` (sqlite + postgres) + paired `20260519000001_project_workspaces` migrations + `schema.sql` parity. FK to `projects(id) ON DELETE CASCADE`. - `engine/workspace/git_backend/`: `GitBackend` protocol, `GitBackendConfig` / `GitBackendDeps`, `EmbeddedGitBackend` (default) / `LocalPathGitBackend` / `ExternalRemoteGitBackend`, `StrategyRegistry`-driven factory with fail-fast missing-deps validation. - `engine/workspace/project_workspace_service.py` with per-project async lock, config-authoritative backend-kind precedence, path-traversal defence. - `engine/workspace/push_queue.py` (`PushQueueCoordinator`) integrated into `WorkspaceIsolationService` (per-project dict, guard lock, double-checked locking). - Connection-catalogue expansion: `GITLAB`, `GITEA`, `FORGEJO` connection types alongside `GITHUB`. Gitea + Forgejo share a `_GiteaFamilyAuthenticator` base today but stay distinct connection identities so a future API divergence does not invalidate saved connections. - Boot wiring in `_install_runtime_services` (gated by `app_state.has_persistence`); `AgentEngineExecutionService.execute_once` now `get_or_provision`s the workspace per project. New `AppState.project_workspace_service` once-only seam. - Ghost-wiring manifest: 6 `ENFORCED` entries for the new runtime symbols (`ProjectWorkspaceService`, `build_git_backend`, `EmbeddedGitBackend`, `LocalPathGitBackend`, `ExternalRemoteGitBackend`, `PushQueueCoordinator`). - Error taxonomy: `ProjectWorkspaceNotProvisionedError` (4015, CONFLICT), `GitBackend*Error` family, `WorkspacePushError`. - Docs: `docs/design/coordination.md` (persistent workspace + push-queue section), `docs/design/integrations.md` (connection types), `docs/reference/errors.md` (4014 / 4015), `docs/reference/pluggable-subsystems.md` (git backend strategy). ## Pre-PR review (this push) Ran the core agent set on the substrate; applied every valid finding before opening the PR. Highlights: - **SEC-1 redaction**: `_redact_args` strips embedded URL credentials at every `run_git_subprocess` log site (no token-in-URL ever lands in structured logs). - **`GIT_*` env sanitisation in `run_git_subprocess`**: when this code runs from inside a git pre-push hook (or any caller whose own cwd is a git working tree), the hook process exports `GIT_DIR` / `GIT_WORK_TREE` and any child git subprocess inherits them, making `is_git_repo()` report True for a freshly-mkdir'd tmp dir because git resolves against the parent repo. The fix strips `GIT_*` env vars in `run_git_subprocess` so backend strategies resolve purely via the per-call `cwd`. (Without this, the embedded backend's `git commit --allow-empty` walked up to the host's `.git` during the pre-push run; the rebased branch you're looking at has those stray commits removed and the gate now reproduces clean.) - **Defensive nested-worktree guard** on `EmbeddedGitBackend` and `LocalPathGitBackend` (compares resolved `--show-toplevel` against the requested path; only refuses when toplevel is a strict parent). - Path-traversal defence in `ProjectWorkspaceService._workspace_path` rejects `/`, `\\`, `..`. - Double-checked locking TOCTOU closed in `ProjectWorkspaceService._lock_for` and `WorkspaceIsolationService._get_or_create_queue` (dict reads moved inside the guard lock). - Push-queue logging surfaces `WORKSPACE_PUSH_QUEUE_ENQUEUED` / `_MERGED` / `_FAILED` / `_WORKER_FAILED`. ## Out of scope / follow-up Filed as #2020 (referenced, not silently deferred): - External-remote OAuth refresh + rate-limit hardening (protocol + thin glue ship now; backend-switch-is-config-only already holds). - Per-execution Docker per-project mount migration into the sandbox lifecycle owner. - `dispatch` callers of `merge_workspace` migrated onto `merge_workspace_with_push` (currently only the direct execution path uses the queue). - `PlannerWorktreeStrategy` per-project `repo_root` resolution at execute-time. - `tests/unit/api/conftest.py::test_client` baseline reset for the known `project_shared_app_once_only_install_leak`. ## Verification - `uv run python -m pytest tests/ -m unit` clean. - Targeted dual-backend conformance: `tests/conformance/persistence/test_project_workspace_repository.py` (18 tests, SQLite + Postgres). - Acceptance tests: `tests/integration/engine/workspace/test_project_workspace_acceptance.py`. - Pre-push gates: ruff, mypy strict, schema-drift (both backends), no-ghost-wiring, no-magic-numbers, isolation `--count=2` gate. --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>

github-actions · 2026-05-20T16:01:26Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

gemini-code-assist · 2026-05-20T16:02:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the core scoring and data infrastructure for the 'golden-company' benchmark (eval spine). It provides the foundational models, loaders, and grading logic required to evaluate synthetic organization performance, while intentionally deferring the runner implementation until runtime prerequisites are met. The changes include strict schema validation, robust error handling, and comprehensive unit testing to ensure the benchmark's reliability and auditability.

Highlights

Data Model and Contract: Introduced frozen Pydantic v2 models for the eval spine, including schema-versioned Scorecard and Brief definitions to ensure deterministic data contracts.
Scoring and Calibration: Implemented a subprocess-based grader for executable briefs and a calibrated LLM judge for judged briefs, featuring a Spearman rho gate (>= 0.7) to ensure judge reliability.
Penalty System: Added a process-fact penalty table with caps and floors to account for budget, execution, and governance events during benchmark runs.
Emitters and Loaders: Added deterministic JSON and Markdown emitters for scorecards, along with robust YAML loaders that include path-traversal containment for anchor sets.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-05-20T16:02:13Z

Warning

Rate limit exceeded

@Aureliolo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 59 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: b64c3097-b7fb-4ae2-8214-384842b0aeb4

📥 Commits

Reviewing files that changed from the base of the PR and between e57d7a0 and 2c25a0a.

📒 Files selected for processing (2)

evals/models/scorecard.py
tests/evals_spine/test_scorecard_emit.py

Walkthrough

Adds an eval spine implementing frozen Pydantic briefs and scorecard models, YAML loaders (briefs and anchors) with path-safety and validation, executable and judged graders (Spearman-based calibration), penalty table and aggregation logic, JSON and Markdown scorecard emitters, an EvalError hierarchy, observability event constants, pyproject tooling/test discovery updates, and a comprehensive suite of unit and integration tests.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: landing the scoring and data contract (spine) for the golden-company benchmark eval system.
Description check	✅ Passed	The description comprehensively explains what landed (scoring + data layer with 68 passing tests), what was intentionally deferred (runner components), and why (runtime prerequisites gap).
Linked Issues check	✅ Passed	The PR successfully implements the primary coding objectives from `#1980`: frozen Pydantic models, YAML loaders, executable grading, judged calibration with Spearman gate, penalties + aggregation, JSON/Markdown emitters, and error hierarchy.
Out of Scope Changes check	✅ Passed	All changes are scoped to the linked issue `#1980`'s first-PR objectives. The runner components (cassette_guard, orchestrator, CLI, baselines, cassettes) are intentionally excluded per the issue status.
Docstring Coverage	✅ Passed	Docstring coverage is 52.78% which is sufficient. The required threshold is 40.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codspeed-hq · 2026-05-20T16:04:10Z

Merging this PR will not alter performance

✅ 33 untouched benchmarks
⏩ 21 skipped benchmarks¹

_{Comparing feat/1980-golden-benchmark (2c25a0a) with main (f60e9ed)}

21 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

coderabbitai

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/loader/briefs.py`:
- Around line 43-47: The globbed entries in files are currently trusted by
suffix only, which allows symlinks or non-file entries to escape briefs_dir or
cause I/O errors; update the logic around files / the for path in files loop to
skip any path that is not a regular file and to ensure path.resolve() is
contained within briefs_dir.resolve() (e.g., check path.is_file() and that
path.resolve() is under briefs_dir.resolve() via relative_to in a try/except)
before calling path.read_text() and yaml.safe_load; apply these checks where
files is built or at the start of the loop that constructs briefs (referencing
variables briefs_dir, files, path, yaml.safe_load, and Brief).

In `@evals/models/brief.py`:
- Line 111: Replace the inline numeric literal in the model schema by
introducing a named constant in the settings.definitions module (e.g.,
HIDDEN_CHECK_DEFAULT_TIMEOUT: int = 30) and use that constant in the Field
default for timeout_seconds in the Brief model (timeout_seconds: int =
Field(default=HIDDEN_CHECK_DEFAULT_TIMEOUT, gt=0)); add an explicit module-level
type annotation for the new constant in settings.definitions and update any
imports so evals.models.brief references the constant instead of the literal.

In `@evals/models/scorecard.py`:
- Around line 123-153: BriefResult currently allows inconsistent states between
kind and judge_calibration; add an after model_validator (e.g. name it
_validate_kind_and_calibration) on the BriefResult model that checks
BriefResult.kind (use the BriefKind enum) and enforces that when kind ==
BriefKind.JUDGED then judge_calibration is not None, and when kind !=
BriefKind.JUDGED then judge_calibration is None; on violation raise a ValueError
with a clear message referencing brief_id, kind and judge_calibration to aid
debugging.
- Around line 164-166: AggregatedProcessFacts currently allows negative counts
in events_by_class; add a Pydantic validator on the AggregatedProcessFacts model
(e.g., `@validator`("events_by_class") or a `@root_validator`) that iterates the
dict values and raises a ValueError if any count is negative, and ensure
total_events remains Field(..., ge=0) as a non-negative guard; reference the
AggregatedProcessFacts.events_by_class field and the model's validation methods
when implementing.

In `@evals/scoring/aggregate.py`:
- Line 32: The field event_constant in the Pydantic model should be typed as
NotBlankStr rather than plain str; update the annotation for event_constant to
NotBlankStr, add the appropriate import for NotBlankStr from the project
validation utilities (or pydantic helpers) at the top of
evals/scoring/aggregate.py, and run typechecks to ensure the model now enforces
non-blank identifier constraints.
- Around line 122-128: In the loop in evals/scoring/aggregate.py that iterates
over events_by_class.items(), validate the `count` before using it: if `count <
0` raise a ValueError with a clear message identifying the offending
`event_constant` and `count`; then proceed to call `penalty_table.is_tracked`,
`penalty_table.points_for(event_constant)`, compute `raw = per_event * count`,
clamp with `penalty_table.cap_per_class`, and add to `total_deduction` as
before. Ensure the check occurs before any arithmetic so negative counts cannot
produce negative deductions.
- Around line 72-75: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field`: import computed_field from pydantic and annotate
is_clean with `@computed_field`, keeping the implementation returning
self.deduction == 0 so the derived value complies with the frozen Pydantic model
contract; ensure the decorator is applied to the is_clean method in the same
class where deduction is defined.

In `@evals/scoring/executable.py`:
- Around line 151-157: Replace the inline event name string literals used in the
logger.warning calls with centralized constants: define EVAL_EXECUTABLE_TIMEOUT
and EVAL_EXECUTABLE_TOOL_MISSING in the observability.events.evals module (or
the existing events module) with values "evals.executable.timeout" and
"evals.executable.tool_missing" respectively, then import those constants at the
top of this module and use them in the logger.warning calls (the ones that
currently pass "evals.executable.timeout" and "evals.executable.tool_missing");
update both occurrences where logger.warning is invoked (the calls that pass
cmd=spec.cmd, timeout=spec.timeout_seconds, error_type=type(exc).__name__,
error=safe_error_description(exc) and the similar tool-missing call) to
reference the imported constants instead of string literals.
- Around line 102-105: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field` decorator so the derived value is serialized; import
computed_field from pydantic if missing and keep the same signature (def
is_clean(self) -> bool) and return expression (self.hidden_pass and
self.build_pass and self.lint_pass) so the field is treated as a computed model
field rather than a plain Python property.
- Line 67: Change the Pydantic field annotation for the identifier "label" from
str to NotBlankStr (so it enforces non-empty identifiers like LABEL_HIDDEN,
LABEL_BUILD, LABEL_LINT) and add an import for NotBlankStr from the project's
shared types module (where other models use NotBlankStr); update the model
definition that contains the "label" field in this file to use NotBlankStr
instead of str.
- Around line 131-143: The completed outcome handler currently hardcodes
duration_seconds=0.0 in _outcome_from_completed while timeouts set
duration_seconds=float(spec.timeout_seconds); fix by capturing real elapsed time
around the subprocess invocation and threading that value into the CheckOutcome:
modify the code that calls subprocess.run to record start/end (e.g.,
time.perf_counter()) and compute elapsed_seconds, change _outcome_from_completed
to accept a duration_seconds: float parameter (or overload it) and use that
value when constructing CheckOutcome, and ensure the timeout branch still sets
duration_seconds=float(spec.timeout_seconds) so all outcomes have meaningful
durations.

In `@tests/evals_spine/test_judged_calibration.py`:
- Line 37: Change the predicate parameter's type from object to a typed Callable
to reflect its true signature: add "from typing import Callable" at the top of
the test file, update the function signature(s) that accept the predicate (the
pytest.parametrize entry ("xs", "ys", "predicate") and the test function(s)
using that parameter) to use predicate: Callable[[float], bool], and remove the
now-unnecessary "# type: ignore[operator]" comments; apply the same change to
the other occurrence around lines 80-83 where predicate is used.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: eed15755-dcc6-4950-9350-66b34d428804

📥 Commits

Reviewing files that changed from the base of the PR and between 2049885 and 815ac88.

📒 Files selected for processing (27)

CLAUDE.md
evals/__init__.py
evals/emit/__init__.py
evals/emit/json_writer.py
evals/emit/markdown_writer.py
evals/errors.py
evals/loader/__init__.py
evals/loader/anchors.py
evals/loader/briefs.py
evals/models/__init__.py
evals/models/brief.py
evals/models/scorecard.py
evals/scoring/__init__.py
evals/scoring/aggregate.py
evals/scoring/executable.py
evals/scoring/judged.py
evals/scoring/penalties.py
evals/scoring/spearman.py
pyproject.toml
tests/evals_spine/__init__.py
tests/evals_spine/conftest.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build Backend
GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
GitHub Check: Dashboard Test
GitHub Check: Test Integration
GitHub Check: Test E2E
GitHub Check: Test Unit
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Read docs/design/ page before implementing; deviations need approval. See DESIGN_SPEC.md.

Present every plan for accept/deny before coding.

Ship enforcement gate with every convention PR.

Configuration Precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at the boot site. No os.environ.get outside startup.

Numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py.

No from __future__ import annotations (3.14 has PEP 649). PEP 758 except: except A, B: must use parens unless binding.

Type hints on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines.

Errors: <Domain><Condition>Error from DomainError; never inherit Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py.

Pydantic v2 frozen + extra="forbid" on every frozen model (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- <reason> for extra="allow"/"ignore"); use @computed_field for derived fields.

Args models at every system boundary; use parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py.

Immutability: use model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries.

Async: use asyncio.TaskGroup for fan-out/fan-in; helpers catch Exception (re-raise MemoryError/RecursionError).

Clock seam: clock: Clock | None = None; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable.

Untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML.

Logging: import `from sy...

Files:

evals/__init__.py
evals/scoring/__init__.py
evals/emit/json_writer.py
evals/loader/__init__.py
evals/errors.py
tests/evals_spine/conftest.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_brief_loader.py
evals/loader/briefs.py
evals/models/scorecard.py
evals/scoring/spearman.py
tests/evals_spine/test_executable_grader.py
evals/emit/markdown_writer.py
evals/scoring/aggregate.py
evals/models/__init__.py
evals/emit/__init__.py
evals/loader/anchors.py
evals/scoring/judged.py
evals/scoring/penalties.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
evals/models/brief.py
tests/evals_spine/test_scorecard_emit.py
evals/scoring/executable.py

**/*.{md,rst}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs sourced from data/runtime_stats.yaml via  markers.

Files:

CLAUDE.md

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use markers: @pytest.mark.{unit,integration,e2e,slow}. Async tests use auto. Global timeout 30s. Coverage minimum 80%.

xdist: use -n 8 --dist=loadfile (prevents 3.14+ Windows ProactorEventLoop leak). Windows: use WindowsSelectorEventLoopPolicy for unit tests; subprocess tests override back.

Test doubles: FakeClock for Clock seam, mock_of[T](**overrides) for typed-boundary substitutions, SimpleNamespace for attribute-bags. Bare MagicMock at typed boundary is blocked by check_mock_spec.py.

FakeClock and mock_of import from tests._shared; inject via clock= and the helper's spec subscript.

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...)). Never skip/xfail flaky tests; fix fundamentally.

Files:

tests/evals_spine/conftest.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_scorecard_emit.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/evals_spine/conftest.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_scorecard_emit.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: No region/currency/locale privileged; use metric units; British English.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Timeout/slow test failures are source-code regressions; never edit `tests/baselines/unit_timing.json` or scripts baseline files. Bypass with `ALLOW_BASELINE_GROWTH=1 git commit ...` (requires user approval).

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review`. After PR: `/aurelio-review-pr` for external feedback. Fix EVERYTHING valid; no deferring.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`).

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Branches: `<type>/<slug>` from main.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Squash merge. PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: After every squash merge → run `/post-merge-cleanup`.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API.

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: API startup lifecycle: two phases - construction (synchronous service wiring) and on_startup (connected persistence wiring). Construction ordering: `agent_registry` BEFORE `auto_wire_meetings`; `tunnel_provider` unconditional. On-startup ordering: `SettingsService` BEFORE `WorkflowExecutionObserver`; `OntologyService` AFTER `persistence.connect()`.

📚 Learning: 2026-05-05T09:04:46.195Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

evals/__init__.py
evals/scoring/__init__.py
evals/emit/json_writer.py
evals/loader/__init__.py
evals/errors.py
tests/evals_spine/conftest.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_brief_loader.py
evals/loader/briefs.py
evals/models/scorecard.py
evals/scoring/spearman.py
tests/evals_spine/test_executable_grader.py
evals/emit/markdown_writer.py
evals/scoring/aggregate.py
evals/models/__init__.py
evals/emit/__init__.py
evals/loader/anchors.py
evals/scoring/judged.py
evals/scoring/penalties.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
evals/models/brief.py
tests/evals_spine/test_scorecard_emit.py
evals/scoring/executable.py

📚 Learning: 2026-05-16T18:36:31.446Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

CLAUDE.md

📚 Learning: 2026-05-16T18:36:35.250Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

CLAUDE.md

🔇 Additional comments (21)

tests/evals_spine/test_scorecard_emit.py (1)

1-255: LGTM!

tests/evals_spine/test_penalties.py (1)

1-182: LGTM!

evals/scoring/spearman.py (1)

1-97: LGTM!

evals/scoring/judged.py (1)

1-298: LGTM!

tests/evals_spine/conftest.py (1)

1-59: LGTM!

tests/evals_spine/test_brief_loader.py (1)

1-144: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

1-87: LGTM!

evals/scoring/__init__.py (1)

1-13: LGTM!

evals/emit/__init__.py (1)

1-1: LGTM!

evals/emit/json_writer.py (1)

1-34: LGTM!

evals/emit/markdown_writer.py (1)

1-141: LGTM!

evals/__init__.py (1)

1-14: LGTM!

CLAUDE.md (1)

3-3: LGTM!

pyproject.toml (1)

190-190: LGTM!

Also applies to: 283-283, 363-363, 378-378

evals/loader/__init__.py (1)

1-7: LGTM!

evals/loader/anchors.py (1)

39-70: LGTM!

Also applies to: 72-123, 125-132

evals/loader/briefs.py (1)

56-70: LGTM!

evals/errors.py (1)

22-116: LGTM!

evals/models/__init__.py (1)

1-6: LGTM!

tests/evals_spine/test_executable_grader.py (1)

1-231: LGTM!

evals/scoring/executable.py (1)

172-179: Shell metacharacter validation in HiddenCheckSpec is correctly implemented.

The validation exists as a model validator at line 114 of evals/models/brief.py (_leading_token_has_no_shell_metachars), checking the leading token against the _SHELL_METACHARS constant defined at line 27. The docstring (lines 97–105) explicitly documents the defense-in-depth approach: validation at load time catches typos in brief YAML before runtime, even though shell=False mitigates the risk. The noqa comment on line 172 is justified.

coderabbitai · 2026-05-20T16:09:56Z

+    @property
+    def is_clean(self) -> bool:
+        """Whether every declared check class passed."""
+        return self.hidden_pass and self.build_pass and self.lint_pass


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use @computed_field instead of @property for derived fields.

The is_clean property is a derived field and should use Pydantic's @computed_field decorator rather than @property. This ensures the field is serialized with the model and aligns with the project's convention for computed fields. As per coding guidelines, derived fields should use @computed_field.

♻️ Proposed fix

-from pydantic import BaseModel, ConfigDict, Field, model_validator +from pydantic import BaseModel, ConfigDict, Field, computed_field, model_validator class ExecutableGrade(BaseModel): """Aggregate grade across hidden + build + lint check classes. Invariant: each ``*_pass`` boolean reflects whether every outcome in that label-bucket exited cleanly (``exit_code == 0`` and not ``timed_out``). The validator below enforces it. """ model_config = ConfigDict(frozen=True, extra="forbid", allow_inf_nan=False) score: int = Field(ge=0, le=EXEC_TOTAL) hidden_pass: bool build_pass: bool lint_pass: bool outcomes: tuple[CheckOutcome, ...] - `@property` + `@computed_field` # type: ignore[misc] + `@property` def is_clean(self) -> bool: """Whether every declared check class passed.""" return self.hidden_pass and self.build_pass and self.lint_pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/scoring/executable.py` around lines 102 - 105, Replace the `@property` on the is_clean method with Pydantic's `@computed_field` decorator so the derived value is serialized; import computed_field from pydantic if missing and keep the same signature (def is_clean(self) -> bool) and return expression (self.hidden_pass and self.build_pass and self.lint_pass) so the field is treated as a computed model field rather than a plain Python property.

gemini-code-assist

Code Review

This pull request introduces the golden-company benchmark framework, a comprehensive evaluation suite for autonomous agents featuring executable checks and calibrated LLM judging. The implementation includes Pydantic models for briefs and scorecards, YAML loaders, and scoring logic that incorporates process-fact penalties. Review feedback highlighted the need to measure actual execution duration in the executable grader and to handle PermissionError during subprocess calls. Security improvements were also suggested regarding credential redaction in logs and command outputs, along with a correction to ensure the atomicity of JSON file writes as described in the documentation.

gemini-code-assist · 2026-05-20T16:10:16Z

+def _outcome_from_completed(
+    spec: HiddenCheckSpec,
+    label: str,
+    completed: subprocess.CompletedProcess[bytes],
+) -> CheckOutcome:
+    return CheckOutcome(
+        label=label,
+        cmd=spec.cmd,
+        exit_code=completed.returncode,
+        duration_seconds=0.0,
+        stdout_tail=_tail(completed.stdout),
+        stderr_tail=_tail(completed.stderr),
+    )


Update _outcome_from_completed to accept a duration parameter. The current implementation hardcodes duration_seconds=0.0. Additionally, ensure that the cmd field in CheckOutcome is redacted if it contains URLs with credentials to prevent sensitive data leakage in tool execution results, as required by repository security rules.

Suggested change

def _outcome_from_completed(

spec: HiddenCheckSpec,

label: str,

completed: subprocess.CompletedProcess[bytes],

) -> CheckOutcome:

return CheckOutcome(

label=label,

cmd=spec.cmd,

exit_code=completed.returncode,

duration_seconds=0.0,

stdout_tail=_tail(completed.stdout),

stderr_tail=_tail(completed.stderr),

)

def _outcome_from_completed(

spec: HiddenCheckSpec,

label: str,

completed: subprocess.CompletedProcess[bytes],

duration: float,

) -> CheckOutcome:

return CheckOutcome(

label=label,

cmd=spec.cmd,

exit_code=completed.returncode,

duration_seconds=duration,

stdout_tail=_tail(completed.stdout),

stderr_tail=_tail(completed.stderr),

)

References

Ensure that URLs containing potential credentials are redacted in all output channels, including tool execution results (e.g., ToolExecutionResult content).

gemini-code-assist · 2026-05-20T16:10:16Z

+def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome:
+    """Run one check; never raises on non-zero exit; raises on tool-missing."""
+    try:
+        completed = subprocess.run(  # noqa: S603 -- args validated, no shell
+            list(spec.cmd),
+            cwd=work_dir,
+            timeout=spec.timeout_seconds,
+            capture_output=True,
+            check=False,
+            shell=False,
+        )
+    except subprocess.TimeoutExpired as exc:
+        return _outcome_from_timeout(spec, label, exc)
+    except (FileNotFoundError, NotADirectoryError) as exc:
+        logger.error(
+            "evals.executable.tool_missing",
+            cmd=spec.cmd,
+            error_type=type(exc).__name__,
+            error=safe_error_description(exc),
+        )
+        msg = f"Required eval tool not found: {spec.cmd[0]!r}"
+        raise EvalToolMissingError(msg) from exc
+    return _outcome_from_completed(spec, label, completed)


Measure the actual execution time using time.perf_counter() and handle PermissionError. subprocess.run raises PermissionError if the file exists but is not executable, which should be caught and reported as an EvalToolMissingError. When logging the error, ensure that spec.cmd is redacted if it contains URLs with credentials to prevent sensitive data leakage to logs.

def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome: """Run one check; never raises on non-zero exit; raises on tool-missing.""" start = time.perf_counter() try: completed = subprocess.run( # noqa: S603 -- args validated, no shell list(spec.cmd), cwd=work_dir, timeout=spec.timeout_seconds, capture_output=True, check=False, shell=False, ) duration = time.perf_counter() - start return _outcome_from_completed(spec, label, completed, duration) except subprocess.TimeoutExpired as exc: return _outcome_from_timeout(spec, label, exc) except (FileNotFoundError, NotADirectoryError, PermissionError) as exc: logger.error( "evals.executable.tool_missing", cmd=spec.cmd, error_type=type(exc).__name__, error=safe_error_description(exc), ) msg = f"Required eval tool not found or not executable: {spec.cmd[0]!r}" raise EvalToolMissingError(msg) from exc

References

Ensure that URLs containing potential credentials are redacted in all output channels, including log messages.

gemini-code-assist · 2026-05-20T16:10:16Z

+import subprocess
+from typing import TYPE_CHECKING, Final, Self


Add import time to support measuring the actual duration of subprocess execution, which is currently hardcoded to 0.0 for successful outcomes.

Suggested change

import subprocess

from typing import TYPE_CHECKING, Final, Self

import subprocess

import time

from typing import TYPE_CHECKING, Final, Self

gemini-code-assist · 2026-05-20T16:10:16Z

+def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path:
+    """Write *scorecard* as JSON into *out_dir*; return the file path."""
+    out_dir.mkdir(parents=True, exist_ok=True)
+    target = out_dir / SCORECARD_JSON_FILENAME
+    payload = scorecard.model_dump_json(indent=JSON_INDENT)
+    target.write_text(payload + "\n", encoding="utf-8")
+    return target


The docstring claims that write_scorecard_json overwrites the file atomically via write_text, but pathlib.Path.write_text is not atomic (it truncates the file upon opening). To ensure atomicity and prevent corruption if the process crashes during writing, use a temporary file and os.replace.

def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path: """Write *scorecard* as JSON into *out_dir*; return the file path.""" import os import tempfile out_dir.mkdir(parents=True, exist_ok=True) target = out_dir / SCORECARD_JSON_FILENAME payload = scorecard.model_dump_json(indent=JSON_INDENT) + "\n" with tempfile.NamedTemporaryFile( mode="w", dir=out_dir, encoding="utf-8", delete=False ) as tf: tf.write(payload) temp_name = tf.name try: os.replace(temp_name, target) except Exception: try: os.unlink(temp_name) except OSError: pass raise return target

codecov · 2026-05-20T16:12:31Z

Codecov Report

❌ Patch coverage is 80.12270% with 486 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.00%. Comparing base (f60e9ed) to head (2c25a0a).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...c/synthorg/persistence/sqlite/conversation_repo.py	62.89%	56 Missing and 3 partials ⚠️
...synthorg/persistence/postgres/conversation_repo.py	65.27%	47 Missing and 3 partials ⚠️
...persistence/sqlite/conversational_proposal_repo.py	65.57%	41 Missing and 1 partial ⚠️
...rsistence/postgres/conversational_proposal_repo.py	70.00%	32 Missing and 1 partial ⚠️
...rg/engine/workspace/git_backend/external_remote.py	58.10%	29 Missing and 2 partials ⚠️
src/synthorg/engine/workspace/push_queue.py	73.39%	21 Missing and 8 partials ⚠️
src/synthorg/meta/chief_of_staff/propose.py	86.74%	20 Missing and 2 partials ⚠️
...thorg/persistence/sqlite/project_workspace_repo.py	68.57%	22 Missing ⚠️
src/synthorg/persistence/conversational_factory.py	54.28%	15 Missing and 1 partial ⚠️
...org/persistence/postgres/project_workspace_repo.py	75.00%	16 Missing ⚠️
... and 21 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2025      +/-   ##
==========================================
- Coverage   85.09%   85.00%   -0.10%     
==========================================
  Files        1913     1948      +35     
  Lines      113388   115799    +2411     
  Branches     9673     9839     +166     
==========================================
+ Hits        96485    98430    +1945     
- Misses      14533    14943     +410     
- Partials     2370     2426      +56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).

Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/emit/json_writer.py`:
- Around line 37-45: The temporary JSON file is opened in text mode without
specifying newline, causing platform-dependent CRLF/LF translations; update the
NamedTemporaryFile call in evals/emit/json_writer.py (the with
tempfile.NamedTemporaryFile(...) as handle block) to include newline="\n" so the
written payload uses LF-only line endings across platforms, keeping mode="w",
encoding="utf-8", delete=False, prefix/suffix as-is.

In `@evals/scoring/aggregate.py`:
- Around line 86-87: The validator
AggregationResult._score_matches_grade_minus_deduction currently enforces the
floor using the global GRADE_FLOOR while aggregate_brief_score uses
penalty_table.floor; change the validator to use the same floor source as
aggregation (e.g., compute expected = max(self.grade - self.deduction,
self.penalty_table.floor) or otherwise obtain penalty_table.floor from the
AggregationResult instance/context) so both validation and scoring use
penalty_table.floor consistently; ensure any null/absent penalty_table is
handled (fall back to GRADE_FLOOR only if penalty_table is unavailable).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 570e0380-6ccd-4433-aa7a-6eb407842c8e

📥 Commits

Reviewing files that changed from the base of the PR and between 815ac88 and 48c3121.

📒 Files selected for processing (14)

evals/emit/json_writer.py
evals/errors.py
evals/loader/briefs.py
evals/models/brief.py
evals/models/scorecard.py
evals/scoring/aggregate.py
evals/scoring/executable.py
evals/scoring/penalties.py
src/synthorg/observability/events/evals.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_scorecard_emit.py
tests/unit/observability/test_events.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
GitHub Check: Dashboard Test
GitHub Check: Test Unit
GitHub Check: Test Integration
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (4)

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Follow configuration precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; no os.environ.get outside startup

No hardcoded numeric values; numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal)

Use comments to explain WHY only; no reviewer citations, issue back-refs, or migration framing; enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py

No from __future__ import annotations (Python 3.14 has PEP 649); PEP 758 except: except A, B: must have parens unless binding

Type hints on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines

Define errors as Error inheriting from DomainError, never directly from Exception/RuntimeError/etc; enforced by check_domain_error_hierarchy.py

Use Pydantic v2 frozen + extra='forbid' on every frozen model project-wide (gate check_frozen_model_extra_forbid.py); @computed_field for derived; NotBlankStr for identifiers

Args models at every system boundary; parse_typed() for every external dict ingestion; enforced by check_boundary_typed.py

Use immutability: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries

Use asyncio.TaskGroup for fan-out/fan-in async operations; helpers must catch Exception and re-raise MemoryError/RecursionError

Implement Clock seam: clock: Clock | None = None; tests inject FakeClock

Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable

Wrap untrusted content with wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML (SEC-1)

Import logger as: from synthorg.observability import get_logger; variable always named logger; never use import logging or print() in app code

Use e...

Files:

src/synthorg/observability/events/evals.py

src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Telemetry: opt-in, off by default; every event property must be in _ALLOWED_PROPERTIES

Files:

src/synthorg/observability/events/evals.py

src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/events/evals.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use test markers: @pytest.mark.{unit,integration,e2e,slow}; async auto; timeout 30s global; coverage 80% min

xdist -n 8 --dist=loadfile auto-applied via pyproject addopts (loadfile prevents 3.14+Windows ProactorEventLoop leak)

Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Test doubles: FakeClock for Clock seam; mock_ofT for typed-boundary substitutions; SimpleNamespace for attribute-bags; bare MagicMock at typed boundary blocked by scripts/check_mock_spec.py (zero-tolerance)

FakeClock and mock_of import from tests._shared; inject via clock= and helper's spec subscript

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() not sleep(large)

Files:

tests/unit/observability/test_events.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_scorecard_emit.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_judged_calibration.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_events.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_scorecard_emit.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_judged_calibration.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Read design spec page before implementing; deviations need approval from DESIGN_SPEC.md

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Present every plan for accept/deny before coding

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Apply no region/currency/locale privilege; use metric units and British English

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Every convention PR must ship its enforcement gate

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Never edit tests/baselines/unit_timing.json or any scripts/*_baseline.{txt,json} / scripts/_*_baseline.py; timeout/slow failures indicate source-code regression

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: After issue: branch + commit + push (no auto-PR); use /pre-pr-review before PR; after PR use /aurelio-review-pr for external feedback; fix EVERYTHING valid before merging

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Commits: <type>: <description> (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via synthorg-repo-bot)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Branches: <type>/<slug> from main

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Squash merge; PR body becomes squash commit; trailers (Release-As, Closes `#N`) must be in PR body

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: GitHub queries: use gh issue list via Bash, NOT MCP list_issues

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: After every squash merge → /post-merge-cleanup

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API

📚 Learning: 2026-05-05T09:04:46.195Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

src/synthorg/observability/events/evals.py
tests/unit/observability/test_events.py
evals/errors.py
evals/scoring/aggregate.py
tests/evals_spine/test_penalties.py
evals/models/scorecard.py
evals/scoring/executable.py
evals/models/brief.py
evals/scoring/penalties.py
evals/emit/json_writer.py
tests/evals_spine/test_scorecard_emit.py
evals/loader/briefs.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_judged_calibration.py

🔇 Additional comments (13)

src/synthorg/observability/events/evals.py (1)

1-13: LGTM!

tests/unit/observability/test_events.py (1)

336-339: LGTM!

evals/errors.py (1)

43-54: LGTM!

Also applies to: 120-120

evals/scoring/aggregate.py (1)

12-12: LGTM!

Also applies to: 34-34, 131-133

tests/evals_spine/test_penalties.py (1)

21-21: LGTM!

Also applies to: 185-210

evals/models/scorecard.py (1)

21-21: LGTM!

Also applies to: 62-62, 82-85, 147-162, 166-166, 190-191, 197-205, 209-209, 259-259, 271-271

evals/scoring/executable.py (1)

18-18: LGTM!

Also applies to: 25-30, 73-73, 83-83, 109-113, 121-121, 148-168, 186-215, 231-231

evals/models/brief.py (1)

53-57: LGTM!

Also applies to: 117-117, 121-121, 253-253

evals/scoring/penalties.py (1)

17-21: LGTM!

Also applies to: 91-91, 101-110

tests/evals_spine/test_scorecard_emit.py (1)

31-41: LGTM!

Also applies to: 53-53, 236-283

evals/loader/briefs.py (1)

12-16: LGTM!

Also applies to: 44-45, 49-49, 53-66

tests/evals_spine/test_brief_loader.py (1)

9-13: LGTM!

Also applies to: 131-161

tests/evals_spine/test_judged_calibration.py (1)

3-3: LGTM!

Also applies to: 81-81, 84-84

52 unit tests green. Lands the eval-spine scoring + data layer: brief and scorecard frozen Pydantic v2 models, YAML loaders via parse_typed, executable + judged graders with Spearman ordinal calibration gate (rho >= 0.7), process-fact penalty table with cap + floor, deterministic JSON and Markdown emitters, DomainError-based eval error hierarchy. pyproject.toml wires evals/ for ruff/pyright/pytest. Phase 3 runner + Phase 8 baselines + Phase 9 wiring follow.

…dule Pre-push mypy gate flagged: (a) write_brief_yaml fixture's return type missing -- introduce BriefYamlWriter Callable type alias in conftest; (b) test_runner_broken_scores_worse.py importing evals.run at module scope failed because that module lands in a follow-up commit -- switch to pytest.importorskip inside helpers so the test collects cleanly until the orchestrator lands; (c) test_load_anchor_set_below_minimum_raises missing Path annotation.

…nventions Apply 19 of 22 review findings from /pre-pr-review (1 invalid finding rejected; 1 deferred with rationale). Notable changes: (1) Critical: path-traversal containment in evals/loader/anchors.py via Path.resolve().is_relative_to(); (2) Critical: consolidate SCORECARD_SCHEMA_VERSION to a single source-of-truth in evals/models/scorecard.py; (3) Wrong exc type for non-dict brief YAML switched from BriefSuiteEmptyError to TypeError; (4) HiddenCheckSpec rejects shell metacharacters in cmd[0] at load time (defence-in-depth atop shell=False); (5) Cross-field validators on BriefResult, JudgeCalibrationReport, ProcessFactReport, AggregatedProcessFacts, PenaltyEntry, AggregationResult, CheckOutcome, ExecutableGrade, ScriptedJudge, JudgedGrade, AnchorItem; (6) Refactor _run_check / grade_executable below the 50-line ceiling via _outcome_from_completed / _outcome_from_timeout / _run_class helpers; (7) Parametrize-collapse Spearman and executable-grade tests; (8) Issue back-refs and migration-framing scrubbed from docstrings; (9) CLAUDE.md layout line now mentions top-level evals/; (10) 16 new validator + parametrize tests, 68 unit tests pass. Rejected: PEP 758 finding (CLAUDE.md says no parens unless binding; we ARE binding). Deferred: Brief discriminated union (agent flagged optional, current post-validator catches the same XOR errors).

Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).

Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.

CodeRabbit re-review on head 48c3121 posted two actionable comments. Both verified valid against the current code and fixed in this commit. evals/emit/json_writer.py: NamedTemporaryFile now opens with newline LF only, so the serialised JSON is deterministic across platforms. Without this, on Windows the text-mode write would translate LF to CRLF and the scorecard JSON would differ from the same payload produced on Linux. evals/scoring/aggregate.py: AggregationResult now carries the floor used by the aggregator (defaults to GRADE_FLOOR for backward compat) and the score validator clamps against that floor, not the global GRADE_FLOOR. Previously a PenaltyTable with a non-zero floor would produce a final_score the validator rejected. Added a regression test in tests/evals_spine/test_penalties.py that aggregates with a custom floor of 50 and asserts the score is clamped consistently.

…r_consistently docstring Pre-push hooks no-review-origin-in-code and no-migration-framing both flagged the docstring on test_aggregate_uses_penalty_table_floor_consistently in tests/evals_spine/test_penalties.py. The phrase explaining what the validator did 'before the round-4 fix' is exactly the kind of historical narrative the gates forbid: it rots the moment the old behaviour stops being a thing anyone remembers. The replacement describes current invariants only (the floor flows from the penalty table into AggregationResult; the score validator and aggregator share the floor).

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/emit/json_writer.py`:
- Line 25: Remove the unused logger initialization: delete the statement that
creates the module-level logger (logger = get_logger(__name__)) and also remove
the corresponding get_logger import if it isn't used elsewhere in this module;
ensure no references to the symbol logger remain (search for logger and
get_logger in this file) before committing the change.

In `@evals/emit/markdown_writer.py`:
- Around line 128-133: The write_scorecard_md function currently calls
target.write_text(render_scorecard_md(scorecard), encoding="utf-8") which allows
platform-dependent newline translation; change this call to pass newline="\n" so
the file is written with deterministic LF line endings (consistent with
json_writer.py). Update the write_text invocation in write_scorecard_md (and
ensure it still uses SCORECARD_MD_FILENAME and render_scorecard_md) to include
newline="\n" while preserving encoding="utf-8".
- Line 23: The module defines an unused logger via the variable logger =
get_logger(__name__) in markdown_writer.py; remove that unused initialization
and any related unused imports (get_logger or logger) from the file, or if
logging is intended, replace it with a used logging call (e.g., use the logger
inside functions like write_markdown or export_markdown) — locate the
logger/get_logger symbols in the module and either delete them or integrate them
into existing functions so the symbol is actually used.

In `@evals/models/scorecard.py`:
- Around line 252-255: The is_passing method currently truncates the threshold
by using int(self.max_total * PASS_FRACTION); replace that with a non-truncating
ceiling comparison so the required pass count isn't lowered for non-integer
thresholds—use math.ceil(self.max_total * PASS_FRACTION) (import math if needed)
and update is_passing to compare self.total against that ceiling value.

In `@evals/scoring/__init__.py`:
- Around line 3-12: The module docstring incorrectly states "Three concerns"
while listing four modules; update the top-level docstring in evals.scoring (the
module docstring in __init__.py) to read "Four concerns" (or a neutral phrase
like "Several concerns") so the count matches the listed modules
(:mod:`evals.scoring.penalties`, :mod:`evals.scoring.executable`,
:mod:`evals.scoring.judged`, :mod:`evals.scoring.aggregate`).

In `@evals/scoring/aggregate.py`:
- Around line 112-115: Update the docstring that describes the parameter "grade"
so it matches the implementation: remove the statement that out-of-range values
are "clamped" and instead say that grades must be in [0, 100] and values outside
that range raise a ValueError (no clamping). Ensure the docstring for the
function/method that accepts the "grade" parameter mirrors the actual validation
behavior where a ValueError is raised when the grade is out of bounds.

In `@evals/scoring/judged.py`:
- Around line 210-216: Replace the hardcoded event string
"evals.judge.calibration_failed" used in the logger.warning call inside
judged.py with the corresponding event constant exported from
synthorg.observability.events.evals (e.g., EVALS_JUDGE_CALIBRATION_FAILED); add
the import for that constant at the top of the file and use the constant in the
logger.warning call so the code matches the existing event-constant convention.

In `@tests/evals_spine/test_executable_grader.py`:
- Around line 78-96: The test matrix in pytest.mark.parametrize for
("hidden_pass", "build_pass", "lint_pass", "expected_score") is missing the two
boolean combinations (hidden=True, build=False, lint=False) and (hidden=False,
build=False, lint=True); add entries for (True, False, False, EXEC_TOTAL -
EXEC_WEIGHT_BUILD - EXEC_WEIGHT_LINT) and (False, False, True, EXEC_TOTAL -
EXEC_WEIGHT_HIDDEN - EXEC_WEIGHT_BUILD) and append corresponding ids (e.g.,
"build_and_lint_fail" and "hidden_and_build_fail") so the full 8-combination
coverage is present in tests/evals_spine/test_executable_grader.py using the
existing pytest.mark.parametrize block and the constants EXEC_TOTAL,
EXEC_WEIGHT_HIDDEN, EXEC_WEIGHT_BUILD, EXEC_WEIGHT_LINT.

In `@tests/evals_spine/test_judged_calibration.py`:
- Line 306: Add a single module-level import for load_anchor_set from
evals.loader.anchors (alongside the other anchors imports) and remove the two
local imports that appear inside the tests (the local load_anchor_set imports
within the two test functions). This consolidates imports: declare "from
evals.loader.anchors import load_anchor_set" at top-level and delete the
duplicate local import statements in the test bodies so the tests use the
module-level symbol.
- Line 287: Consolidate the local pydantic ValidationError imports by adding a
single module-level import "from pydantic import ValidationError as
PydValidationError" at the top of the test module and removing the three
redundant local imports of PydValidationError inside the individual tests;
update any local references to use the module-level PydValidationError symbol so
tests continue to raise/handle Pydantic ValidationError consistently.
- Line 324: Remove the redundant local import of AnchorItem by deleting the
duplicate "from evals.loader.anchors import AnchorItem" line in the
test_judged_calibration.py module so only the module-level import of AnchorItem
remains; search for any extra occurrences of the symbol AnchorItem within that
file to ensure you remove only the duplicate import and not any necessary
usages.

In `@tests/evals_spine/test_runner_broken_scores_worse.py`:
- Line 53: The assertion in the test uses a strict greater-than comparison
(reference.total > broken.total + SCORE_MARGIN) but the contract requires an
inclusive threshold ("at least" SCORE_MARGIN); change the comparison to a
greater-than-or-equal comparison so the check becomes reference.total >=
broken.total + SCORE_MARGIN, ensuring a gap exactly equal to SCORE_MARGIN
passes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: a85a438d-9fc5-40c1-a517-15ae67cf84d8

📥 Commits

Reviewing files that changed from the base of the PR and between 48c3121 and 6dfda3a.

📒 Files selected for processing (29)

CLAUDE.md
evals/__init__.py
evals/emit/__init__.py
evals/emit/json_writer.py
evals/emit/markdown_writer.py
evals/errors.py
evals/loader/__init__.py
evals/loader/anchors.py
evals/loader/briefs.py
evals/models/__init__.py
evals/models/brief.py
evals/models/scorecard.py
evals/scoring/__init__.py
evals/scoring/aggregate.py
evals/scoring/executable.py
evals/scoring/judged.py
evals/scoring/penalties.py
evals/scoring/spearman.py
pyproject.toml
src/synthorg/observability/events/evals.py
tests/evals_spine/__init__.py
tests/evals_spine/conftest.py
tests/evals_spine/test_brief_loader.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py
tests/unit/observability/test_events.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
GitHub Check: Test Integration
GitHub Check: Test E2E
GitHub Check: Dashboard Test
GitHub Check: Test Unit
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (5)

src/synthorg/!(persistence)/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL

Files:

src/synthorg/observability/events/evals.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Configuration precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; no os.environ.get outside startup

Files:

src/synthorg/observability/events/evals.py

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No hardcoded numeric values; numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of the form NAME: int|float|Final|Final[int]|Final[float] = literal

Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing

No from __future__ import annotations (Python 3.14 has PEP 649); use PEP 758 except: except A, B: requires parens if binding

Type hints required on public functions; mypy strict mode; Google-style docstrings

Line length 88; functions <50 lines; files <800 lines

Errors must use <Domain><Condition>Error naming and inherit from DomainError; never inherit directly from Exception/RuntimeError/etc

Pydantic v2 frozen models must use extra="forbid" on every frozen model; @computed_field is auto-exempt; per-line # lint-allow: frozen-extra-forbid -- <reason> for exceptions

@computed_field should be used for derived properties in Pydantic models

Use NotBlankStr for identifiers in Pydantic models

Args models required at every system boundary; use parse_typed() for every external dict ingestion

Immutability: use model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries

Async code must use asyncio.TaskGroup for fan-out/fan-in; helpers catch Exception and re-raise MemoryError/RecursionError

Use Clock seam: clock: Clock | None = None; tests inject FakeClock

Lifecycle services must own _lifecycle_lock; timed-out stops mark unrestartable

Untrusted content (SEC-1) must be wrapped via wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML

Import logger via from synthorg.observability import get_logger; variable always named logger; never use import logging or print() in app code

Event names must come from observability.events.<domain> constants; use structured kwargs in log calls (logger.info(EVENT, key=value))

Error paths must log WARNING/ERROR w...

Files:

src/synthorg/observability/events/evals.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/events/evals.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Test markers: @pytest.mark.{unit,integration,e2e,slow}

Async tests use auto mode; global timeout 30s; coverage 80% minimum

Test doubles ladder: FakeClock for Clock seam, mock_of[T](**overrides) for typed-boundary substitutions, SimpleNamespace for attribute-bags; bare MagicMock at typed boundary is blocked by scripts/check_mock_spec.py

FakeClock and mock_of import from tests._shared; inject via clock= and helper's spec subscript

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() not sleep(large)

Files:

tests/unit/observability/test_events.py
tests/evals_spine/conftest.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py
tests/evals_spine/test_brief_loader.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/unit/observability/test_events.py
tests/evals_spine/conftest.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py
tests/evals_spine/test_brief_loader.py

tests/**/*test*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Windows unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Files:

tests/unit/observability/test_events.py
tests/evals_spine/conftest.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py
tests/evals_spine/test_brief_loader.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Read `docs/design/` page before implementing; deviations need approval

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Present every plan for accept/deny before coding

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: No region/currency/locale privileged; use metric units; British English

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Every convention PR must ship its enforcement gate

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Never edit `tests/baselines/unit_timing.json` or any `scripts/*_baseline.{txt,json}` / `scripts/_*_baseline.py` files; timeout/slow failures indicate source-code regression

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review` before creating PR

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After PR creation: use `/aurelio-review-pr` for external feedback and fix all valid issues before merging

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Use `go -C cli` command, never `cd cli`

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: xdist `-n 8 --dist=loadfile` auto-applied via pyproject `addopts` (`loadfile` prevents 3.14+Windows ProactorEventLoop leak)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Branches: `<type>/<slug>` from main

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: GitHub queries: use `gh issue list` via Bash, NOT MCP `list_issues`

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After every squash merge → `/post-merge-cleanup`

📚 Learning: 2026-05-05T09:04:46.195Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

evals/emit/__init__.py
evals/loader/__init__.py
src/synthorg/observability/events/evals.py
evals/__init__.py
evals/models/__init__.py
evals/scoring/__init__.py
evals/scoring/spearman.py
evals/errors.py
tests/unit/observability/test_events.py
tests/evals_spine/conftest.py
evals/emit/markdown_writer.py
evals/emit/json_writer.py
tests/evals_spine/test_executable_grader.py
evals/loader/anchors.py
evals/models/brief.py
evals/models/scorecard.py
evals/scoring/aggregate.py
tests/evals_spine/test_penalties.py
tests/evals_spine/test_judged_calibration.py
evals/scoring/executable.py
tests/evals_spine/test_runner_broken_scores_worse.py
tests/evals_spine/test_scorecard_emit.py
evals/loader/briefs.py
evals/scoring/judged.py
evals/scoring/penalties.py
tests/evals_spine/test_brief_loader.py

📚 Learning: 2026-05-16T18:36:31.446Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

CLAUDE.md

📚 Learning: 2026-05-16T18:36:35.250Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

CLAUDE.md

🔇 Additional comments (45)

tests/evals_spine/conftest.py (1)

1-60: LGTM!

tests/evals_spine/test_brief_loader.py (1)

19-180: LGTM!

evals/loader/briefs.py (1)

1-94: LGTM!

evals/loader/__init__.py (1)

1-8: LGTM!

evals/loader/anchors.py (1)

1-133: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

1-52: LGTM!

Also applies to: 54-88

evals/scoring/spearman.py (1)

1-97: LGTM!

evals/scoring/penalties.py (1)

1-153: LGTM!

tests/evals_spine/test_penalties.py (1)

1-233: LGTM!

tests/evals_spine/test_scorecard_emit.py (1)

1-316: LGTM!

tests/unit/observability/test_events.py (1)

336-339: LGTM!

tests/evals_spine/test_executable_grader.py (2)

26-75: LGTM!

97-230: LGTM!

src/synthorg/observability/events/evals.py (2)

1-10: LGTM!

12-13: The constants at lines 12-13 are properly imported and used in the executable scoring module (evals/scoring/executable.py). Both EVALS_EXECUTABLE_TIMEOUT and EVALS_EXECUTABLE_TOOL_MISSING are used correctly in structured logging calls with appropriate context kwargs and secure error handling (safe_error_description(exc) instead of raw exception strings).

CLAUDE.md (1)

3-3: LGTM!

evals/__init__.py (1)

1-15: LGTM!

pyproject.toml (1)

190-190: LGTM!

Also applies to: 283-283, 363-363, 378-378

evals/errors.py (1)

1-130: LGTM!

evals/models/__init__.py (1)

1-7: LGTM!

evals/models/brief.py (1)

1-269: LGTM!

evals/models/scorecard.py (1)

1-251: LGTM!

Also applies to: 256-313

evals/scoring/executable.py (4)

1-91: LGTM!

93-134: LGTM!

136-216: LGTM!

218-300: LGTM!

evals/scoring/judged.py (5)

1-52: LGTM!

54-119: LGTM!

122-171: LGTM!

174-222: LGTM!

225-298: LGTM!

evals/emit/json_writer.py (2)

37-45: Past review finding has been addressed.

The past review comment flagged the missing newline="\n" parameter. The current code includes it at line 41, ensuring platform-independent LF line endings for byte-level determinism.

31-57: LGTM!

evals/emit/__init__.py (1)

1-1: LGTM!

evals/emit/markdown_writer.py (6)

34-37: LGTM!

40-48: LGTM!

51-69: LGTM!

72-79: LGTM!

82-97: LGTM!

100-125: LGTM!

tests/evals_spine/test_judged_calibration.py (5)

1-32: LGTM!

36-102: LGTM!

108-159: LGTM!

162-197: LGTM!

200-258: LGTM!

All 12 actionable comments from the CodeRabbit re-review (review id 4331444484) verified valid and fixed in this commit. Source files: json_writer.py drops the unused module-level logger so the symbol does not lie about the import surface; markdown_writer.py does the same and write_scorecard_md now pins newline=LF for byte-identical Markdown across platforms; scorecard.is_passing uses math.ceil instead of int() so fractional pass thresholds round up to the next integer (a max_total of 295 at 65% requires 192, not 191); the evals.scoring package docstring corrects 'Three concerns' to 'Four concerns' to match the four listed modules; aggregate_brief_score docstring drops the obsolete 'clamped' claim and accurately states that out-of-range grades raise ValueError; judged.py logs the calibration-failed event via EVALS_JUDGE_CALIBRATION_FAILED imported from observability.events.evals (added there) instead of an inline literal. Test files: test_executable_grader.py parametrize matrix expanded from six to the full eight 2x2x2 boolean combinations across hidden/build/lint pass flags; test_judged_calibration.py consolidates yaml + PydValidationError + load_anchor_set + MIN_ANCHOR_SET_SIZE + JudgeCalibrationReport + JudgedGrade + AnchorItem imports to the module top, dropping the seven duplicate local imports; test_runner_broken_scores_worse.py SCORE_MARGIN comparison switches from > to >= so a reference run that beats the broken run by exactly SCORE_MARGIN points still passes (the contract is 'at least SCORE_MARGIN', inclusive).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

evals/models/scorecard.py (1)
129-177: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Persist the per-brief score floor in BriefResult.

aggregate_brief_score() now allows score = max(grade - deduction, penalty_table.floor), but BriefResult still revalidates against max(grade - deduction, GRADE_FLOOR) and has no field to carry the configured floor. Any non-zero PenaltyTable.floor will make a valid aggregation impossible to serialize into the scorecard without either failing validation or lying about how the score was derived. Either thread the floor through BriefResult/Scorecard and version the schema, or reject non-zero floors before scorecard construction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/models/scorecard.py` around lines 129 - 177, The BriefResult validation
assumes a global GRADE_FLOOR but aggregate_brief_score can use a
PenaltyTable.floor; add a score_floor int field to BriefResult (defaulting to
GRADE_FLOOR for compatibility) and update the
_score_matches_grade_minus_deduction validator to compute expected =
max(self.grade - self.deduction, self.score_floor) so the serialized row records
the actual floor used. Also ensure aggregate_brief_score (and Scorecard
construction) populates BriefResult.score_floor from PenaltyTable.floor when
creating BriefResult instances (or else reject non-zero PenaltyTable.floor
before building scorecards if you choose to forbid non-default floors).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@evals/models/scorecard.py`:
- Around line 129-177: The BriefResult validation assumes a global GRADE_FLOOR
but aggregate_brief_score can use a PenaltyTable.floor; add a score_floor int
field to BriefResult (defaulting to GRADE_FLOOR for compatibility) and update
the _score_matches_grade_minus_deduction validator to compute expected =
max(self.grade - self.deduction, self.score_floor) so the serialized row records
the actual floor used. Also ensure aggregate_brief_score (and Scorecard
construction) populates BriefResult.score_floor from PenaltyTable.floor when
creating BriefResult instances (or else reject non-zero PenaltyTable.floor
before building scorecards if you choose to forbid non-default floors).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: f2fa10d1-679d-449d-bd31-72249dc71096

📥 Commits

Reviewing files that changed from the base of the PR and between 6dfda3a and e57d7a0.

📒 Files selected for processing (10)

evals/emit/json_writer.py
evals/emit/markdown_writer.py
evals/models/scorecard.py
evals/scoring/__init__.py
evals/scoring/aggregate.py
evals/scoring/judged.py
src/synthorg/observability/events/evals.py
tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py

💤 Files with no reviewable changes (1)

evals/emit/json_writer.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
GitHub Check: Dashboard Test
GitHub Check: Test Unit
GitHub Check: Test Integration
GitHub Check: Analyze (python)

🧰 Additional context used

📓 Path-based instructions (3)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Async test fixtures: use auto mode; global timeout 30s; coverage minimum 80%; xdist -n 8 --dist=loadfile auto-applied via pyproject addopts (loadfile prevents 3.14+ Windows ProactorEventLoop leak)

Windows unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Test markers: @pytest.mark.{unit,integration,e2e,slow}; test doubles ladder in conventions.md section 12.1; use FakeClock and mock_of[T](**overrides) from tests._shared; SimpleNamespace for attribute-bags; bare MagicMock at typed boundary blocked by scripts/check_mock_spec.py

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() instead of sleep(large)

Files:

tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

tests/evals_spine/test_executable_grader.py
tests/evals_spine/test_judged_calibration.py
tests/evals_spine/test_runner_broken_scores_worse.py

src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Configuration precedence order: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env; YAML is ingestion format only, not a precedence tier; no os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value

Lifecycle services must own _lifecycle_lock; timed-out stops mark services unrestartable

Files:

src/synthorg/observability/events/evals.py

src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No hardcoded numeric values; numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of form NAME: int|float|Final|Final[int]|Final[float] = literal per scripts/check_no_magic_numbers.py

Use asyncio.TaskGroup for fan-out/fan-in in async code; helpers must catch Exception and re-raise MemoryError/RecursionError

Use Clock seam pattern: clock: Clock | None = None; tests inject FakeClock from tests._shared

Wrap untrusted content via wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML per SEC-1

Use from synthorg.observability import get_logger with variable always named logger; never use import logging or print() in app code

Event names must come from observability.events.<domain> constants; use structured kwargs (logger.info(EVENT, key=value))

Error paths must log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED AFTER persistence write

Secret-log redaction (SEC-1): never use error=str(exc) or interpolate {exc}; use error_type=type(exc).__name__ + error=safe_error_description(exc); never exc_info=True; OTel forbidden span.record_exception(exc), use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False per check_logger_exception_str_exc.py

Comments must explain WHY only; no reviewer citations, issue back-refs, or migration framing per check_no_review_origin_in_code.py and check_no_migration_framing.py

No from __future__ import annotations (Python 3.14+ has PEP 649); PEP 758 except: except A, B: requires parens when binding

Type hints required on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines

Error classes must follow <Domain><Condition>Error naming and inherit from DomainError; never inherit directly from `Exce...

Files:

src/synthorg/observability/events/evals.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

src/synthorg/observability/events/evals.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Read design specification in `docs/design/` before implementing; deviations require approval per [DESIGN_SPEC.md](docs/DESIGN_SPEC.md)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Present every plan for accept/deny before coding

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: No region/currency/locale should be privileged; use metric units and British English per [docs/reference/regional-defaults.md](docs/reference/regional-defaults.md)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Every convention PR must ship its enforcement gate per [docs/reference/convention-gates.md](docs/reference/convention-gates.md)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Timeout/slow test failures indicate source-code regression; never edit `tests/baselines/unit_timing.json` or baseline files; bypass only via `ALLOW_BASELINE_GROWTH=1 git commit ...` with explicit user approval

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: After issue implementation: branch + commit + push without auto-PR; use `/pre-pr-review` (gh pr create is blocked); after PR use `/aurelio-review-pr` for external feedback; fix all valid issues without deferring

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Git commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Branch naming: `<type>/<slug>` from main

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: After every squash merge → run `/post-merge-cleanup`

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Use `uv sync` for all dependencies, `uv sync --group docs` for docs toolchain; run `bash scripts/install_cli_tools.sh` once per machine for golangci-lint; run `bash scripts/install_git_hooks.sh` once per clone to wire `core.hooksPath -> scripts/git-hooks`

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Linting: `uv run ruff check src/ tests/ --fix` + `uv run ruff format src/ tests/`; type checking: `uv run mypy --num-workers=4 src/ tests/` (strict)

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Testing: `uv run python -m pytest tests/ -m {unit|integration|e2e}` with `-n 8 --dist=loadfile` via pyproject `addopts`; coverage minimum 80%; use `HYPOTHESIS_PROFILE={dev|fuzz}` for property tests

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit checks: `uv run pre-commit run --all-files`; schema drift: `uv run python scripts/check_schema_drift_revisions.py --backend {sqlite|postgres}`

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Documentation build: `PYTHONPATH=. uv run zensical build` for docs; reference [docs/reference/claude-reference.md](docs/reference/claude-reference.md) for doc layout, Docker, releasing, CI, dependencies, Hypothesis

Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Web: see `web/CLAUDE.md` for guidelines; CLI: see `cli/CLAUDE.md` and use `go -C cli` (never `cd cli`); Shell: see `~/.claude/rules/common/bash.md` for canonical `cd`/`git -C`/Bash file-write rules

📚 Learning: 2026-05-05T09:04:46.195Z

Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

evals/scoring/__init__.py
tests/evals_spine/test_executable_grader.py
src/synthorg/observability/events/evals.py
evals/scoring/aggregate.py
evals/emit/markdown_writer.py
tests/evals_spine/test_judged_calibration.py
evals/models/scorecard.py
evals/scoring/judged.py
tests/evals_spine/test_runner_broken_scores_worse.py

🔇 Additional comments (9)

src/synthorg/observability/events/evals.py (1)

14-14: LGTM!

tests/evals_spine/test_executable_grader.py (1)

85-87: LGTM!

Also applies to: 95-97

tests/evals_spine/test_judged_calibration.py (1)

7-8: LGTM!

Also applies to: 11-16, 26-26, 30-30

evals/scoring/__init__.py (1)

3-3: LGTM!

evals/scoring/aggregate.py (1)

112-116: LGTM!

evals/scoring/judged.py (1)

24-24: LGTM!

Also applies to: 211-217

evals/models/scorecard.py (1)

10-10: LGTM!

Also applies to: 252-261

evals/emit/markdown_writer.py (1)

124-133: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

53-53: LGTM!

The new CodeRabbit review on head e57d7a0 (review id 4331572429) carried a single outside-diff-range finding pointing out that round 4 fixed the AggregationResult floor inconsistency but stopped at the scorer boundary: BriefResult still validated against the global GRADE_FLOOR while aggregate_brief_score produced scores using penalty_table.floor. A non-zero PenaltyTable.floor would make a valid aggregation impossible to serialise into the scorecard without either failing validation or lying about how the score was derived. BriefResult now carries a score_floor field (default GRADE_FLOOR for backward compat) and the validator clamps against self.score_floor instead of the global. aggregate_brief_score already returns AggregationResult.floor; downstream code that builds BriefResult from an AggregationResult should forward floor -> score_floor so the serialised row records the exact bound the aggregator applied. Added two regression tests in tests/evals_spine/test_scorecard_emit.py: one that accepts a non-default custom floor of 50, one that rejects a score below the custom floor.

Aureliolo · 2026-05-20T19:45:10Z

@coderabbitai review

coderabbitai · 2026-05-20T19:45:18Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…y benchmark (#2025) Part of #1979. Lands the **scoring + data layer** of the golden-company benchmark (eval spine), as documented in the Status (2026-05-20) section of the linked issue. This PR intentionally does **not** close #1980. The runner (cassette guard + telemetry collector + orchestrator + CLI + baselines + recorded cassette + `make benchmark` target) is held until a runtime prerequisite gap closes. See "Prerequisites for the runner PR" in #1980 for the full follow-up plan. ## What landed The complete scoring + data contract for the eval spine, 68 unit tests passing: - `evals/models/{brief,scorecard}.py` -- frozen Pydantic v2 models, schema-versioned (`SCORECARD_SCHEMA_VERSION = 1`). - `evals/loader/{briefs,anchors}.py` -- YAML loaders via `synthorg.api.boundary.parse_typed`; path-traversal containment on anchor-set loading. - `evals/scoring/executable.py` -- subprocess grader (hidden / build / lint, weighted 60/20/20); rejects shell metacharacters at load time as defence-in-depth. - `evals/scoring/judged.py` + `spearman.py` -- ordinal calibration with a Spearman rho gate (`>= 0.7`); refuses to score when the judge's ordering disagrees with the hand-scored anchor set. - `evals/scoring/penalties.py` + `aggregate.py` -- process-fact penalty table with cap + floor. - `evals/emit/{json_writer,markdown_writer}.py` -- deterministic JSON + Markdown emitters. - `evals/errors.py` -- `DomainError`-rooted hierarchy. - `pyproject.toml` -- `evals/` wired for ruff (`src`), pyright (`extraPaths`), pytest (`pythonpath`), isort (`known-first-party`). - `CLAUDE.md` -- layout line now mentions the new `evals/` top-level directory. Pydantic models carry cross-field invariant validators so the scorecard is auditable from any single field: `BriefResult` enforces `score == max(grade - deduction, GRADE_FLOOR)`; `JudgeCalibrationReport` enforces `passed == (rho >= gate)`; `PenaltyEntry` enforces `raw == points_per_event * count` and `applied <= raw`; `AggregatedProcessFacts` enforces `total_events == sum(events_by_class)`; `ProcessFactReport` enforces entries match the tracked events_by_class slice; `CheckOutcome` ties `timed_out=True` to the POSIX timeout sentinel; `ExecutableGrade` enforces pass-flags match per-label outcomes; `ScriptedJudge` requires non-empty responses or default scores; `JudgedGrade` refuses construction with a failing calibration; `AnchorItem` rejects hand_scores outside `[0.0, 1.0]`; `HiddenCheckSpec` rejects shell metacharacters in `cmd[0]`. ## What is NOT yet built (follow-up PR scope) - `evals/runner/cassette_guard.py` -- cassette SHA256 + REPLAY enforcement. - `evals/runner/telemetry_collector.py` -- structlog event bucketing. - `evals/runner/orchestrator.py` -- per-brief boot/submit/wait/collect loop. - `evals/run.py` + `evals/__main__.py` -- CLI entry point. - `evals/baselines/{reference,broken}.yaml`, `evals/briefs/*.yaml`, `evals/cassettes/reference_run.cassette.json`. - `Makefile` target + `.gitignore`. - The acceptance test `test_runner_broken_scores_worse.py` going green (currently uses `pytest.importorskip("evals.run")`). ## Why the runner did NOT land in this PR A runtime spike (`create_app()` + submit one `TaskRequirement` via the `IntakeEngine`) confirmed the studio's agent-execution path is not yet wired end-to-end. Intake transitions fire (`intake.request.received` -> `triaging` -> `scoped`), then `TaskEngineNotRunningError` raises. No `BUDGET_*` / `EXECUTION_*` / `STAGNATION_*` / `APPROVAL_GATE_*` events appear. Without those events, broken.yaml and reference.yaml would produce identical empty process-fact traces and the scorer could not discriminate. Per project rule "admit you cannot in plain words, BEFORE": surfacing this gap rather than shipping a green acceptance test that asserts nothing. Prerequisites tracked in #1980's Status section. ## Test plan ```bash uv run python -m pytest tests/evals_spine/ -m unit ``` Result locally: 68 passed in 9.13s. ```bash uv run ruff check src/ tests/ evals/ uv run ruff format --check src/ tests/ evals/ uv run mypy --num-workers=4 tests/evals_spine/ evals/ ``` All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate, review-origin gate, migration-framing gate, secret detection, etc.) ran on the push and passed. ## Review coverage `/pre-pr-review` ran 9 agents in parallel against the spine. 22 findings consolidated in `_audit/pre-pr-review/triage.md`: - 2 CRITICAL: path-traversal containment, duplicate `SCORECARD_SCHEMA_VERSION` -- both fixed. - 5 MAJOR valid (wrong exception type, three issue back-refs in docstrings, missing `evals/` in CLAUDE.md layout) -- all fixed. - 1 MAJOR invalid (PEP 758 syntax: agent misread CLAUDE.md's "no parens unless binding"; we ARE binding `as exc`, parens required) -- rejected with rationale. - 11 MEDIUM valid (function-length refactor, shell-metachar guard, test-quality cleanups, 4 cross-field validators) -- all fixed. - 3 LOW deferred -- 2 implemented (extra validators on trusted-construction types, JudgeProtocol docstring polish); 1 deferred (Brief discriminated union -- the review agent itself flagged it as a pragmatic trade-off and the current post-validator catches the same XOR errors; restructuring would force a substantial test refactor for marginal benefit). Net: **19 of 22 review findings applied in this PR**, with rationale recorded in the commit body of `815ac88f` and in the triage file. ## Related - Part of #1979 (operate-tier EPIC). - Depends on #1961 (client-simulation harness, already in place) + EPIC #1973 (runnable studio, partial -- runtime gap blocks the runner PR) + #1956. - Blocks #1983 (learning curve), #1990, #1995, #1998 -- all consume the scorecard schema landing here.

## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - Introduced conversational interface for direct clarify and propose interactions. - Cost management now includes forecast gates, hard ceilings, and Pareto considerations. - Added living documentation engine combining wiki and retrieval-augmented generation features. - Real intake engine is now operational for live data processing. - Virtual desktop tool with vision verification gate available for enhanced workspace control. ### What's new - Per-project reproducible environments for consistent setups. - Headless browser testing tool integrated for automated UI validation. - Governed external API and data access tool introduced. - Hardened external-remote git backend with sandbox mounts and push-queue dispatching. - Adversarial red-team gate subsystem for enhanced security testing. - Self-extending toolkit to dynamically expand capabilities. - Stakes-aware model routing enables prioritized processing. - Task-board entry adapter connects live runtime with project management. - Persistent project workspace with pluggable git backend and per-project push queues implemented. - Knowledge and provenance substrate added to track data lineage. - Scoring and data contract framework for golden-company benchmark evaluations. ### Under the hood - Desktop Dockerfile pinned by digest to improve build stability and documented publishing gap fixed.  :robot: I have created a release *beep* *boop* --- ## [0.8.7](v0.8.6...v0.8.7) (2026-05-22) ### Features * conversational interface v1 - 1:1 clarify + propose ([#2019](#2019)) ([216ef94](216ef94)), closes [#1968](#1968) * cost as a first-class dial (forecast gate, hard ceiling, Pareto) ([#2029](#2029)) ([700a59e](700a59e)), closes [#1982](#1982) * **env:** reproducible per-project environments ([#2039](#2039)) ([d2c0ef9](d2c0ef9)), closes [#1994](#1994) * **evals:** [#1980](#1980) spine -- scoring + data contract for golden-company benchmark ([#2025](#2025)) ([53108e8](53108e8)) * goal/objective entry adapter ([#1964](#1964)) ([#2022](#2022)) ([cb15c3c](cb15c3c)) * governed external API/data access tool ([#1991](#1991)) ([#2032](#2032)) ([e08b451](e08b451)) * harden external-remote git backend + per-project sandbox mount + push-queue dispatch ([#2020](#2020)) ([#2030](#2030)) ([2fa2e1e](2fa2e1e)) * headless browser testing tool ([#1992](#1992)) ([#2024](#2024)) ([277b52a](277b52a)) * knowledge + provenance substrate ([#2036](#2036)) ([48c897b](48c897b)) * living documentation engine (dual-purpose wiki + RAG namespace) ([#2028](#2028)) ([3d10da9](3d10da9)), closes [#1976](#1976) * real intake engine online ([#2017](#2017)) ([9d8eb34](9d8eb34)) * **redteam:** adversarial red-team gate subsystem ([#1986](#1986)) ([#2026](#2026)) ([d2207e9](d2207e9)) * self-extending toolkit ([#1995](#1995)) ([#2035](#2035)) ([5ffc545](5ffc545)) * stakes-aware model routing ([#1998](#1998)) ([#2038](#2038)) ([9b98312](9b98312)) * task-board entry adapter to live runtime ([#1963](#1963)) ([#2023](#2023)) ([a8f1eea](a8f1eea)) * virtual desktop tool and vision verifier gate ([#2031](#2031)) ([dfe8b42](dfe8b42)), closes [#1993](#1993) * **workspace:** persistent project workspace + pluggable git backend + per-project push queue ([#2021](#2021)) ([ee58ee7](ee58ee7)) ### Bug Fixes * pin desktop Dockerfile by digest (Scorecard [#309](#309)) + document publish gap ([#2034](#2034)) ([8fda188](8fda188)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Aureliolo and others added 5 commits May 20, 2026 00:38

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 16:02 — with GitHub Actions Inactive

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 18:29 — with GitHub Actions Inactive

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread evals/emit/json_writer.py

Comment thread evals/scoring/aggregate.py Outdated

Aureliolo added 7 commits May 20, 2026 20:52

Aureliolo force-pushed the feat/1980-golden-benchmark branch from 48c3121 to 6dfda3a Compare May 20, 2026 18:58

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 18:59 — with GitHub Actions Inactive

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 19:19 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Aureliolo force-pushed the main branch from 2049885 to 194c48a Compare May 20, 2026 19:33

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 19:33 — with GitHub Actions Inactive

Aureliolo force-pushed the main branch from 194c48a to ee58ee7 Compare May 20, 2026 19:42

Aureliolo merged commit 7b43d6f into main May 20, 2026
83 checks passed

Aureliolo deleted the feat/1980-golden-benchmark branch May 20, 2026 20:03

Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 20:03 — with GitHub Actions Inactive

synthorg-repo-bot Bot mentioned this pull request May 20, 2026

chore(main): release 0.8.7 #2018

Merged

		import subprocess
		from typing import TYPE_CHECKING, Final, Self

Conversation

Aureliolo commented May 20, 2026

What landed

What is NOT yet built (follow-up PR scope)

Why the runner did NOT land in this PR

Test plan

Review coverage

Related

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

gemini-code-assist Bot commented May 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Uh oh!

codspeed-hq Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codspeed-hq Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading