feat(evals): #1980 spine -- scoring + data contract for golden-company benchmark#2025
Conversation
Closes #1962. Child of EPIC #1955. ## Summary Wires `DirectIntake`/`AgentIntake` as a live, non-simulated work-entry path. A real `ClientRequest` submitted via `POST /requests` and approved via `POST /requests/{id}/approve` now flows through the work pipeline spine so an agent actually executes it (proven under the simulation harness with a scripted provider). The pipeline spine (#1960) already maps `WorkItem -> ClientRequest -> intake_engine.process()` in its intake phase; the gap was that `approve_request` called `intake_engine.process()` directly and produced a `CREATED` task no agent ever ran. This PR introduces a thin work-entry adapter seam, hands approve at `POST /requests/{id}/approve` to a background pipeline run, and reconciles the stored request to its terminal status. ## What changed - **New `WorkEntryAdapter` protocol + `IntakeEntryAdapter` + `build_work_entry_adapter` factory** under `src/synthorg/engine/pipeline/entry/`. Source-keyed dispatch (`WorkSource.INTAKE` ships now; siblings #1963 / #1964 add their arms here). - **`POST /requests/{id}/approve` rewritten** to return **HTTP 202 Accepted** with the request walked to `APPROVED`; a background coroutine (`process_intake_pipeline`) drives the adapter -> pipeline and reconciles the stored request to `TASK_CREATED` (or `CANCELLED` on intake rejection / pipeline failure). Terminal failures log ERROR + cancel the request as a fallback so a request can never silently hang in `APPROVED`. `AgentRuntimeNotConfiguredError` (409, taxonomy 4014) when no adapter / pipeline is wired (empty company). - **New `simulations.intake_default_project` setting** (`read_only_post_init`, default `"client-intake"`); the project is created at boot if absent. One resolved value drives both the intake strategy's `project=` and the adapter's `WorkItem.project`, so the pipeline's project-existence check and the created task agree. - **New `WsEventType.REQUEST_TASK_CREATED`** + matching `WsRequestTaskCreatedPayload`; the terminal hop now has a dedicated event (the intermediate `request.approved` still fires for the sync `APPROVED` walk). - **`AppState.intake_entry_adapter` seam** (once-only set + has_ + swap, mirrors `work_pipeline`); boot wiring in `engine.pipeline.entry.boot.wire_real_intake_entry` is called from both `_install_runtime_services` and the post-setup `_rebuild_runtime_services`. - **Manifest discipline (hard EPIC criterion)**: `scripts/_ghost_wiring_manifest.txt` now has `ENFORCED` lines for `DirectIntake`, `AgentIntake`, `IntakeEntryAdapter`, `build_work_entry_adapter`. The `no-ghost-wiring` gate passes. - **Docs**: `docs/design/client-simulation.md` documents the real work-entry path + the new setting + the new factory row; `docs/design/verification-quality.md` banner updated (intake engine online); `docs/design/page-structure.md` URL routing map gains the `/clients` / `/clients/requests` / `/clients/simulations` / `/clients/reviews/:taskId` / `/clients/:clientId` routes. - **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the 202 status, added `'request.task_created'` to `websocket.ts`, updated the MSW approve mock to return 202. ## Tests - New unit tests: `tests/unit/engine/pipeline/entry/{test_intake_adapter,test_factory,test_boot}.py`, `tests/unit/api/controllers/test_requests_intake_pipeline.py` (success / WorkIntakeRejected / WorkPipelineError / concurrent-terminal / MemoryError re-raise), `tests/unit/api/test_state.py` adapter seam, `tests/unit/client/test_runtime_builder.py` default-project threading + `intake_default_project` env override, `tests/unit/settings/test_intake_settings.py`. - New integration tests in `tests/integration/api/controllers/test_client_simulation.py`: approve returns 202 + spawns the pipeline task; approve without an adapter returns 409. - New e2e `tests/e2e/test_real_intake_entry_e2e.py`: builds the production runtime via `build_runtime_services` + `ScriptedDriver` + `DirectIntake`, drives `process_intake_pipeline`, asserts request reaches `TASK_CREATED` and the task is past `CREATED` (proof an agent ran). A second case covers a scoped request with reviewer notes. Validation run on this branch: 29468 unit tests pass (full suite), 3183 web unit tests pass; e2e + the new integration suite pass under the simulation-harness substrate (zero real LLM spend). ## Pre-PR review Reviewed by 10 core agents (python-reviewer, security-reviewer, async-concurrency-reviewer, conventions-enforcer, test-quality-reviewer, api-contract-drift, silent-failure-hunter, issue-resolution-verifier, comment-quality-rot, docs-consistency). All valid findings implemented: - Wrapped `process_intake_pipeline`'s terminal reconciliation in `_safe_finalize` so a reconcile failure can never leave a request stuck in `APPROVED`: ERROR with `request_id` + fallback `CANCELLED`, with `release_request_lock_if_idle` in `finally`. - Hardened `_ensure_project` to tolerate a `DuplicateRecordError` race (post-setup reinit) and log ERROR with project id on any other create failure. - Explicit `except asyncio.CancelledError: raise` ahead of the broad handler. - `_current_if_approved` vanished-request now logs ERROR (invariant break, not a routine warning). - `_ensure_project(project_id: NotBlankStr)` and a comment on `_spawn_intake_pipeline` documenting why a detached task (not `TaskGroup`) is correct for the 202-return contract. - Added the dedicated `request.task_created` WS event (+ `WsRequestTaskCreatedPayload`) so terminal-hop semantics are no longer multiplexed onto `request.approved`. - Doc fixes: `verification-quality.md` banner, `client-simulation.md` settings enumeration + factory table, `page-structure.md` routing map. Agent claims that the project-wide `except A, B:` (PEP 758 / Python 3.14) syntax was a "Python 3 syntax error" were rejected with primary-source evidence (ruff, mypy, the full 29468-test unit suite, and all 73 pre-push gates pass; the pattern is pre-existing in `pipeline/service.py` and `simulations.py`). --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
## Summary EPIC #1967 keystone: a human talks to ONE fixed top agent (Chief of Staff) in natural language. The agent asks clarifying questions when the request is underspecified, then parks one or more concrete `WorkItem`s in the human approval queue. On approval each item runs through the work pipeline; **no autonomous acting** (direct MCP acting is the separate child #1972). The pipeline spine (`build_work_pipeline`, #1960) and approval producer (#1956) already landed and are ENFORCED in the ghost-wiring manifest. This PR drives them from a conversation. ## What changed ### Service + API - `src/synthorg/meta/chief_of_staff/propose.py` :: `ChiefOfStaffProposer` (single-responsibility; keeps `chat.py` explain-only). Structured clarify-or-propose LLM turn via `core/json_parsing.py`. SEC-1: human content wrapped via `wrap_untrusted(TAG_TASK_DATA, ...)`. - `POST /meta/chat/propose` on `MetaController` with `ConversationalProposeRequest` args model, `meta.chat.propose` (5/60s) rate-limit, 503 dependency-matrix (proposer + work pipeline must be wired). - `src/synthorg/api/controllers/_approval_review_gate.py` :: `try_conversational_intake_resume` Flow 0 before parked-context / review-gate. On approve: reconstructs `WorkItem` from the proposal and drives `app_state.work_pipeline.run`. On reject: marks proposal REJECTED, pipeline untouched. - `build_chief_of_staff_proposer` factory + `_wire_chief_of_staff_proposer` on_startup hook + AppState set-once accessors (`chief_of_staff_proposer`, `conversational_proposal_repo`). - `ChiefOfStaffConfig` propose_* fields (opt-in, default off). ### Data model - New `ApprovalSource.CONVERSATIONAL_INTAKE` discriminator + new `ConversationRole` / `ConversationStatus` / `ConversationalProposalStatus` enums. - Frozen Pydantic models: `Conversation`, `ConversationTurn`, `ProposedWork`, `ProposeDecision` (XOR validator), `ConversationalProposal`, `ProposeArgs`, `ProposedApprovalSummary`, `ProposeResult`. ### Persistence (dual-backend) - New protocols: `ConversationRepository` (StatefulRepository), `ConversationTurnRepository` (AppendOnlyRepository), `ConversationalProposalRepository` (Stateful + FilteredQuery). Generic categories per ADR-0001; no bespoke methods. - Concrete sqlite + postgres impls, `conversational_factory.build_conversational_repositories` (backend-aware; degrades gracefully to None for absent / unknown backends so app boot stays clean). - One yoyo revision per backend (`20260519000001_conversational_intake.sql`). Postgres widens `approvals.source` CHECK via `DROP/ADD CONSTRAINT`. SQLite intentionally keeps the narrow CHECK (rebuild via writable_schema was perf-too-costly under the unit suite's `--count=2` isolation gate; the v1 ApprovalStore is in-memory by default so `CONVERSATIONAL_INTAKE` rows never reach SQLite. Documented inline.). - `schema_drift_baseline.txt` grown by 4 canonical TEXT/TIMESTAMPTZ entries for the new timestamp columns (user-approved, identical to every other dual-backend table). Committed with `ALLOW_BASELINE_GROWTH=1`. ### Errors / events - New `<Domain><Condition>Error` family in `meta/errors.py`: `ConversationNotFoundError` (404), `ConversationClosedError` (409), `ConversationalProposeUnavailableError` (503), `ConversationalProposeResponseInvalidError` (502). New `ErrorCode` entries 3017 / 4015 / 7010. - New event constants in `observability/events/{chief_of_staff,approval_gate,persistence}.py` (clarify/propose/cap/transition + dispatcher branch + persistence repo error/read events). ### Manifest discipline (hard acceptance criterion) - `scripts/_ghost_wiring_manifest.txt`: new `ENFORCED build_chief_of_staff_proposer #1968` line. Verified by `scripts/check_no_ghost_wiring.py`. ### Web / generated types - Regenerated `web/src/api/types/{openapi,dtos,enum-values,error-codes}.gen.ts` for the new endpoint and enums. - Hand-written `postChatPropose()` client + `ConversationalProposeResponse` interface in `web/src/api/endpoints/meta.ts`; MSW handler in `web/src/mocks/handlers/meta.ts` (1:1 parity rule). ### Docs - `docs/design/self-improvement.md`: `/meta/chat/propose` endpoint, `propose.py` + model trio in package inventory, three-flow approval-decision routing (Flow 0 conversational-intake), YAML defaults for `propose_*`. - `README.md`: v1 scope clarified (clarify + propose; no autonomous acting). - `CLAUDE.md`: one-liner on proposer wiring + 503 conditions. ## Test plan - **Unit (29 503 passed locally on `-m unit`)**: enums, models, errors, settings, proposer (clarify / propose / untrusted-wrap / parse-failure / clarification cap / project-resolution), wiring (`build_chief_of_staff_proposer` 503 matrix; backend-aware factory degradation), dispatch branch (approve-runs-pipeline, reject-skips-pipeline, missing-proposal-owned-but-noop, pipeline-unavailable-503, pipeline-failure-leaves-PENDING), API model validation + route + rate-limit-policy registration. - **Dual-backend conformance** (`tests/conformance/persistence/`, `pytest.mark.integration`): `Conversation`, `ConversationTurn`, `ConversationalProposal` against sqlite + postgres. Schema-drift gates green on both backends. - **Simulation-harness e2e** (`tests/e2e/test_conversational_propose_e2e.py`, `pytest.mark.e2e`): builds the REAL pipeline via `build_runtime_services`, REAL `ChiefOfStaffProposer` with a scripted clarify-then-propose provider, REAL `ApprovalStore`, REAL approval-decision seam (`signal_resume_intent`). Asserts: vague request → clarify → answer → PENDING approval in store → approve → `Task` created and advanced past CREATED by a real agent. Reject path: pipeline never runs. - All convention gates green (ghost-wiring, schema-drift, dual-backend parity, persistence-boundary, dto-types-ts-in-sync, error-codes-ts-in-sync, no-magic-numbers, no-review-origin, frozen-extra-forbid, ...). ## Review coverage 11 specialist review agents ran pre-push (code, python, security, persistence, conventions, logging, async-concurrency, api-contract-drift, issue-resolution-verifier, docs-consistency, comment-quality-rot). 18 valid findings (7 High, 10 Medium, 1 doc polish) addressed in commit 80b34f9. 10 false positives dismissed with reasons in `_audit/pre-pr-review/triage.md`. `issue-resolution-verifier` returned PASS on every acceptance criterion and the parent epic's manifest-discipline rule. Closes #1968 --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
Closes #1964 Part of EPIC #1955. ## Summary Thin high-altitude work-entry adapter: a human-stated objective enters the pipeline spine as a `WorkItem(source=OBJECTIVE)` and is decomposed into projects/tasks by the existing `MultiAgentCoordinator`, then executed solo or as a team (automatic; never a user choice). No new decomposition logic — reuses the spine + coordinator end to end. ### Wire-up - `src/synthorg/engine/pipeline/entry/objective_adapter.py` — new `ObjectiveSubmission` (frozen Pydantic, `extra="forbid"`) and `ObjectiveEntryAdapter` (thin: `__slots__ = ("_default_project", "_work_pipeline")`). - `src/synthorg/engine/pipeline/entry/factory.py` — adds the `WorkSource.OBJECTIVE` arm. Coordinates with #1963 on this shared seam; the task-board adapter rebases when it lands. - `src/synthorg/engine/pipeline/entry/protocol.py` — generalised to PEP 695 generic `class WorkEntryAdapter[T_Request](Protocol)` so the intake and objective adapters can each be typed against their own domain input. - `src/synthorg/engine/pipeline/entry/boot.py` — new `wire_real_objective_entry()` (cold boot + hot-reload via `hot_swap=True`); `_ensure_project` lifted as a shared helper. - `src/synthorg/api/state.py` — `objective_entry_adapter` slot + property + once-only `set_` + atomic `set_if_absent_` + `swap_` seams + `objective_background_tasks: set[Task]` for spawned pipeline runs. - `src/synthorg/api/controllers/objectives.py` — new `POST /objectives` returning 202 + `submission_id` (server-minted, propagated as the `WorkItem.correlation_id`); pipeline runs as a background task so the HTTP boundary never blocks on team coordination. Guarded by `require_write_access` + per-op rate-limit `objectives.submit`. - `src/synthorg/api/app.py` + `controllers/setup/agent_helpers.py` — the new wire call is appended right after the intake wire, both at cold boot and on provider-reinit. - New OBJECTIVES settings namespace + `default_project` definition (`read_only_post_init=True`, `restart_required=True`). ### Manifest discipline (parent epic hard criterion) `scripts/_ghost_wiring_manifest.txt` ships an `ENFORCED ObjectiveEntryAdapter #1964 ...` line in this same PR; `scripts/check_no_ghost_wiring.py` passes. ## Test plan - **Unit** (all green; `tests/unit/engine/pipeline/entry/` + `tests/unit/api/controllers/test_objectives.py`): - `ObjectiveSubmission`: frozen, rejects unknown fields, blank title/description/requested_by, blank acceptance-criteria elements; submission_id defaults to uuid4. - `ObjectiveEntryAdapter`: maps submission to `WorkItem(source=OBJECTIVE)`; correlation_id == submission_id; default project enforced; optional fields fall through to `WorkItem` defaults; pipeline errors propagate unchanged. - Factory dispatch: OBJECTIVE arm returns `ObjectiveEntryAdapter`; unwired sources still raise `UnknownStrategyError`. - Boot hook: no-op without `work_pipeline`; creates the objectives project when absent; skips create when it exists; hot_swap uses the swap seam (and still ensures the project exists). - Controller: 202 + minted submission_id; background task tracked on `AppState`; controller never blocks on the pipeline run; done-callback removes completed tasks from the set; payload rejects unknown fields, blank fields, and unknown enum strings. - **E2E** (`tests/e2e/test_real_objective_entry_e2e.py`, run under `-m e2e`): builds the real runtime via the production `build_runtime_services` with a deterministic `ScriptedDriver`. Two cases: 1. Default `leaf-threshold` policy — proves the path is wired and the spawned task advances past `CREATED`. 2. Forced `always-team` policy — proves the objective decomposes: asserts `verdict=SPLITTABLE`, `execution_path=TEAM`, and the root task advances past `CREATED`. ## Review coverage Reviewed via `/pre-pr-review` with the core agent roster only (thin-adapter scope): python-reviewer, security-reviewer, test-quality-reviewer, conventions-enforcer, issue-resolution-verifier. | Agent | Result | | --- | --- | | python-reviewer | APPROVE, zero findings | | security-reviewer | APPROVE, zero findings (SEC-1 wrapping, auth, rate-limit, background-task lifecycle, settings injection verified) | | conventions-enforcer | Zero violations | | issue-resolution-verifier | PASS, all 7 acceptance criteria at confidence 100 | | test-quality-reviewer | 0 critical; 5 minor refinements applied (defensive double-yield on done-callback test, invalid-enum-rejection coverage, blank-acceptance-criteria coverage, `_result` helper docstring, hot_swap project-create assertion) | All ruff / mypy strict / pre-commit / pre-push gates green; 9485 affected-area unit tests pass locally. ## Out of scope - First-class `Objective` persistence (no model, repository, or migration). Submission is ephemeral; state lives on the spawned root task. - `/objectives/{id}` GET / list / cancel surfaces. - Frontend integration. - Coordinator / decomposition-strategy / routing-policy changes (reused as-is). - Add an `idempotency_key=` query filter to `/tasks` so HTTP-only clients can resolve `submission_id` → spawned task. Tracked as a follow-up; the simulation harness drives the adapter directly so this is not blocking acceptance for #1964. --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
Closes #1963. Child of EPIC #1955. ## Summary Wires `TaskBoardEntryAdapter` as the live work-entry path for the task board UI. A human-filed task via `POST /tasks` now flows through the work pipeline spine so an agent actually executes it, proven under the simulation harness with a `ScriptedDriver`. The pipeline spine (#1960) already maps `WorkItem -> ClientRequest -> intake_engine.process()` in its intake phase; this PR adds the TASK_BOARD arm of `build_work_entry_adapter`, rewrites `POST /tasks` to hand the filing to a detached background pipeline run, and ships the matching ghost-wiring manifest line. ## Scope constraint (locked in plan) The spine's `_intake` always creates a task internally via `intake_engine.process()`; there is no spine entry that adopts a pre-existing task. Therefore the board adapter cannot pre-create a task and hand it off (would yield duplicates), and "move triggers the spine" requires new pipeline logic (forbidden by the issue's "thin adapter, no new pipeline logic" scope). The board adapter fires on `file` only; column moves remain pure `TaskEngine` status walks of the spine-created task. Documented in `docs/design/page-structure.md` and the task-board adapter docstring. ## What changed - **New `TaskBoardEntryAdapter` + `TaskBoardFiling`** under `src/synthorg/engine/pipeline/entry/task_board_adapter.py`. Frozen Pydantic input with `extra="forbid"` and an auto-generated UUID4 correlation id; the adapter maps a filing onto a `WorkItem(source=TASK_BOARD)` and drives `WorkPipeline.run`. - **`build_work_entry_adapter` extended** with the TASK_BOARD arm + `@overload` so boot helpers get the concrete return type per source literal. `default_project` stays required on the signature (intake consumes it; TASK_BOARD ignores it). - **`POST /tasks` rewritten** to return HTTP 202 + `TaskBoardSubmissionResponse` (`correlation_id`, `title`, `project`, `status: "submitted"`). A detached background coroutine (`process_task_board_pipeline`) drives the adapter; pipeline failures are logged at WARNING (intake rejection) or ERROR (any other), `MemoryError` / `RecursionError` re-raise. `AgentRuntimeNotConfiguredError` (409, code 4014) when no adapter is wired (empty company / no provider). - **`AppState.task_board_entry_adapter` seam** mirrors `intake_entry_adapter`: ctor kwarg, slot, property, `has_*`, `set_*` / `set_*_if_absent` / `swap_*`. Boot wiring in `engine.pipeline.entry.boot.wire_real_task_board_entry` is called from both `_install_runtime_services` and the post-setup `_rebuild_runtime_services` (skip project bootstrap; board filings carry their own project). - **Three new event constants**: `API_TASK_BOARD_SUBMITTED`, `API_TASK_BOARD_REJECTED_NO_ADAPTER`, `API_TASK_BOARD_PIPELINE_FAILED`. - **Manifest discipline** (hard EPIC criterion): `scripts/_ghost_wiring_manifest.txt` has an `ENFORCED` line for `TaskBoardEntryAdapter`. The `no-ghost-wiring` gate passes. - **Docs**: `docs/design/client-simulation.md` documents the new TASK_BOARD work-entry path next to the existing INTAKE one; `docs/design/page-structure.md` notes the new 202 contract on `POST /tasks` and explicitly documents the move-is-display-only semantics. - **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the 202 + new schema, updated `tasks.ts` endpoint to return `TaskBoardSubmissionResponse`, Zustand store's `createTask` returns the submission envelope (no optimistic insert; the existing `task.created` WS handler inserts the real task once intake creates it), MSW handler returns 202, dialog + stories typed to the new shape. ## Tests - New unit tests: `tests/unit/engine/pipeline/entry/test_task_board_adapter.py` (filing validation; mapping; propagation of `WorkIntakeRejectedError`, `MemoryError`, `RecursionError`, generic `Exception`). - Extended `tests/unit/engine/pipeline/entry/{test_factory,test_boot}.py`: TASK_BOARD arm via parametrize; boot helper no-ops when pipeline / simulation runtime absent; hot-swap; no project bootstrap. - Rewritten `tests/unit/api/controllers/test_tasks.py`: 202 + submission envelope, UUID4 correlation id, 409 when no adapter (`test_create_task_raises_agent_runtime_not_configured_without_adapter`). - New `tests/unit/api/test_state.py` cases covering the new seam. - Reworked `tests/unit/api/controllers/test_coordination.py` to pre-create tasks via direct persistence insertion instead of `POST /tasks` (cross-loop crash with the now-mocked pipeline). - `tests/integration/api/test_task_create_empty_company.py` updated: no provider -> 409 (unchanged); provider present -> 202 + envelope. - New e2e `tests/e2e/test_task_board_entry_e2e.py`: production runtime via `build_runtime_services` + `ScriptedDriver`. Positive path proves the task reaches a post-execution status; an unknown-project path proves the background coroutine swallows the pipeline failure; a MemoryError stand-in proves the re-raise contract holds end-to-end. - New web tests: `TaskCreateDialog.test.tsx` updated for the submission envelope; store test asserts no optimistic insert (WS handler owns the inserted task) and "submitted" toast wording. Validation on this branch: 29 510 unit tests pass; 3 new e2e tests pass; 2 integration tests pass; targeted web suite (`tasks.test.ts`, `TaskCreateDialog.test.tsx`, `TaskBoardPage.test.tsx`, `useTaskBoardData.test.ts`) green; ruff, ruff-format, mypy strict, all pre-push convention gates green. ## Pre-PR review Reviewed by 5 core agents (python-reviewer, security-reviewer, test-quality-reviewer, conventions-enforcer, issue-resolution-verifier). 6 findings, all implemented in the test-tightening commit: - Tightened `TaskBoardSubmissionResponse.status` to `Literal["submitted"]`. - Added adapter-level error-propagation tests (`MemoryError`, `RecursionError`, generic `Exception`). - Tightened the 202 happy-path test to validate UUID4 correlation id. - Added two negative-path e2e tests (unknown project; MemoryError re-raise). - Parametrized the factory's positive test for the two wired sources. - Renamed the 409 test for semantic clarity. Verdicts after fixes: zero CRITICAL / HIGH from any agent; issue-resolution-verifier PASS with both acceptance criteria RESOLVED; conventions-enforcer zero violations; security-reviewer confirmed SEC-1 prompt safety is handled downstream by `policy/llm_judged.py` and the adapter does not need to wrap. Co-authored-by: SynthOrg <synthorg-bot@synthorg.local> --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
… + per-project push queue (#2021) Closes #1974. Foundational substrate for EPIC #1973 (autonomous product studio). Each project gets a 1:1 persistent git-backed working tree that survives across agents, tasks, and sessions; two agents merging concurrently are serialised through a per-project push queue; the git backend is config-only swappable across `embedded` / `local_path` / `external_remote` (with GitHub / GitLab / Gitea / Forgejo connection identities). ## Locked acceptance criteria 1. **1:1 project to persistent git-backed workspace surviving across sessions.** `ProjectWorkspaceService.get_or_provision(project_id)` materialises `<base>/projects/<project_id>/` once; the `ProjectWorkspace` row plus the on-disk working tree are the durable contract. A backend-kind change in `GitBackendConfig.kind` re-provisions under the new backend (keeping the workspace path) with a `WORKSPACE_BACKEND_KIND_CHANGED` warning event. Conformance-tested under SQLite + Postgres. 2. **Two agents merge concurrently without branch collision via a per-project serial push queue.** `WorkspaceIsolationService.merge_workspace_with_push` routes through a lazily-built `PushQueueCoordinator` per project. FIFO; sentinel-based stop; `WorkspacePushError` distinguishes a forge-rejection from a local `WorkspaceMergeError` (conflicted merge does not push). The queue sits in front of the existing `WorkspaceIsolationStrategy` so a future virtual-branch strategy slots in without architectural change. 3. **Backend is config-only swappable.** `GitBackendType` is `embedded` / `external_remote` / `local_path`; `build_git_backend(config, deps)` resolves a `GitBackend` via `StrategyRegistry`. `external_remote` resolves a saved connection from the catalogue (GitHub, GitLab, Gitea, Forgejo) and injects the token as a URL-userinfo. Deep OAuth refresh + rate-limit hardening is tracked in the follow-up issue. 4. **Validated under the simulation harness.** Acceptance tests in `tests/integration/engine/workspace/test_project_workspace_acceptance.py` cover persistence across "session restarts", concurrent merge-without-collision, config-only backend swap, and cross-project sandbox isolation. ## What ships - `ProjectWorkspace` model + `ProjectWorkspaceRepository` (sqlite + postgres) + paired `20260519000001_project_workspaces` migrations + `schema.sql` parity. FK to `projects(id) ON DELETE CASCADE`. - `engine/workspace/git_backend/`: `GitBackend` protocol, `GitBackendConfig` / `GitBackendDeps`, `EmbeddedGitBackend` (default) / `LocalPathGitBackend` / `ExternalRemoteGitBackend`, `StrategyRegistry`-driven factory with fail-fast missing-deps validation. - `engine/workspace/project_workspace_service.py` with per-project async lock, config-authoritative backend-kind precedence, path-traversal defence. - `engine/workspace/push_queue.py` (`PushQueueCoordinator`) integrated into `WorkspaceIsolationService` (per-project dict, guard lock, double-checked locking). - Connection-catalogue expansion: `GITLAB`, `GITEA`, `FORGEJO` connection types alongside `GITHUB`. Gitea + Forgejo share a `_GiteaFamilyAuthenticator` base today but stay distinct connection identities so a future API divergence does not invalidate saved connections. - Boot wiring in `_install_runtime_services` (gated by `app_state.has_persistence`); `AgentEngineExecutionService.execute_once` now `get_or_provision`s the workspace per project. New `AppState.project_workspace_service` once-only seam. - Ghost-wiring manifest: 6 `ENFORCED` entries for the new runtime symbols (`ProjectWorkspaceService`, `build_git_backend`, `EmbeddedGitBackend`, `LocalPathGitBackend`, `ExternalRemoteGitBackend`, `PushQueueCoordinator`). - Error taxonomy: `ProjectWorkspaceNotProvisionedError` (4015, CONFLICT), `GitBackend*Error` family, `WorkspacePushError`. - Docs: `docs/design/coordination.md` (persistent workspace + push-queue section), `docs/design/integrations.md` (connection types), `docs/reference/errors.md` (4014 / 4015), `docs/reference/pluggable-subsystems.md` (git backend strategy). ## Pre-PR review (this push) Ran the core agent set on the substrate; applied every valid finding before opening the PR. Highlights: - **SEC-1 redaction**: `_redact_args` strips embedded URL credentials at every `run_git_subprocess` log site (no token-in-URL ever lands in structured logs). - **`GIT_*` env sanitisation in `run_git_subprocess`**: when this code runs from inside a git pre-push hook (or any caller whose own cwd is a git working tree), the hook process exports `GIT_DIR` / `GIT_WORK_TREE` and any child git subprocess inherits them, making `is_git_repo()` report True for a freshly-mkdir'd tmp dir because git resolves against the parent repo. The fix strips `GIT_*` env vars in `run_git_subprocess` so backend strategies resolve purely via the per-call `cwd`. (Without this, the embedded backend's `git commit --allow-empty` walked up to the host's `.git` during the pre-push run; the rebased branch you're looking at has those stray commits removed and the gate now reproduces clean.) - **Defensive nested-worktree guard** on `EmbeddedGitBackend` and `LocalPathGitBackend` (compares resolved `--show-toplevel` against the requested path; only refuses when toplevel is a strict parent). - Path-traversal defence in `ProjectWorkspaceService._workspace_path` rejects `/`, `\\`, `..`. - Double-checked locking TOCTOU closed in `ProjectWorkspaceService._lock_for` and `WorkspaceIsolationService._get_or_create_queue` (dict reads moved inside the guard lock). - Push-queue logging surfaces `WORKSPACE_PUSH_QUEUE_ENQUEUED` / `_MERGED` / `_FAILED` / `_WORKER_FAILED`. ## Out of scope / follow-up Filed as #2020 (referenced, not silently deferred): - External-remote OAuth refresh + rate-limit hardening (protocol + thin glue ship now; backend-switch-is-config-only already holds). - Per-execution Docker per-project mount migration into the sandbox lifecycle owner. - `dispatch` callers of `merge_workspace` migrated onto `merge_workspace_with_push` (currently only the direct execution path uses the queue). - `PlannerWorktreeStrategy` per-project `repo_root` resolution at execute-time. - `tests/unit/api/conftest.py::test_client` baseline reset for the known `project_shared_app_once_only_install_leak`. ## Verification - `uv run python -m pytest tests/ -m unit` clean. - Targeted dual-backend conformance: `tests/conformance/persistence/test_project_workspace_repository.py` (18 tests, SQLite + Postgres). - Acceptance tests: `tests/integration/engine/workspace/test_project_workspace_acceptance.py`. - Pre-push gates: ruff, mypy strict, schema-drift (both backends), no-ghost-wiring, no-magic-numbers, isolation `--count=2` gate. --------- Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the core scoring and data infrastructure for the 'golden-company' benchmark (eval spine). It provides the foundational models, loaders, and grading logic required to evaluate synthetic organization performance, while intentionally deferring the runner implementation until runtime prerequisites are met. The changes include strict schema validation, robust error handling, and comprehensive unit testing to ensure the benchmark's reliability and auditability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdds an eval spine implementing frozen Pydantic briefs and scorecard models, YAML loaders (briefs and anchors) with path-safety and validation, executable and judged graders (Spearman-based calibration), penalty table and aggregation logic, JSON and Markdown scorecard emitters, an EvalError hierarchy, observability event constants, pyproject tooling/test discovery updates, and a comprehensive suite of unit and integration tests. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Merging this PR will not alter performance
Comparing Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/loader/briefs.py`:
- Around line 43-47: The globbed entries in files are currently trusted by
suffix only, which allows symlinks or non-file entries to escape briefs_dir or
cause I/O errors; update the logic around files / the for path in files loop to
skip any path that is not a regular file and to ensure path.resolve() is
contained within briefs_dir.resolve() (e.g., check path.is_file() and that
path.resolve() is under briefs_dir.resolve() via relative_to in a try/except)
before calling path.read_text() and yaml.safe_load; apply these checks where
files is built or at the start of the loop that constructs briefs (referencing
variables briefs_dir, files, path, yaml.safe_load, and Brief).
In `@evals/models/brief.py`:
- Line 111: Replace the inline numeric literal in the model schema by
introducing a named constant in the settings.definitions module (e.g.,
HIDDEN_CHECK_DEFAULT_TIMEOUT: int = 30) and use that constant in the Field
default for timeout_seconds in the Brief model (timeout_seconds: int =
Field(default=HIDDEN_CHECK_DEFAULT_TIMEOUT, gt=0)); add an explicit module-level
type annotation for the new constant in settings.definitions and update any
imports so evals.models.brief references the constant instead of the literal.
In `@evals/models/scorecard.py`:
- Around line 123-153: BriefResult currently allows inconsistent states between
kind and judge_calibration; add an after model_validator (e.g. name it
_validate_kind_and_calibration) on the BriefResult model that checks
BriefResult.kind (use the BriefKind enum) and enforces that when kind ==
BriefKind.JUDGED then judge_calibration is not None, and when kind !=
BriefKind.JUDGED then judge_calibration is None; on violation raise a ValueError
with a clear message referencing brief_id, kind and judge_calibration to aid
debugging.
- Around line 164-166: AggregatedProcessFacts currently allows negative counts
in events_by_class; add a Pydantic validator on the AggregatedProcessFacts model
(e.g., `@validator`("events_by_class") or a `@root_validator`) that iterates the
dict values and raises a ValueError if any count is negative, and ensure
total_events remains Field(..., ge=0) as a non-negative guard; reference the
AggregatedProcessFacts.events_by_class field and the model's validation methods
when implementing.
In `@evals/scoring/aggregate.py`:
- Line 32: The field event_constant in the Pydantic model should be typed as
NotBlankStr rather than plain str; update the annotation for event_constant to
NotBlankStr, add the appropriate import for NotBlankStr from the project
validation utilities (or pydantic helpers) at the top of
evals/scoring/aggregate.py, and run typechecks to ensure the model now enforces
non-blank identifier constraints.
- Around line 122-128: In the loop in evals/scoring/aggregate.py that iterates
over events_by_class.items(), validate the `count` before using it: if `count <
0` raise a ValueError with a clear message identifying the offending
`event_constant` and `count`; then proceed to call `penalty_table.is_tracked`,
`penalty_table.points_for(event_constant)`, compute `raw = per_event * count`,
clamp with `penalty_table.cap_per_class`, and add to `total_deduction` as
before. Ensure the check occurs before any arithmetic so negative counts cannot
produce negative deductions.
- Around line 72-75: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field`: import computed_field from pydantic and annotate
is_clean with `@computed_field`, keeping the implementation returning
self.deduction == 0 so the derived value complies with the frozen Pydantic model
contract; ensure the decorator is applied to the is_clean method in the same
class where deduction is defined.
In `@evals/scoring/executable.py`:
- Around line 151-157: Replace the inline event name string literals used in the
logger.warning calls with centralized constants: define EVAL_EXECUTABLE_TIMEOUT
and EVAL_EXECUTABLE_TOOL_MISSING in the observability.events.evals module (or
the existing events module) with values "evals.executable.timeout" and
"evals.executable.tool_missing" respectively, then import those constants at the
top of this module and use them in the logger.warning calls (the ones that
currently pass "evals.executable.timeout" and "evals.executable.tool_missing");
update both occurrences where logger.warning is invoked (the calls that pass
cmd=spec.cmd, timeout=spec.timeout_seconds, error_type=type(exc).__name__,
error=safe_error_description(exc) and the similar tool-missing call) to
reference the imported constants instead of string literals.
- Around line 102-105: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field` decorator so the derived value is serialized; import
computed_field from pydantic if missing and keep the same signature (def
is_clean(self) -> bool) and return expression (self.hidden_pass and
self.build_pass and self.lint_pass) so the field is treated as a computed model
field rather than a plain Python property.
- Line 67: Change the Pydantic field annotation for the identifier "label" from
str to NotBlankStr (so it enforces non-empty identifiers like LABEL_HIDDEN,
LABEL_BUILD, LABEL_LINT) and add an import for NotBlankStr from the project's
shared types module (where other models use NotBlankStr); update the model
definition that contains the "label" field in this file to use NotBlankStr
instead of str.
- Around line 131-143: The completed outcome handler currently hardcodes
duration_seconds=0.0 in _outcome_from_completed while timeouts set
duration_seconds=float(spec.timeout_seconds); fix by capturing real elapsed time
around the subprocess invocation and threading that value into the CheckOutcome:
modify the code that calls subprocess.run to record start/end (e.g.,
time.perf_counter()) and compute elapsed_seconds, change _outcome_from_completed
to accept a duration_seconds: float parameter (or overload it) and use that
value when constructing CheckOutcome, and ensure the timeout branch still sets
duration_seconds=float(spec.timeout_seconds) so all outcomes have meaningful
durations.
In `@tests/evals_spine/test_judged_calibration.py`:
- Line 37: Change the predicate parameter's type from object to a typed Callable
to reflect its true signature: add "from typing import Callable" at the top of
the test file, update the function signature(s) that accept the predicate (the
pytest.parametrize entry ("xs", "ys", "predicate") and the test function(s)
using that parameter) to use predicate: Callable[[float], bool], and remove the
now-unnecessary "# type: ignore[operator]" comments; apply the same change to
the other occurrence around lines 80-83 where predicate is used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: eed15755-dcc6-4950-9350-66b34d428804
📒 Files selected for processing (27)
CLAUDE.mdevals/__init__.pyevals/emit/__init__.pyevals/emit/json_writer.pyevals/emit/markdown_writer.pyevals/errors.pyevals/loader/__init__.pyevals/loader/anchors.pyevals/loader/briefs.pyevals/models/__init__.pyevals/models/brief.pyevals/models/scorecard.pyevals/scoring/__init__.pyevals/scoring/aggregate.pyevals/scoring/executable.pyevals/scoring/judged.pyevals/scoring/penalties.pyevals/scoring/spearman.pypyproject.tomltests/evals_spine/__init__.pytests/evals_spine/conftest.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_penalties.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build Backend
- GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Dashboard Test
- GitHub Check: Test Integration
- GitHub Check: Test E2E
- GitHub Check: Test Unit
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Read
docs/design/page before implementing; deviations need approval. See DESIGN_SPEC.md.Present every plan for accept/deny before coding.
Ship enforcement gate with every convention PR.
Configuration Precedence: DB > env > code default via
SettingsService/ConfigResolver(Cat-1) or env > code default (Cat-2,read_only_post_init); Cat-3 bootstrap secrets are pure env at the boot site. Noos.environ.getoutside startup.Numerics live in
settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced byscripts/check_no_magic_numbers.py.No
from __future__ import annotations(3.14 has PEP 649). PEP 758 except:except A, B:must use parens unless binding.Type hints on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines.
Errors:
<Domain><Condition>ErrorfromDomainError; never inheritException/RuntimeError/etc directly. Enforced bycheck_domain_error_hierarchy.py.Pydantic v2 frozen +
extra="forbid"on every frozen model (gatecheck_frozen_model_extra_forbid.py;@computed_fieldauto-exempt, per-line# lint-allow: frozen-extra-forbid -- <reason>forextra="allow"/"ignore"); use@computed_fieldfor derived fields.Args models at every system boundary; use
parse_typed()for every external dict ingestion. Enforced bycheck_boundary_typed.py.Immutability: use
model_copy(update=...)orcopy.deepcopy(); deepcopy at system boundaries.Async: use
asyncio.TaskGroupfor fan-out/fan-in; helpers catchException(re-raiseMemoryError/RecursionError).Clock seam:
clock: Clock | None = None; tests injectFakeClock. Lifecycle: services own_lifecycle_lock; timed-out stops mark unrestartable.Untrusted content (SEC-1): use
wrap_untrusted()fromengine.prompt_safety; useHTMLParseGuardfor HTML.Logging: import `from sy...
Files:
evals/__init__.pyevals/scoring/__init__.pyevals/emit/json_writer.pyevals/loader/__init__.pyevals/errors.pytests/evals_spine/conftest.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_brief_loader.pyevals/loader/briefs.pyevals/models/scorecard.pyevals/scoring/spearman.pytests/evals_spine/test_executable_grader.pyevals/emit/markdown_writer.pyevals/scoring/aggregate.pyevals/models/__init__.pyevals/emit/__init__.pyevals/loader/anchors.pyevals/scoring/judged.pyevals/scoring/penalties.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pyevals/models/brief.pytests/evals_spine/test_scorecard_emit.pyevals/scoring/executable.py
**/*.{md,rst}
📄 CodeRabbit inference engine (CLAUDE.md)
Numerics in README and public docs sourced from
data/runtime_stats.yamlvia<!--RS:NAME-->markers.
Files:
CLAUDE.md
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use markers:
@pytest.mark.{unit,integration,e2e,slow}. Async tests useauto. Global timeout 30s. Coverage minimum 80%.xdist: use
-n 8 --dist=loadfile(prevents 3.14+ Windows ProactorEventLoop leak). Windows: useWindowsSelectorEventLoopPolicyfor unit tests; subprocess tests override back.Test doubles:
FakeClockfor Clock seam,mock_of[T](**overrides)for typed-boundary substitutions,SimpleNamespacefor attribute-bags. BareMagicMockat typed boundary is blocked bycheck_mock_spec.py.FakeClock and
mock_ofimport fromtests._shared; inject viaclock=and the helper's spec subscript.Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add
@example(...)). Never skip/xfail flaky tests; fix fundamentally.
Files:
tests/evals_spine/conftest.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_scorecard_emit.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/evals_spine/conftest.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_scorecard_emit.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: No region/currency/locale privileged; use metric units; British English.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: Timeout/slow test failures are source-code regressions; never edit `tests/baselines/unit_timing.json` or scripts baseline files. Bypass with `ALLOW_BASELINE_GROWTH=1 git commit ...` (requires user approval).
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review`. After PR: `/aurelio-review-pr` for external feedback. Fix EVERYTHING valid; no deferring.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`).
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: Branches: `<type>/<slug>` from main.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: Squash merge. PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: After every squash merge → run `/post-merge-cleanup`.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API.
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T16:02:13.510Z
Learning: API startup lifecycle: two phases - construction (synchronous service wiring) and on_startup (connected persistence wiring). Construction ordering: `agent_registry` BEFORE `auto_wire_meetings`; `tunnel_provider` unconditional. On-startup ordering: `SettingsService` BEFORE `WorkflowExecutionObserver`; `OntologyService` AFTER `persistence.connect()`.
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
evals/__init__.pyevals/scoring/__init__.pyevals/emit/json_writer.pyevals/loader/__init__.pyevals/errors.pytests/evals_spine/conftest.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_brief_loader.pyevals/loader/briefs.pyevals/models/scorecard.pyevals/scoring/spearman.pytests/evals_spine/test_executable_grader.pyevals/emit/markdown_writer.pyevals/scoring/aggregate.pyevals/models/__init__.pyevals/emit/__init__.pyevals/loader/anchors.pyevals/scoring/judged.pyevals/scoring/penalties.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pyevals/models/brief.pytests/evals_spine/test_scorecard_emit.pyevals/scoring/executable.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.
Applied to files:
CLAUDE.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).
Applied to files:
CLAUDE.md
🔇 Additional comments (21)
tests/evals_spine/test_scorecard_emit.py (1)
1-255: LGTM!tests/evals_spine/test_penalties.py (1)
1-182: LGTM!evals/scoring/spearman.py (1)
1-97: LGTM!evals/scoring/judged.py (1)
1-298: LGTM!tests/evals_spine/conftest.py (1)
1-59: LGTM!tests/evals_spine/test_brief_loader.py (1)
1-144: LGTM!tests/evals_spine/test_runner_broken_scores_worse.py (1)
1-87: LGTM!evals/scoring/__init__.py (1)
1-13: LGTM!evals/emit/__init__.py (1)
1-1: LGTM!evals/emit/json_writer.py (1)
1-34: LGTM!evals/emit/markdown_writer.py (1)
1-141: LGTM!evals/__init__.py (1)
1-14: LGTM!CLAUDE.md (1)
3-3: LGTM!pyproject.toml (1)
190-190: LGTM!Also applies to: 283-283, 363-363, 378-378
evals/loader/__init__.py (1)
1-7: LGTM!evals/loader/anchors.py (1)
39-70: LGTM!Also applies to: 72-123, 125-132
evals/loader/briefs.py (1)
56-70: LGTM!evals/errors.py (1)
22-116: LGTM!evals/models/__init__.py (1)
1-6: LGTM!tests/evals_spine/test_executable_grader.py (1)
1-231: LGTM!evals/scoring/executable.py (1)
172-179: Shell metacharacter validation inHiddenCheckSpecis correctly implemented.The validation exists as a model validator at line 114 of
evals/models/brief.py(_leading_token_has_no_shell_metachars), checking the leading token against the_SHELL_METACHARSconstant defined at line 27. The docstring (lines 97–105) explicitly documents the defense-in-depth approach: validation at load time catches typos in brief YAML before runtime, even thoughshell=Falsemitigates the risk. The noqa comment on line 172 is justified.
| @property | ||
| def is_clean(self) -> bool: | ||
| """Whether every declared check class passed.""" | ||
| return self.hidden_pass and self.build_pass and self.lint_pass |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Use @computed_field instead of @property for derived fields.
The is_clean property is a derived field and should use Pydantic's @computed_field decorator rather than @property. This ensures the field is serialized with the model and aligns with the project's convention for computed fields. As per coding guidelines, derived fields should use @computed_field.
♻️ Proposed fix
-from pydantic import BaseModel, ConfigDict, Field, model_validator
+from pydantic import BaseModel, ConfigDict, Field, computed_field, model_validator
class ExecutableGrade(BaseModel):
"""Aggregate grade across hidden + build + lint check classes.
Invariant: each ``*_pass`` boolean reflects whether every outcome
in that label-bucket exited cleanly (``exit_code == 0`` and not
``timed_out``). The validator below enforces it.
"""
model_config = ConfigDict(frozen=True, extra="forbid", allow_inf_nan=False)
score: int = Field(ge=0, le=EXEC_TOTAL)
hidden_pass: bool
build_pass: bool
lint_pass: bool
outcomes: tuple[CheckOutcome, ...]
- `@property`
+ `@computed_field` # type: ignore[misc]
+ `@property`
def is_clean(self) -> bool:
"""Whether every declared check class passed."""
return self.hidden_pass and self.build_pass and self.lint_pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@evals/scoring/executable.py` around lines 102 - 105, Replace the `@property` on
the is_clean method with Pydantic's `@computed_field` decorator so the derived
value is serialized; import computed_field from pydantic if missing and keep the
same signature (def is_clean(self) -> bool) and return expression
(self.hidden_pass and self.build_pass and self.lint_pass) so the field is
treated as a computed model field rather than a plain Python property.
There was a problem hiding this comment.
Code Review
This pull request introduces the golden-company benchmark framework, a comprehensive evaluation suite for autonomous agents featuring executable checks and calibrated LLM judging. The implementation includes Pydantic models for briefs and scorecards, YAML loaders, and scoring logic that incorporates process-fact penalties. Review feedback highlighted the need to measure actual execution duration in the executable grader and to handle PermissionError during subprocess calls. Security improvements were also suggested regarding credential redaction in logs and command outputs, along with a correction to ensure the atomicity of JSON file writes as described in the documentation.
| def _outcome_from_completed( | ||
| spec: HiddenCheckSpec, | ||
| label: str, | ||
| completed: subprocess.CompletedProcess[bytes], | ||
| ) -> CheckOutcome: | ||
| return CheckOutcome( | ||
| label=label, | ||
| cmd=spec.cmd, | ||
| exit_code=completed.returncode, | ||
| duration_seconds=0.0, | ||
| stdout_tail=_tail(completed.stdout), | ||
| stderr_tail=_tail(completed.stderr), | ||
| ) |
There was a problem hiding this comment.
Update _outcome_from_completed to accept a duration parameter. The current implementation hardcodes duration_seconds=0.0. Additionally, ensure that the cmd field in CheckOutcome is redacted if it contains URLs with credentials to prevent sensitive data leakage in tool execution results, as required by repository security rules.
| def _outcome_from_completed( | |
| spec: HiddenCheckSpec, | |
| label: str, | |
| completed: subprocess.CompletedProcess[bytes], | |
| ) -> CheckOutcome: | |
| return CheckOutcome( | |
| label=label, | |
| cmd=spec.cmd, | |
| exit_code=completed.returncode, | |
| duration_seconds=0.0, | |
| stdout_tail=_tail(completed.stdout), | |
| stderr_tail=_tail(completed.stderr), | |
| ) | |
| def _outcome_from_completed( | |
| spec: HiddenCheckSpec, | |
| label: str, | |
| completed: subprocess.CompletedProcess[bytes], | |
| duration: float, | |
| ) -> CheckOutcome: | |
| return CheckOutcome( | |
| label=label, | |
| cmd=spec.cmd, | |
| exit_code=completed.returncode, | |
| duration_seconds=duration, | |
| stdout_tail=_tail(completed.stdout), | |
| stderr_tail=_tail(completed.stderr), | |
| ) |
References
- Ensure that URLs containing potential credentials are redacted in all output channels, including tool execution results (e.g., ToolExecutionResult content).
| def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome: | ||
| """Run one check; never raises on non-zero exit; raises on tool-missing.""" | ||
| try: | ||
| completed = subprocess.run( # noqa: S603 -- args validated, no shell | ||
| list(spec.cmd), | ||
| cwd=work_dir, | ||
| timeout=spec.timeout_seconds, | ||
| capture_output=True, | ||
| check=False, | ||
| shell=False, | ||
| ) | ||
| except subprocess.TimeoutExpired as exc: | ||
| return _outcome_from_timeout(spec, label, exc) | ||
| except (FileNotFoundError, NotADirectoryError) as exc: | ||
| logger.error( | ||
| "evals.executable.tool_missing", | ||
| cmd=spec.cmd, | ||
| error_type=type(exc).__name__, | ||
| error=safe_error_description(exc), | ||
| ) | ||
| msg = f"Required eval tool not found: {spec.cmd[0]!r}" | ||
| raise EvalToolMissingError(msg) from exc | ||
| return _outcome_from_completed(spec, label, completed) |
There was a problem hiding this comment.
Measure the actual execution time using time.perf_counter() and handle PermissionError. subprocess.run raises PermissionError if the file exists but is not executable, which should be caught and reported as an EvalToolMissingError. When logging the error, ensure that spec.cmd is redacted if it contains URLs with credentials to prevent sensitive data leakage to logs.
def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome:
"""Run one check; never raises on non-zero exit; raises on tool-missing."""
start = time.perf_counter()
try:
completed = subprocess.run( # noqa: S603 -- args validated, no shell
list(spec.cmd),
cwd=work_dir,
timeout=spec.timeout_seconds,
capture_output=True,
check=False,
shell=False,
)
duration = time.perf_counter() - start
return _outcome_from_completed(spec, label, completed, duration)
except subprocess.TimeoutExpired as exc:
return _outcome_from_timeout(spec, label, exc)
except (FileNotFoundError, NotADirectoryError, PermissionError) as exc:
logger.error(
"evals.executable.tool_missing",
cmd=spec.cmd,
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
msg = f"Required eval tool not found or not executable: {spec.cmd[0]!r}"
raise EvalToolMissingError(msg) from excReferences
- Ensure that URLs containing potential credentials are redacted in all output channels, including log messages.
| import subprocess | ||
| from typing import TYPE_CHECKING, Final, Self |
There was a problem hiding this comment.
| def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path: | ||
| """Write *scorecard* as JSON into *out_dir*; return the file path.""" | ||
| out_dir.mkdir(parents=True, exist_ok=True) | ||
| target = out_dir / SCORECARD_JSON_FILENAME | ||
| payload = scorecard.model_dump_json(indent=JSON_INDENT) | ||
| target.write_text(payload + "\n", encoding="utf-8") | ||
| return target |
There was a problem hiding this comment.
The docstring claims that write_scorecard_json overwrites the file atomically via write_text, but pathlib.Path.write_text is not atomic (it truncates the file upon opening). To ensure atomicity and prevent corruption if the process crashes during writing, use a temporary file and os.replace.
def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path:
"""Write *scorecard* as JSON into *out_dir*; return the file path."""
import os
import tempfile
out_dir.mkdir(parents=True, exist_ok=True)
target = out_dir / SCORECARD_JSON_FILENAME
payload = scorecard.model_dump_json(indent=JSON_INDENT) + "\n"
with tempfile.NamedTemporaryFile(
mode="w", dir=out_dir, encoding="utf-8", delete=False
) as tf:
tf.write(payload)
temp_name = tf.name
try:
os.replace(temp_name, target)
except Exception:
try:
os.unlink(temp_name)
except OSError:
pass
raise
return target
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2025 +/- ##
==========================================
- Coverage 85.09% 85.00% -0.10%
==========================================
Files 1913 1948 +35
Lines 113388 115799 +2411
Branches 9673 9839 +166
==========================================
+ Hits 96485 98430 +1945
- Misses 14533 14943 +410
- Partials 2370 2426 +56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).
Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/emit/json_writer.py`:
- Around line 37-45: The temporary JSON file is opened in text mode without
specifying newline, causing platform-dependent CRLF/LF translations; update the
NamedTemporaryFile call in evals/emit/json_writer.py (the with
tempfile.NamedTemporaryFile(...) as handle block) to include newline="\n" so the
written payload uses LF-only line endings across platforms, keeping mode="w",
encoding="utf-8", delete=False, prefix/suffix as-is.
In `@evals/scoring/aggregate.py`:
- Around line 86-87: The validator
AggregationResult._score_matches_grade_minus_deduction currently enforces the
floor using the global GRADE_FLOOR while aggregate_brief_score uses
penalty_table.floor; change the validator to use the same floor source as
aggregation (e.g., compute expected = max(self.grade - self.deduction,
self.penalty_table.floor) or otherwise obtain penalty_table.floor from the
AggregationResult instance/context) so both validation and scoring use
penalty_table.floor consistently; ensure any null/absent penalty_table is
handled (fall back to GRADE_FLOOR only if penalty_table is unavailable).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 570e0380-6ccd-4433-aa7a-6eb407842c8e
📒 Files selected for processing (14)
evals/emit/json_writer.pyevals/errors.pyevals/loader/briefs.pyevals/models/brief.pyevals/models/scorecard.pyevals/scoring/aggregate.pyevals/scoring/executable.pyevals/scoring/penalties.pysrc/synthorg/observability/events/evals.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_penalties.pytests/evals_spine/test_scorecard_emit.pytests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Dashboard Test
- GitHub Check: Test Unit
- GitHub Check: Test Integration
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Follow configuration precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; no os.environ.get outside startup
No hardcoded numeric values; numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal)
Use comments to explain WHY only; no reviewer citations, issue back-refs, or migration framing; enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py
No
from __future__ import annotations(Python 3.14 has PEP 649); PEP 758 except:except A, B:must have parens unless bindingType hints on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines
Define errors as Error inheriting from DomainError, never directly from Exception/RuntimeError/etc; enforced by check_domain_error_hierarchy.py
Use Pydantic v2 frozen + extra='forbid' on every frozen model project-wide (gate check_frozen_model_extra_forbid.py);
@computed_fieldfor derived; NotBlankStr for identifiersArgs models at every system boundary; parse_typed() for every external dict ingestion; enforced by check_boundary_typed.py
Use immutability: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries
Use asyncio.TaskGroup for fan-out/fan-in async operations; helpers must catch Exception and re-raise MemoryError/RecursionError
Implement Clock seam: clock: Clock | None = None; tests inject FakeClock
Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable
Wrap untrusted content with wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML (SEC-1)
Import logger as: from synthorg.observability import get_logger; variable always named logger; never use import logging or print() in app code
Use e...
Files:
src/synthorg/observability/events/evals.py
src/synthorg/observability/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Telemetry: opt-in, off by default; every event property must be in _ALLOWED_PROPERTIES
Files:
src/synthorg/observability/events/evals.py
src/**/*.py
⚙️ CodeRabbit configuration file
This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.
Files:
src/synthorg/observability/events/evals.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use test markers:
@pytest.mark.{unit,integration,e2e,slow}; async auto; timeout 30s global; coverage 80% minxdist -n 8 --dist=loadfile auto-applied via pyproject addopts (loadfile prevents 3.14+Windows ProactorEventLoop leak)
Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back
Test doubles: FakeClock for Clock seam; mock_ofT for typed-boundary substitutions; SimpleNamespace for attribute-bags; bare MagicMock at typed boundary blocked by scripts/check_mock_spec.py (zero-tolerance)
FakeClock and mock_of import from tests._shared; inject via clock= and helper's spec subscript
Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add
@example(...))Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() not sleep(large)
Files:
tests/unit/observability/test_events.pytests/evals_spine/test_penalties.pytests/evals_spine/test_scorecard_emit.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_judged_calibration.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/unit/observability/test_events.pytests/evals_spine/test_penalties.pytests/evals_spine/test_scorecard_emit.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_judged_calibration.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Read design spec page before implementing; deviations need approval from DESIGN_SPEC.md
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Apply no region/currency/locale privilege; use metric units and British English
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Every convention PR must ship its enforcement gate
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Never edit tests/baselines/unit_timing.json or any scripts/*_baseline.{txt,json} / scripts/_*_baseline.py; timeout/slow failures indicate source-code regression
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: After issue: branch + commit + push (no auto-PR); use /pre-pr-review before PR; after PR use /aurelio-review-pr for external feedback; fix EVERYTHING valid before merging
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Commits: <type>: <description> (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via synthorg-repo-bot)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Branches: <type>/<slug> from main
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: Squash merge; PR body becomes squash commit; trailers (Release-As, Closes `#N`) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: GitHub queries: use gh issue list via Bash, NOT MCP list_issues
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: After every squash merge → /post-merge-cleanup
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:31:34.221Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
src/synthorg/observability/events/evals.pytests/unit/observability/test_events.pyevals/errors.pyevals/scoring/aggregate.pytests/evals_spine/test_penalties.pyevals/models/scorecard.pyevals/scoring/executable.pyevals/models/brief.pyevals/scoring/penalties.pyevals/emit/json_writer.pytests/evals_spine/test_scorecard_emit.pyevals/loader/briefs.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_judged_calibration.py
🔇 Additional comments (13)
src/synthorg/observability/events/evals.py (1)
1-13: LGTM!tests/unit/observability/test_events.py (1)
336-339: LGTM!evals/errors.py (1)
43-54: LGTM!Also applies to: 120-120
evals/scoring/aggregate.py (1)
12-12: LGTM!Also applies to: 34-34, 131-133
tests/evals_spine/test_penalties.py (1)
21-21: LGTM!Also applies to: 185-210
evals/models/scorecard.py (1)
21-21: LGTM!Also applies to: 62-62, 82-85, 147-162, 166-166, 190-191, 197-205, 209-209, 259-259, 271-271
evals/scoring/executable.py (1)
18-18: LGTM!Also applies to: 25-30, 73-73, 83-83, 109-113, 121-121, 148-168, 186-215, 231-231
evals/models/brief.py (1)
53-57: LGTM!Also applies to: 117-117, 121-121, 253-253
evals/scoring/penalties.py (1)
17-21: LGTM!Also applies to: 91-91, 101-110
tests/evals_spine/test_scorecard_emit.py (1)
31-41: LGTM!Also applies to: 53-53, 236-283
evals/loader/briefs.py (1)
12-16: LGTM!Also applies to: 44-45, 49-49, 53-66
tests/evals_spine/test_brief_loader.py (1)
9-13: LGTM!Also applies to: 131-161
tests/evals_spine/test_judged_calibration.py (1)
3-3: LGTM!Also applies to: 81-81, 84-84
52 unit tests green. Lands the eval-spine scoring + data layer: brief and scorecard frozen Pydantic v2 models, YAML loaders via parse_typed, executable + judged graders with Spearman ordinal calibration gate (rho >= 0.7), process-fact penalty table with cap + floor, deterministic JSON and Markdown emitters, DomainError-based eval error hierarchy. pyproject.toml wires evals/ for ruff/pyright/pytest. Phase 3 runner + Phase 8 baselines + Phase 9 wiring follow.
…dule Pre-push mypy gate flagged: (a) write_brief_yaml fixture's return type missing -- introduce BriefYamlWriter Callable type alias in conftest; (b) test_runner_broken_scores_worse.py importing evals.run at module scope failed because that module lands in a follow-up commit -- switch to pytest.importorskip inside helpers so the test collects cleanly until the orchestrator lands; (c) test_load_anchor_set_below_minimum_raises missing Path annotation.
…nventions Apply 19 of 22 review findings from /pre-pr-review (1 invalid finding rejected; 1 deferred with rationale). Notable changes: (1) Critical: path-traversal containment in evals/loader/anchors.py via Path.resolve().is_relative_to(); (2) Critical: consolidate SCORECARD_SCHEMA_VERSION to a single source-of-truth in evals/models/scorecard.py; (3) Wrong exc type for non-dict brief YAML switched from BriefSuiteEmptyError to TypeError; (4) HiddenCheckSpec rejects shell metacharacters in cmd[0] at load time (defence-in-depth atop shell=False); (5) Cross-field validators on BriefResult, JudgeCalibrationReport, ProcessFactReport, AggregatedProcessFacts, PenaltyEntry, AggregationResult, CheckOutcome, ExecutableGrade, ScriptedJudge, JudgedGrade, AnchorItem; (6) Refactor _run_check / grade_executable below the 50-line ceiling via _outcome_from_completed / _outcome_from_timeout / _run_class helpers; (7) Parametrize-collapse Spearman and executable-grade tests; (8) Issue back-refs and migration-framing scrubbed from docstrings; (9) CLAUDE.md layout line now mentions top-level evals/; (10) 16 new validator + parametrize tests, 68 unit tests pass. Rejected: PEP 758 finding (CLAUDE.md says no parens unless binding; we ARE binding). Deferred: Brief discriminated union (agent flagged optional, current post-validator catches the same XOR errors).
Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).
Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.
CodeRabbit re-review on head 48c3121 posted two actionable comments. Both verified valid against the current code and fixed in this commit. evals/emit/json_writer.py: NamedTemporaryFile now opens with newline LF only, so the serialised JSON is deterministic across platforms. Without this, on Windows the text-mode write would translate LF to CRLF and the scorecard JSON would differ from the same payload produced on Linux. evals/scoring/aggregate.py: AggregationResult now carries the floor used by the aggregator (defaults to GRADE_FLOOR for backward compat) and the score validator clamps against that floor, not the global GRADE_FLOOR. Previously a PenaltyTable with a non-zero floor would produce a final_score the validator rejected. Added a regression test in tests/evals_spine/test_penalties.py that aggregates with a custom floor of 50 and asserts the score is clamped consistently.
…r_consistently docstring Pre-push hooks no-review-origin-in-code and no-migration-framing both flagged the docstring on test_aggregate_uses_penalty_table_floor_consistently in tests/evals_spine/test_penalties.py. The phrase explaining what the validator did 'before the round-4 fix' is exactly the kind of historical narrative the gates forbid: it rots the moment the old behaviour stops being a thing anyone remembers. The replacement describes current invariants only (the floor flows from the penalty table into AggregationResult; the score validator and aggregator share the floor).
48c3121 to
6dfda3a
Compare
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/emit/json_writer.py`:
- Line 25: Remove the unused logger initialization: delete the statement that
creates the module-level logger (logger = get_logger(__name__)) and also remove
the corresponding get_logger import if it isn't used elsewhere in this module;
ensure no references to the symbol logger remain (search for logger and
get_logger in this file) before committing the change.
In `@evals/emit/markdown_writer.py`:
- Around line 128-133: The write_scorecard_md function currently calls
target.write_text(render_scorecard_md(scorecard), encoding="utf-8") which allows
platform-dependent newline translation; change this call to pass newline="\n" so
the file is written with deterministic LF line endings (consistent with
json_writer.py). Update the write_text invocation in write_scorecard_md (and
ensure it still uses SCORECARD_MD_FILENAME and render_scorecard_md) to include
newline="\n" while preserving encoding="utf-8".
- Line 23: The module defines an unused logger via the variable logger =
get_logger(__name__) in markdown_writer.py; remove that unused initialization
and any related unused imports (get_logger or logger) from the file, or if
logging is intended, replace it with a used logging call (e.g., use the logger
inside functions like write_markdown or export_markdown) — locate the
logger/get_logger symbols in the module and either delete them or integrate them
into existing functions so the symbol is actually used.
In `@evals/models/scorecard.py`:
- Around line 252-255: The is_passing method currently truncates the threshold
by using int(self.max_total * PASS_FRACTION); replace that with a non-truncating
ceiling comparison so the required pass count isn't lowered for non-integer
thresholds—use math.ceil(self.max_total * PASS_FRACTION) (import math if needed)
and update is_passing to compare self.total against that ceiling value.
In `@evals/scoring/__init__.py`:
- Around line 3-12: The module docstring incorrectly states "Three concerns"
while listing four modules; update the top-level docstring in evals.scoring (the
module docstring in __init__.py) to read "Four concerns" (or a neutral phrase
like "Several concerns") so the count matches the listed modules
(:mod:`evals.scoring.penalties`, :mod:`evals.scoring.executable`,
:mod:`evals.scoring.judged`, :mod:`evals.scoring.aggregate`).
In `@evals/scoring/aggregate.py`:
- Around line 112-115: Update the docstring that describes the parameter "grade"
so it matches the implementation: remove the statement that out-of-range values
are "clamped" and instead say that grades must be in [0, 100] and values outside
that range raise a ValueError (no clamping). Ensure the docstring for the
function/method that accepts the "grade" parameter mirrors the actual validation
behavior where a ValueError is raised when the grade is out of bounds.
In `@evals/scoring/judged.py`:
- Around line 210-216: Replace the hardcoded event string
"evals.judge.calibration_failed" used in the logger.warning call inside
judged.py with the corresponding event constant exported from
synthorg.observability.events.evals (e.g., EVALS_JUDGE_CALIBRATION_FAILED); add
the import for that constant at the top of the file and use the constant in the
logger.warning call so the code matches the existing event-constant convention.
In `@tests/evals_spine/test_executable_grader.py`:
- Around line 78-96: The test matrix in pytest.mark.parametrize for
("hidden_pass", "build_pass", "lint_pass", "expected_score") is missing the two
boolean combinations (hidden=True, build=False, lint=False) and (hidden=False,
build=False, lint=True); add entries for (True, False, False, EXEC_TOTAL -
EXEC_WEIGHT_BUILD - EXEC_WEIGHT_LINT) and (False, False, True, EXEC_TOTAL -
EXEC_WEIGHT_HIDDEN - EXEC_WEIGHT_BUILD) and append corresponding ids (e.g.,
"build_and_lint_fail" and "hidden_and_build_fail") so the full 8-combination
coverage is present in tests/evals_spine/test_executable_grader.py using the
existing pytest.mark.parametrize block and the constants EXEC_TOTAL,
EXEC_WEIGHT_HIDDEN, EXEC_WEIGHT_BUILD, EXEC_WEIGHT_LINT.
In `@tests/evals_spine/test_judged_calibration.py`:
- Line 306: Add a single module-level import for load_anchor_set from
evals.loader.anchors (alongside the other anchors imports) and remove the two
local imports that appear inside the tests (the local load_anchor_set imports
within the two test functions). This consolidates imports: declare "from
evals.loader.anchors import load_anchor_set" at top-level and delete the
duplicate local import statements in the test bodies so the tests use the
module-level symbol.
- Line 287: Consolidate the local pydantic ValidationError imports by adding a
single module-level import "from pydantic import ValidationError as
PydValidationError" at the top of the test module and removing the three
redundant local imports of PydValidationError inside the individual tests;
update any local references to use the module-level PydValidationError symbol so
tests continue to raise/handle Pydantic ValidationError consistently.
- Line 324: Remove the redundant local import of AnchorItem by deleting the
duplicate "from evals.loader.anchors import AnchorItem" line in the
test_judged_calibration.py module so only the module-level import of AnchorItem
remains; search for any extra occurrences of the symbol AnchorItem within that
file to ensure you remove only the duplicate import and not any necessary
usages.
In `@tests/evals_spine/test_runner_broken_scores_worse.py`:
- Line 53: The assertion in the test uses a strict greater-than comparison
(reference.total > broken.total + SCORE_MARGIN) but the contract requires an
inclusive threshold ("at least" SCORE_MARGIN); change the comparison to a
greater-than-or-equal comparison so the check becomes reference.total >=
broken.total + SCORE_MARGIN, ensuring a gap exactly equal to SCORE_MARGIN
passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: a85a438d-9fc5-40c1-a517-15ae67cf84d8
📒 Files selected for processing (29)
CLAUDE.mdevals/__init__.pyevals/emit/__init__.pyevals/emit/json_writer.pyevals/emit/markdown_writer.pyevals/errors.pyevals/loader/__init__.pyevals/loader/anchors.pyevals/loader/briefs.pyevals/models/__init__.pyevals/models/brief.pyevals/models/scorecard.pyevals/scoring/__init__.pyevals/scoring/aggregate.pyevals/scoring/executable.pyevals/scoring/judged.pyevals/scoring/penalties.pyevals/scoring/spearman.pypyproject.tomlsrc/synthorg/observability/events/evals.pytests/evals_spine/__init__.pytests/evals_spine/conftest.pytests/evals_spine/test_brief_loader.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_penalties.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.pytests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Test Integration
- GitHub Check: Test E2E
- GitHub Check: Dashboard Test
- GitHub Check: Test Unit
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (5)
src/synthorg/!(persistence)/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Only
src/synthorg/persistence/may import sqlite/psycopg or emit raw SQL
Files:
src/synthorg/observability/events/evals.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Configuration precedence: DB > env > code default via
SettingsService/ConfigResolver(Cat-1) or env > code default (Cat-2,read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; noos.environ.getoutside startup
Files:
src/synthorg/observability/events/evals.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
No hardcoded numeric values; numerics live in
settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of the formNAME: int|float|Final|Final[int]|Final[float] = literalComments should explain WHY only; no reviewer citations, issue back-refs, or migration framing
No
from __future__ import annotations(Python 3.14 has PEP 649); use PEP 758 except:except A, B:requires parens if bindingType hints required on public functions; mypy strict mode; Google-style docstrings
Line length 88; functions <50 lines; files <800 lines
Errors must use
<Domain><Condition>Errornaming and inherit fromDomainError; never inherit directly fromException/RuntimeError/etcPydantic v2 frozen models must use
extra="forbid"on every frozen model;@computed_fieldis auto-exempt; per-line# lint-allow: frozen-extra-forbid -- <reason>for exceptions
@computed_fieldshould be used for derived properties in Pydantic modelsUse
NotBlankStrfor identifiers in Pydantic modelsArgs models required at every system boundary; use
parse_typed()for every external dict ingestionImmutability: use
model_copy(update=...)orcopy.deepcopy(); deepcopy at system boundariesAsync code must use
asyncio.TaskGroupfor fan-out/fan-in; helpers catchExceptionand re-raiseMemoryError/RecursionErrorUse Clock seam:
clock: Clock | None = None; tests injectFakeClockLifecycle services must own
_lifecycle_lock; timed-out stops mark unrestartableUntrusted content (SEC-1) must be wrapped via
wrap_untrusted()fromengine.prompt_safety; useHTMLParseGuardfor HTMLImport logger via
from synthorg.observability import get_logger; variable always namedlogger; never useimport loggingorprint()in app codeEvent names must come from
observability.events.<domain>constants; use structured kwargs in log calls (logger.info(EVENT, key=value))Error paths must log WARNING/ERROR w...
Files:
src/synthorg/observability/events/evals.py
⚙️ CodeRabbit configuration file
This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.
Files:
src/synthorg/observability/events/evals.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Test markers:
@pytest.mark.{unit,integration,e2e,slow}Async tests use
automode; global timeout 30s; coverage 80% minimumTest doubles ladder:
FakeClockfor Clock seam,mock_of[T](**overrides)for typed-boundary substitutions,SimpleNamespacefor attribute-bags; bareMagicMockat typed boundary is blocked byscripts/check_mock_spec.pyFakeClock and
mock_ofimport fromtests._shared; inject viaclock=and helper's spec subscriptFlaky tests: NEVER skip/xfail; fix fundamentally; use
asyncio.Event().wait()notsleep(large)
Files:
tests/unit/observability/test_events.pytests/evals_spine/conftest.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.pytests/evals_spine/test_brief_loader.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/unit/observability/test_events.pytests/evals_spine/conftest.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.pytests/evals_spine/test_brief_loader.py
tests/**/*test*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Windows unit tests use
WindowsSelectorEventLoopPolicy(3.14 IOCP teardown race); subprocess tests override backHypothesis: 10 deterministic CI examples; failures are real bugs (fix + add
@example(...))
Files:
tests/unit/observability/test_events.pytests/evals_spine/conftest.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.pytests/evals_spine/test_brief_loader.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Read `docs/design/` page before implementing; deviations need approval
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: No region/currency/locale privileged; use metric units; British English
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Every convention PR must ship its enforcement gate
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Never edit `tests/baselines/unit_timing.json` or any `scripts/*_baseline.{txt,json}` / `scripts/_*_baseline.py` files; timeout/slow failures indicate source-code regression
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review` before creating PR
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: After PR creation: use `/aurelio-review-pr` for external feedback and fix all valid issues before merging
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Use `go -C cli` command, never `cd cli`
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: xdist `-n 8 --dist=loadfile` auto-applied via pyproject `addopts` (`loadfile` prevents 3.14+Windows ProactorEventLoop leak)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Branches: `<type>/<slug>` from main
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: GitHub queries: use `gh issue list` via Bash, NOT MCP `list_issues`
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T18:59:08.797Z
Learning: After every squash merge → `/post-merge-cleanup`
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
evals/emit/__init__.pyevals/loader/__init__.pysrc/synthorg/observability/events/evals.pyevals/__init__.pyevals/models/__init__.pyevals/scoring/__init__.pyevals/scoring/spearman.pyevals/errors.pytests/unit/observability/test_events.pytests/evals_spine/conftest.pyevals/emit/markdown_writer.pyevals/emit/json_writer.pytests/evals_spine/test_executable_grader.pyevals/loader/anchors.pyevals/models/brief.pyevals/models/scorecard.pyevals/scoring/aggregate.pytests/evals_spine/test_penalties.pytests/evals_spine/test_judged_calibration.pyevals/scoring/executable.pytests/evals_spine/test_runner_broken_scores_worse.pytests/evals_spine/test_scorecard_emit.pyevals/loader/briefs.pyevals/scoring/judged.pyevals/scoring/penalties.pytests/evals_spine/test_brief_loader.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.
Applied to files:
CLAUDE.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).
Applied to files:
CLAUDE.md
🔇 Additional comments (45)
tests/evals_spine/conftest.py (1)
1-60: LGTM!tests/evals_spine/test_brief_loader.py (1)
19-180: LGTM!evals/loader/briefs.py (1)
1-94: LGTM!evals/loader/__init__.py (1)
1-8: LGTM!evals/loader/anchors.py (1)
1-133: LGTM!tests/evals_spine/test_runner_broken_scores_worse.py (1)
1-52: LGTM!Also applies to: 54-88
evals/scoring/spearman.py (1)
1-97: LGTM!evals/scoring/penalties.py (1)
1-153: LGTM!tests/evals_spine/test_penalties.py (1)
1-233: LGTM!tests/evals_spine/test_scorecard_emit.py (1)
1-316: LGTM!tests/unit/observability/test_events.py (1)
336-339: LGTM!tests/evals_spine/test_executable_grader.py (2)
26-75: LGTM!
97-230: LGTM!src/synthorg/observability/events/evals.py (2)
1-10: LGTM!
12-13: The constants at lines 12-13 are properly imported and used in the executable scoring module (evals/scoring/executable.py). BothEVALS_EXECUTABLE_TIMEOUTandEVALS_EXECUTABLE_TOOL_MISSINGare used correctly in structured logging calls with appropriate context kwargs and secure error handling (safe_error_description(exc)instead of raw exception strings).CLAUDE.md (1)
3-3: LGTM!evals/__init__.py (1)
1-15: LGTM!pyproject.toml (1)
190-190: LGTM!Also applies to: 283-283, 363-363, 378-378
evals/errors.py (1)
1-130: LGTM!evals/models/__init__.py (1)
1-7: LGTM!evals/models/brief.py (1)
1-269: LGTM!evals/models/scorecard.py (1)
1-251: LGTM!Also applies to: 256-313
evals/scoring/executable.py (4)
1-91: LGTM!
93-134: LGTM!
136-216: LGTM!
218-300: LGTM!evals/scoring/judged.py (5)
1-52: LGTM!
54-119: LGTM!
122-171: LGTM!
174-222: LGTM!
225-298: LGTM!evals/emit/json_writer.py (2)
37-45: Past review finding has been addressed.The past review comment flagged the missing
newline="\n"parameter. The current code includes it at line 41, ensuring platform-independent LF line endings for byte-level determinism.
31-57: LGTM!evals/emit/__init__.py (1)
1-1: LGTM!evals/emit/markdown_writer.py (6)
34-37: LGTM!
40-48: LGTM!
51-69: LGTM!
72-79: LGTM!
82-97: LGTM!
100-125: LGTM!tests/evals_spine/test_judged_calibration.py (5)
1-32: LGTM!
36-102: LGTM!
108-159: LGTM!
162-197: LGTM!
200-258: LGTM!
All 12 actionable comments from the CodeRabbit re-review (review id 4331444484) verified valid and fixed in this commit. Source files: json_writer.py drops the unused module-level logger so the symbol does not lie about the import surface; markdown_writer.py does the same and write_scorecard_md now pins newline=LF for byte-identical Markdown across platforms; scorecard.is_passing uses math.ceil instead of int() so fractional pass thresholds round up to the next integer (a max_total of 295 at 65% requires 192, not 191); the evals.scoring package docstring corrects 'Three concerns' to 'Four concerns' to match the four listed modules; aggregate_brief_score docstring drops the obsolete 'clamped' claim and accurately states that out-of-range grades raise ValueError; judged.py logs the calibration-failed event via EVALS_JUDGE_CALIBRATION_FAILED imported from observability.events.evals (added there) instead of an inline literal. Test files: test_executable_grader.py parametrize matrix expanded from six to the full eight 2x2x2 boolean combinations across hidden/build/lint pass flags; test_judged_calibration.py consolidates yaml + PydValidationError + load_anchor_set + MIN_ANCHOR_SET_SIZE + JudgeCalibrationReport + JudgedGrade + AnchorItem imports to the module top, dropping the seven duplicate local imports; test_runner_broken_scores_worse.py SCORE_MARGIN comparison switches from > to >= so a reference run that beats the broken run by exactly SCORE_MARGIN points still passes (the contract is 'at least SCORE_MARGIN', inclusive).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
evals/models/scorecard.py (1)
129-177:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPersist the per-brief score floor in
BriefResult.
aggregate_brief_score()now allowsscore = max(grade - deduction, penalty_table.floor), butBriefResultstill revalidates againstmax(grade - deduction, GRADE_FLOOR)and has no field to carry the configured floor. Any non-zeroPenaltyTable.floorwill make a valid aggregation impossible to serialize into the scorecard without either failing validation or lying about how the score was derived. Either thread the floor throughBriefResult/Scorecardand version the schema, or reject non-zero floors before scorecard construction.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/models/scorecard.py` around lines 129 - 177, The BriefResult validation assumes a global GRADE_FLOOR but aggregate_brief_score can use a PenaltyTable.floor; add a score_floor int field to BriefResult (defaulting to GRADE_FLOOR for compatibility) and update the _score_matches_grade_minus_deduction validator to compute expected = max(self.grade - self.deduction, self.score_floor) so the serialized row records the actual floor used. Also ensure aggregate_brief_score (and Scorecard construction) populates BriefResult.score_floor from PenaltyTable.floor when creating BriefResult instances (or else reject non-zero PenaltyTable.floor before building scorecards if you choose to forbid non-default floors).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@evals/models/scorecard.py`:
- Around line 129-177: The BriefResult validation assumes a global GRADE_FLOOR
but aggregate_brief_score can use a PenaltyTable.floor; add a score_floor int
field to BriefResult (defaulting to GRADE_FLOOR for compatibility) and update
the _score_matches_grade_minus_deduction validator to compute expected =
max(self.grade - self.deduction, self.score_floor) so the serialized row records
the actual floor used. Also ensure aggregate_brief_score (and Scorecard
construction) populates BriefResult.score_floor from PenaltyTable.floor when
creating BriefResult instances (or else reject non-zero PenaltyTable.floor
before building scorecards if you choose to forbid non-default floors).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: f2fa10d1-679d-449d-bd31-72249dc71096
📒 Files selected for processing (10)
evals/emit/json_writer.pyevals/emit/markdown_writer.pyevals/models/scorecard.pyevals/scoring/__init__.pyevals/scoring/aggregate.pyevals/scoring/judged.pysrc/synthorg/observability/events/evals.pytests/evals_spine/test_executable_grader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.py
💤 Files with no reviewable changes (1)
- evals/emit/json_writer.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
- GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
- GitHub Check: Dashboard Test
- GitHub Check: Test Unit
- GitHub Check: Test Integration
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Async test fixtures: use
automode; global timeout 30s; coverage minimum 80%; xdist-n 8 --dist=loadfileauto-applied via pyprojectaddopts(loadfile prevents 3.14+ Windows ProactorEventLoop leak)Windows unit tests use
WindowsSelectorEventLoopPolicy(3.14 IOCP teardown race); subprocess tests override backTest markers:
@pytest.mark.{unit,integration,e2e,slow}; test doubles ladder in conventions.md section 12.1; useFakeClockandmock_of[T](**overrides)fromtests._shared;SimpleNamespacefor attribute-bags; bareMagicMockat typed boundary blocked byscripts/check_mock_spec.pyHypothesis: 10 deterministic CI examples; failures are real bugs (fix + add
@example(...))Flaky tests: NEVER skip/xfail; fix fundamentally; use
asyncio.Event().wait()instead ofsleep(large)
Files:
tests/evals_spine/test_executable_grader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.py
⚙️ CodeRabbit configuration file
Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare
@settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which@given() honors automatically.
Files:
tests/evals_spine/test_executable_grader.pytests/evals_spine/test_judged_calibration.pytests/evals_spine/test_runner_broken_scores_worse.py
src/synthorg/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Configuration precedence order: DB > env > code default via
SettingsService/ConfigResolver(Cat-1) or env > code default (Cat-2,read_only_post_init); Cat-3 bootstrap secrets are pure env; YAML is ingestion format only, not a precedence tier; noos.environ.getoutside startup; pre-init Cat-2 reads usesettings.bootstrap_resolver.resolve_init_valueLifecycle services must own
_lifecycle_lock; timed-out stops mark services unrestartable
Files:
src/synthorg/observability/events/evals.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
No hardcoded numeric values; numerics live in
settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of formNAME: int|float|Final|Final[int]|Final[float] = literalperscripts/check_no_magic_numbers.pyUse
asyncio.TaskGroupfor fan-out/fan-in in async code; helpers must catchExceptionand re-raiseMemoryError/RecursionErrorUse Clock seam pattern:
clock: Clock | None = None; tests injectFakeClockfromtests._sharedWrap untrusted content via
wrap_untrusted()fromengine.prompt_safety; useHTMLParseGuardfor HTML per SEC-1Use
from synthorg.observability import get_loggerwith variable always namedlogger; never useimport loggingorprint()in app codeEvent names must come from
observability.events.<domain>constants; use structured kwargs (logger.info(EVENT, key=value))Error paths must log WARNING/ERROR with context before raising; state transitions log INFO via
*_STATUS_TRANSITIONEDAFTER persistence writeSecret-log redaction (SEC-1): never use
error=str(exc)or interpolate{exc}; useerror_type=type(exc).__name__+error=safe_error_description(exc); neverexc_info=True; OTel forbiddenspan.record_exception(exc), usespan.set_attribute("exception.message", safe_error_description(exc))+record_exception=False, set_status_on_exception=Falsepercheck_logger_exception_str_exc.pyComments must explain WHY only; no reviewer citations, issue back-refs, or migration framing per
check_no_review_origin_in_code.pyandcheck_no_migration_framing.pyNo
from __future__ import annotations(Python 3.14+ has PEP 649); PEP 758 except:except A, B:requires parens when bindingType hints required on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines
Error classes must follow
<Domain><Condition>Errornaming and inherit fromDomainError; never inherit directly from `Exce...
Files:
src/synthorg/observability/events/evals.py
⚙️ CodeRabbit configuration file
This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.
Files:
src/synthorg/observability/events/evals.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Read design specification in `docs/design/` before implementing; deviations require approval per [DESIGN_SPEC.md](docs/DESIGN_SPEC.md)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: No region/currency/locale should be privileged; use metric units and British English per [docs/reference/regional-defaults.md](docs/reference/regional-defaults.md)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Every convention PR must ship its enforcement gate per [docs/reference/convention-gates.md](docs/reference/convention-gates.md)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Timeout/slow test failures indicate source-code regression; never edit `tests/baselines/unit_timing.json` or baseline files; bypass only via `ALLOW_BASELINE_GROWTH=1 git commit ...` with explicit user approval
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: After issue implementation: branch + commit + push without auto-PR; use `/pre-pr-review` (gh pr create is blocked); after PR use `/aurelio-review-pr` for external feedback; fix all valid issues without deferring
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Git commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Branch naming: `<type>/<slug>` from main
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: After every squash merge → run `/post-merge-cleanup`
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Use `uv sync` for all dependencies, `uv sync --group docs` for docs toolchain; run `bash scripts/install_cli_tools.sh` once per machine for golangci-lint; run `bash scripts/install_git_hooks.sh` once per clone to wire `core.hooksPath -> scripts/git-hooks`
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Linting: `uv run ruff check src/ tests/ --fix` + `uv run ruff format src/ tests/`; type checking: `uv run mypy --num-workers=4 src/ tests/` (strict)
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Testing: `uv run python -m pytest tests/ -m {unit|integration|e2e}` with `-n 8 --dist=loadfile` via pyproject `addopts`; coverage minimum 80%; use `HYPOTHESIS_PROFILE={dev|fuzz}` for property tests
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit checks: `uv run pre-commit run --all-files`; schema drift: `uv run python scripts/check_schema_drift_revisions.py --backend {sqlite|postgres}`
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Documentation build: `PYTHONPATH=. uv run zensical build` for docs; reference [docs/reference/claude-reference.md](docs/reference/claude-reference.md) for doc layout, Docker, releasing, CI, dependencies, Hypothesis
Learnt from: CR
Repo: Aureliolo/synthorg
Timestamp: 2026-05-20T19:19:19.402Z
Learning: Web: see `web/CLAUDE.md` for guidelines; CLI: see `cli/CLAUDE.md` and use `go -C cli` (never `cd cli`); Shell: see `~/.claude/rules/common/bash.md` for canonical `cd`/`git -C`/Bash file-write rules
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.
Applied to files:
evals/scoring/__init__.pytests/evals_spine/test_executable_grader.pysrc/synthorg/observability/events/evals.pyevals/scoring/aggregate.pyevals/emit/markdown_writer.pytests/evals_spine/test_judged_calibration.pyevals/models/scorecard.pyevals/scoring/judged.pytests/evals_spine/test_runner_broken_scores_worse.py
🔇 Additional comments (9)
src/synthorg/observability/events/evals.py (1)
14-14: LGTM!tests/evals_spine/test_executable_grader.py (1)
85-87: LGTM!Also applies to: 95-97
tests/evals_spine/test_judged_calibration.py (1)
7-8: LGTM!Also applies to: 11-16, 26-26, 30-30
evals/scoring/__init__.py (1)
3-3: LGTM!evals/scoring/aggregate.py (1)
112-116: LGTM!evals/scoring/judged.py (1)
24-24: LGTM!Also applies to: 211-217
evals/models/scorecard.py (1)
10-10: LGTM!Also applies to: 252-261
evals/emit/markdown_writer.py (1)
124-133: LGTM!tests/evals_spine/test_runner_broken_scores_worse.py (1)
53-53: LGTM!
The new CodeRabbit review on head e57d7a0 (review id 4331572429) carried a single outside-diff-range finding pointing out that round 4 fixed the AggregationResult floor inconsistency but stopped at the scorer boundary: BriefResult still validated against the global GRADE_FLOOR while aggregate_brief_score produced scores using penalty_table.floor. A non-zero PenaltyTable.floor would make a valid aggregation impossible to serialise into the scorecard without either failing validation or lying about how the score was derived. BriefResult now carries a score_floor field (default GRADE_FLOOR for backward compat) and the validator clamps against self.score_floor instead of the global. aggregate_brief_score already returns AggregationResult.floor; downstream code that builds BriefResult from an AggregationResult should forward floor -> score_floor so the serialised row records the exact bound the aggregator applied. Added two regression tests in tests/evals_spine/test_scorecard_emit.py: one that accepts a non-default custom floor of 50, one that rejects a score below the custom floor.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…y benchmark (#2025) Part of #1979. Lands the **scoring + data layer** of the golden-company benchmark (eval spine), as documented in the Status (2026-05-20) section of the linked issue. This PR intentionally does **not** close #1980. The runner (cassette guard + telemetry collector + orchestrator + CLI + baselines + recorded cassette + `make benchmark` target) is held until a runtime prerequisite gap closes. See "Prerequisites for the runner PR" in #1980 for the full follow-up plan. ## What landed The complete scoring + data contract for the eval spine, 68 unit tests passing: - `evals/models/{brief,scorecard}.py` -- frozen Pydantic v2 models, schema-versioned (`SCORECARD_SCHEMA_VERSION = 1`). - `evals/loader/{briefs,anchors}.py` -- YAML loaders via `synthorg.api.boundary.parse_typed`; path-traversal containment on anchor-set loading. - `evals/scoring/executable.py` -- subprocess grader (hidden / build / lint, weighted 60/20/20); rejects shell metacharacters at load time as defence-in-depth. - `evals/scoring/judged.py` + `spearman.py` -- ordinal calibration with a Spearman rho gate (`>= 0.7`); refuses to score when the judge's ordering disagrees with the hand-scored anchor set. - `evals/scoring/penalties.py` + `aggregate.py` -- process-fact penalty table with cap + floor. - `evals/emit/{json_writer,markdown_writer}.py` -- deterministic JSON + Markdown emitters. - `evals/errors.py` -- `DomainError`-rooted hierarchy. - `pyproject.toml` -- `evals/` wired for ruff (`src`), pyright (`extraPaths`), pytest (`pythonpath`), isort (`known-first-party`). - `CLAUDE.md` -- layout line now mentions the new `evals/` top-level directory. Pydantic models carry cross-field invariant validators so the scorecard is auditable from any single field: `BriefResult` enforces `score == max(grade - deduction, GRADE_FLOOR)`; `JudgeCalibrationReport` enforces `passed == (rho >= gate)`; `PenaltyEntry` enforces `raw == points_per_event * count` and `applied <= raw`; `AggregatedProcessFacts` enforces `total_events == sum(events_by_class)`; `ProcessFactReport` enforces entries match the tracked events_by_class slice; `CheckOutcome` ties `timed_out=True` to the POSIX timeout sentinel; `ExecutableGrade` enforces pass-flags match per-label outcomes; `ScriptedJudge` requires non-empty responses or default scores; `JudgedGrade` refuses construction with a failing calibration; `AnchorItem` rejects hand_scores outside `[0.0, 1.0]`; `HiddenCheckSpec` rejects shell metacharacters in `cmd[0]`. ## What is NOT yet built (follow-up PR scope) - `evals/runner/cassette_guard.py` -- cassette SHA256 + REPLAY enforcement. - `evals/runner/telemetry_collector.py` -- structlog event bucketing. - `evals/runner/orchestrator.py` -- per-brief boot/submit/wait/collect loop. - `evals/run.py` + `evals/__main__.py` -- CLI entry point. - `evals/baselines/{reference,broken}.yaml`, `evals/briefs/*.yaml`, `evals/cassettes/reference_run.cassette.json`. - `Makefile` target + `.gitignore`. - The acceptance test `test_runner_broken_scores_worse.py` going green (currently uses `pytest.importorskip("evals.run")`). ## Why the runner did NOT land in this PR A runtime spike (`create_app()` + submit one `TaskRequirement` via the `IntakeEngine`) confirmed the studio's agent-execution path is not yet wired end-to-end. Intake transitions fire (`intake.request.received` -> `triaging` -> `scoped`), then `TaskEngineNotRunningError` raises. No `BUDGET_*` / `EXECUTION_*` / `STAGNATION_*` / `APPROVAL_GATE_*` events appear. Without those events, broken.yaml and reference.yaml would produce identical empty process-fact traces and the scorer could not discriminate. Per project rule "admit you cannot in plain words, BEFORE": surfacing this gap rather than shipping a green acceptance test that asserts nothing. Prerequisites tracked in #1980's Status section. ## Test plan ```bash uv run python -m pytest tests/evals_spine/ -m unit ``` Result locally: 68 passed in 9.13s. ```bash uv run ruff check src/ tests/ evals/ uv run ruff format --check src/ tests/ evals/ uv run mypy --num-workers=4 tests/evals_spine/ evals/ ``` All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate, review-origin gate, migration-framing gate, secret detection, etc.) ran on the push and passed. ## Review coverage `/pre-pr-review` ran 9 agents in parallel against the spine. 22 findings consolidated in `_audit/pre-pr-review/triage.md`: - 2 CRITICAL: path-traversal containment, duplicate `SCORECARD_SCHEMA_VERSION` -- both fixed. - 5 MAJOR valid (wrong exception type, three issue back-refs in docstrings, missing `evals/` in CLAUDE.md layout) -- all fixed. - 1 MAJOR invalid (PEP 758 syntax: agent misread CLAUDE.md's "no parens unless binding"; we ARE binding `as exc`, parens required) -- rejected with rationale. - 11 MEDIUM valid (function-length refactor, shell-metachar guard, test-quality cleanups, 4 cross-field validators) -- all fixed. - 3 LOW deferred -- 2 implemented (extra validators on trusted-construction types, JudgeProtocol docstring polish); 1 deferred (Brief discriminated union -- the review agent itself flagged it as a pragmatic trade-off and the current post-validator catches the same XOR errors; restructuring would force a substantial test refactor for marginal benefit). Net: **19 of 22 review findings applied in this PR**, with rationale recorded in the commit body of `815ac88f` and in the triage file. ## Related - Part of #1979 (operate-tier EPIC). - Depends on #1961 (client-simulation harness, already in place) + EPIC #1973 (runnable studio, partial -- runtime gap blocks the runner PR) + #1956. - Blocks #1983 (learning curve), #1990, #1995, #1998 -- all consume the scorecard schema landing here.
<!-- HIGHLIGHTS_START --> ## Highlights > _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub Models). Commit-based changelog below._ ### What you'll notice - Introduced conversational interface for direct clarify and propose interactions. - Cost management now includes forecast gates, hard ceilings, and Pareto considerations. - Added living documentation engine combining wiki and retrieval-augmented generation features. - Real intake engine is now operational for live data processing. - Virtual desktop tool with vision verification gate available for enhanced workspace control. ### What's new - Per-project reproducible environments for consistent setups. - Headless browser testing tool integrated for automated UI validation. - Governed external API and data access tool introduced. - Hardened external-remote git backend with sandbox mounts and push-queue dispatching. - Adversarial red-team gate subsystem for enhanced security testing. - Self-extending toolkit to dynamically expand capabilities. - Stakes-aware model routing enables prioritized processing. - Task-board entry adapter connects live runtime with project management. - Persistent project workspace with pluggable git backend and per-project push queues implemented. - Knowledge and provenance substrate added to track data lineage. - Scoring and data contract framework for golden-company benchmark evaluations. ### Under the hood - Desktop Dockerfile pinned by digest to improve build stability and documented publishing gap fixed. <!-- HIGHLIGHTS_END --> :robot: I have created a release *beep* *boop* --- ## [0.8.7](v0.8.6...v0.8.7) (2026-05-22) ### Features * conversational interface v1 - 1:1 clarify + propose ([#2019](#2019)) ([216ef94](216ef94)), closes [#1968](#1968) * cost as a first-class dial (forecast gate, hard ceiling, Pareto) ([#2029](#2029)) ([700a59e](700a59e)), closes [#1982](#1982) * **env:** reproducible per-project environments ([#2039](#2039)) ([d2c0ef9](d2c0ef9)), closes [#1994](#1994) * **evals:** [#1980](#1980) spine -- scoring + data contract for golden-company benchmark ([#2025](#2025)) ([53108e8](53108e8)) * goal/objective entry adapter ([#1964](#1964)) ([#2022](#2022)) ([cb15c3c](cb15c3c)) * governed external API/data access tool ([#1991](#1991)) ([#2032](#2032)) ([e08b451](e08b451)) * harden external-remote git backend + per-project sandbox mount + push-queue dispatch ([#2020](#2020)) ([#2030](#2030)) ([2fa2e1e](2fa2e1e)) * headless browser testing tool ([#1992](#1992)) ([#2024](#2024)) ([277b52a](277b52a)) * knowledge + provenance substrate ([#2036](#2036)) ([48c897b](48c897b)) * living documentation engine (dual-purpose wiki + RAG namespace) ([#2028](#2028)) ([3d10da9](3d10da9)), closes [#1976](#1976) * real intake engine online ([#2017](#2017)) ([9d8eb34](9d8eb34)) * **redteam:** adversarial red-team gate subsystem ([#1986](#1986)) ([#2026](#2026)) ([d2207e9](d2207e9)) * self-extending toolkit ([#1995](#1995)) ([#2035](#2035)) ([5ffc545](5ffc545)) * stakes-aware model routing ([#1998](#1998)) ([#2038](#2038)) ([9b98312](9b98312)) * task-board entry adapter to live runtime ([#1963](#1963)) ([#2023](#2023)) ([a8f1eea](a8f1eea)) * virtual desktop tool and vision verifier gate ([#2031](#2031)) ([dfe8b42](dfe8b42)), closes [#1993](#1993) * **workspace:** persistent project workspace + pluggable git backend + per-project push queue ([#2021](#2021)) ([ee58ee7](ee58ee7)) ### Bug Fixes * pin desktop Dockerfile by digest (Scorecard [#309](#309)) + document publish gap ([#2034](#2034)) ([8fda188](8fda188)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Part of #1979. Lands the scoring + data layer of the golden-company benchmark (eval spine), as documented in the Status (2026-05-20) section of the linked issue.
This PR intentionally does not close #1980. The runner (cassette guard + telemetry collector + orchestrator + CLI + baselines + recorded cassette +
make benchmarktarget) is held until a runtime prerequisite gap closes. See "Prerequisites for the runner PR" in #1980 for the full follow-up plan.What landed
The complete scoring + data contract for the eval spine, 68 unit tests passing:
evals/models/{brief,scorecard}.py-- frozen Pydantic v2 models, schema-versioned (SCORECARD_SCHEMA_VERSION = 1).evals/loader/{briefs,anchors}.py-- YAML loaders viasynthorg.api.boundary.parse_typed; path-traversal containment on anchor-set loading.evals/scoring/executable.py-- subprocess grader (hidden / build / lint, weighted 60/20/20); rejects shell metacharacters at load time as defence-in-depth.evals/scoring/judged.py+spearman.py-- ordinal calibration with a Spearman rho gate (>= 0.7); refuses to score when the judge's ordering disagrees with the hand-scored anchor set.evals/scoring/penalties.py+aggregate.py-- process-fact penalty table with cap + floor.evals/emit/{json_writer,markdown_writer}.py-- deterministic JSON + Markdown emitters.evals/errors.py--DomainError-rooted hierarchy.pyproject.toml--evals/wired for ruff (src), pyright (extraPaths), pytest (pythonpath), isort (known-first-party).CLAUDE.md-- layout line now mentions the newevals/top-level directory.Pydantic models carry cross-field invariant validators so the scorecard is auditable from any single field:
BriefResultenforcesscore == max(grade - deduction, GRADE_FLOOR);JudgeCalibrationReportenforcespassed == (rho >= gate);PenaltyEntryenforcesraw == points_per_event * countandapplied <= raw;AggregatedProcessFactsenforcestotal_events == sum(events_by_class);ProcessFactReportenforces entries match the tracked events_by_class slice;CheckOutcometiestimed_out=Trueto the POSIX timeout sentinel;ExecutableGradeenforces pass-flags match per-label outcomes;ScriptedJudgerequires non-empty responses or default scores;JudgedGraderefuses construction with a failing calibration;AnchorItemrejects hand_scores outside[0.0, 1.0];HiddenCheckSpecrejects shell metacharacters incmd[0].What is NOT yet built (follow-up PR scope)
evals/runner/cassette_guard.py-- cassette SHA256 + REPLAY enforcement.evals/runner/telemetry_collector.py-- structlog event bucketing.evals/runner/orchestrator.py-- per-brief boot/submit/wait/collect loop.evals/run.py+evals/__main__.py-- CLI entry point.evals/baselines/{reference,broken}.yaml,evals/briefs/*.yaml,evals/cassettes/reference_run.cassette.json.Makefiletarget +.gitignore.test_runner_broken_scores_worse.pygoing green (currently usespytest.importorskip("evals.run")).Why the runner did NOT land in this PR
A runtime spike (
create_app()+ submit oneTaskRequirementvia theIntakeEngine) confirmed the studio's agent-execution path is not yet wired end-to-end. Intake transitions fire (intake.request.received->triaging->scoped), thenTaskEngineNotRunningErrorraises. NoBUDGET_*/EXECUTION_*/STAGNATION_*/APPROVAL_GATE_*events appear. Without those events, broken.yaml and reference.yaml would produce identical empty process-fact traces and the scorer could not discriminate. Per project rule "admit you cannot in plain words, BEFORE": surfacing this gap rather than shipping a green acceptance test that asserts nothing.Prerequisites tracked in #1980's Status section.
Test plan
Result locally: 68 passed in 9.13s.
All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate, review-origin gate, migration-framing gate, secret detection, etc.) ran on the push and passed.
Review coverage
/pre-pr-reviewran 9 agents in parallel against the spine. 22 findings consolidated in_audit/pre-pr-review/triage.md:SCORECARD_SCHEMA_VERSION-- both fixed.evals/in CLAUDE.md layout) -- all fixed.as exc, parens required) -- rejected with rationale.Net: 19 of 22 review findings applied in this PR, with rationale recorded in the commit body of
815ac88fand in the triage file.Related