Skip to content

feat(evals): #1980 spine -- scoring + data contract for golden-company benchmark#2025

Merged
Aureliolo merged 14 commits into
mainfrom
feat/1980-golden-benchmark
May 20, 2026
Merged

feat(evals): #1980 spine -- scoring + data contract for golden-company benchmark#2025
Aureliolo merged 14 commits into
mainfrom
feat/1980-golden-benchmark

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Part of #1979. Lands the scoring + data layer of the golden-company benchmark (eval spine), as documented in the Status (2026-05-20) section of the linked issue.

This PR intentionally does not close #1980. The runner (cassette guard + telemetry collector + orchestrator + CLI + baselines + recorded cassette + make benchmark target) is held until a runtime prerequisite gap closes. See "Prerequisites for the runner PR" in #1980 for the full follow-up plan.

What landed

The complete scoring + data contract for the eval spine, 68 unit tests passing:

  • evals/models/{brief,scorecard}.py -- frozen Pydantic v2 models, schema-versioned (SCORECARD_SCHEMA_VERSION = 1).
  • evals/loader/{briefs,anchors}.py -- YAML loaders via synthorg.api.boundary.parse_typed; path-traversal containment on anchor-set loading.
  • evals/scoring/executable.py -- subprocess grader (hidden / build / lint, weighted 60/20/20); rejects shell metacharacters at load time as defence-in-depth.
  • evals/scoring/judged.py + spearman.py -- ordinal calibration with a Spearman rho gate (>= 0.7); refuses to score when the judge's ordering disagrees with the hand-scored anchor set.
  • evals/scoring/penalties.py + aggregate.py -- process-fact penalty table with cap + floor.
  • evals/emit/{json_writer,markdown_writer}.py -- deterministic JSON + Markdown emitters.
  • evals/errors.py -- DomainError-rooted hierarchy.
  • pyproject.toml -- evals/ wired for ruff (src), pyright (extraPaths), pytest (pythonpath), isort (known-first-party).
  • CLAUDE.md -- layout line now mentions the new evals/ top-level directory.

Pydantic models carry cross-field invariant validators so the scorecard is auditable from any single field: BriefResult enforces score == max(grade - deduction, GRADE_FLOOR); JudgeCalibrationReport enforces passed == (rho >= gate); PenaltyEntry enforces raw == points_per_event * count and applied <= raw; AggregatedProcessFacts enforces total_events == sum(events_by_class); ProcessFactReport enforces entries match the tracked events_by_class slice; CheckOutcome ties timed_out=True to the POSIX timeout sentinel; ExecutableGrade enforces pass-flags match per-label outcomes; ScriptedJudge requires non-empty responses or default scores; JudgedGrade refuses construction with a failing calibration; AnchorItem rejects hand_scores outside [0.0, 1.0]; HiddenCheckSpec rejects shell metacharacters in cmd[0].

What is NOT yet built (follow-up PR scope)

  • evals/runner/cassette_guard.py -- cassette SHA256 + REPLAY enforcement.
  • evals/runner/telemetry_collector.py -- structlog event bucketing.
  • evals/runner/orchestrator.py -- per-brief boot/submit/wait/collect loop.
  • evals/run.py + evals/__main__.py -- CLI entry point.
  • evals/baselines/{reference,broken}.yaml, evals/briefs/*.yaml, evals/cassettes/reference_run.cassette.json.
  • Makefile target + .gitignore.
  • The acceptance test test_runner_broken_scores_worse.py going green (currently uses pytest.importorskip("evals.run")).

Why the runner did NOT land in this PR

A runtime spike (create_app() + submit one TaskRequirement via the IntakeEngine) confirmed the studio's agent-execution path is not yet wired end-to-end. Intake transitions fire (intake.request.received -> triaging -> scoped), then TaskEngineNotRunningError raises. No BUDGET_* / EXECUTION_* / STAGNATION_* / APPROVAL_GATE_* events appear. Without those events, broken.yaml and reference.yaml would produce identical empty process-fact traces and the scorer could not discriminate. Per project rule "admit you cannot in plain words, BEFORE": surfacing this gap rather than shipping a green acceptance test that asserts nothing.

Prerequisites tracked in #1980's Status section.

Test plan

uv run python -m pytest tests/evals_spine/ -m unit

Result locally: 68 passed in 9.13s.

uv run ruff check src/ tests/ evals/
uv run ruff format --check src/ tests/ evals/
uv run mypy --num-workers=4 tests/evals_spine/ evals/

All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate, review-origin gate, migration-framing gate, secret detection, etc.) ran on the push and passed.

Review coverage

/pre-pr-review ran 9 agents in parallel against the spine. 22 findings consolidated in _audit/pre-pr-review/triage.md:

  • 2 CRITICAL: path-traversal containment, duplicate SCORECARD_SCHEMA_VERSION -- both fixed.
  • 5 MAJOR valid (wrong exception type, three issue back-refs in docstrings, missing evals/ in CLAUDE.md layout) -- all fixed.
  • 1 MAJOR invalid (PEP 758 syntax: agent misread CLAUDE.md's "no parens unless binding"; we ARE binding as exc, parens required) -- rejected with rationale.
  • 11 MEDIUM valid (function-length refactor, shell-metachar guard, test-quality cleanups, 4 cross-field validators) -- all fixed.
  • 3 LOW deferred -- 2 implemented (extra validators on trusted-construction types, JudgeProtocol docstring polish); 1 deferred (Brief discriminated union -- the review agent itself flagged it as a pragmatic trade-off and the current post-validator catches the same XOR errors; restructuring would force a substantial test refactor for marginal benefit).

Net: 19 of 22 review findings applied in this PR, with rationale recorded in the commit body of 815ac88f and in the triage file.

Related

Aureliolo and others added 5 commits May 20, 2026 00:38
Closes #1962. Child of EPIC #1955.

## Summary

Wires `DirectIntake`/`AgentIntake` as a live, non-simulated work-entry
path. A real `ClientRequest` submitted via `POST /requests` and approved
via `POST /requests/{id}/approve` now flows through the work pipeline
spine so an agent actually executes it (proven under the simulation
harness with a scripted provider).

The pipeline spine (#1960) already maps `WorkItem -> ClientRequest ->
intake_engine.process()` in its intake phase; the gap was that
`approve_request` called `intake_engine.process()` directly and produced
a `CREATED` task no agent ever ran. This PR introduces a thin work-entry
adapter seam, hands approve at `POST /requests/{id}/approve` to a
background pipeline run, and reconciles the stored request to its
terminal status.

## What changed

- **New `WorkEntryAdapter` protocol + `IntakeEntryAdapter` +
`build_work_entry_adapter` factory** under
`src/synthorg/engine/pipeline/entry/`. Source-keyed dispatch
(`WorkSource.INTAKE` ships now; siblings #1963 / #1964 add their arms
here).
- **`POST /requests/{id}/approve` rewritten** to return **HTTP 202
Accepted** with the request walked to `APPROVED`; a background coroutine
(`process_intake_pipeline`) drives the adapter -> pipeline and
reconciles the stored request to `TASK_CREATED` (or `CANCELLED` on
intake rejection / pipeline failure). Terminal failures log ERROR +
cancel the request as a fallback so a request can never silently hang in
`APPROVED`. `AgentRuntimeNotConfiguredError` (409, taxonomy 4014) when
no adapter / pipeline is wired (empty company).
- **New `simulations.intake_default_project` setting**
(`read_only_post_init`, default `"client-intake"`); the project is
created at boot if absent. One resolved value drives both the intake
strategy's `project=` and the adapter's `WorkItem.project`, so the
pipeline's project-existence check and the created task agree.
- **New `WsEventType.REQUEST_TASK_CREATED`** + matching
`WsRequestTaskCreatedPayload`; the terminal hop now has a dedicated
event (the intermediate `request.approved` still fires for the sync
`APPROVED` walk).
- **`AppState.intake_entry_adapter` seam** (once-only set + has_ + swap,
mirrors `work_pipeline`); boot wiring in
`engine.pipeline.entry.boot.wire_real_intake_entry` is called from both
`_install_runtime_services` and the post-setup
`_rebuild_runtime_services`.
- **Manifest discipline (hard EPIC criterion)**:
`scripts/_ghost_wiring_manifest.txt` now has `ENFORCED` lines for
`DirectIntake`, `AgentIntake`, `IntakeEntryAdapter`,
`build_work_entry_adapter`. The `no-ghost-wiring` gate passes.
- **Docs**: `docs/design/client-simulation.md` documents the real
work-entry path + the new setting + the new factory row;
`docs/design/verification-quality.md` banner updated (intake engine
online); `docs/design/page-structure.md` URL routing map gains the
`/clients` / `/clients/requests` / `/clients/simulations` /
`/clients/reviews/:taskId` / `/clients/:clientId` routes.
- **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the
202 status, added `'request.task_created'` to `websocket.ts`, updated
the MSW approve mock to return 202.

## Tests

- New unit tests:
`tests/unit/engine/pipeline/entry/{test_intake_adapter,test_factory,test_boot}.py`,
`tests/unit/api/controllers/test_requests_intake_pipeline.py` (success /
WorkIntakeRejected / WorkPipelineError / concurrent-terminal /
MemoryError re-raise), `tests/unit/api/test_state.py` adapter seam,
`tests/unit/client/test_runtime_builder.py` default-project threading +
`intake_default_project` env override,
`tests/unit/settings/test_intake_settings.py`.
- New integration tests in
`tests/integration/api/controllers/test_client_simulation.py`: approve
returns 202 + spawns the pipeline task; approve without an adapter
returns 409.
- New e2e `tests/e2e/test_real_intake_entry_e2e.py`: builds the
production runtime via `build_runtime_services` + `ScriptedDriver` +
`DirectIntake`, drives `process_intake_pipeline`, asserts request
reaches `TASK_CREATED` and the task is past `CREATED` (proof an agent
ran). A second case covers a scoped request with reviewer notes.

Validation run on this branch: 29468 unit tests pass (full suite), 3183
web unit tests pass; e2e + the new integration suite pass under the
simulation-harness substrate (zero real LLM spend).

## Pre-PR review

Reviewed by 10 core agents (python-reviewer, security-reviewer,
async-concurrency-reviewer, conventions-enforcer, test-quality-reviewer,
api-contract-drift, silent-failure-hunter, issue-resolution-verifier,
comment-quality-rot, docs-consistency). All valid findings implemented:

- Wrapped `process_intake_pipeline`'s terminal reconciliation in
`_safe_finalize` so a reconcile failure can never leave a request stuck
in `APPROVED`: ERROR with `request_id` + fallback `CANCELLED`, with
`release_request_lock_if_idle` in `finally`.
- Hardened `_ensure_project` to tolerate a `DuplicateRecordError` race
(post-setup reinit) and log ERROR with project id on any other create
failure.
- Explicit `except asyncio.CancelledError: raise` ahead of the broad
handler.
- `_current_if_approved` vanished-request now logs ERROR (invariant
break, not a routine warning).
- `_ensure_project(project_id: NotBlankStr)` and a comment on
`_spawn_intake_pipeline` documenting why a detached task (not
`TaskGroup`) is correct for the 202-return contract.
- Added the dedicated `request.task_created` WS event (+
`WsRequestTaskCreatedPayload`) so terminal-hop semantics are no longer
multiplexed onto `request.approved`.
- Doc fixes: `verification-quality.md` banner, `client-simulation.md`
settings enumeration + factory table, `page-structure.md` routing map.

Agent claims that the project-wide `except A, B:` (PEP 758 / Python
3.14) syntax was a "Python 3 syntax error" were rejected with
primary-source evidence (ruff, mypy, the full 29468-test unit suite, and
all 73 pre-push gates pass; the pattern is pre-existing in
`pipeline/service.py` and `simulations.py`).

---------

Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
## Summary

EPIC #1967 keystone: a human talks to ONE fixed top agent (Chief of
Staff) in natural language. The agent asks clarifying questions when the
request is underspecified, then parks one or more concrete `WorkItem`s
in the human approval queue. On approval each item runs through the work
pipeline; **no autonomous acting** (direct MCP acting is the separate
child #1972).

The pipeline spine (`build_work_pipeline`, #1960) and approval producer
(#1956) already landed and are ENFORCED in the ghost-wiring manifest.
This PR drives them from a conversation.

## What changed

### Service + API
- `src/synthorg/meta/chief_of_staff/propose.py` ::
`ChiefOfStaffProposer` (single-responsibility; keeps `chat.py`
explain-only). Structured clarify-or-propose LLM turn via
`core/json_parsing.py`. SEC-1: human content wrapped via
`wrap_untrusted(TAG_TASK_DATA, ...)`.
- `POST /meta/chat/propose` on `MetaController` with
`ConversationalProposeRequest` args model, `meta.chat.propose` (5/60s)
rate-limit, 503 dependency-matrix (proposer + work pipeline must be
wired).
- `src/synthorg/api/controllers/_approval_review_gate.py` ::
`try_conversational_intake_resume` Flow 0 before parked-context /
review-gate. On approve: reconstructs `WorkItem` from the proposal and
drives `app_state.work_pipeline.run`. On reject: marks proposal
REJECTED, pipeline untouched.
- `build_chief_of_staff_proposer` factory +
`_wire_chief_of_staff_proposer` on_startup hook + AppState set-once
accessors (`chief_of_staff_proposer`, `conversational_proposal_repo`).
- `ChiefOfStaffConfig` propose_* fields (opt-in, default off).

### Data model
- New `ApprovalSource.CONVERSATIONAL_INTAKE` discriminator + new
`ConversationRole` / `ConversationStatus` /
`ConversationalProposalStatus` enums.
- Frozen Pydantic models: `Conversation`, `ConversationTurn`,
`ProposedWork`, `ProposeDecision` (XOR validator),
`ConversationalProposal`, `ProposeArgs`, `ProposedApprovalSummary`,
`ProposeResult`.

### Persistence (dual-backend)
- New protocols: `ConversationRepository` (StatefulRepository),
`ConversationTurnRepository` (AppendOnlyRepository),
`ConversationalProposalRepository` (Stateful + FilteredQuery). Generic
categories per ADR-0001; no bespoke methods.
- Concrete sqlite + postgres impls,
`conversational_factory.build_conversational_repositories`
(backend-aware; degrades gracefully to None for absent / unknown
backends so app boot stays clean).
- One yoyo revision per backend
(`20260519000001_conversational_intake.sql`). Postgres widens
`approvals.source` CHECK via `DROP/ADD CONSTRAINT`. SQLite intentionally
keeps the narrow CHECK (rebuild via writable_schema was perf-too-costly
under the unit suite's `--count=2` isolation gate; the v1 ApprovalStore
is in-memory by default so `CONVERSATIONAL_INTAKE` rows never reach
SQLite. Documented inline.).
- `schema_drift_baseline.txt` grown by 4 canonical TEXT/TIMESTAMPTZ
entries for the new timestamp columns (user-approved, identical to every
other dual-backend table). Committed with `ALLOW_BASELINE_GROWTH=1`.

### Errors / events
- New `<Domain><Condition>Error` family in `meta/errors.py`:
`ConversationNotFoundError` (404), `ConversationClosedError` (409),
`ConversationalProposeUnavailableError` (503),
`ConversationalProposeResponseInvalidError` (502). New `ErrorCode`
entries 3017 / 4015 / 7010.
- New event constants in
`observability/events/{chief_of_staff,approval_gate,persistence}.py`
(clarify/propose/cap/transition + dispatcher branch + persistence repo
error/read events).

### Manifest discipline (hard acceptance criterion)
- `scripts/_ghost_wiring_manifest.txt`: new `ENFORCED
build_chief_of_staff_proposer #1968` line. Verified by
`scripts/check_no_ghost_wiring.py`.

### Web / generated types
- Regenerated
`web/src/api/types/{openapi,dtos,enum-values,error-codes}.gen.ts` for
the new endpoint and enums.
- Hand-written `postChatPropose()` client +
`ConversationalProposeResponse` interface in
`web/src/api/endpoints/meta.ts`; MSW handler in
`web/src/mocks/handlers/meta.ts` (1:1 parity rule).

### Docs
- `docs/design/self-improvement.md`: `/meta/chat/propose` endpoint,
`propose.py` + model trio in package inventory, three-flow
approval-decision routing (Flow 0 conversational-intake), YAML defaults
for `propose_*`.
- `README.md`: v1 scope clarified (clarify + propose; no autonomous
acting).
- `CLAUDE.md`: one-liner on proposer wiring + 503 conditions.

## Test plan

- **Unit (29 503 passed locally on `-m unit`)**: enums, models, errors,
settings, proposer (clarify / propose / untrusted-wrap / parse-failure /
clarification cap / project-resolution), wiring
(`build_chief_of_staff_proposer` 503 matrix; backend-aware factory
degradation), dispatch branch (approve-runs-pipeline,
reject-skips-pipeline, missing-proposal-owned-but-noop,
pipeline-unavailable-503, pipeline-failure-leaves-PENDING), API model
validation + route + rate-limit-policy registration.
- **Dual-backend conformance** (`tests/conformance/persistence/`,
`pytest.mark.integration`): `Conversation`, `ConversationTurn`,
`ConversationalProposal` against sqlite + postgres. Schema-drift gates
green on both backends.
- **Simulation-harness e2e**
(`tests/e2e/test_conversational_propose_e2e.py`, `pytest.mark.e2e`):
builds the REAL pipeline via `build_runtime_services`, REAL
`ChiefOfStaffProposer` with a scripted clarify-then-propose provider,
REAL `ApprovalStore`, REAL approval-decision seam
(`signal_resume_intent`). Asserts: vague request → clarify → answer →
PENDING approval in store → approve → `Task` created and advanced past
CREATED by a real agent. Reject path: pipeline never runs.
- All convention gates green (ghost-wiring, schema-drift, dual-backend
parity, persistence-boundary, dto-types-ts-in-sync,
error-codes-ts-in-sync, no-magic-numbers, no-review-origin,
frozen-extra-forbid, ...).

## Review coverage

11 specialist review agents ran pre-push (code, python, security,
persistence, conventions, logging, async-concurrency,
api-contract-drift, issue-resolution-verifier, docs-consistency,
comment-quality-rot). 18 valid findings (7 High, 10 Medium, 1 doc
polish) addressed in commit 80b34f9. 10 false positives dismissed with
reasons in `_audit/pre-pr-review/triage.md`. `issue-resolution-verifier`
returned PASS on every acceptance criterion and the parent epic's
manifest-discipline rule.

Closes #1968

---------

Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
Closes #1964

Part of EPIC #1955.

## Summary

Thin high-altitude work-entry adapter: a human-stated objective enters
the pipeline spine as a `WorkItem(source=OBJECTIVE)` and is decomposed
into projects/tasks by the existing `MultiAgentCoordinator`, then
executed solo or as a team (automatic; never a user choice). No new
decomposition logic — reuses the spine + coordinator end to end.

### Wire-up

- `src/synthorg/engine/pipeline/entry/objective_adapter.py` — new
  `ObjectiveSubmission` (frozen Pydantic, `extra="forbid"`) and
  `ObjectiveEntryAdapter` (thin: `__slots__ = ("_default_project",
  "_work_pipeline")`).
- `src/synthorg/engine/pipeline/entry/factory.py` — adds the
  `WorkSource.OBJECTIVE` arm. Coordinates with #1963 on this shared
  seam; the task-board adapter rebases when it lands.
- `src/synthorg/engine/pipeline/entry/protocol.py` — generalised to
  PEP 695 generic `class WorkEntryAdapter[T_Request](Protocol)` so the
  intake and objective adapters can each be typed against their own
  domain input.
- `src/synthorg/engine/pipeline/entry/boot.py` — new
  `wire_real_objective_entry()` (cold boot + hot-reload via
  `hot_swap=True`); `_ensure_project` lifted as a shared helper.
- `src/synthorg/api/state.py` — `objective_entry_adapter` slot +
  property + once-only `set_` + atomic `set_if_absent_` + `swap_`
  seams + `objective_background_tasks: set[Task]` for spawned
  pipeline runs.
- `src/synthorg/api/controllers/objectives.py` — new `POST /objectives`
  returning 202 + `submission_id` (server-minted, propagated as the
  `WorkItem.correlation_id`); pipeline runs as a background task so
  the HTTP boundary never blocks on team coordination. Guarded by
  `require_write_access` + per-op rate-limit `objectives.submit`.
- `src/synthorg/api/app.py` + `controllers/setup/agent_helpers.py` —
  the new wire call is appended right after the intake wire, both at
  cold boot and on provider-reinit.
- New OBJECTIVES settings namespace + `default_project` definition
  (`read_only_post_init=True`, `restart_required=True`).

### Manifest discipline (parent epic hard criterion)

`scripts/_ghost_wiring_manifest.txt` ships an `ENFORCED
ObjectiveEntryAdapter #1964 ...` line in this same PR;
`scripts/check_no_ghost_wiring.py` passes.

## Test plan

- **Unit** (all green; `tests/unit/engine/pipeline/entry/` +
  `tests/unit/api/controllers/test_objectives.py`):
  - `ObjectiveSubmission`: frozen, rejects unknown fields, blank
    title/description/requested_by, blank acceptance-criteria
    elements; submission_id defaults to uuid4.
  - `ObjectiveEntryAdapter`: maps submission to
    `WorkItem(source=OBJECTIVE)`; correlation_id == submission_id;
    default project enforced; optional fields fall through to
    `WorkItem` defaults; pipeline errors propagate unchanged.
  - Factory dispatch: OBJECTIVE arm returns `ObjectiveEntryAdapter`;
    unwired sources still raise `UnknownStrategyError`.
  - Boot hook: no-op without `work_pipeline`; creates the objectives
    project when absent; skips create when it exists; hot_swap uses
    the swap seam (and still ensures the project exists).
  - Controller: 202 + minted submission_id; background task tracked
    on `AppState`; controller never blocks on the pipeline run;
    done-callback removes completed tasks from the set; payload
    rejects unknown fields, blank fields, and unknown enum strings.
- **E2E** (`tests/e2e/test_real_objective_entry_e2e.py`, run under
  `-m e2e`): builds the real runtime via the production
  `build_runtime_services` with a deterministic `ScriptedDriver`.
  Two cases:
  1. Default `leaf-threshold` policy — proves the path is wired and
     the spawned task advances past `CREATED`.
  2. Forced `always-team` policy — proves the objective decomposes:
     asserts `verdict=SPLITTABLE`, `execution_path=TEAM`, and the
     root task advances past `CREATED`.

## Review coverage

Reviewed via `/pre-pr-review` with the core agent roster only
(thin-adapter scope): python-reviewer, security-reviewer,
test-quality-reviewer, conventions-enforcer, issue-resolution-verifier.

| Agent | Result |
| --- | --- |
| python-reviewer | APPROVE, zero findings |
| security-reviewer | APPROVE, zero findings (SEC-1 wrapping, auth,
rate-limit, background-task lifecycle, settings injection verified) |
| conventions-enforcer | Zero violations |
| issue-resolution-verifier | PASS, all 7 acceptance criteria at
confidence 100 |
| test-quality-reviewer | 0 critical; 5 minor refinements applied
(defensive double-yield on done-callback test, invalid-enum-rejection
coverage, blank-acceptance-criteria coverage, `_result` helper
docstring, hot_swap project-create assertion) |

All ruff / mypy strict / pre-commit / pre-push gates green; 9485
affected-area unit tests pass locally.

## Out of scope

- First-class `Objective` persistence (no model, repository, or
  migration). Submission is ephemeral; state lives on the spawned
  root task.
- `/objectives/{id}` GET / list / cancel surfaces.
- Frontend integration.
- Coordinator / decomposition-strategy / routing-policy changes
  (reused as-is).
- Add an `idempotency_key=` query filter to `/tasks` so HTTP-only
  clients can resolve `submission_id` → spawned task. Tracked as a
  follow-up; the simulation harness drives the adapter directly so
  this is not blocking acceptance for #1964.

---------

Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
Closes #1963. Child of EPIC #1955.

## Summary

Wires `TaskBoardEntryAdapter` as the live work-entry path for the task
board UI. A human-filed task via `POST /tasks` now flows through the
work pipeline spine so an agent actually executes it, proven under the
simulation harness with a `ScriptedDriver`.

The pipeline spine (#1960) already maps `WorkItem -> ClientRequest ->
intake_engine.process()` in its intake phase; this PR adds the
TASK_BOARD arm of `build_work_entry_adapter`, rewrites `POST /tasks` to
hand the filing to a detached background pipeline run, and ships the
matching ghost-wiring manifest line.

## Scope constraint (locked in plan)

The spine's `_intake` always creates a task internally via
`intake_engine.process()`; there is no spine entry that adopts a
pre-existing task. Therefore the board adapter cannot pre-create a
task and hand it off (would yield duplicates), and "move triggers the
spine" requires new pipeline logic (forbidden by the issue's
"thin adapter, no new pipeline logic" scope). The board adapter fires
on `file` only; column moves remain pure `TaskEngine` status walks of
the spine-created task. Documented in
`docs/design/page-structure.md` and the task-board adapter docstring.

## What changed

- **New `TaskBoardEntryAdapter` + `TaskBoardFiling`** under
  `src/synthorg/engine/pipeline/entry/task_board_adapter.py`. Frozen
  Pydantic input with `extra="forbid"` and an auto-generated UUID4
  correlation id; the adapter maps a filing onto a
  `WorkItem(source=TASK_BOARD)` and drives `WorkPipeline.run`.
- **`build_work_entry_adapter` extended** with the TASK_BOARD arm +
  `@overload` so boot helpers get the concrete return type per source
  literal. `default_project` stays required on the signature (intake
  consumes it; TASK_BOARD ignores it).
- **`POST /tasks` rewritten** to return HTTP 202 +
  `TaskBoardSubmissionResponse` (`correlation_id`, `title`, `project`,
  `status: "submitted"`). A detached background coroutine
  (`process_task_board_pipeline`) drives the adapter; pipeline
  failures are logged at WARNING (intake rejection) or ERROR (any
  other), `MemoryError` / `RecursionError` re-raise.
  `AgentRuntimeNotConfiguredError` (409, code 4014) when no adapter is
  wired (empty company / no provider).
- **`AppState.task_board_entry_adapter` seam** mirrors
  `intake_entry_adapter`: ctor kwarg, slot, property, `has_*`,
  `set_*` / `set_*_if_absent` / `swap_*`. Boot wiring in
  `engine.pipeline.entry.boot.wire_real_task_board_entry` is called
  from both `_install_runtime_services` and the post-setup
  `_rebuild_runtime_services` (skip project bootstrap; board filings
  carry their own project).
- **Three new event constants**: `API_TASK_BOARD_SUBMITTED`,
`API_TASK_BOARD_REJECTED_NO_ADAPTER`, `API_TASK_BOARD_PIPELINE_FAILED`.
- **Manifest discipline** (hard EPIC criterion):
  `scripts/_ghost_wiring_manifest.txt` has an `ENFORCED` line for
  `TaskBoardEntryAdapter`. The `no-ghost-wiring` gate passes.
- **Docs**: `docs/design/client-simulation.md` documents the new
  TASK_BOARD work-entry path next to the existing INTAKE one;
  `docs/design/page-structure.md` notes the new 202 contract on
  `POST /tasks` and explicitly documents the move-is-display-only
  semantics.
- **Frontend**: regenerated `web/src/api/types/openapi.gen.ts` for the
  202 + new schema, updated `tasks.ts` endpoint to return
  `TaskBoardSubmissionResponse`, Zustand store's `createTask` returns
  the submission envelope (no optimistic insert; the existing
  `task.created` WS handler inserts the real task once intake creates
  it), MSW handler returns 202, dialog + stories typed to the new
  shape.

## Tests

- New unit tests:
`tests/unit/engine/pipeline/entry/test_task_board_adapter.py`
  (filing validation; mapping; propagation of `WorkIntakeRejectedError`,
  `MemoryError`, `RecursionError`, generic `Exception`).
- Extended
`tests/unit/engine/pipeline/entry/{test_factory,test_boot}.py`:
  TASK_BOARD arm via parametrize; boot helper no-ops when pipeline /
  simulation runtime absent; hot-swap; no project bootstrap.
- Rewritten `tests/unit/api/controllers/test_tasks.py`: 202 +
  submission envelope, UUID4 correlation id, 409 when no adapter

(`test_create_task_raises_agent_runtime_not_configured_without_adapter`).
- New `tests/unit/api/test_state.py` cases covering the new seam.
- Reworked `tests/unit/api/controllers/test_coordination.py` to
  pre-create tasks via direct persistence insertion instead of
  `POST /tasks` (cross-loop crash with the now-mocked pipeline).
- `tests/integration/api/test_task_create_empty_company.py` updated:
  no provider -> 409 (unchanged); provider present -> 202 + envelope.
- New e2e `tests/e2e/test_task_board_entry_e2e.py`: production runtime
  via `build_runtime_services` + `ScriptedDriver`. Positive path proves
  the task reaches a post-execution status; an unknown-project path
  proves the background coroutine swallows the pipeline failure; a
  MemoryError stand-in proves the re-raise contract holds end-to-end.
- New web tests: `TaskCreateDialog.test.tsx` updated for the
  submission envelope; store test asserts no optimistic insert (WS
  handler owns the inserted task) and "submitted" toast wording.

Validation on this branch: 29 510 unit tests pass; 3 new e2e tests
pass; 2 integration tests pass; targeted web suite (`tasks.test.ts`,
`TaskCreateDialog.test.tsx`, `TaskBoardPage.test.tsx`,
`useTaskBoardData.test.ts`) green; ruff, ruff-format, mypy strict, all
pre-push convention gates green.

## Pre-PR review

Reviewed by 5 core agents (python-reviewer, security-reviewer,
test-quality-reviewer, conventions-enforcer, issue-resolution-verifier).
6 findings, all implemented in the test-tightening commit:

- Tightened `TaskBoardSubmissionResponse.status` to
`Literal["submitted"]`.
- Added adapter-level error-propagation tests (`MemoryError`,
  `RecursionError`, generic `Exception`).
- Tightened the 202 happy-path test to validate UUID4 correlation id.
- Added two negative-path e2e tests (unknown project; MemoryError
  re-raise).
- Parametrized the factory's positive test for the two wired sources.
- Renamed the 409 test for semantic clarity.

Verdicts after fixes: zero CRITICAL / HIGH from any agent;
issue-resolution-verifier PASS with both acceptance criteria RESOLVED;
conventions-enforcer zero violations; security-reviewer confirmed
SEC-1 prompt safety is handled downstream by `policy/llm_judged.py`
and the adapter does not need to wrap.

Co-authored-by: SynthOrg &lt;synthorg-bot@synthorg.local&gt;

---------

Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
… + per-project push queue (#2021)

Closes #1974.

Foundational substrate for EPIC #1973 (autonomous product studio). Each
project gets a 1:1 persistent git-backed working tree that survives
across agents, tasks, and sessions; two agents merging concurrently are
serialised through a per-project push queue; the git backend is
config-only swappable across `embedded` / `local_path` /
`external_remote` (with GitHub / GitLab / Gitea / Forgejo connection
identities).

## Locked acceptance criteria

1. **1:1 project to persistent git-backed workspace surviving across
sessions.** `ProjectWorkspaceService.get_or_provision(project_id)`
materialises `<base>/projects/<project_id>/` once; the
`ProjectWorkspace` row plus the on-disk working tree are the durable
contract. A backend-kind change in `GitBackendConfig.kind` re-provisions
under the new backend (keeping the workspace path) with a
`WORKSPACE_BACKEND_KIND_CHANGED` warning event. Conformance-tested under
SQLite + Postgres.
2. **Two agents merge concurrently without branch collision via a
per-project serial push queue.**
`WorkspaceIsolationService.merge_workspace_with_push` routes through a
lazily-built `PushQueueCoordinator` per project. FIFO; sentinel-based
stop; `WorkspacePushError` distinguishes a forge-rejection from a local
`WorkspaceMergeError` (conflicted merge does not push). The queue sits
in front of the existing `WorkspaceIsolationStrategy` so a future
virtual-branch strategy slots in without architectural change.
3. **Backend is config-only swappable.** `GitBackendType` is `embedded`
/ `external_remote` / `local_path`; `build_git_backend(config, deps)`
resolves a `GitBackend` via `StrategyRegistry`. `external_remote`
resolves a saved connection from the catalogue (GitHub, GitLab, Gitea,
Forgejo) and injects the token as a URL-userinfo. Deep OAuth refresh +
rate-limit hardening is tracked in the follow-up issue.
4. **Validated under the simulation harness.** Acceptance tests in
`tests/integration/engine/workspace/test_project_workspace_acceptance.py`
cover persistence across "session restarts", concurrent
merge-without-collision, config-only backend swap, and cross-project
sandbox isolation.

## What ships

- `ProjectWorkspace` model + `ProjectWorkspaceRepository` (sqlite +
postgres) + paired `20260519000001_project_workspaces` migrations +
`schema.sql` parity. FK to `projects(id) ON DELETE CASCADE`.
- `engine/workspace/git_backend/`: `GitBackend` protocol,
`GitBackendConfig` / `GitBackendDeps`, `EmbeddedGitBackend` (default) /
`LocalPathGitBackend` / `ExternalRemoteGitBackend`,
`StrategyRegistry`-driven factory with fail-fast missing-deps
validation.
- `engine/workspace/project_workspace_service.py` with per-project async
lock, config-authoritative backend-kind precedence, path-traversal
defence.
- `engine/workspace/push_queue.py` (`PushQueueCoordinator`) integrated
into `WorkspaceIsolationService` (per-project dict, guard lock,
double-checked locking).
- Connection-catalogue expansion: `GITLAB`, `GITEA`, `FORGEJO`
connection types alongside `GITHUB`. Gitea + Forgejo share a
`_GiteaFamilyAuthenticator` base today but stay distinct connection
identities so a future API divergence does not invalidate saved
connections.
- Boot wiring in `_install_runtime_services` (gated by
`app_state.has_persistence`); `AgentEngineExecutionService.execute_once`
now `get_or_provision`s the workspace per project. New
`AppState.project_workspace_service` once-only seam.
- Ghost-wiring manifest: 6 `ENFORCED` entries for the new runtime
symbols (`ProjectWorkspaceService`, `build_git_backend`,
`EmbeddedGitBackend`, `LocalPathGitBackend`, `ExternalRemoteGitBackend`,
`PushQueueCoordinator`).
- Error taxonomy: `ProjectWorkspaceNotProvisionedError` (4015,
CONFLICT), `GitBackend*Error` family, `WorkspacePushError`.
- Docs: `docs/design/coordination.md` (persistent workspace + push-queue
section), `docs/design/integrations.md` (connection types),
`docs/reference/errors.md` (4014 / 4015),
`docs/reference/pluggable-subsystems.md` (git backend strategy).

## Pre-PR review (this push)

Ran the core agent set on the substrate; applied every valid finding
before opening the PR. Highlights:

- **SEC-1 redaction**: `_redact_args` strips embedded URL credentials at
every `run_git_subprocess` log site (no token-in-URL ever lands in
structured logs).
- **`GIT_*` env sanitisation in `run_git_subprocess`**: when this code
runs from inside a git pre-push hook (or any caller whose own cwd is a
git working tree), the hook process exports `GIT_DIR` / `GIT_WORK_TREE`
and any child git subprocess inherits them, making `is_git_repo()`
report True for a freshly-mkdir'd tmp dir because git resolves against
the parent repo. The fix strips `GIT_*` env vars in `run_git_subprocess`
so backend strategies resolve purely via the per-call `cwd`. (Without
this, the embedded backend's `git commit --allow-empty` walked up to the
host's `.git` during the pre-push run; the rebased branch you're looking
at has those stray commits removed and the gate now reproduces clean.)
- **Defensive nested-worktree guard** on `EmbeddedGitBackend` and
`LocalPathGitBackend` (compares resolved `--show-toplevel` against the
requested path; only refuses when toplevel is a strict parent).
- Path-traversal defence in `ProjectWorkspaceService._workspace_path`
rejects `/`, `\\`, `..`.
- Double-checked locking TOCTOU closed in
`ProjectWorkspaceService._lock_for` and
`WorkspaceIsolationService._get_or_create_queue` (dict reads moved
inside the guard lock).
- Push-queue logging surfaces `WORKSPACE_PUSH_QUEUE_ENQUEUED` /
`_MERGED` / `_FAILED` / `_WORKER_FAILED`.

## Out of scope / follow-up

Filed as #2020 (referenced, not silently deferred):

- External-remote OAuth refresh + rate-limit hardening (protocol + thin
glue ship now; backend-switch-is-config-only already holds).
- Per-execution Docker per-project mount migration into the sandbox
lifecycle owner.
- `dispatch` callers of `merge_workspace` migrated onto
`merge_workspace_with_push` (currently only the direct execution path
uses the queue).
- `PlannerWorktreeStrategy` per-project `repo_root` resolution at
execute-time.
- `tests/unit/api/conftest.py::test_client` baseline reset for the known
`project_shared_app_once_only_install_leak`.

## Verification

- `uv run python -m pytest tests/ -m unit` clean.
- Targeted dual-backend conformance:
`tests/conformance/persistence/test_project_workspace_repository.py` (18
tests, SQLite + Postgres).
- Acceptance tests:
`tests/integration/engine/workspace/test_project_workspace_acceptance.py`.
- Pre-push gates: ruff, mypy strict, schema-drift (both backends),
no-ghost-wiring, no-magic-numbers, isolation `--count=2` gate.

---------

Co-authored-by: SynthOrg <synthorg-bot@synthorg.local>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the core scoring and data infrastructure for the 'golden-company' benchmark (eval spine). It provides the foundational models, loaders, and grading logic required to evaluate synthetic organization performance, while intentionally deferring the runner implementation until runtime prerequisites are met. The changes include strict schema validation, robust error handling, and comprehensive unit testing to ensure the benchmark's reliability and auditability.

Highlights

  • Data Model and Contract: Introduced frozen Pydantic v2 models for the eval spine, including schema-versioned Scorecard and Brief definitions to ensure deterministic data contracts.
  • Scoring and Calibration: Implemented a subprocess-based grader for executable briefs and a calibrated LLM judge for judged briefs, featuring a Spearman rho gate (>= 0.7) to ensure judge reliability.
  • Penalty System: Added a process-fact penalty table with caps and floors to account for budget, execution, and governance events during benchmark runs.
  • Emitters and Loaders: Added deterministic JSON and Markdown emitters for scorecards, along with robust YAML loaders that include path-traversal containment for anchor sets.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Review Change Stack

Warning

Rate limit exceeded

@Aureliolo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 59 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: b64c3097-b7fb-4ae2-8214-384842b0aeb4

📥 Commits

Reviewing files that changed from the base of the PR and between e57d7a0 and 2c25a0a.

📒 Files selected for processing (2)
  • evals/models/scorecard.py
  • tests/evals_spine/test_scorecard_emit.py

Walkthrough

Adds an eval spine implementing frozen Pydantic briefs and scorecard models, YAML loaders (briefs and anchors) with path-safety and validation, executable and judged graders (Spearman-based calibration), penalty table and aggregation logic, JSON and Markdown scorecard emitters, an EvalError hierarchy, observability event constants, pyproject tooling/test discovery updates, and a comprehensive suite of unit and integration tests.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: landing the scoring and data contract (spine) for the golden-company benchmark eval system.
Description check ✅ Passed The description comprehensively explains what landed (scoring + data layer with 68 passing tests), what was intentionally deferred (runner components), and why (runtime prerequisites gap).
Linked Issues check ✅ Passed The PR successfully implements the primary coding objectives from #1980: frozen Pydantic models, YAML loaders, executable grading, judged calibration with Spearman gate, penalties + aggregation, JSON/Markdown emitters, and error hierarchy.
Out of Scope Changes check ✅ Passed All changes are scoped to the linked issue #1980's first-PR objectives. The runner components (cassette_guard, orchestrator, CLI, baselines, cassettes) are intentionally excluded per the issue status.
Docstring Coverage ✅ Passed Docstring coverage is 52.78% which is sufficient. The required threshold is 40.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 16:02 — with GitHub Actions Inactive
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 20, 2026

Merging this PR will not alter performance

✅ 33 untouched benchmarks
⏩ 21 skipped benchmarks1


Comparing feat/1980-golden-benchmark (2c25a0a) with main (f60e9ed)

Open in CodSpeed

Footnotes

  1. 21 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/loader/briefs.py`:
- Around line 43-47: The globbed entries in files are currently trusted by
suffix only, which allows symlinks or non-file entries to escape briefs_dir or
cause I/O errors; update the logic around files / the for path in files loop to
skip any path that is not a regular file and to ensure path.resolve() is
contained within briefs_dir.resolve() (e.g., check path.is_file() and that
path.resolve() is under briefs_dir.resolve() via relative_to in a try/except)
before calling path.read_text() and yaml.safe_load; apply these checks where
files is built or at the start of the loop that constructs briefs (referencing
variables briefs_dir, files, path, yaml.safe_load, and Brief).

In `@evals/models/brief.py`:
- Line 111: Replace the inline numeric literal in the model schema by
introducing a named constant in the settings.definitions module (e.g.,
HIDDEN_CHECK_DEFAULT_TIMEOUT: int = 30) and use that constant in the Field
default for timeout_seconds in the Brief model (timeout_seconds: int =
Field(default=HIDDEN_CHECK_DEFAULT_TIMEOUT, gt=0)); add an explicit module-level
type annotation for the new constant in settings.definitions and update any
imports so evals.models.brief references the constant instead of the literal.

In `@evals/models/scorecard.py`:
- Around line 123-153: BriefResult currently allows inconsistent states between
kind and judge_calibration; add an after model_validator (e.g. name it
_validate_kind_and_calibration) on the BriefResult model that checks
BriefResult.kind (use the BriefKind enum) and enforces that when kind ==
BriefKind.JUDGED then judge_calibration is not None, and when kind !=
BriefKind.JUDGED then judge_calibration is None; on violation raise a ValueError
with a clear message referencing brief_id, kind and judge_calibration to aid
debugging.
- Around line 164-166: AggregatedProcessFacts currently allows negative counts
in events_by_class; add a Pydantic validator on the AggregatedProcessFacts model
(e.g., `@validator`("events_by_class") or a `@root_validator`) that iterates the
dict values and raises a ValueError if any count is negative, and ensure
total_events remains Field(..., ge=0) as a non-negative guard; reference the
AggregatedProcessFacts.events_by_class field and the model's validation methods
when implementing.

In `@evals/scoring/aggregate.py`:
- Line 32: The field event_constant in the Pydantic model should be typed as
NotBlankStr rather than plain str; update the annotation for event_constant to
NotBlankStr, add the appropriate import for NotBlankStr from the project
validation utilities (or pydantic helpers) at the top of
evals/scoring/aggregate.py, and run typechecks to ensure the model now enforces
non-blank identifier constraints.
- Around line 122-128: In the loop in evals/scoring/aggregate.py that iterates
over events_by_class.items(), validate the `count` before using it: if `count <
0` raise a ValueError with a clear message identifying the offending
`event_constant` and `count`; then proceed to call `penalty_table.is_tracked`,
`penalty_table.points_for(event_constant)`, compute `raw = per_event * count`,
clamp with `penalty_table.cap_per_class`, and add to `total_deduction` as
before. Ensure the check occurs before any arithmetic so negative counts cannot
produce negative deductions.
- Around line 72-75: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field`: import computed_field from pydantic and annotate
is_clean with `@computed_field`, keeping the implementation returning
self.deduction == 0 so the derived value complies with the frozen Pydantic model
contract; ensure the decorator is applied to the is_clean method in the same
class where deduction is defined.

In `@evals/scoring/executable.py`:
- Around line 151-157: Replace the inline event name string literals used in the
logger.warning calls with centralized constants: define EVAL_EXECUTABLE_TIMEOUT
and EVAL_EXECUTABLE_TOOL_MISSING in the observability.events.evals module (or
the existing events module) with values "evals.executable.timeout" and
"evals.executable.tool_missing" respectively, then import those constants at the
top of this module and use them in the logger.warning calls (the ones that
currently pass "evals.executable.timeout" and "evals.executable.tool_missing");
update both occurrences where logger.warning is invoked (the calls that pass
cmd=spec.cmd, timeout=spec.timeout_seconds, error_type=type(exc).__name__,
error=safe_error_description(exc) and the similar tool-missing call) to
reference the imported constants instead of string literals.
- Around line 102-105: Replace the `@property` on the is_clean method with
Pydantic's `@computed_field` decorator so the derived value is serialized; import
computed_field from pydantic if missing and keep the same signature (def
is_clean(self) -> bool) and return expression (self.hidden_pass and
self.build_pass and self.lint_pass) so the field is treated as a computed model
field rather than a plain Python property.
- Line 67: Change the Pydantic field annotation for the identifier "label" from
str to NotBlankStr (so it enforces non-empty identifiers like LABEL_HIDDEN,
LABEL_BUILD, LABEL_LINT) and add an import for NotBlankStr from the project's
shared types module (where other models use NotBlankStr); update the model
definition that contains the "label" field in this file to use NotBlankStr
instead of str.
- Around line 131-143: The completed outcome handler currently hardcodes
duration_seconds=0.0 in _outcome_from_completed while timeouts set
duration_seconds=float(spec.timeout_seconds); fix by capturing real elapsed time
around the subprocess invocation and threading that value into the CheckOutcome:
modify the code that calls subprocess.run to record start/end (e.g.,
time.perf_counter()) and compute elapsed_seconds, change _outcome_from_completed
to accept a duration_seconds: float parameter (or overload it) and use that
value when constructing CheckOutcome, and ensure the timeout branch still sets
duration_seconds=float(spec.timeout_seconds) so all outcomes have meaningful
durations.

In `@tests/evals_spine/test_judged_calibration.py`:
- Line 37: Change the predicate parameter's type from object to a typed Callable
to reflect its true signature: add "from typing import Callable" at the top of
the test file, update the function signature(s) that accept the predicate (the
pytest.parametrize entry ("xs", "ys", "predicate") and the test function(s)
using that parameter) to use predicate: Callable[[float], bool], and remove the
now-unnecessary "# type: ignore[operator]" comments; apply the same change to
the other occurrence around lines 80-83 where predicate is used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: eed15755-dcc6-4950-9350-66b34d428804

📥 Commits

Reviewing files that changed from the base of the PR and between 2049885 and 815ac88.

📒 Files selected for processing (27)
  • CLAUDE.md
  • evals/__init__.py
  • evals/emit/__init__.py
  • evals/emit/json_writer.py
  • evals/emit/markdown_writer.py
  • evals/errors.py
  • evals/loader/__init__.py
  • evals/loader/anchors.py
  • evals/loader/briefs.py
  • evals/models/__init__.py
  • evals/models/brief.py
  • evals/models/scorecard.py
  • evals/scoring/__init__.py
  • evals/scoring/aggregate.py
  • evals/scoring/executable.py
  • evals/scoring/judged.py
  • evals/scoring/penalties.py
  • evals/scoring/spearman.py
  • pyproject.toml
  • tests/evals_spine/__init__.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build Backend
  • GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
  • GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
  • GitHub Check: Dashboard Test
  • GitHub Check: Test Integration
  • GitHub Check: Test E2E
  • GitHub Check: Test Unit
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Read docs/design/ page before implementing; deviations need approval. See DESIGN_SPEC.md.

Present every plan for accept/deny before coding.

Ship enforcement gate with every convention PR.

Configuration Precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at the boot site. No os.environ.get outside startup.

Numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal). Enforced by scripts/check_no_magic_numbers.py.

No from __future__ import annotations (3.14 has PEP 649). PEP 758 except: except A, B: must use parens unless binding.

Type hints on public functions; mypy strict. Google-style docstrings. Line length 88; functions <50 lines; files <800 lines.

Errors: <Domain><Condition>Error from DomainError; never inherit Exception/RuntimeError/etc directly. Enforced by check_domain_error_hierarchy.py.

Pydantic v2 frozen + extra="forbid" on every frozen model (gate check_frozen_model_extra_forbid.py; @computed_field auto-exempt, per-line # lint-allow: frozen-extra-forbid -- <reason> for extra="allow"/"ignore"); use @computed_field for derived fields.

Args models at every system boundary; use parse_typed() for every external dict ingestion. Enforced by check_boundary_typed.py.

Immutability: use model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries.

Async: use asyncio.TaskGroup for fan-out/fan-in; helpers catch Exception (re-raise MemoryError/RecursionError).

Clock seam: clock: Clock | None = None; tests inject FakeClock. Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable.

Untrusted content (SEC-1): use wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML.

Logging: import `from sy...

Files:

  • evals/__init__.py
  • evals/scoring/__init__.py
  • evals/emit/json_writer.py
  • evals/loader/__init__.py
  • evals/errors.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_brief_loader.py
  • evals/loader/briefs.py
  • evals/models/scorecard.py
  • evals/scoring/spearman.py
  • tests/evals_spine/test_executable_grader.py
  • evals/emit/markdown_writer.py
  • evals/scoring/aggregate.py
  • evals/models/__init__.py
  • evals/emit/__init__.py
  • evals/loader/anchors.py
  • evals/scoring/judged.py
  • evals/scoring/penalties.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • evals/models/brief.py
  • tests/evals_spine/test_scorecard_emit.py
  • evals/scoring/executable.py
**/*.{md,rst}

📄 CodeRabbit inference engine (CLAUDE.md)

Numerics in README and public docs sourced from data/runtime_stats.yaml via <!--RS:NAME--> markers.

Files:

  • CLAUDE.md
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use markers: @pytest.mark.{unit,integration,e2e,slow}. Async tests use auto. Global timeout 30s. Coverage minimum 80%.

xdist: use -n 8 --dist=loadfile (prevents 3.14+ Windows ProactorEventLoop leak). Windows: use WindowsSelectorEventLoopPolicy for unit tests; subprocess tests override back.

Test doubles: FakeClock for Clock seam, mock_of[T](**overrides) for typed-boundary substitutions, SimpleNamespace for attribute-bags. Bare MagicMock at typed boundary is blocked by check_mock_spec.py.

FakeClock and mock_of import from tests._shared; inject via clock= and the helper's spec subscript.

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...)). Never skip/xfail flaky tests; fix fundamentally.

Files:

  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_scorecard_emit.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_scorecard_emit.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: No region/currency/locale privileged; use metric units; British English.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Timeout/slow test failures are source-code regressions; never edit `tests/baselines/unit_timing.json` or scripts baseline files. Bypass with `ALLOW_BASELINE_GROWTH=1 git commit ...` (requires user approval).
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review`. After PR: `/aurelio-review-pr` for external feedback. Fix EVERYTHING valid; no deferring.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`).
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Branches: `<type>/<slug>` from main.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: Squash merge. PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: After every squash merge → run `/post-merge-cleanup`.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API.
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T16:02:13.510Z
Learning: API startup lifecycle: two phases - construction (synchronous service wiring) and on_startup (connected persistence wiring). Construction ordering: `agent_registry` BEFORE `auto_wire_meetings`; `tunnel_provider` unconditional. On-startup ordering: `SettingsService` BEFORE `WorkflowExecutionObserver`; `OntologyService` AFTER `persistence.connect()`.
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • evals/__init__.py
  • evals/scoring/__init__.py
  • evals/emit/json_writer.py
  • evals/loader/__init__.py
  • evals/errors.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_brief_loader.py
  • evals/loader/briefs.py
  • evals/models/scorecard.py
  • evals/scoring/spearman.py
  • tests/evals_spine/test_executable_grader.py
  • evals/emit/markdown_writer.py
  • evals/scoring/aggregate.py
  • evals/models/__init__.py
  • evals/emit/__init__.py
  • evals/loader/anchors.py
  • evals/scoring/judged.py
  • evals/scoring/penalties.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • evals/models/brief.py
  • tests/evals_spine/test_scorecard_emit.py
  • evals/scoring/executable.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • CLAUDE.md
🔇 Additional comments (21)
tests/evals_spine/test_scorecard_emit.py (1)

1-255: LGTM!

tests/evals_spine/test_penalties.py (1)

1-182: LGTM!

evals/scoring/spearman.py (1)

1-97: LGTM!

evals/scoring/judged.py (1)

1-298: LGTM!

tests/evals_spine/conftest.py (1)

1-59: LGTM!

tests/evals_spine/test_brief_loader.py (1)

1-144: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

1-87: LGTM!

evals/scoring/__init__.py (1)

1-13: LGTM!

evals/emit/__init__.py (1)

1-1: LGTM!

evals/emit/json_writer.py (1)

1-34: LGTM!

evals/emit/markdown_writer.py (1)

1-141: LGTM!

evals/__init__.py (1)

1-14: LGTM!

CLAUDE.md (1)

3-3: LGTM!

pyproject.toml (1)

190-190: LGTM!

Also applies to: 283-283, 363-363, 378-378

evals/loader/__init__.py (1)

1-7: LGTM!

evals/loader/anchors.py (1)

39-70: LGTM!

Also applies to: 72-123, 125-132

evals/loader/briefs.py (1)

56-70: LGTM!

evals/errors.py (1)

22-116: LGTM!

evals/models/__init__.py (1)

1-6: LGTM!

tests/evals_spine/test_executable_grader.py (1)

1-231: LGTM!

evals/scoring/executable.py (1)

172-179: Shell metacharacter validation in HiddenCheckSpec is correctly implemented.

The validation exists as a model validator at line 114 of evals/models/brief.py (_leading_token_has_no_shell_metachars), checking the leading token against the _SHELL_METACHARS constant defined at line 27. The docstring (lines 97–105) explicitly documents the defense-in-depth approach: validation at load time catches typos in brief YAML before runtime, even though shell=False mitigates the risk. The noqa comment on line 172 is justified.

Comment thread evals/loader/briefs.py
Comment thread evals/models/brief.py Outdated
Comment thread evals/models/scorecard.py
Comment thread evals/models/scorecard.py
Comment thread evals/scoring/aggregate.py Outdated
Comment on lines +102 to +105
@property
def is_clean(self) -> bool:
"""Whether every declared check class passed."""
return self.hidden_pass and self.build_pass and self.lint_pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use @computed_field instead of @property for derived fields.

The is_clean property is a derived field and should use Pydantic's @computed_field decorator rather than @property. This ensures the field is serialized with the model and aligns with the project's convention for computed fields. As per coding guidelines, derived fields should use @computed_field.

♻️ Proposed fix
-from pydantic import BaseModel, ConfigDict, Field, model_validator
+from pydantic import BaseModel, ConfigDict, Field, computed_field, model_validator

 class ExecutableGrade(BaseModel):
     """Aggregate grade across hidden + build + lint check classes.
 
     Invariant: each ``*_pass`` boolean reflects whether every outcome
     in that label-bucket exited cleanly (``exit_code == 0`` and not
     ``timed_out``). The validator below enforces it.
     """
 
     model_config = ConfigDict(frozen=True, extra="forbid", allow_inf_nan=False)
 
     score: int = Field(ge=0, le=EXEC_TOTAL)
     hidden_pass: bool
     build_pass: bool
     lint_pass: bool
     outcomes: tuple[CheckOutcome, ...]
 
-    `@property`
+    `@computed_field`  # type: ignore[misc]
+    `@property`
     def is_clean(self) -> bool:
         """Whether every declared check class passed."""
         return self.hidden_pass and self.build_pass and self.lint_pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/scoring/executable.py` around lines 102 - 105, Replace the `@property` on
the is_clean method with Pydantic's `@computed_field` decorator so the derived
value is serialized; import computed_field from pydantic if missing and keep the
same signature (def is_clean(self) -> bool) and return expression
(self.hidden_pass and self.build_pass and self.lint_pass) so the field is
treated as a computed model field rather than a plain Python property.

Comment thread evals/scoring/executable.py
Comment thread evals/scoring/executable.py
Comment thread evals/scoring/penalties.py Outdated
Comment thread tests/evals_spine/test_judged_calibration.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the golden-company benchmark framework, a comprehensive evaluation suite for autonomous agents featuring executable checks and calibrated LLM judging. The implementation includes Pydantic models for briefs and scorecards, YAML loaders, and scoring logic that incorporates process-fact penalties. Review feedback highlighted the need to measure actual execution duration in the executable grader and to handle PermissionError during subprocess calls. Security improvements were also suggested regarding credential redaction in logs and command outputs, along with a correction to ensure the atomicity of JSON file writes as described in the documentation.

Comment on lines +131 to +143
def _outcome_from_completed(
spec: HiddenCheckSpec,
label: str,
completed: subprocess.CompletedProcess[bytes],
) -> CheckOutcome:
return CheckOutcome(
label=label,
cmd=spec.cmd,
exit_code=completed.returncode,
duration_seconds=0.0,
stdout_tail=_tail(completed.stdout),
stderr_tail=_tail(completed.stderr),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Update _outcome_from_completed to accept a duration parameter. The current implementation hardcodes duration_seconds=0.0. Additionally, ensure that the cmd field in CheckOutcome is redacted if it contains URLs with credentials to prevent sensitive data leakage in tool execution results, as required by repository security rules.

Suggested change
def _outcome_from_completed(
spec: HiddenCheckSpec,
label: str,
completed: subprocess.CompletedProcess[bytes],
) -> CheckOutcome:
return CheckOutcome(
label=label,
cmd=spec.cmd,
exit_code=completed.returncode,
duration_seconds=0.0,
stdout_tail=_tail(completed.stdout),
stderr_tail=_tail(completed.stderr),
)
def _outcome_from_completed(
spec: HiddenCheckSpec,
label: str,
completed: subprocess.CompletedProcess[bytes],
duration: float,
) -> CheckOutcome:
return CheckOutcome(
label=label,
cmd=spec.cmd,
exit_code=completed.returncode,
duration_seconds=duration,
stdout_tail=_tail(completed.stdout),
stderr_tail=_tail(completed.stderr),
)
References
  1. Ensure that URLs containing potential credentials are redacted in all output channels, including tool execution results (e.g., ToolExecutionResult content).

Comment thread evals/scoring/executable.py Outdated
Comment on lines +169 to +191
def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome:
"""Run one check; never raises on non-zero exit; raises on tool-missing."""
try:
completed = subprocess.run( # noqa: S603 -- args validated, no shell
list(spec.cmd),
cwd=work_dir,
timeout=spec.timeout_seconds,
capture_output=True,
check=False,
shell=False,
)
except subprocess.TimeoutExpired as exc:
return _outcome_from_timeout(spec, label, exc)
except (FileNotFoundError, NotADirectoryError) as exc:
logger.error(
"evals.executable.tool_missing",
cmd=spec.cmd,
error_type=type(exc).__name__,
error=safe_error_description(exc),
)
msg = f"Required eval tool not found: {spec.cmd[0]!r}"
raise EvalToolMissingError(msg) from exc
return _outcome_from_completed(spec, label, completed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Measure the actual execution time using time.perf_counter() and handle PermissionError. subprocess.run raises PermissionError if the file exists but is not executable, which should be caught and reported as an EvalToolMissingError. When logging the error, ensure that spec.cmd is redacted if it contains URLs with credentials to prevent sensitive data leakage to logs.

def _run_check(spec: HiddenCheckSpec, label: str, work_dir: Path) -> CheckOutcome:
    """Run one check; never raises on non-zero exit; raises on tool-missing."""
    start = time.perf_counter()
    try:
        completed = subprocess.run(  # noqa: S603 -- args validated, no shell
            list(spec.cmd),
            cwd=work_dir,
            timeout=spec.timeout_seconds,
            capture_output=True,
            check=False,
            shell=False,
        )
        duration = time.perf_counter() - start
        return _outcome_from_completed(spec, label, completed, duration)
    except subprocess.TimeoutExpired as exc:
        return _outcome_from_timeout(spec, label, exc)
    except (FileNotFoundError, NotADirectoryError, PermissionError) as exc:
        logger.error(
            "evals.executable.tool_missing",
            cmd=spec.cmd,
            error_type=type(exc).__name__,
            error=safe_error_description(exc),
        )
        msg = f"Required eval tool not found or not executable: {spec.cmd[0]!r}"
        raise EvalToolMissingError(msg) from exc
References
  1. Ensure that URLs containing potential credentials are redacted in all output channels, including log messages.

Comment on lines +17 to +18
import subprocess
from typing import TYPE_CHECKING, Final, Self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add import time to support measuring the actual duration of subprocess execution, which is currently hardcoded to 0.0 for successful outcomes.

Suggested change
import subprocess
from typing import TYPE_CHECKING, Final, Self
import subprocess
import time
from typing import TYPE_CHECKING, Final, Self

Comment thread evals/emit/json_writer.py
Comment on lines +24 to +30
def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path:
"""Write *scorecard* as JSON into *out_dir*; return the file path."""
out_dir.mkdir(parents=True, exist_ok=True)
target = out_dir / SCORECARD_JSON_FILENAME
payload = scorecard.model_dump_json(indent=JSON_INDENT)
target.write_text(payload + "\n", encoding="utf-8")
return target
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring claims that write_scorecard_json overwrites the file atomically via write_text, but pathlib.Path.write_text is not atomic (it truncates the file upon opening). To ensure atomicity and prevent corruption if the process crashes during writing, use a temporary file and os.replace.

def write_scorecard_json(scorecard: Scorecard, out_dir: Path) -> Path:
    """Write *scorecard* as JSON into *out_dir*; return the file path."""
    import os
    import tempfile

    out_dir.mkdir(parents=True, exist_ok=True)
    target = out_dir / SCORECARD_JSON_FILENAME
    payload = scorecard.model_dump_json(indent=JSON_INDENT) + "\n"

    with tempfile.NamedTemporaryFile(
        mode="w", dir=out_dir, encoding="utf-8", delete=False
    ) as tf:
        tf.write(payload)
        temp_name = tf.name

    try:
        os.replace(temp_name, target)
    except Exception:
        try:
            os.unlink(temp_name)
        except OSError:
            pass
        raise
    return target

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 80.12270% with 486 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.00%. Comparing base (f60e9ed) to head (2c25a0a).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...c/synthorg/persistence/sqlite/conversation_repo.py 62.89% 56 Missing and 3 partials ⚠️
...synthorg/persistence/postgres/conversation_repo.py 65.27% 47 Missing and 3 partials ⚠️
...persistence/sqlite/conversational_proposal_repo.py 65.57% 41 Missing and 1 partial ⚠️
...rsistence/postgres/conversational_proposal_repo.py 70.00% 32 Missing and 1 partial ⚠️
...rg/engine/workspace/git_backend/external_remote.py 58.10% 29 Missing and 2 partials ⚠️
src/synthorg/engine/workspace/push_queue.py 73.39% 21 Missing and 8 partials ⚠️
src/synthorg/meta/chief_of_staff/propose.py 86.74% 20 Missing and 2 partials ⚠️
...thorg/persistence/sqlite/project_workspace_repo.py 68.57% 22 Missing ⚠️
src/synthorg/persistence/conversational_factory.py 54.28% 15 Missing and 1 partial ⚠️
...org/persistence/postgres/project_workspace_repo.py 75.00% 16 Missing ⚠️
... and 21 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2025      +/-   ##
==========================================
- Coverage   85.09%   85.00%   -0.10%     
==========================================
  Files        1913     1948      +35     
  Lines      113388   115799    +2411     
  Branches     9673     9839     +166     
==========================================
+ Hits        96485    98430    +1945     
- Misses      14533    14943     +410     
- Partials     2370     2426      +56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Aureliolo pushed a commit that referenced this pull request May 20, 2026
Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).
Aureliolo pushed a commit that referenced this pull request May 20, 2026
Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 18:29 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/emit/json_writer.py`:
- Around line 37-45: The temporary JSON file is opened in text mode without
specifying newline, causing platform-dependent CRLF/LF translations; update the
NamedTemporaryFile call in evals/emit/json_writer.py (the with
tempfile.NamedTemporaryFile(...) as handle block) to include newline="\n" so the
written payload uses LF-only line endings across platforms, keeping mode="w",
encoding="utf-8", delete=False, prefix/suffix as-is.

In `@evals/scoring/aggregate.py`:
- Around line 86-87: The validator
AggregationResult._score_matches_grade_minus_deduction currently enforces the
floor using the global GRADE_FLOOR while aggregate_brief_score uses
penalty_table.floor; change the validator to use the same floor source as
aggregation (e.g., compute expected = max(self.grade - self.deduction,
self.penalty_table.floor) or otherwise obtain penalty_table.floor from the
AggregationResult instance/context) so both validation and scoring use
penalty_table.floor consistently; ensure any null/absent penalty_table is
handled (fall back to GRADE_FLOOR only if penalty_table is unavailable).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 570e0380-6ccd-4433-aa7a-6eb407842c8e

📥 Commits

Reviewing files that changed from the base of the PR and between 815ac88 and 48c3121.

📒 Files selected for processing (14)
  • evals/emit/json_writer.py
  • evals/errors.py
  • evals/loader/briefs.py
  • evals/models/brief.py
  • evals/models/scorecard.py
  • evals/scoring/aggregate.py
  • evals/scoring/executable.py
  • evals/scoring/penalties.py
  • src/synthorg/observability/events/evals.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
  • GitHub Check: Dashboard Test
  • GitHub Check: Test Unit
  • GitHub Check: Test Integration
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Follow configuration precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; no os.environ.get outside startup

No hardcoded numeric values; numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants (NAME: int|float|Final|Final[int]|Final[float] = literal)

Use comments to explain WHY only; no reviewer citations, issue back-refs, or migration framing; enforced by check_no_review_origin_in_code.py and check_no_migration_framing.py

No from __future__ import annotations (Python 3.14 has PEP 649); PEP 758 except: except A, B: must have parens unless binding

Type hints on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines

Define errors as Error inheriting from DomainError, never directly from Exception/RuntimeError/etc; enforced by check_domain_error_hierarchy.py

Use Pydantic v2 frozen + extra='forbid' on every frozen model project-wide (gate check_frozen_model_extra_forbid.py); @computed_field for derived; NotBlankStr for identifiers

Args models at every system boundary; parse_typed() for every external dict ingestion; enforced by check_boundary_typed.py

Use immutability: model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries

Use asyncio.TaskGroup for fan-out/fan-in async operations; helpers must catch Exception and re-raise MemoryError/RecursionError

Implement Clock seam: clock: Clock | None = None; tests inject FakeClock

Lifecycle: services own _lifecycle_lock; timed-out stops mark unrestartable

Wrap untrusted content with wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML (SEC-1)

Import logger as: from synthorg.observability import get_logger; variable always named logger; never use import logging or print() in app code

Use e...

Files:

  • src/synthorg/observability/events/evals.py
src/synthorg/observability/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Telemetry: opt-in, off by default; every event property must be in _ALLOWED_PROPERTIES

Files:

  • src/synthorg/observability/events/evals.py
src/**/*.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/observability/events/evals.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use test markers: @pytest.mark.{unit,integration,e2e,slow}; async auto; timeout 30s global; coverage 80% min

xdist -n 8 --dist=loadfile auto-applied via pyproject addopts (loadfile prevents 3.14+Windows ProactorEventLoop leak)

Windows: unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Test doubles: FakeClock for Clock seam; mock_ofT for typed-boundary substitutions; SimpleNamespace for attribute-bags; bare MagicMock at typed boundary blocked by scripts/check_mock_spec.py (zero-tolerance)

FakeClock and mock_of import from tests._shared; inject via clock= and helper's spec subscript

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/observability/test_events.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_judged_calibration.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/observability/test_events.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_judged_calibration.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Read design spec page before implementing; deviations need approval from DESIGN_SPEC.md
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Apply no region/currency/locale privilege; use metric units and British English
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Every convention PR must ship its enforcement gate
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Never edit tests/baselines/unit_timing.json or any scripts/*_baseline.{txt,json} / scripts/_*_baseline.py; timeout/slow failures indicate source-code regression
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: After issue: branch + commit + push (no auto-PR); use /pre-pr-review before PR; after PR use /aurelio-review-pr for external feedback; fix EVERYTHING valid before merging
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Commits: <type>: <description> (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via synthorg-repo-bot)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Branches: <type>/<slug> from main
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: Squash merge; PR body becomes squash commit; trailers (Release-As, Closes `#N`) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: GitHub queries: use gh issue list via Bash, NOT MCP list_issues
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: After every squash merge → /post-merge-cleanup
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:31:34.221Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • src/synthorg/observability/events/evals.py
  • tests/unit/observability/test_events.py
  • evals/errors.py
  • evals/scoring/aggregate.py
  • tests/evals_spine/test_penalties.py
  • evals/models/scorecard.py
  • evals/scoring/executable.py
  • evals/models/brief.py
  • evals/scoring/penalties.py
  • evals/emit/json_writer.py
  • tests/evals_spine/test_scorecard_emit.py
  • evals/loader/briefs.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_judged_calibration.py
🔇 Additional comments (13)
src/synthorg/observability/events/evals.py (1)

1-13: LGTM!

tests/unit/observability/test_events.py (1)

336-339: LGTM!

evals/errors.py (1)

43-54: LGTM!

Also applies to: 120-120

evals/scoring/aggregate.py (1)

12-12: LGTM!

Also applies to: 34-34, 131-133

tests/evals_spine/test_penalties.py (1)

21-21: LGTM!

Also applies to: 185-210

evals/models/scorecard.py (1)

21-21: LGTM!

Also applies to: 62-62, 82-85, 147-162, 166-166, 190-191, 197-205, 209-209, 259-259, 271-271

evals/scoring/executable.py (1)

18-18: LGTM!

Also applies to: 25-30, 73-73, 83-83, 109-113, 121-121, 148-168, 186-215, 231-231

evals/models/brief.py (1)

53-57: LGTM!

Also applies to: 117-117, 121-121, 253-253

evals/scoring/penalties.py (1)

17-21: LGTM!

Also applies to: 91-91, 101-110

tests/evals_spine/test_scorecard_emit.py (1)

31-41: LGTM!

Also applies to: 53-53, 236-283

evals/loader/briefs.py (1)

12-16: LGTM!

Also applies to: 44-45, 49-49, 53-66

tests/evals_spine/test_brief_loader.py (1)

9-13: LGTM!

Also applies to: 131-161

tests/evals_spine/test_judged_calibration.py (1)

3-3: LGTM!

Also applies to: 81-81, 84-84

Comment thread evals/emit/json_writer.py
Comment thread evals/scoring/aggregate.py Outdated
Aureliolo added 7 commits May 20, 2026 20:52
52 unit tests green. Lands the eval-spine scoring + data layer: brief and scorecard frozen Pydantic v2 models, YAML loaders via parse_typed, executable + judged graders with Spearman ordinal calibration gate (rho >= 0.7), process-fact penalty table with cap + floor, deterministic JSON and Markdown emitters, DomainError-based eval error hierarchy. pyproject.toml wires evals/ for ruff/pyright/pytest. Phase 3 runner + Phase 8 baselines + Phase 9 wiring follow.
…dule

Pre-push mypy gate flagged: (a) write_brief_yaml fixture's return type missing -- introduce BriefYamlWriter Callable type alias in conftest; (b) test_runner_broken_scores_worse.py importing evals.run at module scope failed because that module lands in a follow-up commit -- switch to pytest.importorskip inside helpers so the test collects cleanly until the orchestrator lands; (c) test_load_anchor_set_below_minimum_raises missing Path annotation.
…nventions

Apply 19 of 22 review findings from /pre-pr-review (1 invalid finding rejected; 1 deferred with rationale). Notable changes: (1) Critical: path-traversal containment in evals/loader/anchors.py via Path.resolve().is_relative_to(); (2) Critical: consolidate SCORECARD_SCHEMA_VERSION to a single source-of-truth in evals/models/scorecard.py; (3) Wrong exc type for non-dict brief YAML switched from BriefSuiteEmptyError to TypeError; (4) HiddenCheckSpec rejects shell metacharacters in cmd[0] at load time (defence-in-depth atop shell=False); (5) Cross-field validators on BriefResult, JudgeCalibrationReport, ProcessFactReport, AggregatedProcessFacts, PenaltyEntry, AggregationResult, CheckOutcome, ExecutableGrade, ScriptedJudge, JudgedGrade, AnchorItem; (6) Refactor _run_check / grade_executable below the 50-line ceiling via _outcome_from_completed / _outcome_from_timeout / _run_class helpers; (7) Parametrize-collapse Spearman and executable-grade tests; (8) Issue back-refs and migration-framing scrubbed from docstrings; (9) CLAUDE.md layout line now mentions top-level evals/; (10) 16 new validator + parametrize tests, 68 unit tests pass. Rejected: PEP 758 finding (CLAUDE.md says no parens unless binding; we ARE binding). Deferred: Brief discriminated union (agent flagged optional, current post-validator catches the same XOR errors).
Reviewer feedback on PR #2025 from the initial CodeRabbit + Gemini pass. All findings verified against the current code; everything valid is fixed in this round (no deferrals). evals/loader/briefs.py refuses non-file glob entries and resolved paths outside briefs_dir (new BriefSuitePathTraversalError); blocks symlink escape and directory-with-yaml-suffix. evals/models/brief.py introduces DEFAULT_HIDDEN_CHECK_TIMEOUT_SECONDS as a module-level Final constant in place of the inline default. evals/models/scorecard.py adds a BriefResult validator enforcing kind=JUDGED iff judge_calibration is set, and AggregatedProcessFacts now rejects negative per-class counts. evals/scoring/aggregate.py types PenaltyEntry.event_constant as NotBlankStr and aggregate_brief_score refuses negative per-class counts before applying penalties. evals/scoring/executable.py types CheckOutcome.label as NotBlankStr, _outcome_from_completed records measured time.perf_counter() duration, _run_check catches PermissionError as tool-missing, and event names come from observability.events.evals constants instead of string literals. evals/scoring/penalties.py rejects negative points_per_event values and floor above GRADE_CEILING. evals/emit/json_writer.py uses tempfile + Path.replace + fsync for atomic write; the previous Path.write_text was advertised atomic but is not. tests/evals_spine/test_judged_calibration.py types predicate as Callable[[float], bool], type-ignore removed. is_clean stays @Property (NOT @computed_field) on ProcessFactReport, AggregatedProcessFacts, AggregationResult, ExecutableGrade: @computed_field serialises the value into JSON, but extra=forbid then rejects it on model_validate_json round-trip; matches existing Scorecard.total / max_total / is_passing pattern. Added positive tests for every new validator (path traversal, negative counts, kind/calibration consistency, PenaltyTable bounds).
Round 1 push #2025 added src/synthorg/observability/events/evals.py but did not update tests/unit/observability/test_events.py::TestEventConstants::test_all_domain_modules_discovered. That test enumerates every legitimate domain module under observability/events/ and refuses unknowns. Adding the new evals namespace to the expected set unblocks the pre-push pytest gate.
CodeRabbit re-review on head 48c3121 posted two actionable comments. Both verified valid against the current code and fixed in this commit. evals/emit/json_writer.py: NamedTemporaryFile now opens with newline LF only, so the serialised JSON is deterministic across platforms. Without this, on Windows the text-mode write would translate LF to CRLF and the scorecard JSON would differ from the same payload produced on Linux. evals/scoring/aggregate.py: AggregationResult now carries the floor used by the aggregator (defaults to GRADE_FLOOR for backward compat) and the score validator clamps against that floor, not the global GRADE_FLOOR. Previously a PenaltyTable with a non-zero floor would produce a final_score the validator rejected. Added a regression test in tests/evals_spine/test_penalties.py that aggregates with a custom floor of 50 and asserts the score is clamped consistently.
…r_consistently docstring

Pre-push hooks no-review-origin-in-code and no-migration-framing both flagged the docstring on test_aggregate_uses_penalty_table_floor_consistently in tests/evals_spine/test_penalties.py. The phrase explaining what the validator did 'before the round-4 fix' is exactly the kind of historical narrative the gates forbid: it rots the moment the old behaviour stops being a thing anyone remembers. The replacement describes current invariants only (the floor flows from the penalty table into AggregationResult; the score validator and aggregator share the floor).
@Aureliolo Aureliolo force-pushed the feat/1980-golden-benchmark branch from 48c3121 to 6dfda3a Compare May 20, 2026 18:58
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 18:59 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/emit/json_writer.py`:
- Line 25: Remove the unused logger initialization: delete the statement that
creates the module-level logger (logger = get_logger(__name__)) and also remove
the corresponding get_logger import if it isn't used elsewhere in this module;
ensure no references to the symbol logger remain (search for logger and
get_logger in this file) before committing the change.

In `@evals/emit/markdown_writer.py`:
- Around line 128-133: The write_scorecard_md function currently calls
target.write_text(render_scorecard_md(scorecard), encoding="utf-8") which allows
platform-dependent newline translation; change this call to pass newline="\n" so
the file is written with deterministic LF line endings (consistent with
json_writer.py). Update the write_text invocation in write_scorecard_md (and
ensure it still uses SCORECARD_MD_FILENAME and render_scorecard_md) to include
newline="\n" while preserving encoding="utf-8".
- Line 23: The module defines an unused logger via the variable logger =
get_logger(__name__) in markdown_writer.py; remove that unused initialization
and any related unused imports (get_logger or logger) from the file, or if
logging is intended, replace it with a used logging call (e.g., use the logger
inside functions like write_markdown or export_markdown) — locate the
logger/get_logger symbols in the module and either delete them or integrate them
into existing functions so the symbol is actually used.

In `@evals/models/scorecard.py`:
- Around line 252-255: The is_passing method currently truncates the threshold
by using int(self.max_total * PASS_FRACTION); replace that with a non-truncating
ceiling comparison so the required pass count isn't lowered for non-integer
thresholds—use math.ceil(self.max_total * PASS_FRACTION) (import math if needed)
and update is_passing to compare self.total against that ceiling value.

In `@evals/scoring/__init__.py`:
- Around line 3-12: The module docstring incorrectly states "Three concerns"
while listing four modules; update the top-level docstring in evals.scoring (the
module docstring in __init__.py) to read "Four concerns" (or a neutral phrase
like "Several concerns") so the count matches the listed modules
(:mod:`evals.scoring.penalties`, :mod:`evals.scoring.executable`,
:mod:`evals.scoring.judged`, :mod:`evals.scoring.aggregate`).

In `@evals/scoring/aggregate.py`:
- Around line 112-115: Update the docstring that describes the parameter "grade"
so it matches the implementation: remove the statement that out-of-range values
are "clamped" and instead say that grades must be in [0, 100] and values outside
that range raise a ValueError (no clamping). Ensure the docstring for the
function/method that accepts the "grade" parameter mirrors the actual validation
behavior where a ValueError is raised when the grade is out of bounds.

In `@evals/scoring/judged.py`:
- Around line 210-216: Replace the hardcoded event string
"evals.judge.calibration_failed" used in the logger.warning call inside
judged.py with the corresponding event constant exported from
synthorg.observability.events.evals (e.g., EVALS_JUDGE_CALIBRATION_FAILED); add
the import for that constant at the top of the file and use the constant in the
logger.warning call so the code matches the existing event-constant convention.

In `@tests/evals_spine/test_executable_grader.py`:
- Around line 78-96: The test matrix in pytest.mark.parametrize for
("hidden_pass", "build_pass", "lint_pass", "expected_score") is missing the two
boolean combinations (hidden=True, build=False, lint=False) and (hidden=False,
build=False, lint=True); add entries for (True, False, False, EXEC_TOTAL -
EXEC_WEIGHT_BUILD - EXEC_WEIGHT_LINT) and (False, False, True, EXEC_TOTAL -
EXEC_WEIGHT_HIDDEN - EXEC_WEIGHT_BUILD) and append corresponding ids (e.g.,
"build_and_lint_fail" and "hidden_and_build_fail") so the full 8-combination
coverage is present in tests/evals_spine/test_executable_grader.py using the
existing pytest.mark.parametrize block and the constants EXEC_TOTAL,
EXEC_WEIGHT_HIDDEN, EXEC_WEIGHT_BUILD, EXEC_WEIGHT_LINT.

In `@tests/evals_spine/test_judged_calibration.py`:
- Line 306: Add a single module-level import for load_anchor_set from
evals.loader.anchors (alongside the other anchors imports) and remove the two
local imports that appear inside the tests (the local load_anchor_set imports
within the two test functions). This consolidates imports: declare "from
evals.loader.anchors import load_anchor_set" at top-level and delete the
duplicate local import statements in the test bodies so the tests use the
module-level symbol.
- Line 287: Consolidate the local pydantic ValidationError imports by adding a
single module-level import "from pydantic import ValidationError as
PydValidationError" at the top of the test module and removing the three
redundant local imports of PydValidationError inside the individual tests;
update any local references to use the module-level PydValidationError symbol so
tests continue to raise/handle Pydantic ValidationError consistently.
- Line 324: Remove the redundant local import of AnchorItem by deleting the
duplicate "from evals.loader.anchors import AnchorItem" line in the
test_judged_calibration.py module so only the module-level import of AnchorItem
remains; search for any extra occurrences of the symbol AnchorItem within that
file to ensure you remove only the duplicate import and not any necessary
usages.

In `@tests/evals_spine/test_runner_broken_scores_worse.py`:
- Line 53: The assertion in the test uses a strict greater-than comparison
(reference.total > broken.total + SCORE_MARGIN) but the contract requires an
inclusive threshold ("at least" SCORE_MARGIN); change the comparison to a
greater-than-or-equal comparison so the check becomes reference.total >=
broken.total + SCORE_MARGIN, ensuring a gap exactly equal to SCORE_MARGIN
passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: a85a438d-9fc5-40c1-a517-15ae67cf84d8

📥 Commits

Reviewing files that changed from the base of the PR and between 48c3121 and 6dfda3a.

📒 Files selected for processing (29)
  • CLAUDE.md
  • evals/__init__.py
  • evals/emit/__init__.py
  • evals/emit/json_writer.py
  • evals/emit/markdown_writer.py
  • evals/errors.py
  • evals/loader/__init__.py
  • evals/loader/anchors.py
  • evals/loader/briefs.py
  • evals/models/__init__.py
  • evals/models/brief.py
  • evals/models/scorecard.py
  • evals/scoring/__init__.py
  • evals/scoring/aggregate.py
  • evals/scoring/executable.py
  • evals/scoring/judged.py
  • evals/scoring/penalties.py
  • evals/scoring/spearman.py
  • pyproject.toml
  • src/synthorg/observability/events/evals.py
  • tests/evals_spine/__init__.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_brief_loader.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
  • GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
  • GitHub Check: Test Integration
  • GitHub Check: Test E2E
  • GitHub Check: Dashboard Test
  • GitHub Check: Test Unit
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (5)
src/synthorg/!(persistence)/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Only src/synthorg/persistence/ may import sqlite/psycopg or emit raw SQL

Files:

  • src/synthorg/observability/events/evals.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Configuration precedence: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env at boot site; no os.environ.get outside startup

Files:

  • src/synthorg/observability/events/evals.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No hardcoded numeric values; numerics live in settings/definitions/; allowlist only 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of the form NAME: int|float|Final|Final[int]|Final[float] = literal

Comments should explain WHY only; no reviewer citations, issue back-refs, or migration framing

No from __future__ import annotations (Python 3.14 has PEP 649); use PEP 758 except: except A, B: requires parens if binding

Type hints required on public functions; mypy strict mode; Google-style docstrings

Line length 88; functions <50 lines; files <800 lines

Errors must use <Domain><Condition>Error naming and inherit from DomainError; never inherit directly from Exception/RuntimeError/etc

Pydantic v2 frozen models must use extra="forbid" on every frozen model; @computed_field is auto-exempt; per-line # lint-allow: frozen-extra-forbid -- <reason> for exceptions

@computed_field should be used for derived properties in Pydantic models

Use NotBlankStr for identifiers in Pydantic models

Args models required at every system boundary; use parse_typed() for every external dict ingestion

Immutability: use model_copy(update=...) or copy.deepcopy(); deepcopy at system boundaries

Async code must use asyncio.TaskGroup for fan-out/fan-in; helpers catch Exception and re-raise MemoryError/RecursionError

Use Clock seam: clock: Clock | None = None; tests inject FakeClock

Lifecycle services must own _lifecycle_lock; timed-out stops mark unrestartable

Untrusted content (SEC-1) must be wrapped via wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML

Import logger via from synthorg.observability import get_logger; variable always named logger; never use import logging or print() in app code

Event names must come from observability.events.<domain> constants; use structured kwargs in log calls (logger.info(EVENT, key=value))

Error paths must log WARNING/ERROR w...

Files:

  • src/synthorg/observability/events/evals.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/observability/events/evals.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Test markers: @pytest.mark.{unit,integration,e2e,slow}

Async tests use auto mode; global timeout 30s; coverage 80% minimum

Test doubles ladder: FakeClock for Clock seam, mock_of[T](**overrides) for typed-boundary substitutions, SimpleNamespace for attribute-bags; bare MagicMock at typed boundary is blocked by scripts/check_mock_spec.py

FakeClock and mock_of import from tests._shared; inject via clock= and helper's spec subscript

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() not sleep(large)

Files:

  • tests/unit/observability/test_events.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/evals_spine/test_brief_loader.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/unit/observability/test_events.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/evals_spine/test_brief_loader.py
tests/**/*test*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Windows unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Files:

  • tests/unit/observability/test_events.py
  • tests/evals_spine/conftest.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
  • tests/evals_spine/test_brief_loader.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Read `docs/design/` page before implementing; deviations need approval
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: No region/currency/locale privileged; use metric units; British English
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Every convention PR must ship its enforcement gate
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Never edit `tests/baselines/unit_timing.json` or any `scripts/*_baseline.{txt,json}` / `scripts/_*_baseline.py` files; timeout/slow failures indicate source-code regression
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After issue: branch + commit + push (no auto-PR); use `/pre-pr-review` before creating PR
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After PR creation: use `/aurelio-review-pr` for external feedback and fix all valid issues before merging
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Use `go -C cli` command, never `cd cli`
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: xdist `-n 8 --dist=loadfile` auto-applied via pyproject `addopts` (`loadfile` prevents 3.14+Windows ProactorEventLoop leak)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Branches: `<type>/<slug>` from main
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: GitHub queries: use `gh issue list` via Bash, NOT MCP `list_issues`
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T18:59:08.797Z
Learning: After every squash merge → `/post-merge-cleanup`
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • evals/emit/__init__.py
  • evals/loader/__init__.py
  • src/synthorg/observability/events/evals.py
  • evals/__init__.py
  • evals/models/__init__.py
  • evals/scoring/__init__.py
  • evals/scoring/spearman.py
  • evals/errors.py
  • tests/unit/observability/test_events.py
  • tests/evals_spine/conftest.py
  • evals/emit/markdown_writer.py
  • evals/emit/json_writer.py
  • tests/evals_spine/test_executable_grader.py
  • evals/loader/anchors.py
  • evals/models/brief.py
  • evals/models/scorecard.py
  • evals/scoring/aggregate.py
  • tests/evals_spine/test_penalties.py
  • tests/evals_spine/test_judged_calibration.py
  • evals/scoring/executable.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
  • tests/evals_spine/test_scorecard_emit.py
  • evals/loader/briefs.py
  • evals/scoring/judged.py
  • evals/scoring/penalties.py
  • tests/evals_spine/test_brief_loader.py
📚 Learning: 2026-05-16T18:36:31.446Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/reference/conventions.md:787-789
Timestamp: 2026-05-16T18:36:31.446Z
Learning: In Aureliolo/synthorg, do not require adding `<!--RS:...-->` “Doc Numeric Claims (MANDATORY)” numeric macros for Python version numbers mentioned in documentation prose (e.g., “Python 3.14”, “Python 3.15”). The `scripts/check_doc_numeric_macros.py` gate only applies to `README.md`, `docs/index.md`, `docs/roadmap/index.md`, `docs/architecture/decisions.md`, and `docs/reference/convention-gates.md`, and it only flags digits adjacent to specific stat nouns (tests/providers/agents/stars/releases), not language version mentions like “Python 3.14”.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-05-16T18:36:35.250Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1944
File: docs/getting_started.md:109-109
Timestamp: 2026-05-16T18:36:35.250Z
Learning: When reviewing Markdown in the synthorg repo, account for the CI gate `check_doc_numeric_macros.py`: it skips fenced code blocks entirely, and it only flags digits that are adjacent to these stat nouns: `tests`, `providers`, `agents`, `stars`, `releases`. Therefore, numeric examples such as CLI flag values (e.g., `--num-workers=4` in fenced bash blocks) and prose version numbers (e.g., `3.14`/`3.15`) are not expected to trigger this check; prioritize changes only when digits appear next to one of the listed nouns (e.g., “5 tests”, “10 stars”, etc.).

Applied to files:

  • CLAUDE.md
🔇 Additional comments (45)
tests/evals_spine/conftest.py (1)

1-60: LGTM!

tests/evals_spine/test_brief_loader.py (1)

19-180: LGTM!

evals/loader/briefs.py (1)

1-94: LGTM!

evals/loader/__init__.py (1)

1-8: LGTM!

evals/loader/anchors.py (1)

1-133: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

1-52: LGTM!

Also applies to: 54-88

evals/scoring/spearman.py (1)

1-97: LGTM!

evals/scoring/penalties.py (1)

1-153: LGTM!

tests/evals_spine/test_penalties.py (1)

1-233: LGTM!

tests/evals_spine/test_scorecard_emit.py (1)

1-316: LGTM!

tests/unit/observability/test_events.py (1)

336-339: LGTM!

tests/evals_spine/test_executable_grader.py (2)

26-75: LGTM!


97-230: LGTM!

src/synthorg/observability/events/evals.py (2)

1-10: LGTM!


12-13: The constants at lines 12-13 are properly imported and used in the executable scoring module (evals/scoring/executable.py). Both EVALS_EXECUTABLE_TIMEOUT and EVALS_EXECUTABLE_TOOL_MISSING are used correctly in structured logging calls with appropriate context kwargs and secure error handling (safe_error_description(exc) instead of raw exception strings).

CLAUDE.md (1)

3-3: LGTM!

evals/__init__.py (1)

1-15: LGTM!

pyproject.toml (1)

190-190: LGTM!

Also applies to: 283-283, 363-363, 378-378

evals/errors.py (1)

1-130: LGTM!

evals/models/__init__.py (1)

1-7: LGTM!

evals/models/brief.py (1)

1-269: LGTM!

evals/models/scorecard.py (1)

1-251: LGTM!

Also applies to: 256-313

evals/scoring/executable.py (4)

1-91: LGTM!


93-134: LGTM!


136-216: LGTM!


218-300: LGTM!

evals/scoring/judged.py (5)

1-52: LGTM!


54-119: LGTM!


122-171: LGTM!


174-222: LGTM!


225-298: LGTM!

evals/emit/json_writer.py (2)

37-45: Past review finding has been addressed.

The past review comment flagged the missing newline="\n" parameter. The current code includes it at line 41, ensuring platform-independent LF line endings for byte-level determinism.


31-57: LGTM!

evals/emit/__init__.py (1)

1-1: LGTM!

evals/emit/markdown_writer.py (6)

34-37: LGTM!


40-48: LGTM!


51-69: LGTM!


72-79: LGTM!


82-97: LGTM!


100-125: LGTM!

tests/evals_spine/test_judged_calibration.py (5)

1-32: LGTM!


36-102: LGTM!


108-159: LGTM!


162-197: LGTM!


200-258: LGTM!

Comment thread evals/emit/json_writer.py Outdated
Comment thread evals/emit/markdown_writer.py Outdated
Comment thread evals/emit/markdown_writer.py
Comment thread evals/models/scorecard.py
Comment thread evals/scoring/__init__.py Outdated
Comment thread tests/evals_spine/test_executable_grader.py
Comment thread tests/evals_spine/test_judged_calibration.py Outdated
Comment thread tests/evals_spine/test_judged_calibration.py Outdated
Comment thread tests/evals_spine/test_judged_calibration.py Outdated
Comment thread tests/evals_spine/test_runner_broken_scores_worse.py Outdated
All 12 actionable comments from the CodeRabbit re-review (review id 4331444484) verified valid and fixed in this commit. Source files: json_writer.py drops the unused module-level logger so the symbol does not lie about the import surface; markdown_writer.py does the same and write_scorecard_md now pins newline=LF for byte-identical Markdown across platforms; scorecard.is_passing uses math.ceil instead of int() so fractional pass thresholds round up to the next integer (a max_total of 295 at 65% requires 192, not 191); the evals.scoring package docstring corrects 'Three concerns' to 'Four concerns' to match the four listed modules; aggregate_brief_score docstring drops the obsolete 'clamped' claim and accurately states that out-of-range grades raise ValueError; judged.py logs the calibration-failed event via EVALS_JUDGE_CALIBRATION_FAILED imported from observability.events.evals (added there) instead of an inline literal. Test files: test_executable_grader.py parametrize matrix expanded from six to the full eight 2x2x2 boolean combinations across hidden/build/lint pass flags; test_judged_calibration.py consolidates yaml + PydValidationError + load_anchor_set + MIN_ANCHOR_SET_SIZE + JudgeCalibrationReport + JudgedGrade + AnchorItem imports to the module top, dropping the seven duplicate local imports; test_runner_broken_scores_worse.py SCORE_MARGIN comparison switches from > to >= so a reference run that beats the broken run by exactly SCORE_MARGIN points still passes (the contract is 'at least SCORE_MARGIN', inclusive).
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 19:19 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
evals/models/scorecard.py (1)

129-177: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Persist the per-brief score floor in BriefResult.

aggregate_brief_score() now allows score = max(grade - deduction, penalty_table.floor), but BriefResult still revalidates against max(grade - deduction, GRADE_FLOOR) and has no field to carry the configured floor. Any non-zero PenaltyTable.floor will make a valid aggregation impossible to serialize into the scorecard without either failing validation or lying about how the score was derived. Either thread the floor through BriefResult/Scorecard and version the schema, or reject non-zero floors before scorecard construction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/models/scorecard.py` around lines 129 - 177, The BriefResult validation
assumes a global GRADE_FLOOR but aggregate_brief_score can use a
PenaltyTable.floor; add a score_floor int field to BriefResult (defaulting to
GRADE_FLOOR for compatibility) and update the
_score_matches_grade_minus_deduction validator to compute expected =
max(self.grade - self.deduction, self.score_floor) so the serialized row records
the actual floor used. Also ensure aggregate_brief_score (and Scorecard
construction) populates BriefResult.score_floor from PenaltyTable.floor when
creating BriefResult instances (or else reject non-zero PenaltyTable.floor
before building scorecards if you choose to forbid non-default floors).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@evals/models/scorecard.py`:
- Around line 129-177: The BriefResult validation assumes a global GRADE_FLOOR
but aggregate_brief_score can use a PenaltyTable.floor; add a score_floor int
field to BriefResult (defaulting to GRADE_FLOOR for compatibility) and update
the _score_matches_grade_minus_deduction validator to compute expected =
max(self.grade - self.deduction, self.score_floor) so the serialized row records
the actual floor used. Also ensure aggregate_brief_score (and Scorecard
construction) populates BriefResult.score_floor from PenaltyTable.floor when
creating BriefResult instances (or else reject non-zero PenaltyTable.floor
before building scorecards if you choose to forbid non-default floors).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: f2fa10d1-679d-449d-bd31-72249dc71096

📥 Commits

Reviewing files that changed from the base of the PR and between 6dfda3a and e57d7a0.

📒 Files selected for processing (10)
  • evals/emit/json_writer.py
  • evals/emit/markdown_writer.py
  • evals/models/scorecard.py
  • evals/scoring/__init__.py
  • evals/scoring/aggregate.py
  • evals/scoring/judged.py
  • src/synthorg/observability/events/evals.py
  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
💤 Files with no reviewable changes (1)
  • evals/emit/json_writer.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Build Fine-Tune (gpu, fine-tune-gpu)
  • GitHub Check: Build Fine-Tune (cpu, fine-tune-cpu)
  • GitHub Check: Dashboard Test
  • GitHub Check: Test Unit
  • GitHub Check: Test Integration
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (3)
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Async test fixtures: use auto mode; global timeout 30s; coverage minimum 80%; xdist -n 8 --dist=loadfile auto-applied via pyproject addopts (loadfile prevents 3.14+ Windows ProactorEventLoop leak)

Windows unit tests use WindowsSelectorEventLoopPolicy (3.14 IOCP teardown race); subprocess tests override back

Test markers: @pytest.mark.{unit,integration,e2e,slow}; test doubles ladder in conventions.md section 12.1; use FakeClock and mock_of[T](**overrides) from tests._shared; SimpleNamespace for attribute-bags; bare MagicMock at typed boundary blocked by scripts/check_mock_spec.py

Hypothesis: 10 deterministic CI examples; failures are real bugs (fix + add @example(...))

Flaky tests: NEVER skip/xfail; fix fundamentally; use asyncio.Event().wait() instead of sleep(large)

Files:

  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py

⚙️ CodeRabbit configuration file

Test files do not require Google-style docstrings on classes or functions -- ruff D rules are only enforced on src/. A bare @settings() decorator with no arguments on Hypothesis property tests is a no-op and should not be suggested -- the HYPOTHESIS_PROFILE env var controls example counts via registered profiles, which @given() honors automatically.

Files:

  • tests/evals_spine/test_executable_grader.py
  • tests/evals_spine/test_judged_calibration.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
src/synthorg/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Configuration precedence order: DB > env > code default via SettingsService/ConfigResolver (Cat-1) or env > code default (Cat-2, read_only_post_init); Cat-3 bootstrap secrets are pure env; YAML is ingestion format only, not a precedence tier; no os.environ.get outside startup; pre-init Cat-2 reads use settings.bootstrap_resolver.resolve_init_value

Lifecycle services must own _lifecycle_lock; timed-out stops mark services unrestartable

Files:

  • src/synthorg/observability/events/evals.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

No hardcoded numeric values; numerics live in settings/definitions/; allowlist 0/1/-1, HTTP codes, hex masks, powers-of-2, and module-level annotated named constants of form NAME: int|float|Final|Final[int]|Final[float] = literal per scripts/check_no_magic_numbers.py

Use asyncio.TaskGroup for fan-out/fan-in in async code; helpers must catch Exception and re-raise MemoryError/RecursionError

Use Clock seam pattern: clock: Clock | None = None; tests inject FakeClock from tests._shared

Wrap untrusted content via wrap_untrusted() from engine.prompt_safety; use HTMLParseGuard for HTML per SEC-1

Use from synthorg.observability import get_logger with variable always named logger; never use import logging or print() in app code

Event names must come from observability.events.<domain> constants; use structured kwargs (logger.info(EVENT, key=value))

Error paths must log WARNING/ERROR with context before raising; state transitions log INFO via *_STATUS_TRANSITIONED AFTER persistence write

Secret-log redaction (SEC-1): never use error=str(exc) or interpolate {exc}; use error_type=type(exc).__name__ + error=safe_error_description(exc); never exc_info=True; OTel forbidden span.record_exception(exc), use span.set_attribute("exception.message", safe_error_description(exc)) + record_exception=False, set_status_on_exception=False per check_logger_exception_str_exc.py

Comments must explain WHY only; no reviewer citations, issue back-refs, or migration framing per check_no_review_origin_in_code.py and check_no_migration_framing.py

No from __future__ import annotations (Python 3.14+ has PEP 649); PEP 758 except: except A, B: requires parens when binding

Type hints required on public functions; mypy strict mode; Google-style docstrings; line length 88; functions <50 lines; files <800 lines

Error classes must follow <Domain><Condition>Error naming and inherit from DomainError; never inherit directly from `Exce...

Files:

  • src/synthorg/observability/events/evals.py

⚙️ CodeRabbit configuration file

This project uses Python 3.14+ with PEP 758 except syntax: "except A, B:" (comma-separated, no parentheses) is correct and mandatory -- do NOT flag it as a typo or suggest parenthesized form. The "except builtins.MemoryError, RecursionError: raise" pattern is intentional project convention for system-error propagation. When evaluating the 50-line function limit, count only the function body excluding the signature lines, decorators, and docstring. Functions 1-5 lines over due to docstrings or multi-line signatures should not be flagged. Do not suggest extracting single-use helper functions called exactly once -- this reduces readability without improving maintainability.

Files:

  • src/synthorg/observability/events/evals.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Read design specification in `docs/design/` before implementing; deviations require approval per [DESIGN_SPEC.md](docs/DESIGN_SPEC.md)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Present every plan for accept/deny before coding
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: No region/currency/locale should be privileged; use metric units and British English per [docs/reference/regional-defaults.md](docs/reference/regional-defaults.md)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Every convention PR must ship its enforcement gate per [docs/reference/convention-gates.md](docs/reference/convention-gates.md)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Timeout/slow test failures indicate source-code regression; never edit `tests/baselines/unit_timing.json` or baseline files; bypass only via `ALLOW_BASELINE_GROWTH=1 git commit ...` with explicit user approval
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: After issue implementation: branch + commit + push without auto-PR; use `/pre-pr-review` (gh pr create is blocked); after PR use `/aurelio-review-pr` for external feedback; fix all valid issues without deferring
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Git commits: `<type>: <description>` (feat/fix/refactor/docs/test/chore/perf/ci); commitizen-enforced
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Signed commits required on protected refs (GPG/SSH or GitHub App via `synthorg-repo-bot`)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Branch naming: `<type>/<slug>` from main
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit/pre-push hooks via `.pre-commit-config.yaml`; tool-call gates via `.claude/settings.json` PreToolUse (`scripts/check_*.sh`/`.py`)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Squash merge; PR body becomes squash commit; trailers (`Release-As`, `Closes `#N``) must be in PR body
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: After every squash merge → run `/post-merge-cleanup`
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: CLI is Docker-only (init/start/stop/status); features go in dashboard + REST API
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Use `uv sync` for all dependencies, `uv sync --group docs` for docs toolchain; run `bash scripts/install_cli_tools.sh` once per machine for golangci-lint; run `bash scripts/install_git_hooks.sh` once per clone to wire `core.hooksPath -> scripts/git-hooks`
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Linting: `uv run ruff check src/ tests/ --fix` + `uv run ruff format src/ tests/`; type checking: `uv run mypy --num-workers=4 src/ tests/` (strict)
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Testing: `uv run python -m pytest tests/ -m {unit|integration|e2e}` with `-n 8 --dist=loadfile` via pyproject `addopts`; coverage minimum 80%; use `HYPOTHESIS_PROFILE={dev|fuzz}` for property tests
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Pre-commit checks: `uv run pre-commit run --all-files`; schema drift: `uv run python scripts/check_schema_drift_revisions.py --backend {sqlite|postgres}`
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Documentation build: `PYTHONPATH=. uv run zensical build` for docs; reference [docs/reference/claude-reference.md](docs/reference/claude-reference.md) for doc layout, Docker, releasing, CI, dependencies, Hypothesis
Learnt from: CR
Repo: Aureliolo/synthorg

Timestamp: 2026-05-20T19:19:19.402Z
Learning: Web: see `web/CLAUDE.md` for guidelines; CLI: see `cli/CLAUDE.md` and use `go -C cli` (never `cd cli`); Shell: see `~/.claude/rules/common/bash.md` for canonical `cd`/`git -C`/Bash file-write rules
📚 Learning: 2026-05-05T09:04:46.195Z
Learnt from: Aureliolo
Repo: Aureliolo/synthorg PR: 1760
File: scripts/_dual_backend_parity_lib.py:215-216
Timestamp: 2026-05-05T09:04:46.195Z
Learning: This repository targets Python 3.14+ and follows PEP 758. Therefore, reviewer tooling should NOT treat unparenthesized multi-exception `except` clauses written without an `as` clause (e.g., `except MemoryError, RecursionError:`) as syntax errors. Only flag `except`-clause problems when they are genuinely invalid for Python 3.14+.

Applied to files:

  • evals/scoring/__init__.py
  • tests/evals_spine/test_executable_grader.py
  • src/synthorg/observability/events/evals.py
  • evals/scoring/aggregate.py
  • evals/emit/markdown_writer.py
  • tests/evals_spine/test_judged_calibration.py
  • evals/models/scorecard.py
  • evals/scoring/judged.py
  • tests/evals_spine/test_runner_broken_scores_worse.py
🔇 Additional comments (9)
src/synthorg/observability/events/evals.py (1)

14-14: LGTM!

tests/evals_spine/test_executable_grader.py (1)

85-87: LGTM!

Also applies to: 95-97

tests/evals_spine/test_judged_calibration.py (1)

7-8: LGTM!

Also applies to: 11-16, 26-26, 30-30

evals/scoring/__init__.py (1)

3-3: LGTM!

evals/scoring/aggregate.py (1)

112-116: LGTM!

evals/scoring/judged.py (1)

24-24: LGTM!

Also applies to: 211-217

evals/models/scorecard.py (1)

10-10: LGTM!

Also applies to: 252-261

evals/emit/markdown_writer.py (1)

124-133: LGTM!

tests/evals_spine/test_runner_broken_scores_worse.py (1)

53-53: LGTM!

The new CodeRabbit review on head e57d7a0 (review id 4331572429) carried a single outside-diff-range finding pointing out that round 4 fixed the AggregationResult floor inconsistency but stopped at the scorer boundary: BriefResult still validated against the global GRADE_FLOOR while aggregate_brief_score produced scores using penalty_table.floor. A non-zero PenaltyTable.floor would make a valid aggregation impossible to serialise into the scorecard without either failing validation or lying about how the score was derived. BriefResult now carries a score_floor field (default GRADE_FLOOR for backward compat) and the validator clamps against self.score_floor instead of the global. aggregate_brief_score already returns AggregationResult.floor; downstream code that builds BriefResult from an AggregationResult should forward floor -> score_floor so the serialised row records the exact bound the aggregator applied. Added two regression tests in tests/evals_spine/test_scorecard_emit.py: one that accepts a non-default custom floor of 50, one that rejects a score below the custom floor.
@Aureliolo
Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@Aureliolo Aureliolo merged commit 7b43d6f into main May 20, 2026
83 checks passed
@Aureliolo Aureliolo deleted the feat/1980-golden-benchmark branch May 20, 2026 20:03
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview May 20, 2026 20:03 — with GitHub Actions Inactive
Aureliolo added a commit that referenced this pull request May 20, 2026
…y benchmark (#2025)

Part of #1979. Lands the **scoring + data layer** of the golden-company
benchmark (eval spine), as documented in the Status (2026-05-20) section
of the linked issue.

This PR intentionally does **not** close #1980. The runner (cassette
guard + telemetry collector + orchestrator + CLI + baselines + recorded
cassette + `make benchmark` target) is held until a runtime prerequisite
gap closes. See "Prerequisites for the runner PR" in #1980 for the full
follow-up plan.

## What landed

The complete scoring + data contract for the eval spine, 68 unit tests
passing:

- `evals/models/{brief,scorecard}.py` -- frozen Pydantic v2 models,
schema-versioned (`SCORECARD_SCHEMA_VERSION = 1`).
- `evals/loader/{briefs,anchors}.py` -- YAML loaders via
`synthorg.api.boundary.parse_typed`; path-traversal containment on
anchor-set loading.
- `evals/scoring/executable.py` -- subprocess grader (hidden / build /
lint, weighted 60/20/20); rejects shell metacharacters at load time as
defence-in-depth.
- `evals/scoring/judged.py` + `spearman.py` -- ordinal calibration with
a Spearman rho gate (`>= 0.7`); refuses to score when the judge's
ordering disagrees with the hand-scored anchor set.
- `evals/scoring/penalties.py` + `aggregate.py` -- process-fact penalty
table with cap + floor.
- `evals/emit/{json_writer,markdown_writer}.py` -- deterministic JSON +
Markdown emitters.
- `evals/errors.py` -- `DomainError`-rooted hierarchy.
- `pyproject.toml` -- `evals/` wired for ruff (`src`), pyright
(`extraPaths`), pytest (`pythonpath`), isort (`known-first-party`).
- `CLAUDE.md` -- layout line now mentions the new `evals/` top-level
directory.

Pydantic models carry cross-field invariant validators so the scorecard
is auditable from any single field: `BriefResult` enforces `score ==
max(grade - deduction, GRADE_FLOOR)`; `JudgeCalibrationReport` enforces
`passed == (rho >= gate)`; `PenaltyEntry` enforces `raw ==
points_per_event * count` and `applied <= raw`; `AggregatedProcessFacts`
enforces `total_events == sum(events_by_class)`; `ProcessFactReport`
enforces entries match the tracked events_by_class slice; `CheckOutcome`
ties `timed_out=True` to the POSIX timeout sentinel; `ExecutableGrade`
enforces pass-flags match per-label outcomes; `ScriptedJudge` requires
non-empty responses or default scores; `JudgedGrade` refuses
construction with a failing calibration; `AnchorItem` rejects
hand_scores outside `[0.0, 1.0]`; `HiddenCheckSpec` rejects shell
metacharacters in `cmd[0]`.

## What is NOT yet built (follow-up PR scope)

- `evals/runner/cassette_guard.py` -- cassette SHA256 + REPLAY
enforcement.
- `evals/runner/telemetry_collector.py` -- structlog event bucketing.
- `evals/runner/orchestrator.py` -- per-brief boot/submit/wait/collect
loop.
- `evals/run.py` + `evals/__main__.py` -- CLI entry point.
- `evals/baselines/{reference,broken}.yaml`, `evals/briefs/*.yaml`,
`evals/cassettes/reference_run.cassette.json`.
- `Makefile` target + `.gitignore`.
- The acceptance test `test_runner_broken_scores_worse.py` going green
(currently uses `pytest.importorskip("evals.run")`).

## Why the runner did NOT land in this PR

A runtime spike (`create_app()` + submit one `TaskRequirement` via the
`IntakeEngine`) confirmed the studio's agent-execution path is not yet
wired end-to-end. Intake transitions fire (`intake.request.received` ->
`triaging` -> `scoped`), then `TaskEngineNotRunningError` raises. No
`BUDGET_*` / `EXECUTION_*` / `STAGNATION_*` / `APPROVAL_GATE_*` events
appear. Without those events, broken.yaml and reference.yaml would
produce identical empty process-fact traces and the scorer could not
discriminate. Per project rule "admit you cannot in plain words,
BEFORE": surfacing this gap rather than shipping a green acceptance test
that asserts nothing.

Prerequisites tracked in #1980's Status section.

## Test plan

```bash
uv run python -m pytest tests/evals_spine/ -m unit
```

Result locally: 68 passed in 9.13s.

```bash
uv run ruff check src/ tests/ evals/
uv run ruff format --check src/ tests/ evals/
uv run mypy --num-workers=4 tests/evals_spine/ evals/
```

All green. Pre-push hooks (affected mypy + pytest, ruff, em-dash gate,
review-origin gate, migration-framing gate, secret detection, etc.) ran
on the push and passed.

## Review coverage

`/pre-pr-review` ran 9 agents in parallel against the spine. 22 findings
consolidated in `_audit/pre-pr-review/triage.md`:

- 2 CRITICAL: path-traversal containment, duplicate
`SCORECARD_SCHEMA_VERSION` -- both fixed.
- 5 MAJOR valid (wrong exception type, three issue back-refs in
docstrings, missing `evals/` in CLAUDE.md layout) -- all fixed.
- 1 MAJOR invalid (PEP 758 syntax: agent misread CLAUDE.md's "no parens
unless binding"; we ARE binding `as exc`, parens required) -- rejected
with rationale.
- 11 MEDIUM valid (function-length refactor, shell-metachar guard,
test-quality cleanups, 4 cross-field validators) -- all fixed.
- 3 LOW deferred -- 2 implemented (extra validators on
trusted-construction types, JudgeProtocol docstring polish); 1 deferred
(Brief discriminated union -- the review agent itself flagged it as a
pragmatic trade-off and the current post-validator catches the same XOR
errors; restructuring would force a substantial test refactor for
marginal benefit).

Net: **19 of 22 review findings applied in this PR**, with rationale
recorded in the commit body of `815ac88f` and in the triage file.

## Related

- Part of #1979 (operate-tier EPIC).
- Depends on #1961 (client-simulation harness, already in place) + EPIC
#1973 (runnable studio, partial -- runtime gap blocks the runner PR) +
#1956.
- Blocks #1983 (learning curve), #1990, #1995, #1998 -- all consume the
scorecard schema landing here.
Aureliolo pushed a commit that referenced this pull request May 22, 2026
<!-- HIGHLIGHTS_START -->
## Highlights

> _AI-generated summary (model: `openai/gpt-4.1-mini` via GitHub
Models). Commit-based changelog below._

### What you'll notice
- Introduced conversational interface for direct clarify and propose
interactions.
- Cost management now includes forecast gates, hard ceilings, and Pareto
considerations.
- Added living documentation engine combining wiki and
retrieval-augmented generation features.
- Real intake engine is now operational for live data processing.
- Virtual desktop tool with vision verification gate available for
enhanced workspace control.

### What's new
- Per-project reproducible environments for consistent setups.
- Headless browser testing tool integrated for automated UI validation.
- Governed external API and data access tool introduced.
- Hardened external-remote git backend with sandbox mounts and
push-queue dispatching.
- Adversarial red-team gate subsystem for enhanced security testing.
- Self-extending toolkit to dynamically expand capabilities.
- Stakes-aware model routing enables prioritized processing.
- Task-board entry adapter connects live runtime with project
management.
- Persistent project workspace with pluggable git backend and
per-project push queues implemented.
- Knowledge and provenance substrate added to track data lineage.
- Scoring and data contract framework for golden-company benchmark
evaluations.

### Under the hood
- Desktop Dockerfile pinned by digest to improve build stability and
documented publishing gap fixed.

<!-- HIGHLIGHTS_END -->

:robot: I have created a release *beep* *boop*
---


##
[0.8.7](v0.8.6...v0.8.7)
(2026-05-22)


### Features

* conversational interface v1 - 1:1 clarify + propose
([#2019](#2019))
([216ef94](216ef94)),
closes [#1968](#1968)
* cost as a first-class dial (forecast gate, hard ceiling, Pareto)
([#2029](#2029))
([700a59e](700a59e)),
closes [#1982](#1982)
* **env:** reproducible per-project environments
([#2039](#2039))
([d2c0ef9](d2c0ef9)),
closes [#1994](#1994)
* **evals:** [#1980](#1980)
spine -- scoring + data contract for golden-company benchmark
([#2025](#2025))
([53108e8](53108e8))
* goal/objective entry adapter
([#1964](#1964))
([#2022](#2022))
([cb15c3c](cb15c3c))
* governed external API/data access tool
([#1991](#1991))
([#2032](#2032))
([e08b451](e08b451))
* harden external-remote git backend + per-project sandbox mount +
push-queue dispatch
([#2020](#2020))
([#2030](#2030))
([2fa2e1e](2fa2e1e))
* headless browser testing tool
([#1992](#1992))
([#2024](#2024))
([277b52a](277b52a))
* knowledge + provenance substrate
([#2036](#2036))
([48c897b](48c897b))
* living documentation engine (dual-purpose wiki + RAG namespace)
([#2028](#2028))
([3d10da9](3d10da9)),
closes [#1976](#1976)
* real intake engine online
([#2017](#2017))
([9d8eb34](9d8eb34))
* **redteam:** adversarial red-team gate subsystem
([#1986](#1986))
([#2026](#2026))
([d2207e9](d2207e9))
* self-extending toolkit
([#1995](#1995))
([#2035](#2035))
([5ffc545](5ffc545))
* stakes-aware model routing
([#1998](#1998))
([#2038](#2038))
([9b98312](9b98312))
* task-board entry adapter to live runtime
([#1963](#1963))
([#2023](#2023))
([a8f1eea](a8f1eea))
* virtual desktop tool and vision verifier gate
([#2031](#2031))
([dfe8b42](dfe8b42)),
closes [#1993](#1993)
* **workspace:** persistent project workspace + pluggable git backend +
per-project push queue
([#2021](#2021))
([ee58ee7](ee58ee7))


### Bug Fixes

* pin desktop Dockerfile by digest (Scorecard
[#309](#309)) + document
publish gap ([#2034](#2034))
([8fda188](8fda188))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: synthorg-repo-bot[bot] <279117679+synthorg-repo-bot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Golden-company benchmark (eval spine)

1 participant