feat: structured debate with pro/con, confidence, early-stop (#144) by let-sunny · Pull Request #172 · let-sunny/canicode

let-sunny · 2026-03-29T09:14:26Z

Problem

캘리브레이션 파이프라인의 Critic이 proposals만 보고 판단함. Converter가 실제로 뭘 어려워했는지, 어떤 pixel gap이 있었는지, 과거 run에서 같은 룰이 어떻게 평가됐는지 모름.

또한 debate 결과가 logs/ (gitignored, ephemeral)에만 남아서 cross-run 학습이 안 됨.

Design decisions

1. Critic에게 구조화된 evidence 전달

Before: proposals only → Critic이 숫자만 보고 판단
After: proposals + converter's ruleImpactAssessment + gap analysis + prior cross-run evidence

왜: Critic이 "이 룰이 실제로 구현 난이도에 영향을 줬는지"를 판단하려면 Converter의 실제 경험이 필요함.

2. Pro/Con 구조화 출력

Before: { decision, reason } — 이유 하나로 찬반 구분 불가
After: { decision, confidence, pro: [...], con: [...], reason } — 명시적 찬반 분리

왜: 단일 reason만 있으면 한쪽 관점만 반영됨. 찬반을 명시적으로 분리하면 Arbitrator가 더 나은 판단 가능. 이 데이터가 evidence에 영속되면 다음 run에서도 활용.

3. Evidence enrichment — `data/`에 영속

Before: Critic 결과 → logs/debate.json (휘발) → 다음 run에서 못 봄
After: Critic 결과 → logs/debate.json + data/calibration-evidence.json (영속)

새 CLI command calibrate-enrich-evidence가 debate.json에서 Critic의 pro/con/confidence를 읽어서 data/calibration-evidence.json의 기존 엔트리를 업데이트.

4. Early-stop (Arbitrator 스킵)

모든 proposals가 high confidence로 REJECT되면 Arbitrator 실행 불필요. debate.json에 stoppingReason: "all-high-confidence-reject" 기록하고 종료.

5. Self-consistency (hold 결정)

Before: low confidence라도 Arbitrator가 적용하거나 기각
After: confidence: low → decision: "hold" — 적용하지 않고 evidence에만 누적

다음 run에서 같은 룰이 다시 제안되면 case count 올라가면서 confidence 자연 수렴. 기존 cross-run evidence 메커니즘 활용.

6. Convergence에 hold 반영

hold = "아직 결정 못 함" → not converged. fixture가 done/으로 이동하지 않고 다음 calibration에서 재평가.

Data flow

data/calibration-evidence.json (git-tracked, persistent)
  ↕ enrichCalibrationEvidence() writes back
  ↕ loadCalibrationEvidence() reads

logs/calibration/<run>/debate.json (gitignored, ephemeral)
  → per-run detail for debugging

Changed files

File	What
`evidence.ts` (contracts)	Schema: confidence, pro, con, decision fields
`evidence-collector.ts`	`enrichCalibrationEvidence()` + grouping with allPro/allCon
`critic.md`	Expanded input + structured pro/con output
`arbitrator.md`	hold logic + stoppingReason + revised 복원
`calibrate-loop.md`	Evidence gathering → Critic, early-stop, enrich step
`run-directory.ts`	hold in convergence, early-stop = converged
`fixture-management.ts`	`calibrate-enrich-evidence` CLI command
`cli/index.ts`	Register new command
`CLAUDE.md`	Documentation

Test plan

All 606 tests pass
Type check passes (exactOptionalPropertyTypes compliant)
Evidence schema backward compatible (new fields optional)
Self-review: evidence write gap, revised decision type, convergence hold, early-stop convergence
Run /calibrate-loop on a fixture to validate end-to-end (post-merge)

Closes #144

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Decisions now include per-item confidence and explicit decision types; low-confidence items are held (not applied) and reflected in convergence counts and summaries.
- Early-stop when all proposals are high-confidence rejections; finalized/early-stop actions written to debate state.
- CLI commands to gather, finalize, enrich, and persist proposed-rule lists and structured evidence added.
Chores
- Agent prompts, convergence logic, evidence schemas, CLI orchestration, and docs updated to support structured pro/con reviews and confidence.
Tests
- Expanded test suites covering evidence enrichment, convergence outcomes, CLI gather/finalize behavior, and run-directory parsing.

Phase 1-3 of calibration pipeline improvements: - Evidence schema: add confidence, pro, con, decision fields - Critic: receives converter assessment + gaps + prior evidence, outputs structured pro/con arguments per proposal - Arbitrator: holds low-confidence decisions instead of applying - Early-stop: skip Arbitrator when all proposals rejected with high confidence - Cross-run evidence enriched with pro/con for better future reviews - CLAUDE.md calibration section updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-29T09:14:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57bdfe3b-dcb3-4805-844b-d5bcc3e32020

📥 Commits

Reviewing files that changed from the base of the PR and between 3cdf582 and 16112aa.

📒 Files selected for processing (1)

src/cli/commands/internal/fixture-management.ts

📝 Walkthrough

Walkthrough

Critic now consumes structured evidence (converter impact, actionable gaps, prior cross-run evidence) and emits per-proposal pro/con + confidence. Added deterministic gather/finalize CLI steps (early-stop when all reviews are high-confidence REJECT). Arbitrator must HOLD low-confidence proposals; evidence enrichment/prune and schemas updated.

Changes

Cohort / File(s)	Summary
Agent Prompts & Docs `\.claude/agents/calibration/critic.md`, `\.claude/agents/calibration/arbitrator.md`, `CLAUDE.md`	Switched models to `claude-opus-4-6`; Critic prompt expanded to require converter impact, pixel-gap analysis, and prior evidence; Critic must output `pro`/`con` arrays and `confidence`; Arbitrator updated to annotate decision types and hold low-confidence proposals.
Calibration Loop Orchestration & Docs `\.claude/commands/calibrate-loop.md`	Added deterministic evidence-gather and finalize steps, introduced early-stop and hold semantics, and updated debate finalization and evidence enrichment/prune flow description.
CLI Commands & Helpers `src/cli/commands/internal/calibrate-debate.ts`, `src/cli/commands/internal/calibrate-debate.test.ts`, `src/cli/commands/internal/fixture-management.ts`, `src/cli/commands/internal/calibrate-evaluate.ts`, `src/cli/index.ts`	Added `gatherEvidence`, `loadProposedRuleIds`, `registerGatherEvidence`, `registerFinalizeDebate`, and `registerEvidenceEnrich`; tests for evidence parsing, rule-id extraction, early-stop and hold logic; `calibrate-evaluate` writes `proposed-rules.json`; CLI registers new internal commands.
Evidence Contracts `src/agents/contracts/evidence.ts`	Extended Zod schemas and exported types: optional `confidence: "high"
Evidence Collector `src/agents/evidence-collector.ts`, `src/agents/evidence-collector.test.ts`	`loadCalibrationEvidence` now aggregates deduplicated `allPro`/`allCon` and tracks `lastConfidence`/`lastDecision`; added `enrichCalibrationEvidence(reviews, fixture, evidencePath)` that updates matching entries and writes only when matches found; tests for enrichment, dedupe, and file-write behavior.
Run Directory / Convergence `src/agents/run-directory.ts`, `src/agents/run-directory.test.ts`	Added `CriticReviewSchema`, optional `stoppingReason` in `DebateResultSchema`; `ConvergenceSummary` includes `hold`; `checkConvergence` counts `hold`, supports early-stop when `stoppingReason` exists and `arbitrator` is absent; `isConverged` delegates to `checkConvergence`; tests for early-stop and hold cases.
Tests (integration/unit) `src/cli/commands/internal/calibrate-debate.test.ts`, `src/agents/evidence-collector.test.ts`, `src/agents/run-directory.test.ts`	New and updated tests cover evidence gathering, parsing, enrichment, deduplication, early-stop and hold convergence behaviors.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI (calibrate-gather-evidence)
    participant Run as Run Directory
    participant Evidence as calibration-evidence.json
    participant Critic as Critic Agent
    participant Finalize as CLI (calibrate-finalize-debate)
    participant Arbitrator as Arbitrator Agent

    Run->>CLI: invoke calibrate-gather-evidence
    CLI->>Run: read conversion.json, gaps.json, summary.md
    CLI->>Evidence: load cross-run calibration-evidence.json
    CLI->>Run: write critic-evidence.json
    Run->>Critic: run Critic with proposals + critic-evidence.json
    Critic->>Run: write debate.json (reviews with pro/con/confidence)
    Run->>Finalize: invoke calibrate-finalize-debate
    Finalize->>Run: parse debate.json
    alt all reviews == REJECT && confidence == "high"
        Finalize->>Run: write stoppingReason "all-high-confidence-reject"
        Finalize-->>Run: output { action: "early-stop" }
    else
        Run->>Arbitrator: run Arbitrator with critic reviews
        Arbitrator->>Run: write arbitrator decisions (applied/revised/rejected/hold)
        alt any decision == "hold"
            Run->>Evidence: call enrichCalibrationEvidence (accumulate held reviews)
        else
            Run->>Evidence: record applied/revised decisions into evidence
        end
        Finalize-->>Run: output { action: "finalized", maybe stoppingReason }
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

refactor: split CLI index.ts into per-command modules (#47) #59 — Overlaps adding and registering internal calibration CLI commands and related CLI modules.
fix: calibration pipeline (retina visual-compare, evidence, convergence) #28 — Related changes to calibration evidence handling and Critic/Arbitrator flow/schema updates.
feat: fixture-done --dry-run + latest run auto-resolve (#30) #42 — Overlaps on run-directory convergence logic and earlier adjustments to checkConvergence/isConverged.

Poem

🐇 I hopped through gaps and notes of pro and con,

kept my confidence light when answers were drawn-on;
If doubt is deep I gently press hold,
saving whispers of proof for the next run to behold.
Tiny hops, clearer rules—one carrot at a time.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: introducing structured debate with pro/con arguments, confidence levels, and early-stop mechanisms for the calibration pipeline.
Linked Issues check	✅ Passed	All coding requirements from `#144` are met: evidence schema extended with confidence/pro/con/decision fields, Critic receives structured inputs (conversion, gaps, prior evidence), outputs structured pro/con with confidence, early-stop on all-high-confidence REJECT implemented, low-confidence decisions mapped to 'hold', and stoppingReason recorded in debate.json.
Out of Scope Changes check	✅ Passed	All changes are scoped to requirements in `#144`: evidence contracts, Critic/Arbitrator enhancements, CLI commands (gather-evidence, finalize-debate, enrich-evidence), convergence logic updates, and documentation reflect the debate-based reliability framework without introducing unrelated functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/calibration-debate-144

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Self-review fixes: 1. Evidence enrichment gap: add calibrate-enrich-evidence CLI command and enrichCalibrationEvidence() — writes Critic's pro/con/confidence back to data/calibration-evidence.json (was only in ephemeral logs/) 2. Restore "revised" decision type in Arbitrator (was accidentally dropped) 3. Convergence: "hold" decisions count as not-converged (need more evidence) 4. Early-stop convergence: stoppingReason in debate.json → converged=true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses code review feedback: 1. isConverged now delegates to checkConvergence (single source of truth for early-stop / hold logic) 2. enrichCalibrationEvidence scoped by (ruleId, fixture) — no longer overwrites entries from other fixtures 3. stoppingReason canonical location: debate.json top level only 4. CLI calibrate-enrich-evidence uses process.exitCode = 1 on errors 5. Tests added: enrichCalibrationEvidence (3), checkConvergence early-stop, hold convergence (strict + lenient), isConverged delegation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Review round 2: A. stoppingReason added to DebateResultSchema — no more type cast B. enrichCalibrationEvidence warns when no entries match fixture C. calibrate-prune-evidence CLI now uses process.exitCode = 1 D. allPro/allCon deduplicated in loadCalibrationEvidence grouping F. stoppingReason canonical location commented in schema E (model selection) is a judgment call, not a code change — to evaluate later. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These are the most critical judgment steps in the calibration pipeline. Structured pro/con + confidence reasoning benefits from stronger model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. CriticReviewSchema: confidence/pro/con as typed optional fields, removing Record<string, unknown> casts in fixture-management.ts 2. enrichCalibrationEvidence: early-return when no matches (skip I/O) 3. stoppingReason: omit for normal completion, only set for special cases 4. Test: pro/con deduplication in loadCalibrationEvidence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ands Three deterministic tasks the orchestrator (LLM) was doing are now CLI commands: 1. calibrate-gather-evidence <runDir> Reads conversion.json, gaps.json, summary.md, calibration-evidence.json and writes a single critic-evidence.json for the Critic prompt. Was: orchestrator reads 3-4 files and extracts relevant parts (error-prone) 2. calibrate-finalize-debate <runDir> After Critic: checks early-stop (all REJECT + high confidence) After Arbitrator: determines stoppingReason (low-confidence-hold) Was: orchestrator evaluates conditions inline (can misjudge) calibrate-loop.md updated to use CLI commands instead of inline logic. Orchestrator now only: runs CLI commands + spawns LLM agents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 tests covering: - gather-evidence: conversion.json parsing, gap filtering, ruleId extraction - finalize-debate: early-stop detection, mixed reviews, hold detection, normal completion, missing debate.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- calibrate-evaluate now writes proposed-rules.json (deterministic ruleId list) - calibrate-gather-evidence reads proposed-rules.json first, falls back to summary.md regex (eliminates false positive risk) - Export gatherEvidence, loadProposedRuleIds, GatheredEvidence for direct unit testing - Tests rewritten to import functions directly (13 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.test.ts`:
- Around line 111-128: The test duplicates the early-stop check; refactor so the
test uses the implementation's check instead of re-implementing it: extract the
early-stop predicate from registerFinalizeDebate (or export an existing helper)
as a named function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that
into calibrate-debate.test.ts, then replace the local allHighConfidenceReject
computation with a call to that function using the parsed debate (from
parseDebateResult), ensuring the test asserts the function returns true; update
registerFinalizeDebate to call the same helper so behavior remains identical.

In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 145-153: parseDebateResult currently parses debate.json but the
caller in finalize-debate reads and parses the file again to preserve unknown
fields; change parseDebateResult to return a tuple/object containing both the
validated Debate result and the raw parsed Record (e.g., { validated, raw }) so
callers (like the code using debate and raw) can use the returned raw instead of
re-reading the file: update parseDebateResult signature and implementation to
read once and return both, then replace the second read/JSON.parse in
finalize-debate (the code that sets debatePath and raw) to use the raw value
from parseDebateResult and remove the redundant file read.

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 167-175: The change incorrectly treats missing run directories and
a null parseDebateResult as errors; revert to the prior subagent-compatible
behavior by using console.log instead of console.error for the "Run directory
not found" and "No debate.json found — nothing to prune." messages and do not
set process.exitCode = 1 (leave exit code 0); locate the block that checks
runDir and the parseDebateResult(resolve(runDir)) call and adjust those two
branches to log to stdout and return without changing the exit code so internal
calibration subagents continue to parse stdout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1184e0d5-37b0-4d71-b454-8933f9b23e80

📥 Commits

Reviewing files that changed from the base of the PR and between d9667ca and 4d15939.

📒 Files selected for processing (14)

.claude/agents/calibration/arbitrator.md
.claude/agents/calibration/critic.md
.claude/commands/calibrate-loop.md
CLAUDE.md
src/agents/contracts/evidence.ts
src/agents/evidence-collector.test.ts
src/agents/evidence-collector.ts
src/agents/run-directory.test.ts
src/agents/run-directory.ts
src/cli/commands/internal/calibrate-debate.test.ts
src/cli/commands/internal/calibrate-debate.ts
src/cli/commands/internal/calibrate-evaluate.ts
src/cli/commands/internal/fixture-management.ts
src/cli/index.ts

coderabbitai · 2026-03-29T09:56:52Z

src/cli/commands/internal/calibrate-debate.test.ts

+  it("detects early-stop when all critic reviews are high-confidence REJECT", () => {
+    writeFileSync(join(runDir, "debate.json"), JSON.stringify({
+      critic: {
+        summary: "rejected=2",
+        reviews: [
+          { ruleId: "a", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "x" },
+          { ruleId: "b", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "y" },
+        ],
+      },
+    }));
+
+    const debate = parseDebateResult(runDir)!;
+    const reviews = debate.critic!.reviews;
+    const allHighConfidenceReject = reviews.length > 0 && reviews.every((r) =>
+      r.decision.trim().toUpperCase() === "REJECT" && r.confidence === "high"
+    );
+    expect(allHighConfidenceReject).toBe(true);
+  });


🧹 Nitpick | 🔵 Trivial

Consider extracting the early-stop check logic to avoid duplication.

The test duplicates the early-stop condition logic (lines 124-126) that exists in registerFinalizeDebate. While acceptable for testing the parsed data structure, consider importing and testing the actual finalize logic to ensure the test stays in sync with implementation changes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/calibrate-debate.test.ts` around lines 111 - 128, The test duplicates the early-stop check; refactor so the test uses the implementation's check instead of re-implementing it: extract the early-stop predicate from registerFinalizeDebate (or export an existing helper) as a named function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that into calibrate-debate.test.ts, then replace the local allHighConfidenceReject computation with a call to that function using the parsed debate (from parseDebateResult), ensuring the test asserts the function returns true; update registerFinalizeDebate to call the same helper so behavior remains identical.

coderabbitai · 2026-03-29T09:56:52Z

src/cli/commands/internal/calibrate-debate.ts

+      const debate = parseDebateResult(dir);
+      if (!debate) {
+        console.error("No debate.json found");
+        process.exitCode = 1;
+        return;
+      }
+
+      const debatePath = join(dir, "debate.json");
+      const raw = JSON.parse(readFileSync(debatePath, "utf-8")) as Record<string, unknown>;


🧹 Nitpick | 🔵 Trivial

Minor: Double file read in finalize-debate.

parseDebateResult reads and parses debate.json, then line 153 reads and parses it again to get the raw object. While functionally correct (needed to preserve unknown fields via passthrough), this could be optimized by having parseDebateResult return both the validated result and raw object.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/calibrate-debate.ts` around lines 145 - 153, parseDebateResult currently parses debate.json but the caller in finalize-debate reads and parses the file again to preserve unknown fields; change parseDebateResult to return a tuple/object containing both the validated Debate result and the raw parsed Record (e.g., { validated, raw }) so callers (like the code using debate and raw) can use the returned raw instead of re-reading the file: update parseDebateResult signature and implementation to read once and return both, then replace the second read/JSON.parse in finalize-debate (the code that sets debatePath and raw) to use the raw value from parseDebateResult and remove the redundant file read.

src/cli/commands/internal/fixture-management.ts

Internal calibration commands are consumed by subagents that parse stdout. console.error + process.exitCode = 1 breaks this pattern. Reverted all internal commands to console.log + exit 0 for subagent compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 10-15: The GatheredEvidence interface uses broad unknowns; replace
ruleImpactAssessment, uncoveredStruggles, actionableGaps and priorEvidence with
the concrete contract types used elsewhere (e.g., the domain/evidence validation
contracts) instead of unknown to enable type safety; update the import list at
the top of this module to bring in the correct contracts (for example the
RuleImpactContract, UncoveredStruggleContract, ActionableGapContract and
PriorEvidenceContract or their actual names from your validation/contracts
folder) and change the property types to arrays or records of those contract
types accordingly, then run the project/type-checker to fix any call sites that
expect the narrower shapes.
- Around line 149-151: Wrap the second read/parse of debate.json (the
readFileSync/JSON.parse that sets raw after debatePath is created) in a
try/catch so a corrupt/changed file cannot throw and break orchestration; on
error, fall back to a safe default (e.g., an empty object or the result from
parseDebateResult(dir)), log the parse failure to stderr or processLogger, and
continue to emit the expected stdout JSON action and exit with code 0.
Specifically modify the block that reads debatePath / assigns raw so exceptions
are caught, raw is always defined, and control flow continues (do not re-throw).

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-149: The enrichment step drops "HOLD" decisions because the
check only persists APPROVE/REJECT/REVISE; update the logic that reads
r.decision (variable dec) to also accept and assign "HOLD" to entry.decision (or
simply assign dec after validating against the allowed decision union), and
ensure you update the decision unions/schemas in the contracts and the evidence
collector to include "HOLD" so types/validation, serialization, and downstream
consumers accept the new value.
- Around line 123-124: Validate the CLI input before resolving: inside the
.action((runDir: string) => { ... }) callback use a Zod schema (e.g., a
non-empty string/path schema) to parse or safeParse runDir (import z from
'zod'), handle validation failure (log/throw or exit) and only then call
resolve(validatedDir) to produce resolvedDir; update references to runDir ->
validatedDir and ensure any downstream logic uses the validated value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 942663e1-bc37-423f-8eae-ed5c47132640

📥 Commits

Reviewing files that changed from the base of the PR and between 4d15939 and 0e4c16b.

📒 Files selected for processing (2)

src/cli/commands/internal/calibrate-debate.ts
src/cli/commands/internal/fixture-management.ts

coderabbitai · 2026-03-29T10:04:18Z

src/cli/commands/internal/calibrate-debate.ts

+export interface GatheredEvidence {
+  ruleImpactAssessment: unknown[];
+  uncoveredStruggles: unknown[];
+  actionableGaps: unknown[];
+  priorEvidence: Record<string, unknown>;
+}


🧹 Nitpick | 🔵 Trivial

Prefer evidence contract types over broad unknown containers.

GatheredEvidence.priorEvidence (and related fields) can leverage existing contract types for better maintainability and safer downstream usage.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/calibrate-debate.ts` around lines 10 - 15, The GatheredEvidence interface uses broad unknowns; replace ruleImpactAssessment, uncoveredStruggles, actionableGaps and priorEvidence with the concrete contract types used elsewhere (e.g., the domain/evidence validation contracts) instead of unknown to enable type safety; update the import list at the top of this module to bring in the correct contracts (for example the RuleImpactContract, UncoveredStruggleContract, ActionableGapContract and PriorEvidenceContract or their actual names from your validation/contracts folder) and change the property types to arrays or records of those contract types accordingly, then run the project/type-checker to fix any call sites that expect the narrower shapes.

src/cli/commands/internal/calibrate-debate.ts

coderabbitai · 2026-03-29T10:04:19Z

src/cli/commands/internal/fixture-management.ts

+    .action((runDir: string) => {
+      const resolvedDir = resolve(runDir);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific file mentioned find src/cli -type f -name "*.ts" | head -20

Repository: let-sunny/canicode

Length of output: 714

🏁 Script executed:

# Read the fixture-management.ts file to understand context cat -n src/cli/commands/internal/fixture-management.ts

Repository: let-sunny/canicode

Length of output: 8415

🏁 Script executed:

# Check if there are contracts/schemas defined for CLI inputs find . -path ./node_modules -prune -o -type f -name "*schema*" -o -name "*contract*" | grep -E "\.(ts|tsx)$" | head -20

Repository: let-sunny/canicode

Length of output: 44

🏁 Script executed:

# Search for Zod usage in CLI files to understand validation patterns rg "z\." src/cli --type ts -B 2 -A 2 | head -50

Repository: let-sunny/canicode

Length of output: 2969

Validate CLI input with Zod before resolving paths.

runDir is external CLI input and must be schema-validated per coding guidelines. The code currently calls resolve(runDir) directly without validation.

Proposed fix

+import { z } from "zod"; ... .action((runDir: string) => { - const resolvedDir = resolve(runDir); + const args = z.object({ runDir: z.string().min(1) }).parse({ runDir }); + const resolvedDir = resolve(args.runDir);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.action((runDir: string) => {

const resolvedDir = resolve(runDir);

.action((runDir: string) => {

const args = z.object({ runDir: z.string().min(1) }).parse({ runDir });

const resolvedDir = resolve(args.runDir);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/fixture-management.ts` around lines 123 - 124, Validate the CLI input before resolving: inside the .action((runDir: string) => { ... }) callback use a Zod schema (e.g., a non-empty string/path schema) to parse or safeParse runDir (import z from 'zod'), handle validation failure (log/throw or exit) and only then call resolve(validatedDir) to produce resolvedDir; update references to runDir -> validatedDir and ensure any downstream logic uses the validated value.

src/cli/commands/internal/fixture-management.ts

- finalize-debate: try/catch on second debate.json read, fallback to {action: "continue"} on parse failure (subagent safety) - HOLD added to decision enum in evidence schema, enrichCalibrationEvidence signature, and CLI enrich command (was silently dropped) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 30-53: Replace the ad-hoc JSON.parse + Array.isArray/type checks
for conversion.json, gaps.json (and proposed-rules.json per the comment) with
Zod validation: import the existing UncoveredStruggle and GapEntry Zod schemas
from src/agents/contracts and create a small top-level ConversionSchema (e.g., {
ruleImpactAssessment: z.array(z.unknown()).optional(), uncoveredStruggles:
z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps:
z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run
z.parse/z.safeParse and assign only validated data to
result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps
(filtering gaps by actionable is fine but apply GapEntry schema first). Ensure
malformed files are caught and ignored (or logged) by checking parse results
before mutating result.

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-148: The decision value from critic reviews is compared with
exact-case strings and can be lost; normalize r.decision before checking and
storing by trimming and uppercasing it (e.g., use
r.decision?.trim().toUpperCase()) then compare against
"APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to
entry.decision; update the logic around the dec variable in
registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed
inputs are persisted consistently.
- Around line 135-140: parseRunDirName(basename(resolvedDir)) never yields a
falsy name, so the current check for !fixture is dead; instead validate that the
run directory ends with the expected `--timestamp` suffix by checking the parsed
timestamp (from parseRunDirName) or matching the basename against the
`--YYYY-...` pattern before treating it as a fixture. Update the code around
parseRunDirName / basename(resolvedDir) to reject basenames where the parsed
timestamp is empty or the regex check fails (log an error and return) so only
properly suffixed run directories are accepted for enrichment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a10c1791-ef9b-4ed2-b11f-3b689cd01e4b

📥 Commits

Reviewing files that changed from the base of the PR and between 0e4c16b and 3cdf582.

📒 Files selected for processing (4)

src/agents/contracts/evidence.ts
src/agents/evidence-collector.ts
src/cli/commands/internal/calibrate-debate.ts
src/cli/commands/internal/fixture-management.ts

coderabbitai · 2026-03-29T10:18:38Z

src/cli/commands/internal/calibrate-debate.ts

+  const convPath = join(runDir, "conversion.json");
+  if (existsSync(convPath)) {
+    try {
+      const conv = JSON.parse(readFileSync(convPath, "utf-8")) as Record<string, unknown>;
+      if (Array.isArray(conv["ruleImpactAssessment"])) {
+        result.ruleImpactAssessment = conv["ruleImpactAssessment"];
+      }
+      if (Array.isArray(conv["uncoveredStruggles"])) {
+        result.uncoveredStruggles = conv["uncoveredStruggles"];
+      }
+    } catch { /* ignore malformed */ }
+  }
+
+  // 2. gaps.json → actionable gaps
+  const gapsPath = join(runDir, "gaps.json");
+  if (existsSync(gapsPath)) {
+    try {
+      const gaps = JSON.parse(readFileSync(gapsPath, "utf-8")) as Record<string, unknown>;
+      const gapList = Array.isArray(gaps["gaps"]) ? gaps["gaps"] : [];
+      result.actionableGaps = gapList.filter(
+        (g): g is Record<string, unknown> =>
+          typeof g === "object" && g !== null && (g as Record<string, unknown>)["actionable"] === true
+      );
+    } catch { /* ignore malformed */ }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific lines mentioned find . -name "calibrate-debate.ts" -type f

Repository: let-sunny/canicode

Length of output: 109

🏁 Script executed:

# Check if the file exists and examine its content head -150 src/cli/commands/internal/calibrate-debate.ts | cat -n

Repository: let-sunny/canicode

Length of output: 6312

🏁 Script executed:

# Search for Zod schema files related to the artifacts mentioned find . -path "*/contracts/*" -name "*.ts" -o -name "*.js" | head -20

Repository: let-sunny/canicode

Length of output: 622

🏁 Script executed:

# Look for any existing schemas for conversion, gaps, or rules rg -l "conversion|gaps|proposed.*rules" --type ts --type js -- contracts/

Repository: let-sunny/canicode

Length of output: 112

🏁 Script executed:

# Check the contracts directory for relevant schemas ls -la src/agents/contracts/

Repository: let-sunny/canicode

Length of output: 691

🏁 Script executed:

# Look for any schema definitions related to conversion, gaps, or rules rg "conversion|gaps" src/agents/contracts/ -A 5

Repository: let-sunny/canicode

Length of output: 1570

🏁 Script executed:

# Check if there are any Zod schemas defined in the contracts rg "z\." src/agents/contracts/ | head -20

Repository: let-sunny/canicode

Length of output: 1568

🏁 Script executed:

# Look at the conversion-agent and gap-analyzer contracts cat src/agents/contracts/conversion-agent.ts

Repository: let-sunny/canicode

Length of output: 1067

🏁 Script executed:

# Check gap-analyzer contract cat src/agents/contracts/gap-analyzer.ts

Repository: let-sunny/canicode

Length of output: 1249

🏁 Script executed:

# Check how these schemas are imported and what's available cat src/agents/contracts/conversion-agent.ts

Repository: let-sunny/canicode

Length of output: 1067

🏁 Script executed:

# Verify that no validation is actually happening in gatherEvidence grep -A 20 "function gatherEvidence" src/cli/commands/internal/calibrate-debate.ts | grep -E "(parse|validate|\.parse)"

Repository: let-sunny/canicode

Length of output: 152

🏁 Script executed:

# Check if conversion.json is actually structured as ConversionRecord[] or has top-level fields rg "conversion\.json" src/ -B 2 -A 2 --type ts

Repository: let-sunny/canicode

Length of output: 3139

🏁 Script executed:

# Verify the actual structure - check if conversion.json contains ruleImpactAssessment array rg "ruleImpactAssessment" src/ --type ts

Repository: let-sunny/canicode

Length of output: 1537

🏁 Script executed:

# Check if there's a schema for the top-level conversion.json structure rg "ConversionOutput|RuleImpactAssessment" src/agents/contracts/ --type ts

Repository: let-sunny/canicode

Length of output: 44

🏁 Script executed:

# Look at the test to understand what the conversion.json structure should be cat -n src/cli/commands/internal/calibrate-debate.test.ts | grep -A 30 "extracts ruleImpactAssessment"

Repository: let-sunny/canicode

Length of output: 1469

🏁 Script executed:

# Check if proposed-rules.json is validated or has a schema rg "ProposedRules|proposed.*rules" src/agents/contracts/ --type ts

Repository: let-sunny/canicode

Length of output: 44

🏁 Script executed:

# See how the gathered evidence is used downstream to assess impact rg "critic-evidence" src/ --type ts -B 2 -A 2

Repository: let-sunny/canicode

Length of output: 697

🏁 Script executed:

# Let me check what happens with the evidence after it's gathered - does orchestrator validate it? rg -A 10 "critic-evidence" src/agents/orchestrator.ts --type ts

Repository: let-sunny/canicode

Length of output: 44

🏁 Script executed:

# Check if the conversion output structure is documented anywhere rg "ruleImpactAssessment|UncoveredStruggle" src/agents/ -B 5 --type ts | head -50

Repository: let-sunny/canicode

Length of output: 2431

🏁 Script executed:

# See if there's any downstream validation in the orchestrator cat src/agents/orchestrator.ts | grep -A 30 "ruleImpactAssessment"

Repository: let-sunny/canicode

Length of output: 1795

🏁 Script executed:

# Double-check: are the schemas actually being used somewhere in gatherEvidence? grep -n "UncoveredStruggleSchema\|GapAnalyzerOutputSchema\|GapEntrySchema" src/cli/commands/internal/calibrate-debate.ts

Repository: let-sunny/canicode

Length of output: 44

🏁 Script executed:

# Verify that the coding guideline applies to this file head -5 src/cli/commands/internal/calibrate-debate.ts

Repository: let-sunny/canicode

Length of output: 274

Validate external JSON artifacts with Zod before feeding them into critic-evidence.json.

Lines 30–54 and 76–81 parse conversion.json, gaps.json, and proposed-rules.json using only JSON.parse and shallow Array.isArray or property-type checks. This allows malformed rows to flow directly into the critic payload. The coding guidelines require "Validate all external inputs with Zod schemas" for src/**/*.ts. Schemas for UncoveredStruggle and GapEntry already exist in src/agents/contracts/; add validation using these schemas and create a wrapper schema for the top-level conversion.json structure.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/calibrate-debate.ts` around lines 30 - 53, Replace the ad-hoc JSON.parse + Array.isArray/type checks for conversion.json, gaps.json (and proposed-rules.json per the comment) with Zod validation: import the existing UncoveredStruggle and GapEntry Zod schemas from src/agents/contracts and create a small top-level ConversionSchema (e.g., { ruleImpactAssessment: z.array(z.unknown()).optional(), uncoveredStruggles: z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps: z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run z.parse/z.safeParse and assign only validated data to result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps (filtering gaps by actionable is fine but apply GapEntry schema first). Ensure malformed files are caught and ignored (or logged) by checking parse results before mutating result.

src/cli/commands/internal/fixture-management.ts

coderabbitai · 2026-03-29T10:18:38Z

src/cli/commands/internal/fixture-management.ts

+        const dec = r.decision;
+        if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec;


⚠️ Potential issue | 🟠 Major

Normalize Critic decision casing before persisting it.

critic.reviews[].decision is parsed as a free-form string, and registerFinalizeDebate() already normalizes it with trim().toUpperCase(). This exact-case check will silently drop valid "reject", "hold", etc., so cross-run evidence can lose the new decision field even when the same debate would be handled correctly elsewhere.

Suggested fix

- const dec = r.decision; - if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec; + const dec = r.decision.trim().toUpperCase(); + if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") { + entry.decision = dec; + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cli/commands/internal/fixture-management.ts` around lines 147 - 148, The decision value from critic reviews is compared with exact-case strings and can be lost; normalize r.decision before checking and storing by trimming and uppercasing it (e.g., use r.decision?.trim().toUpperCase()) then compare against "APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to entry.decision; update the logic around the dec variable in registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed inputs are persisted consistently.

- enrich-evidence: check timestamp suffix instead of dead !fixture check (parseRunDirName never returns falsy name) - enrich-evidence: normalize decision to uppercase before persisting (handles mixed-case "reject"/"Reject" from Critic) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

let-sunny and others added 8 commits March 29, 2026 18:22

chore: upgrade Critic and Arbitrator to opus-4-6

cc3e4ba

These are the most critical judgment steps in the calibration pipeline. Structured pro/con + confidence reasoning benefits from stronger model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

let-sunny marked this pull request as ready for review March 29, 2026 09:50

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

let-sunny merged commit 54aa6a0 into main Mar 29, 2026
3 checks passed

This was referenced Mar 29, 2026

feat: rule-discovery prompt fixes + deterministic CLI extraction (#145) #173

Merged

feat: evidence category validation + gap collection + CLI arg validation (#174, #175) #176

Merged

let-sunny deleted the feat/calibration-debate-144 branch March 29, 2026 13:23

		.action((runDir: string) => {
		const resolvedDir = resolve(runDir);

		const dec = r.decision;
		if (dec === "APPROVE" \|\| dec === "REJECT" \|\| dec === "REVISE" \|\| dec === "HOLD") entry.decision = dec;

Conversation

let-sunny commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design decisions

1. Critic에게 구조화된 evidence 전달

2. Pro/Con 구조화 출력

3. Evidence enrichment — data/에 영속

4. Early-stop (Arbitrator 스킵)

5. Self-consistency (hold 결정)

6. Convergence에 hold 반영

Data flow

Changed files

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

let-sunny commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

3. Evidence enrichment — `data/`에 영속

coderabbitai bot commented Mar 29, 2026 •

edited

Loading