Skip to content

feat: structured debate with pro/con, confidence, early-stop (#144)#172

Merged
let-sunny merged 12 commits intomainfrom
feat/calibration-debate-144
Mar 29, 2026
Merged

feat: structured debate with pro/con, confidence, early-stop (#144)#172
let-sunny merged 12 commits intomainfrom
feat/calibration-debate-144

Conversation

@let-sunny
Copy link
Copy Markdown
Owner

@let-sunny let-sunny commented Mar 29, 2026

Problem

캘리브레이션 파이프라인의 Critic이 proposals만 보고 판단함. Converter가 실제로 뭘 어려워했는지, 어떤 pixel gap이 있었는지, 과거 run에서 같은 룰이 어떻게 평가됐는지 모름.

또한 debate 결과가 logs/ (gitignored, ephemeral)에만 남아서 cross-run 학습이 안 됨.

Design decisions

1. Critic에게 구조화된 evidence 전달

Before: proposals only → Critic이 숫자만 보고 판단
After: proposals + converter's ruleImpactAssessment + gap analysis + prior cross-run evidence

왜: Critic이 "이 룰이 실제로 구현 난이도에 영향을 줬는지"를 판단하려면 Converter의 실제 경험이 필요함.

2. Pro/Con 구조화 출력

Before: { decision, reason } — 이유 하나로 찬반 구분 불가
After: { decision, confidence, pro: [...], con: [...], reason } — 명시적 찬반 분리

왜: 단일 reason만 있으면 한쪽 관점만 반영됨. 찬반을 명시적으로 분리하면 Arbitrator가 더 나은 판단 가능. 이 데이터가 evidence에 영속되면 다음 run에서도 활용.

3. Evidence enrichment — data/에 영속

Before: Critic 결과 → logs/debate.json (휘발) → 다음 run에서 못 봄
After: Critic 결과 → logs/debate.json + data/calibration-evidence.json (영속)

새 CLI command calibrate-enrich-evidence가 debate.json에서 Critic의 pro/con/confidence를 읽어서 data/calibration-evidence.json의 기존 엔트리를 업데이트.

4. Early-stop (Arbitrator 스킵)

모든 proposals가 high confidence로 REJECT되면 Arbitrator 실행 불필요. debate.json에 stoppingReason: "all-high-confidence-reject" 기록하고 종료.

5. Self-consistency (hold 결정)

Before: low confidence라도 Arbitrator가 적용하거나 기각
After: confidence: lowdecision: "hold" — 적용하지 않고 evidence에만 누적

다음 run에서 같은 룰이 다시 제안되면 case count 올라가면서 confidence 자연 수렴. 기존 cross-run evidence 메커니즘 활용.

6. Convergence에 hold 반영

hold = "아직 결정 못 함" → not converged. fixture가 done/으로 이동하지 않고 다음 calibration에서 재평가.

Data flow

data/calibration-evidence.json (git-tracked, persistent)
  ↕ enrichCalibrationEvidence() writes back
  ↕ loadCalibrationEvidence() reads

logs/calibration/<run>/debate.json (gitignored, ephemeral)
  → per-run detail for debugging

Changed files

File What
evidence.ts (contracts) Schema: confidence, pro, con, decision fields
evidence-collector.ts enrichCalibrationEvidence() + grouping with allPro/allCon
critic.md Expanded input + structured pro/con output
arbitrator.md hold logic + stoppingReason + revised 복원
calibrate-loop.md Evidence gathering → Critic, early-stop, enrich step
run-directory.ts hold in convergence, early-stop = converged
fixture-management.ts calibrate-enrich-evidence CLI command
cli/index.ts Register new command
CLAUDE.md Documentation

Test plan

  • All 606 tests pass
  • Type check passes (exactOptionalPropertyTypes compliant)
  • Evidence schema backward compatible (new fields optional)
  • Self-review: evidence write gap, revised decision type, convergence hold, early-stop convergence
  • Run /calibrate-loop on a fixture to validate end-to-end (post-merge)

Closes #144

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Decisions now include per-item confidence and explicit decision types; low-confidence items are held (not applied) and reflected in convergence counts and summaries.
    • Early-stop when all proposals are high-confidence rejections; finalized/early-stop actions written to debate state.
    • CLI commands to gather, finalize, enrich, and persist proposed-rule lists and structured evidence added.
  • Chores

    • Agent prompts, convergence logic, evidence schemas, CLI orchestration, and docs updated to support structured pro/con reviews and confidence.
  • Tests

    • Expanded test suites covering evidence enrichment, convergence outcomes, CLI gather/finalize behavior, and run-directory parsing.

Phase 1-3 of calibration pipeline improvements:

- Evidence schema: add confidence, pro, con, decision fields
- Critic: receives converter assessment + gaps + prior evidence,
  outputs structured pro/con arguments per proposal
- Arbitrator: holds low-confidence decisions instead of applying
- Early-stop: skip Arbitrator when all proposals rejected with
  high confidence
- Cross-run evidence enriched with pro/con for better future reviews
- CLAUDE.md calibration section updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57bdfe3b-dcb3-4805-844b-d5bcc3e32020

📥 Commits

Reviewing files that changed from the base of the PR and between 3cdf582 and 16112aa.

📒 Files selected for processing (1)
  • src/cli/commands/internal/fixture-management.ts

📝 Walkthrough

Walkthrough

Critic now consumes structured evidence (converter impact, actionable gaps, prior cross-run evidence) and emits per-proposal pro/con + confidence. Added deterministic gather/finalize CLI steps (early-stop when all reviews are high-confidence REJECT). Arbitrator must HOLD low-confidence proposals; evidence enrichment/prune and schemas updated.

Changes

Cohort / File(s) Summary
Agent Prompts & Docs
\.claude/agents/calibration/critic.md, \.claude/agents/calibration/arbitrator.md, CLAUDE.md
Switched models to claude-opus-4-6; Critic prompt expanded to require converter impact, pixel-gap analysis, and prior evidence; Critic must output pro/con arrays and confidence; Arbitrator updated to annotate decision types and hold low-confidence proposals.
Calibration Loop Orchestration & Docs
\.claude/commands/calibrate-loop.md
Added deterministic evidence-gather and finalize steps, introduced early-stop and hold semantics, and updated debate finalization and evidence enrichment/prune flow description.
CLI Commands & Helpers
src/cli/commands/internal/calibrate-debate.ts, src/cli/commands/internal/calibrate-debate.test.ts, src/cli/commands/internal/fixture-management.ts, src/cli/commands/internal/calibrate-evaluate.ts, src/cli/index.ts
Added gatherEvidence, loadProposedRuleIds, registerGatherEvidence, registerFinalizeDebate, and registerEvidenceEnrich; tests for evidence parsing, rule-id extraction, early-stop and hold logic; calibrate-evaluate writes proposed-rules.json; CLI registers new internal commands.
Evidence Contracts
src/agents/contracts/evidence.ts
Extended Zod schemas and exported types: optional `confidence: "high"
Evidence Collector
src/agents/evidence-collector.ts, src/agents/evidence-collector.test.ts
loadCalibrationEvidence now aggregates deduplicated allPro/allCon and tracks lastConfidence/lastDecision; added enrichCalibrationEvidence(reviews, fixture, evidencePath) that updates matching entries and writes only when matches found; tests for enrichment, dedupe, and file-write behavior.
Run Directory / Convergence
src/agents/run-directory.ts, src/agents/run-directory.test.ts
Added CriticReviewSchema, optional stoppingReason in DebateResultSchema; ConvergenceSummary includes hold; checkConvergence counts hold, supports early-stop when stoppingReason exists and arbitrator is absent; isConverged delegates to checkConvergence; tests for early-stop and hold cases.
Tests (integration/unit)
src/cli/commands/internal/calibrate-debate.test.ts, src/agents/evidence-collector.test.ts, src/agents/run-directory.test.ts
New and updated tests cover evidence gathering, parsing, enrichment, deduplication, early-stop and hold convergence behaviors.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI (calibrate-gather-evidence)
    participant Run as Run Directory
    participant Evidence as calibration-evidence.json
    participant Critic as Critic Agent
    participant Finalize as CLI (calibrate-finalize-debate)
    participant Arbitrator as Arbitrator Agent

    Run->>CLI: invoke calibrate-gather-evidence
    CLI->>Run: read conversion.json, gaps.json, summary.md
    CLI->>Evidence: load cross-run calibration-evidence.json
    CLI->>Run: write critic-evidence.json
    Run->>Critic: run Critic with proposals + critic-evidence.json
    Critic->>Run: write debate.json (reviews with pro/con/confidence)
    Run->>Finalize: invoke calibrate-finalize-debate
    Finalize->>Run: parse debate.json
    alt all reviews == REJECT && confidence == "high"
        Finalize->>Run: write stoppingReason "all-high-confidence-reject"
        Finalize-->>Run: output { action: "early-stop" }
    else
        Run->>Arbitrator: run Arbitrator with critic reviews
        Arbitrator->>Run: write arbitrator decisions (applied/revised/rejected/hold)
        alt any decision == "hold"
            Run->>Evidence: call enrichCalibrationEvidence (accumulate held reviews)
        else
            Run->>Evidence: record applied/revised decisions into evidence
        end
        Finalize-->>Run: output { action: "finalized", maybe stoppingReason }
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I hopped through gaps and notes of pro and con,

kept my confidence light when answers were drawn-on;
If doubt is deep I gently press hold,
saving whispers of proof for the next run to behold.
Tiny hops, clearer rules—one carrot at a time.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: introducing structured debate with pro/con arguments, confidence levels, and early-stop mechanisms for the calibration pipeline.
Linked Issues check ✅ Passed All coding requirements from #144 are met: evidence schema extended with confidence/pro/con/decision fields, Critic receives structured inputs (conversion, gaps, prior evidence), outputs structured pro/con with confidence, early-stop on all-high-confidence REJECT implemented, low-confidence decisions mapped to 'hold', and stoppingReason recorded in debate.json.
Out of Scope Changes check ✅ Passed All changes are scoped to requirements in #144: evidence contracts, Critic/Arbitrator enhancements, CLI commands (gather-evidence, finalize-debate, enrich-evidence), convergence logic updates, and documentation reflect the debate-based reliability framework without introducing unrelated functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/calibration-debate-144

Comment @coderabbitai help to get the list of available commands and usage tips.

let-sunny and others added 8 commits March 29, 2026 18:22
Self-review fixes:

1. Evidence enrichment gap: add calibrate-enrich-evidence CLI command
   and enrichCalibrationEvidence() — writes Critic's pro/con/confidence
   back to data/calibration-evidence.json (was only in ephemeral logs/)

2. Restore "revised" decision type in Arbitrator (was accidentally dropped)

3. Convergence: "hold" decisions count as not-converged (need more evidence)

4. Early-stop convergence: stoppingReason in debate.json → converged=true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code review feedback:

1. isConverged now delegates to checkConvergence (single source of truth
   for early-stop / hold logic)
2. enrichCalibrationEvidence scoped by (ruleId, fixture) — no longer
   overwrites entries from other fixtures
3. stoppingReason canonical location: debate.json top level only
4. CLI calibrate-enrich-evidence uses process.exitCode = 1 on errors
5. Tests added: enrichCalibrationEvidence (3), checkConvergence early-stop,
   hold convergence (strict + lenient), isConverged delegation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review round 2:

A. stoppingReason added to DebateResultSchema — no more type cast
B. enrichCalibrationEvidence warns when no entries match fixture
C. calibrate-prune-evidence CLI now uses process.exitCode = 1
D. allPro/allCon deduplicated in loadCalibrationEvidence grouping
F. stoppingReason canonical location commented in schema

E (model selection) is a judgment call, not a code change — to evaluate later.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are the most critical judgment steps in the calibration pipeline.
Structured pro/con + confidence reasoning benefits from stronger model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. CriticReviewSchema: confidence/pro/con as typed optional fields,
   removing Record<string, unknown> casts in fixture-management.ts
2. enrichCalibrationEvidence: early-return when no matches (skip I/O)
3. stoppingReason: omit for normal completion, only set for special cases
4. Test: pro/con deduplication in loadCalibrationEvidence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ands

Three deterministic tasks the orchestrator (LLM) was doing are now CLI commands:

1. calibrate-gather-evidence <runDir>
   Reads conversion.json, gaps.json, summary.md, calibration-evidence.json
   and writes a single critic-evidence.json for the Critic prompt.
   Was: orchestrator reads 3-4 files and extracts relevant parts (error-prone)

2. calibrate-finalize-debate <runDir>
   After Critic: checks early-stop (all REJECT + high confidence)
   After Arbitrator: determines stoppingReason (low-confidence-hold)
   Was: orchestrator evaluates conditions inline (can misjudge)

calibrate-loop.md updated to use CLI commands instead of inline logic.
Orchestrator now only: runs CLI commands + spawns LLM agents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tests covering:
- gather-evidence: conversion.json parsing, gap filtering, ruleId extraction
- finalize-debate: early-stop detection, mixed reviews, hold detection,
  normal completion, missing debate.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- calibrate-evaluate now writes proposed-rules.json (deterministic ruleId list)
- calibrate-gather-evidence reads proposed-rules.json first, falls back to
  summary.md regex (eliminates false positive risk)
- Export gatherEvidence, loadProposedRuleIds, GatheredEvidence for direct
  unit testing
- Tests rewritten to import functions directly (13 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@let-sunny let-sunny marked this pull request as ready for review March 29, 2026 09:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.test.ts`:
- Around line 111-128: The test duplicates the early-stop check; refactor so the
test uses the implementation's check instead of re-implementing it: extract the
early-stop predicate from registerFinalizeDebate (or export an existing helper)
as a named function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that
into calibrate-debate.test.ts, then replace the local allHighConfidenceReject
computation with a call to that function using the parsed debate (from
parseDebateResult), ensuring the test asserts the function returns true; update
registerFinalizeDebate to call the same helper so behavior remains identical.

In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 145-153: parseDebateResult currently parses debate.json but the
caller in finalize-debate reads and parses the file again to preserve unknown
fields; change parseDebateResult to return a tuple/object containing both the
validated Debate result and the raw parsed Record (e.g., { validated, raw }) so
callers (like the code using debate and raw) can use the returned raw instead of
re-reading the file: update parseDebateResult signature and implementation to
read once and return both, then replace the second read/JSON.parse in
finalize-debate (the code that sets debatePath and raw) to use the raw value
from parseDebateResult and remove the redundant file read.

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 167-175: The change incorrectly treats missing run directories and
a null parseDebateResult as errors; revert to the prior subagent-compatible
behavior by using console.log instead of console.error for the "Run directory
not found" and "No debate.json found — nothing to prune." messages and do not
set process.exitCode = 1 (leave exit code 0); locate the block that checks
runDir and the parseDebateResult(resolve(runDir)) call and adjust those two
branches to log to stdout and return without changing the exit code so internal
calibration subagents continue to parse stdout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1184e0d5-37b0-4d71-b454-8933f9b23e80

📥 Commits

Reviewing files that changed from the base of the PR and between d9667ca and 4d15939.

📒 Files selected for processing (14)
  • .claude/agents/calibration/arbitrator.md
  • .claude/agents/calibration/critic.md
  • .claude/commands/calibrate-loop.md
  • CLAUDE.md
  • src/agents/contracts/evidence.ts
  • src/agents/evidence-collector.test.ts
  • src/agents/evidence-collector.ts
  • src/agents/run-directory.test.ts
  • src/agents/run-directory.ts
  • src/cli/commands/internal/calibrate-debate.test.ts
  • src/cli/commands/internal/calibrate-debate.ts
  • src/cli/commands/internal/calibrate-evaluate.ts
  • src/cli/commands/internal/fixture-management.ts
  • src/cli/index.ts

Comment on lines +111 to +128
it("detects early-stop when all critic reviews are high-confidence REJECT", () => {
writeFileSync(join(runDir, "debate.json"), JSON.stringify({
critic: {
summary: "rejected=2",
reviews: [
{ ruleId: "a", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "x" },
{ ruleId: "b", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "y" },
],
},
}));

const debate = parseDebateResult(runDir)!;
const reviews = debate.critic!.reviews;
const allHighConfidenceReject = reviews.length > 0 && reviews.every((r) =>
r.decision.trim().toUpperCase() === "REJECT" && r.confidence === "high"
);
expect(allHighConfidenceReject).toBe(true);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider extracting the early-stop check logic to avoid duplication.

The test duplicates the early-stop condition logic (lines 124-126) that exists in registerFinalizeDebate. While acceptable for testing the parsed data structure, consider importing and testing the actual finalize logic to ensure the test stays in sync with implementation changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/calibrate-debate.test.ts` around lines 111 - 128,
The test duplicates the early-stop check; refactor so the test uses the
implementation's check instead of re-implementing it: extract the early-stop
predicate from registerFinalizeDebate (or export an existing helper) as a named
function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that into
calibrate-debate.test.ts, then replace the local allHighConfidenceReject
computation with a call to that function using the parsed debate (from
parseDebateResult), ensuring the test asserts the function returns true; update
registerFinalizeDebate to call the same helper so behavior remains identical.

Comment on lines +145 to +153
const debate = parseDebateResult(dir);
if (!debate) {
console.error("No debate.json found");
process.exitCode = 1;
return;
}

const debatePath = join(dir, "debate.json");
const raw = JSON.parse(readFileSync(debatePath, "utf-8")) as Record<string, unknown>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: Double file read in finalize-debate.

parseDebateResult reads and parses debate.json, then line 153 reads and parses it again to get the raw object. While functionally correct (needed to preserve unknown fields via passthrough), this could be optimized by having parseDebateResult return both the validated result and raw object.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/calibrate-debate.ts` around lines 145 - 153,
parseDebateResult currently parses debate.json but the caller in finalize-debate
reads and parses the file again to preserve unknown fields; change
parseDebateResult to return a tuple/object containing both the validated Debate
result and the raw parsed Record (e.g., { validated, raw }) so callers (like the
code using debate and raw) can use the returned raw instead of re-reading the
file: update parseDebateResult signature and implementation to read once and
return both, then replace the second read/JSON.parse in finalize-debate (the
code that sets debatePath and raw) to use the raw value from parseDebateResult
and remove the redundant file read.

Internal calibration commands are consumed by subagents that parse stdout.
console.error + process.exitCode = 1 breaks this pattern.
Reverted all internal commands to console.log + exit 0 for subagent compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 10-15: The GatheredEvidence interface uses broad unknowns; replace
ruleImpactAssessment, uncoveredStruggles, actionableGaps and priorEvidence with
the concrete contract types used elsewhere (e.g., the domain/evidence validation
contracts) instead of unknown to enable type safety; update the import list at
the top of this module to bring in the correct contracts (for example the
RuleImpactContract, UncoveredStruggleContract, ActionableGapContract and
PriorEvidenceContract or their actual names from your validation/contracts
folder) and change the property types to arrays or records of those contract
types accordingly, then run the project/type-checker to fix any call sites that
expect the narrower shapes.
- Around line 149-151: Wrap the second read/parse of debate.json (the
readFileSync/JSON.parse that sets raw after debatePath is created) in a
try/catch so a corrupt/changed file cannot throw and break orchestration; on
error, fall back to a safe default (e.g., an empty object or the result from
parseDebateResult(dir)), log the parse failure to stderr or processLogger, and
continue to emit the expected stdout JSON action and exit with code 0.
Specifically modify the block that reads debatePath / assigns raw so exceptions
are caught, raw is always defined, and control flow continues (do not re-throw).

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-149: The enrichment step drops "HOLD" decisions because the
check only persists APPROVE/REJECT/REVISE; update the logic that reads
r.decision (variable dec) to also accept and assign "HOLD" to entry.decision (or
simply assign dec after validating against the allowed decision union), and
ensure you update the decision unions/schemas in the contracts and the evidence
collector to include "HOLD" so types/validation, serialization, and downstream
consumers accept the new value.
- Around line 123-124: Validate the CLI input before resolving: inside the
.action((runDir: string) => { ... }) callback use a Zod schema (e.g., a
non-empty string/path schema) to parse or safeParse runDir (import z from
'zod'), handle validation failure (log/throw or exit) and only then call
resolve(validatedDir) to produce resolvedDir; update references to runDir ->
validatedDir and ensure any downstream logic uses the validated value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 942663e1-bc37-423f-8eae-ed5c47132640

📥 Commits

Reviewing files that changed from the base of the PR and between 4d15939 and 0e4c16b.

📒 Files selected for processing (2)
  • src/cli/commands/internal/calibrate-debate.ts
  • src/cli/commands/internal/fixture-management.ts

Comment on lines +10 to +15
export interface GatheredEvidence {
ruleImpactAssessment: unknown[];
uncoveredStruggles: unknown[];
actionableGaps: unknown[];
priorEvidence: Record<string, unknown>;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer evidence contract types over broad unknown containers.

GatheredEvidence.priorEvidence (and related fields) can leverage existing contract types for better maintainability and safer downstream usage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/calibrate-debate.ts` around lines 10 - 15, The
GatheredEvidence interface uses broad unknowns; replace ruleImpactAssessment,
uncoveredStruggles, actionableGaps and priorEvidence with the concrete contract
types used elsewhere (e.g., the domain/evidence validation contracts) instead of
unknown to enable type safety; update the import list at the top of this module
to bring in the correct contracts (for example the RuleImpactContract,
UncoveredStruggleContract, ActionableGapContract and PriorEvidenceContract or
their actual names from your validation/contracts folder) and change the
property types to arrays or records of those contract types accordingly, then
run the project/type-checker to fix any call sites that expect the narrower
shapes.

Comment on lines +123 to +124
.action((runDir: string) => {
const resolvedDir = resolve(runDir);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific file mentioned
find src/cli -type f -name "*.ts" | head -20

Repository: let-sunny/canicode

Length of output: 714


🏁 Script executed:

# Read the fixture-management.ts file to understand context
cat -n src/cli/commands/internal/fixture-management.ts

Repository: let-sunny/canicode

Length of output: 8415


🏁 Script executed:

# Check if there are contracts/schemas defined for CLI inputs
find . -path ./node_modules -prune -o -type f -name "*schema*" -o -name "*contract*" | grep -E "\.(ts|tsx)$" | head -20

Repository: let-sunny/canicode

Length of output: 44


🏁 Script executed:

# Search for Zod usage in CLI files to understand validation patterns
rg "z\." src/cli --type ts -B 2 -A 2 | head -50

Repository: let-sunny/canicode

Length of output: 2969


Validate CLI input with Zod before resolving paths.

runDir is external CLI input and must be schema-validated per coding guidelines. The code currently calls resolve(runDir) directly without validation.

Proposed fix
+import { z } from "zod";
...
     .action((runDir: string) => {
-      const resolvedDir = resolve(runDir);
+      const args = z.object({ runDir: z.string().min(1) }).parse({ runDir });
+      const resolvedDir = resolve(args.runDir);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.action((runDir: string) => {
const resolvedDir = resolve(runDir);
.action((runDir: string) => {
const args = z.object({ runDir: z.string().min(1) }).parse({ runDir });
const resolvedDir = resolve(args.runDir);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/fixture-management.ts` around lines 123 - 124,
Validate the CLI input before resolving: inside the .action((runDir: string) =>
{ ... }) callback use a Zod schema (e.g., a non-empty string/path schema) to
parse or safeParse runDir (import z from 'zod'), handle validation failure
(log/throw or exit) and only then call resolve(validatedDir) to produce
resolvedDir; update references to runDir -> validatedDir and ensure any
downstream logic uses the validated value.

- finalize-debate: try/catch on second debate.json read, fallback to
  {action: "continue"} on parse failure (subagent safety)
- HOLD added to decision enum in evidence schema, enrichCalibrationEvidence
  signature, and CLI enrich command (was silently dropped)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 30-53: Replace the ad-hoc JSON.parse + Array.isArray/type checks
for conversion.json, gaps.json (and proposed-rules.json per the comment) with
Zod validation: import the existing UncoveredStruggle and GapEntry Zod schemas
from src/agents/contracts and create a small top-level ConversionSchema (e.g., {
ruleImpactAssessment: z.array(z.unknown()).optional(), uncoveredStruggles:
z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps:
z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run
z.parse/z.safeParse and assign only validated data to
result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps
(filtering gaps by actionable is fine but apply GapEntry schema first). Ensure
malformed files are caught and ignored (or logged) by checking parse results
before mutating result.

In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-148: The decision value from critic reviews is compared with
exact-case strings and can be lost; normalize r.decision before checking and
storing by trimming and uppercasing it (e.g., use
r.decision?.trim().toUpperCase()) then compare against
"APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to
entry.decision; update the logic around the dec variable in
registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed
inputs are persisted consistently.
- Around line 135-140: parseRunDirName(basename(resolvedDir)) never yields a
falsy name, so the current check for !fixture is dead; instead validate that the
run directory ends with the expected `--timestamp` suffix by checking the parsed
timestamp (from parseRunDirName) or matching the basename against the
`--YYYY-...` pattern before treating it as a fixture. Update the code around
parseRunDirName / basename(resolvedDir) to reject basenames where the parsed
timestamp is empty or the regex check fails (log an error and return) so only
properly suffixed run directories are accepted for enrichment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a10c1791-ef9b-4ed2-b11f-3b689cd01e4b

📥 Commits

Reviewing files that changed from the base of the PR and between 0e4c16b and 3cdf582.

📒 Files selected for processing (4)
  • src/agents/contracts/evidence.ts
  • src/agents/evidence-collector.ts
  • src/cli/commands/internal/calibrate-debate.ts
  • src/cli/commands/internal/fixture-management.ts

Comment on lines +30 to +53
const convPath = join(runDir, "conversion.json");
if (existsSync(convPath)) {
try {
const conv = JSON.parse(readFileSync(convPath, "utf-8")) as Record<string, unknown>;
if (Array.isArray(conv["ruleImpactAssessment"])) {
result.ruleImpactAssessment = conv["ruleImpactAssessment"];
}
if (Array.isArray(conv["uncoveredStruggles"])) {
result.uncoveredStruggles = conv["uncoveredStruggles"];
}
} catch { /* ignore malformed */ }
}

// 2. gaps.json → actionable gaps
const gapsPath = join(runDir, "gaps.json");
if (existsSync(gapsPath)) {
try {
const gaps = JSON.parse(readFileSync(gapsPath, "utf-8")) as Record<string, unknown>;
const gapList = Array.isArray(gaps["gaps"]) ? gaps["gaps"] : [];
result.actionableGaps = gapList.filter(
(g): g is Record<string, unknown> =>
typeof g === "object" && g !== null && (g as Record<string, unknown>)["actionable"] === true
);
} catch { /* ignore malformed */ }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific lines mentioned
find . -name "calibrate-debate.ts" -type f

Repository: let-sunny/canicode

Length of output: 109


🏁 Script executed:

# Check if the file exists and examine its content
head -150 src/cli/commands/internal/calibrate-debate.ts | cat -n

Repository: let-sunny/canicode

Length of output: 6312


🏁 Script executed:

# Search for Zod schema files related to the artifacts mentioned
find . -path "*/contracts/*" -name "*.ts" -o -name "*.js" | head -20

Repository: let-sunny/canicode

Length of output: 622


🏁 Script executed:

# Look for any existing schemas for conversion, gaps, or rules
rg -l "conversion|gaps|proposed.*rules" --type ts --type js -- contracts/

Repository: let-sunny/canicode

Length of output: 112


🏁 Script executed:

# Check the contracts directory for relevant schemas
ls -la src/agents/contracts/

Repository: let-sunny/canicode

Length of output: 691


🏁 Script executed:

# Look for any schema definitions related to conversion, gaps, or rules
rg "conversion|gaps" src/agents/contracts/ -A 5

Repository: let-sunny/canicode

Length of output: 1570


🏁 Script executed:

# Check if there are any Zod schemas defined in the contracts
rg "z\." src/agents/contracts/ | head -20

Repository: let-sunny/canicode

Length of output: 1568


🏁 Script executed:

# Look at the conversion-agent and gap-analyzer contracts
cat src/agents/contracts/conversion-agent.ts

Repository: let-sunny/canicode

Length of output: 1067


🏁 Script executed:

# Check gap-analyzer contract
cat src/agents/contracts/gap-analyzer.ts

Repository: let-sunny/canicode

Length of output: 1249


🏁 Script executed:

# Check how these schemas are imported and what's available
cat src/agents/contracts/conversion-agent.ts

Repository: let-sunny/canicode

Length of output: 1067


🏁 Script executed:

# Verify that no validation is actually happening in gatherEvidence
grep -A 20 "function gatherEvidence" src/cli/commands/internal/calibrate-debate.ts | grep -E "(parse|validate|\.parse)"

Repository: let-sunny/canicode

Length of output: 152


🏁 Script executed:

# Check if conversion.json is actually structured as ConversionRecord[] or has top-level fields
rg "conversion\.json" src/ -B 2 -A 2 --type ts

Repository: let-sunny/canicode

Length of output: 3139


🏁 Script executed:

# Verify the actual structure - check if conversion.json contains ruleImpactAssessment array
rg "ruleImpactAssessment" src/ --type ts

Repository: let-sunny/canicode

Length of output: 1537


🏁 Script executed:

# Check if there's a schema for the top-level conversion.json structure
rg "ConversionOutput|RuleImpactAssessment" src/agents/contracts/ --type ts

Repository: let-sunny/canicode

Length of output: 44


🏁 Script executed:

# Look at the test to understand what the conversion.json structure should be
cat -n src/cli/commands/internal/calibrate-debate.test.ts | grep -A 30 "extracts ruleImpactAssessment"

Repository: let-sunny/canicode

Length of output: 1469


🏁 Script executed:

# Check if proposed-rules.json is validated or has a schema
rg "ProposedRules|proposed.*rules" src/agents/contracts/ --type ts

Repository: let-sunny/canicode

Length of output: 44


🏁 Script executed:

# See how the gathered evidence is used downstream to assess impact
rg "critic-evidence" src/ --type ts -B 2 -A 2

Repository: let-sunny/canicode

Length of output: 697


🏁 Script executed:

# Let me check what happens with the evidence after it's gathered - does orchestrator validate it?
rg -A 10 "critic-evidence" src/agents/orchestrator.ts --type ts

Repository: let-sunny/canicode

Length of output: 44


🏁 Script executed:

# Check if the conversion output structure is documented anywhere
rg "ruleImpactAssessment|UncoveredStruggle" src/agents/ -B 5 --type ts | head -50

Repository: let-sunny/canicode

Length of output: 2431


🏁 Script executed:

# See if there's any downstream validation in the orchestrator
cat src/agents/orchestrator.ts | grep -A 30 "ruleImpactAssessment"

Repository: let-sunny/canicode

Length of output: 1795


🏁 Script executed:

# Double-check: are the schemas actually being used somewhere in gatherEvidence?
grep -n "UncoveredStruggleSchema\|GapAnalyzerOutputSchema\|GapEntrySchema" src/cli/commands/internal/calibrate-debate.ts

Repository: let-sunny/canicode

Length of output: 44


🏁 Script executed:

# Verify that the coding guideline applies to this file
head -5 src/cli/commands/internal/calibrate-debate.ts

Repository: let-sunny/canicode

Length of output: 274


Validate external JSON artifacts with Zod before feeding them into critic-evidence.json.

Lines 30–54 and 76–81 parse conversion.json, gaps.json, and proposed-rules.json using only JSON.parse and shallow Array.isArray or property-type checks. This allows malformed rows to flow directly into the critic payload. The coding guidelines require "Validate all external inputs with Zod schemas" for src/**/*.ts. Schemas for UncoveredStruggle and GapEntry already exist in src/agents/contracts/; add validation using these schemas and create a wrapper schema for the top-level conversion.json structure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/calibrate-debate.ts` around lines 30 - 53, Replace
the ad-hoc JSON.parse + Array.isArray/type checks for conversion.json, gaps.json
(and proposed-rules.json per the comment) with Zod validation: import the
existing UncoveredStruggle and GapEntry Zod schemas from src/agents/contracts
and create a small top-level ConversionSchema (e.g., { ruleImpactAssessment:
z.array(z.unknown()).optional(), uncoveredStruggles:
z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps:
z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run
z.parse/z.safeParse and assign only validated data to
result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps
(filtering gaps by actionable is fine but apply GapEntry schema first). Ensure
malformed files are caught and ignored (or logged) by checking parse results
before mutating result.

Comment on lines +147 to +148
const dec = r.decision;
if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize Critic decision casing before persisting it.

critic.reviews[].decision is parsed as a free-form string, and registerFinalizeDebate() already normalizes it with trim().toUpperCase(). This exact-case check will silently drop valid "reject", "hold", etc., so cross-run evidence can lose the new decision field even when the same debate would be handled correctly elsewhere.

Suggested fix
-        const dec = r.decision;
-        if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec;
+        const dec = r.decision.trim().toUpperCase();
+        if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") {
+          entry.decision = dec;
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/commands/internal/fixture-management.ts` around lines 147 - 148, The
decision value from critic reviews is compared with exact-case strings and can
be lost; normalize r.decision before checking and storing by trimming and
uppercasing it (e.g., use r.decision?.trim().toUpperCase()) then compare against
"APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to
entry.decision; update the logic around the dec variable in
registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed
inputs are persisted consistently.

- enrich-evidence: check timestamp suffix instead of dead !fixture check
  (parseRunDirName never returns falsy name)
- enrich-evidence: normalize decision to uppercase before persisting
  (handles mixed-case "reject"/"Reject" from Critic)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@let-sunny let-sunny merged commit 54aa6a0 into main Mar 29, 2026
3 checks passed
@let-sunny let-sunny deleted the feat/calibration-debate-144 branch March 29, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

calibration: adopt debate-based reliability and cost controls

1 participant