feat: structured debate with pro/con, confidence, early-stop (#144)#172
feat: structured debate with pro/con, confidence, early-stop (#144)#172
Conversation
Phase 1-3 of calibration pipeline improvements: - Evidence schema: add confidence, pro, con, decision fields - Critic: receives converter assessment + gaps + prior evidence, outputs structured pro/con arguments per proposal - Arbitrator: holds low-confidence decisions instead of applying - Early-stop: skip Arbitrator when all proposals rejected with high confidence - Cross-run evidence enriched with pro/con for better future reviews - CLAUDE.md calibration section updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughCritic now consumes structured evidence (converter impact, actionable gaps, prior cross-run evidence) and emits per-proposal pro/con + confidence. Added deterministic gather/finalize CLI steps (early-stop when all reviews are high-confidence REJECT). Arbitrator must HOLD low-confidence proposals; evidence enrichment/prune and schemas updated. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as CLI (calibrate-gather-evidence)
participant Run as Run Directory
participant Evidence as calibration-evidence.json
participant Critic as Critic Agent
participant Finalize as CLI (calibrate-finalize-debate)
participant Arbitrator as Arbitrator Agent
Run->>CLI: invoke calibrate-gather-evidence
CLI->>Run: read conversion.json, gaps.json, summary.md
CLI->>Evidence: load cross-run calibration-evidence.json
CLI->>Run: write critic-evidence.json
Run->>Critic: run Critic with proposals + critic-evidence.json
Critic->>Run: write debate.json (reviews with pro/con/confidence)
Run->>Finalize: invoke calibrate-finalize-debate
Finalize->>Run: parse debate.json
alt all reviews == REJECT && confidence == "high"
Finalize->>Run: write stoppingReason "all-high-confidence-reject"
Finalize-->>Run: output { action: "early-stop" }
else
Run->>Arbitrator: run Arbitrator with critic reviews
Arbitrator->>Run: write arbitrator decisions (applied/revised/rejected/hold)
alt any decision == "hold"
Run->>Evidence: call enrichCalibrationEvidence (accumulate held reviews)
else
Run->>Evidence: record applied/revised decisions into evidence
end
Finalize-->>Run: output { action: "finalized", maybe stoppingReason }
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Self-review fixes: 1. Evidence enrichment gap: add calibrate-enrich-evidence CLI command and enrichCalibrationEvidence() — writes Critic's pro/con/confidence back to data/calibration-evidence.json (was only in ephemeral logs/) 2. Restore "revised" decision type in Arbitrator (was accidentally dropped) 3. Convergence: "hold" decisions count as not-converged (need more evidence) 4. Early-stop convergence: stoppingReason in debate.json → converged=true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code review feedback: 1. isConverged now delegates to checkConvergence (single source of truth for early-stop / hold logic) 2. enrichCalibrationEvidence scoped by (ruleId, fixture) — no longer overwrites entries from other fixtures 3. stoppingReason canonical location: debate.json top level only 4. CLI calibrate-enrich-evidence uses process.exitCode = 1 on errors 5. Tests added: enrichCalibrationEvidence (3), checkConvergence early-stop, hold convergence (strict + lenient), isConverged delegation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review round 2: A. stoppingReason added to DebateResultSchema — no more type cast B. enrichCalibrationEvidence warns when no entries match fixture C. calibrate-prune-evidence CLI now uses process.exitCode = 1 D. allPro/allCon deduplicated in loadCalibrationEvidence grouping F. stoppingReason canonical location commented in schema E (model selection) is a judgment call, not a code change — to evaluate later. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are the most critical judgment steps in the calibration pipeline. Structured pro/con + confidence reasoning benefits from stronger model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. CriticReviewSchema: confidence/pro/con as typed optional fields, removing Record<string, unknown> casts in fixture-management.ts 2. enrichCalibrationEvidence: early-return when no matches (skip I/O) 3. stoppingReason: omit for normal completion, only set for special cases 4. Test: pro/con deduplication in loadCalibrationEvidence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ands Three deterministic tasks the orchestrator (LLM) was doing are now CLI commands: 1. calibrate-gather-evidence <runDir> Reads conversion.json, gaps.json, summary.md, calibration-evidence.json and writes a single critic-evidence.json for the Critic prompt. Was: orchestrator reads 3-4 files and extracts relevant parts (error-prone) 2. calibrate-finalize-debate <runDir> After Critic: checks early-stop (all REJECT + high confidence) After Arbitrator: determines stoppingReason (low-confidence-hold) Was: orchestrator evaluates conditions inline (can misjudge) calibrate-loop.md updated to use CLI commands instead of inline logic. Orchestrator now only: runs CLI commands + spawns LLM agents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tests covering: - gather-evidence: conversion.json parsing, gap filtering, ruleId extraction - finalize-debate: early-stop detection, mixed reviews, hold detection, normal completion, missing debate.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- calibrate-evaluate now writes proposed-rules.json (deterministic ruleId list) - calibrate-gather-evidence reads proposed-rules.json first, falls back to summary.md regex (eliminates false positive risk) - Export gatherEvidence, loadProposedRuleIds, GatheredEvidence for direct unit testing - Tests rewritten to import functions directly (13 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cli/commands/internal/calibrate-debate.test.ts`:
- Around line 111-128: The test duplicates the early-stop check; refactor so the
test uses the implementation's check instead of re-implementing it: extract the
early-stop predicate from registerFinalizeDebate (or export an existing helper)
as a named function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that
into calibrate-debate.test.ts, then replace the local allHighConfidenceReject
computation with a call to that function using the parsed debate (from
parseDebateResult), ensuring the test asserts the function returns true; update
registerFinalizeDebate to call the same helper so behavior remains identical.
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 145-153: parseDebateResult currently parses debate.json but the
caller in finalize-debate reads and parses the file again to preserve unknown
fields; change parseDebateResult to return a tuple/object containing both the
validated Debate result and the raw parsed Record (e.g., { validated, raw }) so
callers (like the code using debate and raw) can use the returned raw instead of
re-reading the file: update parseDebateResult signature and implementation to
read once and return both, then replace the second read/JSON.parse in
finalize-debate (the code that sets debatePath and raw) to use the raw value
from parseDebateResult and remove the redundant file read.
In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 167-175: The change incorrectly treats missing run directories and
a null parseDebateResult as errors; revert to the prior subagent-compatible
behavior by using console.log instead of console.error for the "Run directory
not found" and "No debate.json found — nothing to prune." messages and do not
set process.exitCode = 1 (leave exit code 0); locate the block that checks
runDir and the parseDebateResult(resolve(runDir)) call and adjust those two
branches to log to stdout and return without changing the exit code so internal
calibration subagents continue to parse stdout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1184e0d5-37b0-4d71-b454-8933f9b23e80
📒 Files selected for processing (14)
.claude/agents/calibration/arbitrator.md.claude/agents/calibration/critic.md.claude/commands/calibrate-loop.mdCLAUDE.mdsrc/agents/contracts/evidence.tssrc/agents/evidence-collector.test.tssrc/agents/evidence-collector.tssrc/agents/run-directory.test.tssrc/agents/run-directory.tssrc/cli/commands/internal/calibrate-debate.test.tssrc/cli/commands/internal/calibrate-debate.tssrc/cli/commands/internal/calibrate-evaluate.tssrc/cli/commands/internal/fixture-management.tssrc/cli/index.ts
| it("detects early-stop when all critic reviews are high-confidence REJECT", () => { | ||
| writeFileSync(join(runDir, "debate.json"), JSON.stringify({ | ||
| critic: { | ||
| summary: "rejected=2", | ||
| reviews: [ | ||
| { ruleId: "a", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "x" }, | ||
| { ruleId: "b", decision: "REJECT", confidence: "high", pro: [], con: ["weak"], reason: "y" }, | ||
| ], | ||
| }, | ||
| })); | ||
|
|
||
| const debate = parseDebateResult(runDir)!; | ||
| const reviews = debate.critic!.reviews; | ||
| const allHighConfidenceReject = reviews.length > 0 && reviews.every((r) => | ||
| r.decision.trim().toUpperCase() === "REJECT" && r.confidence === "high" | ||
| ); | ||
| expect(allHighConfidenceReject).toBe(true); | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider extracting the early-stop check logic to avoid duplication.
The test duplicates the early-stop condition logic (lines 124-126) that exists in registerFinalizeDebate. While acceptable for testing the parsed data structure, consider importing and testing the actual finalize logic to ensure the test stays in sync with implementation changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/calibrate-debate.test.ts` around lines 111 - 128,
The test duplicates the early-stop check; refactor so the test uses the
implementation's check instead of re-implementing it: extract the early-stop
predicate from registerFinalizeDebate (or export an existing helper) as a named
function (e.g., isEarlyStopOrAllHighConfidenceReject) and import that into
calibrate-debate.test.ts, then replace the local allHighConfidenceReject
computation with a call to that function using the parsed debate (from
parseDebateResult), ensuring the test asserts the function returns true; update
registerFinalizeDebate to call the same helper so behavior remains identical.
| const debate = parseDebateResult(dir); | ||
| if (!debate) { | ||
| console.error("No debate.json found"); | ||
| process.exitCode = 1; | ||
| return; | ||
| } | ||
|
|
||
| const debatePath = join(dir, "debate.json"); | ||
| const raw = JSON.parse(readFileSync(debatePath, "utf-8")) as Record<string, unknown>; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: Double file read in finalize-debate.
parseDebateResult reads and parses debate.json, then line 153 reads and parses it again to get the raw object. While functionally correct (needed to preserve unknown fields via passthrough), this could be optimized by having parseDebateResult return both the validated result and raw object.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/calibrate-debate.ts` around lines 145 - 153,
parseDebateResult currently parses debate.json but the caller in finalize-debate
reads and parses the file again to preserve unknown fields; change
parseDebateResult to return a tuple/object containing both the validated Debate
result and the raw parsed Record (e.g., { validated, raw }) so callers (like the
code using debate and raw) can use the returned raw instead of re-reading the
file: update parseDebateResult signature and implementation to read once and
return both, then replace the second read/JSON.parse in finalize-debate (the
code that sets debatePath and raw) to use the raw value from parseDebateResult
and remove the redundant file read.
Internal calibration commands are consumed by subagents that parse stdout. console.error + process.exitCode = 1 breaks this pattern. Reverted all internal commands to console.log + exit 0 for subagent compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 10-15: The GatheredEvidence interface uses broad unknowns; replace
ruleImpactAssessment, uncoveredStruggles, actionableGaps and priorEvidence with
the concrete contract types used elsewhere (e.g., the domain/evidence validation
contracts) instead of unknown to enable type safety; update the import list at
the top of this module to bring in the correct contracts (for example the
RuleImpactContract, UncoveredStruggleContract, ActionableGapContract and
PriorEvidenceContract or their actual names from your validation/contracts
folder) and change the property types to arrays or records of those contract
types accordingly, then run the project/type-checker to fix any call sites that
expect the narrower shapes.
- Around line 149-151: Wrap the second read/parse of debate.json (the
readFileSync/JSON.parse that sets raw after debatePath is created) in a
try/catch so a corrupt/changed file cannot throw and break orchestration; on
error, fall back to a safe default (e.g., an empty object or the result from
parseDebateResult(dir)), log the parse failure to stderr or processLogger, and
continue to emit the expected stdout JSON action and exit with code 0.
Specifically modify the block that reads debatePath / assigns raw so exceptions
are caught, raw is always defined, and control flow continues (do not re-throw).
In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-149: The enrichment step drops "HOLD" decisions because the
check only persists APPROVE/REJECT/REVISE; update the logic that reads
r.decision (variable dec) to also accept and assign "HOLD" to entry.decision (or
simply assign dec after validating against the allowed decision union), and
ensure you update the decision unions/schemas in the contracts and the evidence
collector to include "HOLD" so types/validation, serialization, and downstream
consumers accept the new value.
- Around line 123-124: Validate the CLI input before resolving: inside the
.action((runDir: string) => { ... }) callback use a Zod schema (e.g., a
non-empty string/path schema) to parse or safeParse runDir (import z from
'zod'), handle validation failure (log/throw or exit) and only then call
resolve(validatedDir) to produce resolvedDir; update references to runDir ->
validatedDir and ensure any downstream logic uses the validated value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 942663e1-bc37-423f-8eae-ed5c47132640
📒 Files selected for processing (2)
src/cli/commands/internal/calibrate-debate.tssrc/cli/commands/internal/fixture-management.ts
| export interface GatheredEvidence { | ||
| ruleImpactAssessment: unknown[]; | ||
| uncoveredStruggles: unknown[]; | ||
| actionableGaps: unknown[]; | ||
| priorEvidence: Record<string, unknown>; | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefer evidence contract types over broad unknown containers.
GatheredEvidence.priorEvidence (and related fields) can leverage existing contract types for better maintainability and safer downstream usage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/calibrate-debate.ts` around lines 10 - 15, The
GatheredEvidence interface uses broad unknowns; replace ruleImpactAssessment,
uncoveredStruggles, actionableGaps and priorEvidence with the concrete contract
types used elsewhere (e.g., the domain/evidence validation contracts) instead of
unknown to enable type safety; update the import list at the top of this module
to bring in the correct contracts (for example the RuleImpactContract,
UncoveredStruggleContract, ActionableGapContract and PriorEvidenceContract or
their actual names from your validation/contracts folder) and change the
property types to arrays or records of those contract types accordingly, then
run the project/type-checker to fix any call sites that expect the narrower
shapes.
| .action((runDir: string) => { | ||
| const resolvedDir = resolve(runDir); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and the specific file mentioned
find src/cli -type f -name "*.ts" | head -20Repository: let-sunny/canicode
Length of output: 714
🏁 Script executed:
# Read the fixture-management.ts file to understand context
cat -n src/cli/commands/internal/fixture-management.tsRepository: let-sunny/canicode
Length of output: 8415
🏁 Script executed:
# Check if there are contracts/schemas defined for CLI inputs
find . -path ./node_modules -prune -o -type f -name "*schema*" -o -name "*contract*" | grep -E "\.(ts|tsx)$" | head -20Repository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Search for Zod usage in CLI files to understand validation patterns
rg "z\." src/cli --type ts -B 2 -A 2 | head -50Repository: let-sunny/canicode
Length of output: 2969
Validate CLI input with Zod before resolving paths.
runDir is external CLI input and must be schema-validated per coding guidelines. The code currently calls resolve(runDir) directly without validation.
Proposed fix
+import { z } from "zod";
...
.action((runDir: string) => {
- const resolvedDir = resolve(runDir);
+ const args = z.object({ runDir: z.string().min(1) }).parse({ runDir });
+ const resolvedDir = resolve(args.runDir);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .action((runDir: string) => { | |
| const resolvedDir = resolve(runDir); | |
| .action((runDir: string) => { | |
| const args = z.object({ runDir: z.string().min(1) }).parse({ runDir }); | |
| const resolvedDir = resolve(args.runDir); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/fixture-management.ts` around lines 123 - 124,
Validate the CLI input before resolving: inside the .action((runDir: string) =>
{ ... }) callback use a Zod schema (e.g., a non-empty string/path schema) to
parse or safeParse runDir (import z from 'zod'), handle validation failure
(log/throw or exit) and only then call resolve(validatedDir) to produce
resolvedDir; update references to runDir -> validatedDir and ensure any
downstream logic uses the validated value.
- finalize-debate: try/catch on second debate.json read, fallback to
{action: "continue"} on parse failure (subagent safety)
- HOLD added to decision enum in evidence schema, enrichCalibrationEvidence
signature, and CLI enrich command (was silently dropped)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cli/commands/internal/calibrate-debate.ts`:
- Around line 30-53: Replace the ad-hoc JSON.parse + Array.isArray/type checks
for conversion.json, gaps.json (and proposed-rules.json per the comment) with
Zod validation: import the existing UncoveredStruggle and GapEntry Zod schemas
from src/agents/contracts and create a small top-level ConversionSchema (e.g., {
ruleImpactAssessment: z.array(z.unknown()).optional(), uncoveredStruggles:
z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps:
z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run
z.parse/z.safeParse and assign only validated data to
result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps
(filtering gaps by actionable is fine but apply GapEntry schema first). Ensure
malformed files are caught and ignored (or logged) by checking parse results
before mutating result.
In `@src/cli/commands/internal/fixture-management.ts`:
- Around line 147-148: The decision value from critic reviews is compared with
exact-case strings and can be lost; normalize r.decision before checking and
storing by trimming and uppercasing it (e.g., use
r.decision?.trim().toUpperCase()) then compare against
"APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to
entry.decision; update the logic around the dec variable in
registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed
inputs are persisted consistently.
- Around line 135-140: parseRunDirName(basename(resolvedDir)) never yields a
falsy name, so the current check for !fixture is dead; instead validate that the
run directory ends with the expected `--timestamp` suffix by checking the parsed
timestamp (from parseRunDirName) or matching the basename against the
`--YYYY-...` pattern before treating it as a fixture. Update the code around
parseRunDirName / basename(resolvedDir) to reject basenames where the parsed
timestamp is empty or the regex check fails (log an error and return) so only
properly suffixed run directories are accepted for enrichment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a10c1791-ef9b-4ed2-b11f-3b689cd01e4b
📒 Files selected for processing (4)
src/agents/contracts/evidence.tssrc/agents/evidence-collector.tssrc/cli/commands/internal/calibrate-debate.tssrc/cli/commands/internal/fixture-management.ts
| const convPath = join(runDir, "conversion.json"); | ||
| if (existsSync(convPath)) { | ||
| try { | ||
| const conv = JSON.parse(readFileSync(convPath, "utf-8")) as Record<string, unknown>; | ||
| if (Array.isArray(conv["ruleImpactAssessment"])) { | ||
| result.ruleImpactAssessment = conv["ruleImpactAssessment"]; | ||
| } | ||
| if (Array.isArray(conv["uncoveredStruggles"])) { | ||
| result.uncoveredStruggles = conv["uncoveredStruggles"]; | ||
| } | ||
| } catch { /* ignore malformed */ } | ||
| } | ||
|
|
||
| // 2. gaps.json → actionable gaps | ||
| const gapsPath = join(runDir, "gaps.json"); | ||
| if (existsSync(gapsPath)) { | ||
| try { | ||
| const gaps = JSON.parse(readFileSync(gapsPath, "utf-8")) as Record<string, unknown>; | ||
| const gapList = Array.isArray(gaps["gaps"]) ? gaps["gaps"] : []; | ||
| result.actionableGaps = gapList.filter( | ||
| (g): g is Record<string, unknown> => | ||
| typeof g === "object" && g !== null && (g as Record<string, unknown>)["actionable"] === true | ||
| ); | ||
| } catch { /* ignore malformed */ } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and the specific lines mentioned
find . -name "calibrate-debate.ts" -type fRepository: let-sunny/canicode
Length of output: 109
🏁 Script executed:
# Check if the file exists and examine its content
head -150 src/cli/commands/internal/calibrate-debate.ts | cat -nRepository: let-sunny/canicode
Length of output: 6312
🏁 Script executed:
# Search for Zod schema files related to the artifacts mentioned
find . -path "*/contracts/*" -name "*.ts" -o -name "*.js" | head -20Repository: let-sunny/canicode
Length of output: 622
🏁 Script executed:
# Look for any existing schemas for conversion, gaps, or rules
rg -l "conversion|gaps|proposed.*rules" --type ts --type js -- contracts/Repository: let-sunny/canicode
Length of output: 112
🏁 Script executed:
# Check the contracts directory for relevant schemas
ls -la src/agents/contracts/Repository: let-sunny/canicode
Length of output: 691
🏁 Script executed:
# Look for any schema definitions related to conversion, gaps, or rules
rg "conversion|gaps" src/agents/contracts/ -A 5Repository: let-sunny/canicode
Length of output: 1570
🏁 Script executed:
# Check if there are any Zod schemas defined in the contracts
rg "z\." src/agents/contracts/ | head -20Repository: let-sunny/canicode
Length of output: 1568
🏁 Script executed:
# Look at the conversion-agent and gap-analyzer contracts
cat src/agents/contracts/conversion-agent.tsRepository: let-sunny/canicode
Length of output: 1067
🏁 Script executed:
# Check gap-analyzer contract
cat src/agents/contracts/gap-analyzer.tsRepository: let-sunny/canicode
Length of output: 1249
🏁 Script executed:
# Check how these schemas are imported and what's available
cat src/agents/contracts/conversion-agent.tsRepository: let-sunny/canicode
Length of output: 1067
🏁 Script executed:
# Verify that no validation is actually happening in gatherEvidence
grep -A 20 "function gatherEvidence" src/cli/commands/internal/calibrate-debate.ts | grep -E "(parse|validate|\.parse)"Repository: let-sunny/canicode
Length of output: 152
🏁 Script executed:
# Check if conversion.json is actually structured as ConversionRecord[] or has top-level fields
rg "conversion\.json" src/ -B 2 -A 2 --type tsRepository: let-sunny/canicode
Length of output: 3139
🏁 Script executed:
# Verify the actual structure - check if conversion.json contains ruleImpactAssessment array
rg "ruleImpactAssessment" src/ --type tsRepository: let-sunny/canicode
Length of output: 1537
🏁 Script executed:
# Check if there's a schema for the top-level conversion.json structure
rg "ConversionOutput|RuleImpactAssessment" src/agents/contracts/ --type tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Look at the test to understand what the conversion.json structure should be
cat -n src/cli/commands/internal/calibrate-debate.test.ts | grep -A 30 "extracts ruleImpactAssessment"Repository: let-sunny/canicode
Length of output: 1469
🏁 Script executed:
# Check if proposed-rules.json is validated or has a schema
rg "ProposedRules|proposed.*rules" src/agents/contracts/ --type tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# See how the gathered evidence is used downstream to assess impact
rg "critic-evidence" src/ --type ts -B 2 -A 2Repository: let-sunny/canicode
Length of output: 697
🏁 Script executed:
# Let me check what happens with the evidence after it's gathered - does orchestrator validate it?
rg -A 10 "critic-evidence" src/agents/orchestrator.ts --type tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Check if the conversion output structure is documented anywhere
rg "ruleImpactAssessment|UncoveredStruggle" src/agents/ -B 5 --type ts | head -50Repository: let-sunny/canicode
Length of output: 2431
🏁 Script executed:
# See if there's any downstream validation in the orchestrator
cat src/agents/orchestrator.ts | grep -A 30 "ruleImpactAssessment"Repository: let-sunny/canicode
Length of output: 1795
🏁 Script executed:
# Double-check: are the schemas actually being used somewhere in gatherEvidence?
grep -n "UncoveredStruggleSchema\|GapAnalyzerOutputSchema\|GapEntrySchema" src/cli/commands/internal/calibrate-debate.tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Verify that the coding guideline applies to this file
head -5 src/cli/commands/internal/calibrate-debate.tsRepository: let-sunny/canicode
Length of output: 274
Validate external JSON artifacts with Zod before feeding them into critic-evidence.json.
Lines 30–54 and 76–81 parse conversion.json, gaps.json, and proposed-rules.json using only JSON.parse and shallow Array.isArray or property-type checks. This allows malformed rows to flow directly into the critic payload. The coding guidelines require "Validate all external inputs with Zod schemas" for src/**/*.ts. Schemas for UncoveredStruggle and GapEntry already exist in src/agents/contracts/; add validation using these schemas and create a wrapper schema for the top-level conversion.json structure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/calibrate-debate.ts` around lines 30 - 53, Replace
the ad-hoc JSON.parse + Array.isArray/type checks for conversion.json, gaps.json
(and proposed-rules.json per the comment) with Zod validation: import the
existing UncoveredStruggle and GapEntry Zod schemas from src/agents/contracts
and create a small top-level ConversionSchema (e.g., { ruleImpactAssessment:
z.array(z.unknown()).optional(), uncoveredStruggles:
z.array(UncoveredStruggle).optional() }) and a GapsSchema ({ gaps:
z.array(GapEntry) }). Read convPath/gapsPath the same way, parse JSON, then run
z.parse/z.safeParse and assign only validated data to
result.ruleImpactAssessment, result.uncoveredStruggles and result.actionableGaps
(filtering gaps by actionable is fine but apply GapEntry schema first). Ensure
malformed files are caught and ignored (or logged) by checking parse results
before mutating result.
| const dec = r.decision; | ||
| if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec; |
There was a problem hiding this comment.
Normalize Critic decision casing before persisting it.
critic.reviews[].decision is parsed as a free-form string, and registerFinalizeDebate() already normalizes it with trim().toUpperCase(). This exact-case check will silently drop valid "reject", "hold", etc., so cross-run evidence can lose the new decision field even when the same debate would be handled correctly elsewhere.
Suggested fix
- const dec = r.decision;
- if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") entry.decision = dec;
+ const dec = r.decision.trim().toUpperCase();
+ if (dec === "APPROVE" || dec === "REJECT" || dec === "REVISE" || dec === "HOLD") {
+ entry.decision = dec;
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cli/commands/internal/fixture-management.ts` around lines 147 - 148, The
decision value from critic reviews is compared with exact-case strings and can
be lost; normalize r.decision before checking and storing by trimming and
uppercasing it (e.g., use r.decision?.trim().toUpperCase()) then compare against
"APPROVE"/"REJECT"/"REVISE"/"HOLD" and assign the normalized value to
entry.decision; update the logic around the dec variable in
registerFinalizeDebate / the block using r.decision so all valid lowercase/mixed
inputs are persisted consistently.
- enrich-evidence: check timestamp suffix instead of dead !fixture check (parseRunDirName never returns falsy name) - enrich-evidence: normalize decision to uppercase before persisting (handles mixed-case "reject"/"Reject" from Critic) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem
캘리브레이션 파이프라인의 Critic이 proposals만 보고 판단함. Converter가 실제로 뭘 어려워했는지, 어떤 pixel gap이 있었는지, 과거 run에서 같은 룰이 어떻게 평가됐는지 모름.
또한 debate 결과가
logs/(gitignored, ephemeral)에만 남아서 cross-run 학습이 안 됨.Design decisions
1. Critic에게 구조화된 evidence 전달
Before: proposals only → Critic이 숫자만 보고 판단
After: proposals + converter's
ruleImpactAssessment+ gap analysis + prior cross-run evidence왜: Critic이 "이 룰이 실제로 구현 난이도에 영향을 줬는지"를 판단하려면 Converter의 실제 경험이 필요함.
2. Pro/Con 구조화 출력
Before:
{ decision, reason }— 이유 하나로 찬반 구분 불가After:
{ decision, confidence, pro: [...], con: [...], reason }— 명시적 찬반 분리왜: 단일 reason만 있으면 한쪽 관점만 반영됨. 찬반을 명시적으로 분리하면 Arbitrator가 더 나은 판단 가능. 이 데이터가 evidence에 영속되면 다음 run에서도 활용.
3. Evidence enrichment —
data/에 영속Before: Critic 결과 →
logs/debate.json(휘발) → 다음 run에서 못 봄After: Critic 결과 →
logs/debate.json+data/calibration-evidence.json(영속)새 CLI command
calibrate-enrich-evidence가 debate.json에서 Critic의 pro/con/confidence를 읽어서data/calibration-evidence.json의 기존 엔트리를 업데이트.4. Early-stop (Arbitrator 스킵)
모든 proposals가 high confidence로 REJECT되면 Arbitrator 실행 불필요. debate.json에
stoppingReason: "all-high-confidence-reject"기록하고 종료.5. Self-consistency (hold 결정)
Before: low confidence라도 Arbitrator가 적용하거나 기각
After:
confidence: low→decision: "hold"— 적용하지 않고 evidence에만 누적다음 run에서 같은 룰이 다시 제안되면 case count 올라가면서 confidence 자연 수렴. 기존 cross-run evidence 메커니즘 활용.
6. Convergence에 hold 반영
hold= "아직 결정 못 함" → not converged. fixture가done/으로 이동하지 않고 다음 calibration에서 재평가.Data flow
Changed files
evidence.ts(contracts)evidence-collector.tsenrichCalibrationEvidence()+ grouping with allPro/allConcritic.mdarbitrator.mdcalibrate-loop.mdrun-directory.tsfixture-management.tscalibrate-enrich-evidenceCLI commandcli/index.tsCLAUDE.mdTest plan
exactOptionalPropertyTypescompliant)/calibrate-loopon a fixture to validate end-to-end (post-merge)Closes #144
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores
Tests