let-sunny · let-sunny · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/.claude/agents/calibration/arbitrator.md b/.claude/agents/calibration/arbitrator.md
@@ -2,20 +2,24 @@
 name: calibration-arbitrator
 description: Makes final calibration decisions by weighing Runner and Critic. Applies approved changes to rule-config.ts and commits. Use after calibration-critic completes.
 tools: Read, Edit, Bash
-model: claude-sonnet-4-6
+model: claude-opus-4-6
 ---
 
 You are the Arbitrator agent in a calibration pipeline.
 You receive the Runner's proposals and the Critic's reviews, and make final decisions.
 
 ## Decision Rules
 
-- **Both APPROVE** → apply Runner's proposed value
-- **Critic REJECT** → keep current score (no change)
-- **Critic REVISE** → apply the Critic's revised value
+- **Both APPROVE** → apply Runner's proposed value (decision: `"applied"`)
+- **Critic REJECT** → keep current score (decision: `"rejected"`)
+- **Critic REVISE** → apply the Critic's revised value (decision: `"revised"`)
 - **proposedDisable: true** → if both Runner and Critic agree, set `enabled: false` in `rule-config.ts`. Decision type: `"disabled"`. If Critic rejects the disable, treat as a normal score adjustment instead.
 - **New rule proposals** → record in `$RUN_DIR/debate.json` only, do NOT add to `rule-config.ts`
 
+### Self-consistency guard
+
+- If the Critic's confidence is `"low"` for a proposal → do NOT apply, regardless of decision. Set decision to `"hold"` with reason explaining insufficient confidence. The evidence will accumulate for future runs.
+
 ## After Deciding
 
 1. Apply approved changes to `src/core/rules/rule-config.ts`
@@ -39,20 +43,28 @@ Return this JSON structure:
 ```json
 {
   "timestamp": "<ISO8601>",
-  "summary": "applied=2 rejected=1 revised=1 newProposals=0",
+  "summary": "applied=1 revised=1 rejected=1 hold=1 newProposals=0",
   "decisions": [
-    {"ruleId": "X", "decision": "applied", "before": -10, "after": -7, "reason": "Critic revised, midpoint applied"},
-    {"ruleId": "X", "decision": "rejected", "reason": "Critic rejection compelling — insufficient evidence"},
-    {"ruleId": "X", "decision": "disabled", "reason": "Converged to zero impact across 3+ runs, all easy"}
+    {"ruleId": "X", "decision": "applied", "before": -10, "after": -7, "confidence": "high", "reason": "Strong evidence, applying Runner's value"},
+    {"ruleId": "X", "decision": "revised", "before": -10, "after": -8, "confidence": "medium", "reason": "Critic revised, midpoint applied"},
+    {"ruleId": "X", "decision": "rejected", "confidence": "medium", "reason": "Critic rejection compelling — insufficient evidence"},
+    {"ruleId": "X", "decision": "hold", "confidence": "low", "reason": "Low confidence — accumulate more evidence before applying"},
+    {"ruleId": "X", "decision": "disabled", "confidence": "high", "reason": "Converged to zero impact across 3+ runs, all easy"}
   ],
   "newRuleProposals": []
 }
 ```
 
+### Field requirements
+
+- **confidence**: carried from Critic's review for each decision
+- **Note**: `stoppingReason` is written by the orchestrator at the debate.json top level, not inside the arbitrator object
+
 ## Rules
 
 - **Do NOT write to ANY file except `src/core/rules/rule-config.ts`.** No log files, no `new-rule-proposals.md`, no `debate.json`, no `activity.jsonl`. The orchestrator handles ALL other file I/O.
 - **Do NOT create files.** Only Edit existing `rule-config.ts` when applying approved score changes.
 - Only modify `rule-config.ts` for approved score/severity changes.
 - Never force-push or amend existing commits.
 - If tests fail, revert everything and report which change caused the failure.
+- **Never apply changes with `confidence: "low"`.** Hold them for future evidence accumulation.
diff --git a/.claude/agents/calibration/critic.md b/.claude/agents/calibration/critic.md
@@ -2,7 +2,7 @@
 name: calibration-critic
 description: Challenges calibration proposals from Runner. Rejects low-confidence or over-aggressive adjustments. Use after calibration-runner completes.
 tools: Read
-model: claude-sonnet-4-6
+model: claude-opus-4-6
 ---
 
 ## Common Review Framework
@@ -16,7 +16,17 @@ All critics follow this base protocol:
 ---
 
 You are the Critic agent in a calibration pipeline.
-You receive the Runner's proposals and challenge each one independently.
+You receive the Runner's proposals along with supporting evidence, and challenge each one independently.
+
+## Input Context
+
+You will receive:
+1. **Proposals** — from evaluation summary (overscored/underscored rules with proposed changes)
+2. **Converter assessment** — `ruleImpactAssessment` showing actual implementation difficulty per rule
+3. **Gap analysis** — actionable pixel gaps between Figma and generated code
+4. **Prior evidence** — cross-run calibration evidence for the proposed rules (accumulated from past runs)
+
+Use ALL inputs to form pro/con arguments. Do not rely on proposals alone.
 
 ## Rejection Rules
 
@@ -50,16 +60,46 @@ Return this JSON structure:
   "timestamp": "<ISO8601>",
   "summary": "approved=1 rejected=1 revised=1",
   "reviews": [
-    {"ruleId": "X", "decision": "APPROVE", "reason": "3 cases, high confidence"},
-    {"ruleId": "X", "decision": "REJECT", "reason": "Rule 1 — only 1 case with low confidence"},
-    {"ruleId": "X", "decision": "REVISE", "revised": -7, "reason": "Rule 2 — change too large, midpoint applied"}
+    {
+      "ruleId": "X",
+      "decision": "APPROVE",
+      "confidence": "high",
+      "pro": ["3 cases across fixtures show easy implementation", "converter rated actualImpact: easy"],
+      "con": ["all cases from same design system"],
+      "reason": "Strong cross-run evidence outweighs single-system concern"
+    },
+    {
+      "ruleId": "X",
+      "decision": "REJECT",
+      "confidence": "low",
+      "pro": ["1 case shows overscored"],
+      "con": ["only 1 fixture", "no gap analysis data supports this"],
+      "reason": "Rule 1 — only 1 case with low confidence"
+    },
+    {
+      "ruleId": "X",
+      "decision": "REVISE",
+      "revised": -7,
+      "confidence": "medium",
+      "pro": ["converter found moderate difficulty, current score implies hard"],
+      "con": ["gap analysis shows some pixel impact in this area"],
+      "reason": "Rule 2 — change too large, midpoint applied"
+    }
   ]
 }
 ```
 
+### Field requirements
+
+- **confidence**: `"high"` | `"medium"` | `"low"` — your assessment of the proposal's reliability
+- **pro**: array of evidence points supporting the proposed change
+- **con**: array of evidence points against the proposed change
+- **reason**: final verdict synthesizing pro/con
+
 ## Rules
 
 - **Do NOT write any files.** The orchestrator handles all file I/O.
 - Do NOT modify `src/rules/rule-config.ts`.
 - Be strict. When in doubt, REJECT or REVISE.
 - Return your full critique so the Arbitrator can decide.
+- **Every review MUST include pro, con, and confidence fields.** No exceptions.
diff --git a/.claude/commands/calibrate-loop.md b/.claude/commands/calibrate-loop.md
@@ -135,8 +135,19 @@ If zero proposals, write `$RUN_DIR/debate.json` with skip reason and jump to Ste
 
 ### Step 5 — Critic
 
+Gather supporting evidence (deterministic CLI — no LLM):
+
+```bash
+npx canicode calibrate-gather-evidence $RUN_DIR
+```
+
+This reads `conversion.json`, `gaps.json`, `summary.md`, and `data/calibration-evidence.json`, and writes a single `$RUN_DIR/critic-evidence.json` with structured data for the Critic.
+
+Read `$RUN_DIR/critic-evidence.json` and include it in the Critic prompt.
+
 Spawn the `calibration-critic` subagent. In the prompt:
-- Include only the proposal list (NOT the Converter's reasoning)
+- Include the proposal list from summary.md
+- Include the gathered evidence from `critic-evidence.json`
 - **Tell the agent: "Return your reviews as JSON. Do NOT write any files."**
 
 After the Critic returns, **you** write the JSON to `$RUN_DIR/debate.json`:
@@ -145,7 +156,16 @@ After the Critic returns, **you** write the JSON to `$RUN_DIR/debate.json`:
   "critic": {
     "timestamp": "<ISO8601>",
     "summary": "approved=<N> rejected=<N> revised=<N>",
-    "reviews": [ ... ]
+    "reviews": [
+      {
+        "ruleId": "X",
+        "decision": "APPROVE|REJECT|REVISE",
+        "confidence": "high|medium|low",
+        "pro": ["evidence supporting change"],
+        "con": ["evidence against change"],
+        "reason": "..."
+      }
+    ]
   }
 }
 ```
@@ -155,32 +175,74 @@ Append to `$RUN_DIR/activity.jsonl`:
 {"step":"Critic","timestamp":"<ISO8601>","result":"approved=<N> rejected=<N> revised=<N>","durationMs":<ms>}
 ```
 
+#### Early-stop check (deterministic CLI — no LLM)
+
+```bash
+npx canicode calibrate-finalize-debate $RUN_DIR
+```
+
+This outputs JSON: `{"action": "early-stop"|"continue", ...}`.
+
+- If `action` is `"early-stop"`: the CLI has already written `stoppingReason` to debate.json. Append to activity.jsonl:
+  ```json
+  {"step":"Arbitrator","timestamp":"<ISO8601>","result":"SKIPPED — early-stop: all proposals rejected with high confidence","durationMs":0}
+  ```
+  Jump to Step 6.5.
+
+- If `action` is `"continue"`: proceed to Step 6.
+
 ### Step 6 — Arbitrator
 
 Spawn the `calibration-arbitrator` subagent. In the prompt:
 - Include proposals and the Critic's reviews from `$RUN_DIR/debate.json`
 - **Tell the agent: "Return your decisions as JSON. Only edit rule-config.ts if applying changes. Do NOT write to logs."**
 
 After the Arbitrator returns, **you** update `$RUN_DIR/debate.json` — read the existing content and add the `arbitrator` field:
+
 ```json
 {
   "critic": { ... },
   "arbitrator": {
     "timestamp": "<ISO8601>",
-    "summary": "applied=<N> rejected=<N> revised=<N>",
-    "decisions": [ ... ]
+    "summary": "applied=<N> revised=<N> rejected=<N> hold=<N>",
+    "decisions": [
+      {
+        "ruleId": "X",
+        "decision": "applied|revised|rejected|hold|disabled",
+        "confidence": "high|medium|low",
+        "before": -10,
+        "after": -7,
+        "reason": "..."
+      }
+    ]
   }
 }
 ```
 
+Then finalize the debate (deterministic CLI — no LLM):
+
+```bash
+npx canicode calibrate-finalize-debate $RUN_DIR
+```
+
+This determines `stoppingReason` (if any) and writes it to debate.json. Outputs JSON with `action: "finalized"`.
+
 Append to `$RUN_DIR/activity.jsonl`:
 ```json
-{"step":"Arbitrator","timestamp":"<ISO8601>","result":"applied=<N> rejected=<N>","durationMs":<ms>}
+{"step":"Arbitrator","timestamp":"<ISO8601>","result":"applied=<N> rejected=<N> hold=<N>","durationMs":<ms>}
+```
+
+### Step 6.5 — Enrich and prune evidence
+
+After the debate (or early-stop), enrich `data/calibration-evidence.json` with the Critic's structured pro/con/confidence. This ensures cross-run evidence persists beyond the ephemeral `logs/` directory.
+
+```bash
+npx canicode calibrate-enrich-evidence $RUN_DIR
 ```
 
-### Step 6.5 — Prune evidence
+This reads `debate.json`, extracts the Critic's reviews (pro, con, confidence, decision), and updates matching entries in `data/calibration-evidence.json`. Runs for both normal and early-stop paths.
 
-After the Arbitrator applies changes, prune calibration evidence for the applied rules:
+Then prune calibration evidence for the applied rules:
 
 ```bash
 npx canicode calibrate-prune-evidence $RUN_DIR
@@ -209,7 +271,7 @@ Report the final summary: similarity, proposals, decisions, and path to `logs/ca
 
 - Each agent must be a SEPARATE subagent call (isolated context).
 - Pass only structured data between agents — never raw reasoning.
-- The Critic must NOT see the Runner's or Converter's reasoning, only the proposal list.
+- The Critic receives proposals + converter's ruleImpactAssessment + gaps + prior evidence (structured data, not free-form reasoning).
 - Only the Arbitrator may edit `rule-config.ts`.
 - Steps 1, 4, 7 are CLI commands — run them directly with Bash.
 - **CRITICAL: YOU write all files to $RUN_DIR. Subagents (Gap Analyzer, Critic, Arbitrator) MUST return JSON as text — tell them "Do NOT write any files." You are the only one who writes to $RUN_DIR.**

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -322,10 +322,22 @@ Process:
 3. Run `canicode visual-compare` — pixel-level comparison against Figma screenshot
 4. Analyze the diff image to categorize pixel gaps (`Gap Analyzer`)
 5. Compare conversion difficulty vs rule scores (`canicode calibrate-evaluate`)
-6. 6-agent debate loop (`/calibrate-loop`): Analysis → Converter → Gap Analyzer → Evaluation → Critic → Arbitrator
+6. Debate loop (`/calibrate-loop`): Analysis → Converter → Gap Analyzer → Evaluation → Critic → Arbitrator
+
+**Critic receives structured evidence** (#144):
+- Proposals from evaluation
+- Converter's `ruleImpactAssessment` (actual implementation difficulty per rule)
+- Gap analysis (actionable pixel gaps)
+- Prior cross-run evidence for proposed rules
+- Outputs structured pro/con arguments + confidence level per proposal
+
+**Early-stop and self-consistency** (#144):
+- All proposals rejected with high confidence → Arbitrator skipped (early-stop)
+- Low-confidence decisions → held (not applied), evidence accumulates for future runs (self-consistency)
+- `stoppingReason` recorded in debate.json for traceability
 
 **Cross-run evidence** accumulates across sessions in `data/`:
-- `calibration-evidence.json` — overscored/underscored rules (fed to Runner for stronger proposals)
+- `calibration-evidence.json` — overscored/underscored rules with confidence, pro/con, decision (fed to Critic for informed review)
 - `discovery-evidence.json` — uncovered gaps not covered by existing rules (fed to `/add-rule` Researcher)
 - Discovery evidence is filtered to exclude environment/tooling noise (font CDN, retina/DPI, network, CI constraints)
 - Evidence is pruned after rules are applied (calibration) or new rules are created (discovery)

diff --git a/src/agents/contracts/evidence.ts b/src/agents/contracts/evidence.ts
@@ -8,6 +8,11 @@ export const CalibrationEvidenceEntrySchema = z.object({
   actualDifficulty: z.string(),
   fixture: z.string(),
   timestamp: z.string(),
+  // Phase 1 fields (#144) — optional for backward compatibility with existing evidence
+  confidence: z.enum(["high", "medium", "low"]).optional(),
+  pro: z.array(z.string()).optional(),
+  con: z.array(z.string()).optional(),
+  decision: z.enum(["APPROVE", "REJECT", "REVISE", "HOLD"]).optional(),
 });
 
 export type CalibrationEvidenceEntry = z.infer<typeof CalibrationEvidenceEntrySchema>;
@@ -17,6 +22,11 @@ export const CrossRunEvidenceGroupSchema = z.object({
   underscoredCount: z.number(),
   overscoredDifficulties: z.array(z.string()),
   underscoredDifficulties: z.array(z.string()),
+  // Aggregated pro/con from all entries for this rule
+  allPro: z.array(z.string()).optional(),
+  allCon: z.array(z.string()).optional(),
+  lastConfidence: z.enum(["high", "medium", "low"]).optional(),
+  lastDecision: z.enum(["APPROVE", "REJECT", "REVISE", "HOLD"]).optional(),
 });
 
 export type CrossRunEvidenceGroup = z.infer<typeof CrossRunEvidenceGroupSchema>;