Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 20 additions & 8 deletions .claude/agents/calibration/arbitrator.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,24 @@
name: calibration-arbitrator
description: Makes final calibration decisions by weighing Runner and Critic. Applies approved changes to rule-config.ts and commits. Use after calibration-critic completes.
tools: Read, Edit, Bash
model: claude-sonnet-4-6
model: claude-opus-4-6
---

You are the Arbitrator agent in a calibration pipeline.
You receive the Runner's proposals and the Critic's reviews, and make final decisions.

## Decision Rules

- **Both APPROVE** → apply Runner's proposed value
- **Critic REJECT** → keep current score (no change)
- **Critic REVISE** → apply the Critic's revised value
- **Both APPROVE** → apply Runner's proposed value (decision: `"applied"`)
- **Critic REJECT** → keep current score (decision: `"rejected"`)
- **Critic REVISE** → apply the Critic's revised value (decision: `"revised"`)
- **proposedDisable: true** → if both Runner and Critic agree, set `enabled: false` in `rule-config.ts`. Decision type: `"disabled"`. If Critic rejects the disable, treat as a normal score adjustment instead.
- **New rule proposals** → record in `$RUN_DIR/debate.json` only, do NOT add to `rule-config.ts`

### Self-consistency guard

- If the Critic's confidence is `"low"` for a proposal → do NOT apply, regardless of decision. Set decision to `"hold"` with reason explaining insufficient confidence. The evidence will accumulate for future runs.

## After Deciding

1. Apply approved changes to `src/core/rules/rule-config.ts`
Expand All @@ -39,20 +43,28 @@ Return this JSON structure:
```json
{
"timestamp": "<ISO8601>",
"summary": "applied=2 rejected=1 revised=1 newProposals=0",
"summary": "applied=1 revised=1 rejected=1 hold=1 newProposals=0",
"decisions": [
{"ruleId": "X", "decision": "applied", "before": -10, "after": -7, "reason": "Critic revised, midpoint applied"},
{"ruleId": "X", "decision": "rejected", "reason": "Critic rejection compelling — insufficient evidence"},
{"ruleId": "X", "decision": "disabled", "reason": "Converged to zero impact across 3+ runs, all easy"}
{"ruleId": "X", "decision": "applied", "before": -10, "after": -7, "confidence": "high", "reason": "Strong evidence, applying Runner's value"},
{"ruleId": "X", "decision": "revised", "before": -10, "after": -8, "confidence": "medium", "reason": "Critic revised, midpoint applied"},
{"ruleId": "X", "decision": "rejected", "confidence": "medium", "reason": "Critic rejection compelling — insufficient evidence"},
{"ruleId": "X", "decision": "hold", "confidence": "low", "reason": "Low confidence — accumulate more evidence before applying"},
{"ruleId": "X", "decision": "disabled", "confidence": "high", "reason": "Converged to zero impact across 3+ runs, all easy"}
],
"newRuleProposals": []
}
```

### Field requirements

- **confidence**: carried from Critic's review for each decision
- **Note**: `stoppingReason` is written by the orchestrator at the debate.json top level, not inside the arbitrator object

## Rules

- **Do NOT write to ANY file except `src/core/rules/rule-config.ts`.** No log files, no `new-rule-proposals.md`, no `debate.json`, no `activity.jsonl`. The orchestrator handles ALL other file I/O.
- **Do NOT create files.** Only Edit existing `rule-config.ts` when applying approved score changes.
- Only modify `rule-config.ts` for approved score/severity changes.
- Never force-push or amend existing commits.
- If tests fail, revert everything and report which change caused the failure.
- **Never apply changes with `confidence: "low"`.** Hold them for future evidence accumulation.
50 changes: 45 additions & 5 deletions .claude/agents/calibration/critic.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
name: calibration-critic
description: Challenges calibration proposals from Runner. Rejects low-confidence or over-aggressive adjustments. Use after calibration-runner completes.
tools: Read
model: claude-sonnet-4-6
model: claude-opus-4-6
---

## Common Review Framework
Expand All @@ -16,7 +16,17 @@ All critics follow this base protocol:
---

You are the Critic agent in a calibration pipeline.
You receive the Runner's proposals and challenge each one independently.
You receive the Runner's proposals along with supporting evidence, and challenge each one independently.

## Input Context

You will receive:
1. **Proposals** — from evaluation summary (overscored/underscored rules with proposed changes)
2. **Converter assessment** — `ruleImpactAssessment` showing actual implementation difficulty per rule
3. **Gap analysis** — actionable pixel gaps between Figma and generated code
4. **Prior evidence** — cross-run calibration evidence for the proposed rules (accumulated from past runs)

Use ALL inputs to form pro/con arguments. Do not rely on proposals alone.

## Rejection Rules

Expand Down Expand Up @@ -50,16 +60,46 @@ Return this JSON structure:
"timestamp": "<ISO8601>",
"summary": "approved=1 rejected=1 revised=1",
"reviews": [
{"ruleId": "X", "decision": "APPROVE", "reason": "3 cases, high confidence"},
{"ruleId": "X", "decision": "REJECT", "reason": "Rule 1 — only 1 case with low confidence"},
{"ruleId": "X", "decision": "REVISE", "revised": -7, "reason": "Rule 2 — change too large, midpoint applied"}
{
"ruleId": "X",
"decision": "APPROVE",
"confidence": "high",
"pro": ["3 cases across fixtures show easy implementation", "converter rated actualImpact: easy"],
"con": ["all cases from same design system"],
"reason": "Strong cross-run evidence outweighs single-system concern"
},
{
"ruleId": "X",
"decision": "REJECT",
"confidence": "low",
"pro": ["1 case shows overscored"],
"con": ["only 1 fixture", "no gap analysis data supports this"],
"reason": "Rule 1 — only 1 case with low confidence"
},
{
"ruleId": "X",
"decision": "REVISE",
"revised": -7,
"confidence": "medium",
"pro": ["converter found moderate difficulty, current score implies hard"],
"con": ["gap analysis shows some pixel impact in this area"],
"reason": "Rule 2 — change too large, midpoint applied"
}
]
}
```

### Field requirements

- **confidence**: `"high"` | `"medium"` | `"low"` — your assessment of the proposal's reliability
- **pro**: array of evidence points supporting the proposed change
- **con**: array of evidence points against the proposed change
- **reason**: final verdict synthesizing pro/con

## Rules

- **Do NOT write any files.** The orchestrator handles all file I/O.
- Do NOT modify `src/rules/rule-config.ts`.
- Be strict. When in doubt, REJECT or REVISE.
- Return your full critique so the Arbitrator can decide.
- **Every review MUST include pro, con, and confidence fields.** No exceptions.
78 changes: 70 additions & 8 deletions .claude/commands/calibrate-loop.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,19 @@ If zero proposals, write `$RUN_DIR/debate.json` with skip reason and jump to Ste

### Step 5 — Critic

Gather supporting evidence (deterministic CLI — no LLM):

```bash
npx canicode calibrate-gather-evidence $RUN_DIR
```

This reads `conversion.json`, `gaps.json`, `summary.md`, and `data/calibration-evidence.json`, and writes a single `$RUN_DIR/critic-evidence.json` with structured data for the Critic.

Read `$RUN_DIR/critic-evidence.json` and include it in the Critic prompt.

Spawn the `calibration-critic` subagent. In the prompt:
- Include only the proposal list (NOT the Converter's reasoning)
- Include the proposal list from summary.md
- Include the gathered evidence from `critic-evidence.json`
- **Tell the agent: "Return your reviews as JSON. Do NOT write any files."**

After the Critic returns, **you** write the JSON to `$RUN_DIR/debate.json`:
Expand All @@ -145,7 +156,16 @@ After the Critic returns, **you** write the JSON to `$RUN_DIR/debate.json`:
"critic": {
"timestamp": "<ISO8601>",
"summary": "approved=<N> rejected=<N> revised=<N>",
"reviews": [ ... ]
"reviews": [
{
"ruleId": "X",
"decision": "APPROVE|REJECT|REVISE",
"confidence": "high|medium|low",
"pro": ["evidence supporting change"],
"con": ["evidence against change"],
"reason": "..."
}
]
}
}
```
Expand All @@ -155,32 +175,74 @@ Append to `$RUN_DIR/activity.jsonl`:
{"step":"Critic","timestamp":"<ISO8601>","result":"approved=<N> rejected=<N> revised=<N>","durationMs":<ms>}
```

#### Early-stop check (deterministic CLI — no LLM)

```bash
npx canicode calibrate-finalize-debate $RUN_DIR
```

This outputs JSON: `{"action": "early-stop"|"continue", ...}`.

- If `action` is `"early-stop"`: the CLI has already written `stoppingReason` to debate.json. Append to activity.jsonl:
```json
{"step":"Arbitrator","timestamp":"<ISO8601>","result":"SKIPPED — early-stop: all proposals rejected with high confidence","durationMs":0}
```
Jump to Step 6.5.

- If `action` is `"continue"`: proceed to Step 6.

### Step 6 — Arbitrator

Spawn the `calibration-arbitrator` subagent. In the prompt:
- Include proposals and the Critic's reviews from `$RUN_DIR/debate.json`
- **Tell the agent: "Return your decisions as JSON. Only edit rule-config.ts if applying changes. Do NOT write to logs."**

After the Arbitrator returns, **you** update `$RUN_DIR/debate.json` — read the existing content and add the `arbitrator` field:

```json
{
"critic": { ... },
"arbitrator": {
"timestamp": "<ISO8601>",
"summary": "applied=<N> rejected=<N> revised=<N>",
"decisions": [ ... ]
"summary": "applied=<N> revised=<N> rejected=<N> hold=<N>",
"decisions": [
{
"ruleId": "X",
"decision": "applied|revised|rejected|hold|disabled",
"confidence": "high|medium|low",
"before": -10,
"after": -7,
"reason": "..."
}
]
}
}
```

Then finalize the debate (deterministic CLI — no LLM):

```bash
npx canicode calibrate-finalize-debate $RUN_DIR
```

This determines `stoppingReason` (if any) and writes it to debate.json. Outputs JSON with `action: "finalized"`.

Append to `$RUN_DIR/activity.jsonl`:
```json
{"step":"Arbitrator","timestamp":"<ISO8601>","result":"applied=<N> rejected=<N>","durationMs":<ms>}
{"step":"Arbitrator","timestamp":"<ISO8601>","result":"applied=<N> rejected=<N> hold=<N>","durationMs":<ms>}
```

### Step 6.5 — Enrich and prune evidence

After the debate (or early-stop), enrich `data/calibration-evidence.json` with the Critic's structured pro/con/confidence. This ensures cross-run evidence persists beyond the ephemeral `logs/` directory.

```bash
npx canicode calibrate-enrich-evidence $RUN_DIR
```

### Step 6.5 — Prune evidence
This reads `debate.json`, extracts the Critic's reviews (pro, con, confidence, decision), and updates matching entries in `data/calibration-evidence.json`. Runs for both normal and early-stop paths.

After the Arbitrator applies changes, prune calibration evidence for the applied rules:
Then prune calibration evidence for the applied rules:

```bash
npx canicode calibrate-prune-evidence $RUN_DIR
Expand Down Expand Up @@ -209,7 +271,7 @@ Report the final summary: similarity, proposals, decisions, and path to `logs/ca

- Each agent must be a SEPARATE subagent call (isolated context).
- Pass only structured data between agents — never raw reasoning.
- The Critic must NOT see the Runner's or Converter's reasoning, only the proposal list.
- The Critic receives proposals + converter's ruleImpactAssessment + gaps + prior evidence (structured data, not free-form reasoning).
- Only the Arbitrator may edit `rule-config.ts`.
- Steps 1, 4, 7 are CLI commands — run them directly with Bash.
- **CRITICAL: YOU write all files to $RUN_DIR. Subagents (Gap Analyzer, Critic, Arbitrator) MUST return JSON as text — tell them "Do NOT write any files." You are the only one who writes to $RUN_DIR.**
Expand Down
16 changes: 14 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,10 +322,22 @@ Process:
3. Run `canicode visual-compare` — pixel-level comparison against Figma screenshot
4. Analyze the diff image to categorize pixel gaps (`Gap Analyzer`)
5. Compare conversion difficulty vs rule scores (`canicode calibrate-evaluate`)
6. 6-agent debate loop (`/calibrate-loop`): Analysis → Converter → Gap Analyzer → Evaluation → Critic → Arbitrator
6. Debate loop (`/calibrate-loop`): Analysis → Converter → Gap Analyzer → Evaluation → Critic → Arbitrator

**Critic receives structured evidence** (#144):
- Proposals from evaluation
- Converter's `ruleImpactAssessment` (actual implementation difficulty per rule)
- Gap analysis (actionable pixel gaps)
- Prior cross-run evidence for proposed rules
- Outputs structured pro/con arguments + confidence level per proposal

**Early-stop and self-consistency** (#144):
- All proposals rejected with high confidence → Arbitrator skipped (early-stop)
- Low-confidence decisions → held (not applied), evidence accumulates for future runs (self-consistency)
- `stoppingReason` recorded in debate.json for traceability

**Cross-run evidence** accumulates across sessions in `data/`:
- `calibration-evidence.json` — overscored/underscored rules (fed to Runner for stronger proposals)
- `calibration-evidence.json` — overscored/underscored rules with confidence, pro/con, decision (fed to Critic for informed review)
- `discovery-evidence.json` — uncovered gaps not covered by existing rules (fed to `/add-rule` Researcher)
- Discovery evidence is filtered to exclude environment/tooling noise (font CDN, retina/DPI, network, CI constraints)
- Evidence is pruned after rules are applied (calibration) or new rules are created (discovery)
Expand Down
10 changes: 10 additions & 0 deletions src/agents/contracts/evidence.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ export const CalibrationEvidenceEntrySchema = z.object({
actualDifficulty: z.string(),
fixture: z.string(),
timestamp: z.string(),
// Phase 1 fields (#144) — optional for backward compatibility with existing evidence
confidence: z.enum(["high", "medium", "low"]).optional(),
pro: z.array(z.string()).optional(),
con: z.array(z.string()).optional(),
decision: z.enum(["APPROVE", "REJECT", "REVISE", "HOLD"]).optional(),
});

export type CalibrationEvidenceEntry = z.infer<typeof CalibrationEvidenceEntrySchema>;
Expand All @@ -17,6 +22,11 @@ export const CrossRunEvidenceGroupSchema = z.object({
underscoredCount: z.number(),
overscoredDifficulties: z.array(z.string()),
underscoredDifficulties: z.array(z.string()),
// Aggregated pro/con from all entries for this rule
allPro: z.array(z.string()).optional(),
allCon: z.array(z.string()).optional(),
lastConfidence: z.enum(["high", "medium", "low"]).optional(),
lastDecision: z.enum(["APPROVE", "REJECT", "REVISE", "HOLD"]).optional(),
});

export type CrossRunEvidenceGroup = z.infer<typeof CrossRunEvidenceGroupSchema>;
Expand Down
Loading
Loading