Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 14 additions & 15 deletions docs/CALIBRATION-PLAYBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ npx canicode save-fixture "https://www.figma.com/design/ABC123/MyDesign?node-id=
| 0 | Orchestrator | Run directory created | `logs/calibration/<name>--<timestamp>/` |
| 1 | CLI | `analysis.json` | Rule analysis — which rules flagged what |
| 2 | Converter | `output.html`, `figma.png`, `code.png`, `diff.png`, `conversion.json` | Implements the entire design as HTML, runs visual-compare |
| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences between Figma and code |
| 4 | CLI | `summary.md` | Score vs actual impact comparison |
| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences, appends uncovered gaps to `data/discovery-evidence.json` |
| 4 | Evaluator | `summary.md` | Score vs actual impact comparison, appends to `data/calibration-evidence.json` |
| 5 | Critic | `debate.json` | Reviews proposals: APPROVE / REJECT / REVISE |
| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits |
| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits, prunes evidence |

### What you see

Expand All @@ -56,19 +56,14 @@ None — fully automatic. Review the commit if you want.

## 3. Nightly Calibration (Multiple Fixtures)

### In Claude Code / Cursor

```
/calibrate-night
```

### On a server
### In Claude Code

```bash
./scripts/calibrate-night.sh
./scripts/calibrate-night.sh --deep # uses Figma MCP for richer data
/calibrate-night fixtures/
```

Input: fixture directory path. Auto-discovers active fixtures (`fixtures/*/data.json`).

### What happens

```
Expand Down Expand Up @@ -100,7 +95,7 @@ Open `logs/calibration/REPORT.md` the next morning. Key sections:
|---------|-----------------|--------|
| **Similarity per run** | Low similarity = hard design | Consider adding more rules for that pattern |
| **Repeating patterns** | Same gap in 3+ fixtures | Strong candidate for `/add-rule` |
| **Rule score vs impact** | Overscored in most runs | Score will auto-adjust in next calibration |
| **Rule score vs impact** | Overscored (penalty too harsh) or underscored (penalty too mild) | Score will auto-adjust in next calibration |
| **New rule candidates** | `text-alignment-mismatch` in 4/6 | Run `/add-rule` |
| **Never flagged rules** | Rule never triggered | Consider `enabled: false` in `rule-config.ts` |

Expand All @@ -125,10 +120,10 @@ When the report identifies a new pattern worth codifying:
| Step | Agent | Output | Description |
|------|-------|--------|-------------|
| 0 | Orchestrator | Run directory | `logs/rule-discovery/<concept>--<date>/` |
| 1 | Researcher | `research.json` | Checks if the concept exists in fixture data, reads accumulated gaps |
| 1 | Researcher | `research.json` | Checks fixture data + `data/discovery-evidence.json` for recurring patterns |
| 2 | Designer | `design.json` | Proposes rule spec: ID, category, severity, score, trigger logic |
| 3 | Implementer | Source code | Writes rule code + tests, builds |
| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: converts design with/without the rule's data |
| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: implements design with/without the rule's data, compares pixel similarity |
| 5 | Evaluator | `evaluation.json` | Measures false positive rate, visual improvement |
| 6 | Critic | `decision.json` | Final verdict |

Expand All @@ -146,6 +141,10 @@ When the report identifies a new pattern worth codifying:
- **Build/test fails** → Implementer attempts fix; if can't, pipeline stops
- **A/B shows no improvement** → Evaluator likely recommends DROP

### Evidence pruning

After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`. This prevents the same pattern from being proposed again.

### Your decision

None during execution — fully automatic. After completion:
Expand Down
110 changes: 57 additions & 53 deletions docs/CALIBRATION.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# Calibration Pipeline

CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 4-agent debate pipeline.
CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 6-step calibration pipeline.

## Why Calibrate?

Initial rule scores were intuition-based estimates. A rule flagged as "blocking" with score -10 might turn out to be trivial to work around in practice (overscored), or a "suggestion" at -2 might actually cause significant conversion difficulty (underscored).

The calibration pipeline validates scores by:
1. Converting flagged Figma nodes to production code
2. Measuring how much each rule actually impacted conversion difficulty
3. Proposing score adjustments when predicted and actual difficulty diverge
1. Implementing the entire scoped design as one HTML page
2. Measuring pixel-level similarity against the Figma screenshot (`visual-compare`)
3. Analyzing diff images to categorize pixel gaps
4. Proposing score adjustments when predicted and actual difficulty diverge

## Pipeline Structure

Expand All @@ -20,34 +21,57 @@ Step 1 — Analysis (CLI)
Run canicode calibrate-analyze to identify issues and group by node.

Step 2 — Converter (Subagent)
Convert the top 5 flagged nodes to production CSS/HTML/React code.
Assess actual conversion difficulty: easy | moderate | hard | failed.
For each flagged rule, note whether it actually made conversion harder.
Implement the ENTIRE scoped design as one HTML page.
Run visual-compare for pixel-level similarity against Figma screenshot.

Step 3 — Evaluation (CLI)
Step 3 — Gap Analyzer (Subagent)
Analyze the diff image between Figma screenshot and generated code.
Categorize each pixel difference (spacing, color, typography, layout, etc.).
Append uncovered gaps to data/discovery-evidence.json.

Step 4 — Evaluation (CLI)
Compare predicted difficulty (from rule scores) vs actual difficulty.
Generate score adjustment proposals.
Append overscored/underscored findings to data/calibration-evidence.json.

Step 4 — Critic (Subagent)
Step 5 — Critic (Subagent)
Challenge each proposal against rejection rules:
- Rule 1: Low confidence + fewer than 2 supporting cases → reject
- Rule 2: Change exceeds 50% of current value → cap at midpoint
- Rule 3: Severity change without high confidence → reject

Step 5 — Arbitrator (Subagent)
Step 6 — Arbitrator (Subagent)
Make final decisions:
- Both approve → apply Runner's value
- Critic rejects → keep current score
- Critic revises → apply Critic's conservative value
Commits approved changes to rule-config.ts.

Step 6.5 — Prune Evidence
Remove evidence for rules that were just adjusted from
data/calibration-evidence.json (applied rules) and
data/discovery-evidence.json (covered gaps).
```

### Tiered Approach

Not all fixtures go through the full pipeline. The tier is based on the current grade:

| Grade | Pipeline | Rationale |
|-------|----------|-----------|
| A+ and above | Full pipeline (Converter + Gap Analysis) | High-quality designs benefit from gap analysis |
| Below A | Converter + visual-compare only (skip gap analysis) | Low-scoring designs need score validation the most |

**Always run the Converter** regardless of grade. Skipping visual-compare on low-scoring designs means scores can never be validated.

## Agents

| Agent | Role | Can edit rule-config.ts? |
|-------|------|------------------------|
| **Runner** | Runs analysis, extracts proposals | No |
| **Converter** | Converts Figma nodes to code, assesses difficulty | No |
| **Converter** | Implements entire design as HTML, runs visual-compare | No |
| **Gap Analyzer** | Categorizes pixel differences from diff image | No |
| **Evaluator** | Compares predicted vs actual difficulty | No |
| **Critic** | Applies rejection heuristics, caps excessive changes | No |
| **Arbitrator** | Makes final decisions, commits changes | Yes |

Expand Down Expand Up @@ -89,9 +113,18 @@ These weights were validated through calibration: rules rated "blocking" consist
| D | 50-64% | Major rework needed |
| F | 0-49% | Fundamental structural problems |

## Cross-Run Evidence

Evidence accumulates across calibration sessions in `data/`:

- **`data/calibration-evidence.json`** — Overscored/underscored rules. Fed back to the Evaluator in subsequent runs for stronger proposals.
- **`data/discovery-evidence.json`** — Uncovered gaps not covered by existing rules. Fed to the `/add-rule` Researcher to find recurring patterns worth turning into new rules.

Discovery evidence is filtered to exclude environment/tooling noise (font CDN differences, retina/DPI scaling, network artifacts, CI constraints). Evidence is pruned after rules are applied (calibration) or new rules are created (discovery).

## Score Adjustment History

Rule scores in `src/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections).
Rule scores in `src/core/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections).

Each adjustment requires:
- Minimum 2 supporting cases at medium+ confidence
Expand Down Expand Up @@ -199,56 +232,33 @@ The Critic's conservatism prevented score whiplash — without it, `no-auto-layo

---

## Calibration Pipeline Structure
## Gap Analysis

```
/calibrate-loop <fixture>
The Gap Analyzer (Step 3) examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed:
- **Covered by existing rule?** — validates that rule's relevance
- **Actionable but no rule?** — candidate for rule discovery (appended to `data/discovery-evidence.json`)
- **Rendering artifact?** — not actionable (font smoothing, anti-aliasing, retina/DPI)

Step 1 — Analysis (CLI): run canicode calibrate-analyze
Step 2 — Converter: implement ENTIRE design as one HTML page + visual-compare
Step 3 — Gap Analyzer: analyze diff image, categorize pixel differences
Step 4 — Evaluation (CLI): compare predicted vs actual difficulty, propose adjustments
Step 5 — Critic: challenge proposals (50% change cap, evidence thresholds)
Step 6 — Arbitrator: apply approved changes to rule-config.ts
```

### Gap Analysis

The Gap Analyzer examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed:
- **Covered by existing rule?** → validates that rule's relevance
- **Actionable but no rule?** → candidate for rule discovery
- **Rendering artifact?** → not actionable (font smoothing, anti-aliasing)

Gap data accumulates in each run's `gaps.json` file (`logs/calibration/*/gaps.json`). The rule discovery pipeline reads this data to find recurring patterns worth turning into new rules.
Gap data is also saved per run in `logs/calibration/*/gaps.json`.

---

## Rule Discovery Pipeline

New rules are added through a 5-agent debate pipeline (`/add-rule`):
New rules are added through a 6-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details.

```
/add-rule "concept" fixture.json
```bash
/add-rule "concept" fixtures/path

Step 1 — Researcher: explore fixture data + accumulated gap data
Step 1 — Researcher: explore fixture data + data/discovery-evidence.json
Step 2 — Designer: propose rule spec (ID, category, severity, score)
Step 3 — Implementer: write rule code + tests
Step 4 — A/B Visual Validation: implement entire design with/without the rule's data, compare similarity
Step 4 — A/B Visual Validation: implement design with/without the rule's data, compare similarity
Step 5 — Evaluator: measure impact, false positives, visual improvement
Step 6 — Critic: decide KEEP / ADJUST / DROP
```
Comment on lines +248 to 259
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's an A/B Visual Validation agent definition
fd -t f 'visual.*validation|a-?b.*test' .claude/agents/rule-discovery/

# List all agent files in rule-discovery directory to confirm count
echo "=== Rule discovery agents ==="
fd -t f '.md$' .claude/agents/rule-discovery/ --exec basename

Repository: let-sunny/canicode

Length of output: 155


🏁 Script executed:

sed -n '120,129p' docs/CALIBRATION-PLAYBOOK.md

Repository: let-sunny/canicode

Length of output: 786


Fix inaccurate "6-agent pipeline" claim—only 5 agents exist.

Line 248 claims a "6-agent pipeline," but the pipeline contains only 5 agents (Researcher, Designer, Implementer, Evaluator, Critic). Step 4, "A/B Visual Validation," is an orchestrated process run by the Orchestrator, not a separate agent. Update the description to reflect the correct agent count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CALIBRATION.md` around lines 248 - 259, Update the documentation to
state a "5-agent pipeline" instead of "6-agent pipeline" for the /add-rule flow
and adjust the agent list to: Researcher, Designer, Implementer, Evaluator,
Critic; clarify that Step 4 ("A/B Visual Validation") is an orchestrated process
executed by the Orchestrator rather than a separate agent, keeping the step
order but removing it from the enumerated agent count and wording around agents
in CALIBRATION.md and the /add-rule description.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

맞습니다. 실제 agent 파일은 5개 (Researcher, Designer, Implementer, Evaluator, Critic)이고, Step 4 A/B Visual Validation은 Orchestrator가 수행하는 프로세스입니다. "5-agent pipeline"으로 수정하는 게 정확합니다.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@let-sunny, thanks for confirming! Here's the correction needed in docs/CALIBRATION.md:

-New rules are added through a 6-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details.
+New rules are added through a 5-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details.

 ```bash
 /add-rule "concept" fixtures/path
 
 Step 1 — Researcher: explore fixture data + data/discovery-evidence.json
 Step 2 — Designer: propose rule spec (ID, category, severity, score)
 Step 3 — Implementer: write rule code + tests
-Step 4 — A/B Visual Validation: implement design with/without the rule's data, compare similarity
+Step 4 — A/B Visual Validation (Orchestrator): implement design with/without the rule's data, compare similarity
 Step 5 — Evaluator: measure impact, false positives, visual improvement
 Step 6 — Critic: decide KEEP / ADJUST / DROP

Also worth checking `docs/CALIBRATION-PLAYBOOK.md` for the same "6-agent" wording — the AI summary mentions the Orchestrator step is described there as well, so it may need the same correction for consistency.

---

<details>
<summary>✏️ Learnings added</summary>

Learnt from: let-sunny
Repo: let-sunny/canicode PR: 88
File: docs/CALIBRATION.md:248-259
Timestamp: 2026-03-26T01:29:06.955Z
Learning: In the let-sunny/canicode repository, the /add-rule rule discovery pipeline has 5 agents (Researcher, Designer, Implementer, Evaluator, Critic). Step 4 "A/B Visual Validation" is an orchestrated process run by the Orchestrator, not a separate agent. The pipeline should always be referred to as a "5-agent pipeline," not "6-agent."


> Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure [review instructions](https://docs.coderabbit.ai/guides/review-instructions). You can manage existing learnings in the [app](https://app.coderabbit.ai/learnings).

</details>



---

<details>
<summary>🧠 Learnings used</summary>

Learnt from: CR
Repo: let-sunny/canicode PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-25T18:32:58.247Z
Learning: The core metric for analysis is visual-compare similarity (0-100%), measuring how accurately AI can reproduce a Figma design as code


---

Learnt from: CR
Repo: let-sunny/canicode PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-25T18:32:58.247Z
Learning: Component-related rule scores should NOT be lowered based on small fixture calibration; calibration fixtures should include large, complex pages


---

Learnt from: let-sunny
Repo: let-sunny/canicode PR: 59
File: src/cli/commands/internal/fixture-management.ts:121-129
Timestamp: 2026-03-25T11:36:48.861Z
Learning: In src/cli/commands/internal/fixture-management.ts (and related internal calibration commands), using console.log + exit code 0 for missing run directories or null results from parseDebateResult is intentional by design. These are internal commands consumed by subagents that check stdout text output rather than exit codes; switching to console.error + process.exit(1) would break the calibration pipeline.


</details>



<!-- This is an auto-generated reply by CodeRabbit -->


### Gap → Rule Discovery Flow

```
Calibration runs accumulate gap data
logs/calibration/*/gaps.json (one per run directory)
Researcher reads accumulated gaps
Recurring actionable patterns → new rule candidates
```
After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`.

### Known Limitations

Expand All @@ -257,9 +267,3 @@ Recurring actionable patterns → new rule candidates
2. **Test fixtures with both positive and negative cases needed.** Current fixtures tend to be all-or-nothing (e.g., 0% description coverage). Effective evaluation requires controlled fixtures.

3. **Font rendering differences.** Playwright uses system fonts; Figma renders with embedded fonts. This creates a baseline similarity gap (~3-5%) that is not actionable.

### Next Steps

- Design controlled test fixtures per concept
- Accumulate gap data across 10+ fixture runs to identify patterns
- Build gap-to-rule-candidate pipeline automation
7 changes: 4 additions & 3 deletions docs/SCORING.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,10 @@ All categories are weighted equally (1.0). No category is inherently more import

The severity weights, density/diversity ratio, and grade thresholds started as intuition-based values. The [`/calibrate-loop`](CALIBRATION.md) pipeline validates them against pixel-level visual comparison:

1. Convert a Figma design to code
2. Compare the result against the original screenshot (`visual-compare`)
3. Check if designs with low scores are actually harder to implement accurately
1. Implement the entire scoped design as one HTML page
2. Compare the result against the Figma screenshot (`visual-compare`)
3. Analyze diff images to categorize pixel gaps
4. Check if designs with low scores are actually harder to implement accurately

Calibration evidence accumulates across runs in `data/calibration-evidence.json`. As more evidence is collected, these constants will be adjusted to better reflect actual implementation difficulty.

Expand Down
Loading