From 85d55cf6c35a4d48fa476851df365b0dc62d5cce Mon Sep 17 00:00:00 2001 From: let-sunny Date: Thu, 26 Mar 2026 09:12:38 +0900 Subject: [PATCH 1/3] docs: update calibration/discovery docs for new category structure (#87) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Update CALIBRATION.md: 4-agent → 6-agent pipeline, add Gap Analyzer (Step 3), Prune Evidence (Step 6.5), tiered approach, cross-run evidence section, fix rule-config.ts path - Update CALIBRATION-PLAYBOOK.md: nightly command with fixture input, evidence accumulation in step table, discovery evidence pruning, remove server scripts reference - Update SCORING.md: align calibration steps with current pipeline - Remove duplicated pipeline structure section and outdated Next Steps Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/CALIBRATION-PLAYBOOK.md | 27 +++++---- docs/CALIBRATION.md | 107 ++++++++++++++++++----------------- docs/SCORING.md | 7 ++- 3 files changed, 72 insertions(+), 69 deletions(-) diff --git a/docs/CALIBRATION-PLAYBOOK.md b/docs/CALIBRATION-PLAYBOOK.md index f3d2110d..d155a208 100644 --- a/docs/CALIBRATION-PLAYBOOK.md +++ b/docs/CALIBRATION-PLAYBOOK.md @@ -35,10 +35,10 @@ npx canicode save-fixture "https://www.figma.com/design/ABC123/MyDesign?node-id= | 0 | Orchestrator | Run directory created | `logs/calibration/--/` | | 1 | CLI | `analysis.json` | Rule analysis — which rules flagged what | | 2 | Converter | `output.html`, `figma.png`, `code.png`, `diff.png`, `conversion.json` | Implements the entire design as HTML, runs visual-compare | -| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences between Figma and code | -| 4 | CLI | `summary.md` | Score vs actual impact comparison | +| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences, appends uncovered gaps to `data/discovery-evidence.json` | +| 4 | Evaluator | `summary.md` | Score vs actual impact comparison, appends to `data/calibration-evidence.json` | | 5 | Critic | `debate.json` | Reviews proposals: APPROVE / REJECT / REVISE | -| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits | +| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits, prunes evidence | ### What you see @@ -56,18 +56,13 @@ None — fully automatic. Review the commit if you want. ## 3. Nightly Calibration (Multiple Fixtures) -### In Claude Code / Cursor +### In Claude Code ``` -/calibrate-night +/calibrate-night fixtures/ ``` -### On a server - -```bash -./scripts/calibrate-night.sh -./scripts/calibrate-night.sh --deep # uses Figma MCP for richer data -``` +Input: fixture directory path. Auto-discovers active fixtures (`fixtures/*/data.json`). ### What happens @@ -100,7 +95,7 @@ Open `logs/calibration/REPORT.md` the next morning. Key sections: |---------|-----------------|--------| | **Similarity per run** | Low similarity = hard design | Consider adding more rules for that pattern | | **Repeating patterns** | Same gap in 3+ fixtures | Strong candidate for `/add-rule` | -| **Rule score vs impact** | Overscored in most runs | Score will auto-adjust in next calibration | +| **Rule score vs impact** | Overscored (penalty too harsh) or underscored (penalty too mild) | Score will auto-adjust in next calibration | | **New rule candidates** | `text-alignment-mismatch` in 4/6 | Run `/add-rule` | | **Never flagged rules** | Rule never triggered | Consider `enabled: false` in `rule-config.ts` | @@ -125,10 +120,10 @@ When the report identifies a new pattern worth codifying: | Step | Agent | Output | Description | |------|-------|--------|-------------| | 0 | Orchestrator | Run directory | `logs/rule-discovery/--/` | -| 1 | Researcher | `research.json` | Checks if the concept exists in fixture data, reads accumulated gaps | +| 1 | Researcher | `research.json` | Checks fixture data + `data/discovery-evidence.json` for recurring patterns | | 2 | Designer | `design.json` | Proposes rule spec: ID, category, severity, score, trigger logic | | 3 | Implementer | Source code | Writes rule code + tests, builds | -| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: converts design with/without the rule's data | +| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: implements design with/without the rule's data, compares pixel similarity | | 5 | Evaluator | `evaluation.json` | Measures false positive rate, visual improvement | | 6 | Critic | `decision.json` | Final verdict | @@ -146,6 +141,10 @@ When the report identifies a new pattern worth codifying: - **Build/test fails** → Implementer attempts fix; if can't, pipeline stops - **A/B shows no improvement** → Evaluator likely recommends DROP +### Evidence pruning + +After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`. This prevents the same pattern from being proposed again. + ### Your decision None during execution — fully automatic. After completion: diff --git a/docs/CALIBRATION.md b/docs/CALIBRATION.md index 5e8dc201..fe7da88c 100644 --- a/docs/CALIBRATION.md +++ b/docs/CALIBRATION.md @@ -1,15 +1,16 @@ # Calibration Pipeline -CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 4-agent debate pipeline. +CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 6-agent pipeline. ## Why Calibrate? Initial rule scores were intuition-based estimates. A rule flagged as "blocking" with score -10 might turn out to be trivial to work around in practice (overscored), or a "suggestion" at -2 might actually cause significant conversion difficulty (underscored). The calibration pipeline validates scores by: -1. Converting flagged Figma nodes to production code -2. Measuring how much each rule actually impacted conversion difficulty -3. Proposing score adjustments when predicted and actual difficulty diverge +1. Implementing the entire scoped design as one HTML page +2. Measuring pixel-level similarity against the Figma screenshot (`visual-compare`) +3. Analyzing diff images to categorize pixel gaps +4. Proposing score adjustments when predicted and actual difficulty diverge ## Pipeline Structure @@ -20,34 +21,56 @@ Step 1 — Analysis (CLI) Run canicode calibrate-analyze to identify issues and group by node. Step 2 — Converter (Subagent) - Convert the top 5 flagged nodes to production CSS/HTML/React code. - Assess actual conversion difficulty: easy | moderate | hard | failed. - For each flagged rule, note whether it actually made conversion harder. + Implement the ENTIRE scoped design as one HTML page. + Run visual-compare for pixel-level similarity against Figma screenshot. -Step 3 — Evaluation (CLI) +Step 3 — Gap Analyzer (Subagent) + Analyze the diff image between Figma screenshot and generated code. + Categorize each pixel difference (spacing, color, typography, layout, etc.). + Append uncovered gaps to data/discovery-evidence.json. + +Step 4 — Evaluation (CLI) Compare predicted difficulty (from rule scores) vs actual difficulty. Generate score adjustment proposals. + Append overscored/underscored findings to data/calibration-evidence.json. -Step 4 — Critic (Subagent) +Step 5 — Critic (Subagent) Challenge each proposal against rejection rules: - Rule 1: Low confidence + fewer than 2 supporting cases → reject - Rule 2: Change exceeds 50% of current value → cap at midpoint - Rule 3: Severity change without high confidence → reject -Step 5 — Arbitrator (Subagent) +Step 6 — Arbitrator (Subagent) Make final decisions: - Both approve → apply Runner's value - Critic rejects → keep current score - Critic revises → apply Critic's conservative value Commits approved changes to rule-config.ts. + +Step 6.5 — Prune Evidence + Remove evidence for rules that were just adjusted from + data/calibration-evidence.json (applied rules) and + data/discovery-evidence.json (covered gaps). ``` +### Tiered Approach + +Not all fixtures go through the full pipeline. The tier is based on the current grade: + +| Grade | Pipeline | Rationale | +|-------|----------|-----------| +| A+ and above | Full 6-step pipeline | High-quality designs benefit from gap analysis | +| B to B+ | Visual-only (skip gap analysis) | Moderate quality — focus on score calibration | +| Below B | Skip visual entirely | Too many issues; visual comparison is noisy | + ## Agents | Agent | Role | Can edit rule-config.ts? | |-------|------|------------------------| | **Runner** | Runs analysis, extracts proposals | No | -| **Converter** | Converts Figma nodes to code, assesses difficulty | No | +| **Converter** | Implements entire design as HTML, runs visual-compare | No | +| **Gap Analyzer** | Categorizes pixel differences from diff image | No | +| **Evaluator** | Compares predicted vs actual difficulty | No | | **Critic** | Applies rejection heuristics, caps excessive changes | No | | **Arbitrator** | Makes final decisions, commits changes | Yes | @@ -89,9 +112,18 @@ These weights were validated through calibration: rules rated "blocking" consist | D | 50-64% | Major rework needed | | F | 0-49% | Fundamental structural problems | +## Cross-Run Evidence + +Evidence accumulates across calibration sessions in `data/`: + +- **`data/calibration-evidence.json`** — Overscored/underscored rules. Fed back to the Evaluator in subsequent runs for stronger proposals. +- **`data/discovery-evidence.json`** — Uncovered gaps not covered by existing rules. Fed to the `/add-rule` Researcher to find recurring patterns worth turning into new rules. + +Discovery evidence is filtered to exclude environment/tooling noise (font CDN differences, retina/DPI scaling, network artifacts, CI constraints). Evidence is pruned after rules are applied (calibration) or new rules are created (discovery). + ## Score Adjustment History -Rule scores in `src/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections). +Rule scores in `src/core/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections). Each adjustment requires: - Minimum 2 supporting cases at medium+ confidence @@ -199,56 +231,33 @@ The Critic's conservatism prevented score whiplash — without it, `no-auto-layo --- -## Calibration Pipeline Structure - -``` -/calibrate-loop - -Step 1 — Analysis (CLI): run canicode calibrate-analyze -Step 2 — Converter: implement ENTIRE design as one HTML page + visual-compare -Step 3 — Gap Analyzer: analyze diff image, categorize pixel differences -Step 4 — Evaluation (CLI): compare predicted vs actual difficulty, propose adjustments -Step 5 — Critic: challenge proposals (50% change cap, evidence thresholds) -Step 6 — Arbitrator: apply approved changes to rule-config.ts -``` - -### Gap Analysis +## Gap Analysis -The Gap Analyzer examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed: -- **Covered by existing rule?** → validates that rule's relevance -- **Actionable but no rule?** → candidate for rule discovery -- **Rendering artifact?** → not actionable (font smoothing, anti-aliasing) +The Gap Analyzer (Step 3) examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed: +- **Covered by existing rule?** — validates that rule's relevance +- **Actionable but no rule?** — candidate for rule discovery (appended to `data/discovery-evidence.json`) +- **Rendering artifact?** — not actionable (font smoothing, anti-aliasing, retina/DPI) -Gap data accumulates in each run's `gaps.json` file (`logs/calibration/*/gaps.json`). The rule discovery pipeline reads this data to find recurring patterns worth turning into new rules. +Gap data is also saved per run in `logs/calibration/*/gaps.json`. --- ## Rule Discovery Pipeline -New rules are added through a 5-agent debate pipeline (`/add-rule`): +New rules are added through a 6-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details. ``` -/add-rule "concept" fixture.json +/add-rule "concept" fixtures/path -Step 1 — Researcher: explore fixture data + accumulated gap data +Step 1 — Researcher: explore fixture data + data/discovery-evidence.json Step 2 — Designer: propose rule spec (ID, category, severity, score) Step 3 — Implementer: write rule code + tests -Step 4 — A/B Visual Validation: implement entire design with/without the rule's data, compare similarity +Step 4 — A/B Visual Validation: implement design with/without the rule's data, compare similarity Step 5 — Evaluator: measure impact, false positives, visual improvement Step 6 — Critic: decide KEEP / ADJUST / DROP ``` -### Gap → Rule Discovery Flow - -``` -Calibration runs accumulate gap data - ↓ -logs/calibration/*/gaps.json (one per run directory) - ↓ -Researcher reads accumulated gaps - ↓ -Recurring actionable patterns → new rule candidates -``` +After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`. ### Known Limitations @@ -257,9 +266,3 @@ Recurring actionable patterns → new rule candidates 2. **Test fixtures with both positive and negative cases needed.** Current fixtures tend to be all-or-nothing (e.g., 0% description coverage). Effective evaluation requires controlled fixtures. 3. **Font rendering differences.** Playwright uses system fonts; Figma renders with embedded fonts. This creates a baseline similarity gap (~3-5%) that is not actionable. - -### Next Steps - -- Design controlled test fixtures per concept -- Accumulate gap data across 10+ fixture runs to identify patterns -- Build gap-to-rule-candidate pipeline automation diff --git a/docs/SCORING.md b/docs/SCORING.md index e2cd1339..041ac68d 100644 --- a/docs/SCORING.md +++ b/docs/SCORING.md @@ -73,9 +73,10 @@ All categories are weighted equally (1.0). No category is inherently more import The severity weights, density/diversity ratio, and grade thresholds started as intuition-based values. The [`/calibrate-loop`](CALIBRATION.md) pipeline validates them against pixel-level visual comparison: -1. Convert a Figma design to code -2. Compare the result against the original screenshot (`visual-compare`) -3. Check if designs with low scores are actually harder to implement accurately +1. Implement the entire scoped design as one HTML page +2. Compare the result against the Figma screenshot (`visual-compare`) +3. Analyze diff images to categorize pixel gaps +4. Check if designs with low scores are actually harder to implement accurately Calibration evidence accumulates across runs in `data/calibration-evidence.json`. As more evidence is collected, these constants will be adjusted to better reflect actual implementation difficulty. From 20abd69964a3fd7dd90627bc5644c9e51455a17f Mon Sep 17 00:00:00 2001 From: let-sunny Date: Thu, 26 Mar 2026 09:41:30 +0900 Subject: [PATCH 2/3] =?UTF-8?q?fix:=20address=20CodeRabbit=20review=20?= =?UTF-8?q?=E2=80=94=20code=20fence=20tags,=20tier=20policy=20update?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add bash language tags to fenced code blocks - Update tier policy: always run Converter regardless of grade (was: skip visual for below B) Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/CALIBRATION-PLAYBOOK.md | 2 +- docs/CALIBRATION.md | 9 +++++---- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/CALIBRATION-PLAYBOOK.md b/docs/CALIBRATION-PLAYBOOK.md index d155a208..bfb2ab73 100644 --- a/docs/CALIBRATION-PLAYBOOK.md +++ b/docs/CALIBRATION-PLAYBOOK.md @@ -58,7 +58,7 @@ None — fully automatic. Review the commit if you want. ### In Claude Code -``` +```bash /calibrate-night fixtures/ ``` diff --git a/docs/CALIBRATION.md b/docs/CALIBRATION.md index fe7da88c..177c65a4 100644 --- a/docs/CALIBRATION.md +++ b/docs/CALIBRATION.md @@ -59,9 +59,10 @@ Not all fixtures go through the full pipeline. The tier is based on the current | Grade | Pipeline | Rationale | |-------|----------|-----------| -| A+ and above | Full 6-step pipeline | High-quality designs benefit from gap analysis | -| B to B+ | Visual-only (skip gap analysis) | Moderate quality — focus on score calibration | -| Below B | Skip visual entirely | Too many issues; visual comparison is noisy | +| A+ and above | Full pipeline (Converter + Gap Analysis) | High-quality designs benefit from gap analysis | +| Below A | Converter + visual-compare only (skip gap analysis) | Low-scoring designs need score validation the most | + +**Always run the Converter** regardless of grade. Skipping visual-compare on low-scoring designs means scores can never be validated. ## Agents @@ -246,7 +247,7 @@ Gap data is also saved per run in `logs/calibration/*/gaps.json`. New rules are added through a 6-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details. -``` +```bash /add-rule "concept" fixtures/path Step 1 — Researcher: explore fixture data + data/discovery-evidence.json From 4f53ae3a0c3231fcea65940ee19508ee37b624e6 Mon Sep 17 00:00:00 2001 From: let-sunny Date: Thu, 26 Mar 2026 09:53:59 +0900 Subject: [PATCH 3/3] fix: clarify '6-step pipeline' wording in CALIBRATION.md Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/CALIBRATION.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/CALIBRATION.md b/docs/CALIBRATION.md index 177c65a4..87a432ba 100644 --- a/docs/CALIBRATION.md +++ b/docs/CALIBRATION.md @@ -1,6 +1,6 @@ # Calibration Pipeline -CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 6-agent pipeline. +CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 6-step calibration pipeline. ## Why Calibrate?