let-sunny · let-sunny · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/docs/CALIBRATION-PLAYBOOK.md b/docs/CALIBRATION-PLAYBOOK.md
@@ -35,10 +35,10 @@ npx canicode save-fixture "https://www.figma.com/design/ABC123/MyDesign?node-id=
 | 0 | Orchestrator | Run directory created | `logs/calibration/<name>--<timestamp>/` |
 | 1 | CLI | `analysis.json` | Rule analysis — which rules flagged what |
 | 2 | Converter | `output.html`, `figma.png`, `code.png`, `diff.png`, `conversion.json` | Implements the entire design as HTML, runs visual-compare |
-| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences between Figma and code |
-| 4 | CLI | `summary.md` | Score vs actual impact comparison |
+| 3 | Gap Analyzer | `gaps.json` | Categorizes pixel differences, appends uncovered gaps to `data/discovery-evidence.json` |
+| 4 | Evaluator | `summary.md` | Score vs actual impact comparison, appends to `data/calibration-evidence.json` |
 | 5 | Critic | `debate.json` | Reviews proposals: APPROVE / REJECT / REVISE |
-| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits |
+| 6 | Arbitrator | `debate.json` (appended), `rule-config.ts` | Makes final decisions, applies approved changes, commits, prunes evidence |
 
 ### What you see
 
@@ -56,19 +56,14 @@ None — fully automatic. Review the commit if you want.
 
 ## 3. Nightly Calibration (Multiple Fixtures)
 
-### In Claude Code / Cursor
-
-```
-/calibrate-night
-```
-
-### On a server
+### In Claude Code
 
 ```bash
-./scripts/calibrate-night.sh
-./scripts/calibrate-night.sh --deep  # uses Figma MCP for richer data
+/calibrate-night fixtures/
 ```
 
+Input: fixture directory path. Auto-discovers active fixtures (`fixtures/*/data.json`).
+
 ### What happens
 
 ```
@@ -100,7 +95,7 @@ Open `logs/calibration/REPORT.md` the next morning. Key sections:
 |---------|-----------------|--------|
 | **Similarity per run** | Low similarity = hard design | Consider adding more rules for that pattern |
 | **Repeating patterns** | Same gap in 3+ fixtures | Strong candidate for `/add-rule` |
-| **Rule score vs impact** | Overscored in most runs | Score will auto-adjust in next calibration |
+| **Rule score vs impact** | Overscored (penalty too harsh) or underscored (penalty too mild) | Score will auto-adjust in next calibration |
 | **New rule candidates** | `text-alignment-mismatch` in 4/6 | Run `/add-rule` |
 | **Never flagged rules** | Rule never triggered | Consider `enabled: false` in `rule-config.ts` |
 
@@ -125,10 +120,10 @@ When the report identifies a new pattern worth codifying:
 | Step | Agent | Output | Description |
 |------|-------|--------|-------------|
 | 0 | Orchestrator | Run directory | `logs/rule-discovery/<concept>--<date>/` |
-| 1 | Researcher | `research.json` | Checks if the concept exists in fixture data, reads accumulated gaps |
+| 1 | Researcher | `research.json` | Checks fixture data + `data/discovery-evidence.json` for recurring patterns |
 | 2 | Designer | `design.json` | Proposes rule spec: ID, category, severity, score, trigger logic |
 | 3 | Implementer | Source code | Writes rule code + tests, builds |
-| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: converts design with/without the rule's data |
+| 4 | Orchestrator | `visual-a.html`, `visual-b.html` | A/B test: implements design with/without the rule's data, compares pixel similarity |
 | 5 | Evaluator | `evaluation.json` | Measures false positive rate, visual improvement |
 | 6 | Critic | `decision.json` | Final verdict |
 
@@ -146,6 +141,10 @@ When the report identifies a new pattern worth codifying:
 - **Build/test fails** → Implementer attempts fix; if can't, pipeline stops
 - **A/B shows no improvement** → Evaluator likely recommends DROP
 
+### Evidence pruning
+
+After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`. This prevents the same pattern from being proposed again.
+
 ### Your decision
 
 None during execution — fully automatic. After completion:

diff --git a/docs/CALIBRATION.md b/docs/CALIBRATION.md
@@ -1,15 +1,16 @@
 # Calibration Pipeline
 
-CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 4-agent debate pipeline.
+CanICode's rule scores and severity levels are not arbitrary — they are continuously validated against actual code conversion difficulty through an automated 6-step calibration pipeline.
 
 ## Why Calibrate?
 
 Initial rule scores were intuition-based estimates. A rule flagged as "blocking" with score -10 might turn out to be trivial to work around in practice (overscored), or a "suggestion" at -2 might actually cause significant conversion difficulty (underscored).
 
 The calibration pipeline validates scores by:
-1. Converting flagged Figma nodes to production code
-2. Measuring how much each rule actually impacted conversion difficulty
-3. Proposing score adjustments when predicted and actual difficulty diverge
+1. Implementing the entire scoped design as one HTML page
+2. Measuring pixel-level similarity against the Figma screenshot (`visual-compare`)
+3. Analyzing diff images to categorize pixel gaps
+4. Proposing score adjustments when predicted and actual difficulty diverge
 
 ## Pipeline Structure
 
@@ -20,34 +21,57 @@ Step 1 — Analysis (CLI)
   Run canicode calibrate-analyze to identify issues and group by node.
 
 Step 2 — Converter (Subagent)
-  Convert the top 5 flagged nodes to production CSS/HTML/React code.
-  Assess actual conversion difficulty: easy | moderate | hard | failed.
-  For each flagged rule, note whether it actually made conversion harder.
+  Implement the ENTIRE scoped design as one HTML page.
+  Run visual-compare for pixel-level similarity against Figma screenshot.
 
-Step 3 — Evaluation (CLI)
+Step 3 — Gap Analyzer (Subagent)
+  Analyze the diff image between Figma screenshot and generated code.
+  Categorize each pixel difference (spacing, color, typography, layout, etc.).
+  Append uncovered gaps to data/discovery-evidence.json.
+
+Step 4 — Evaluation (CLI)
   Compare predicted difficulty (from rule scores) vs actual difficulty.
   Generate score adjustment proposals.
+  Append overscored/underscored findings to data/calibration-evidence.json.
 
-Step 4 — Critic (Subagent)
+Step 5 — Critic (Subagent)
   Challenge each proposal against rejection rules:
   - Rule 1: Low confidence + fewer than 2 supporting cases → reject
   - Rule 2: Change exceeds 50% of current value → cap at midpoint
   - Rule 3: Severity change without high confidence → reject
 
-Step 5 — Arbitrator (Subagent)
+Step 6 — Arbitrator (Subagent)
   Make final decisions:
   - Both approve → apply Runner's value
   - Critic rejects → keep current score
   - Critic revises → apply Critic's conservative value
   Commits approved changes to rule-config.ts.
+
+Step 6.5 — Prune Evidence
+  Remove evidence for rules that were just adjusted from
+  data/calibration-evidence.json (applied rules) and
+  data/discovery-evidence.json (covered gaps).
 ```
 
+### Tiered Approach
+
+Not all fixtures go through the full pipeline. The tier is based on the current grade:
+
+| Grade | Pipeline | Rationale |
+|-------|----------|-----------|
+| A+ and above | Full pipeline (Converter + Gap Analysis) | High-quality designs benefit from gap analysis |
+| Below A | Converter + visual-compare only (skip gap analysis) | Low-scoring designs need score validation the most |
+
+**Always run the Converter** regardless of grade. Skipping visual-compare on low-scoring designs means scores can never be validated.
+
 ## Agents
 
 | Agent | Role | Can edit rule-config.ts? |
 |-------|------|------------------------|
 | **Runner** | Runs analysis, extracts proposals | No |
-| **Converter** | Converts Figma nodes to code, assesses difficulty | No |
+| **Converter** | Implements entire design as HTML, runs visual-compare | No |
+| **Gap Analyzer** | Categorizes pixel differences from diff image | No |
+| **Evaluator** | Compares predicted vs actual difficulty | No |
 | **Critic** | Applies rejection heuristics, caps excessive changes | No |
 | **Arbitrator** | Makes final decisions, commits changes | Yes |
 
@@ -89,9 +113,18 @@ These weights were validated through calibration: rules rated "blocking" consist
 | D | 50-64% | Major rework needed |
 | F | 0-49% | Fundamental structural problems |
 
+## Cross-Run Evidence
+
+Evidence accumulates across calibration sessions in `data/`:
+
+- **`data/calibration-evidence.json`** — Overscored/underscored rules. Fed back to the Evaluator in subsequent runs for stronger proposals.
+- **`data/discovery-evidence.json`** — Uncovered gaps not covered by existing rules. Fed to the `/add-rule` Researcher to find recurring patterns worth turning into new rules.
+
+Discovery evidence is filtered to exclude environment/tooling noise (font CDN differences, retina/DPI scaling, network artifacts, CI constraints). Evidence is pruned after rules are applied (calibration) or new rules are created (discovery).
+
 ## Score Adjustment History
 
-Rule scores in `src/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections).
+Rule scores in `src/core/rules/rule-config.ts` have been adjusted through multiple calibration cycles across different fixture files (Material 3 Design Kit, HTTP Design, Simple DS Card Grid, Simple DS Panel Sections, Simple DS Page Sections).
 
 Each adjustment requires:
 - Minimum 2 supporting cases at medium+ confidence
@@ -199,56 +232,33 @@ The Critic's conservatism prevented score whiplash — without it, `no-auto-layo
 
 ---
 
-## Calibration Pipeline Structure
+## Gap Analysis
 
-```
-/calibrate-loop <fixture>
+The Gap Analyzer (Step 3) examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed:
+- **Covered by existing rule?** — validates that rule's relevance
+- **Actionable but no rule?** — candidate for rule discovery (appended to `data/discovery-evidence.json`)
+- **Rendering artifact?** — not actionable (font smoothing, anti-aliasing, retina/DPI)
 
-Step 1 — Analysis (CLI): run canicode calibrate-analyze
-Step 2 — Converter: implement ENTIRE design as one HTML page + visual-compare
-Step 3 — Gap Analyzer: analyze diff image, categorize pixel differences
-Step 4 — Evaluation (CLI): compare predicted vs actual difficulty, propose adjustments
-Step 5 — Critic: challenge proposals (50% change cap, evidence thresholds)
-Step 6 — Arbitrator: apply approved changes to rule-config.ts
-```
-
-### Gap Analysis
-
-The Gap Analyzer examines the diff image between Figma screenshot and AI-generated code. Each gap is categorized (spacing, color, typography, layout, etc.) and assessed:
-- **Covered by existing rule?** → validates that rule's relevance
-- **Actionable but no rule?** → candidate for rule discovery
-- **Rendering artifact?** → not actionable (font smoothing, anti-aliasing)
-
-Gap data accumulates in each run's `gaps.json` file (`logs/calibration/*/gaps.json`). The rule discovery pipeline reads this data to find recurring patterns worth turning into new rules.
+Gap data is also saved per run in `logs/calibration/*/gaps.json`.
 
 ---
 
 ## Rule Discovery Pipeline
 
-New rules are added through a 5-agent debate pipeline (`/add-rule`):
+New rules are added through a 6-agent pipeline (`/add-rule`). See [CALIBRATION-PLAYBOOK.md](./CALIBRATION-PLAYBOOK.md) for operational details.
 
-```
-/add-rule "concept" fixture.json
+```bash
+/add-rule "concept" fixtures/path
 
-Step 1 — Researcher: explore fixture data + accumulated gap data
+Step 1 — Researcher: explore fixture data + data/discovery-evidence.json
 Step 2 — Designer: propose rule spec (ID, category, severity, score)
 Step 3 — Implementer: write rule code + tests
-Step 4 — A/B Visual Validation: implement entire design with/without the rule's data, compare similarity
+Step 4 — A/B Visual Validation: implement design with/without the rule's data, compare similarity
 Step 5 — Evaluator: measure impact, false positives, visual improvement
 Step 6 — Critic: decide KEEP / ADJUST / DROP
 ```
 
-### Gap → Rule Discovery Flow
-
-```
-Calibration runs accumulate gap data
-    ↓
-logs/calibration/*/gaps.json  (one per run directory)
-    ↓
-Researcher reads accumulated gaps
-    ↓
-Recurring actionable patterns → new rule candidates
-```
+After KEEP or ADJUST, discovery evidence for the rule's category is pruned from `data/discovery-evidence.json`.
 
 ### Known Limitations
 
@@ -257,9 +267,3 @@ Recurring actionable patterns → new rule candidates
 2. **Test fixtures with both positive and negative cases needed.** Current fixtures tend to be all-or-nothing (e.g., 0% description coverage). Effective evaluation requires controlled fixtures.
 
 3. **Font rendering differences.** Playwright uses system fonts; Figma renders with embedded fonts. This creates a baseline similarity gap (~3-5%) that is not actionable.
-
-### Next Steps
-
-- Design controlled test fixtures per concept
-- Accumulate gap data across 10+ fixture runs to identify patterns
-- Build gap-to-rule-candidate pipeline automation
diff --git a/docs/SCORING.md b/docs/SCORING.md
@@ -73,9 +73,10 @@ All categories are weighted equally (1.0). No category is inherently more import
 
 The severity weights, density/diversity ratio, and grade thresholds started as intuition-based values. The [`/calibrate-loop`](CALIBRATION.md) pipeline validates them against pixel-level visual comparison:
 
-1. Convert a Figma design to code
-2. Compare the result against the original screenshot (`visual-compare`)
-3. Check if designs with low scores are actually harder to implement accurately
+1. Implement the entire scoped design as one HTML page
+2. Compare the result against the Figma screenshot (`visual-compare`)
+3. Analyze diff images to categorize pixel gaps
+4. Check if designs with low scores are actually harder to implement accurately
 
 Calibration evidence accumulates across runs in `data/calibration-evidence.json`. As more evidence is collected, these constants will be adjusted to better reflect actual implementation difficulty.