From 15a4d60da5f8dcf33260c49b59f7883890ebaf6d Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 31 Mar 2026 06:14:04 +0000
Subject: [PATCH 1/3] refactor: separate Converter role + rename run-phase1
 (#218)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 2 of pipeline consolidation:

Converter role separation:
- Converter now only writes HTML (baseline + 6 strips) and
  converter-assessment.json (ruleImpact + uncoveredStruggles)
- Orchestrator handles all measurements: html-postprocess, visual-compare,
  code-metrics, responsive comparison, strip delta calculations
- Orchestrator assembles final conversion.json from measurements + assessment

Ablation rename:
- run-phase1.ts → run-strip.ts (clearer name for strip experiments)
- Updated CLAUDE.md references

https://claude.ai/code/session_01N72z2Wbib4cLhYc3FRdSQi
---
 .claude/agents/calibration/converter.md       | 257 ++----------------
 .claude/commands/calibrate-loop.md            | 143 +++++++++-
 CLAUDE.md                                     |   4 +-
 .../ablation/{run-phase1.ts => run-strip.ts}  |   4 +-
 4 files changed, 163 insertions(+), 245 deletions(-)
 rename src/experiments/ablation/{run-phase1.ts => run-strip.ts} (99%)
diff --git a/.claude/agents/calibration/converter.md b/.claude/agents/calibration/converter.md
index 83fdcbb..09ffd30 100644
--- a/.claude/agents/calibration/converter.md
+++ b/.claude/agents/calibration/converter.md
@@ -1,11 +1,11 @@
 ---
 name: calibration-converter
-description: Converts the entire scoped Figma design to a single HTML page and measures pixel-perfect accuracy via visual comparison.
+description: Converts the entire scoped Figma design to a single HTML page. Outputs baseline + strip HTML files and a self-assessment of rule impacts.
 tools: Bash, Read, Write, Glob
 model: claude-sonnet-4-6
 ---
 
-You are the Converter agent in a calibration pipeline. Your job is to implement the entire scoped design as a single HTML page and measure how accurately it matches the original Figma design.
+You are the Converter agent in a calibration pipeline. Your job is to implement the entire scoped design as a single HTML page, then implement 6 stripped variants. The orchestrator handles all measurements (visual-compare, code-metrics).
 
 ## Input
 
@@ -27,10 +27,8 @@ Convert the **entire root node** (the full scoped design) as one standalone HTML
 Use BOTH sources together for accurate conversion:
 
 **Primary source — design tree (structure + CSS-ready values):**
-```
-npx canicode design-tree <fixture-path> --output $RUN_DIR/design-tree.txt
-```
-This produces a 4KB DOM-like tree with inline CSS styles instead of 250KB+ raw JSON. Each node = one HTML element. Every style value is CSS-ready.
+Read `$RUN_DIR/design-tree.txt` (pre-generated by the orchestrator).
+This is a 4KB DOM-like tree with inline CSS styles instead of 250KB+ raw JSON. Each node = one HTML element. Every style value is CSS-ready.
 
 **Secondary source — fixture JSON (exact raw values):**
 Read the original fixture JSON directly when you need to verify a value from the design tree. Use it to cross-check colors, spacing, font sizes, and any value that seems ambiguous or lossy in the design tree output.
@@ -48,256 +46,61 @@ Read and follow `.claude/skills/design-to-code/PROMPT.md` for all code generatio
 ## Steps
 
 1. Read `.claude/skills/design-to-code/PROMPT.md` for code generation rules
-2. Generate design tree (CLI)
+2. Read `$RUN_DIR/design-tree.txt` (pre-generated by orchestrator)
 3. Convert the design tree to a single standalone HTML+CSS file
    - Each node in the tree maps 1:1 to an HTML element
    - Copy style values directly — they are already CSS-ready
    - Follow all rules from DESIGN-TO-CODE-PROMPT.md
 4. Save to `$RUN_DIR/output.html`
-5. Post-process HTML (sanitize + inject local fonts):
-
-   ```bash
-   npx canicode html-postprocess $RUN_DIR/output.html
-   ```
-
-6. Run visual comparison:
-
-   ```bash
-   npx canicode visual-compare $RUN_DIR/output.html \
-     --figma-url "https://www.figma.com/design/<fileKey>/file?node-id=<rootNodeId>" \
-     --output $RUN_DIR
-   ```
-
-   This saves `figma.png`, `code.png`, and `diff.png` into the run directory.
-   Replace `:` with `-` in the nodeId for the URL.
-7. **Responsive comparison** (if expanded screenshot exists):
-
-   List `screenshot-*.png` in the fixture directory. Extract the width number from each filename, sort numerically. If 2+ screenshots exist, the smallest width is the original and the largest is the expanded viewport.
-
-   ```bash
-   # Example: screenshot-1200.png (original), screenshot-1920.png (expanded)
-   SCREENSHOTS=($(ls <fixture-path>/screenshot-*.png | sort -t- -k2 -n))
-   LARGEST="${SCREENSHOTS[-1]}"
-   LARGEST_WIDTH=$(echo "$LARGEST" | grep -oP 'screenshot-\K\d+')
-
-   npx canicode visual-compare $RUN_DIR/output.html \
-     --figma-url "https://www.figma.com/design/<fileKey>/file?node-id=<rootNodeId>" \
-     --figma-screenshot "$LARGEST" \
-     --width "$LARGEST_WIDTH" \
-     --expand-root \
-     --output $RUN_DIR/responsive
-   ```
+5. **Strip Ablation — HTML generation only**: For each of the **6** strip types, the orchestrator has placed stripped design-trees in `$RUN_DIR/stripped/`. Convert each to HTML.
 
-   The command outputs JSON to stdout with a `similarity` field. Record it as `responsiveSimilarity` and calculate `responsiveDelta = similarity - responsiveSimilarity`.
-   If only 1 screenshot exists, skip responsive comparison and set `responsiveSimilarity`, `responsiveDelta`, and `responsiveViewport` to `null`.
-8. Use similarity to determine overall difficulty (thresholds defined in `src/agents/orchestrator.ts` → `SIMILARITY_DIFFICULTY_THRESHOLDS`):
+   **Strip types** (process every one): `layout-direction-spacing`, `size-constraints`, `component-references`, `node-names-hierarchy`, `variable-references`, `style-references`
 
-   | Similarity | Difficulty |
-   |-----------|-----------|
-   | 90%+ | easy |
-   | 70-89% | moderate |
-   | 50-69% | hard |
-   | <50% | failed |
+   For each `<strip-type>`:
+   a. Read `$RUN_DIR/stripped/<strip-type>.txt`
+   b. Convert to HTML with the same rules as baseline (PROMPT.md); save `$RUN_DIR/stripped/<strip-type>.html`
 
-9. **MANDATORY — Rule Impact Assessment**: For EVERY rule ID in `nodeIssueSummaries[].flaggedRuleIds`, assess its actual impact on conversion. Read the analysis JSON, collect all unique `flaggedRuleIds`, and for each one write an entry in `ruleImpactAssessment`. This array MUST NOT be empty if there are flagged rules.
+6. **MANDATORY — Rule Impact Assessment**: For EVERY rule ID in `nodeIssueSummaries[].flaggedRuleIds`, assess its actual impact on conversion. Read the analysis JSON, collect all unique `flaggedRuleIds`, and for each one write an entry in `ruleImpactAssessment`. This array MUST NOT be empty if there are flagged rules.
    - Did this rule's issue actually make the conversion harder?
    - What was its real impact on the final similarity score?
    - Rate as: `easy` (no real difficulty), `moderate` (some guessing needed), `hard` (significant pixel loss), `failed` (could not reproduce)
-10. **Code metrics** (shared CLI — recorded for analysis/reporting):
-
-   ```bash
-   npx canicode code-metrics $RUN_DIR/output.html
-   ```
-
-   Returns JSON with `htmlBytes`, `htmlLines`, `cssClassCount`, `cssVariableCount`.
-11. Note any difficulties NOT covered by existing rules as `uncoveredStruggles`
-    - **Only include design-related issues** — problems in the Figma file structure, missing tokens, ambiguous layout, etc.
-    - **Exclude environment/tooling issues** — font CDN availability, screenshot DPI/retina scaling, browser rendering quirks, network issues, CI limitations. These are not design problems.
-12. **Strip Ablation** (objective difficulty measurement): For each of the **6** strip types, the orchestrator places stripped design-trees in `$RUN_DIR/stripped/`. Convert each to HTML, then collect the **same categories of metrics as the baseline** (pixel similarity, optional responsive similarity, design-tree token estimate, HTML size, CSS counts). Strip rows in `conversion.json` must populate `StripDeltaResultSchema` (`src/agents/contracts/conversion-agent.ts`).
-
-    **Strip types** (process every one): `layout-direction-spacing`, `size-constraints`, `component-references`, `node-names-hierarchy`, `variable-references`, `style-references`
-
-    For each `<strip-type>`:
-
-    a. Read `$RUN_DIR/stripped/<strip-type>.txt`
-    b. Convert to HTML with the same rules as baseline (PROMPT.md); save `$RUN_DIR/stripped/<strip-type>.html`, then post-process:
-       ```bash
-       npx canicode html-postprocess $RUN_DIR/stripped/<strip-type>.html
-       ```
-    c. **Pixel similarity** (design viewport — same framing as baseline):
-       ```bash
-       npx canicode visual-compare $RUN_DIR/stripped/<strip-type>.html \
-         --figma-screenshot $RUN_DIR/figma.png \
-         --output $RUN_DIR/stripped/<strip-type>
-       ```
-       Record `strippedSimilarity` from the command JSON stdout. Use the baseline run’s `similarity` as `baselineSimilarity` (same value for every strip row).
-    d. **Input tokens (design-tree text):** Match `generateDesignTreeWithStats` in `src/core/design-tree/design-tree.ts`:  
-       `inputTokens = ceil(utf8Text.length / 4)` where `utf8Text` is the full file contents as a JavaScript string (use the same string length as if the file were read with UTF-8 decoding).  
-       - `baselineInputTokens` = from `$RUN_DIR/design-tree.txt`  
-       - `strippedInputTokens` = from `$RUN_DIR/stripped/<strip-type>.txt`  
-       - `tokenDelta` = `baselineInputTokens - strippedInputTokens`  
-       Example (Node): `node -e "const fs=require('fs'); const n=Math.ceil(fs.readFileSync(process.argv[1],'utf8').length/4); console.log(n)" "$RUN_DIR/design-tree.txt"`
-    e. **Code metrics** (shared CLI — covers HTML size + CSS metrics):
-       ```bash
-       npx canicode code-metrics $RUN_DIR/output.html           # baseline
-       npx canicode code-metrics $RUN_DIR/stripped/<strip-type>.html  # stripped
-       ```
-       From JSON output: `baselineHtmlBytes` / `strippedHtmlBytes`, `baselineCssClassCount` / `strippedCssClassCount`, `baselineCssVariableCount` / `strippedCssVariableCount`. Compute `htmlBytesDelta` = `baselineHtmlBytes - strippedHtmlBytes`.
-    f. **Responsive similarity at the expanded viewport** (same screenshot + width as step 7):
-
-       If step 7 **skipped** (only one fixture screenshot): set `baselineResponsiveSimilarity`, `strippedResponsiveSimilarity`, `responsiveDelta`, and `responsiveViewport` to `null` on **every** strip row.
-
-       If step 7 **ran**: reuse the same `LARGEST` screenshot path and `LARGEST_WIDTH` variables from step 7.
-
-       - **`size-constraints` (required):** Run visual-compare on the stripped HTML at the expanded viewport so missing size info shows up where it actually breaks (not only at design width):
-
-         ```bash
-         npx canicode visual-compare $RUN_DIR/stripped/size-constraints.html \
-           --figma-screenshot "$LARGEST" \
-           --width "$LARGEST_WIDTH" \
-           --expand-root \
-           --output $RUN_DIR/stripped/size-constraints-responsive
-         ```
-
-         Record JSON stdout `similarity` as **`strippedResponsiveSimilarity`**. Set **`baselineResponsiveSimilarity`** to the root conversion field **`responsiveSimilarity`** from step 7 (baseline `output.html` at the same viewport — already measured). Set **`responsiveViewport`** to `LARGEST_WIDTH` (number). Set **`responsiveDelta`** = `baselineResponsiveSimilarity - strippedResponsiveSimilarity` (percentage points).
-
-       - **Other strip types:** Optional — same command pattern with `$RUN_DIR/stripped/<strip-type>.html` and a distinct `--output` directory if you want responsive rows for reporting; otherwise set the four responsive fields to `null`.
-
-    **Derived fields (every strip row):**
-
-    - `delta` = `baselineSimilarity - strippedSimilarity` (percentage points)
-    - `deltaDifficulty`: use the metric the evaluator uses for that strip family (`src/agents/evaluation-agent.ts` — `getStripDifficultyForRule`):
-      - `layout-direction-spacing` → map `delta` with `stripDeltaToDifficulty` (`src/core/design-tree/delta.ts` pixel table below)
-      - `size-constraints` → if `responsiveDelta` is a finite number, map `responsiveDelta` with `stripDeltaToDifficulty`; else map `delta`
-      - `component-references`, `node-names-hierarchy`, `variable-references`, `style-references` → if both input token counts are present, map with `tokenDeltaToDifficulty(baselineInputTokens, strippedInputTokens)`; else map `delta` with `stripDeltaToDifficulty`
-
-    Pixel / responsive threshold table (`stripDeltaToDifficulty`):
-
-    | Delta (%p) | Difficulty |
-    |-----------|------------|
-    | ≤ 5 | easy |
-    | 6–15 | moderate |
-    | 16–30 | hard |
-    | > 30 | failed |
-
-    Token threshold table (`tokenDeltaToDifficulty`): percentage = `(baselineInputTokens - strippedInputTokens) / baselineInputTokens * 100` — ≤5% easy, 6–20% moderate, 21–40% hard, >40% failed (baseline 0 → treat as easy).
+7. Note any difficulties NOT covered by existing rules as `uncoveredStruggles`
+   - **Only include design-related issues** — problems in the Figma file structure, missing tokens, ambiguous layout, etc.
+   - **Exclude environment/tooling issues** — font CDN availability, screenshot DPI/retina scaling, browser rendering quirks, network issues, CI limitations. These are not design problems.
 
 ## Output
 
-Write results to `$RUN_DIR/conversion.json`.
+Write results to `$RUN_DIR/converter-assessment.json`.
 
 **CRITICAL: `ruleImpactAssessment` MUST contain one entry per unique flagged rule ID. An empty array means the calibration pipeline cannot evaluate rule scores.**
 
 ```json
 {
   "rootNodeId": "562:9069",
-  "generatedCode": "// The full HTML page",
-  "similarity": 87,
-  "responsiveSimilarity": 72,
-  "responsiveDelta": 15,
-  "responsiveViewport": 1920,
-  "htmlBytes": 42000,
-  "htmlLines": 850,
-  "cssClassCount": 45,
-  "cssVariableCount": 12,
-  "difficulty": "moderate",
-  "notes": "Summary of the conversion experience",
   "ruleImpactAssessment": [
     {
-      "ruleId": "raw-value",
-      "issueCount": 4,
-      "actualImpact": "easy",
-      "description": "Colors were directly available in design tree, no difficulty"
-    },
-    {
-      "ruleId": "detached-instance",
-      "issueCount": 2,
-      "actualImpact": "easy",
-      "description": "Detached instances rendered identically to attached ones"
+      "ruleId": "no-auto-layout",
+      "issueCount": 5,
+      "actualImpact": "high",
+      "description": "..."
     }
   ],
-  "stripDeltas": [
-    {
-      "stripType": "layout-direction-spacing",
-      "baselineSimilarity": 87,
-      "strippedSimilarity": 75,
-      "delta": 12,
-      "deltaDifficulty": "moderate",
-      "baselineResponsiveSimilarity": null,
-      "strippedResponsiveSimilarity": null,
-      "responsiveDelta": null,
-      "responsiveViewport": null,
-      "baselineInputTokens": 2400,
-      "strippedInputTokens": 2380,
-      "tokenDelta": 20,
-      "baselineHtmlBytes": 42000,
-      "strippedHtmlBytes": 41500,
-      "htmlBytesDelta": 500,
-      "baselineCssClassCount": 45,
-      "strippedCssClassCount": 44,
-      "baselineCssVariableCount": 12,
-      "strippedCssVariableCount": 12
-    },
-    {
-      "stripType": "size-constraints",
-      "baselineSimilarity": 87,
-      "strippedSimilarity": 86,
-      "delta": 1,
-      "deltaDifficulty": "moderate",
-      "baselineResponsiveSimilarity": 72,
-      "strippedResponsiveSimilarity": 58,
-      "responsiveDelta": 14,
-      "responsiveViewport": 1920,
-      "baselineInputTokens": 2400,
-      "strippedInputTokens": 2200,
-      "tokenDelta": 200,
-      "baselineHtmlBytes": 42000,
-      "strippedHtmlBytes": 41800,
-      "htmlBytesDelta": 200,
-      "baselineCssClassCount": 45,
-      "strippedCssClassCount": 45,
-      "baselineCssVariableCount": 12,
-      "strippedCssVariableCount": 12
-    },
-    {
-      "stripType": "component-references",
-      "baselineSimilarity": 87,
-      "strippedSimilarity": 84,
-      "delta": 3,
-      "deltaDifficulty": "hard",
-      "baselineResponsiveSimilarity": null,
-      "strippedResponsiveSimilarity": null,
-      "responsiveDelta": null,
-      "responsiveViewport": null,
-      "baselineInputTokens": 2400,
-      "strippedInputTokens": 1800,
-      "tokenDelta": 600,
-      "baselineHtmlBytes": 42000,
-      "strippedHtmlBytes": 39000,
-      "htmlBytesDelta": 3000,
-      "baselineCssClassCount": 45,
-      "strippedCssClassCount": 38,
-      "baselineCssVariableCount": 12,
-      "strippedCssVariableCount": 10
-    }
-  ],
-  "interpretations": [
-    "Used system font fallback for Inter (not installed in CI)",
-    "Set body margin to 0 (not specified in design tree)"
-  ],
   "uncoveredStruggles": [
     {
-      "description": "A difficulty not covered by any flagged rule",
-      "suggestedCategory": "pixel-critical | responsive-critical | code-quality | token-management | interaction | semantic",
-      "estimatedImpact": "easy | moderate | hard | failed"
+      "description": "...",
+      "suggestedCategory": "pixel-critical",
+      "estimatedImpact": "medium"
     }
-  ]
+  ],
+  "interpretations": ["guessed X as Y", "assumed Z"]
 }
 ```
 
+The orchestrator will run all measurements (html-postprocess, visual-compare, code-metrics) and assemble the final `conversion.json` by merging your assessment with the measurement results.
+
 ## Rules
 
-- Do NOT modify any source files. Only write to the run directory.
-- Implement the FULL design, not individual nodes.
-- If visual-compare fails (rate limit, etc.), set similarity to -1 and explain in notes.
-- Return a brief summary so the orchestrator can proceed.
+- **Do NOT run visual-compare, html-postprocess, or code-metrics.** The orchestrator handles all measurements.
+- **Do NOT write conversion.json.** Write only `converter-assessment.json`. The orchestrator assembles the final conversion.json.
+- Focus on accurate HTML implementation and honest rule impact assessment.
+- Each strip HTML should be a fresh implementation from the stripped design-tree, not a modification of the baseline.
diff --git a/.claude/commands/calibrate-loop.md b/.claude/commands/calibrate-loop.md
index 38c0e24..85521af 100644
--- a/.claude/commands/calibrate-loop.md
+++ b/.claude/commands/calibrate-loop.md
@@ -50,7 +50,7 @@ If tier is `"visual-only"`, append after Converter completes:
 {"step":"Gap Analyzer","timestamp":"<ISO8601>","result":"SKIPPED — tier=visual-only, gap analysis skipped","durationMs":0}
 ```
 
-### Step 2 — Converter (Baseline + Strip Ablation)
+### Step 2 — Converter (HTML Generation)
 
 Read the analysis JSON to extract `fileKey`. Also determine the root nodeId — if the input was a Figma URL, parse the node-id from it. If it was a fixture, use the document root id.
 
@@ -91,32 +91,147 @@ Fixture directory: <paste input path here>
 fileKey: <extracted fileKey>
 Root nodeId: <extracted nodeId>
 Run directory: <paste RUN_DIR here>
-figma.png is already in the run directory (copied from fixture screenshot). visual-compare will reuse it.
 
+design-tree.txt is already in the run directory.
 Stripped design-trees are pre-generated in $RUN_DIR/stripped/.
-After completing the baseline conversion (steps 1-10), proceed with step 11 (Strip Ablation)
-to convert each stripped design-tree and measure similarity deltas.
-```
 
-The Converter writes `output.html`, `conversion.json`, `design-tree.txt` to $RUN_DIR and runs `visual-compare --output $RUN_DIR` which creates `figma.png` (or reuses cached), `code.png`, `diff.png`. It also writes stripped HTML files and their comparison results.
+Your job: implement baseline HTML (output.html) + 6 strip HTMLs (stripped/<type>.html),
+then write converter-assessment.json with ruleImpactAssessment + uncoveredStruggles.
+Do NOT run visual-compare, html-postprocess, or code-metrics — the orchestrator handles measurements.
+```
 
 After the Converter returns, **verify** these files exist in $RUN_DIR:
 ```bash
-ls $RUN_DIR/conversion.json $RUN_DIR/output.html
+ls $RUN_DIR/output.html $RUN_DIR/converter-assessment.json
+```
+
+If `converter-assessment.json` is missing, write it yourself from the Converter's returned summary.
+
+**Record token usage**: The subagent result includes `total_tokens`, `tool_uses`, `duration_ms` in usage metadata. Store these for later inclusion in conversion.json.
+
+Append to `$RUN_DIR/activity.jsonl`:
+```json
+{"step":"Converter","timestamp":"<ISO8601>","result":"baseline + 6 strips written, tokens=<N>","durationMs":<ms>}
+```
+
+### Step 2.5 — Measurements (CLI — no LLM)
+
+Run all measurements on the Converter's HTML outputs. This is deterministic — no subagent needed.
+
+**Baseline measurements:**
+
+```bash
+# Post-process HTML (sanitize + inject local fonts)
+npx canicode html-postprocess $RUN_DIR/output.html
+
+# Visual comparison (baseline)
+npx canicode visual-compare $RUN_DIR/output.html \
+  --figma-screenshot $RUN_DIR/figma.png \
+  --output $RUN_DIR
+```
+
+Record the `similarity` from visual-compare JSON stdout.
+
+**Responsive comparison** (if expanded screenshot exists):
+
+List `screenshot-*.png` in the fixture directory. Extract the width number from each filename, sort numerically. If 2+ screenshots exist, the smallest width is the original and the largest is the expanded viewport.
+
+```bash
+# Example: screenshot-1200.png (original), screenshot-1920.png (expanded)
+SCREENSHOTS=($(ls <fixture-path>/screenshot-*.png | sort -t- -k2 -n))
+LARGEST="${SCREENSHOTS[-1]}"
+LARGEST_WIDTH=$(echo "$LARGEST" | grep -oP 'screenshot-\K\d+')
+
+npx canicode visual-compare $RUN_DIR/output.html \
+  --figma-screenshot "$LARGEST" \
+  --width "$LARGEST_WIDTH" \
+  --expand-root \
+  --output $RUN_DIR/responsive
+```
+
+Record `responsiveSimilarity` from JSON stdout. If only 1 screenshot exists, set `responsiveSimilarity`, `responsiveDelta`, `responsiveViewport` to `null`.
+
+**Code metrics (baseline):**
+
+```bash
+npx canicode code-metrics $RUN_DIR/output.html
+```
+
+Record `htmlBytes`, `htmlLines`, `cssClassCount`, `cssVariableCount` from JSON stdout.
+
+**Strip measurements** — for each of the 6 strip types:
+
+```bash
+# Post-process
+npx canicode html-postprocess $RUN_DIR/stripped/<strip-type>.html
+
+# Visual comparison
+npx canicode visual-compare $RUN_DIR/stripped/<strip-type>.html \
+  --figma-screenshot $RUN_DIR/figma.png \
+  --output $RUN_DIR/stripped/<strip-type>
+
+# Code metrics
+npx canicode code-metrics $RUN_DIR/stripped/<strip-type>.html
+```
+
+For each strip, record `strippedSimilarity`, `strippedHtmlBytes`, `strippedCssClassCount`, `strippedCssVariableCount` from the CLI outputs.
+
+**Input tokens** (design-tree text): `inputTokens = ceil(utf8Text.length / 4)`
+- `baselineInputTokens` from `$RUN_DIR/design-tree.txt`
+- `strippedInputTokens` from `$RUN_DIR/stripped/<strip-type>.txt`
+- `tokenDelta` = `baselineInputTokens - strippedInputTokens`
+
+**Responsive for size-constraints strip** (if responsive comparison ran above):
+
+```bash
+npx canicode visual-compare $RUN_DIR/stripped/size-constraints.html \
+  --figma-screenshot "$LARGEST" \
+  --width "$LARGEST_WIDTH" \
+  --expand-root \
+  --output $RUN_DIR/stripped/size-constraints-responsive
 ```
 
-If `conversion.json` is missing, write it yourself from the Converter's returned summary.
+Other strip types: set responsive fields to `null`.
+
+**Derived fields (every strip row):**
+
+- `delta` = `baselineSimilarity - strippedSimilarity` (percentage points)
+- `htmlBytesDelta` = `baselineHtmlBytes - strippedHtmlBytes`
+- `deltaDifficulty`: use the metric the evaluator uses for that strip family (`src/agents/evaluation-agent.ts` — `getStripDifficultyForRule`):
+  - `layout-direction-spacing` → map `delta` with `stripDeltaToDifficulty` (≤5 easy, 6–15 moderate, 16–30 hard, >30 failed)
+  - `size-constraints` → if `responsiveDelta` is finite, map `responsiveDelta` with `stripDeltaToDifficulty`; else map `delta`
+  - `component-references`, `node-names-hierarchy`, `variable-references`, `style-references` → if both token counts present, map with `tokenDeltaToDifficulty` (≤5% easy, 6–20% moderate, 21–40% hard, >40% failed); else map `delta` with `stripDeltaToDifficulty`
 
-**Verify strip deltas**: Read `conversion.json` and check that `stripDeltas` array is present and non-empty. If missing (Converter didn't complete strip ablation), log a warning but continue — the evaluation will fall back to Converter self-assessment.
+**Difficulty from similarity:** Use `SIMILARITY_DIFFICULTY_THRESHOLDS` from `src/agents/orchestrator.ts`: 90%+ easy, 70-89% moderate, 50-69% hard, <50% failed.
+
+**Assemble `conversion.json`**: Merge Converter's `converter-assessment.json` (ruleImpactAssessment, uncoveredStruggles) with all measurement results:
+
+```json
+{
+  "rootNodeId": "<from converter-assessment.json>",
+  "similarity": <baseline similarity>,
+  "difficulty": "<from similarity thresholds>",
+  "responsiveSimilarity": <or null>,
+  "responsiveDelta": <or null>,
+  "responsiveViewport": <or null>,
+  "htmlBytes": <from code-metrics>,
+  "htmlLines": <from code-metrics>,
+  "cssClassCount": <from code-metrics>,
+  "cssVariableCount": <from code-metrics>,
+  "ruleImpactAssessment": <from converter-assessment.json>,
+  "uncoveredStruggles": <from converter-assessment.json>,
+  "stripDeltas": [<assembled from strip measurements>],
+  "converterTokens": <from subagent usage>,
+  "converterToolUses": <from subagent usage>,
+  "converterDurationMs": <from subagent usage>
+}
+```
 
-**Record token usage**: The subagent result includes `total_tokens`, `tool_uses`, `duration_ms` in usage metadata. Read `conversion.json`, add these fields, and write back:
-- `converterTokens`: total tokens consumed by the Converter subagent
-- `converterToolUses`: number of tool calls
-- `converterDurationMs`: execution time in milliseconds
+Write `$RUN_DIR/conversion.json`.
 
 Append to `$RUN_DIR/activity.jsonl`:
 ```json
-{"step":"Converter","timestamp":"<ISO8601>","result":"similarity=<N>% difficulty=<level> strips=<N>/5 tokens=<N>","durationMs":<ms>}
+{"step":"Measurements","timestamp":"<ISO8601>","result":"similarity=<N>% difficulty=<level> strips=<N>/6","durationMs":<ms>}
 ```
 
 ### Step 3 — Gap Analysis
diff --git a/CLAUDE.md b/CLAUDE.md
index 4d1ed9f..6bf2551 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -126,10 +126,10 @@ Calibration commands are NOT exposed as CLI commands. They run exclusively insid
 
 Two scripts, shared helpers:
 
-**`run-phase1.ts` — Strip experiments**
+**`run-strip.ts` — Strip experiments**
 
 ```bash
-ANTHROPIC_API_KEY=sk-... npx tsx src/experiments/ablation/run-phase1.ts
+ANTHROPIC_API_KEY=sk-... npx tsx src/experiments/ablation/run-strip.ts
 ABLATION_FIXTURES=desktop-product-detail ABLATION_TYPES=component-references npx tsx ...
 ```
 
diff --git a/src/experiments/ablation/run-phase1.ts b/src/experiments/ablation/run-strip.ts
similarity index 99%
rename from src/experiments/ablation/run-phase1.ts
rename to src/experiments/ablation/run-strip.ts
index bb80620..375c933 100644
--- a/src/experiments/ablation/run-phase1.ts
+++ b/src/experiments/ablation/run-strip.ts
@@ -1,11 +1,11 @@
 /**
- * Ablation Phase 1: Strip experiments.
+ * Ablation: Strip experiments.
  *
  * For each selected strip type × N fixtures × M runs:
  *   Strip info from design-tree → implement via API → render → compare → record metrics
  *
  * Usage:
- *   ANTHROPIC_API_KEY=sk-... npx tsx src/experiments/ablation/run-phase1.ts
+ *   ANTHROPIC_API_KEY=sk-... npx tsx src/experiments/ablation/run-strip.ts
  *
  * Environment variables:
  *   ANTHROPIC_API_KEY, ABLATION_FIXTURES, ABLATION_TYPES, ABLATION_RUNS, ABLATION_BASELINE_ONLY

From e8d77e110760e70041e0ccc15024bff251fb4252 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 31 Mar 2026 06:37:04 +0000
Subject: [PATCH 2/3] =?UTF-8?q?fix:=20address=20review=20=E2=80=94=20fix?=
 =?UTF-8?q?=20actualImpact=20enum=20+=20verify=20stripped=20HTMLs?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Fix example JSON "high" → "hard" (valid: easy/moderate/hard/failed)
- Add verification for 6 stripped HTML files after Converter returns

https://claude.ai/code/session_01N72z2Wbib4cLhYc3FRdSQi
---
 .claude/agents/calibration/converter.md | 2 +-
 .claude/commands/calibrate-loop.md      | 8 +++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/.claude/agents/calibration/converter.md b/.claude/agents/calibration/converter.md
index 09ffd30..b957108 100644
--- a/.claude/agents/calibration/converter.md
+++ b/.claude/agents/calibration/converter.md
@@ -81,7 +81,7 @@ Write results to `$RUN_DIR/converter-assessment.json`.
     {
       "ruleId": "no-auto-layout",
       "issueCount": 5,
-      "actualImpact": "high",
+      "actualImpact": "hard",
       "description": "..."
     }
   ],
diff --git a/.claude/commands/calibrate-loop.md b/.claude/commands/calibrate-loop.md
index 85521af..277c03f 100644
--- a/.claude/commands/calibrate-loop.md
+++ b/.claude/commands/calibrate-loop.md
@@ -103,9 +103,15 @@ Do NOT run visual-compare, html-postprocess, or code-metrics — the orchestrato
 After the Converter returns, **verify** these files exist in $RUN_DIR:
 ```bash
 ls $RUN_DIR/output.html $RUN_DIR/converter-assessment.json
+ls $RUN_DIR/stripped/layout-direction-spacing.html \
+   $RUN_DIR/stripped/size-constraints.html \
+   $RUN_DIR/stripped/component-references.html \
+   $RUN_DIR/stripped/node-names-hierarchy.html \
+   $RUN_DIR/stripped/variable-references.html \
+   $RUN_DIR/stripped/style-references.html
 ```
 
-If `converter-assessment.json` is missing, write it yourself from the Converter's returned summary.
+If any file is missing, log a warning naming the missing files but continue.
 
 **Record token usage**: The subagent result includes `total_tokens`, `tool_uses`, `duration_ms` in usage metadata. Store these for later inclusion in conversion.json.
 

From edd6e97f21d892d5e488c2592522b4e7056dd578 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 31 Mar 2026 06:49:46 +0000
Subject: [PATCH 3/3] fix: update Step 3 similarity source to conversion.json

https://claude.ai/code/session_01N72z2Wbib4cLhYc3FRdSQi
---
 .claude/commands/calibrate-loop.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/commands/calibrate-loop.md b/.claude/commands/calibrate-loop.md
index 277c03f..ca3d6d8 100644
--- a/.claude/commands/calibrate-loop.md
+++ b/.claude/commands/calibrate-loop.md
@@ -256,7 +256,7 @@ Proceed to Step 4.
 
 **If EXISTS**: spawn the `calibration-gap-analyzer` subagent. In the prompt include:
 - Screenshot paths: `$RUN_DIR/figma.png`, `$RUN_DIR/code.png`, `$RUN_DIR/diff.png`
-- Similarity score from the Converter's output
+- Similarity score from `$RUN_DIR/conversion.json`
 - Generated HTML path: `$RUN_DIR/output.html`
 - Fixture path and analysis JSON path: `$RUN_DIR/analysis.json`
 - The Converter's interpretations list