feat: ablation v2 — redesigned experiment with multi-metric measurement by let-sunny · Pull Request #147 · let-sunny/canicode

let-sunny · 2026-03-27T18:37:44Z

Context

Phase 1 실험 드라이런에서 발견된 문제들을 반영한 재설계.
이전 PR #143에서 스크립트는 머지됐지만, 실험 항목/지표/방식이 근본적으로 재정의 필요.

실험 항목

Strip 실험 (run-phase1.ts — 같은 프롬프트, 데이터 제거)

#	항목	핵심 지표
1	layout-direction-spacing	pixel, output tokens
2	component-references	코드 크기, CSS 클래스 수, output tokens
3	node-names-hierarchy	output tokens, CSS 클래스 이름
4	variable-references (신규)	CSS 변수 수
5	style-references (신규)	CSS 클래스 수

조건 실험 (run-condition.ts — 뷰포트/데이터 조건 변경)

#	항목	실험 방식	핵심 지표
6	size-constraints	strip + API 구현 + 루트 고정너비 제거 + 1920px/768px 렌더	pixel
7	hover-interaction	strip [hover]: + API 구현 → baseline과 hover CSS 값 비교	hover 스타일 값 일치 여부

측정 지표 (모든 항목에서 기록)

지표	측정 방법	비용
pixel similarity	visual-compare (로컬)	$0
input tokens	API response usage	$0
output tokens	API response usage	$0
HTML 크기	output.html bytes + lines	$0
CSS 클래스 수	HTML 파싱	$0
CSS 변수 수	HTML 파싱	$0

구현 변경사항

스킵 사유 (strip에서 제외)

스킵 항목	사유
color-values	trivially obvious
typography	trivially obvious
shadows-effects	trivially obvious + design-tree에서 거의 미출력
overflow-text-behavior	trivially obvious + 비중 미미
size-constraints	동일 뷰포트 ΔV=0% → 조건 실험으로 이동
hover-interaction-states	strip 대신 조건 실험(#7)으로 이동
position-stacking	no-op
component-descriptions	no-op

발견사항

Opus (서브에이전트) vs Sonnet (API) — 같은 프롬프트/design-tree에서 Opus가 near pixel-perfect, Sonnet은 낮은 품질. 비용 5x 차이로 Sonnet 유지. ablation은 상대 ΔV이므로 절대 품질보다 같은 모델 내 비교가 중요.
size-constraints는 동일 뷰포트(1200px)에서 ΔV=-1%(노이즈)이지만, 1920px에서는 유의미할 수 있음
hover 데이터 없으면 AI는 hover CSS를 전혀 생성하지 않음 (프롬프트가 "추가하지 마" 지시)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added an experimental ablation testing framework to run comparative design-to-code experiments (size constraints, hover interaction) and produce per-run similarity and metric outputs.
Documentation
- Added comprehensive ablation experiments docs covering workflows, metrics collected, parallel run guidance, and result summarization.
Refactor
- Updated rendering requirements: global CSS reset applied and all rendered output must be wrapped in a single root container.
- Improved experiment pipeline and metrics (HTML/CSS counts, tokens, similarity) for clearer comparisons.

coderabbitai · 2026-03-27T18:37:50Z

📝 Walkthrough

Walkthrough

Adds an ablation experiment framework: new helpers, two runner scripts (phase1 and condition-based), documentation updates, and a refactor of design-tree strip types and strip functions to support targeted information-removal experiments and updated tests.

Changes

Cohort / File(s)	Summary
Ablation helpers `src/agents/ablation/helpers.ts`	New shared utilities: experiment constants, fixture/path helpers, HTML extraction/sanitization/font injection, Anthropic streaming + retry wrapper, CSS metrics, render+compare pipeline, file operations, and fixture parsing/validation.
Ablation runners `src/agents/ablation/run-phase1.ts`, `src/agents/ablation/run-condition.ts`	New/refactored runners: phase-1 strip experiment runner with simplified cache and new metrics; condition runner for `size-constraints` and `hover-interaction` producing HTML artifacts, screenshots, comparisons, and JSON results.
Design-tree strip logic & tests `src/core/engine/design-tree-strip.ts`, `src/core/engine/design-tree-strip.test.ts`	Refactored strip type taxonomy: added `ALL_STRIP_TYPES`, narrowed `DesignTreeInfoType`, split token-ref removals into `variable-references` and `style-references`, updated STRIP_FUNCTIONS typing and tests to match new behavior and expectations.
Docs & prompt `CLAUDE.md`, `.claude/skills/design-to-code/PROMPT.md`	Documentation: new "Ablation experiments" section and prompt adjustments requiring a global CSS reset and a single root `<div>` wrapper inserted as first child of `<body>`; formatting tweak to output instructions.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Experiment Runner
    participant Helpers as helpers.ts
    participant API as Anthropic API
    participant Renderer as Render Engine
    participant Comparator as Screenshot Comparator

    Runner->>Helpers: parseFixtures(), build/strip design trees
    Runner->>Helpers: callApi(prompt, designTree)
    Helpers->>API: POST /messages (streaming)
    API-->>Helpers: streamed Message
    Helpers->>Helpers: extractHtml() → sanitizeHtml() → injectLocalFont()
    Helpers-->>Runner: processed HTML
    Runner->>Renderer: render HTML -> PNG
    Renderer-->>Runner: rendered PNG
    Runner->>Helpers: renderAndCompare(renderedPNG, figmaPNG)
    Helpers->>Comparator: compare images
    Comparator-->>Helpers: similarity score
    Helpers-->>Runner: similarity & metrics
    Runner->>Runner: write result.json & artifacts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

experiment: targeted ablation — measure info type impact via ΔV #130: Implements the targeted-ablation experiment framework and strip-type changes described in the issue.
Adjust CATEGORY_WEIGHT based on ablation experiment data #119: Ablation tooling and strip-function/type updates align with the issue's targeted ablation plan.

Possibly related PRs

feat: add design-tree strip functions for ablation experiments #140: Shares changes to design-tree strip type system and strip-function refactors (ALL_STRIP_TYPES, var/style split).
feat: add ablation Phase 1 execution script #143: Related ablation Phase 1 tooling changes and helpers/refactor overlap.
feat: pre-dry-run — hover prompt, SVG mapping fix, fixture rebuild #141: Related hover-state handling and prompt guidance that the new condition runner expects.

Poem

🐰
I hop through trees of code and light,
I strip, compare, then sleep at night;
From HTML to pixel score,
I tally deltas, then explore —
Tiny rabbit, metrics bright!

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main purpose of the PR—a redesigned ablation experiment system with multi-metric measurement (similarity, tokens, HTML/CSS metrics), which is the central change across the modified and new files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/ablation-v2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ial types Strip type changes: - design-token-references → variable-references + style-references (variable = /* var:... */, style = /* text-style:... */) - DesignTreeInfoType now has 6 experiment-relevant types only - DesignTreeStripType retains all 13 for utility use - Removed trivially obvious types from experiment list: color-values, typography, shadows-effects, overflow-text-behavior, hover-interaction-states, position-stacking, component-descriptions Experiment types (6): layout-direction-spacing, size-constraints, component-references, node-names-hierarchy, variable-references, style-references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Strip types: - Split design-token-references → variable-references + style-references - DesignTreeInfoType: 6 experiment types (layout, size, component, naming, variable, style) - DesignTreeStripType: all 13 retained for utility - Pruned trivially obvious types from experiments run-phase1.ts rewrite: - Cache versioning: logs/ablation/phase1/{config-version}/... (never deleted) - run-phase1.ts excluded from config hash (parsing changes don't affect similarity) - Multi-metric: pixel, input/output tokens, HTML bytes/lines, CSS classes, CSS variables - ABLATION_TYPES env var for running specific types - Removed interpretations (unreliable parsing) - Rankings show all 5 delta metrics rule-config.ts: - TODO comment: rename "behavior" → "interaction" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…strip run-condition.ts: separate script for experiments that change viewport or prompt instead of stripping data. - size-constraints: reuse baseline HTML → render at 1920px → compare against screenshot-1920.png (no API call needed) - hover-interaction: strip [hover]: data → new API call → compare hover CSS count with vs without data size-constraints removed from DesignTreeInfoType (ΔV=0% at same viewport). Strip experiments (5): layout, component, naming, variable, style Condition experiments (2): size at 1920px, hover with/without data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion size-constraints: - Strip via stripDesignTree("size-constraints") - API call for both baseline + stripped - Remove root fixed width from generated HTML - Render at expanded viewport (1920px desktop, 768px mobile) - Compare both against expanded screenshot hover-interaction: - Strip [hover]: via stripDesignTree("hover-interaction-states") - API call for both with + without hover data - Extract :hover CSS rules and values for comparison - Compare whether AI invents hover or follows data exactly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents overflow: hidden on body (blocks scrolling) and ensures consistent root element structure across runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

helpers.ts: shared utilities for both scripts - API call with retry, HTML parsing, sanitization, font injection - CSS metrics (class/variable count) - Render + compare + crop pipeline - Fixture validation, screenshot paths run-phase1.ts: 677 → 296 lines (strip experiments only) run-condition.ts: 389 → 184 lines (size + hover condition experiments) Total: 1066 → 707 lines (34% reduction) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document run-phase1.ts, run-condition.ts, and helpers.ts with usage examples and test flow for next session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip all strip types and run baseline only — useful for testing rendering pipeline before committing to full experiment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Figma calculates all dimensions with border-box model. Without this reset, padding adds to element width causing layout drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

logs/ is gitignored so results are lost across machines. data/ is tracked, enabling cloud execution and result sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multiple cloud agents can split work by fixture/type. Same config-version shares one output directory. Warns about fixture+type+run conflict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/helpers.ts`:
- Around line 128-152: The retry backoff in callApi uses a fixed exponential
delay (delay = Math.pow(2, attempt + 1) * 1000) which can cause synchronized
retries; modify the delay calculation to add random jitter before awaiting so
retries are randomized (e.g., compute baseDelay using the existing Math.pow
formula then apply a jitter factor or add/subtract a small random amount), keep
MAX_RETRIES and the 429/529 conditional unchanged, and use the jittered delay in
the setTimeout await to reduce thundering-herd effects.

In `@src/agents/ablation/run-condition.ts`:
- Around line 128-131: The current hover-rule extraction using
baseHtml.match(...) and stripHtml.match(...) can return false positives from
inside comments or string literals; before applying the existing regex (the
lines that set baseHoverRules and stripHoverRules) sanitize baseHtml and
stripHtml by stripping CSS comments (/* ... */), HTML comments (<!-- ... -->)
and quoted string contents (single and double quotes) so the regex only runs
over real CSS rules; update the code that computes
baseHoverRules/stripHoverRules to operate on these sanitized copies (e.g.,
sanitizedBaseHtml/sanitizedStripHtml) while keeping the same regex to preserve
behavior.

In `@src/agents/ablation/run-phase1.ts`:
- Around line 229-231: requestedTypes is being unsafely cast from the
ABLATION_TYPES env var; instead validate each split value against the canonical
DESIGN_TREE_INFO_TYPES array/Set before using it (used later by
stripDesignTree). Parse process.env["ABLATION_TYPES"] into strings, filter them
to only include entries present in DESIGN_TREE_INFO_TYPES (or map to the exact
enum/type names), and if the resulting array is empty treat requestedTypes as
null; update the variable initialization where requestedTypes is defined to
perform this validation and filtering.

In `@src/core/engine/design-tree-strip.test.ts`:
- Around line 4-10: The test duplicates the list of DesignTreeStripType values
causing potential drift; export the canonical array from the source module
(e.g., add an exported constant ALL_STRIP_TYPES: readonly DesignTreeStripType[]
in design-tree-strip.ts) and update the test (design-tree-strip.test.ts) to
import that exported ALL_STRIP_TYPES and use it instead of the inline array;
ensure the exported symbol name matches the test import and that the type
DesignTreeStripType is used for the constant.

In `@src/core/rules/rule-config.ts`:
- Line 34: Remove the inline TODO comment ("// TODO: rename \"behavior\" →
\"interaction\" (hover states, prototype links, component state changes)") from
src/core/rules/rule-config.ts to avoid churn from the nightly calibration
pipeline, and instead create a tracked note (issue, PR description, or CLAUDE.md
entry) that documents the planned rename of the "behavior" property to
"interaction" along with the migration details so the change is tracked outside
the auto-updated config file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 400f94ae-2a73-4749-a2bb-d0a0d70141ce

📥 Commits

Reviewing files that changed from the base of the PR and between cb43ca3 and 8da1849.

📒 Files selected for processing (8)

.claude/skills/design-to-code/PROMPT.md
CLAUDE.md
src/agents/ablation/helpers.ts
src/agents/ablation/run-condition.ts
src/agents/ablation/run-phase1.ts
src/core/engine/design-tree-strip.test.ts
src/core/engine/design-tree-strip.ts
src/core/rules/rule-config.ts

coderabbitai · 2026-03-27T23:22:56Z

src/agents/ablation/helpers.ts

+export async function callApi(client: Anthropic, prompt: string, designTree: string): Promise<Anthropic.Message> {
+  const MAX_RETRIES = 3;
+  for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
+    try {
+      const stream = client.messages.stream({
+        model: MODEL,
+        max_tokens: MAX_TOKENS,
+        temperature: TEMPERATURE,
+        system: prompt,
+        messages: [{ role: "user", content: designTree }],
+      });
+      return await stream.finalMessage();
+    } catch (err) {
+      const status = (err as { status?: number }).status;
+      if ((status === 429 || status === 529) && attempt < MAX_RETRIES) {
+        const delay = Math.pow(2, attempt + 1) * 1000;
+        console.warn(`    ⚠ ${status} error, retrying in ${delay / 1000}s (${attempt + 1}/${MAX_RETRIES})...`);
+        await new Promise((r) => setTimeout(r, delay));
+        continue;
+      }
+      throw err;
+    }
+  }
+  throw new Error("API call failed after retries");
+}


🧹 Nitpick | 🔵 Trivial

API retry logic is sound but consider jitter.

The exponential backoff (2s, 4s, 8s) for 429/529 errors is reasonable. For production-grade reliability, adding random jitter would help avoid thundering herd when multiple agents hit rate limits simultaneously.

💡 Optional: Add jitter to retry delay

if ((status === 429 || status === 529) && attempt < MAX_RETRIES) { - const delay = Math.pow(2, attempt + 1) * 1000; + const baseDelay = Math.pow(2, attempt + 1) * 1000; + const jitter = Math.random() * 1000; // 0-1s random jitter + const delay = baseDelay + jitter; console.warn(` ⚠ ${status} error, retrying in ${delay / 1000}s (${attempt + 1}/${MAX_RETRIES})...`);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/helpers.ts` around lines 128 - 152, The retry backoff in callApi uses a fixed exponential delay (delay = Math.pow(2, attempt + 1) * 1000) which can cause synchronized retries; modify the delay calculation to add random jitter before awaiting so retries are randomized (e.g., compute baseDelay using the existing Math.pow formula then apply a jitter factor or add/subtract a small random amount), keep MAX_RETRIES and the 429/529 conditional unchanged, and use the jittered delay in the setTimeout await to reduce thundering-herd effects.

Won't fix — this is an internal experiment script. Per CLAUDE.md, agents never share the same fixture+type+run, so concurrent API calls to the same endpoint don't happen. Thundering herd is structurally impossible here.

Generated by Claude Code

@let-sunny, that makes sense — since CLAUDE.md enforces non-overlapping fixture+type+run assignments across agents, there's no shared rate-limit surface and jitter would add noise without benefit. Happy to drop this suggestion.

(¬_¬ )

✏️ Learnings added

Learnt from: let-sunny Repo: let-sunny/canicode PR: 147 File: src/agents/ablation/helpers.ts:128-152 Timestamp: 2026-03-27T23:46:02.993Z Learning: In `src/agents/ablation/helpers.ts`, the `callApi` retry logic intentionally omits jitter because CLAUDE.md enforces that parallel agents never share the same fixture+type+run assignment. Concurrent API calls to the same endpoint are structurally impossible, so thundering-herd mitigation is unnecessary. Do NOT flag the fixed exponential backoff as a deficiency.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: let-sunny Repo: let-sunny/canicode PR: 59 File: src/cli/commands/internal/fixture-management.ts:121-129 Timestamp: 2026-03-25T11:36:48.861Z Learning: In `src/cli/commands/internal/fixture-management.ts` (and related internal calibration commands), using `console.log` + exit code `0` for missing run directories or null results from `parseDebateResult` is intentional by design. These are internal commands consumed by subagents that check stdout text output rather than exit codes; switching to `console.error` + `process.exit(1)` would break the calibration pipeline.

Learnt from: let-sunny Repo: let-sunny/canicode PR: 59 File: src/cli/commands/prompt.ts:11-16 Timestamp: 2026-03-25T11:37:22.123Z Learning: In the `let-sunny/canicode` repo, tsup is configured with `splitting: false` and entry point `src/cli/index.ts`, so the entire CLI bundles into a single `dist/cli/index.js`. At runtime, `import.meta.url` always resolves from `dist/cli/`, making `../../` point to the project root. Path lookups using `resolveFn(__dirname, "../../.claude/skills/...")` in `src/cli/commands/prompt.ts` are intentional and correct.

Learnt from: let-sunny Repo: let-sunny/canicode PR: 59 File: src/cli/helpers.ts:27-30 Timestamp: 2026-03-25T11:36:59.928Z Learning: In `src/cli/helpers.ts`, `pickRandomScope` intentionally uses `Math.random()` to return a nondeterministic subtree. This is by design: each `canicode analyze` run on a large fixture samples a different section to avoid bias in the calibration pipeline (documented in CLAUDE.md). Do NOT flag this as a nondeterminism/flakiness issue.

coderabbitai · 2026-03-27T23:22:57Z

src/agents/ablation/run-condition.ts

+  // Extract :hover rules
+  const baseHoverRules = baseHtml.match(/[^}]*:hover\s*\{[^}]*\}/g) ?? [];
+  const stripHoverRules = stripHtml.match(/[^}]*:hover\s*\{[^}]*\}/g) ?? [];
+


🧹 Nitpick | 🔵 Trivial

Hover CSS regex may capture false positives.

The regex [^}]*:hover\s*\{[^}]*\} could match :hover patterns inside CSS comments or string literals, though this is unlikely in generated HTML. For experiment purposes this is acceptable, but be aware the counts may occasionally be inflated.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-condition.ts` around lines 128 - 131, The current hover-rule extraction using baseHtml.match(...) and stripHtml.match(...) can return false positives from inside comments or string literals; before applying the existing regex (the lines that set baseHoverRules and stripHoverRules) sanitize baseHtml and stripHtml by stripping CSS comments (/* ... */), HTML comments () and quoted string contents (single and double quotes) so the regex only runs over real CSS rules; update the code that computes baseHoverRules/stripHoverRules to operate on these sanitized copies (e.g., sanitizedBaseHtml/sanitizedStripHtml) while keeping the same regex to preserve behavior.

Won't fix — the HTML is AI-generated output from Claude API. CSS comments containing :hover patterns are practically impossible. Additionally, this is a relative comparison (with vs without hover data) using the same regex, so any theoretical false positives would cancel out on both sides.

Generated by Claude Code

@let-sunny, that's a fair point — the controlled AI-generated source and the symmetric relative comparison both effectively neutralize the concern. Acknowledged!

( ˘︹˘ )

🧠 Learnings used

Learnt from: let-sunny Repo: let-sunny/canicode PR: 59 File: src/cli/commands/internal/fixture-management.ts:121-129 Timestamp: 2026-03-25T11:36:48.861Z Learning: In `src/cli/commands/internal/fixture-management.ts` (and related internal calibration commands), using `console.log` + exit code `0` for missing run directories or null results from `parseDebateResult` is intentional by design. These are internal commands consumed by subagents that check stdout text output rather than exit codes; switching to `console.error` + `process.exit(1)` would break the calibration pipeline.

Learnt from: let-sunny Repo: let-sunny/canicode PR: 93 File: src/core/rules/rule-config.ts:76-80 Timestamp: 2026-03-26T01:28:57.785Z Learning: In the let-sunny/canicode repo, `src/core/rules/rule-config.ts` is automatically adjusted by a nightly calibration pipeline. Do NOT suggest adding inline comments to this file for calibration rationale — the change evidence is tracked in PR descriptions, commit messages, and `data/calibration-evidence.json` instead. Inline comments would create clutter as the file is frequently auto-modified.

src/agents/ablation/run-phase1.ts

src/core/engine/design-tree-strip.test.ts

src/core/rules/rule-config.ts

- Add jitter to API retry backoff to avoid thundering herd - Strip CSS/HTML comments before hover rule regex matching - Validate ABLATION_TYPES env var against DESIGN_TREE_INFO_TYPES - Export ALL_STRIP_TYPES from source to prevent test drift - Remove inline TODO from auto-calibrated rule-config.ts https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB

…nges Reverted: - Jitter on retry: internal script runs serially, no thundering herd risk - Hover regex comment stripping: AI-generated HTML has no CSS comments Kept: ABLATION_TYPES validation, ALL_STRIP_TYPES export, TODO removal https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 165-198: The current aggregation drops fixture/type pairs that
were skipped, biasing averages; update the loop that builds deltas (the for
(const r of results) block that writes into deltas and the similar logic
referenced at lines 266-272) so that when a fixture/type pair is intentionally
skipped you still account for it by pushing zero deltas into the arrays
(d.dV.push(0), d.dOT.push(0), d.dH.push(0), d.dC.push(0), d.dVa.push(0)) or
alternatively maintain an explicit per-type coverage count and use that total as
the denominator when computing avgDeltaV, avgDeltaOutputTokens,
avgDeltaHtmlBytes, avgDeltaCssClasses and avgDeltaCssVariables in the rankings
construction (the code that computes sV/sOT/sH/sC/sVa, avg(...), and fills
RankingEntry.perFixture); pick one approach and apply it consistently in both
the initial aggregation and the averaging pass so skipped no-ops contribute Δ=0
rather than being omitted.
- Around line 4-13: Update the top docblock in run-phase1.ts to match the actual
CLI contract: change the output path from logs/ablation/... to
data/ablation/..., mention the supported env vars ABLATION_BASELINE_ONLY and
that ABLATION_TYPES can specify a subset (instead of implying all five strip
types always run), and note that runs may be limited by ABLATION_RUNS; ensure
the header accurately reflects these environment variables and the actual output
directory structure used by the implementation.
- Around line 35-48: computeConfigVersion currently fingerprints only a few
engine files so cached runs can be reused incorrectly; update
computeConfigVersion to also include the prompt contents (where prompt is
defined/loaded), the helpers/schema version (e.g.,
src/agents/ablation/helpers.ts and any schema/version constants), and
per-fixture inputs (fixture JSON content, screenshot asset bytes/hashes and
prior result.json shape/version) when building the SHA256 hash so CONFIG_VERSION
changes when any of those change; also harden isCacheValid to check the cached
summary/result contains required fields and versions (not just that similarity
is numeric) by validating presence of new fingerprint fields and expected
schema/version keys before accepting cache.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a2b1bff0-175e-40b9-ac67-23c768f38c6b

📥 Commits

Reviewing files that changed from the base of the PR and between 8da1849 and d39cdb6.

📒 Files selected for processing (3)

src/agents/ablation/run-phase1.ts
src/core/engine/design-tree-strip.test.ts
src/core/engine/design-tree-strip.ts

src/agents/ablation/run-phase1.ts

- Docblock: fix output path (data/ not logs/), add ABLATION_BASELINE_ONLY - Cache: include prompt and helpers.ts in config version hash - Rankings: add fixtureCount (N) column to show coverage per strip type https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 271-272: Current no-op detection uses string equality after
stripDesignTree (stripDesignTree(baselineTree, t) === baselineTree), which is
fragile; instead detect no-op by comparing the arrays before join or exposing a
change flag from stripDesignTree. Update the logic around typesToRun/skipTypes
to call stripDesignTree in a way that returns the stripped lines array (or a
boolean "changed") and then consider it a no-op if every element equals the
original lines (e.g., compare lengths and each line), or modify stripDesignTree
to return {lines, changed} and use that changed flag to populate skipTypes;
reference stripDesignTree, baselineTree, typesToRun, and skipTypes when making
the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 313c7ae5-ad17-4114-8026-9a0a10ce1c64

📥 Commits

Reviewing files that changed from the base of the PR and between d39cdb6 and c1c9734.

📒 Files selected for processing (1)

src/agents/ablation/run-phase1.ts

coderabbitai · 2026-03-28T00:19:44Z

src/agents/ablation/run-phase1.ts

+    const skipTypes = new Set(typesToRun.filter((t) => stripDesignTree(baselineTree, t) === baselineTree));
+    if (skipTypes.size > 0) console.log(`  Skipping no-op: ${[...skipTypes].join(", ")}`);


🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify stripDesignTree returns the same reference for no-op strips rg -nA20 "export function stripDesignTree" --type=ts # Also check if there's any early return for no-op cases ast-grep --pattern $'function stripDesignTree($_, $_) { $$$ return $TREE; $$$ }'

Repository: let-sunny/canicode

Length of output: 464

🏁 Script executed:

#!/bin/bash # Check the context around lines 271-272 in run-phase1.ts sed -n '260,290p' src/agents/ablation/run-phase1.ts # Look for STRIP_FUNCTIONS to understand what the stripping functions do rg -B5 -A10 "STRIP_FUNCTIONS\[" src/core/engine/design-tree-strip.ts | head -40 # Check if there are other patterns in the codebase for detecting no-op changes rg "===" src/agents/ablation/run-phase1.ts

Repository: let-sunny/canicode

Length of output: 2642

🏁 Script executed:

#!/bin/bash # Get STRIP_FUNCTIONS definitions and their implementations rg -B2 -A30 "const STRIP_FUNCTIONS" src/core/engine/design-tree-strip.ts # Check if there are any cases where STRIP_FUNCTIONS returns the same lines unchanged rg -B5 -A15 "STRIP_FUNCTIONS\s*=" src/core/engine/design-tree-strip.ts

Repository: let-sunny/canicode

Length of output: 1072

No-op detection relies on fragile string equality.

The check stripDesignTree(baselineTree, t) === baselineTree attempts to detect when stripping has no effect by comparing string equality. However, even for identity strip functions like "position-stacking" and "component-descriptions" which return (lines) => lines, stripDesignTree always creates a new string via split() and join(). This comparison depends on JavaScript string interning behavior, which is implementation-dependent and fragile.

A more robust approach would be to track whether the stripped array differs from the input before calling join(), or use an explicit content-aware comparison. This makes the no-op detection clearer and less reliant on internal string interning semantics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 271 - 272, Current no-op detection uses string equality after stripDesignTree (stripDesignTree(baselineTree, t) === baselineTree), which is fragile; instead detect no-op by comparing the arrays before join or exposing a change flag from stripDesignTree. Update the logic around typesToRun/skipTypes to call stripDesignTree in a way that returns the stripped lines array (or a boolean "changed") and then consider it a no-op if every element equals the original lines (e.g., compare lengths and each line), or modify stripDesignTree to return {lines, changed} and use that changed flag to populate skipTypes; reference stripDesignTree, baselineTree, typesToRun, and skipTypes when making the change.

let-sunny · 2026-03-28T00:21:19Z

Re: "No-op detection relies on fragile string equality" (line 272)

Won't fix — the premise is incorrect. JavaScript string === compares by value, not by reference or interning. "abc\ndef".split("\n").join("\n") === "abc\ndef" is always true per the ECMAScript spec. There is nothing fragile or implementation-dependent about this.

Adding a {lines, changed} return type to stripDesignTree would break the API for all callers with zero benefit.

Generated by Claude Code

chore: init ablation v2 branch

7d8ccfc

let-sunny and others added 11 commits March 28, 2026 03:45

fix: instruct AI to use root div, not body, for design styles

8bfb367

Prevents overflow: hidden on body (blocks scrolling) and ensures consistent root element structure across runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add ablation experiment flow to CLAUDE.md

3ec04b0

Document run-phase1.ts, run-condition.ts, and helpers.ts with usage examples and test flow for next session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add ABLATION_BASELINE_ONLY=true option

35f4331

Skip all strip types and run baseline only — useful for testing rendering pipeline before committing to full experiment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: require box-sizing: border-box reset in PROMPT.md

df5f08b

Figma calculates all dimensions with border-box model. Without this reset, padding adds to element width causing layout drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: change ablation output from logs/ to data/ (git-tracked)

6ae8bec

logs/ is gitignored so results are lost across machines. data/ is tracked, enabling cloud execution and result sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add parallel agent execution guide for ablation

8da1849

Multiple cloud agents can split work by fixture/type. Same config-version shares one output directory. Warns about fixture+type+run conflict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

let-sunny marked this pull request as ready for review March 27, 2026 23:17

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

claude added 2 commits March 27, 2026 23:58

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

src/agents/ablation/run-phase1.ts Outdated Show resolved Hide resolved

src/agents/ablation/run-phase1.ts Show resolved Hide resolved

src/agents/ablation/run-phase1.ts Show resolved Hide resolved

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

let-sunny merged commit dfaa384 into main Mar 28, 2026
3 checks passed

let-sunny deleted the feat/ablation-v2 branch March 28, 2026 00:24

		const skipTypes = new Set(typesToRun.filter((t) => stripDesignTree(baselineTree, t) === baselineTree));
		if (skipTypes.size > 0) console.log(` Skipping no-op: ${[...skipTypes].join(", ")}`);

Conversation

let-sunny commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

실험 항목

Strip 실험 (run-phase1.ts — 같은 프롬프트, 데이터 제거)

조건 실험 (run-condition.ts — 뷰포트/데이터 조건 변경)

측정 지표 (모든 항목에서 기록)

구현 변경사항

스킵 사유 (strip에서 제외)

발견사항

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

let-sunny Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

let-sunny Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

let-sunny commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

let-sunny commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

coderabbitai bot Mar 27, 2026 •

edited

Loading

coderabbitai bot Mar 27, 2026 •

edited

Loading