Skip to content

feat: ablation v2 — redesigned experiment with multi-metric measurement#147

Merged
let-sunny merged 15 commits intomainfrom
feat/ablation-v2
Mar 28, 2026
Merged

feat: ablation v2 — redesigned experiment with multi-metric measurement#147
let-sunny merged 15 commits intomainfrom
feat/ablation-v2

Conversation

@let-sunny
Copy link
Copy Markdown
Owner

@let-sunny let-sunny commented Mar 27, 2026

Context

Phase 1 실험 드라이런에서 발견된 문제들을 반영한 재설계.
이전 PR #143에서 스크립트는 머지됐지만, 실험 항목/지표/방식이 근본적으로 재정의 필요.

실험 항목

Strip 실험 (run-phase1.ts — 같은 프롬프트, 데이터 제거)

# 항목 핵심 지표
1 layout-direction-spacing pixel, output tokens
2 component-references 코드 크기, CSS 클래스 수, output tokens
3 node-names-hierarchy output tokens, CSS 클래스 이름
4 variable-references (신규) CSS 변수 수
5 style-references (신규) CSS 클래스 수

조건 실험 (run-condition.ts — 뷰포트/데이터 조건 변경)

# 항목 실험 방식 핵심 지표
6 size-constraints strip + API 구현 + 루트 고정너비 제거 + 1920px/768px 렌더 pixel
7 hover-interaction strip [hover]: + API 구현 → baseline과 hover CSS 값 비교 hover 스타일 값 일치 여부

측정 지표 (모든 항목에서 기록)

지표 측정 방법 비용
pixel similarity visual-compare (로컬) $0
input tokens API response usage $0
output tokens API response usage $0
HTML 크기 output.html bytes + lines $0
CSS 클래스 수 HTML 파싱 $0
CSS 변수 수 HTML 파싱 $0

구현 변경사항

  • design-token-referencesvariable-references + style-references 분리
  • 조건 실험 별도 스크립트 (run-condition.ts)
  • CSS 클래스/변수 수 파싱 + result.json에 기록
  • interpretations 제거
  • 캐시 버저닝 (config-version별 디렉토리로 보존, 삭제 불필요)
  • run-phase1.ts를 config hash에서 제외
  • size-constraints를 strip에서 제거 → 조건 실험으로 이동
  • ABLATION_TYPES 환경변수 (특정 타입만 실행)
  • behavior → interaction 카테고리 rename TODO 주석
  • 모델: Sonnet (Opus는 pixel-perfect이지만 ~5x 비용, ΔV 측정에는 Sonnet 충분)

스킵 사유 (strip에서 제외)

스킵 항목 사유
color-values trivially obvious
typography trivially obvious
shadows-effects trivially obvious + design-tree에서 거의 미출력
overflow-text-behavior trivially obvious + 비중 미미
size-constraints 동일 뷰포트 ΔV=0% → 조건 실험으로 이동
hover-interaction-states strip 대신 조건 실험(#7)으로 이동
position-stacking no-op
component-descriptions no-op

발견사항

  • Opus (서브에이전트) vs Sonnet (API) — 같은 프롬프트/design-tree에서 Opus가 near pixel-perfect, Sonnet은 낮은 품질. 비용 5x 차이로 Sonnet 유지. ablation은 상대 ΔV이므로 절대 품질보다 같은 모델 내 비교가 중요.
  • size-constraints는 동일 뷰포트(1200px)에서 ΔV=-1%(노이즈)이지만, 1920px에서는 유의미할 수 있음
  • hover 데이터 없으면 AI는 hover CSS를 전혀 생성하지 않음 (프롬프트가 "추가하지 마" 지시)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an experimental ablation testing framework to run comparative design-to-code experiments (size constraints, hover interaction) and produce per-run similarity and metric outputs.
  • Documentation

    • Added comprehensive ablation experiments docs covering workflows, metrics collected, parallel run guidance, and result summarization.
  • Refactor

    • Updated rendering requirements: global CSS reset applied and all rendered output must be wrapped in a single root container.
    • Improved experiment pipeline and metrics (HTML/CSS counts, tokens, similarity) for clearer comparisons.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Adds an ablation experiment framework: new helpers, two runner scripts (phase1 and condition-based), documentation updates, and a refactor of design-tree strip types and strip functions to support targeted information-removal experiments and updated tests.

Changes

Cohort / File(s) Summary
Ablation helpers
src/agents/ablation/helpers.ts
New shared utilities: experiment constants, fixture/path helpers, HTML extraction/sanitization/font injection, Anthropic streaming + retry wrapper, CSS metrics, render+compare pipeline, file operations, and fixture parsing/validation.
Ablation runners
src/agents/ablation/run-phase1.ts, src/agents/ablation/run-condition.ts
New/refactored runners: phase-1 strip experiment runner with simplified cache and new metrics; condition runner for size-constraints and hover-interaction producing HTML artifacts, screenshots, comparisons, and JSON results.
Design-tree strip logic & tests
src/core/engine/design-tree-strip.ts, src/core/engine/design-tree-strip.test.ts
Refactored strip type taxonomy: added ALL_STRIP_TYPES, narrowed DesignTreeInfoType, split token-ref removals into variable-references and style-references, updated STRIP_FUNCTIONS typing and tests to match new behavior and expectations.
Docs & prompt
CLAUDE.md, .claude/skills/design-to-code/PROMPT.md
Documentation: new "Ablation experiments" section and prompt adjustments requiring a global CSS reset and a single root <div> wrapper inserted as first child of <body>; formatting tweak to output instructions.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Experiment Runner
    participant Helpers as helpers.ts
    participant API as Anthropic API
    participant Renderer as Render Engine
    participant Comparator as Screenshot Comparator

    Runner->>Helpers: parseFixtures(), build/strip design trees
    Runner->>Helpers: callApi(prompt, designTree)
    Helpers->>API: POST /messages (streaming)
    API-->>Helpers: streamed Message
    Helpers->>Helpers: extractHtml() → sanitizeHtml() → injectLocalFont()
    Helpers-->>Runner: processed HTML
    Runner->>Renderer: render HTML -> PNG
    Renderer-->>Runner: rendered PNG
    Runner->>Helpers: renderAndCompare(renderedPNG, figmaPNG)
    Helpers->>Comparator: compare images
    Comparator-->>Helpers: similarity score
    Helpers-->>Runner: similarity & metrics
    Runner->>Runner: write result.json & artifacts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

Poem

🐰
I hop through trees of code and light,
I strip, compare, then sleep at night;
From HTML to pixel score,
I tally deltas, then explore —
Tiny rabbit, metrics bright!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main purpose of the PR—a redesigned ablation experiment system with multi-metric measurement (similarity, tokens, HTML/CSS metrics), which is the central change across the modified and new files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ablation-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

let-sunny and others added 11 commits March 28, 2026 03:45
…ial types

Strip type changes:
- design-token-references → variable-references + style-references
  (variable = /* var:... */, style = /* text-style:... */)
- DesignTreeInfoType now has 6 experiment-relevant types only
- DesignTreeStripType retains all 13 for utility use
- Removed trivially obvious types from experiment list:
  color-values, typography, shadows-effects, overflow-text-behavior,
  hover-interaction-states, position-stacking, component-descriptions

Experiment types (6):
  layout-direction-spacing, size-constraints, component-references,
  node-names-hierarchy, variable-references, style-references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip types:
- Split design-token-references → variable-references + style-references
- DesignTreeInfoType: 6 experiment types (layout, size, component, naming, variable, style)
- DesignTreeStripType: all 13 retained for utility
- Pruned trivially obvious types from experiments

run-phase1.ts rewrite:
- Cache versioning: logs/ablation/phase1/{config-version}/... (never deleted)
- run-phase1.ts excluded from config hash (parsing changes don't affect similarity)
- Multi-metric: pixel, input/output tokens, HTML bytes/lines, CSS classes, CSS variables
- ABLATION_TYPES env var for running specific types
- Removed interpretations (unreliable parsing)
- Rankings show all 5 delta metrics

rule-config.ts:
- TODO comment: rename "behavior" → "interaction"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…strip

run-condition.ts: separate script for experiments that change
viewport or prompt instead of stripping data.

- size-constraints: reuse baseline HTML → render at 1920px → compare
  against screenshot-1920.png (no API call needed)
- hover-interaction: strip [hover]: data → new API call → compare
  hover CSS count with vs without data

size-constraints removed from DesignTreeInfoType (ΔV=0% at same viewport).

Strip experiments (5): layout, component, naming, variable, style
Condition experiments (2): size at 1920px, hover with/without data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

size-constraints:
- Strip via stripDesignTree("size-constraints")
- API call for both baseline + stripped
- Remove root fixed width from generated HTML
- Render at expanded viewport (1920px desktop, 768px mobile)
- Compare both against expanded screenshot

hover-interaction:
- Strip [hover]: via stripDesignTree("hover-interaction-states")
- API call for both with + without hover data
- Extract :hover CSS rules and values for comparison
- Compare whether AI invents hover or follows data exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents overflow: hidden on body (blocks scrolling) and ensures
consistent root element structure across runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
helpers.ts: shared utilities for both scripts
- API call with retry, HTML parsing, sanitization, font injection
- CSS metrics (class/variable count)
- Render + compare + crop pipeline
- Fixture validation, screenshot paths

run-phase1.ts: 677 → 296 lines (strip experiments only)
run-condition.ts: 389 → 184 lines (size + hover condition experiments)
Total: 1066 → 707 lines (34% reduction)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document run-phase1.ts, run-condition.ts, and helpers.ts
with usage examples and test flow for next session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip all strip types and run baseline only — useful for testing
rendering pipeline before committing to full experiment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Figma calculates all dimensions with border-box model.
Without this reset, padding adds to element width causing layout drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
logs/ is gitignored so results are lost across machines.
data/ is tracked, enabling cloud execution and result sharing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple cloud agents can split work by fixture/type.
Same config-version shares one output directory.
Warns about fixture+type+run conflict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@let-sunny let-sunny marked this pull request as ready for review March 27, 2026 23:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/helpers.ts`:
- Around line 128-152: The retry backoff in callApi uses a fixed exponential
delay (delay = Math.pow(2, attempt + 1) * 1000) which can cause synchronized
retries; modify the delay calculation to add random jitter before awaiting so
retries are randomized (e.g., compute baseDelay using the existing Math.pow
formula then apply a jitter factor or add/subtract a small random amount), keep
MAX_RETRIES and the 429/529 conditional unchanged, and use the jittered delay in
the setTimeout await to reduce thundering-herd effects.

In `@src/agents/ablation/run-condition.ts`:
- Around line 128-131: The current hover-rule extraction using
baseHtml.match(...) and stripHtml.match(...) can return false positives from
inside comments or string literals; before applying the existing regex (the
lines that set baseHoverRules and stripHoverRules) sanitize baseHtml and
stripHtml by stripping CSS comments (/* ... */), HTML comments (<!-- ... -->)
and quoted string contents (single and double quotes) so the regex only runs
over real CSS rules; update the code that computes
baseHoverRules/stripHoverRules to operate on these sanitized copies (e.g.,
sanitizedBaseHtml/sanitizedStripHtml) while keeping the same regex to preserve
behavior.

In `@src/agents/ablation/run-phase1.ts`:
- Around line 229-231: requestedTypes is being unsafely cast from the
ABLATION_TYPES env var; instead validate each split value against the canonical
DESIGN_TREE_INFO_TYPES array/Set before using it (used later by
stripDesignTree). Parse process.env["ABLATION_TYPES"] into strings, filter them
to only include entries present in DESIGN_TREE_INFO_TYPES (or map to the exact
enum/type names), and if the resulting array is empty treat requestedTypes as
null; update the variable initialization where requestedTypes is defined to
perform this validation and filtering.

In `@src/core/engine/design-tree-strip.test.ts`:
- Around line 4-10: The test duplicates the list of DesignTreeStripType values
causing potential drift; export the canonical array from the source module
(e.g., add an exported constant ALL_STRIP_TYPES: readonly DesignTreeStripType[]
in design-tree-strip.ts) and update the test (design-tree-strip.test.ts) to
import that exported ALL_STRIP_TYPES and use it instead of the inline array;
ensure the exported symbol name matches the test import and that the type
DesignTreeStripType is used for the constant.

In `@src/core/rules/rule-config.ts`:
- Line 34: Remove the inline TODO comment ("// TODO: rename \"behavior\" →
\"interaction\" (hover states, prototype links, component state changes)") from
src/core/rules/rule-config.ts to avoid churn from the nightly calibration
pipeline, and instead create a tracked note (issue, PR description, or CLAUDE.md
entry) that documents the planned rename of the "behavior" property to
"interaction" along with the migration details so the change is tracked outside
the auto-updated config file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 400f94ae-2a73-4749-a2bb-d0a0d70141ce

📥 Commits

Reviewing files that changed from the base of the PR and between cb43ca3 and 8da1849.

📒 Files selected for processing (8)
  • .claude/skills/design-to-code/PROMPT.md
  • CLAUDE.md
  • src/agents/ablation/helpers.ts
  • src/agents/ablation/run-condition.ts
  • src/agents/ablation/run-phase1.ts
  • src/core/engine/design-tree-strip.test.ts
  • src/core/engine/design-tree-strip.ts
  • src/core/rules/rule-config.ts

Comment on lines +128 to +152
export async function callApi(client: Anthropic, prompt: string, designTree: string): Promise<Anthropic.Message> {
const MAX_RETRIES = 3;
for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
try {
const stream = client.messages.stream({
model: MODEL,
max_tokens: MAX_TOKENS,
temperature: TEMPERATURE,
system: prompt,
messages: [{ role: "user", content: designTree }],
});
return await stream.finalMessage();
} catch (err) {
const status = (err as { status?: number }).status;
if ((status === 429 || status === 529) && attempt < MAX_RETRIES) {
const delay = Math.pow(2, attempt + 1) * 1000;
console.warn(` ⚠ ${status} error, retrying in ${delay / 1000}s (${attempt + 1}/${MAX_RETRIES})...`);
await new Promise((r) => setTimeout(r, delay));
continue;
}
throw err;
}
}
throw new Error("API call failed after retries");
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

API retry logic is sound but consider jitter.

The exponential backoff (2s, 4s, 8s) for 429/529 errors is reasonable. For production-grade reliability, adding random jitter would help avoid thundering herd when multiple agents hit rate limits simultaneously.

💡 Optional: Add jitter to retry delay
       if ((status === 429 || status === 529) && attempt < MAX_RETRIES) {
-        const delay = Math.pow(2, attempt + 1) * 1000;
+        const baseDelay = Math.pow(2, attempt + 1) * 1000;
+        const jitter = Math.random() * 1000; // 0-1s random jitter
+        const delay = baseDelay + jitter;
         console.warn(`    ⚠ ${status} error, retrying in ${delay / 1000}s (${attempt + 1}/${MAX_RETRIES})...`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agents/ablation/helpers.ts` around lines 128 - 152, The retry backoff in
callApi uses a fixed exponential delay (delay = Math.pow(2, attempt + 1) * 1000)
which can cause synchronized retries; modify the delay calculation to add random
jitter before awaiting so retries are randomized (e.g., compute baseDelay using
the existing Math.pow formula then apply a jitter factor or add/subtract a small
random amount), keep MAX_RETRIES and the 429/529 conditional unchanged, and use
the jittered delay in the setTimeout await to reduce thundering-herd effects.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — this is an internal experiment script. Per CLAUDE.md, agents never share the same fixture+type+run, so concurrent API calls to the same endpoint don't happen. Thundering herd is structurally impossible here.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@let-sunny, that makes sense — since CLAUDE.md enforces non-overlapping fixture+type+run assignments across agents, there's no shared rate-limit surface and jitter would add noise without benefit. Happy to drop this suggestion.

(¬_¬ )


✏️ Learnings added
Learnt from: let-sunny
Repo: let-sunny/canicode PR: 147
File: src/agents/ablation/helpers.ts:128-152
Timestamp: 2026-03-27T23:46:02.993Z
Learning: In `src/agents/ablation/helpers.ts`, the `callApi` retry logic intentionally omits jitter because CLAUDE.md enforces that parallel agents never share the same fixture+type+run assignment. Concurrent API calls to the same endpoint are structurally impossible, so thundering-herd mitigation is unnecessary. Do NOT flag the fixed exponential backoff as a deficiency.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: let-sunny
Repo: let-sunny/canicode PR: 59
File: src/cli/commands/internal/fixture-management.ts:121-129
Timestamp: 2026-03-25T11:36:48.861Z
Learning: In `src/cli/commands/internal/fixture-management.ts` (and related internal calibration commands), using `console.log` + exit code `0` for missing run directories or null results from `parseDebateResult` is intentional by design. These are internal commands consumed by subagents that check stdout text output rather than exit codes; switching to `console.error` + `process.exit(1)` would break the calibration pipeline.

Learnt from: let-sunny
Repo: let-sunny/canicode PR: 59
File: src/cli/commands/prompt.ts:11-16
Timestamp: 2026-03-25T11:37:22.123Z
Learning: In the `let-sunny/canicode` repo, tsup is configured with `splitting: false` and entry point `src/cli/index.ts`, so the entire CLI bundles into a single `dist/cli/index.js`. At runtime, `import.meta.url` always resolves from `dist/cli/`, making `../../` point to the project root. Path lookups using `resolveFn(__dirname, "../../.claude/skills/...")` in `src/cli/commands/prompt.ts` are intentional and correct.

Learnt from: let-sunny
Repo: let-sunny/canicode PR: 59
File: src/cli/helpers.ts:27-30
Timestamp: 2026-03-25T11:36:59.928Z
Learning: In `src/cli/helpers.ts`, `pickRandomScope` intentionally uses `Math.random()` to return a nondeterministic subtree. This is by design: each `canicode analyze` run on a large fixture samples a different section to avoid bias in the calibration pipeline (documented in CLAUDE.md). Do NOT flag this as a nondeterminism/flakiness issue.

Comment on lines +128 to +131
// Extract :hover rules
const baseHoverRules = baseHtml.match(/[^}]*:hover\s*\{[^}]*\}/g) ?? [];
const stripHoverRules = stripHtml.match(/[^}]*:hover\s*\{[^}]*\}/g) ?? [];

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hover CSS regex may capture false positives.

The regex [^}]*:hover\s*\{[^}]*\} could match :hover patterns inside CSS comments or string literals, though this is unlikely in generated HTML. For experiment purposes this is acceptable, but be aware the counts may occasionally be inflated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agents/ablation/run-condition.ts` around lines 128 - 131, The current
hover-rule extraction using baseHtml.match(...) and stripHtml.match(...) can
return false positives from inside comments or string literals; before applying
the existing regex (the lines that set baseHoverRules and stripHoverRules)
sanitize baseHtml and stripHtml by stripping CSS comments (/* ... */), HTML
comments (<!-- ... -->) and quoted string contents (single and double quotes) so
the regex only runs over real CSS rules; update the code that computes
baseHoverRules/stripHoverRules to operate on these sanitized copies (e.g.,
sanitizedBaseHtml/sanitizedStripHtml) while keeping the same regex to preserve
behavior.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — the HTML is AI-generated output from Claude API. CSS comments containing :hover patterns are practically impossible. Additionally, this is a relative comparison (with vs without hover data) using the same regex, so any theoretical false positives would cancel out on both sides.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@let-sunny, that's a fair point — the controlled AI-generated source and the symmetric relative comparison both effectively neutralize the concern. Acknowledged!

( ˘︹˘ )


🧠 Learnings used
Learnt from: let-sunny
Repo: let-sunny/canicode PR: 59
File: src/cli/commands/internal/fixture-management.ts:121-129
Timestamp: 2026-03-25T11:36:48.861Z
Learning: In `src/cli/commands/internal/fixture-management.ts` (and related internal calibration commands), using `console.log` + exit code `0` for missing run directories or null results from `parseDebateResult` is intentional by design. These are internal commands consumed by subagents that check stdout text output rather than exit codes; switching to `console.error` + `process.exit(1)` would break the calibration pipeline.

Learnt from: let-sunny
Repo: let-sunny/canicode PR: 93
File: src/core/rules/rule-config.ts:76-80
Timestamp: 2026-03-26T01:28:57.785Z
Learning: In the let-sunny/canicode repo, `src/core/rules/rule-config.ts` is automatically adjusted by a nightly calibration pipeline. Do NOT suggest adding inline comments to this file for calibration rationale — the change evidence is tracked in PR descriptions, commit messages, and `data/calibration-evidence.json` instead. Inline comments would create clutter as the file is frequently auto-modified.

claude added 2 commits March 27, 2026 23:58
- Add jitter to API retry backoff to avoid thundering herd
- Strip CSS/HTML comments before hover rule regex matching
- Validate ABLATION_TYPES env var against DESIGN_TREE_INFO_TYPES
- Export ALL_STRIP_TYPES from source to prevent test drift
- Remove inline TODO from auto-calibrated rule-config.ts

https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB
…nges

Reverted:
- Jitter on retry: internal script runs serially, no thundering herd risk
- Hover regex comment stripping: AI-generated HTML has no CSS comments

Kept: ABLATION_TYPES validation, ALL_STRIP_TYPES export, TODO removal

https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 165-198: The current aggregation drops fixture/type pairs that
were skipped, biasing averages; update the loop that builds deltas (the for
(const r of results) block that writes into deltas and the similar logic
referenced at lines 266-272) so that when a fixture/type pair is intentionally
skipped you still account for it by pushing zero deltas into the arrays
(d.dV.push(0), d.dOT.push(0), d.dH.push(0), d.dC.push(0), d.dVa.push(0)) or
alternatively maintain an explicit per-type coverage count and use that total as
the denominator when computing avgDeltaV, avgDeltaOutputTokens,
avgDeltaHtmlBytes, avgDeltaCssClasses and avgDeltaCssVariables in the rankings
construction (the code that computes sV/sOT/sH/sC/sVa, avg(...), and fills
RankingEntry.perFixture); pick one approach and apply it consistently in both
the initial aggregation and the averaging pass so skipped no-ops contribute Δ=0
rather than being omitted.
- Around line 4-13: Update the top docblock in run-phase1.ts to match the actual
CLI contract: change the output path from logs/ablation/... to
data/ablation/..., mention the supported env vars ABLATION_BASELINE_ONLY and
that ABLATION_TYPES can specify a subset (instead of implying all five strip
types always run), and note that runs may be limited by ABLATION_RUNS; ensure
the header accurately reflects these environment variables and the actual output
directory structure used by the implementation.
- Around line 35-48: computeConfigVersion currently fingerprints only a few
engine files so cached runs can be reused incorrectly; update
computeConfigVersion to also include the prompt contents (where prompt is
defined/loaded), the helpers/schema version (e.g.,
src/agents/ablation/helpers.ts and any schema/version constants), and
per-fixture inputs (fixture JSON content, screenshot asset bytes/hashes and
prior result.json shape/version) when building the SHA256 hash so CONFIG_VERSION
changes when any of those change; also harden isCacheValid to check the cached
summary/result contains required fields and versions (not just that similarity
is numeric) by validating presence of new fingerprint fields and expected
schema/version keys before accepting cache.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a2b1bff0-175e-40b9-ac67-23c768f38c6b

📥 Commits

Reviewing files that changed from the base of the PR and between 8da1849 and d39cdb6.

📒 Files selected for processing (3)
  • src/agents/ablation/run-phase1.ts
  • src/core/engine/design-tree-strip.test.ts
  • src/core/engine/design-tree-strip.ts

- Docblock: fix output path (data/ not logs/), add ABLATION_BASELINE_ONLY
- Cache: include prompt and helpers.ts in config version hash
- Rankings: add fixtureCount (N) column to show coverage per strip type

https://claude.ai/code/session_01TrxHjQfDtbH8QptGwNRdHB
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 271-272: Current no-op detection uses string equality after
stripDesignTree (stripDesignTree(baselineTree, t) === baselineTree), which is
fragile; instead detect no-op by comparing the arrays before join or exposing a
change flag from stripDesignTree. Update the logic around typesToRun/skipTypes
to call stripDesignTree in a way that returns the stripped lines array (or a
boolean "changed") and then consider it a no-op if every element equals the
original lines (e.g., compare lengths and each line), or modify stripDesignTree
to return {lines, changed} and use that changed flag to populate skipTypes;
reference stripDesignTree, baselineTree, typesToRun, and skipTypes when making
the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 313c7ae5-ad17-4114-8026-9a0a10ce1c64

📥 Commits

Reviewing files that changed from the base of the PR and between d39cdb6 and c1c9734.

📒 Files selected for processing (1)
  • src/agents/ablation/run-phase1.ts

Comment on lines +271 to +272
const skipTypes = new Set(typesToRun.filter((t) => stripDesignTree(baselineTree, t) === baselineTree));
if (skipTypes.size > 0) console.log(` Skipping no-op: ${[...skipTypes].join(", ")}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify stripDesignTree returns the same reference for no-op strips
rg -nA20 "export function stripDesignTree" --type=ts
# Also check if there's any early return for no-op cases
ast-grep --pattern $'function stripDesignTree($_, $_) {
  $$$
  return $TREE;
  $$$
}'

Repository: let-sunny/canicode

Length of output: 464


🏁 Script executed:

#!/bin/bash
# Check the context around lines 271-272 in run-phase1.ts
sed -n '260,290p' src/agents/ablation/run-phase1.ts

# Look for STRIP_FUNCTIONS to understand what the stripping functions do
rg -B5 -A10 "STRIP_FUNCTIONS\[" src/core/engine/design-tree-strip.ts | head -40

# Check if there are other patterns in the codebase for detecting no-op changes
rg "===" src/agents/ablation/run-phase1.ts

Repository: let-sunny/canicode

Length of output: 2642


🏁 Script executed:

#!/bin/bash
# Get STRIP_FUNCTIONS definitions and their implementations
rg -B2 -A30 "const STRIP_FUNCTIONS" src/core/engine/design-tree-strip.ts

# Check if there are any cases where STRIP_FUNCTIONS returns the same lines unchanged
rg -B5 -A15 "STRIP_FUNCTIONS\s*=" src/core/engine/design-tree-strip.ts

Repository: let-sunny/canicode

Length of output: 1072


No-op detection relies on fragile string equality.

The check stripDesignTree(baselineTree, t) === baselineTree attempts to detect when stripping has no effect by comparing string equality. However, even for identity strip functions like "position-stacking" and "component-descriptions" which return (lines) => lines, stripDesignTree always creates a new string via split() and join(). This comparison depends on JavaScript string interning behavior, which is implementation-dependent and fragile.

A more robust approach would be to track whether the stripped array differs from the input before calling join(), or use an explicit content-aware comparison. This makes the no-op detection clearer and less reliant on internal string interning semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agents/ablation/run-phase1.ts` around lines 271 - 272, Current no-op
detection uses string equality after stripDesignTree
(stripDesignTree(baselineTree, t) === baselineTree), which is fragile; instead
detect no-op by comparing the arrays before join or exposing a change flag from
stripDesignTree. Update the logic around typesToRun/skipTypes to call
stripDesignTree in a way that returns the stripped lines array (or a boolean
"changed") and then consider it a no-op if every element equals the original
lines (e.g., compare lengths and each line), or modify stripDesignTree to return
{lines, changed} and use that changed flag to populate skipTypes; reference
stripDesignTree, baselineTree, typesToRun, and skipTypes when making the change.

Copy link
Copy Markdown
Owner Author

Re: "No-op detection relies on fragile string equality" (line 272)

Won't fix — the premise is incorrect. JavaScript string === compares by value, not by reference or interning. "abc\ndef".split("\n").join("\n") === "abc\ndef" is always true per the ECMAScript spec. There is nothing fragile or implementation-dependent about this.

Adding a {lines, changed} return type to stripDesignTree would break the API for all callers with zero benefit.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants