feat: add ablation Phase 1 execution script#143
Conversation
Automated script for targeted ablation experiment: - 3 fixtures × (1 baseline + 10 strip types) = 33 API calls - Calls Claude API with PROMPT.md + design-tree (baseline or stripped) - Renders HTML via Playwright, compares vs Figma screenshot - Records ΔV (similarity), ΔI (interpretations), ΔT (tokens) - Resume support: skips completed runs on restart - Outputs rankings sorted by avg ΔV Usage: ANTHROPIC_API_KEY=sk-... npx tsx src/agents/ablation/run-phase1.ts Output: logs/ablation/phase1/ (gitignored) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (272)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdded a new Ablation Phase 1 CLI runner that validates fixtures and PROMPT, generates baseline and stripped design-trees, runs cached or fresh Anthropic calls per condition/run, parses/sanitizes model HTML and interpretations, renders and compares screenshots, computes per-type ΔV/ΔI/ΔT rankings, and writes artifacts and a summary. Changes
Sequence DiagramsequenceDiagram
participant Runner as Phase1 Runner
participant FS as File System
participant Cache as Cache Layer
participant API as Anthropic API
participant Parser as Response Parser
participant Renderer as HTML Renderer
participant Comparator as Image Comparator
participant Aggregator as Result Aggregator
Runner->>FS: Load fixtures & PROMPT.md
Runner->>Runner: Build baseline design-tree
Runner->>Runner: Identify no-op strip types
loop For each fixture/type/run
Runner->>Cache: Check cache (model/temp/tokens/promptHash/config/fixtureHash)
alt Cache Hit
Cache-->>Runner: Return cached result + artifacts
else Cache Miss
Runner->>API: Send prompt + design-tree
API-->>Runner: Response + token usage
Runner->>Parser: Extract HTML & interpretations
Parser-->>Runner: Parsed HTML + interpretations (or parse-fail)
Runner->>Renderer: Render HTML -> code.png
Renderer-->>Runner: code.png
Runner->>FS: Copy GT screenshot -> figma.png
Runner->>Comparator: Compare code.png vs figma.png
Comparator-->>Runner: diff.png + similarity
Runner->>FS: Save result.json and artifacts
end
end
Runner->>Aggregator: Pair baseline vs ablated runs
Aggregator->>Aggregator: Compute ΔV, ΔI, ΔT per type
Aggregator-->>Runner: Rankings + summary
Runner->>FS: Write logs/ablation/phase1/summary.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
High: - Cache invalidation: store cacheKey (model + promptHash + configVersion) in result.json, invalidate on mismatch - Multi-run support: ABLATION_RUNS env var (default 1 for Phase 1, set 3 for Phase 2), rankings average across runs Medium: - ΔI parsing: 3 fallback patterns for interpretations extraction, warnings on parse failure instead of silent 0 - Fixture configurability: ABLATION_FIXTURES env var overrides defaults Low: - Auto-detect no-op strips: isStripNoOp() compares before/after, no more hardcoded SKIP_TYPES Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Medium: - cacheKey includes temperature + maxTokens (experiment condition changes now invalidate cache) - ABLATION_RUNS validation: reject NaN, 0, negative values with error Low: - HTML parsing: 4 fallback patterns (```html, <!DOCTYPE, <html, // filename:) - Cost output: separate session vs cached (newResults tracking) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blocking: - runIndex-paired Δ calculation: baseline[fixture:runIndex] vs ablation[fixture:runIndex] → removes covariance, reduces variance - Cache validation: check all required artifacts exist (output.html, code.png, figma.png, diff.png) not just result.json Medium: - interpretationsCount = -1 on parse failure (not silent 0) ΔI calculation skips pairs where either side failed - HTML parsing: pick LARGEST code block (>100 chars, contains <) instead of first match, avoids grabbing small CSS snippets Low: - ABLATION_FIXTURES: filter empty strings from comma-split Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o config version Medium: - RunResult: add interpretationsParseFailed + parseWarnings fields for post-hoc failure analysis in summary.json - HTML block selection: prioritize <!doctype/<html > <body > largest (prevents grabbing CSS/JS snippets over actual HTML) Low: - CONFIG_VERSION: auto-computed from sha256 of core source files (design-tree-strip.ts, visual-compare-helpers.ts) — no manual bump needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add design-tree.ts and visual-compare.ts to the hash input alongside design-tree-strip.ts and visual-compare-helpers.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ABLATION_RUNS: regex /^\d+$/ check rejects "3abc", spaces, floats - HTML block threshold: 100 → 50 chars for small fixture robustness - Empty fixture list: explicit error instead of silent empty summary Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mary - Console: interp=PARSE_FAIL instead of interp=0 when parsing failed - summary.json: skippedFixtures array with fixture name + reason - Phase1Summary type: add skippedFixtures field Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-run context in result.json:
- screenshotPath, logicalViewport, exportScale
- htmlExtractMethod (doctype/body/largest/none)
- designTreeTokens (input size)
Summary-level metadata:
- cacheStats: { hits, misses }
- parseFailureCount
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Config version hash now includes the script itself (HTML parsing, interpretation extraction, ranking logic changes invalidate cache) - cacheStats.misses → cacheStats.newCalls (accurate semantics) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Line 423: The current comparator used in rankings.sort only compares avgDeltaV
and leaves ties to insertion order; update the comparator used in rankings.sort
to implement the documented tie-breakers by comparing avgDeltaV (desc), then
avgDeltaI (desc) if avgDeltaV is equal, and finally avgDeltaT (desc) if
avgDeltaI is also equal, ensuring you handle undefined/null values consistently
so tied entries are deterministically ordered by avgDeltaI then avgDeltaT.
- Line 361: The log and formatter multiply comparison.similarity by 100 even
though compareScreenshots() already returns a 0–100 percentage; remove the extra
"* 100" so values are displayed correctly (e.g., change occurrences in the
console.log at run-phase1.ts that reference comparison.similarity and any other
places in the same file such as the similar expressions around lines where
similarity is used for printing or computing ΔV—refer to the console.log at the
shown diff and the analogous usages later in the file) so that
comparison.similarity is used directly and then formatted with toFixed as
needed.
- Around line 152-153: Add proper Zod validation for cache inputs: create Zod
schemas in src/agents/contracts/ (e.g., RunResultSchema and CacheKeySchema) and
export types via z.infer; then replace direct JSON parsing/casts in isCacheValid
and the cache-loading loop (where JSON.parse(... ) as RunResult is used) to
first safeParse the file contents with RunResultSchema (and validate the
cacheKey with CacheKeySchema), reject or skip files on validation failure, and
use the validated typed result for all numeric operations (similarity,
inputTokens, outputTokens, totalTokens) so downstream math uses guaranteed
numbers rather than potentially undefined/NaN values.
- Around line 532-545: Compute the design tree before checking cache and include
an input hash of that tree plus the fixture reference screenshot in the cache
identity: move the tree derivation (tree = type === "baseline" ? baselineTree :
stripDesignTree(baselineTree, type)) to before calling isCacheValid, implement a
helper (e.g., computeInputHash(tree, fixture) or hashTreeAndScreenshot) that
hashes the serialized tree plus the fixture's ground-truth image/metadata, and
use that hash when calling isCacheValid(...) and when constructing cacheKey
passed into runSingle(...); update getResultPath / cache writing to persist this
input hash so cache validation will detect changes to fixture data, images, or
generated trees.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 167e4aa1-a2dd-448d-805d-495afa6fdbdf
📒 Files selected for processing (1)
src/agents/ablation/run-phase1.ts
src/agents/ablation/run-phase1.ts
Outdated
| try { | ||
| const result = JSON.parse(readFileSync(resultPath, "utf-8")) as RunResult; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file structure and find the relevant code
git ls-files | grep -E "(run-phase1|contracts)" | head -20Repository: let-sunny/canicode
Length of output: 624
🏁 Script executed:
# Check if the file exists and get line count
wc -l src/agents/ablation/run-phase1.tsRepository: let-sunny/canicode
Length of output: 99
🏁 Script executed:
# Read the specific lines mentioned in the review (152-153 and 536-539)
cat -n src/agents/ablation/run-phase1.ts | sed -n '140,165p'Repository: let-sunny/canicode
Length of output: 1230
🏁 Script executed:
# Also check lines around 536-539
cat -n src/agents/ablation/run-phase1.ts | sed -n '525,550p'Repository: let-sunny/canicode
Length of output: 1312
🏁 Script executed:
# Search for RunResult type definition
rg "RunResult" --type ts --type tsx -B 2 -A 5Repository: let-sunny/canicode
Length of output: 89
🏁 Script executed:
# Check if there's a Zod schema for RunResult in contracts directory
fd . src/contracts --type fRepository: let-sunny/canicode
Length of output: 162
🏁 Script executed:
# Search for RunResult type definition
rg "type RunResult|interface RunResult|RunResult =" --type ts -B 2 -A 5Repository: let-sunny/canicode
Length of output: 1085
🏁 Script executed:
# Check imports at the beginning of run-phase1.ts
head -50 src/agents/ablation/run-phase1.tsRepository: let-sunny/canicode
Length of output: 1902
🏁 Script executed:
# List all files in contracts directories
find src/agents/contracts src/core/contracts -type f -name "*.ts"Repository: let-sunny/canicode
Length of output: 590
🏁 Script executed:
# Search for Zod schema definitions related to results
rg "z\.object\|z\.infer" --type ts | grep -i resultRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Get the full RunResult interface definition
rg "interface RunResult" -A 20 --type tsRepository: let-sunny/canicode
Length of output: 1322
🏁 Script executed:
# Check if there's a CacheKey interface
rg "interface CacheKey|type CacheKey" -A 10 --type tsRepository: let-sunny/canicode
Length of output: 649
🏁 Script executed:
# See if Zod is used elsewhere in the ablation code
rg "import.*zod|from.*zod" src/agents/ablation --type tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Check the CLAUDE.md to see exact coding guidelines
cat CLAUDE.md | grep -A 10 "Zod"Repository: let-sunny/canicode
Length of output: 962
🏁 Script executed:
# Check if CLAUDE.md applies to agents directory
grep -A 5 "src/agents" CLAUDE.mdRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# See how results are used downstream
rg "cached\.|\.similarity|\.inputTokens|\.outputTokens" src/agents/ablation/run-phase1.ts -B 2 -A 2Repository: let-sunny/canicode
Length of output: 1420
🏁 Script executed:
# Check if any Zod schema pattern exists elsewhere in agents for reference
rg "z\.object\|ZodSchema" src/agents --type ts -lRepository: let-sunny/canicode
Length of output: 44
Add Zod schema validation for RunResult and CacheKey before parsing cached JSON.
File reads from disk are external inputs and must be validated with Zod schemas per project guidelines. Both line 153 (in isCacheValid) and line 536 (cache loading loop) parse JSON directly and cast to RunResult without validation. A cache file with matching cacheKey but missing or malformed numeric fields (similarity, inputTokens, outputTokens, totalTokens) will pass type narrowing and feed undefined/NaN into downstream calculations:
- Line 539:
cached.similarity * 100 - Delta calculations:
baseline.similarity - r.similarity - Cost estimation: token summation
Define Zod schemas for RunResult and CacheKey in src/agents/contracts/ and parse with validation instead of type assertion. Infer TypeScript types from the schemas using z.infer<typeof Schema>.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agents/ablation/run-phase1.ts` around lines 152 - 153, Add proper Zod
validation for cache inputs: create Zod schemas in src/agents/contracts/ (e.g.,
RunResultSchema and CacheKeySchema) and export types via z.infer; then replace
direct JSON parsing/casts in isCacheValid and the cache-loading loop (where
JSON.parse(... ) as RunResult is used) to first safeParse the file contents with
RunResultSchema (and validate the cacheKey with CacheKeySchema), reject or skip
files on validation failure, and use the validated typed result for all numeric
operations (similarity, inputTokens, outputTokens, totalTokens) so downstream
math uses guaranteed numbers rather than potentially undefined/NaN values.
src/agents/ablation/run-phase1.ts
Outdated
| for (const type of conditions) { | ||
| for (let run = 0; run < runsPerCondition; run++) { | ||
| // Check cache with key + artifact validation | ||
| if (isCacheValid(fixture, type, run, cacheKey)) { | ||
| const cached = JSON.parse(readFileSync(getResultPath(fixture, type, run), "utf-8")) as RunResult; | ||
| allResults.push(cached); | ||
| cacheHits++; | ||
| console.log(` [cached] ${type} run ${run + 1} → similarity=${(cached.similarity * 100).toFixed(1)}%`); | ||
| continue; | ||
| } | ||
|
|
||
| console.log(` [${type}] run ${run + 1}/${runsPerCondition}`); | ||
| const tree = type === "baseline" ? baselineTree : stripDesignTree(baselineTree, type); | ||
| const result = await runSingle(client, prompt, fixture, type, tree, run, cacheKey); |
There was a problem hiding this comment.
Fold the current fixture inputs into cache validation.
This loop checks isCacheValid() before it derives the current baseline/stripped tree, and the cache identity never changes when fixtures/<name>/data.json, fixture images, or the reference screenshot change. After any fixture/export update, the run will silently reuse stale similarity/token metrics and skew the ablation ranking. Please compute the current tree first and include an input hash for the generated design-tree plus the ground-truth screenshot in the cache key.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agents/ablation/run-phase1.ts` around lines 532 - 545, Compute the design
tree before checking cache and include an input hash of that tree plus the
fixture reference screenshot in the cache identity: move the tree derivation
(tree = type === "baseline" ? baselineTree : stripDesignTree(baselineTree,
type)) to before calling isCacheValid, implement a helper (e.g.,
computeInputHash(tree, fixture) or hashTreeAndScreenshot) that hashes the
serialized tree plus the fixture's ground-truth image/metadata, and use that
hash when calling isCacheValid(...) and when constructing cacheKey passed into
runSingle(...); update getResultPath / cache writing to persist this input hash
so cache validation will detect changes to fixture data, images, or generated
trees.
…t validation - Remove ×100 on similarity display (compareScreenshots returns 0-100) - Ranking sort: primary=ΔV → tie-break=ΔI → ΔT - Cache key includes fixtureHash (data.json + screenshot sha256) so fixture updates invalidate cached results - isValidRunResult: validate numeric fields before using cached data - computeFixtureHash: per-fixture input fingerprint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 retries with 2s/4s/8s delays for rate limit (429) and overloaded (529) errors. Other errors still fail immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Download Inter variable font (352KB) to assets/fonts/. Script replaces Google Fonts <link> tags with local @font-face to eliminate CDN dependency and prevent Playwright timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/agents/ablation/run-phase1.ts (2)
154-193:⚠️ Potential issue | 🟠 MajorValidate cached results once with a full schema, then reuse that parsed object.
isValidRunResult()only guards a subset of fields, and the cache-hit branch reparsesresult.jsonand trusts it asRunResult. A corruptedtype,runIndex, orinterpretationsCountcan still skew rankings andsummary.json. Replace the boolean check with a single Zod-backed loader that safeParses the full payload and returns the validated object for reuse. As per coding guidelines, "Use Zod schemas for external input validation (schemas in contracts/ directory)" and "Infer TypeScript types from Zod schemas usingz.infer<typeof Schema>."Also applies to: 572-573
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 154 - 193, Replace the fragile isValidRunResult boolean guard and the reparsing in isCacheValid with a Zod-backed loader: create/use the full RunResult Zod schema from contracts/ (infer type with z.infer<typeof RunResultSchema>), implement a function (e.g., loadValidatedRunResult(resultPath): RunResult | null) that reads result.json, calls RunResultSchema.safeParse, and returns the parsed/validated object or null on failure; then update isCacheValid to call loadValidatedRunResult once, reuse the returned validated RunResult (instead of reparsing and using isValidRunResult), and perform the cacheKey checks against that validated object while still verifying REQUIRED_ARTIFACTS and existence of result.json.
145-151:⚠️ Potential issue | 🟠 MajorBuild the cache identity from the actual run inputs.
computeFixtureHash()only seesdata.jsonand the Figma screenshot, but this run also depends on the generatedtreeplus fixtureimages/andvectors/. Updating those inputs can still hit a stale cache and reuse old similarity/token data. Move cache-key construction inside the condition loop and hash the per-typetreetogether with the relevant asset bytes.Also applies to: 271-278, 326-334, 543-545, 581-582
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 145 - 151, computeFixtureHash currently only hashes data.json and the screenshot, causing stale cache hits; move cache-key construction into the per-type processing loop and include the actual run inputs: the generated per-type "tree" string and the bytes of fixture assets under images/ and vectors/ when present. Update computeFixtureHash (and the equivalent hashing logic at the other call sites referenced) to accept the per-type tree and asset bytes or compute them in-loop, then combine those values with data.json and screenshot before calling createHash("sha256") so the returned digest uniquely reflects the exact inputs used for that run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 317-347: The HTML from parseResponse is untrusted: before
rendering it, validate/sanitize it in run-phase1.ts (e.g., in the block that
writes htmlPath) by rejecting or stripping any <script>, <iframe>, inline event
handlers (on*), and external resource URLs (http(s):) or at minimum fail the run
if such tags/attributes are present; then harden the renderer in
src/core/engine/visual-compare.ts inside the renderCodeScreenshot function by
disabling JavaScript execution on the page (use the API that sets
JavaScriptEnabled to false), enable request interception and block/abort all
network requests (allow only data: or file: if absolutely required), and ensure
no navigation/waitUntil phases rely on network activity—this will prevent
prompt-injected scripts from running or making outbound requests when rendering
the untrusted model HTML.
---
Duplicate comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 154-193: Replace the fragile isValidRunResult boolean guard and
the reparsing in isCacheValid with a Zod-backed loader: create/use the full
RunResult Zod schema from contracts/ (infer type with z.infer<typeof
RunResultSchema>), implement a function (e.g.,
loadValidatedRunResult(resultPath): RunResult | null) that reads result.json,
calls RunResultSchema.safeParse, and returns the parsed/validated object or null
on failure; then update isCacheValid to call loadValidatedRunResult once, reuse
the returned validated RunResult (instead of reparsing and using
isValidRunResult), and perform the cacheKey checks against that validated object
while still verifying REQUIRED_ARTIFACTS and existence of result.json.
- Around line 145-151: computeFixtureHash currently only hashes data.json and
the screenshot, causing stale cache hits; move cache-key construction into the
per-type processing loop and include the actual run inputs: the generated
per-type "tree" string and the bytes of fixture assets under images/ and
vectors/ when present. Update computeFixtureHash (and the equivalent hashing
logic at the other call sites referenced) to accept the per-type tree and asset
bytes or compute them in-loop, then combine those values with data.json and
screenshot before calling createHash("sha256") so the returned digest uniquely
reflects the exact inputs used for that run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a0833a63-0372-4a02-8e4b-f9b0e44be230
📒 Files selected for processing (1)
src/agents/ablation/run-phase1.ts
| const { html, interpretations, parseWarnings, interpretationsParseFailed } = parseResponse(responseText); | ||
| if (parseWarnings.length > 0) { | ||
| for (const w of parseWarnings) console.warn(` WARNING: ${w}`); | ||
| } | ||
|
|
||
| const htmlPath = join(runDir, "output.html"); | ||
| writeFileSync(htmlPath, html); | ||
| writeFileSync(join(runDir, "interpretations.json"), JSON.stringify(interpretations, null, 2)); | ||
|
|
||
| // Copy fixture images to run dir so HTML can reference them | ||
| const fixtureImagesDir = resolve(`fixtures/${fixture}/images`); | ||
| if (existsSync(fixtureImagesDir)) { | ||
| const runImagesDir = join(runDir, "images"); | ||
| mkdirSync(runImagesDir, { recursive: true }); | ||
| for (const f of readdirSync(fixtureImagesDir)) { | ||
| copyFileSync(join(fixtureImagesDir, f), join(runImagesDir, f)); | ||
| } | ||
| } | ||
|
|
||
| // Render HTML to screenshot | ||
| console.log(` Rendering screenshot...`); | ||
| const codePngPath = join(runDir, "code.png"); | ||
| const figmaScreenshotPath = getFixtureScreenshotPath(fixture); | ||
| const figmaPng = readFileSync(figmaScreenshotPath); | ||
| const { PNG } = await import("pngjs"); | ||
| const figmaImage = PNG.sync.read(figmaPng); | ||
| const exportScale = 2; | ||
| const logicalW = Math.max(1, Math.round(figmaImage.width / exportScale)); | ||
| const logicalH = Math.max(1, Math.round(figmaImage.height / exportScale)); | ||
|
|
||
| await renderCodeScreenshot(htmlPath, codePngPath, { width: logicalW, height: logicalH }, exportScale); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the parseResponse function to understand where html comes from
fd "run-phase1.ts" --type fRepository: let-sunny/canicode
Length of output: 95
🏁 Script executed:
# Also search for renderCodeScreenshot to understand the rendering context
rg "renderCodeScreenshot" --type ts --type tsx -B 2 -A 5Repository: let-sunny/canicode
Length of output: 89
🏁 Script executed:
# Check for parseResponse implementation
rg "function parseResponse|const parseResponse|export.*parseResponse" --type ts --type tsx -B 2 -A 10Repository: let-sunny/canicode
Length of output: 89
🏁 Script executed:
# Look at the actual browser context configuration in the rendering code
rg "browser\.newContext|javaScriptEnabled|route\(" --type ts --type tsx -B 3 -A 3Repository: let-sunny/canicode
Length of output: 89
🏁 Script executed:
# Search for parseResponse and renderCodeScreenshot without type filters
rg "parseResponse" -A 10 src/Repository: let-sunny/canicode
Length of output: 2009
🏁 Script executed:
# Search for renderCodeScreenshot
rg "renderCodeScreenshot" -B 2 -A 10 src/Repository: let-sunny/canicode
Length of output: 5940
🏁 Script executed:
# Look for browser context configuration
rg "javaScriptEnabled|newContext" -B 2 -A 5 src/Repository: let-sunny/canicode
Length of output: 540
🏁 Script executed:
# Read the run-phase1.ts file to understand the full context
wc -l src/agents/ablation/run-phase1.tsRepository: let-sunny/canicode
Length of output: 99
🏁 Script executed:
# Let me check the full renderCodeScreenshot implementation
sed -n '1,100p' src/core/engine/visual-compare.tsRepository: let-sunny/canicode
Length of output: 3318
🏁 Script executed:
# Check lines around where page is navigated
rg "goto\|page\." src/core/engine/visual-compare.ts -A 2 -B 1Repository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Check if there's any HTML sanitization or validation before rendering
rg "sanitize|validate|script|<script" src/agents/ablation/run-phase1.ts -B 2 -A 2Repository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# See the complete context around lines 317-347 in run-phase1.ts
sed -n '300,360p' src/agents/ablation/run-phase1.tsRepository: let-sunny/canicode
Length of output: 2680
🏁 Script executed:
# Continue reading renderCodeScreenshot implementation
sed -n '95,150p' src/core/engine/visual-compare.tsRepository: let-sunny/canicode
Length of output: 2037
🏁 Script executed:
# Get more of the renderCodeScreenshot function
rg "renderCodeScreenshot" -A 30 src/core/engine/visual-compare.ts | head -50Repository: let-sunny/canicode
Length of output: 1565
🏁 Script executed:
# Check if page.goto or any navigation happens and what options are set
sed -n '95,200p' src/core/engine/visual-compare.tsRepository: let-sunny/canicode
Length of output: 3864
🏁 Script executed:
# Verify if there are any other uses of renderCodeScreenshot or similar patterns
rg "newContext|javaScriptEnabled|route\(" src/ --type ts -cRepository: let-sunny/canicode
Length of output: 97
🏁 Script executed:
# Check if there's any documentation about this code's threat model
cat README.md 2>/dev/null | head -50Repository: let-sunny/canicode
Length of output: 2544
Disable JavaScript and block network requests before rendering untrusted model HTML.
The html output from parseResponse() comes directly from the Claude model response with no sanitization. When passed to renderCodeScreenshot(), it's executed in a Chromium context with JavaScript enabled and network access unrestricted. This allows a prompt-injected response to make arbitrary HTTP requests or probe local files during the waitUntil: "networkidle" phase.
Apply these hardening steps in src/core/engine/visual-compare.ts:
Required security fix
const context = await browser.newContext({
viewport: logicalViewport,
deviceScaleFactor,
+ javaScriptEnabled: false,
});
const page = await context.newPage();
+ await context.route("http://**/*", (route) => route.abort());
+ await context.route("https://**/*", (route) => route.abort());Additionally, validate the HTML before rendering in src/agents/ablation/run-phase1.ts:
+ if (/<script\b/i.test(finalHtml) || /\bon\w+=/i.test(finalHtml)) {
+ throw new Error("Refusing to render HTML with executable content");
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { html, interpretations, parseWarnings, interpretationsParseFailed } = parseResponse(responseText); | |
| if (parseWarnings.length > 0) { | |
| for (const w of parseWarnings) console.warn(` WARNING: ${w}`); | |
| } | |
| const htmlPath = join(runDir, "output.html"); | |
| writeFileSync(htmlPath, html); | |
| writeFileSync(join(runDir, "interpretations.json"), JSON.stringify(interpretations, null, 2)); | |
| // Copy fixture images to run dir so HTML can reference them | |
| const fixtureImagesDir = resolve(`fixtures/${fixture}/images`); | |
| if (existsSync(fixtureImagesDir)) { | |
| const runImagesDir = join(runDir, "images"); | |
| mkdirSync(runImagesDir, { recursive: true }); | |
| for (const f of readdirSync(fixtureImagesDir)) { | |
| copyFileSync(join(fixtureImagesDir, f), join(runImagesDir, f)); | |
| } | |
| } | |
| // Render HTML to screenshot | |
| console.log(` Rendering screenshot...`); | |
| const codePngPath = join(runDir, "code.png"); | |
| const figmaScreenshotPath = getFixtureScreenshotPath(fixture); | |
| const figmaPng = readFileSync(figmaScreenshotPath); | |
| const { PNG } = await import("pngjs"); | |
| const figmaImage = PNG.sync.read(figmaPng); | |
| const exportScale = 2; | |
| const logicalW = Math.max(1, Math.round(figmaImage.width / exportScale)); | |
| const logicalH = Math.max(1, Math.round(figmaImage.height / exportScale)); | |
| await renderCodeScreenshot(htmlPath, codePngPath, { width: logicalW, height: logicalH }, exportScale); | |
| const { html, interpretations, parseWarnings, interpretationsParseFailed } = parseResponse(responseText); | |
| if (parseWarnings.length > 0) { | |
| for (const w of parseWarnings) console.warn(` WARNING: ${w}`); | |
| } | |
| const htmlPath = join(runDir, "output.html"); | |
| writeFileSync(htmlPath, html); | |
| writeFileSync(join(runDir, "interpretations.json"), JSON.stringify(interpretations, null, 2)); | |
| // Copy fixture images to run dir so HTML can reference them | |
| const fixtureImagesDir = resolve(`fixtures/${fixture}/images`); | |
| if (existsSync(fixtureImagesDir)) { | |
| const runImagesDir = join(runDir, "images"); | |
| mkdirSync(runImagesDir, { recursive: true }); | |
| for (const f of readdirSync(fixtureImagesDir)) { | |
| copyFileSync(join(fixtureImagesDir, f), join(runImagesDir, f)); | |
| } | |
| } | |
| // Render HTML to screenshot | |
| if (/<script\b/i.test(html) || /\bon\w+=/i.test(html)) { | |
| throw new Error("Refusing to render HTML with executable content"); | |
| } | |
| console.log(` Rendering screenshot...`); | |
| const codePngPath = join(runDir, "code.png"); | |
| const figmaScreenshotPath = getFixtureScreenshotPath(fixture); | |
| const figmaPng = readFileSync(figmaScreenshotPath); | |
| const { PNG } = await import("pngjs"); | |
| const figmaImage = PNG.sync.read(figmaPng); | |
| const exportScale = 2; | |
| const logicalW = Math.max(1, Math.round(figmaImage.width / exportScale)); | |
| const logicalH = Math.max(1, Math.round(figmaImage.height / exportScale)); | |
| await renderCodeScreenshot(htmlPath, codePngPath, { width: logicalW, height: logicalH }, exportScale); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agents/ablation/run-phase1.ts` around lines 317 - 347, The HTML from
parseResponse is untrusted: before rendering it, validate/sanitize it in
run-phase1.ts (e.g., in the block that writes htmlPath) by rejecting or
stripping any <script>, <iframe>, inline event handlers (on*), and external
resource URLs (http(s):) or at minimum fail the run if such tags/attributes are
present; then harden the renderer in src/core/engine/visual-compare.ts inside
the renderCodeScreenshot function by disabling JavaScript execution on the page
(use the API that sets JavaScriptEnabled to false), enable request interception
and block/abort all network requests (allow only data: or file: if absolutely
required), and ensure no navigation/waitUntil phases rely on network
activity—this will prevent prompt-injected scripts from running or making
outbound requests when rendering the untrusted model HTML.
LLM output starts with "// filename: index.html" which is not valid HTML. Playwright's body > *:first-child selector fails because the root element is invisible. Strip the prefix before saving. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Untrusted model output rendered in Chromium via Playwright. Remove script tags to prevent arbitrary JS execution during screenshot capture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/agents/ablation/run-phase1.ts (3)
154-193:⚠️ Potential issue | 🟠 MajorReplace the partial cache guard with a schema-backed parse.
isValidRunResult()only validates a few numerics and thatcacheKeyis “an object”. A malformedresult.jsoncan still slip badtype,fixture,runIndex,interpretationsCount, or nestedcacheKeyfields into the cache-hit path when it is later reloaded asRunResult. Parse cached results through one Zod schema instead of a partial type guard/cast, and infer the TypeScript type from that schema. As per coding guidelines, "Use Zod schemas for external input validation (schemas in contracts/ directory)" and "Infer TypeScript types from Zod schemas using z.infer".#!/bin/bash rg -n 'isValidRunResult|JSON.parse\(readFileSync\(.*result\.json|as RunResult' src/agents/ablation/run-phase1.ts || true rg -n 'RunResultSchema|CacheKeySchema' src || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 154 - 193, Replace the fragile isValidRunResult + manual JSON cast with a Zod-backed parse: import the RunResult schema (and the nested CacheKey schema) from the contracts directory, infer the TypeScript type with z.infer<typeof RunResultSchema>, and use RunResultSchema.parse(parsed) inside isCacheValid to validate the contents of result.json (instead of isValidRunResult and the cast to RunResult). Ensure you remove the loose "cacheKey is object" check and instead compare fields against the validated result.cacheKey typed from the schema (use the schema's fixtureHash field), and handle parse errors by returning false in the catch block.
145-150:⚠️ Potential issue | 🟠 MajorInclude fixture asset directories in
fixtureHash.
computeFixtureHash()only hashesdata.jsonand the reference screenshot, butgetDesignTreeOptions()feedsvectors/andimages/intogenerateDesignTree(), and the run later copiesimages/into the render sandbox. Editing those assets leaves the cache key unchanged, so staleresult.json/code.png/diff.pngcan be reused as valid cache hits. Hash the optional asset directories recursively as part of the fixture inputs.Also applies to: 271-278, 357-364
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 145 - 150, computeFixtureHash currently only includes data.json and the screenshot, missing mutable asset directories; update computeFixtureHash to also recursively walk and include contents of fixtures/<fixture>/vectors/ and fixtures/<fixture>/images/ (if they exist) into the SHA256 via createHash (same hash instance used for data.json and screenshot), and include file paths / file contents deterministically (e.g., sorted) so changes to any asset file update the hash; apply the same directory-recursive hashing fix to the other duplicated implementations referenced in the comment so that getDesignTreeOptions -> generateDesignTree inputs (vectors/images) and the render-sandbox image copy are part of the cache key.
334-355:⚠️ Potential issue | 🔴 CriticalReject executable or remote-resource HTML before the browser render.
The HTML here comes straight from the model and is only rewritten for fonts before rendering. A prompt-injected response can still carry
<script>,on*=handlers,<iframe>, or model-suppliedhttp(s)/file:URLs and turn the screenshot step into network or local-file access. Fail closed on that content before writingoutput.html.Also applies to: 367-378
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 334 - 355, The HTML from parseResponse (variables html and finalHtml) must be scanned and rejected if it contains executable or remote-resource content before writing output.html (htmlPath) or interpretations.json; update run-phase1.ts to parse finalHtml (e.g., with a DOM parser like JSDOM or a sanitizer library) and fail-closed by throwing/logging and aborting when any disallowed elements/attributes are present—specifically detect <script>, <iframe>, <object>, <embed>, inline event handlers (attributes matching /^on/i), and any src/href attributes containing http:, https:, file:, or data: URLs; apply this check right after building finalHtml and before calling writeFileSync, or alternatively sanitize by removing those nodes/attributes and only proceed if resulting HTML contains no remote/executable references.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 449-454: The deltaT sign is inverted: currently deltas.deltaT is
computed as ablation minus baseline so higher values mean more tokens (worse)
but later logic treats larger as better; update all occurrences where deltaT is
pushed (e.g., the deltas.deltaT.push(...) lines near deltas.deltaV/deltaI and
the other two spots flagged) to compute baseline.totalTokens - r.totalTokens
instead (or alternatively change downstream sorting to sort avgDeltaT
ascending); ensure references to deltas.deltaT and avgDeltaT remain consistent
after the change so ties rank variants that save tokens higher.
- Around line 520-522: The fixtures list assigned in run-phase1.ts must be
validated and sandboxed: when computing fixtures from
process.env["ABLATION_FIXTURES"], validate each entry against a strict safe-name
pattern (e.g. allow only alphanumerics, dots, dashes, underscores) and reject or
ignore any that don't match; additionally, after resolving a fixture path into
the fixture input directory and when building run directories via getRunDir(),
canonicalize (resolve) the path and assert it is still rooted under the expected
base directories (e.g., the fixtures base and logs/ablation/phase1 base) before
using it, failing fast or skipping any fixture that escapes its base. Use the
existing fixtures variable and the code that joins with the fixtures base and
getRunDir() to add these checks so no input can traverse out of the intended
sandbox.
---
Duplicate comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 154-193: Replace the fragile isValidRunResult + manual JSON cast
with a Zod-backed parse: import the RunResult schema (and the nested CacheKey
schema) from the contracts directory, infer the TypeScript type with
z.infer<typeof RunResultSchema>, and use RunResultSchema.parse(parsed) inside
isCacheValid to validate the contents of result.json (instead of
isValidRunResult and the cast to RunResult). Ensure you remove the loose
"cacheKey is object" check and instead compare fields against the validated
result.cacheKey typed from the schema (use the schema's fixtureHash field), and
handle parse errors by returning false in the catch block.
- Around line 145-150: computeFixtureHash currently only includes data.json and
the screenshot, missing mutable asset directories; update computeFixtureHash to
also recursively walk and include contents of fixtures/<fixture>/vectors/ and
fixtures/<fixture>/images/ (if they exist) into the SHA256 via createHash (same
hash instance used for data.json and screenshot), and include file paths / file
contents deterministically (e.g., sorted) so changes to any asset file update
the hash; apply the same directory-recursive hashing fix to the other duplicated
implementations referenced in the comment so that getDesignTreeOptions ->
generateDesignTree inputs (vectors/images) and the render-sandbox image copy are
part of the cache key.
- Around line 334-355: The HTML from parseResponse (variables html and
finalHtml) must be scanned and rejected if it contains executable or
remote-resource content before writing output.html (htmlPath) or
interpretations.json; update run-phase1.ts to parse finalHtml (e.g., with a DOM
parser like JSDOM or a sanitizer library) and fail-closed by throwing/logging
and aborting when any disallowed elements/attributes are present—specifically
detect <script>, <iframe>, <object>, <embed>, inline event handlers (attributes
matching /^on/i), and any src/href attributes containing http:, https:, file:,
or data: URLs; apply this check right after building finalHtml and before
calling writeFileSync, or alternatively sanitize by removing those
nodes/attributes and only proceed if resulting HTML contains no
remote/executable references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 75e1572c-6fa8-46f0-b35c-8824d57217c7
⛔ Files ignored due to path filters (1)
assets/fonts/Inter.var.woff2is excluded by!**/*.woff2
📒 Files selected for processing (1)
src/agents/ablation/run-phase1.ts
| const fixtures = process.env["ABLATION_FIXTURES"] | ||
| ? process.env["ABLATION_FIXTURES"].split(",").map((s) => s.trim()).filter(Boolean) | ||
| : DEFAULT_FIXTURES; |
There was a problem hiding this comment.
Validate fixture names before feeding them into path resolution.
ABLATION_FIXTURES is split and trimmed, but never sanitized before later being interpolated into both fixture input paths and getRunDir(). A value like ../../tmp/x escapes fixtures/ and logs/ablation/phase1/, letting this script read or write outside its intended sandbox. Restrict fixture names to a safe pattern and reject resolved paths that leave the expected base directory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agents/ablation/run-phase1.ts` around lines 520 - 522, The fixtures list
assigned in run-phase1.ts must be validated and sandboxed: when computing
fixtures from process.env["ABLATION_FIXTURES"], validate each entry against a
strict safe-name pattern (e.g. allow only alphanumerics, dots, dashes,
underscores) and reject or ignore any that don't match; additionally, after
resolving a fixture path into the fixture input directory and when building run
directories via getRunDir(), canonicalize (resolve) the path and assert it is
still rooted under the expected base directories (e.g., the fixtures base and
logs/ablation/phase1 base) before using it, failing fast or skipping any fixture
that escapes its base. Use the existing fixtures variable and the code that
joins with the fixtures base and getRunDir() to add these checks so no input can
traverse out of the intended sandbox.
- ΔT: baseline - ablation (positive = fewer tokens = better) Was ablation - baseline (more tokens ranked higher in ties) - Fixture name validation: reject path traversal attempts (only lowercase alphanumeric, hyphens, underscores) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (5)
src/agents/ablation/run-phase1.ts (5)
154-165:⚠️ Potential issue | 🟠 MajorSchema-validate cached
result.jsonbefore using it.
isValidRunResult()narrows toRunResultafter checking only a few numeric fields, and the cache-hit path at Line 608 bypasses even that withas RunResult. A malformed cache file can still poison rankings or token totals. Parse once with a Zod schema from the repo’scontracts/directory and infer the TS type from that schema instead of asserting it. As per coding guidelines, "Use Zod schemas for external input validation (schemas in contracts/ directory)" and "Infer TypeScript types from Zod schemas usingz.infer<typeof Schema>."Also applies to: 177-181, 607-608
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 154 - 165, Replace the ad-hoc isValidRunResult check and any unchecked casts to RunResult with a Zod-based validation: import the appropriate RunResult schema from the contracts/ directory (e.g., RunResultSchema), use RunResultSchema.safeParse(...) when reading/parsing cached result.json and in isValidRunResult to both validate and infer the type (z.infer<typeof RunResultSchema>), and handle failures by rejecting the cache-hit (log the error and proceed as a cache miss or throw). Update the cache-hit path that currently uses "as RunResult" to use the schema result (or discard on safeParse.error) so malformed cache files cannot poison rankings or token totals.
197-199:⚠️ Potential issue | 🟠 MajorReject unsafe fixture names before resolving paths.
ABLATION_FIXTURESentries are trimmed but never sanitized. Values like../../tmp/xescape bothfixtures/andlogs/ablation/phase1/throughresolve(...)andgetRunDir(), which turns an env var into arbitrary read/write access outside the experiment sandbox. Validate against a safe fixture-name pattern and assert the resolved paths stay under the expected base directories.Also applies to: 523-526, 564-568
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 197 - 199, The getRunDir function (and other places using ABLATION_FIXTURES/resolve) currently accepts unvalidated fixture names which allows path traversal; validate fixture names against a strict safe pattern (e.g., /^[a-z0-9-_]+$/i) before using them and reject/throw on invalid values, and after calling resolve(OUTPUT_DIR, fixture, ...) assert that the resolved path is a descendant of the intended base (OUTPUT_DIR or fixtures base) by comparing path.normalize/resolve prefixes so a malicious input like "../../tmp" cannot escape the sandbox; apply the same validation/assertion to the other occurrences mentioned (around lines 523-526 and 564-568) so all uses of ABLATION_FIXTURES are sanitized and bounded.
345-349:⚠️ Potential issue | 🔴 Critical
<script>stripping alone does not make this HTML safe to render.Inline
on*handlers,<iframe>/<object>, and externalhttp(s)resources still survive intooutput.html, and the screenshot renderer loads that HTML in a normal browser context. A prompt-injected response can still run or fetch during screenshot capture unless the remaining active content is rejected or stripped too.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 345 - 349, The current sanitization only strips <script> and specific <link> tags but leaves inline event handlers, frames, objects, and external resource references; update the sanitization around finalHtml to fully neutralize active content by parsing finalHtml into a DOM and: remove element types iframe, object, embed, frame; strip any attributes that match /^on/i (e.g., onclick, onload) and remove javascript: URLs in href/src attributes; rewrite or remove src/href that reference external http(s) resources (and data: if you want to avoid remote-like behavior) or replace them with inert placeholders; and ensure the existing replace calls on finalHtml are replaced with this DOM-based sanitizer (target the code around the finalHtml variable and the script/link replace logic).
145-150:⚠️ Potential issue | 🟠 MajorHash
images/andvectors/into the cache key.
computeFixtureHash()only fingerprintsdata.jsonand the reference screenshot. This file also consumesfixtures/<fixture>/vectors/when building the design tree andfixtures/<fixture>/images/when rendering, so asset changes can still hit a stale cache and skew the rankings. Hash those directories too, or hash the generated tree plus copied assets before acceptingresult.json.Also applies to: 271-279, 361-369
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 145 - 150, computeFixtureHash currently only includes data.json and the reference screenshot so changes under fixtures/<fixture>/images/ and fixtures/<fixture>/vectors/ can be missed; update computeFixtureHash to also walk and incorporate the contents (and relative paths) of the images/ and vectors/ directories into the SHA256 hash (e.g., recursively read each file in fixtures/${fixture}/images and fixtures/${fixture}/vectors and feed their bytes and path names into hash.update), and apply the same change to the other identical hashing sites in this file; alternatively, if you already generate a design tree and copy assets before finalizing result.json, hash the generated tree plus copied asset files (their bytes and relative paths) and use that value as the cache key before accepting result.json.
453-458:⚠️ Potential issue | 🟠 MajorFlip
ΔTso the efficiency tie-break prefers cheaper runs.
avgDeltaTis sorted descending, but Line 458 recordsablation - baseline. In ties, a more expensive ablation outranks one that saves tokens.🔧 Proposed fix
- deltas.deltaT.push(r.totalTokens - baseline.totalTokens); + deltas.deltaT.push(baseline.totalTokens - r.totalTokens);Also applies to: 486-490
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 453 - 458, The ΔT push currently records r.totalTokens - baseline.totalTokens which makes positive values mean the ablation is more expensive; change the sign so ΔT = baseline.totalTokens - r.totalTokens by replacing deltas.deltaT.push(r.totalTokens - baseline.totalTokens) with deltas.deltaT.push(baseline.totalTokens - r.totalTokens) so avgDeltaT (sorted descending) treats token savings as better; make the same flip in the other occurrence (the block around deltas.deltaT.push at 486-490) so both places produce positive values for cheaper runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 154-165: Replace the ad-hoc isValidRunResult check and any
unchecked casts to RunResult with a Zod-based validation: import the appropriate
RunResult schema from the contracts/ directory (e.g., RunResultSchema), use
RunResultSchema.safeParse(...) when reading/parsing cached result.json and in
isValidRunResult to both validate and infer the type (z.infer<typeof
RunResultSchema>), and handle failures by rejecting the cache-hit (log the error
and proceed as a cache miss or throw). Update the cache-hit path that currently
uses "as RunResult" to use the schema result (or discard on safeParse.error) so
malformed cache files cannot poison rankings or token totals.
- Around line 197-199: The getRunDir function (and other places using
ABLATION_FIXTURES/resolve) currently accepts unvalidated fixture names which
allows path traversal; validate fixture names against a strict safe pattern
(e.g., /^[a-z0-9-_]+$/i) before using them and reject/throw on invalid values,
and after calling resolve(OUTPUT_DIR, fixture, ...) assert that the resolved
path is a descendant of the intended base (OUTPUT_DIR or fixtures base) by
comparing path.normalize/resolve prefixes so a malicious input like "../../tmp"
cannot escape the sandbox; apply the same validation/assertion to the other
occurrences mentioned (around lines 523-526 and 564-568) so all uses of
ABLATION_FIXTURES are sanitized and bounded.
- Around line 345-349: The current sanitization only strips <script> and
specific <link> tags but leaves inline event handlers, frames, objects, and
external resource references; update the sanitization around finalHtml to fully
neutralize active content by parsing finalHtml into a DOM and: remove element
types iframe, object, embed, frame; strip any attributes that match /^on/i
(e.g., onclick, onload) and remove javascript: URLs in href/src attributes;
rewrite or remove src/href that reference external http(s) resources (and data:
if you want to avoid remote-like behavior) or replace them with inert
placeholders; and ensure the existing replace calls on finalHtml are replaced
with this DOM-based sanitizer (target the code around the finalHtml variable and
the script/link replace logic).
- Around line 145-150: computeFixtureHash currently only includes data.json and
the reference screenshot so changes under fixtures/<fixture>/images/ and
fixtures/<fixture>/vectors/ can be missed; update computeFixtureHash to also
walk and incorporate the contents (and relative paths) of the images/ and
vectors/ directories into the SHA256 hash (e.g., recursively read each file in
fixtures/${fixture}/images and fixtures/${fixture}/vectors and feed their bytes
and path names into hash.update), and apply the same change to the other
identical hashing sites in this file; alternatively, if you already generate a
design tree and copy assets before finalizing result.json, hash the generated
tree plus copied asset files (their bytes and relative paths) and use that value
as the cache key before accepting result.json.
- Around line 453-458: The ΔT push currently records r.totalTokens -
baseline.totalTokens which makes positive values mean the ablation is more
expensive; change the sign so ΔT = baseline.totalTokens - r.totalTokens by
replacing deltas.deltaT.push(r.totalTokens - baseline.totalTokens) with
deltas.deltaT.push(baseline.totalTokens - r.totalTokens) so avgDeltaT (sorted
descending) treats token savings as better; make the same flip in the other
occurrence (the block around deltas.deltaT.push at 486-490) so both places
produce positive values for cheaper runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5821fbe7-5272-4eee-9e6c-0eb1964162d0
📒 Files selected for processing (1)
src/agents/ablation/run-phase1.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 110-123: computeConfigVersion currently builds a hash from
coreFiles but omits the Figma loader; update the coreFiles array inside
computeConfigVersion to include the resolver path for
"src/core/engine/figma-file-loader.ts" so changes to the loader invalidate the
cache—locate the computeConfigVersion function and add the figma-file-loader
entry alongside the other resolve(...) entries.
- Around line 345-349: Sanitize finalHtml more robustly by removing inline event
handler attributes and javascript: URLs in addition to stripping <script> and
Google Fonts links: update the sanitization around the finalHtml.replace calls
(the lines that currently strip <script> and <link> tags) to also remove
attributes matching /^on[a-z]+/i and href/src values starting with "javascript:"
(and consider stripping style attributes that include url("javascript:")
patterns); as defense-in-depth, also harden the Playwright rendering path — add
javaScriptEnabled: false to the browser context options used by
renderCodeScreenshot() (or the context creation in visual-compare.ts) so even if
sanitization misses something the page cannot execute JS.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 23115220-3ad3-4858-a9b3-417b05f88919
📒 Files selected for processing (1)
src/agents/ablation/run-phase1.ts
Critical — SVG icon mapping bug: - save-fixture + implement: download loop skipped nodes without URL but mapping loop didn't, causing filename offset (icon-2.svg mapped to wrong node). Fixed by single-pass mapping+download. - Vector filenames now include parent name (Star-Icon vs Figma-Icon) since all vectors are named "Icon" in Figma. Experiment fixes: - Crop code.png to match figma.png dimensions before comparison (AI renders taller HTML → all pixels below design height were diff) - coreFiles: add figma-file-loader.ts to config version hash - HTML sanitization: strip event handlers + javascript: URLs - ΔT sign: baseline - ablation (positive = fewer tokens = better) - Fixture name validation: reject path traversal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Figma's layoutPositioning: ABSOLUTE was ignored in design-tree output. Now emits: - position: absolute + top/left (relative to parent bbox) for ABSOLUTE nodes - position: relative on parents that have ABSOLUTE children This fixes icon buttons and dropdown overlays that were rendered in normal flow instead of their intended absolute positions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…break) FILL sizing + flex-grow: 1 together causes flex children to take full width and push siblings to next line in wrap containers. flex-grow: 1 alone correctly fills remaining space without forcing 100% width. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…retch) Figma's counterAxisAlignItems defaults to MIN (flex-start) when unset, but CSS flex align-items defaults to stretch. This mismatch caused children to stretch instead of aligning to start. Also: skip textAlignVertical: TOP (CSS default, no emit needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SVGs renamed from generic icon-N.svg to parent-named (star-icon, figma-icon, chevron-down-icon, etc.) with corrected mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All fixtures rebuilt with: - Parent-named SVG filenames (star-icon, figma-icon, etc.) - Corrected mapping (single-pass, no offset bug) - interactionDestinations included - 4 viewport screenshots (375/768/1200/1920) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/agents/ablation/run-phase1.ts (2)
145-152:⚠️ Potential issue | 🟠 MajorCache keys still miss asset-driven tree changes.
computeFixtureHash()only hashesdata.jsonand the reference screenshot, butgenerateDesignTree(file, getDesignTreeOptions(fixture))also depends on fixturevectors/andimages/content.src/core/engine/design-tree.ts:586-606loadsmapping.jsonfrom those directories, so fixture rebuilds like this PR's vector renames can still reuse stale similarity/token results because the loop checks the cache before hashing the actual per-condition tree.Hash the current
treeplus the reference screenshot, or hash every asset/mapping file that feedsgetDesignTreeOptions(), and build the cache key aftertreeis derived for the current condition.Also applies to: 615-617, 643-654
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 145 - 152, computeFixtureHash currently only hashes data.json and the screenshot, missing assets that affect generateDesignTree (vectors/, images/, mapping.json) so cache keys can be stale; update the cache key generation to include the actual design tree used for the condition (or include all asset/mapping files used by getDesignTreeOptions) by computing the tree via generateDesignTree(file, getDesignTreeOptions(fixture)) first, then hashing the serialized tree plus the reference screenshot (or hashed contents of vectors/, images/, and mapping.json) before storing/looking up cache entries; apply the same change to the other cache-key sites in this file that call computeFixtureHash or build per-condition cache keys so all checks use the tree-aware hash.
346-352:⚠️ Potential issue | 🔴 CriticalUntrusted model HTML is still rendered with active browser capabilities.
This sanitizer only removes
<script>, quotedon*=handlers, andhref="javascript:...". The renderer used at Line 387 still goes throughrenderCodeScreenshot(), and the currentsrc/core/engine/visual-compare.ts:1-35snippet shows a Playwright context withoutjavaScriptEnabled: falseor request blocking. A prompt-injected response can still execute or make outbound requests duringpage.goto(..., { waitUntil: "networkidle" }).Harden both sides: disable JS and block network in the renderer, and reject or strip the remaining executable/resource vectors here (
src=javascript:, unquoted handlers, iframes, external URLs, etc.).Verify that the renderer still leaves JS/network enabled and that the sanitizer only handles a narrow subset of executable attributes. Expected result:
src/core/engine/visual-compare.tsshowsnewContextwithout hardening andsrc/agents/ablation/run-phase1.tsshows no handling forsrc="javascript:", unquoted handlers, or remote resource URLs.#!/bin/bash set -euo pipefail echo "=== src/core/engine/visual-compare.ts ===" sed -n '1,40p' src/core/engine/visual-compare.ts echo echo "=== src/agents/ablation/run-phase1.ts ===" sed -n '340,390p' src/agents/ablation/run-phase1.tsAlso applies to: 387-387
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/run-phase1.ts` around lines 346 - 352, Sanitizer and renderer are insufficient: in src/agents/ablation/run-phase1.ts the finalHtml cleanup only strips <script>, quoted on* handlers and javascript: hrefs, and the renderer (renderCodeScreenshot / src/core/engine/visual-compare.ts newContext) still allows JS/network; fix by (1) hardening the Playwright context used by renderCodeScreenshot: createContext/newContext must set javaScriptEnabled: false and block network requests (or use route to abort non-data: requests) so pages cannot execute JS or fetch external resources, and (2) extend the HTML sanitizer in run-phase1.ts (the finalHtml replacements) to also remove src="javascript:..." and src='javascript:...', unquoted on* handlers (e.g., onload=...), strip or neutralize <iframe> and <object> tags, and remove/relocate external resource URLs (e.g., <link href="https://fonts..." rel=...>, <img src="http(s)://...">) so no remote resources can be loaded; reference functions: renderCodeScreenshot, newContext (visual-compare.ts), and the finalHtml replacement block in run-phase1.ts when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 620-655: Wrap each fixture and each individual run iteration in
their own try/catch blocks so a single failure does not abort the entire matrix:
call loadFigmaFileFromJson and generateDesignTree inside a fixture-level
try/catch and on error create a failure RunResult (matching the shape used by
allResults/summary.json) and push it to allResults, increment failure counters,
write/update summary.json, then continue to the next fixture; likewise wrap the
per-condition/per-run logic (including isCacheValid, readFileSync,
stripDesignTree, and runSingle) in a run-level try/catch and on any error
construct and push a failure RunResult (with type, run index, fixture id, and
error message), persist summary.json, and continue to the next run. Ensure you
reference generateDesignTree, loadFigmaFileFromJson, isCacheValid,
stripDesignTree, runSingle, allResults, and the existing summary write path so
resume support can skip completed entries on subsequent runs.
In `@src/core/engine/design-tree.ts`:
- Around line 351-364: The code currently can push both "position: absolute" and
later "position: relative" for the same node, which overwrites the absolute
positioning; update the logic in the block that adds the parent-relative rule so
it only adds "position: relative" when the node itself is not absolutely
positioned (check node.layoutPositioning !== "ABSOLUTE" or that styles does not
already include a "position: absolute" entry) and leave the existing absolute
branch (where node.layoutPositioning === "ABSOLUTE") unchanged; target the code
around node.layoutPositioning, the styles array, and the children check so
absolute nodes with absolute children keep "position: absolute" and only
non-absolute parents receive "position: relative".
---
Duplicate comments:
In `@src/agents/ablation/run-phase1.ts`:
- Around line 145-152: computeFixtureHash currently only hashes data.json and
the screenshot, missing assets that affect generateDesignTree (vectors/,
images/, mapping.json) so cache keys can be stale; update the cache key
generation to include the actual design tree used for the condition (or include
all asset/mapping files used by getDesignTreeOptions) by computing the tree via
generateDesignTree(file, getDesignTreeOptions(fixture)) first, then hashing the
serialized tree plus the reference screenshot (or hashed contents of vectors/,
images/, and mapping.json) before storing/looking up cache entries; apply the
same change to the other cache-key sites in this file that call
computeFixtureHash or build per-condition cache keys so all checks use the
tree-aware hash.
- Around line 346-352: Sanitizer and renderer are insufficient: in
src/agents/ablation/run-phase1.ts the finalHtml cleanup only strips <script>,
quoted on* handlers and javascript: hrefs, and the renderer
(renderCodeScreenshot / src/core/engine/visual-compare.ts newContext) still
allows JS/network; fix by (1) hardening the Playwright context used by
renderCodeScreenshot: createContext/newContext must set javaScriptEnabled: false
and block network requests (or use route to abort non-data: requests) so pages
cannot execute JS or fetch external resources, and (2) extend the HTML sanitizer
in run-phase1.ts (the finalHtml replacements) to also remove
src="javascript:..." and src='javascript:...', unquoted on* handlers (e.g.,
onload=...), strip or neutralize <iframe> and <object> tags, and remove/relocate
external resource URLs (e.g., <link href="https://fonts..." rel=...>, <img
src="http(s)://...">) so no remote resources can be loaded; reference functions:
renderCodeScreenshot, newContext (visual-compare.ts), and the finalHtml
replacement block in run-phase1.ts when making these changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 785f2a6d-6cd7-432d-8543-484709bb6501
⛔ Files ignored due to path filters (28)
fixtures/desktop-product-detail/data.jsonis excluded by!fixtures/**fixtures/desktop-product-detail/screenshot-1920.pngis excluded by!**/*.png,!fixtures/**,!**/*.pngfixtures/desktop-product-detail/vectors/chevron-down-icon-2.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/chevron-down-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/chevron-up-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/figma-icon-2.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/figma-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/heart-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/linkedin-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/logo-instagram-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/logo-youtube-icon.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/mapping.jsonis excluded by!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-10.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-11.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-12.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-13.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-14.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-15.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-16.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-17.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-18.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-4.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-5.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-6.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-7.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-8.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/star-icon-9.svgis excluded by!**/*.svg,!fixtures/**fixtures/desktop-product-detail/vectors/x-logo-icon.svgis excluded by!**/*.svg,!fixtures/**
📒 Files selected for processing (5)
src/agents/ablation/run-phase1.tssrc/cli/commands/implement.tssrc/cli/commands/save-fixture.tssrc/cli/helpers.tssrc/core/engine/design-tree.ts
Previously used getNodeImages (renders entire node including children) for image downloads. Background images with text/buttons on top would include those children in the PNG, causing duplicate content when HTML also renders the text. Now uses getImageFills API to download the original fill image only. Also: - collectImageNodes returns imageRef for direct fill lookup - getImageFills added to FigmaClient - Figma default align-items: flex-start emitted explicitly - textAlignVertical: TOP skipped (CSS default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Don't emit position: relative if node is already position: absolute (absolute node with absolute children would get both, relative wins) - Wrap per-condition loop in try/catch so one failed run doesn't abort the entire experiment matrix. Failures logged to skippedFixtures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MAX_TOKENS 16000 → 32000 (responses were truncated, HTML block never closed, parsing failed for all conditions) - Parse regex: match unclosed code blocks (truncated by max_tokens) using (?:```|$) instead of requiring closing ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anthropic SDK requires streaming for operations that may take >10 min. With MAX_TOKENS=32000, non-streaming calls hit this limit. Switch to client.messages.stream() + await stream.finalMessage(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removing all colors or all fonts guarantees pixel diff — these test whether raw data exists, not whether the data is organized/useful. The interesting question (tokenization helps?) is already covered by design-token-references strip type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without [hover]: data, AI cannot produce hover CSS. Same logic as color/typography — tests data existence, not AI's inference ability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ow coverage) Same logic as hover — AI can't infer overflow: hidden if not provided. Plus only a few nodes have it, so ΔV would be noise. Remaining test types (5): layout-direction-spacing, size-constraints, component-references, node-names-hierarchy, design-token-references. These test AI's inference/organization ability, not data existence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Automated script for targeted ablation experiment (#130).
What it does
Usage
Output
logs/ablation/phase1/(gitignored):design-tree.txt,response.txt,output.html,code.png,figma.png,diff.png,result.jsonsummary.jsonwith rankings sorted by avg ΔVFeatures
result.json)Cost estimate
33 API calls (3 baselines + 30 ablations) × ~$0.50 = ~$16.50
Skipped types:
position-stacking,component-descriptions(no-op stubs)Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Improvements