feat: responsive viewport comparison in calibration (#182)#183
feat: responsive viewport comparison in calibration (#182)#183
Conversation
Add --figma-screenshot option to visual-compare CLI for local screenshot comparison without API fetch. Converter now runs 2-pass comparison: original viewport + expanded viewport from fixture screenshots. responsiveDelta feeds into evaluation to validate responsive-critical rule scores with measured pixel data instead of AI opinion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds an optional responsive-viewport comparison and HTML/CSS code-metric collection to the converter; orchestrator forwards a numeric Changes
Sequence DiagramssequenceDiagram
participant Converter
participant Orchestrator
participant EvaluationAgent
participant RuleClassifier
Converter->>Orchestrator: emit conversionJson (includes responsiveDelta, code metrics)
Orchestrator->>EvaluationAgent: runEvaluationAgent(input + responsiveDelta)
EvaluationAgent->>EvaluationAgent: if responsiveDelta != null\ncompute responsiveDifficulty
EvaluationAgent->>RuleClassifier: classifyFlaggedRule(score, responsiveDifficulty)
RuleClassifier-->>EvaluationAgent: updated type & actualDifficulty
EvaluationAgent->>EvaluationAgent: update reasoning & validatedRuleSet
EvaluationAgent-->>Orchestrator: return updated mismatches & report
sequenceDiagram
participant User
participant CLI
participant VisualCompare
participant FileSystem
User->>CLI: visual-compare --figma-screenshot <path>
CLI->>VisualCompare: visualCompare({figmaScreenshotPath: path, ...})
VisualCompare->>FileSystem: stat/validate path
FileSystem-->>VisualCompare: file metadata (width, exists)
VisualCompare->>VisualCompare: inferExportScale(png.width)
VisualCompare->>FileSystem: copy PNG -> outputDir/figma.png
VisualCompare->>VisualCompare: run comparison with inferred scale
VisualCompare-->>CLI: comparison results (similarity, responsiveSimilarity, metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Delete rule from validatedRuleSet when override changes type away from validated - Clamp negative responsiveDelta to 0 (treat as easy) - Add 4 tests: override to underscored, keep validated, skip non-responsive, negative delta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reuse ablation helpers' KNOWN_1X_WIDTHS convention: 1920/768 = @1x, others = @2x. When --figma-screenshot is used, inferExportScale() reads PNG width and sets the correct scale automatically. Previously defaulted to @2x for all local screenshots, causing dimension mismatch for 1920px/768px fixture screenshots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add computeCodeMetrics() to visual-compare-helpers.ts (htmlBytes, htmlLines, cssClassCount, cssVariableCount) - Converter records metrics in conversion.json after generating output.html - De-duplicate: ablation helpers.ts re-exports from core instead of own copy - Metrics proxy for token cost — subagents can't measure tokens directly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/agents/calibration/converter.md:
- Around line 63-77: The ordered list item labelled "5. **Responsive
comparison**" has incorrect/ inconsistent numbering; update the list numbering
to continue the sequence used earlier (e.g., ensure it follows the prior item
number) and adjust the markdown so there is a blank line before and after the
fenced code block to satisfy MD031; locate the "Responsive comparison" section
and the fenced code block that starts with "```bash" and correct the list
numeral and surrounding blank lines accordingly.
In `@src/agents/evaluation-agent.ts`:
- Around line 172-193: The code uses RULE_ID_CATEGORY[mismatch.ruleId as RuleId]
without guarding for an unknown key; change the block in evaluate/adjustment so
you first ensure mismatch.ruleId is present and that
RULE_ID_CATEGORY[mismatch.ruleId] is !== undefined before comparing to
"responsive-critical" (remove the unsafe `as RuleId` cast), e.g. compute const
category = RULE_ID_CATEGORY[mismatch.ruleId]; if (!mismatch.ruleId || category
=== undefined) continue; then proceed with classifyFlaggedRule, setting
mismatch.type, actualDifficulty, reasoning and updating validatedRuleSet as
before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5eb89a93-5ef4-4fde-a67b-6fea4343f405
📒 Files selected for processing (8)
.claude/agents/calibration/converter.mdsrc/agents/contracts/evaluation-agent.tssrc/agents/evaluation-agent.test.tssrc/agents/evaluation-agent.tssrc/agents/orchestrator.tssrc/cli/commands/visual-compare.tssrc/core/engine/visual-compare-helpers.tssrc/core/engine/visual-compare.ts
Orchestrator records converterTokens, converterToolUses, converterDurationMs from subagent usage metadata. Direct measurement — no API proxy needed. Code metrics (htmlBytes, cssClassCount etc.) remain for output quality, token usage now measured from subagent execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ordered list numbering in converter.md (MD031/MD029) - Guard unknown ruleId before RULE_ID_CATEGORY access (noUncheckedIndexedAccess) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/agents/ablation/helpers.ts (1)
154-156: 🧹 Nitpick | 🔵 TrivialDuplicate
KNOWN_1X_WIDTHSconstant — import from shared helpers instead.This constant duplicates
KNOWN_1X_WIDTHSfromsrc/core/engine/visual-compare-helpers.ts(line 19). Since you're already importing from that module, consider importing this constant as well to maintain a single source of truth.♻️ Proposed fix
-import { compareScreenshots } from "../../core/engine/visual-compare-helpers.js"; +import { compareScreenshots, KNOWN_1X_WIDTHS } from "../../core/engine/visual-compare-helpers.js";Then remove lines 155-156:
// Figma save-fixture exports at `@2x` by default. 1920/768 condition screenshots are `@1x`. // Detect: if width matches a known `@1x` size (1920, 768), use scale 1. Otherwise `@2x`. - const KNOWN_1X_WIDTHS = [1920, 768]; const exportScale = KNOWN_1X_WIDTHS.includes(figmaWidth) ? 1 : 2;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agents/ablation/helpers.ts` around lines 154 - 156, Remove the duplicate KNOWN_1X_WIDTHS constant in src/agents/ablation/helpers.ts and import the shared KNOWN_1X_WIDTHS from src/core/engine/visual-compare-helpers.ts instead; update the code that computes exportScale (the line using KNOWN_1X_WIDTHS.includes(figmaWidth) ? 1 : 2) to reference the imported constant and delete the local declaration so the module uses the single source of truth..claude/agents/calibration/converter.md (1)
81-90:⚠️ Potential issue | 🟡 MinorDuplicate step numbering: two steps labeled "6".
Lines 81 and 90 are both numbered "6". The second should be "7" to maintain sequential ordering.
📝 Suggested fix
-6. **MANDATORY — Rule Impact Assessment**: For EVERY rule ID in `nodeIssueSummaries[].flaggedRuleIds`, assess its actual impact on conversion. +7. **MANDATORY — Rule Impact Assessment**: For EVERY rule ID in `nodeIssueSummaries[].flaggedRuleIds`, assess its actual impact on conversion.Also update the subsequent steps:
- Line 94: "7." → "8."
- Line 99: "8." → "9."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/calibration/converter.md around lines 81 - 90, Fix the duplicate step numbering in the calibration converter docs by renumbering the second "6." (the Rule Impact Assessment step that references nodeIssueSummaries[].flaggedRuleIds and ruleImpactAssessment) to "7.", then increment the following steps accordingly (change the current "7." to "8." and "8." to "9.") so the sequence is consecutive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.claude/agents/calibration/converter.md:
- Around line 81-90: Fix the duplicate step numbering in the calibration
converter docs by renumbering the second "6." (the Rule Impact Assessment step
that references nodeIssueSummaries[].flaggedRuleIds and ruleImpactAssessment) to
"7.", then increment the following steps accordingly (change the current "7." to
"8." and "8." to "9.") so the sequence is consecutive.
In `@src/agents/ablation/helpers.ts`:
- Around line 154-156: Remove the duplicate KNOWN_1X_WIDTHS constant in
src/agents/ablation/helpers.ts and import the shared KNOWN_1X_WIDTHS from
src/core/engine/visual-compare-helpers.ts instead; update the code that computes
exportScale (the line using KNOWN_1X_WIDTHS.includes(figmaWidth) ? 1 : 2) to
reference the imported constant and delete the local declaration so the module
uses the single source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fae58546-060d-4004-a18a-5a100762f686
📒 Files selected for processing (5)
.claude/agents/calibration/converter.md.claude/commands/calibrate-loop.mdsrc/agents/ablation/helpers.tssrc/agents/evaluation-agent.tssrc/core/engine/visual-compare-helpers.ts
- ablation helpers.ts: import inferExportScale() from core instead of local constant - converter.md: fix duplicate step 6 → sequential 7/8/9 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
.claude/agents/calibration/converter.md (1)
63-63:⚠️ Potential issue | 🟡 MinorList numbering still incorrect after previous fix attempt.
Static analysis continues to flag this line as incorrectly numbered (expected 6, actual 5). The past review comment indicated this was addressed, but the issue persists.
Root cause: Line 56 duplicates step number 3, causing all subsequent steps to be off by one. While line 56 is not part of this PR's changes, the new responsive comparison step should be numbered correctly as step 6 to reflect its actual position in the sequence.
🔢 Numbering fix
-5. **Responsive comparison** (if expanded screenshot exists): +6. **Responsive comparison** (if expanded screenshot exists):Also fix subsequent steps: 6→7, 7→8, 8→9, 9→10, or address the root cause by fixing the duplicate step 3 at line 56.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/calibration/converter.md at line 63, The numbered list for "Responsive comparison" is currently labeled 5 but should be 6; update the list item text "5. **Responsive comparison** (if expanded screenshot exists):" to "6. **Responsive comparison** (if expanded screenshot exists):" and then renumber the following steps (change 6→7, 7→8, 8→9, 9→10) OR remove the duplicate "3." entry at the earlier step (the duplicate on line 56) so the entire sequence shifts correctly; ensure the label text for the item containing "Responsive comparison" and all subsequent step labels are consistent with the corrected sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/agents/calibration/converter.md:
- Around line 79-80: The instructions currently say "set both to null" when only
1 screenshot exists but the schema defines three responsive fields; update the
converter text to explicitly set responsiveSimilarity, responsiveDelta, and
responsiveViewport to null when fewer than two screenshots exist (i.e., skip
responsive comparison and assign null to all three fields).
- Line 98: The description for cssVariableCount uses the ambiguous pattern
`--*:`; update it to an explicit pattern or example so readers know what to
count — e.g., change `--*:` to a concrete regex `--[a-zA-Z0-9_-]+:` (or show an
example like `--primary-color:`) and mention that cssVariableCount should count
unique occurrences of custom properties defined in a <style> block using that
pattern.
- Around line 65-77: The doc is missing explicit steps to pick the largest
screenshot, extract its width, and read the similarity result; update the
instructions to: list and sort screenshot-*.png into an array (e.g.,
SCREENSHOTS), set LARGEST to the last element of that sorted list, extract the
numeric width from LARGEST into LARGEST_WIDTH (e.g., by regex matching the
screenshot-<number>.png pattern), pass "$LARGEST" and "$LARGEST_WIDTH" to the
existing npx canicode visual-compare invocation, and after the run read
responsiveSimilarity from the comparison output (e.g., parse
$RUN_DIR/responsive/comparison.json and read the similarity field into
RESPONSIVE_SIMILARITY). Include these exact symbol names (SCREENSHOTS, LARGEST,
LARGEST_WIDTH, RESPONSIVE_SIMILARITY) and reference the screenshot-*.png pattern
and $RUN_DIR/responsive/comparison.json so reviewers can locate the changes.
- Around line 94-99: The documented code-metrics fields (htmlBytes, htmlLines,
cssClassCount, cssVariableCount, responsiveViewport, responsiveSimilarity) in
step 8 and the conversion.json schema are unused downstream (downstream agents
only read ruleImpactAssessment, uncoveredStruggles, responsiveDelta); either
remove these six fields from the converter.md step 8 and the conversion.json
schema so the converter agent stops collecting them, or update the evaluation
pipeline/downstream extraction to consume them (i.e., add mappings from
htmlBytes/htmlLines/cssClassCount/cssVariableCount/responsiveViewport/responsiveSimilarity
into the downstream outputs or ruleImpactAssessment/responsiveDelta), and ensure
conversion.json, step 8 text, and any schema references (conversion.json,
ruleImpactAssessment, uncoveredStruggles, responsiveDelta) are kept consistent.
---
Duplicate comments:
In @.claude/agents/calibration/converter.md:
- Line 63: The numbered list for "Responsive comparison" is currently labeled 5
but should be 6; update the list item text "5. **Responsive comparison** (if
expanded screenshot exists):" to "6. **Responsive comparison** (if expanded
screenshot exists):" and then renumber the following steps (change 6→7, 7→8,
8→9, 9→10) OR remove the duplicate "3." entry at the earlier step (the duplicate
on line 56) so the entire sequence shifts correctly; ensure the label text for
the item containing "Responsive comparison" and all subsequent step labels are
consistent with the corrected sequence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 186753e8-3a2c-4501-9542-a9c1bbb41336
📒 Files selected for processing (2)
.claude/agents/calibration/converter.mdsrc/agents/ablation/helpers.ts
| 8. **Code metrics**: After writing `output.html`, record these in conversion.json: | ||
| - `htmlBytes`: file size in bytes | ||
| - `htmlLines`: line count | ||
| - `cssClassCount`: unique CSS class selectors in `<style>` block | ||
| - `cssVariableCount`: unique CSS custom properties (`--*:`) in `<style>` block | ||
| 9. Note any difficulties NOT covered by existing rules as `uncoveredStruggles` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that code metrics and responsive fields (except responsiveDelta) are not consumed
rg -n --type=ts -C3 'htmlBytes|htmlLines|cssClassCount|cssVariableCount|responsiveSimilarity|responsiveViewport' \
-g '!*.test.ts' \
-g '!*.md'Repository: let-sunny/canicode
Length of output: 5457
🏁 Script executed:
# Check orchestrator.ts at the referenced lines to see what fields are extracted
find . -name "orchestrator.ts" -type f | head -5Repository: let-sunny/canicode
Length of output: 90
🏁 Script executed:
# Also search for calibration converter implementation and how it's used
rg -l "calibration.*converter" --type=ts -g '!*.test.ts'Repository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Check what the converter agent actually does with these metrics
rg -n "conversion\.json" --type=ts -C2 -g '!*.test.ts' | head -50Repository: let-sunny/canicode
Length of output: 2245
🏁 Script executed:
# Check orchestrator.ts at the referenced lines
sed -n '300,330p' src/agents/orchestrator.tsRepository: let-sunny/canicode
Length of output: 866
🏁 Script executed:
# Check calibrate-debate.ts to see what fields it extracts from conversion.json
sed -n '28,80p' src/cli/commands/internal/calibrate-debate.tsRepository: let-sunny/canicode
Length of output: 2013
🏁 Script executed:
# Also search for where these metrics (htmlBytes, htmlLines, etc.) are used after being read
rg -n "ruleImpactAssessment|uncoveredStruggles|conversion" src/cli/commands/internal/calibrate-debate.ts -A3 -B3Repository: let-sunny/canicode
Length of output: 1700
🏁 Script executed:
# Read the converter.md file mentioned in the review
cat .claude/agents/calibration/converter.md | head -150Repository: let-sunny/canicode
Length of output: 6743
🏁 Script executed:
# Search for actual converter agent implementation or execution
rg -l "converter" src/agents --type=ts -g '!*.test.ts' | head -10Repository: let-sunny/canicode
Length of output: 94
🏁 Script executed:
# Look for where conversion.json is written to understand the schema
rg -n "conversion.json" src --type=ts -B5 -A5 | grep -A20 "writeFileSync.*conversion"Repository: let-sunny/canicode
Length of output: 1761
🏁 Script executed:
# Verify responsiveSimilarity and responsiveViewport are extracted
rg -n "responsiveSimilarity|responsiveViewport" src/agents/orchestrator.ts -B2 -A2Repository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Double-check evaluation-agent contract for these fields
grep -A30 "responsiveSimilarity\|responsiveViewport" src/agents/contracts/evaluation-agent.tsRepository: let-sunny/canicode
Length of output: 451
🏁 Script executed:
# Check for all responsive-related field usage in orchestrator
rg -n "responsive" src/agents/orchestrator.ts -iRepository: let-sunny/canicode
Length of output: 281
🏁 Script executed:
# Check if responsiveViewport appears anywhere in codebase
rg -n "responsiveViewport" src --type=tsRepository: let-sunny/canicode
Length of output: 44
🏁 Script executed:
# Check the full conversion.json schema comment in converter.md to see all fields
sed -n '114,155p' .claude/agents/calibration/converter.mdRepository: let-sunny/canicode
Length of output: 1380
Remove or use code metrics fields documented in step 8.
The instructions direct the converter agent to collect htmlBytes, htmlLines, cssClassCount, cssVariableCount, responsiveViewport, and responsiveSimilarity (lines 94-99 and the conversion.json schema). However, downstream agents extract only ruleImpactAssessment, uncoveredStruggles, and responsiveDelta from conversion.json. The six metrics above are never consumed and create unnecessary effort for the agent.
Either remove these fields from the documentation or integrate them into the evaluation pipeline if they serve a future purpose.
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 94-94: Ordered list item prefix
Expected: 9; Actual: 8; Style: 1/2/3
(MD029, ol-prefix)
[warning] 99-99: Ordered list item prefix
Expected: 10; Actual: 9; Style: 1/2/3
(MD029, ol-prefix)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/agents/calibration/converter.md around lines 94 - 99, The documented
code-metrics fields (htmlBytes, htmlLines, cssClassCount, cssVariableCount,
responsiveViewport, responsiveSimilarity) in step 8 and the conversion.json
schema are unused downstream (downstream agents only read ruleImpactAssessment,
uncoveredStruggles, responsiveDelta); either remove these six fields from the
converter.md step 8 and the conversion.json schema so the converter agent stops
collecting them, or update the evaluation pipeline/downstream extraction to
consume them (i.e., add mappings from
htmlBytes/htmlLines/cssClassCount/cssVariableCount/responsiveViewport/responsiveSimilarity
into the downstream outputs or ruleImpactAssessment/responsiveDelta), and ensure
conversion.json, step 8 text, and any schema references (conversion.json,
ruleImpactAssessment, uncoveredStruggles, responsiveDelta) are kept consistent.
…s, css pattern - Fix step numbering: 1-10 sequential (was duplicate 3, duplicate 6) - Concrete bash: array sort + grep for screenshot width extraction - "set both to null" → explicit 3 fields (responsiveSimilarity, responsiveDelta, responsiveViewport) - Code metrics annotated as "for analysis/reporting — not consumed by evaluation" - CSS variable pattern: --*: → example (--primary-color:, --spacing-md:) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
캘리브레이션 파이프라인에 responsive viewport 비교 추가. responsive-critical 룰의 score를 AI 의견이 아닌 pixel-level 측정값으로 검증.
Problem
기존 파이프라인은 디자인 원본 사이즈에서만 1회 비교. Ablation 실험에서 확인된 바:
fixed-size-in-auto-layout,missing-size-constraint등 responsive-critical 룰의 score가 측정 없이 AI 의견으로만 validated/overscored 판정됨.Changes
1.
visual-compareCLI —--figma-screenshot옵션 추가--figma-url과 조합: URL은 파싱용, 스크린샷은 로컬 파일 사용--figma-url단독 사용은 변경 없음 (하위 호환)2.
visual-comparecore —figmaScreenshotPath옵션VisualCompareOptions에figmaScreenshotPath필드 추가3. Converter 프롬프트 — 2-pass 비교
screenshot-*.pngglob → 숫자 정렬 → 가장 작은 = 원본, 가장 큰 = 확대similarityresponsiveSimilarityresponsiveDelta = similarity - responsiveSimilarity4.
conversion.json— responsive 필드 추가{ "similarity": 87, "responsiveSimilarity": 72, "responsiveDelta": 15, "responsiveViewport": 1920, ... }5. Evaluation — responsiveDelta 기반 difficulty 결정
responsiveDelta로 difficulty 판정type과reasoning이 측정값으로 덮어씌워지며, 원래 AI 의견도 기록에 남김6. Orchestrator — responsiveDelta 전달
conversion.json에서responsiveDelta추출 → evaluation input에 전달Architecture
Test plan
pnpm lint— passpnpm test:run— 646 tests passCloses #182
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Behavior / Evaluation
Documentation
Tests