Skip to content

calibrate: nightly run — 7 score adjustments from 9 fixtures#93

Merged
let-sunny merged 5 commits intomainfrom
calibrate/nightly-2026-03-26
Mar 26, 2026
Merged

calibrate: nightly run — 7 score adjustments from 9 fixtures#93
let-sunny merged 5 commits intomainfrom
calibrate/nightly-2026-03-26

Conversation

@let-sunny
Copy link
Copy Markdown
Owner

@let-sunny let-sunny commented Mar 26, 2026

Calibrate-night 결과 (2026-03-26)

9개 fixture 전부 풀 파이프라인 (Converter + Evaluation + Critic + Arbitrator) 실행.

Score Adjustments Applied (7)

Rule Before After Evidence Fixture
no-auto-layout -7 -10 hard (circular clock layout) material3-52949
fixed-size-in-auto-layout -5 -3 2 cross-fixture easy simple-ds-175-9106
missing-size-constraint -5 -3 2 cross-fixture easy simple-ds-175-9106
magic-number-spacing -4 -3 2 cross-fixture easy material3-56615-82356
absolute-position-in-auto-layout -10 -7 medium+easy (2 cases) material3-51954
missing-responsive-behavior -4 -3 2 fixtures easy material3-51954
raw-font -8 -4 3 cross-fixture easy, blocking→risk material3-51954

Similarity Results (all 9 fixtures)

Fixture Nodes Grade Similarity Difficulty
simple-ds-4333-9262 140 B+ (81%) 97% easy
material3-52949-27916 453 C (67%) 72% hard
simple-ds-562-9518 245 C (67%) 98% easy
simple-ds-175-9106 236 C (65%) 94% low
material3-56615-82356 330 D (60%) 92% moderate
material3-51954-18254 959 D (64%) 98% easy
material3-56615-45927 197 D (59%) 86% moderate
simple-ds-175-8591 301 D (63%) 97% easy
simple-ds-175-7790 310 D (64%) 98% easy

Key Insights

  • Below-B fixtures produced the most valuable data — removing the tier skip was the right call
  • Critic quality excellent: CLAUDE.md policies correctly applied, session saturation recognized
  • raw-font -8→-4 was the biggest change: 3 fixtures showed AI handles raw fonts easily (explicit values in design tree)
  • Component rules held firm: CLAUDE.md policy blocked all disable proposals for missing-component, detached-instance, variant-structure-mismatch

Refs #89

Test plan

  • 590 tests pass
  • 9 fixtures analyzed with full pipeline

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Adjusted severity levels and scoring weights for multiple validation rules to improve prioritization and consistency in system feedback.

let-sunny and others added 5 commits March 26, 2026 09:49
Circular clock layout without Auto Layout required trigonometric
coordinate calculations — genuinely hard to implement.
Validated by 6-agent debate: Critic APPROVE, Arbitrator APPLY.

Fixture: material3-52949-27916 (time picker, 453 nodes, 72% similarity)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…5→-3

Both rules overscored across 2 cross-fixture cases (material3 + simple-ds).
Critic revised from proposed -2 to -3 (modest reduction, not aggressive).
Arbitrator approved. 9 other proposals rejected (CLAUDE.md policy, low confidence).

Fixture: simple-ds-175-9106 (236 nodes, 94% similarity)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 cross-fixture overscored observations (material3 + simple-ds).
Critic revised from proposed -2 to -3 (modest reduction).
14 other proposals rejected (session stability, CLAUDE.md policy, low confidence).

Fixture: material3-56615-82356 (330 nodes, 92% similarity)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- absolute-position-in-auto-layout: -10 → -7 (keep blocking, 2 cases medium+easy)
- missing-responsive-behavior: -4 → -3 (keep risk, 2 cases easy)
- raw-font: -8 → -4 (blocking → risk, 3 cross-fixture easy cases)

14 other proposals rejected (session stability, CLAUDE.md policy, low confidence).
Fixture: material3-51954-18254 (959 nodes, 98% similarity, date range picker)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

Updated scoring and severity values for multiple linting rules in RULE_CONFIGS. Adjusted penalty scores across layout and token-related rules, including changes to no-auto-layout, absolute-position-in-auto-layout, fixed-size-in-auto-layout, and others, along with a severity downgrade for raw-font rule.

Changes

Cohort / File(s) Summary
Rule Configuration Scoring
src/core/rules/rule-config.ts
Updated penalty scores for no-auto-layout (−7 to −10), absolute-position-in-auto-layout (−10 to −7), fixed-size-in-auto-layout and missing-size-constraint (−5 to −3), missing-responsive-behavior (−4 to −3), and magic-number-spacing (−4 to −3). Changed raw-font rule severity from blocking to risk with score adjustment from −8 to −4.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰 Whiskers twitching with delight,
Scoring tweaks align just right,
Rule severities now dance and sway,
Configuration brightens up the day!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: calibration adjustments to rule scoring based on a nightly test run with specific fixtures.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch calibrate/nightly-2026-03-26

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/rules/rule-config.ts`:
- Around line 76-80: The PR reduces the "raw-font" rule's impact (severity
changed from "blocking" to "risk" and score from -8 to -4) but lacks inline
rationale; add a concise comment next to the "raw-font" entry in rule-config.ts
that explains the evidence and reasoning (e.g., "downgraded to recoverable risk
based on 3 cross-fixture easy cases") and mention the relevant calibration
factors (SEVERITY_DENSITY_WEIGHT and SEVERITY_WEIGHT) so future maintainers can
understand the change context; ensure the comment references the "raw-font" key
and the adjusted severity/score values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 415d5cf3-77bb-45d8-a783-adaf21ad776e

📥 Commits

Reviewing files that changed from the base of the PR and between 55d92c3 and f97c1c0.

⛔ Files ignored due to path filters (2)
  • data/calibration-evidence.json is excluded by !data/**
  • data/discovery-evidence.json is excluded by !data/**
📒 Files selected for processing (1)
  • src/core/rules/rule-config.ts

Comment on lines 76 to 80
"raw-font": {
severity: "blocking",
score: -8,
severity: "risk",
score: -4,
enabled: true,
},
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Significant downgrade: raw-font from blocking to risk with score halved.

This is the largest calibration adjustment in this PR. Per the relevant code snippets:

  • SEVERITY_DENSITY_WEIGHT: blocking=3.0 → risk=2.0 (33% reduction in density impact)
  • SEVERITY_WEIGHT: blocking=10 → risk=5 (50% reduction in weight)

Combined with the score change (-8 → -4), raw-font violations will have substantially less impact on overall design scores. The PR states this was based on "3 cross-fixture easy" evidence.

Consider documenting the rationale inline (e.g., a comment explaining why raw fonts are now considered recoverable risk rather than blocking) to help future maintainers understand this calibration decision.

📝 Suggested documentation
  "raw-font": {
+   // Calibrated 2026-03-26: downgraded from blocking based on 3 cross-fixture evidence
+   // Raw fonts are recoverable via style tokens; not a hard blocker for implementation
    severity: "risk",
    score: -4,
    enabled: true,
  },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"raw-font": {
severity: "blocking",
score: -8,
severity: "risk",
score: -4,
enabled: true,
},
"raw-font": {
// Calibrated 2026-03-26: downgraded from blocking based on 3 cross-fixture evidence
// Raw fonts are recoverable via style tokens; not a hard blocker for implementation
severity: "risk",
score: -4,
enabled: true,
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/rules/rule-config.ts` around lines 76 - 80, The PR reduces the
"raw-font" rule's impact (severity changed from "blocking" to "risk" and score
from -8 to -4) but lacks inline rationale; add a concise comment next to the
"raw-font" entry in rule-config.ts that explains the evidence and reasoning
(e.g., "downgraded to recoverable risk based on 3 cross-fixture easy cases") and
mention the relevant calibration factors (SEVERITY_DENSITY_WEIGHT and
SEVERITY_WEIGHT) so future maintainers can understand the change context; ensure
the comment references the "raw-font" key and the adjusted severity/score
values.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule-config.ts는 calibration pipeline이 자동으로 조정하는 파일이라 매번 인라인 코멘트를 달면 금방 clutter됩니다. 변경 근거는 PR description과 commit message에 기록되어 있고, data/calibration-evidence.json에 cross-run evidence도 누적되므로 별도 인라인 코멘트는 불필요합니다.


Generated by Claude Code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@let-sunny, understood — since rule-config.ts is auto-managed by the calibration pipeline, inline comments would indeed be overwritten or accumulate noise over time. The combination of PR descriptions, commit messages, and data/calibration-evidence.json as the source of truth for calibration rationale makes sense. I'll skip inline documentation suggestions for this file in future reviews.


✏️ Learnings added
Learnt from: let-sunny
Repo: let-sunny/canicode PR: 93
File: src/core/rules/rule-config.ts:76-80
Timestamp: 2026-03-26T01:28:57.785Z
Learning: In the let-sunny/canicode repo, `src/core/rules/rule-config.ts` is automatically adjusted by a nightly calibration pipeline. Do NOT suggest adding inline comments to this file for calibration rationale — the change evidence is tracked in PR descriptions, commit messages, and `data/calibration-evidence.json` instead. Inline comments would create clutter as the file is frequently auto-modified.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: let-sunny/canicode PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-25T18:32:58.247Z
Learning: Applies to src/core/rules/rule-config.ts : Rule scores, severity, and thresholds must be managed in `rules/rule-config.ts`

Learnt from: CR
Repo: let-sunny/canicode PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-25T18:32:58.247Z
Learning: Rule logic and score config must be intentionally separated so scores can be tuned without modifying rule logic

Learnt from: CR
Repo: let-sunny/canicode PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-25T18:32:58.247Z
Learning: Component-related rule scores should NOT be lowered based on small fixture calibration; calibration fixtures should include large, complex pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant