feat: restructure logs (1 run = 1 directory) + calibration playbook#4
Merged
feat: restructure logs (1 run = 1 directory) + calibration playbook#4
Conversation
Each calibration/rule-discovery run now creates a self-contained directory under logs/calibration/<name>--<timestamp>/ with all artifacts (analysis, conversion, gaps, screenshots, HTML, debate, activity log). - Add run-directory.ts utility for creating/listing run directories - Update ActivityLogger to accept runDir instead of fixture path - Add --run-dir option to calibrate-analyze and calibrate-evaluate CLI - Port and adapt gap-rule-report from cursor-work branch - Add calibrate-gap-report CLI command (reads new directory structure) - Add /calibrate-night Claude Code command for multi-fixture runs - Move /tmp outputs (screenshots, HTML, design-tree) into run directory - Update all agent .md files: LOG_FILE → RUN_DIR pattern - Simplify nightly script (no more snapshot copying) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add sourceUrl to AnalysisFile schema, saved by save-fixture CLI - calibrate-night scans fixtures/*.json instead of explicit list - After applied=0 (converged), fixture is moved to fixtures/done/ - Nightly script and Claude Code command both use same logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add CALIBRATION-PLAYBOOK.md: step-by-step user guide for calibration and rule discovery workflows - Add sourceUrl field to AnalysisFile, saved by save-fixture CLI - calibrate-night scans fixtures/*.json, moves converged to done/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Subagents were writing files to wrong paths (old log structure patterns). Now the orchestrator is the single writer for all $RUN_DIR files. - Remove Write tool from Gap Analyzer, Critic, Arbitrator, Researcher, Designer, Evaluator, rule-discovery Critic - Each agent's .md now says "Do NOT write any files. Return JSON." - Orchestrator commands capture agent output and write to $RUN_DIR - Add Step 7 to calibrate-loop: auto-generate REPORT.md - Add file verification after Converter step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
So the user always sees a debate.json in the run directory — either with critic/arbitrator decisions or a "skipped" reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Converter was returning empty ruleImpactAssessment[], causing evaluation to produce zero proposals and skipping the Critic/Arbitrator debate. Now the prompt mandates one entry per flagged rule ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- calibrate-gap-report now backs up existing REPORT.md with timestamp before overwriting (e.g. REPORT--2026-03-23-23-52-47-833.md) - Arbitrator prompt explicitly forbids writing ANY file except rule-config.ts — prevents new-rule-proposals.md and other stale paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
let-sunny
added a commit
that referenced
this pull request
Mar 24, 2026
1. evidence-collector: normalize ruleId/fixture before storing (prevents bucket splitting from whitespace differences) 2. cli: validate --width/--height as positive numbers 3. visual-compare: browser.close() in finally block (prevents Chromium leak on goto/screenshot failure) 4. visual-compare: remove stale figma.png comment, keep simple cache-miss-only logic 5. mcp docs: sync visual-compare topic with new scale options Deferred: Zod schema for debate.json (#2), test helper extraction (#4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
let-sunny
added a commit
that referenced
this pull request
Mar 24, 2026
…ce) (#28) * fix: calibration pipeline retina compare, evidence prune, lenient convergence - visual-compare: default Figma export scale 2, logical viewport + deviceScaleFactor so code.png matches @2x figma.png; CLI omits 1440×900 override; --figma-scale - Figma image cache keys include scale; MCP visual-compare accepts figmaExportScale - extractAppliedRuleIds / pruneCalibrationEvidence: trim + case-insensitive decisions - appendCalibrationEvidence: replace same (ruleId, fixture) on re-run - fixture-done --lenient-convergence + isConverged({ lenient }) for reject-only stalls - Document lenient flag in calibrate-night command Closes #14 * fix(agents): harden debate parsing and evidence dedupe keys - Guard extractAppliedRuleIds and isConverged when arbitrator.decisions is not an array - Use trimmed fixture in calibration evidence dedupe key - Add tests for malformed debate.json - Add docs/RESEARCH-BRANCH-CALIBRATION-RULE-DISCOVERY.md (calibration & rule discovery) - Align visual-compare.test.ts comment with padPng behavior Made-with: Cursor * fix(agents): self-review hardening for evidence and debate parsing - Dedupe calibration evidence within a single append (Map, last wins) - Treat non-object decision rows as inert in extractAppliedRuleIds and isConverged - Doc: fix overscored/underscored typo; strict convergence note; Appendix A self-review Made-with: Cursor * fix: extract tolerance constants + defensive debate parsing - SCALE_ROUNDING_TOLERANCE (0.08): broader for @2x/@3x detection - UNITY_SCALE_TOLERANCE (0.02): tighter for 1x false-positive prevention - parseDebateResult: validates shape, debug logs on malformed input Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove research doc (discussion-only, not for merge) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: token metrics + target environment + A/B token comparison 1. Token metrics: - generateDesignTreeWithStats() returns estimatedTokens + bytes - CLI design-tree command shows token estimate - CalibrationReportData includes tokenMetrics (tokens, bytes, per-node) 2. Target environment in CLAUDE.md: - Primary target: teams with designers, 300+ node pages - Token consumption is first-class metric - Component scores should not be lowered from small fixture calibration 3. add-rule A/B always includes token comparison: - Step 4 measures both visual similarity AND token savings - Token savings ratio: 1 - tokens_b / tokens_a - Even if visual diff is zero, token savings justify the rule Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add generateDesignTreeWithStats unit tests - Returns tree, estimatedTokens, bytes - Output matches generateDesignTree - CLI verified: 31KB fixture → ~7912 tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract decision helpers + add JSDoc coverage - Extract normalizeDecision() and countDecisions() from duplicated String(d.decision).trim().toLowerCase() pattern (was 3x) - Add APPLIED_TYPES/REJECTED_TYPES constants - Add JSDoc to all exported interfaces/functions in changed files: run-directory.ts, visual-compare.ts, design-tree.ts, report-generator.ts, docs.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR #28 review — 5 fixes 1. evidence-collector: normalize ruleId/fixture before storing (prevents bucket splitting from whitespace differences) 2. cli: validate --width/--height as positive numbers 3. visual-compare: browser.close() in finally block (prevents Chromium leak on goto/screenshot failure) 4. visual-compare: remove stale figma.png comment, keep simple cache-miss-only logic 5. mcp docs: sync visual-compare topic with new scale options Deferred: Zod schema for debate.json (#2), test helper extraction (#4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: add GitHub issue templates (bug, feature, research) Three templates matching the patterns used in this project: - Bug: symptom + cause + fix + affected files - Feature/Refactor: background + current + proposal + prerequisites - Research: question + method + expected outcome + blockers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: coerce --width/--height from string before validation CAC passes CLI option values as strings. Number.isFinite("1440") returns false, rejecting valid inputs. Mirror the figmaScale pattern: coerce with Number() first, then validate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
let-sunny
pushed a commit
that referenced
this pull request
Mar 30, 2026
- Add explicit `wholeDesign` flag to EvaluationAgentInput instead of inferring from conversionRecords.length === 1 (review #1) - Use z.enum().or(z.literal()) for discovery impact schema so new code gets compile-time enforcement of canonical values (review #2) - Move KNOWN_RULE_IDS to module-level constant (review #3) - Remove unused fixture param from pruneCalibrationEvidence — score changes are global so full prune is correct (review #4, YAGNI) https://claude.ai/code/session_017kvMmvjC54TuWaEWQgWiGc
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
logs/calibration/<name>--<timestamp>/에 모든 산출물 저장/tmp제거: 스크린샷, HTML, design-tree 전부 런 디렉토리로 이동fixtures/*.json스캔, 수렴(applied=0)시fixtures/done/으로 이동/calibrate-loop끝에 자동 실행, 이전 리포트 타임스탬프 백업/calibrate-night커맨드 추가: Claude Code/Cursor용 나이틀리 캘리브레이션ruleImpactAssessment필수화E2E 검증 완료
/calibrate-loop 풀플로우
/add-rule 풀플로우
Changed files
New (5 + tests)
src/agents/run-directory.ts+ testsrc/agents/gap-rule-report.ts+ test.claude/commands/calibrate-night.mddocs/CALIBRATION-PLAYBOOK.mdModified — TypeScript (5)
src/agents/activity-logger.ts—(runDir)생성자src/agents/contracts/calibration.ts—runDir필드src/agents/orchestrator.ts— runDir 사용src/cli/index.ts—--run-dir,calibrate-gap-report,sourceUrl, REPORT 버저닝src/core/contracts/figma-node.ts—sourceUrl필드Modified — Agent .md (13)
ruleImpactAssessment필수화Modified — Other (4)
scripts/calibrate-night.sh— fixture 스캔 + done/ 이동CLAUDE.md,docs/CALIBRATION.md,PRIVACY.md다음 PR 예정
save-fixture에?geometry=paths추가 (벡터 패스 가져오기)Test plan
/calibrate-loopE2E 풀플로우/add-ruleE2E 풀플로우 (DROP → revert)🤖 Generated with Claude Code