feat: restructure logs (1 run = 1 directory) + calibration playbook by let-sunny · Pull Request #4 · let-sunny/canicode

let-sunny · 2026-03-23T23:09:07Z

Summary

1실행 = 1폴더: 캘리브레이션/룰 디스커버리 실행마다 logs/calibration/<name>--<timestamp>/에 모든 산출물 저장
/tmp 제거: 스크린샷, HTML, design-tree 전부 런 디렉토리로 이동
오케스트레이터가 파일 쓰기 전담: 서브에이전트는 JSON 반환만, Write 권한 제거
Fixture 자동 스캔: fixtures/*.json 스캔, 수렴(applied=0)시 fixtures/done/으로 이동
REPORT.md 자동 생성 + 버저닝: /calibrate-loop 끝에 자동 실행, 이전 리포트 타임스탬프 백업
/calibrate-night 커맨드 추가: Claude Code/Cursor용 나이틀리 캘리브레이션
CALIBRATION-PLAYBOOK.md: 사용자 관점 워크플로우 가이드
Converter 프롬프트 강화: ruleImpactAssessment 필수화
debate.json 항상 생성: proposals=0이어도 skip 기록

E2E 검증 완료

/calibrate-loop 풀플로우

파일	상태
analysis.json	✅
design-tree.txt	✅
output.html	✅
figma.png, code.png, diff.png	✅
conversion.json (9 ruleImpactAssessments)	✅
gaps.json (오케스트레이터가 직접 씀)	✅
summary.md	✅
debate.json (critic + arbitrator)	✅
activity.jsonl (7 steps)	✅
REPORT.md (자동 생성)	✅
엉뚱한 경로 파일	✅ 없음

/add-rule 풀플로우

단계	상태
Researcher → research.json	✅
Designer → design.json	✅
Implementer → 코드 작성 + 테스트	✅
A/B Visual Validation	✅
Decision: DROP (근본 원인은 save-fixture에서 geometry=paths 미사용)	✅ 코드 revert 완료

Changed files

New (5 + tests)

src/agents/run-directory.ts + test
src/agents/gap-rule-report.ts + test
.claude/commands/calibrate-night.md
docs/CALIBRATION-PLAYBOOK.md

Modified — TypeScript (5)

src/agents/activity-logger.ts — (runDir) 생성자
src/agents/contracts/calibration.ts — runDir 필드
src/agents/orchestrator.ts — runDir 사용
src/cli/index.ts — --run-dir, calibrate-gap-report, sourceUrl, REPORT 버저닝
src/core/contracts/figma-node.ts — sourceUrl 필드

Modified — Agent .md (13)

핵심: Write 권한 제거, "Return JSON, do NOT write files" 패턴
Converter: ruleImpactAssessment 필수화
Arbitrator: rule-config.ts 외 파일 쓰기 명시적 금지

Modified — Other (4)

scripts/calibrate-night.sh — fixture 스캔 + done/ 이동
CLAUDE.md, docs/CALIBRATION.md, PRIVACY.md

다음 PR 예정

save-fixture에 ?geometry=paths 추가 (벡터 패스 가져오기)
figma.png 캐싱 (같은 fixture면 재사용)
점수 낮으면 visual-compare 스킵

Test plan

215 tests pass, lint clean
/calibrate-loop E2E 풀플로우
/add-rule E2E 풀플로우 (DROP → revert)
REPORT.md 버저닝 (백업 확인)
엉뚱한 경로 파일 0개

🤖 Generated with Claude Code

Each calibration/rule-discovery run now creates a self-contained directory under logs/calibration/<name>--<timestamp>/ with all artifacts (analysis, conversion, gaps, screenshots, HTML, debate, activity log). - Add run-directory.ts utility for creating/listing run directories - Update ActivityLogger to accept runDir instead of fixture path - Add --run-dir option to calibrate-analyze and calibrate-evaluate CLI - Port and adapt gap-rule-report from cursor-work branch - Add calibrate-gap-report CLI command (reads new directory structure) - Add /calibrate-night Claude Code command for multi-fixture runs - Move /tmp outputs (screenshots, HTML, design-tree) into run directory - Update all agent .md files: LOG_FILE → RUN_DIR pattern - Simplify nightly script (no more snapshot copying) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add sourceUrl to AnalysisFile schema, saved by save-fixture CLI - calibrate-night scans fixtures/*.json instead of explicit list - After applied=0 (converged), fixture is moved to fixtures/done/ - Nightly script and Claude Code command both use same logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add CALIBRATION-PLAYBOOK.md: step-by-step user guide for calibration and rule discovery workflows - Add sourceUrl field to AnalysisFile, saved by save-fixture CLI - calibrate-night scans fixtures/*.json, moves converged to done/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Subagents were writing files to wrong paths (old log structure patterns). Now the orchestrator is the single writer for all $RUN_DIR files. - Remove Write tool from Gap Analyzer, Critic, Arbitrator, Researcher, Designer, Evaluator, rule-discovery Critic - Each agent's .md now says "Do NOT write any files. Return JSON." - Orchestrator commands capture agent output and write to $RUN_DIR - Add Step 7 to calibrate-loop: auto-generate REPORT.md - Add file verification after Converter step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

So the user always sees a debate.json in the run directory — either with critic/arbitrator decisions or a "skipped" reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Converter was returning empty ruleImpactAssessment[], causing evaluation to produce zero proposals and skipping the Critic/Arbitrator debate. Now the prompt mandates one entry per flagged rule ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- calibrate-gap-report now backs up existing REPORT.md with timestamp before overwriting (e.g. REPORT--2026-03-23-23-52-47-833.md) - Arbitrator prompt explicitly forbids writing ANY file except rule-config.ts — prevents new-rule-proposals.md and other stale paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. evidence-collector: normalize ruleId/fixture before storing (prevents bucket splitting from whitespace differences) 2. cli: validate --width/--height as positive numbers 3. visual-compare: browser.close() in finally block (prevents Chromium leak on goto/screenshot failure) 4. visual-compare: remove stale figma.png comment, keep simple cache-miss-only logic 5. mcp docs: sync visual-compare topic with new scale options Deferred: Zod schema for debate.json (#2), test helper extraction (#4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@2x

…ce) (#28) * fix: calibration pipeline retina compare, evidence prune, lenient convergence - visual-compare: default Figma export scale 2, logical viewport + deviceScaleFactor so code.png matches @2x figma.png; CLI omits 1440×900 override; --figma-scale - Figma image cache keys include scale; MCP visual-compare accepts figmaExportScale - extractAppliedRuleIds / pruneCalibrationEvidence: trim + case-insensitive decisions - appendCalibrationEvidence: replace same (ruleId, fixture) on re-run - fixture-done --lenient-convergence + isConverged({ lenient }) for reject-only stalls - Document lenient flag in calibrate-night command Closes #14 * fix(agents): harden debate parsing and evidence dedupe keys - Guard extractAppliedRuleIds and isConverged when arbitrator.decisions is not an array - Use trimmed fixture in calibration evidence dedupe key - Add tests for malformed debate.json - Add docs/RESEARCH-BRANCH-CALIBRATION-RULE-DISCOVERY.md (calibration & rule discovery) - Align visual-compare.test.ts comment with padPng behavior Made-with: Cursor * fix(agents): self-review hardening for evidence and debate parsing - Dedupe calibration evidence within a single append (Map, last wins) - Treat non-object decision rows as inert in extractAppliedRuleIds and isConverged - Doc: fix overscored/underscored typo; strict convergence note; Appendix A self-review Made-with: Cursor * fix: extract tolerance constants + defensive debate parsing - SCALE_ROUNDING_TOLERANCE (0.08): broader for @2x/@3x detection - UNITY_SCALE_TOLERANCE (0.02): tighter for 1x false-positive prevention - parseDebateResult: validates shape, debug logs on malformed input Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove research doc (discussion-only, not for merge) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: token metrics + target environment + A/B token comparison 1. Token metrics: - generateDesignTreeWithStats() returns estimatedTokens + bytes - CLI design-tree command shows token estimate - CalibrationReportData includes tokenMetrics (tokens, bytes, per-node) 2. Target environment in CLAUDE.md: - Primary target: teams with designers, 300+ node pages - Token consumption is first-class metric - Component scores should not be lowered from small fixture calibration 3. add-rule A/B always includes token comparison: - Step 4 measures both visual similarity AND token savings - Token savings ratio: 1 - tokens_b / tokens_a - Even if visual diff is zero, token savings justify the rule Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add generateDesignTreeWithStats unit tests - Returns tree, estimatedTokens, bytes - Output matches generateDesignTree - CLI verified: 31KB fixture → ~7912 tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract decision helpers + add JSDoc coverage - Extract normalizeDecision() and countDecisions() from duplicated String(d.decision).trim().toLowerCase() pattern (was 3x) - Add APPLIED_TYPES/REJECTED_TYPES constants - Add JSDoc to all exported interfaces/functions in changed files: run-directory.ts, visual-compare.ts, design-tree.ts, report-generator.ts, docs.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR #28 review — 5 fixes 1. evidence-collector: normalize ruleId/fixture before storing (prevents bucket splitting from whitespace differences) 2. cli: validate --width/--height as positive numbers 3. visual-compare: browser.close() in finally block (prevents Chromium leak on goto/screenshot failure) 4. visual-compare: remove stale figma.png comment, keep simple cache-miss-only logic 5. mcp docs: sync visual-compare topic with new scale options Deferred: Zod schema for debate.json (#2), test helper extraction (#4) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: add GitHub issue templates (bug, feature, research) Three templates matching the patterns used in this project: - Bug: symptom + cause + fix + affected files - Feature/Refactor: background + current + proposal + prerequisites - Research: question + method + expected outcome + blockers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: coerce --width/--height from string before validation CAC passes CLI option values as strings. Number.isFinite("1440") returns false, rejecting valid inputs. Mirror the figmaScale pattern: coerce with Number() first, then validate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add explicit `wholeDesign` flag to EvaluationAgentInput instead of inferring from conversionRecords.length === 1 (review #1) - Use z.enum().or(z.literal()) for discovery impact schema so new code gets compile-time enforcement of canonical values (review #2) - Move KNOWN_RULE_IDS to module-level constant (review #3) - Remove unused fixture param from pruneCalibrationEvidence — score changes are global so full prune is correct (review #4, YAGNI) https://claude.ai/code/session_017kvMmvjC54TuWaEWQgWiGc

let-sunny and others added 7 commits March 24, 2026 07:50

fix: write debate.json even when zero proposals (skip record)

ed751df

So the user always sees a debate.json in the run directory — either with critic/arbitrator decisions or a "skipped" reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

let-sunny merged commit 707fd5b into main Mar 24, 2026
1 check passed

let-sunny deleted the feat/log-structure branch March 24, 2026 01:54

let-sunny mentioned this pull request Mar 27, 2026

Adjust CATEGORY_WEIGHT based on ablation experiment data #119

Closed

let-sunny mentioned this pull request Mar 30, 2026

interview with ai #200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: restructure logs (1 run = 1 directory) + calibration playbook#4

feat: restructure logs (1 run = 1 directory) + calibration playbook#4
let-sunny merged 7 commits intomainfrom
feat/log-structure

let-sunny commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

let-sunny commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

E2E 검증 완료

/calibrate-loop 풀플로우

/add-rule 풀플로우

Changed files

New (5 + tests)

Modified — TypeScript (5)

Modified — Agent .md (13)

Modified — Other (4)

다음 PR 예정

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

let-sunny commented Mar 23, 2026 •

edited

Loading