Skip to content

feat: restructure logs (1 run = 1 directory) + calibration playbook#4

Merged
let-sunny merged 7 commits intomainfrom
feat/log-structure
Mar 24, 2026
Merged

feat: restructure logs (1 run = 1 directory) + calibration playbook#4
let-sunny merged 7 commits intomainfrom
feat/log-structure

Conversation

@let-sunny
Copy link
Copy Markdown
Owner

@let-sunny let-sunny commented Mar 23, 2026

Summary

  • 1실행 = 1폴더: 캘리브레이션/룰 디스커버리 실행마다 logs/calibration/<name>--<timestamp>/에 모든 산출물 저장
  • /tmp 제거: 스크린샷, HTML, design-tree 전부 런 디렉토리로 이동
  • 오케스트레이터가 파일 쓰기 전담: 서브에이전트는 JSON 반환만, Write 권한 제거
  • Fixture 자동 스캔: fixtures/*.json 스캔, 수렴(applied=0)시 fixtures/done/으로 이동
  • REPORT.md 자동 생성 + 버저닝: /calibrate-loop 끝에 자동 실행, 이전 리포트 타임스탬프 백업
  • /calibrate-night 커맨드 추가: Claude Code/Cursor용 나이틀리 캘리브레이션
  • CALIBRATION-PLAYBOOK.md: 사용자 관점 워크플로우 가이드
  • Converter 프롬프트 강화: ruleImpactAssessment 필수화
  • debate.json 항상 생성: proposals=0이어도 skip 기록

E2E 검증 완료

/calibrate-loop 풀플로우

파일 상태
analysis.json
design-tree.txt
output.html
figma.png, code.png, diff.png
conversion.json (9 ruleImpactAssessments)
gaps.json (오케스트레이터가 직접 씀)
summary.md
debate.json (critic + arbitrator)
activity.jsonl (7 steps)
REPORT.md (자동 생성)
엉뚱한 경로 파일 ✅ 없음

/add-rule 풀플로우

단계 상태
Researcher → research.json
Designer → design.json
Implementer → 코드 작성 + 테스트
A/B Visual Validation
Decision: DROP (근본 원인은 save-fixture에서 geometry=paths 미사용) ✅ 코드 revert 완료

Changed files

New (5 + tests)

  • src/agents/run-directory.ts + test
  • src/agents/gap-rule-report.ts + test
  • .claude/commands/calibrate-night.md
  • docs/CALIBRATION-PLAYBOOK.md

Modified — TypeScript (5)

  • src/agents/activity-logger.ts(runDir) 생성자
  • src/agents/contracts/calibration.tsrunDir 필드
  • src/agents/orchestrator.ts — runDir 사용
  • src/cli/index.ts--run-dir, calibrate-gap-report, sourceUrl, REPORT 버저닝
  • src/core/contracts/figma-node.tssourceUrl 필드

Modified — Agent .md (13)

  • 핵심: Write 권한 제거, "Return JSON, do NOT write files" 패턴
  • Converter: ruleImpactAssessment 필수화
  • Arbitrator: rule-config.ts 외 파일 쓰기 명시적 금지

Modified — Other (4)

  • scripts/calibrate-night.sh — fixture 스캔 + done/ 이동
  • CLAUDE.md, docs/CALIBRATION.md, PRIVACY.md

다음 PR 예정

  • save-fixture?geometry=paths 추가 (벡터 패스 가져오기)
  • figma.png 캐싱 (같은 fixture면 재사용)
  • 점수 낮으면 visual-compare 스킵

Test plan

  • 215 tests pass, lint clean
  • /calibrate-loop E2E 풀플로우
  • /add-rule E2E 풀플로우 (DROP → revert)
  • REPORT.md 버저닝 (백업 확인)
  • 엉뚱한 경로 파일 0개

🤖 Generated with Claude Code

let-sunny and others added 7 commits March 24, 2026 07:50
Each calibration/rule-discovery run now creates a self-contained directory
under logs/calibration/<name>--<timestamp>/ with all artifacts (analysis,
conversion, gaps, screenshots, HTML, debate, activity log).

- Add run-directory.ts utility for creating/listing run directories
- Update ActivityLogger to accept runDir instead of fixture path
- Add --run-dir option to calibrate-analyze and calibrate-evaluate CLI
- Port and adapt gap-rule-report from cursor-work branch
- Add calibrate-gap-report CLI command (reads new directory structure)
- Add /calibrate-night Claude Code command for multi-fixture runs
- Move /tmp outputs (screenshots, HTML, design-tree) into run directory
- Update all agent .md files: LOG_FILE → RUN_DIR pattern
- Simplify nightly script (no more snapshot copying)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add sourceUrl to AnalysisFile schema, saved by save-fixture CLI
- calibrate-night scans fixtures/*.json instead of explicit list
- After applied=0 (converged), fixture is moved to fixtures/done/
- Nightly script and Claude Code command both use same logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add CALIBRATION-PLAYBOOK.md: step-by-step user guide for calibration
  and rule discovery workflows
- Add sourceUrl field to AnalysisFile, saved by save-fixture CLI
- calibrate-night scans fixtures/*.json, moves converged to done/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Subagents were writing files to wrong paths (old log structure patterns).
Now the orchestrator is the single writer for all $RUN_DIR files.

- Remove Write tool from Gap Analyzer, Critic, Arbitrator, Researcher,
  Designer, Evaluator, rule-discovery Critic
- Each agent's .md now says "Do NOT write any files. Return JSON."
- Orchestrator commands capture agent output and write to $RUN_DIR
- Add Step 7 to calibrate-loop: auto-generate REPORT.md
- Add file verification after Converter step

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
So the user always sees a debate.json in the run directory —
either with critic/arbitrator decisions or a "skipped" reason.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Converter was returning empty ruleImpactAssessment[], causing evaluation
to produce zero proposals and skipping the Critic/Arbitrator debate.
Now the prompt mandates one entry per flagged rule ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- calibrate-gap-report now backs up existing REPORT.md with timestamp
  before overwriting (e.g. REPORT--2026-03-23-23-52-47-833.md)
- Arbitrator prompt explicitly forbids writing ANY file except
  rule-config.ts — prevents new-rule-proposals.md and other stale paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@let-sunny let-sunny merged commit 707fd5b into main Mar 24, 2026
1 check passed
@let-sunny let-sunny deleted the feat/log-structure branch March 24, 2026 01:54
let-sunny added a commit that referenced this pull request Mar 24, 2026
1. evidence-collector: normalize ruleId/fixture before storing
   (prevents bucket splitting from whitespace differences)
2. cli: validate --width/--height as positive numbers
3. visual-compare: browser.close() in finally block
   (prevents Chromium leak on goto/screenshot failure)
4. visual-compare: remove stale figma.png comment, keep simple
   cache-miss-only logic
5. mcp docs: sync visual-compare topic with new scale options

Deferred: Zod schema for debate.json (#2), test helper extraction (#4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
let-sunny added a commit that referenced this pull request Mar 24, 2026
…ce) (#28)

* fix: calibration pipeline retina compare, evidence prune, lenient convergence

- visual-compare: default Figma export scale 2, logical viewport + deviceScaleFactor
  so code.png matches @2x figma.png; CLI omits 1440×900 override; --figma-scale
- Figma image cache keys include scale; MCP visual-compare accepts figmaExportScale
- extractAppliedRuleIds / pruneCalibrationEvidence: trim + case-insensitive decisions
- appendCalibrationEvidence: replace same (ruleId, fixture) on re-run
- fixture-done --lenient-convergence + isConverged({ lenient }) for reject-only stalls
- Document lenient flag in calibrate-night command

Closes #14

* fix(agents): harden debate parsing and evidence dedupe keys

- Guard extractAppliedRuleIds and isConverged when arbitrator.decisions is not an array
- Use trimmed fixture in calibration evidence dedupe key
- Add tests for malformed debate.json
- Add docs/RESEARCH-BRANCH-CALIBRATION-RULE-DISCOVERY.md (calibration & rule discovery)
- Align visual-compare.test.ts comment with padPng behavior

Made-with: Cursor

* fix(agents): self-review hardening for evidence and debate parsing

- Dedupe calibration evidence within a single append (Map, last wins)
- Treat non-object decision rows as inert in extractAppliedRuleIds and isConverged
- Doc: fix overscored/underscored typo; strict convergence note; Appendix A self-review

Made-with: Cursor

* fix: extract tolerance constants + defensive debate parsing

- SCALE_ROUNDING_TOLERANCE (0.08): broader for @2x/@3x detection
- UNITY_SCALE_TOLERANCE (0.02): tighter for 1x false-positive prevention
- parseDebateResult: validates shape, debug logs on malformed input

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove research doc (discussion-only, not for merge)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: token metrics + target environment + A/B token comparison

1. Token metrics:
   - generateDesignTreeWithStats() returns estimatedTokens + bytes
   - CLI design-tree command shows token estimate
   - CalibrationReportData includes tokenMetrics (tokens, bytes, per-node)

2. Target environment in CLAUDE.md:
   - Primary target: teams with designers, 300+ node pages
   - Token consumption is first-class metric
   - Component scores should not be lowered from small fixture calibration

3. add-rule A/B always includes token comparison:
   - Step 4 measures both visual similarity AND token savings
   - Token savings ratio: 1 - tokens_b / tokens_a
   - Even if visual diff is zero, token savings justify the rule

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add generateDesignTreeWithStats unit tests

- Returns tree, estimatedTokens, bytes
- Output matches generateDesignTree
- CLI verified: 31KB fixture → ~7912 tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract decision helpers + add JSDoc coverage

- Extract normalizeDecision() and countDecisions() from duplicated
  String(d.decision).trim().toLowerCase() pattern (was 3x)
- Add APPLIED_TYPES/REJECTED_TYPES constants
- Add JSDoc to all exported interfaces/functions in changed files:
  run-directory.ts, visual-compare.ts, design-tree.ts,
  report-generator.ts, docs.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR #28 review — 5 fixes

1. evidence-collector: normalize ruleId/fixture before storing
   (prevents bucket splitting from whitespace differences)
2. cli: validate --width/--height as positive numbers
3. visual-compare: browser.close() in finally block
   (prevents Chromium leak on goto/screenshot failure)
4. visual-compare: remove stale figma.png comment, keep simple
   cache-miss-only logic
5. mcp docs: sync visual-compare topic with new scale options

Deferred: Zod schema for debate.json (#2), test helper extraction (#4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add GitHub issue templates (bug, feature, research)

Three templates matching the patterns used in this project:
- Bug: symptom + cause + fix + affected files
- Feature/Refactor: background + current + proposal + prerequisites
- Research: question + method + expected outcome + blockers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: coerce --width/--height from string before validation

CAC passes CLI option values as strings. Number.isFinite("1440")
returns false, rejecting valid inputs. Mirror the figmaScale pattern:
coerce with Number() first, then validate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
let-sunny pushed a commit that referenced this pull request Mar 30, 2026
- Add explicit `wholeDesign` flag to EvaluationAgentInput instead of
  inferring from conversionRecords.length === 1 (review #1)
- Use z.enum().or(z.literal()) for discovery impact schema so new code
  gets compile-time enforcement of canonical values (review #2)
- Move KNOWN_RULE_IDS to module-level constant (review #3)
- Remove unused fixture param from pruneCalibrationEvidence — score
  changes are global so full prune is correct (review #4, YAGNI)

https://claude.ai/code/session_017kvMmvjC54TuWaEWQgWiGc
@let-sunny let-sunny mentioned this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant