Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions .claude/commands/add-rule.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ Append to `$RUN_DIR/activity.jsonl`:
{"step":"Implementer","timestamp":"<ISO8601>","result":"implemented rule <rule-id>","durationMs":<ms>}
```

### Step 4 — A/B Visual Validation
### Step 4 — A/B Validation (Visual + Token)

Run an A/B comparison on the entire design to measure the rule's actual impact on pixel-perfect accuracy:
Run an A/B comparison measuring both **pixel accuracy** and **token efficiency**:

1. Extract `fileKey` and root `nodeId` from the fixture or Figma URL.

Expand All @@ -91,9 +91,15 @@ Run an A/B comparison on the entire design to measure the rule's actual impact o
- Run: `npx canicode visual-compare $RUN_DIR/visual-b.html --figma-url "<figma-url-with-root-node-id>" --output $RUN_DIR/visual-b`
- Record similarity_b

5. Compare: if similarity_b > similarity_a → the rule catches something that genuinely improves implementation quality.
5. **Visual comparison**: if similarity_b > similarity_a → the rule improves pixel accuracy.

6. Record both scores for the Evaluator.
6. **Token comparison** (always measure, even if visual diff is zero):
- Count lines/bytes of `visual-a.html` vs `visual-b.html`
- If Test B is significantly smaller → the rule reduces token consumption (important for large pages)
- Record `tokens_a` (estimated: file bytes / 4) and `tokens_b`
- Token savings ratio: `1 - tokens_b / tokens_a`

Comment on lines +96 to +101

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard the token-savings ratio against tokens_a = 0.

If Test A output is empty/failed, 1 - tokens_b / tokens_a becomes invalid and can break downstream evaluation data.

🛠️ Suggested doc tweak
-   - Token savings ratio: `1 - tokens_b / tokens_a`
+   - Token savings ratio:
+     - if `tokens_a > 0`: `1 - tokens_b / tokens_a`
+     - else: `0` (or `n/a`, and mark the run as generation-failed)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
6. **Token comparison** (always measure, even if visual diff is zero):
- Count lines/bytes of `visual-a.html` vs `visual-b.html`
- If Test B is significantly smaller → the rule reduces token consumption (important for large pages)
- Record `tokens_a` (estimated: file bytes / 4) and `tokens_b`
- Token savings ratio: `1 - tokens_b / tokens_a`
6. **Token comparison** (always measure, even if visual diff is zero):
- Count lines/bytes of `visual-a.html` vs `visual-b.html`
- If Test B is significantly smaller → the rule reduces token consumption (important for large pages)
- Record `tokens_a` (estimated: file bytes / 4) and `tokens_b`
- Token savings ratio:
- if `tokens_a > 0`: `1 - tokens_b / tokens_a`
- else: `0` (or `n/a`, and mark the run as generation-failed)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/add-rule.md around lines 96 - 101, The token-savings
calculation using the expression `1 - tokens_b / tokens_a` is unsafe when
`tokens_a` can be zero; update the doc and any corresponding implementation
notes to first check `tokens_a == 0` (or falsy) and handle that case
explicitly—e.g., set the savings ratio to 0 or null, or mark the comparison as
invalid/skipped—before computing `1 - tokens_b / tokens_a`; reference the
variables `tokens_a` and `tokens_b` and the ratio expression so maintainers can
find and update the logic and documentation accordingly.

7. Record all scores for the Evaluator: `similarity_a`, `similarity_b`, `tokens_a`, `tokens_b`.

### Step 5 — Evaluator

Expand Down
8 changes: 7 additions & 1 deletion .claude/commands/calibrate-night.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,13 @@ After each successful run, use the CLI to check convergence and move:
npx canicode fixture-done <fixture-path> --run-dir $RUN_DIR
```

This checks `debate.json` for convergence (`applied=0 AND rejected=0`) and moves the fixture to `done/`. If the fixture hasn't converged, the command exits with an error — that's expected, just skip and continue.
This checks `debate.json` for convergence (`applied=0 AND rejected=0` by default) and moves the fixture to `done/`. If the fixture hasn't converged, the command exits with an error — that's expected, just skip and continue.

If the same low-confidence proposals keep getting **rejected** and nothing is applied (issue #14), you can move anyway with **`--lenient-convergence`** (converged when there are no applied/revised decisions, ignoring rejects):

```bash
npx canicode fixture-done <fixture-path> --run-dir $RUN_DIR --lenient-convergence
```

Report which fixtures were moved to `done/`.

Expand Down
44 changes: 44 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Bug Report
description: Something is broken or producing wrong results
labels: ["bug"]
body:
- type: textarea
id: symptom
attributes:
label: Symptom
description: What's happening? Include error messages, wrong output, or unexpected behavior.
placeholder: "invisible-layer scores -10 (blocking) but hidden layers don't block implementation"
validations:
required: true
- type: textarea
id: cause
attributes:
label: Cause
description: Why is this happening? Root cause if known.
placeholder: "design-tree already skips visible:false nodes, so hidden layers have zero impact on code generation"
validations:
required: false
- type: textarea
id: fix
attributes:
label: Proposed Fix
description: How should it be fixed?
placeholder: "Change severity from blocking to suggestion, score from -10 to -1"
validations:
required: false
- type: textarea
id: files
attributes:
label: Affected Files
description: Which source files need changes?
placeholder: "src/core/rules/rule-config.ts, src/core/rules/ai-readability/index.ts"
validations:
required: false
- type: textarea
id: related
attributes:
label: Related Issues
description: Link related issues or PRs
placeholder: "#14, #15"
validations:
required: false
Comment on lines +1 to +44

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Well-structured bug report template.

The template follows GitHub's issue form schema correctly and provides a logical field progression (symptom → cause → fix → affected files → related issues). Requiring only the symptom field allows for quick bug reports while providing optional fields for more detailed analysis. The project-specific placeholders effectively demonstrate the expected format.

💡 Optional enhancement: Add duplicate issue check

Consider adding a checkbox at the beginning to encourage users to search for existing issues:

 labels: ["bug"]
 body:
+  - type: checkboxes
+    id: checklist
+    attributes:
+      label: Pre-submission checklist
+      options:
+        - label: I have searched existing issues for duplicates
+          required: true
   - type: textarea
     id: symptom

This helps reduce duplicate bug reports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/ISSUE_TEMPLATE/bug.yml around lines 1 - 44, Insert a new checkbox
field at the start of the form body (before the existing "symptom" field) to
prompt users to confirm they searched for duplicate issues; add a field with
type: checkboxes, id: duplicate_search (or similar), attributes.label: "I
searched for existing issues" and an options array containing a single option
with label "Yes, I searched" and value "yes" so reviewers can quickly spot
potential duplicates when triaging; keep the existing symptom, cause, fix,
files, and related fields unchanged.

52 changes: 52 additions & 0 deletions .github/ISSUE_TEMPLATE/feature.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Feature / Refactor
description: New functionality or improvement to existing code
labels: ["enhancement"]
body:
- type: textarea
id: background
attributes:
label: Background
description: Why is this needed? What problem does it solve?
placeholder: "save-fixture doesn't store component master nodes, so design-tree and analysis rules can't access component structure"
validations:
required: true
- type: textarea
id: current
attributes:
label: Current State
description: How does it work now? Include code snippets if helpful.
placeholder: |
```typescript
// Only stores component metadata (name, description)
components: { "comp:1": { name: "Button", description: "" } }
// Missing: master node tree
```
validations:
required: false
- type: textarea
id: proposal
attributes:
label: Proposal
description: What should change? Be specific about the approach.
placeholder: |
1. Collect componentIds from INSTANCE nodes
2. Fetch master nodes via getFileNodes
3. Store in componentDefinitions field
validations:
required: true
- type: textarea
id: prerequisites
attributes:
label: Prerequisites
description: What needs to be done first?
placeholder: "#16 must be merged before this can work"
validations:
required: false
- type: textarea
id: related
attributes:
label: Related Issues
description: Link related issues or PRs
placeholder: "#14, #17"
validations:
required: false
50 changes: 50 additions & 0 deletions .github/ISSUE_TEMPLATE/research.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Research / Investigation
description: Needs exploration before implementation — API feasibility, impact measurement, etc.
labels: ["research"]
body:
- type: textarea
id: question
attributes:
label: Research Question
description: What are we trying to find out?
placeholder: "Does Figma's annotation data improve AI implementation accuracy?"
validations:
required: true
- type: textarea
id: method
attributes:
label: Method
description: How will we investigate? Include test fixtures, A/B approach, API calls, etc.
placeholder: |
A/B test:
- Test A: implement without annotations
- Test B: implement with annotations
- Compare: similarity + token consumption
validations:
required: true
- type: textarea
id: expected
attributes:
label: Expected Outcome
description: What will the result tell us? How will it inform next steps?
placeholder: |
- If annotations improve accuracy → add annotation-related rules
- If no difference → annotations are human-only, not AI-relevant
validations:
required: true
- type: textarea
id: blockers
attributes:
label: Blockers
description: Anything that might prevent this research?
placeholder: "REST API annotations field is private beta — may need MCP-only path"
validations:
required: false
- type: textarea
id: related
attributes:
label: Related Issues
description: Link related issues or PRs
placeholder: "#14, #20"
validations:
required: false
13 changes: 13 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,19 @@ A CLI tool that analyzes Figma design structures to provide development-friendli

**Can AI implement this Figma design pixel-perfectly?** Everything in this project serves this single question. Every rule, score, category, and pipeline exists to measure and improve how accurately AI can reproduce a Figma design as code. The metric is `visual-compare` similarity (0-100%).

## Target Environment

The primary target is **teams with designers** where developers (+AI) implement large Figma pages:
- **Page scale**: 300+ nodes, full screens, not small component sections
- **Component-heavy**: Design systems with reusable components, variants, tokens
- **AI context budget**: Large pages must fit in AI context windows — componentization reduces token count via deduplication
- **Not the target**: Individual developers generating simple UI with AI — they don't need Figma analysis

This means:
- Component-related rule scores (missing-component, etc.) should NOT be lowered based on small fixture calibration
- Token consumption is a first-class metric — designs that waste tokens on repeated structures are penalized
- Calibration fixtures should include large, complex pages alongside small sections

## Tech Stack

- **Runtime**: Node.js (>=18)
Expand Down
156 changes: 156 additions & 0 deletions docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Fixture analysis: validity feedback (over- vs under-estimation)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Rename this file to kebab-case to satisfy repository naming rules.

docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md violates the filename convention and should be renamed (for example: docs/fixtures/analysis-validity-feedback.md).

As per coding guidelines, "**/*: Use kebab-case for filenames (e.g., my-component.ts)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md` at line 1, Rename the file
docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md to kebab-case (for example
docs/fixtures/analysis-validity-feedback.md) and update all references to it
(README links, tests, docs, or any import/path usages) so they point to the new
filename; ensure the git move preserves history (git mv) or the change is
committed as a rename.


This document records **subjective engineering judgment** on whether `canicode analyze` results **oversell** or **undersell** how hard it is for an AI to implement the UI **pixel-close** from the fixture, and what to do about it.

It is **not** a statistical proof. It assumes: REST fixture JSON + current `rule-config` scoring, and the north-star metric **visual similarity**, not design-system purity alone.

---

## How to read this

| Term | Meaning here |
|------|----------------|
| **Overrated (grade/score)** | The **readiness score looks better** than the likely **implementation + visual-compare** outcome without extra context or cleanup. |
| **Underrated** | The **score looks worse** than how repeatable or implementable the file actually is for an AI (given a strong prompt / repeated patterns). |

**Remedies** split into: **Figma / fixture hygiene**, **prompt & workflow**, **tooling (rules, scoring, scope)**.

---

## Cross-cutting patterns

### A. Invisible layers + raw color clusters (many Simple DS fixtures)

- **Overrated:** Overall **grade can look decent (A / high B)** while the tree still carries many **invisible-layer** and **raw-color** hits. Severity may keep the headline grade up, but the AI still sees **noise** (extra nodes, ambiguous fills) and may **mis-order or mis-style** visible content. **Pixel outcome** can be worse than the letter grade suggests.
- **Underrated:** Rare for the **same** pattern; occasionally invisible only affects editor hygiene, not export — but REST + rules still complain.
- **Remedies:**
- **Figma:** Delete or detach truly unused layers; replace raw values with variables where the team cares; reduce hidden decoration.
- **Workflow:** Feed **`design-tree`** or a **pruned node list** to the AI so it ignores known-noise IDs.
- **Tooling:** Consider reporting **“visible-only issues”** aggregate for AI-facing summaries; tune `invisible-layer` / `raw-color` weights after calibration for “impact on visual-compare.”

### B. Unscoped large trees (Material 3 & large Simple DS)

- **Overrated:** **Low grades (C, B)** can look like “this file is terrible to implement” when the real problem is **scope** — mixing many screens/components in one graph. A single scoped frame might be **B+ or A** in isolation.
- **Underrated:** Almost the inverse: **one bad subtree** can dominate issue counts; the rest of the file might be fine — the **average** undervalues “good parts.”
- **Remedies:**
- **Fixture:** Re-`save-fixture` with a **single `node-id`** per calibration / AI task.
- **CLI:** Always analyze with the **same scope** you ask the AI to implement.
- **Tooling:** Surface **root-scoped score** vs **whole-file score** in reports when unscoped.

### C. Design-system spacing / fixed-width spam (`material3-52949-27916`, `material3-51954-18254`)

- **Overrated:** Less common; if spacing issues are **systematic**, fixing one pattern fixes hundreds — headline pain can be **overstated** for a **template-aware** AI.
- **Underrated:** **Underrated for “first-try” AI** without a grid spec — hundred of **inconsistent-spacing** / **fixed-width** hits mean **lots of opportunity for pixel drift**; visual-compare will punish. Score may feel “only B” but failure modes are many.
- **Remedies:**
- **Prompt:** State **grid base (e.g. 4px)**, max width, breakpoint behavior explicitly.
- **Figma:** Normalize spacing to the grid; reduce fixed widths or mark responsive intent.
- **Tooling:** Calibration tied to **visual-compare** on these fixtures so spacing rules track **pixel deltas**.

### D. Numeric / generic naming (Material 3 community)

- **Overrated:** Score can look **OK on layout** while **ai-readability / naming** tanks — sometimes **underweighted** in how much it confuses **less capable** models (variable names, component names in code).
- **Underrated:** For **vision-heavy** codegen (screenshot + structure), **bad names** matter less — score may **undersell** “actually buildable if you ignore names.”
- **Remedies:**
- **Figma:** Rename critical interactive nodes; keep DS naming convention.
- **Prompt:** “Map Figma node IDs to React components; names are not authoritative.”
- **Tooling:** Separate **“semantic naming score”** from **implementation difficulty** in user-facing copy.

### E. `figma-ui3-kit` (token category very weak, few nodes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer a more precise descriptor than “very weak.”

Consider replacing “very weak” with a concrete phrase (for example, “limited token coverage”) to keep the analysis language crisp and measurable.

🧰 Tools
🪛 LanguageTool

[style] ~58-~58: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ... ### E. figma-ui3-kit (token category very weak, few nodes) - Overrated: **Overall...

(EN_WEAK_ADJECTIVE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md` at line 58, Update the header
text in the line "### E. `figma-ui3-kit` (token category very weak, few nodes)"
to use a more precise, measurable descriptor (e.g., "limited token coverage" or
"low token coverage, few nodes") instead of the vague phrase "very weak" so the
analysis is clearer and actionable.


- **Overrated:** **Overall B** might **overstate** readiness: **token ~21%** means heavy **raw-color/raw-font** — first **visual-compare** often fails on **font and fill** unless explicitly mocked.
- **Underrated:** Small **node count** — fast iteration; score may **feel harsh** vs effort to **manually** match a tiny scope.
- **Remedies:**
- **Workflow:** Predeclare **font stack** and **color substitution** in the prompt to absorb raw-font/raw-color.
- **Tooling:** Optional “**compare mode: ignore typography**” (dangerous but documented) for layout-only checks.

---

## Per-fixture notes (snapshot-oriented)

Names refer to directories under `fixtures/` or `fixtures/done/`. Grades refer to the last batch analyze (v0.8.9); re-run after config changes.

### `done/simple-ds-page-sections` — A (86%)

- **Overrated:** **Token/component** weakness (raw color, missing descriptions, invisible layers) is **under-reflected** in a single letter “A” for **naive** AI codegen without DS context.
- **Underrated:** **Small, shallow tree** — easier to reason about than bulk Simple DS fixtures; score is not “too kind” to structure.
- **Remedies:** Add **component description** stub in prompt; clean invisible layers; run **visual-compare** once to validate the A.

### `material3-56615-45927` — C+ (72%), large unscoped

- **Overrated:** Unlikely at file level — **C+** already warns; per-frame might be better.
- **Underrated:** Possible if user only implements **one** repeated component — issue count is **file-wide**.
- **Remedies:** **Mandatory scoping**; treat as multiple fixtures.

### `simple-ds-175-9106` / `175-8591` / `175-7790` / `562-9518` — B ~ B+

- **Overrated:** **raw-color + invisible-layer** volume — **B/B+** may **overrate** first-shot pixel match.
- **Underrated:** **Detach/instance** and **default-name** sometimes fixable with a strong “use Figma structure as source of truth” prompt.
- **Remedies:** Scope; variable cleanup; explicit **instance** handling in prompt.

### `simple-ds-4333-9262` — A (89%)

- **Overrated:** Still **missing-component-description** and spacing — “A” is **fragile** for **fully autonomous** AI without extra text context.
- **Underrated:** Among the **cleaner** Simple DS slices — good **golden** candidate; score may be **fair or slightly harsh** if issues are mostly doc/metadata.
- **Remedies:** Use as **benchmark**; add descriptions only if product needs them for handoff.

### `material3-52949-27916` — B (79%), 400+ issues

- **Overrated:** Whole-file **B** might **overrate** “one-shot implement whole thing.”
- **Underrated:** For **systematic** spacing tokens, a **macro prompt** might achieve **ok** similarity on repetitive regions — composite grade **undersells** repeatability.
- **Remedies:** Split fixture; calibrate spacing rules on **scoped** chunks.

### `done/simple-ds-panel-sections` — A (87%)

- **Overrated:** Same invisible/raw-color cluster as other Simple DS **done** sets — **A** vs **pixel** needs verification.
- **Underrated:** Low depth, moderate issues — reasonable **hand-implement** cost.

### `material3-51954-18254` — B+ (82%), spacing-dominated

- **Overrated:** Low probability — issue profile is **honest** about layout math risk.
- **Underrated:** If the AI is given **explicit spacing table** extracted once from Figma, difficulty drops **faster** than score implies.
- **Remedies:** Export **spacing tokens** or measurement table alongside `data.json`.

### `done/material3-kit-2` — B+ (81%)

- **Overrated:** **deep-nesting + sibling direction** — **B+** might **overrate** “drop in one prompt” success for junior models.
- **Underrated:** **Narrow node count** (96) — less overwhelming than `82356`.
- **Remedies:** Refactor nesting in Figma for handoff; or implement **section-by-section**.

### `material3-56615-82356` — C (66%)

- **Overrated:** Rare — **C** is already blunt.
- **Underrated:** If scoped to **one** leaf screen, local grade might be **much higher** — file-level **undersells** localized quality.
- **Remedies:** **Never** use as single-scope “implement all”; split into **10+** scoped fixtures.

### `done/material3-kit-1` — A (85%)

- **Overrated:** **Still mixed token/naming/invisible** — **A** is strong; verify with **visual-compare** for **stock** AI.
- **Underrated:** Mature kit — **repeatable patterns** help experienced prompts.
- **Remedies:** Keep as **positive** control next to `kit-2` / `82356`.

### `done/simple-ds-card-grid` — B+ (80%)

- **Overrated:** **Component/naming** weak — **B+** may **overrate** grid+card **pixel** parity without layout hints.
- **Underrated:** **Calibration history** — score may be **well tuned** for this file; trust **relative** more than absolute.
- **Remedies:** Explicit **grid columns/gap** in prompt.

### `done/figma-ui3-kit` — B (76%), token floor

- **Overrated:** **Overall B** with **token 21%** — **readiness for unguided AI** is **overrated** unless fonts/colors are specified in prompt.
- **Underrated:** **20 nodes** — absolute work is small; grade **undersells** “quick human polish.”
- **Remedies:** Font and color contract in README for this fixture; optional webfont load in generated HTML.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use “web font” for clearer orthography and consistency.

Replace “webfont” with “web font” in this sentence for cleaner documentation style.

🧰 Tools
🪛 LanguageTool

[grammar] ~141-~141: Ensure spelling is correct
Context: ...polish.” - Remedies: Font and color contract in README for this fixture; optional we...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fixtures/ANALYSIS-VALIDITY-FEEDBACK.md` at line 141, In the sentence
"Font and color contract in README for this fixture; optional webfont load in
generated HTML." replace the token "webfont" with "web font" to match the
project's orthography convention; update that phrase in
ANALYSIS-VALIDITY-FEEDBACK.md so it reads "...optional web font load in
generated HTML." ensuring consistency with other docs.


---

## Raising “validity” of the metric (product direction)

1. **Always pair** headline grade with **visual-compare** (or explicit “not run”) on the **same scoped node**.
2. **Publish** whether the run was **scoped** and **node count** in JSON (provenance) — avoids comparing whole-file vs framed scores.
3. **Calibrate** rule weights using **pixel delta categories**, not issue count alone — reduces **over/under** gap for AI use cases.
4. **Two summaries** in reports: **“handoff / DS hygiene”** vs **“likely first visual-compare (AI)”** — can diverge.

---

## Revision

Re-evaluate after major `rule-config` or rule-set changes; per-fixture grades will move.
Loading
Loading