Skip to content

test(ui-judge): score more playground examples#2689

Merged
PupilTong merged 1 commit into
mainfrom
hw/codex/ui-judge-playground-preview
May 22, 2026
Merged

test(ui-judge): score more playground examples#2689
PupilTong merged 1 commit into
mainfrom
hw/codex/ui-judge-playground-preview

Conversation

@PupilTong
Copy link
Copy Markdown
Collaborator

@PupilTong PupilTong commented May 22, 2026

Summary

  • add more A2UI playground demos to the ui-judge scoring coverage
  • write all scored demo results into the UI Judge result JSON as separate entries

Test Plan

  • ./node_modules/.bin/biome check packages/genui/ui-judge/tests/judge-page.spec.ts
  • git diff --check
  • CI=1 pnpm --filter @lynx-js/ui-judge exec tsc -p tsconfig.json
  • env -u MIDSCENE_MODEL_NAME -u MIDSCENE_MODEL_API_KEY -u MIDSCENE_OPENAI_INIT_CONFIG_JSON CI=1 pnpm --filter @lynx-js/ui-judge test

Summary by CodeRabbit

  • Tests

    • Expanded test coverage to validate additional playground demo cases with individual scoring.
    • Test results now aggregated across all demo cases.
  • Refactor

    • Updated output mechanism for aggregated UI judge test results across multiple demos.

Review Change Stack

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 22, 2026

⚠️ No Changeset found

Latest commit: d2fbed3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

The UI Judge test suite expands from evaluating a single playground demo to comprehensively scoring all PLAYGROUND_DEMO_CASES. Each demo is annotated with a task description, evaluated via judgePage, and results are aggregated into a single persistent output file.

Changes

UI Judge multi-demo test expansion

Layer / File(s) Summary
Test data expansion with task field
packages/genui/ui-judge/tests/judge-page.spec.ts
PLAYGROUND_DEMO_CASES grows to include more playground demos (cast-grid, citywalk-list, fridge-search, workout-plan, and others) with each entry now carrying a task field that describes the evaluation goal.
Multi-demo evaluation and aggregation test
packages/genui/ui-judge/tests/judge-page.spec.ts
The single-demo scoring test is replaced with logic that iterates all PLAYGROUND_DEMO_CASES, runs each demo in a test step, navigates to its preview, waits for ready/expected signals, calls judgePage with the per-demo task, accumulates results, and asserts expectations for every demo.
Result aggregation and persistence
packages/genui/ui-judge/tests/judge-page.spec.ts
A new JudgedPlaygroundResult interface and writeUiJudgeResults helper persist multiple scored demo results (each with attached task) into a single JSON file, replacing the prior single-result writer.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • lynx-family/lynx-stack#2629: Directly extends the initial judgePage test suite by expanding demo coverage from a small subset to comprehensive multi-demo evaluation with aggregated result persistence.

Suggested reviewers

  • Sherry-hue
  • HuJean

Poem

🐰 A rabbit's ode to expanded tests:

More playgrounds spring to life and test,
Each demo task now blessed.
Results converge in files so neat—
One judge to score them all complete! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: expanding ui-judge test coverage to score more playground examples.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hw/codex/ui-judge-playground-preview

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/genui/ui-judge/tests/judge-page.spec.ts (1)

150-153: ⚡ Quick win

Persist demoId alongside each result for a stable identifier.

The JSON entry currently depends on mutable task text and environment-specific URL. Including demoId makes downstream diffing and trend tracking deterministic.

💡 Proposed refactor
 interface JudgedPlaygroundResult {
+  demoId: string;
   result: UiJudgeResult;
   task: string;
 }
@@
         judgedResults.push({
+          demoId: demo.demoId,
           result,
           task: demo.task,
         });
@@
-          results: judgedResults.map(({ result, task }) => ({
+          results: judgedResults.map(({ demoId, result, task }) => ({
+            demoId,
             ...result,
             task,
           })),

Also applies to: 205-208, 224-227

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/genui/ui-judge/tests/judge-page.spec.ts` around lines 150 - 153,
Include a stable demo identifier when appending judged results: when pushing
into judgedResults (the push that currently uses { result, task: demo.task }),
add demoId: demo.demoId so each JSON entry contains a deterministic id; make the
same change for the other two places that push results (the similar pushes
around the other occurrences) to ensure all saved entries include demoId
alongside result and task.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/genui/ui-judge/tests/judge-page.spec.ts`:
- Around line 123-124: The global test timeout set by test.setTimeout(1_200_000)
is too small given judgePage is awaited for up to timeoutMs: 180_000 across 8
demos and additional waitForPreviewText delays; increase the timeout to a value
that covers 8 * 180_000 plus the per-demo waitForPreviewText overhead (suggest
using test.setTimeout(1_800_000) or 2_000_000) so the loop that calls
judgePage(...) and waitForPreviewText(...) has enough time to complete.

---

Nitpick comments:
In `@packages/genui/ui-judge/tests/judge-page.spec.ts`:
- Around line 150-153: Include a stable demo identifier when appending judged
results: when pushing into judgedResults (the push that currently uses { result,
task: demo.task }), add demoId: demo.demoId so each JSON entry contains a
deterministic id; make the same change for the other two places that push
results (the similar pushes around the other occurrences) to ensure all saved
entries include demoId alongside result and task.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b006a33e-70ed-49d3-8d83-6f60bfbe4492

📥 Commits

Reviewing files that changed from the base of the PR and between 2d64575 and d2fbed3.

📒 Files selected for processing (1)
  • packages/genui/ui-judge/tests/judge-page.spec.ts

Comment thread packages/genui/ui-judge/tests/judge-page.spec.ts
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

UI Judge

Average score: 3.6 / 5 across 8 results.

# Dimension Score Page Status
1 visual-correctness 3 / 5 preview OK
2 visual-correctness 5 / 5 preview OK
3 visual-correctness 3 / 5 preview OK
4 visual-correctness 3 / 5 preview OK
5 visual-correctness 3 / 5 preview OK
6 visual-correctness 5 / 5 preview OK
7 visual-correctness 5 / 5 preview OK
8 visual-correctness 2 / 5 preview OK
Details

Result 1

  • Task: The A2UI playground preview should show date-night dining recommendations for Moonlight Terrace, Pinewood Bistro, and Sea Breeze Kitchen.

Result 2

  • Task: The A2UI playground preview should show a cast grid for the short film Night Notes, including Lin Xia and Zhou Ning cast cards.

Result 3

  • Task: The A2UI playground preview should show weekend citywalk coffee picks with Rooftop Brew Room, Corner Canvas Lab, and Late Sun Roastery.

Result 4

  • Task: The A2UI playground preview should show refrigerator search results with Siemens, Hualing, Haier, and Midea product cards.

Result 5

  • Task: The A2UI playground preview should show a Kyoto 48-hour trip planner with Day 1 and Day 2 itinerary sections, including Monkey Park Viewpoint.

Result 6

  • Task: The A2UI playground preview should show the current weather for Austin, TX, including clear skies with light breeze.

Result 7

  • Task: The A2UI playground preview should show a Wireless Headphones Pro product card with a visible Add to Cart action.

Result 8

  • Task: The A2UI playground preview should show a weekly workout plan with five days from Monday Ramp-Up through Friday Conditioning.

Workflow run

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will degrade performance by 7.16%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 80 untouched benchmarks
⏩ 26 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
transform 1000 view elements 40 ms 43.1 ms -7.16%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing hw/codex/ui-judge-playground-preview (d2fbed3) with main (2d64575)

Open in CodSpeed

Footnotes

  1. 26 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@relativeci
Copy link
Copy Markdown

relativeci Bot commented May 22, 2026

React Example with Element Template

#850 Bundle Size — 202.16KiB (0%).

d2fbed3(current) vs 2d64575 main#848(baseline)

Bundle metrics  no changes
                 Current
#850
     Baseline
#848
No change  Initial JS 0B 0B
No change  Initial CSS 0B 0B
No change  Cache Invalidation 0% 0%
No change  Chunks 0 0
No change  Assets 4 4
No change  Modules 100 100
No change  Duplicate Modules 30 30
No change  Duplicate Code 39.22% 39.22%
No change  Packages 2 2
No change  Duplicate Packages 0 0
Bundle size by type  no changes
                 Current
#850
     Baseline
#848
No change  IMG 145.76KiB 145.76KiB
No change  Other 56.41KiB 56.41KiB

Bundle analysis reportBranch hw/codex/ui-judge-playground-pre...Project dashboard


Generated by RelativeCIDocumentationReport issue

@relativeci
Copy link
Copy Markdown

relativeci Bot commented May 22, 2026

Web Explorer

#10157 Bundle Size — 903.53KiB (0%).

d2fbed3(current) vs 2d64575 main#10155(baseline)

Bundle metrics  Change 2 changes
                 Current
#10157
     Baseline
#10155
No change  Initial JS 45.06KiB 45.06KiB
No change  Initial CSS 2.22KiB 2.22KiB
No change  Cache Invalidation 0% 0%
No change  Chunks 9 9
No change  Assets 11 11
Change  Modules 230(-0.86%) 232
No change  Duplicate Modules 11 11
Change  Duplicate Code 27.13%(+0.04%) 27.12%
No change  Packages 10 10
No change  Duplicate Packages 0 0
Bundle size by type  no changes
                 Current
#10157
     Baseline
#10155
No change  JS 499.15KiB 499.15KiB
No change  Other 402.16KiB 402.16KiB
No change  CSS 2.22KiB 2.22KiB

Bundle analysis reportBranch hw/codex/ui-judge-playground-pre...Project dashboard


Generated by RelativeCIDocumentationReport issue

@relativeci
Copy link
Copy Markdown

relativeci Bot commented May 22, 2026

React MTF Example

#1715 Bundle Size — 208.75KiB (0%).

d2fbed3(current) vs 2d64575 main#1713(baseline)

Bundle metrics  no changes
                 Current
#1715
     Baseline
#1713
No change  Initial JS 0B 0B
No change  Initial CSS 0B 0B
No change  Cache Invalidation 0% 0%
No change  Chunks 0 0
No change  Assets 3 3
No change  Modules 195 195
No change  Duplicate Modules 77 77
No change  Duplicate Code 44.17% 44.17%
No change  Packages 2 2
No change  Duplicate Packages 0 0
Bundle size by type  no changes
                 Current
#1715
     Baseline
#1713
No change  IMG 111.23KiB 111.23KiB
No change  Other 97.52KiB 97.52KiB

Bundle analysis reportBranch hw/codex/ui-judge-playground-pre...Project dashboard


Generated by RelativeCIDocumentationReport issue

@relativeci
Copy link
Copy Markdown

relativeci Bot commented May 22, 2026

React External

#1698 Bundle Size — 698.01KiB (0%).

d2fbed3(current) vs 2d64575 main#1696(baseline)

Bundle metrics  no changes
                 Current
#1698
     Baseline
#1696
No change  Initial JS 0B 0B
No change  Initial CSS 0B 0B
No change  Cache Invalidation 0% 0%
No change  Chunks 0 0
No change  Assets 3 3
No change  Modules 17 17
No change  Duplicate Modules 5 5
No change  Duplicate Code 8.59% 8.59%
No change  Packages 0 0
No change  Duplicate Packages 0 0
Bundle size by type  no changes
                 Current
#1698
     Baseline
#1696
No change  Other 698.01KiB 698.01KiB

Bundle analysis reportBranch hw/codex/ui-judge-playground-pre...Project dashboard


Generated by RelativeCIDocumentationReport issue

@relativeci
Copy link
Copy Markdown

relativeci Bot commented May 22, 2026

React Example

#8582 Bundle Size — 237.81KiB (0%).

d2fbed3(current) vs 2d64575 main#8580(baseline)

Bundle metrics  no changes
                 Current
#8582
     Baseline
#8580
No change  Initial JS 0B 0B
No change  Initial CSS 0B 0B
No change  Cache Invalidation 0% 0%
No change  Chunks 0 0
No change  Assets 4 4
No change  Modules 200 200
No change  Duplicate Modules 80 80
No change  Duplicate Code 44.68% 44.68%
No change  Packages 2 2
No change  Duplicate Packages 0 0
Bundle size by type  no changes
                 Current
#8582
     Baseline
#8580
No change  IMG 145.76KiB 145.76KiB
No change  Other 92.05KiB 92.05KiB

Bundle analysis reportBranch hw/codex/ui-judge-playground-pre...Project dashboard


Generated by RelativeCIDocumentationReport issue

@PupilTong PupilTong merged commit 60bdcd4 into main May 22, 2026
87 of 90 checks passed
@PupilTong PupilTong deleted the hw/codex/ui-judge-playground-preview branch May 22, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants