Improve code-review eval: add regression detection scenarios by PureWeen · Pull Request #34884 · dotnet/maui

PureWeen · 2026-04-08T18:30:47Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Improves the code-review skill's eval.yaml with regression detection scenarios based on real regressed-in-10.0.60 issues.

Context

The code-review skill already has 345 lines of regression prevention rules in review-rules.md (Section 21: 13 rules from 30 reverted PRs, Section 22: 8 most-regressed component rankings)
None of the existing 6 eval scenarios tested these regression rules — this PR closes that gap
Based on analysis of 5 real regression issues ([MAUI] I3_Sizing_Dynamic item sizing - After dragging the scrollbar, all images will automatically return to their original size. #34783, C6-The C6 page cannot scroll on Windows and Android platforms. #34666, [MAUI] I2_Spacing_ItemSpacing - First and last item on the list is truncated after changing Spacing value. #34636, [MacOS] Misaligned items before resizing the window on MacOS #34635, [.NET 10] Increasing gap in the bottom while scrolling. #34634) and their fix PRs
Follows eval best practices from try-fix and verify-tests evaluation cycles (PRs Improve try-fix skill: add eval.yaml and fix prompt issues #34807, Add eval.yaml for verify-tests-fail-without-fix skill #34815)
Part of eval coverage expansion tracked in issue Establish eval.yaml lifecycle and CI integration for skill validation #34814

Changes

3 new regression-focused scenarios (6 → 9 total):

#	Scenario	What it tests	Based on
7	CollectionView regression risk	Agent flags CV as high-risk component, calls for broad scenario coverage	Issue #34666 (scroll inside disabled RefreshView)
8	Side-effect risk on shared paths	Agent identifies that modifying a shared invalidation path risks breaking adjacent behaviors	Issue #34636 (ItemSpacing clipping)
9	Safe change — no false warnings	Agent does NOT flag regression risk for documentation-only changes	Prevents false positive regression noise

Also improved existing scenarios:

Added expect_activation: false to both negative trigger scenarios (2 and 6)
Added version changelog header

Eval Stats

9 scenarios (was 6)
0 output_contains (no vocabulary overfitting)
17 output_not_contains (anti-pattern guards)
4 output_matches (verdict pattern matching)
2 expect_activation: false (native negative triggers)

- Add 3 new scenarios testing Section 21/22 regression prevention rules: - Scenario 7: CollectionView high-regression-risk component detection - Scenario 8: Side-effect risk when modifying shared code paths - Scenario 9: Safe doc-only change should NOT get false regression warnings - Add expect_activation: false to both negative trigger scenarios - Add version changelog comment header - Based on real regressed-in-10.0.60 issues (#34666, #34636, #34635) - 9 scenarios total (was 6), 0 output_contains, 17 output_not_contains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-08T18:31:32Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34884

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34884"

Copilot

Pull request overview

This PR expands the code-review skill’s evaluation coverage by adding new scenarios specifically aimed at detecting regression-risk reasoning, based on real “regressed-in-10.0.60” issue patterns. It also strengthens two existing negative-trigger scenarios by explicitly asserting the skill should not activate.

Changes:

Added a v2 changelog header to the eval file.
Added 3 new regression-focused eval scenarios (raising total scenarios from 6 to 9).
Added expect_activation: false to two negative-trigger scenarios to ensure the workflow doesn’t activate on informational prompts.

Copilot · 2026-04-08T18:33:43Z

+        value: "no concerns"
+      - type: "output_not_contains"
+        value: "looks safe"


The output_not_contains: "no concerns" / "looks safe" assertions are both case-sensitive and very generic. This can make the scenario flaky: the model might respond with capitalized variants ("No concerns"/"Looks safe") which would bypass the guard, or use these substrings in a longer sentence while still raising concerns. Consider tightening these to more specific phrases you want to forbid and/or include the common capitalization variants so the assertion matches the intended anti-pattern reliably.

Suggested change

value: "no concerns"

- type: "output_not_contains"

value: "looks safe"

value: "no concerns about side effects"

- type: "output_not_contains"

value: "No concerns about side effects"

- type: "output_not_contains"

value: "looks safe to me"

- type: "output_not_contains"

value: "Looks safe to me"

Copilot · 2026-04-08T18:33:43Z

+        value: "regression risk"
+      - type: "output_not_contains"


The assertion output_not_contains: "regression risk" conflicts with the rubric expectation that the agent should identify this as doc-only and explicitly state there's no regression risk. A correct response like "no regression risk" would still contain the banned substring and fail the scenario. Consider removing this assertion or narrowing it to warning-style phrasing (e.g., "high regression risk" / "regression-prone") so correct low-risk language remains allowed.

Suggested change

value: "regression risk"

- type: "output_not_contains"

…gression priority eval.yaml: - Scenario 1: Add output_contains 'review' - Scenario 7: Add output_contains 'regression' - Scenario 8: Add output_contains 'side effect' - Scenario 9: Add output_contains 'documentation', narrow 'adjacent scenario' to 'test adjacent scenario' SKILL.md: - Add regression risk check instruction in Step 3 referencing Section 21/22 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Create .github/docs/skill-eval-guide.md documenting both eval tools: - Anthropic skill-creator (trigger/routing evaluation) - dotnet/skills skill-validator (behavioral evaluation) - Add mandatory pre-evaluation checklist to prevent missing tools - Document multi-agent evaluation protocol (3 workers, 4-round max) - Add eval quality standards (assertion/rubric conflict check, coverage requirements) - Update README-AI.md to reference the new guide Root cause: Anthropic evaluator worker missed skill-creator tool for 5+ rounds because it never enumerated anthropics/skills/skills/ contents. This guide ensures future evaluations always check both toolchains first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 8, 2026 18:30

Copilot started reviewing on behalf of PureWeen April 8, 2026 18:31 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

PureWeen and others added 3 commits April 8, 2026 13:37

Remove skill-eval-guide.md -- belongs in polypilot, not MAUI repo

c5d204e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This was referenced Apr 9, 2026

[repo-status] Daily Repo Status - April 9, 2026 🌟 #34893

Closed

[repo-status] Daily Repo Status - April 13, 2026 🌟 #34924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve code-review eval: add regression detection scenarios#34884

Improve code-review eval: add regression detection scenarios#34884
PureWeen wants to merge 4 commits into
mainfrom
skill-eval/code-review-regression-scenarios

PureWeen commented Apr 8, 2026

Uh oh!

github-actions Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PureWeen commented Apr 8, 2026

Summary

Context

Changes

Eval Stats

Uh oh!

github-actions Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 8, 2026 •

edited

Loading