Improve code-review eval: add regression detection scenarios#34884
Improve code-review eval: add regression detection scenarios#34884PureWeen wants to merge 4 commits into
Conversation
- Add 3 new scenarios testing Section 21/22 regression prevention rules: - Scenario 7: CollectionView high-regression-risk component detection - Scenario 8: Side-effect risk when modifying shared code paths - Scenario 9: Safe doc-only change should NOT get false regression warnings - Add expect_activation: false to both negative trigger scenarios - Add version changelog comment header - Based on real regressed-in-10.0.60 issues (#34666, #34636, #34635) - 9 scenarios total (was 6), 0 output_contains, 17 output_not_contains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34884Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34884" |
There was a problem hiding this comment.
Pull request overview
This PR expands the code-review skill’s evaluation coverage by adding new scenarios specifically aimed at detecting regression-risk reasoning, based on real “regressed-in-10.0.60” issue patterns. It also strengthens two existing negative-trigger scenarios by explicitly asserting the skill should not activate.
Changes:
- Added a v2 changelog header to the eval file.
- Added 3 new regression-focused eval scenarios (raising total scenarios from 6 to 9).
- Added
expect_activation: falseto two negative-trigger scenarios to ensure the workflow doesn’t activate on informational prompts.
| value: "no concerns" | ||
| - type: "output_not_contains" | ||
| value: "looks safe" |
There was a problem hiding this comment.
The output_not_contains: "no concerns" / "looks safe" assertions are both case-sensitive and very generic. This can make the scenario flaky: the model might respond with capitalized variants ("No concerns"/"Looks safe") which would bypass the guard, or use these substrings in a longer sentence while still raising concerns. Consider tightening these to more specific phrases you want to forbid and/or include the common capitalization variants so the assertion matches the intended anti-pattern reliably.
| value: "no concerns" | |
| - type: "output_not_contains" | |
| value: "looks safe" | |
| value: "no concerns about side effects" | |
| - type: "output_not_contains" | |
| value: "No concerns about side effects" | |
| - type: "output_not_contains" | |
| value: "looks safe to me" | |
| - type: "output_not_contains" | |
| value: "Looks safe to me" |
| value: "regression risk" | ||
| - type: "output_not_contains" |
There was a problem hiding this comment.
The assertion output_not_contains: "regression risk" conflicts with the rubric expectation that the agent should identify this as doc-only and explicitly state there's no regression risk. A correct response like "no regression risk" would still contain the banned substring and fail the scenario. Consider removing this assertion or narrowing it to warning-style phrasing (e.g., "high regression risk" / "regression-prone") so correct low-risk language remains allowed.
| value: "regression risk" | |
| - type: "output_not_contains" |
…gression priority eval.yaml: - Scenario 1: Add output_contains 'review' - Scenario 7: Add output_contains 'regression' - Scenario 8: Add output_contains 'side effect' - Scenario 9: Add output_contains 'documentation', narrow 'adjacent scenario' to 'test adjacent scenario' SKILL.md: - Add regression risk check instruction in Step 3 referencing Section 21/22 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Create .github/docs/skill-eval-guide.md documenting both eval tools: - Anthropic skill-creator (trigger/routing evaluation) - dotnet/skills skill-validator (behavioral evaluation) - Add mandatory pre-evaluation checklist to prevent missing tools - Document multi-agent evaluation protocol (3 workers, 4-round max) - Add eval quality standards (assertion/rubric conflict check, coverage requirements) - Update README-AI.md to reference the new guide Root cause: Anthropic evaluator worker missed skill-creator tool for 5+ rounds because it never enumerated anthropics/skills/skills/ contents. This guide ensures future evaluations always check both toolchains first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Improves the code-review skill's eval.yaml with regression detection scenarios based on real
regressed-in-10.0.60issues.Context
review-rules.md(Section 21: 13 rules from 30 reverted PRs, Section 22: 8 most-regressed component rankings)Changes
3 new regression-focused scenarios (6 → 9 total):
Also improved existing scenarios:
expect_activation: falseto both negative trigger scenarios (2 and 6)Eval Stats
output_contains(no vocabulary overfitting)output_not_contains(anti-pattern guards)output_matches(verdict pattern matching)expect_activation: false(native negative triggers)