Improve code-review skill to prevent regression approvals#34985
Improve code-review skill to prevent regression approvals#34985kubaflo wants to merge 2 commits into
Conversation
Add prior review reconciliation, CI hard gate, blast radius assessment, failure-mode probing, and confidence calibration to prevent approving PRs that cause startup crashes. PR #34669 (badge support) was approved with 'LGTM, Confidence: high' despite CI failures, unreconciled critical findings, and unassessed blast radius — causing 100% UITest failure from app startup crash on all platforms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34985Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34985" |
There was a problem hiding this comment.
Pull request overview
Updates the .github/skills/code-review skill to reduce the risk of false “LGTM” outcomes on high-blast-radius PRs by adding explicit prior-review reconciliation, CI verification hard-gates, and structured blast-radius/failure-mode probing expectations.
Changes:
- Added new eval scenarios to validate CI verification, prior-review reconciliation, and blast-radius signaling.
- Extended the review rules reference with regression-prevention rules derived from PR #34669 / #34984.
- Expanded the
code-reviewskill workflow with mandatory prior-review reconciliation, CI hard gates, blast-radius/failure-mode probing, and confidence calibration guidance.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| .github/skills/code-review/tests/eval.yaml | Adds new evaluation scenarios targeting CI verification, prior review reconciliation, and blast-radius expectations. |
| .github/skills/code-review/references/review-rules.md | Adds new regression-prevention rules (section 21) and updates toolbar regression count context. |
| .github/skills/code-review/SKILL.md | Introduces mandatory prior-review reconciliation, CI hard gate policy, blast-radius/failure-mode probing, confidence calibration, and a lessons-learned section. |
| prompt: "code review PR #34669 in dotnet/maui - note there were prior reviews on this PR" | ||
| assertions: | ||
| - type: "output_matches" | ||
| pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)" |
There was a problem hiding this comment.
This assertion can be satisfied by generic words like "Reconciliation" or the section header alone, without evidence that the agent actually looked up prior reviews/comments. Consider tightening the pattern to require a concrete indicator of prior-review inspection (e.g., mention of running the gh api .../reviews call, listing at least one reviewer/login, or explicitly stating "No prior critical findings found").
| pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)" | |
| pattern: "(gh api .*?/reviews|reviewer[: ]+@?[A-Za-z0-9_.-]+|No prior critical findings found|prior critical finding[s]?:)" |
There was a problem hiding this comment.
Applied in 9b3f6f6 — pattern now requires either the actual gh command (gh pr view .*--json reviews or gh api .*/reviews), a Reviewer: line listing a login, or the explicit "No prior critical findings" sentinel.
| prompt: "code review PR #34669 in dotnet/maui — this PR adds new BindableProperties to ToolbarItem and modifies platform toolbar extensions" | ||
| assertions: | ||
| - type: "output_matches" | ||
| pattern: "(blast radius|all instances|startup|global|static)" |
There was a problem hiding this comment.
The output_matches pattern is very easy to hit with generic phrasing ("startup", "static", etc.) and doesn't ensure the agent performed an actual blast-radius assessment. If this scenario is meant to catch missing blast-radius analysis, consider requiring the specific section header ("Blast Radius Assessment") and at least one structured yes/no answer (e.g., "Runs for all instances: yes/no").
| pattern: "(blast radius|all instances|startup|global|static)" | |
| pattern: "(?is)Blast Radius Assessment.*Runs for all instances:\\s*(yes|no)" |
There was a problem hiding this comment.
Applied in 9b3f6f6 — pattern now uses your suggested (?is)Blast Radius Assessment.*Runs for all instances:\s*(yes|no). Thanks.
| - What happens if referenced resources (images, fonts) don't exist in the project? | ||
| - Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)? | ||
| - Does static state survive page disposal and get stale? | ||
| - What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)? |
There was a problem hiding this comment.
The example "iOS 26+ API on iOS 18" is likely to confuse readers since iOS 26 is not a real/current version and reads like a typo. Suggest rephrasing to a version-agnostic example ("newer iOS API on older iOS") or using realistic version numbers that match current SDK/platform ranges used in this repo.
| - What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)? | |
| - What happens if the platform API is unavailable (e.g., a newer iOS API on an older iOS version)? |
There was a problem hiding this comment.
Generalized in 9b3f6f6 to "newer iOS API used without an OperatingSystem.IsIOSVersionAtLeast(...) guard". Worth flagging that iOS 26 is not actually a typo — it's the current/upcoming iOS version referenced in safe-area-ios.instructions.md — but the version-agnostic phrasing is clearer and won't age out.
| prompt: "code review PR #34669 in dotnet/maui" | ||
| assertions: | ||
| - type: "output_matches" | ||
| pattern: "(gh pr checks|CI Status|CI.*fail)" |
There was a problem hiding this comment.
The assertion regex is so broad that an agent can satisfy it by printing the template header (e.g., "CI Status") without demonstrating that it actually checked CI. To make this scenario validate the intended behavior, tighten the pattern to require an explicit gh pr checks mention/output and/or specific check results (e.g., maui-pr state).
| pattern: "(gh pr checks|CI Status|CI.*fail)" | |
| pattern: "(gh pr checks|maui-pr.*(pass|fail|success|failure|pending|cancelled)|CI.*maui-pr.*(pass|fail|success|failure|pending|cancelled))" |
There was a problem hiding this comment.
Applied in 9b3f6f6 — pattern now requires either literal gh pr checks or a maui-pr state token (pass|fail|success|failure|pending|cancelled|red|green|❌|✅|⏳). Just "CI Status" header alone no longer passes.
Multi-Model Code Review — PR #34985PR: Improve code-review skill to prevent regression approvals CI Status
Prior Review Comments
Findings🟡 MODERATE —
|
- Fix invalid 'conclusion' --json field (gh pr checks supports name/state/bucket) - Add --required flag guidance for hard-gate decisions - Make prior-review jq filter case-insensitive (ascii_downcase) - Bump body truncation 300→2000 to capture structured findings - Switch 'gh api repos/dotnet/maui' → repo-context-aware 'gh pr view --json reviews' - Add trust scoping note (weight CHANGES_REQUESTED reviews over drive-by comments) - Document that high-confidence is intentionally unreachable for large infra PRs at review time - Generalize iOS version-guard example (drop iOS 26+ specific phrasing) - Tighten eval assertions per copilot-bot suggestions (CI, reconciliation, blast radius) - Add 'Failure-Mode Probing' to negative-trigger output_not_contains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review feedback addressed (commit 9b3f6f6)Thanks @PureWeen — really high-signal review. Applied 7 of 9 findings; explicitly responding to each below. Applied
Also applied (from copilot-pull-request-reviewer inline comments)
Not applied — explanation
CI is still pending on the new push — will hold off any further changes until it lands. |
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Description of Change
Improves the
code-reviewskill after PR #34669 caused a 100% UITest regression that was approved with "LGTM, Confidence: high." The badge feature PR crashed the app on startup across all platforms and required revert PR #34984.Root Cause — Why the Skill Failed
The code-review skill approved PR #34669 despite:
maui-pr❌, Build Analysis ❌, Samples ❌)Changes
SKILL.md— 3 workflow additionsgh pr checks. No LGTM when CI is red/pending. Notes UITests dont run on PR builds.Also adds a Lessons Learned section documenting PR #34669 as a cautionary example.
references/review-rules.md— 3 new regression rules (Section 21)Updated Toolbar regression count in Section 22 (5 → 6).
tests/eval.yaml— 3 new eval scenariosgh pr checks)What NOT to Do (for future agents)
gh pr checks— false CI claims directly caused this regression to be mergedIssues Fixed
Addresses skill gap exposed by PR #34669 (reverted in PR #34984)