Improve code-review skill to prevent regression approvals by kubaflo · Pull Request #34985 · dotnet/maui

kubaflo · 2026-04-15T15:42:33Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description of Change

Improves the code-review skill after PR #34669 caused a 100% UITest regression that was approved with "LGTM, Confidence: high." The badge feature PR crashed the app on startup across all platforms and required revert PR #34984.

Root Cause — Why the Skill Failed

The code-review skill approved PR #34669 despite:

Gap	What Happened
No CI verification	Said "No CI failures detected. Clean build" — CI was actually failing (`maui-pr` ❌, Build Analysis ❌, Samples ❌)
No prior review reconciliation	Pass 1 correctly found a critical WeakReference bug. Pass 3 praised the exact same code as "robust" and gave LGTM
No blast radius assessment	Toolbar code changes affected ALL toolbars at startup, not just badge-enabled items
Superficial failure-mode probing	Asked "Should BadgeText be int?" instead of "What if this runs on pages without badges?"
Overconfidence	"Confidence: high" on a 1141-line, 26-file platform infrastructure change

Changes

`SKILL.md` — 3 workflow additions

New Requirement	What It Prevents
Step 4: Prior Review Reconciliation	Must produce a reconciliation table for each prior critical finding. No LGTM with unresolved findings.
Step 5: CI Hard Gate	Must run `gh pr checks`. No LGTM when CI is red/pending. Notes UITests dont run on PR builds.
Step 6: Blast Radius + Failure-Mode Probing + Confidence Calibration	Replaces vague Devil's Advocate. Must assess whether code runs for ALL instances, probe real failure modes (startup, null defaults, missing resources), and cap confidence by blast radius + evidence.

Also adds a Lessons Learned section documenting PR #34669 as a cautionary example.

`references/review-rules.md` — 3 new regression rules (Section 21)

"New feature code must not execute for non-feature users" — the exact rule that would have caught PR Add BadgeText, BadgeColor, and BadgeTextColor support to ToolbarItem #34669
"HostApp test pages must not crash the app" — missing image resources caused HostApp startup crash
"Static dictionaries in extensions survive handler disposal" — scope cleanup to current instance

Updated Toolbar regression count in Section 22 (5 → 6).

`tests/eval.yaml` — 3 new eval scenarios

CI verification (agent must run gh pr checks)
Prior review reconciliation (agent must not contradict earlier findings)
Blast radius assessment (agent must flag infrastructure changes)

What NOT to Do (for future agents)

❌ Dont say "Clean build" without running gh pr checks — false CI claims directly caused this regression to be merged
❌ Dont silently contradict prior critical findings — if Pass 1 found a bug, Pass 3 must reconcile it explicitly
❌ Dont give "Confidence: high" on large infra PRs — cap by blast radius and evidence, not optimism

Issues Fixed

Addresses skill gap exposed by PR #34669 (reverted in PR #34984)

Add prior review reconciliation, CI hard gate, blast radius assessment, failure-mode probing, and confidence calibration to prevent approving PRs that cause startup crashes. PR #34669 (badge support) was approved with 'LGTM, Confidence: high' despite CI failures, unreconciled critical findings, and unassessed blast radius — causing 100% UITest failure from app startup crash on all platforms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-15T15:42:46Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34985

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34985"

Copilot

Pull request overview

Updates the .github/skills/code-review skill to reduce the risk of false “LGTM” outcomes on high-blast-radius PRs by adding explicit prior-review reconciliation, CI verification hard-gates, and structured blast-radius/failure-mode probing expectations.

Changes:

Added new eval scenarios to validate CI verification, prior-review reconciliation, and blast-radius signaling.
Extended the review rules reference with regression-prevention rules derived from PR #34669 / #34984.
Expanded the code-review skill workflow with mandatory prior-review reconciliation, CI hard gates, blast-radius/failure-mode probing, and confidence calibration guidance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
.github/skills/code-review/tests/eval.yaml	Adds new evaluation scenarios targeting CI verification, prior review reconciliation, and blast-radius expectations.
.github/skills/code-review/references/review-rules.md	Adds new regression-prevention rules (section 21) and updates toolbar regression count context.
.github/skills/code-review/SKILL.md	Introduces mandatory prior-review reconciliation, CI hard gate policy, blast-radius/failure-mode probing, confidence calibration, and a lessons-learned section.

Copilot · 2026-04-15T15:47:05Z

+    prompt: "code review PR #34669 in dotnet/maui - note there were prior reviews on this PR"
+    assertions:
+      - type: "output_matches"
+        pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)"


This assertion can be satisfied by generic words like "Reconciliation" or the section header alone, without evidence that the agent actually looked up prior reviews/comments. Consider tightening the pattern to require a concrete indicator of prior-review inspection (e.g., mention of running the gh api .../reviews call, listing at least one reviewer/login, or explicitly stating "No prior critical findings found").

Suggested change

pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)"

pattern: "(gh api .*?/reviews|reviewer[: ]+@?[A-Za-z0-9_.-]+|No prior critical findings found|prior critical finding[s]?:)"

Applied in 9b3f6f6 — pattern now requires either the actual gh command (gh pr view .*--json reviews or gh api .*/reviews), a Reviewer: line listing a login, or the explicit "No prior critical findings" sentinel.

Copilot · 2026-04-15T15:47:05Z

+    prompt: "code review PR #34669 in dotnet/maui — this PR adds new BindableProperties to ToolbarItem and modifies platform toolbar extensions"
+    assertions:
+      - type: "output_matches"
+        pattern: "(blast radius|all instances|startup|global|static)"


The output_matches pattern is very easy to hit with generic phrasing ("startup", "static", etc.) and doesn't ensure the agent performed an actual blast-radius assessment. If this scenario is meant to catch missing blast-radius analysis, consider requiring the specific section header ("Blast Radius Assessment") and at least one structured yes/no answer (e.g., "Runs for all instances: yes/no").

Suggested change

pattern: "(blast radius|all instances|startup|global|static)"

pattern: "(?is)Blast Radius Assessment.*Runs for all instances:\\s*(yes|no)"

Applied in 9b3f6f6 — pattern now uses your suggested (?is)Blast Radius Assessment.*Runs for all instances:\s*(yes|no). Thanks.

Copilot · 2026-04-15T15:47:06Z

+- What happens if referenced resources (images, fonts) don't exist in the project?
+- Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)?
+- Does static state survive page disposal and get stale?
+- What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)?


The example "iOS 26+ API on iOS 18" is likely to confuse readers since iOS 26 is not a real/current version and reads like a typo. Suggest rephrasing to a version-agnostic example ("newer iOS API on older iOS") or using realistic version numbers that match current SDK/platform ranges used in this repo.

Suggested change

- What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)?

- What happens if the platform API is unavailable (e.g., a newer iOS API on an older iOS version)?

Generalized in 9b3f6f6 to "newer iOS API used without an OperatingSystem.IsIOSVersionAtLeast(...) guard". Worth flagging that iOS 26 is not actually a typo — it's the current/upcoming iOS version referenced in safe-area-ios.instructions.md — but the version-agnostic phrasing is clearer and won't age out.

Copilot · 2026-04-15T15:47:06Z

+    prompt: "code review PR #34669 in dotnet/maui"
+    assertions:
+      - type: "output_matches"
+        pattern: "(gh pr checks|CI Status|CI.*fail)"


PureWeen · 2026-04-15T19:25:01Z

Multi-Model Code Review — PR #34985

PR: Improve code-review skill to prevent regression approvals
Author: @kubaflo | Files: 3 changed (+177/−15) | Target: main

CI Status

maui-pr: ⏳ Pending
license/cla: ✅
Add Dogfooding Comment: ✅
Build Analysis: Skipped (depends on maui-pr)
Bump global.json: Skipped (not applicable)

Note: maui-pr has not completed. CI status cannot be confirmed as green.

Prior Review Comments

copilot-pull-request-reviewer[bot] posted 4 inline comments (eval assertion broadness × 3, iOS version example × 1). No human reviews yet. Those 4 comments are not duplicated below — they remain valid and should be addressed.

Findings

🟡 MODERATE — `gh pr checks --json name,state,conclusion` uses invalid field (SKILL.md, Step 5)

Consensus: 2/3 reviewers (confirmed via gh pr checks --help)

The documented programmatic command uses conclusion which is not a valid --json field for gh pr checks. Valid fields are: bucket, completedAt, description, event, link, name, startedAt, state, workflow. This command will error at runtime, making the programmatic CI analysis path inoperative. Use state or bucket instead.

🟡 MODERATE — CI gate distinguishes "required" vs "optional" checks but commands do not filter (SKILL.md, Step 5)

Consensus: 3/3 reviewers

The policy table distinguishes required vs optional checks ("Only optional/informational checks red → May proceed"), but the prescribed gh pr checks <PR_NUMBER> returns ALL checks. gh pr checks supports a --required flag that is not mentioned. Without it, an agent cannot reliably distinguish required from optional failures, leading to false NEEDS_CHANGES verdicts from optional check failures.

Suggested fix: Prescribe gh pr checks <PR_NUMBER> --required for the gate decision, and the bare gh pr checks for full context.

🟡 MODERATE — Case-sensitive `contains("Critical")` in jq silently misses variants (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The comment search filter uses contains("Critical") (capital C). Human reviewers and bots use inconsistent casing — "critical", "CRITICAL", "[CRITICAL]". Since this is the mechanism meant to prevent the exact regression class described in the PR, a case miss means the agent sees "no prior critical findings" and proceeds to LGTM.

Suggested fix: Use ascii_downcase: .body | ascii_downcase | contains("critical").

🟡 MODERATE — Review body truncated at 300 characters drops critical findings (SKILL.md, Step 4)

Consensus: 3/3 reviewers

The gh api command truncates review bodies with \(.body[0:300]). Structured agent reviews (with headers, tables, code blocks) routinely exceed 2000+ chars. The critical findings from PR #34669 Pass 1 would have been well past the 300-char mark. This makes the reconciliation step a false safety net.

Suggested fix: Increase to at least [0:2000], or search the full body with jq before truncating for display.

🟡 MODERATE — Eval scenarios depend on live GitHub state of PR #34669 (eval.yaml)

Consensus: 2/3 reviewers

All 3 new eval scenarios target the same historical PR #34669. CI state, reviews, and comments on a merged-and-reverted PR are frozen but could be administratively modified. This creates correlated test brittleness — all 3 tests fail simultaneously if the PR metadata changes. Using different PRs or synthetic fixtures would provide independent signal.

🟢 MINOR — High confidence is logically unsatisfiable for large infra PRs (SKILL.md, Step 6)

Consensus: 2/3 reviewers

The rule requires "UITests have been verified" for Confidence: high on large infrastructure PRs, but the same document states UITests don't run on PR builds. This creates a logical impossibility. If intentional (a defensible safety position after PR #34669), state it explicitly: "High confidence is not achievable for this category at PR review time." As written, an agent may hallucinate UITest verification to satisfy the rule.

🟢 MINOR — Stale eval assertion after Devil's Advocate → Failure-Mode Probing rename (eval.yaml)

Consensus: 2/3 reviewers

The PR replaces ### Devil's Advocate with ### Failure-Mode Probing in the output template and checklist, but the existing negative-trigger eval test may still use output_not_contains: "Devil's Advocate" as a discriminating assertion. If so, that assertion is now vacuously true (the agent never emits "Devil's Advocate" regardless), reducing the negative-trigger test's detection power.

Suggested fix: Verify the existing negative-trigger assertions and update to output_not_contains: "Failure-Mode Probing".

🟢 MINOR — Hardcoded `repos/dotnet/maui` in `gh api` command (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The gh api "repos/dotnet/maui/pulls/..." call is hardcoded while gh pr view and gh pr checks are repo-context-aware. One reviewer noted the skill is explicitly scoped to dotnet/maui so this is acceptable in context, but for consistency and resilience, using gh pr view <PR_NUMBER> --json reviews (which is context-aware) would be more robust.

🟢 MINOR — No trust model for reconciliation gate (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The reconciliation rule says "If ANY prior critical finding is unresolved → verdict must be NEEDS_CHANGES" and the search scans ALL PR comments for "Critical" / "must be fixed". Any contributor could post a comment containing these phrases and force the agent to block LGTM. For a high-traffic repo with community commenters, consider scoping to formal reviews (state: CHANGES_REQUESTED) from trusted sources rather than all PR comments.

Discarded Findings (1/3 after adversarial — insufficient consensus)

Blast radius eval prompt pre-supplies answer — 1 reviewer flagged; 2 others disagreed, noting the assertion keywords are not literally in the prompt
No negative counterpart eval scenarios — 1 reviewer flagged; 2 others noted negative scenarios already exist in eval.yaml

Test Coverage Assessment

This PR modifies documentation/skill files (.md, .yaml) — no compiled code. The 3 new eval scenarios in eval.yaml provide positive test coverage for CI verification, reconciliation, and blast radius detection. As noted above, these could benefit from tighter assertion patterns (per the existing copilot-bot inline comments).

Summary and Recommendation

Verdict: ⚠️ Request Changes

The PR is directionally excellent — it addresses real failure modes from the PR #34669 incident with well-structured safeguards. However:

CI is pending — maui-pr has not completed
5 MODERATE findings affect the reliability of the core safety mechanisms (reconciliation search, CI gate commands)
4 existing inline comments from the copilot reviewer (eval assertion broadness, iOS version example) should also be addressed

The most impactful fixes are: (a) replace conclusion with state or bucket in the --json command, (b) add --required flag guidance, (c) fix case-sensitivity in the jq filter, and (d) increase the 300-char body truncation. These are all small changes that strengthen the mechanisms this PR introduces.

- Fix invalid 'conclusion' --json field (gh pr checks supports name/state/bucket) - Add --required flag guidance for hard-gate decisions - Make prior-review jq filter case-insensitive (ascii_downcase) - Bump body truncation 300→2000 to capture structured findings - Switch 'gh api repos/dotnet/maui' → repo-context-aware 'gh pr view --json reviews' - Add trust scoping note (weight CHANGES_REQUESTED reviews over drive-by comments) - Document that high-confidence is intentionally unreachable for large infra PRs at review time - Generalize iOS version-guard example (drop iOS 26+ specific phrasing) - Tighten eval assertions per copilot-bot suggestions (CI, reconciliation, blast radius) - Add 'Failure-Mode Probing' to negative-trigger output_not_contains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kubaflo · 2026-04-18T19:25:10Z

Review feedback addressed (commit `9b3f6f6`)

Thanks @PureWeen — really high-signal review. Applied 7 of 9 findings; explicitly responding to each below.

Applied

#	Finding	Fix
1	Invalid `--json conclusion` field	Replaced with `--json name,state,bucket,workflow` and inlined a comment listing the valid fields
2	No `--required` filter on CI gate	Now prescribe `gh pr checks <PR> --required` for the hard-gate decision; bare `gh pr checks` for full context
3	Case-sensitive `contains("Critical")`	Now uses `ascii_downcase` before `contains`
4	300-char body truncation drops findings	Bumped to 2000
6	High confidence logically unsatisfiable	Added explicit note: this is intentional after PR #34669; large infra PRs are capped at `medium` until post-merge verification, do not hallucinate UITest verification to satisfy the rule
7	Stale `Devil's Advocate` eval assertion	Added `output_not_contains: "Failure-Mode Probing"` alongside the legacy `Devil's Advocate` line
8	Hardcoded `repos/dotnet/maui`	Switched to repo-context-aware `gh pr view <PR> --json reviews`
9	No trust model for reconciliation gate	Added a Trust scoping paragraph: weight `reviews[].state == CHANGES_REQUESTED` and known reviewer bots over free-form comments — a drive-by comment containing "critical" is not, by itself, a blocker

Also applied (from copilot-pull-request-reviewer inline comments)

CI eval assertion — tightened to require literal gh pr checks or a maui-pr state token
Reconciliation eval assertion — tightened to require either the actual gh command, a Reviewer: line, or the explicit "No prior critical findings" sentinel
Blast radius eval assertion — tightened to require Blast Radius Assessment header and Runs for all instances: yes/no
iOS 26 example — generalized to "newer iOS API used without an OperatingSystem.IsIOSVersionAtLeast(...) guard". (iOS 26 isn't actually a typo — it's the upcoming version referenced in our safe-area instructions — but the version-agnostic phrasing is clearer and won't age out.)

Not applied — explanation

#	Finding	Why not
5	Eval scenarios all target PR #34669	Acknowledged. All three new scenarios target this PR specifically because it's the canonical regression we're protecting against — the eval doubles as a regression test. PR metadata on a merged-and-reverted PR is effectively frozen; if it ever does get administratively edited, the failures would be a useful signal that someone touched the historical record. Open to splitting if you'd prefer synthetic fixtures, but I'd rather not lose the direct PR #34669 anchoring.

CI is still pending on the new push — will hold off any further changes until it lands.

PureWeen · 2026-05-16T17:45:56Z

#35478

Copilot AI review requested due to automatic review settings April 15, 2026 15:42

Copilot started reviewing on behalf of kubaflo April 15, 2026 15:43 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 16, 2026

[repo-status] 🌟 Daily Repo Status - April 16, 2026 #34990

Closed

PureWeen mentioned this pull request Apr 16, 2026

Add daily PR review queue workflow with actionability detection #34818

Merged

github-actions Bot mentioned this pull request Apr 17, 2026

[PR Review Queue] 2026-04-17 #35004

Closed

github-actions Bot mentioned this pull request Apr 20, 2026

[PR Review Queue] 2026-04-20 #35039

Closed

PureWeen mentioned this pull request May 16, 2026

Improve code-review skill with regression prevention safeguards #35478

Open

PureWeen closed this May 16, 2026

	pattern: "(Prior.Finding\|Reconciliation\|earlier.review\|prior.*critical)"
	pattern: "(gh api .*?/reviews\|reviewer[: ]+@?[A-Za-z0-9_.-]+\|No prior critical findings found\|prior critical finding[s]?:)"

	pattern: "(blast radius\|all instances\|startup\|global\|static)"
	pattern: "(?is)Blast Radius Assessment.Runs for all instances:\\s(yes\|no)"

	- What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)?
	- What happens if the platform API is unavailable (e.g., a newer iOS API on an older iOS version)?

	pattern: "(gh pr checks\|CI Status\|CI.*fail)"
	pattern: "(gh pr checks\|maui-pr.(pass\|fail\|success\|failure\|pending\|cancelled)\|CI.maui-pr.*(pass\|fail\|success\|failure\|pending\|cancelled))"

Conversation

kubaflo commented Apr 15, 2026

Description of Change

Root Cause — Why the Skill Failed

Changes

SKILL.md — 3 workflow additions

references/review-rules.md — 3 new regression rules (Section 21)

tests/eval.yaml — 3 new eval scenarios

What NOT to Do (for future agents)

Issues Fixed

Uh oh!

github-actions Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

kubaflo Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

PureWeen commented Apr 15, 2026

Multi-Model Code Review — PR #34985

CI Status

Prior Review Comments

Findings

🟡 MODERATE — gh pr checks --json name,state,conclusion uses invalid field (SKILL.md, Step 5)

🟡 MODERATE — CI gate distinguishes "required" vs "optional" checks but commands do not filter (SKILL.md, Step 5)

🟡 MODERATE — Case-sensitive contains("Critical") in jq silently misses variants (SKILL.md, Step 4)

🟡 MODERATE — Review body truncated at 300 characters drops critical findings (SKILL.md, Step 4)

🟡 MODERATE — Eval scenarios depend on live GitHub state of PR #34669 (eval.yaml)

🟢 MINOR — High confidence is logically unsatisfiable for large infra PRs (SKILL.md, Step 6)

🟢 MINOR — Stale eval assertion after Devil's Advocate → Failure-Mode Probing rename (eval.yaml)

🟢 MINOR — Hardcoded repos/dotnet/maui in gh api command (SKILL.md, Step 4)

🟢 MINOR — No trust model for reconciliation gate (SKILL.md, Step 4)

Discarded Findings (1/3 after adversarial — insufficient consensus)

Test Coverage Assessment

Summary and Recommendation

Uh oh!

kubaflo commented Apr 18, 2026

Review feedback addressed (commit 9b3f6f6)

Applied

Also applied (from copilot-pull-request-reviewer inline comments)

Not applied — explanation

Uh oh!

PureWeen commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`SKILL.md` — 3 workflow additions

`references/review-rules.md` — 3 new regression rules (Section 21)

`tests/eval.yaml` — 3 new eval scenarios

github-actions Bot commented Apr 15, 2026 •

edited

Loading

🟡 MODERATE — `gh pr checks --json name,state,conclusion` uses invalid field (SKILL.md, Step 5)

🟡 MODERATE — Case-sensitive `contains("Critical")` in jq silently misses variants (SKILL.md, Step 4)

🟢 MINOR — Hardcoded `repos/dotnet/maui` in `gh api` command (SKILL.md, Step 4)

Review feedback addressed (commit `9b3f6f6`)