Skip to content

Improve code-review skill to prevent regression approvals#34985

Closed
kubaflo wants to merge 2 commits into
mainfrom
fix/code-review-skill-regression-gaps
Closed

Improve code-review skill to prevent regression approvals#34985
kubaflo wants to merge 2 commits into
mainfrom
fix/code-review-skill-regression-gaps

Conversation

@kubaflo
Copy link
Copy Markdown
Contributor

@kubaflo kubaflo commented Apr 15, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description of Change

Improves the code-review skill after PR #34669 caused a 100% UITest regression that was approved with "LGTM, Confidence: high." The badge feature PR crashed the app on startup across all platforms and required revert PR #34984.

Root Cause — Why the Skill Failed

The code-review skill approved PR #34669 despite:

Gap What Happened
No CI verification Said "No CI failures detected. Clean build" — CI was actually failing (maui-pr ❌, Build Analysis ❌, Samples ❌)
No prior review reconciliation Pass 1 correctly found a critical WeakReference bug. Pass 3 praised the exact same code as "robust" and gave LGTM
No blast radius assessment Toolbar code changes affected ALL toolbars at startup, not just badge-enabled items
Superficial failure-mode probing Asked "Should BadgeText be int?" instead of "What if this runs on pages without badges?"
Overconfidence "Confidence: high" on a 1141-line, 26-file platform infrastructure change

Changes

SKILL.md — 3 workflow additions

New Requirement What It Prevents
Step 4: Prior Review Reconciliation Must produce a reconciliation table for each prior critical finding. No LGTM with unresolved findings.
Step 5: CI Hard Gate Must run gh pr checks. No LGTM when CI is red/pending. Notes UITests dont run on PR builds.
Step 6: Blast Radius + Failure-Mode Probing + Confidence Calibration Replaces vague Devil's Advocate. Must assess whether code runs for ALL instances, probe real failure modes (startup, null defaults, missing resources), and cap confidence by blast radius + evidence.

Also adds a Lessons Learned section documenting PR #34669 as a cautionary example.

references/review-rules.md — 3 new regression rules (Section 21)

  1. "New feature code must not execute for non-feature users" — the exact rule that would have caught PR Add BadgeText, BadgeColor, and BadgeTextColor support to ToolbarItem #34669
  2. "HostApp test pages must not crash the app" — missing image resources caused HostApp startup crash
  3. "Static dictionaries in extensions survive handler disposal" — scope cleanup to current instance

Updated Toolbar regression count in Section 22 (5 → 6).

tests/eval.yaml — 3 new eval scenarios

  • CI verification (agent must run gh pr checks)
  • Prior review reconciliation (agent must not contradict earlier findings)
  • Blast radius assessment (agent must flag infrastructure changes)

What NOT to Do (for future agents)

  • Dont say "Clean build" without running gh pr checks — false CI claims directly caused this regression to be merged
  • Dont silently contradict prior critical findings — if Pass 1 found a bug, Pass 3 must reconcile it explicitly
  • Dont give "Confidence: high" on large infra PRs — cap by blast radius and evidence, not optimism

Issues Fixed

Addresses skill gap exposed by PR #34669 (reverted in PR #34984)

Add prior review reconciliation, CI hard gate, blast radius assessment,
failure-mode probing, and confidence calibration to prevent approving
PRs that cause startup crashes.

PR #34669 (badge support) was approved with 'LGTM, Confidence: high'
despite CI failures, unreconciled critical findings, and unassessed
blast radius — causing 100% UITest failure from app startup crash on
all platforms.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 15, 2026 15:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34985

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34985"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the .github/skills/code-review skill to reduce the risk of false “LGTM” outcomes on high-blast-radius PRs by adding explicit prior-review reconciliation, CI verification hard-gates, and structured blast-radius/failure-mode probing expectations.

Changes:

  • Added new eval scenarios to validate CI verification, prior-review reconciliation, and blast-radius signaling.
  • Extended the review rules reference with regression-prevention rules derived from PR #34669 / #34984.
  • Expanded the code-review skill workflow with mandatory prior-review reconciliation, CI hard gates, blast-radius/failure-mode probing, and confidence calibration guidance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
.github/skills/code-review/tests/eval.yaml Adds new evaluation scenarios targeting CI verification, prior review reconciliation, and blast-radius expectations.
.github/skills/code-review/references/review-rules.md Adds new regression-prevention rules (section 21) and updates toolbar regression count context.
.github/skills/code-review/SKILL.md Introduces mandatory prior-review reconciliation, CI hard gate policy, blast-radius/failure-mode probing, confidence calibration, and a lessons-learned section.

prompt: "code review PR #34669 in dotnet/maui - note there were prior reviews on this PR"
assertions:
- type: "output_matches"
pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can be satisfied by generic words like "Reconciliation" or the section header alone, without evidence that the agent actually looked up prior reviews/comments. Consider tightening the pattern to require a concrete indicator of prior-review inspection (e.g., mention of running the gh api .../reviews call, listing at least one reviewer/login, or explicitly stating "No prior critical findings found").

Suggested change
pattern: "(Prior.*Finding|Reconciliation|earlier.*review|prior.*critical)"
pattern: "(gh api .*?/reviews|reviewer[: ]+@?[A-Za-z0-9_.-]+|No prior critical findings found|prior critical finding[s]?:)"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 9b3f6f6 — pattern now requires either the actual gh command (gh pr view .*--json reviews or gh api .*/reviews), a Reviewer: line listing a login, or the explicit "No prior critical findings" sentinel.

prompt: "code review PR #34669 in dotnet/maui — this PR adds new BindableProperties to ToolbarItem and modifies platform toolbar extensions"
assertions:
- type: "output_matches"
pattern: "(blast radius|all instances|startup|global|static)"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output_matches pattern is very easy to hit with generic phrasing ("startup", "static", etc.) and doesn't ensure the agent performed an actual blast-radius assessment. If this scenario is meant to catch missing blast-radius analysis, consider requiring the specific section header ("Blast Radius Assessment") and at least one structured yes/no answer (e.g., "Runs for all instances: yes/no").

Suggested change
pattern: "(blast radius|all instances|startup|global|static)"
pattern: "(?is)Blast Radius Assessment.*Runs for all instances:\\s*(yes|no)"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 9b3f6f6 — pattern now uses your suggested (?is)Blast Radius Assessment.*Runs for all instances:\s*(yes|no). Thanks.

Comment thread .github/skills/code-review/SKILL.md Outdated
- What happens if referenced resources (images, fonts) don't exist in the project?
- Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)?
- Does static state survive page disposal and get stale?
- What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)?
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example "iOS 26+ API on iOS 18" is likely to confuse readers since iOS 26 is not a real/current version and reads like a typo. Suggest rephrasing to a version-agnostic example ("newer iOS API on older iOS") or using realistic version numbers that match current SDK/platform ranges used in this repo.

Suggested change
- What happens if the platform API is unavailable (e.g., iOS 26+ API on iOS 18)?
- What happens if the platform API is unavailable (e.g., a newer iOS API on an older iOS version)?

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalized in 9b3f6f6 to "newer iOS API used without an OperatingSystem.IsIOSVersionAtLeast(...) guard". Worth flagging that iOS 26 is not actually a typo — it's the current/upcoming iOS version referenced in safe-area-ios.instructions.md — but the version-agnostic phrasing is clearer and won't age out.

prompt: "code review PR #34669 in dotnet/maui"
assertions:
- type: "output_matches"
pattern: "(gh pr checks|CI Status|CI.*fail)"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion regex is so broad that an agent can satisfy it by printing the template header (e.g., "CI Status") without demonstrating that it actually checked CI. To make this scenario validate the intended behavior, tighten the pattern to require an explicit gh pr checks mention/output and/or specific check results (e.g., maui-pr state).

Suggested change
pattern: "(gh pr checks|CI Status|CI.*fail)"
pattern: "(gh pr checks|maui-pr.*(pass|fail|success|failure|pending|cancelled)|CI.*maui-pr.*(pass|fail|success|failure|pending|cancelled))"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 9b3f6f6 — pattern now requires either literal gh pr checks or a maui-pr state token (pass|fail|success|failure|pending|cancelled|red|green|❌|✅|⏳). Just "CI Status" header alone no longer passes.

@PureWeen
Copy link
Copy Markdown
Member

Multi-Model Code Review — PR #34985

PR: Improve code-review skill to prevent regression approvals
Author: @kubaflo | Files: 3 changed (+177/−15) | Target: main


CI Status

  • maui-pr: ⏳ Pending
  • license/cla: ✅
  • Add Dogfooding Comment: ✅
  • Build Analysis: Skipped (depends on maui-pr)
  • Bump global.json: Skipped (not applicable)

Note: maui-pr has not completed. CI status cannot be confirmed as green.

Prior Review Comments

copilot-pull-request-reviewer[bot] posted 4 inline comments (eval assertion broadness × 3, iOS version example × 1). No human reviews yet. Those 4 comments are not duplicated below — they remain valid and should be addressed.


Findings

🟡 MODERATE — gh pr checks --json name,state,conclusion uses invalid field (SKILL.md, Step 5)

Consensus: 2/3 reviewers (confirmed via gh pr checks --help)

The documented programmatic command uses conclusion which is not a valid --json field for gh pr checks. Valid fields are: bucket, completedAt, description, event, link, name, startedAt, state, workflow. This command will error at runtime, making the programmatic CI analysis path inoperative. Use state or bucket instead.


🟡 MODERATE — CI gate distinguishes "required" vs "optional" checks but commands do not filter (SKILL.md, Step 5)

Consensus: 3/3 reviewers

The policy table distinguishes required vs optional checks ("Only optional/informational checks red → May proceed"), but the prescribed gh pr checks <PR_NUMBER> returns ALL checks. gh pr checks supports a --required flag that is not mentioned. Without it, an agent cannot reliably distinguish required from optional failures, leading to false NEEDS_CHANGES verdicts from optional check failures.

Suggested fix: Prescribe gh pr checks <PR_NUMBER> --required for the gate decision, and the bare gh pr checks for full context.


🟡 MODERATE — Case-sensitive contains("Critical") in jq silently misses variants (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The comment search filter uses contains("Critical") (capital C). Human reviewers and bots use inconsistent casing — "critical", "CRITICAL", "[CRITICAL]". Since this is the mechanism meant to prevent the exact regression class described in the PR, a case miss means the agent sees "no prior critical findings" and proceeds to LGTM.

Suggested fix: Use ascii_downcase: .body | ascii_downcase | contains("critical").


🟡 MODERATE — Review body truncated at 300 characters drops critical findings (SKILL.md, Step 4)

Consensus: 3/3 reviewers

The gh api command truncates review bodies with \(.body[0:300]). Structured agent reviews (with headers, tables, code blocks) routinely exceed 2000+ chars. The critical findings from PR #34669 Pass 1 would have been well past the 300-char mark. This makes the reconciliation step a false safety net.

Suggested fix: Increase to at least [0:2000], or search the full body with jq before truncating for display.


🟡 MODERATE — Eval scenarios depend on live GitHub state of PR #34669 (eval.yaml)

Consensus: 2/3 reviewers

All 3 new eval scenarios target the same historical PR #34669. CI state, reviews, and comments on a merged-and-reverted PR are frozen but could be administratively modified. This creates correlated test brittleness — all 3 tests fail simultaneously if the PR metadata changes. Using different PRs or synthetic fixtures would provide independent signal.


🟢 MINOR — High confidence is logically unsatisfiable for large infra PRs (SKILL.md, Step 6)

Consensus: 2/3 reviewers

The rule requires "UITests have been verified" for Confidence: high on large infrastructure PRs, but the same document states UITests don't run on PR builds. This creates a logical impossibility. If intentional (a defensible safety position after PR #34669), state it explicitly: "High confidence is not achievable for this category at PR review time." As written, an agent may hallucinate UITest verification to satisfy the rule.


🟢 MINOR — Stale eval assertion after Devil's Advocate → Failure-Mode Probing rename (eval.yaml)

Consensus: 2/3 reviewers

The PR replaces ### Devil's Advocate with ### Failure-Mode Probing in the output template and checklist, but the existing negative-trigger eval test may still use output_not_contains: "Devil's Advocate" as a discriminating assertion. If so, that assertion is now vacuously true (the agent never emits "Devil's Advocate" regardless), reducing the negative-trigger test's detection power.

Suggested fix: Verify the existing negative-trigger assertions and update to output_not_contains: "Failure-Mode Probing".


🟢 MINOR — Hardcoded repos/dotnet/maui in gh api command (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The gh api "repos/dotnet/maui/pulls/..." call is hardcoded while gh pr view and gh pr checks are repo-context-aware. One reviewer noted the skill is explicitly scoped to dotnet/maui so this is acceptable in context, but for consistency and resilience, using gh pr view <PR_NUMBER> --json reviews (which is context-aware) would be more robust.


🟢 MINOR — No trust model for reconciliation gate (SKILL.md, Step 4)

Consensus: 2/3 reviewers

The reconciliation rule says "If ANY prior critical finding is unresolved → verdict must be NEEDS_CHANGES" and the search scans ALL PR comments for "Critical" / "must be fixed". Any contributor could post a comment containing these phrases and force the agent to block LGTM. For a high-traffic repo with community commenters, consider scoping to formal reviews (state: CHANGES_REQUESTED) from trusted sources rather than all PR comments.


Discarded Findings (1/3 after adversarial — insufficient consensus)

  • Blast radius eval prompt pre-supplies answer — 1 reviewer flagged; 2 others disagreed, noting the assertion keywords are not literally in the prompt
  • No negative counterpart eval scenarios — 1 reviewer flagged; 2 others noted negative scenarios already exist in eval.yaml

Test Coverage Assessment

This PR modifies documentation/skill files (.md, .yaml) — no compiled code. The 3 new eval scenarios in eval.yaml provide positive test coverage for CI verification, reconciliation, and blast radius detection. As noted above, these could benefit from tighter assertion patterns (per the existing copilot-bot inline comments).


Summary and Recommendation

Verdict: ⚠️ Request Changes

The PR is directionally excellent — it addresses real failure modes from the PR #34669 incident with well-structured safeguards. However:

  1. CI is pendingmaui-pr has not completed
  2. 5 MODERATE findings affect the reliability of the core safety mechanisms (reconciliation search, CI gate commands)
  3. 4 existing inline comments from the copilot reviewer (eval assertion broadness, iOS version example) should also be addressed

The most impactful fixes are: (a) replace conclusion with state or bucket in the --json command, (b) add --required flag guidance, (c) fix case-sensitivity in the jq filter, and (d) increase the 300-char body truncation. These are all small changes that strengthen the mechanisms this PR introduces.

- Fix invalid 'conclusion' --json field (gh pr checks supports name/state/bucket)
- Add --required flag guidance for hard-gate decisions
- Make prior-review jq filter case-insensitive (ascii_downcase)
- Bump body truncation 300→2000 to capture structured findings
- Switch 'gh api repos/dotnet/maui' → repo-context-aware 'gh pr view --json reviews'
- Add trust scoping note (weight CHANGES_REQUESTED reviews over drive-by comments)
- Document that high-confidence is intentionally unreachable for large infra PRs at review time
- Generalize iOS version-guard example (drop iOS 26+ specific phrasing)
- Tighten eval assertions per copilot-bot suggestions (CI, reconciliation, blast radius)
- Add 'Failure-Mode Probing' to negative-trigger output_not_contains

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented Apr 18, 2026

Review feedback addressed (commit 9b3f6f6)

Thanks @PureWeen — really high-signal review. Applied 7 of 9 findings; explicitly responding to each below.

Applied

# Finding Fix
1 Invalid --json conclusion field Replaced with --json name,state,bucket,workflow and inlined a comment listing the valid fields
2 No --required filter on CI gate Now prescribe gh pr checks <PR> --required for the hard-gate decision; bare gh pr checks for full context
3 Case-sensitive contains("Critical") Now uses ascii_downcase before contains
4 300-char body truncation drops findings Bumped to 2000
6 High confidence logically unsatisfiable Added explicit note: this is intentional after PR #34669; large infra PRs are capped at medium until post-merge verification, do not hallucinate UITest verification to satisfy the rule
7 Stale Devil's Advocate eval assertion Added output_not_contains: "Failure-Mode Probing" alongside the legacy Devil's Advocate line
8 Hardcoded repos/dotnet/maui Switched to repo-context-aware gh pr view <PR> --json reviews
9 No trust model for reconciliation gate Added a Trust scoping paragraph: weight reviews[].state == CHANGES_REQUESTED and known reviewer bots over free-form comments — a drive-by comment containing "critical" is not, by itself, a blocker

Also applied (from copilot-pull-request-reviewer inline comments)

  • CI eval assertion — tightened to require literal gh pr checks or a maui-pr state token
  • Reconciliation eval assertion — tightened to require either the actual gh command, a Reviewer: line, or the explicit "No prior critical findings" sentinel
  • Blast radius eval assertion — tightened to require Blast Radius Assessment header and Runs for all instances: yes/no
  • iOS 26 example — generalized to "newer iOS API used without an OperatingSystem.IsIOSVersionAtLeast(...) guard". (iOS 26 isn't actually a typo — it's the upcoming version referenced in our safe-area instructions — but the version-agnostic phrasing is clearer and won't age out.)

Not applied — explanation

# Finding Why not
5 Eval scenarios all target PR #34669 Acknowledged. All three new scenarios target this PR specifically because it's the canonical regression we're protecting against — the eval doubles as a regression test. PR metadata on a merged-and-reverted PR is effectively frozen; if it ever does get administratively edited, the failures would be a useful signal that someone touched the historical record. Open to splitting if you'd prefer synthetic fixtures, but I'd rather not lose the direct PR #34669 anchoring.

CI is still pending on the new push — will hold off any further changes until it lands.

@PureWeen
Copy link
Copy Markdown
Member

#35478

@PureWeen PureWeen closed this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants