dotnet · kubaflo · Apr 15, 2026 · Apr 18, 2026
diff --git a/.github/skills/code-review/SKILL.md b/.github/skills/code-review/SKILL.md
@@ -92,24 +92,120 @@ Now read the PR description, linked issue, and comments. Treat these as **claims
 2. If the PR claims a bug fix, verify the root cause analysis matches the code
 3. Check existing review comments to avoid duplicating feedback
 
+#### 🚨 Prior Review Reconciliation (MANDATORY)
+
+Check for prior reviews on the same PR — from the Copilot PR reviewer bot, other agents, or human reviewers:
+
+```bash
+# Get all reviews (use repo-context-aware command; bump truncation to 2000 chars to capture structured findings)
+gh pr view <PR_NUMBER> --json reviews --jq '.reviews[] | "Reviewer: \(.author.login) | State: \(.state)\n\(.body[0:2000])\n---"'
+
+# Search PR comments for prior critical findings (case-insensitive — reviewers use "Critical", "critical", "[CRITICAL]")
+gh pr view <PR_NUMBER> --json comments --jq '.comments[] | select(.body | ascii_downcase | (contains("critical") or contains("must be fixed") or contains("🔴"))) | .body[0:2000]'
+```
+
+**Trust scoping:** The comment scan is intentionally broad to catch agent and bot reviews. To avoid spoofing by arbitrary commenters, weight findings from formal reviews (`reviews[].state == "CHANGES_REQUESTED"`) and known reviewer bots (`copilot-pull-request-reviewer`, `PureWeen`, etc.) more heavily than free-form comments. A drive-by comment containing the word "critical" is not, by itself, a blocker — but a `CHANGES_REQUESTED` review is.
+
+**If prior reviews flagged critical issues, you MUST produce a reconciliation table:**
+
+```markdown
+### Prior Finding Reconciliation
+
+| Prior Critical Finding | Source | Current Status | Evidence |
+|------------------------|--------|----------------|----------|
+| [finding] | [reviewer/pass] | ✅ Fixed / ❌ Unresolved / 🔄 Obsolete | [code ref or explanation] |
+```
+
+**Rules:**
+- If ANY prior critical finding is **unresolved** → verdict must be `NEEDS_CHANGES`
+- If status cannot be determined → mark **unresolved** (default to caution)
+- If finding is obsolete (code was removed/rewritten) → say so explicitly with evidence
+- **NEVER silently drop or contradict a prior critical finding.** This is the #1 cause of regression approvals — PR #34669 was approved with "high confidence" after earlier passes had correctly identified a critical bug that was never fixed.
+
 ### Step 5: Check CI Status
 
 ```bash
+# Full context — all checks (required + optional)
 gh pr checks <PR_NUMBER>
+
+# Hard-gate decision — required checks only (use this for the verdict gate below)
+gh pr checks <PR_NUMBER> --required
+
+# Programmatic analysis — valid --json fields are: bucket, completedAt, description,
+# event, link, name, startedAt, state, workflow. (`conclusion` is NOT a valid field.)
+# `bucket` categorizes `state` into pass/fail/pending/skipping/cancel.
+gh pr checks <PR_NUMBER> --json name,state,bucket,workflow
 ```
 
-Review CI results. **Never post ✅ LGTM if any required CI check is failing.** If CI is failing:
-- If caused by the PR's code changes, flag as ❌ error
-- If a known infrastructure issue or pre-existing flake, note it but still use ⚠️
-- If the PR description acknowledges the failure, note it in the summary
+Review CI results. **🚨 HARD GATE: Never give `LGTM` if any required CI check is failing or pending.** Use `--required` to scope the gate decision so optional/informational check failures don't force a false `NEEDS_CHANGES`.
+
+| CI State | Allowed Verdict |
+|----------|----------------|
+| All required checks ✅ green | May proceed to verdict |
+| Any required check ❌ red | **`NEEDS_CHANGES`** — document which checks failed and whether related to PR |
+| Required checks ⏳ pending | **`NEEDS_DISCUSSION`** — state "CI not yet complete" |
+| Only optional/informational checks red | May proceed, note the failures |
+
+**NEVER say "No CI failures detected" or "Clean build" without actually running `gh pr checks`.** False CI claims directly caused PR #34669 (100% UITest failure) to be approved and merged.
+
+**UITest awareness:** UITests do NOT run on PR builds in this repo — they only run post-merge. If the PR modifies HostApp pages, handler/extension code, or platform infrastructure, explicitly note: _"UITests do not run on PR builds. Startup and runtime behavior cannot be verified from CI alone."_
+
+### Step 6: Blast Radius, Failure-Mode Probing, and Verdict
+
+#### Blast Radius Assessment (MANDATORY for infrastructure changes)
+
+**Trigger:** PR modifies any of: handlers, platform extensions, toolbar/navigation code, page registration, static state, `PropertyChanged` subscriptions, or HostApp startup paths.
 
-### Step 6: Devil's Advocate and Verdict
+Answer these questions explicitly in the review:
 
-Before finalizing your verdict:
+| Question | Why It Matters |
+|----------|---------------|
+| Does this code run for ALL instances, or only when the new feature is used? | Toolbar code that wraps ALL items in a Grid broke startup for every page (PR #34669) |
+| Does this code run at app startup or page initialization? | Static fields initialized on first access can crash the app before any test page loads |
+| Are there new static/shared state fields that affect all pages/windows? | Static `ConcurrentDictionary` survives handler disposal unless explicitly cleaned up |
+| Do new `PropertyChanged` subscriptions fire for ALL property changes? | A handler subscribing to all PropertyChanged events burns CPU and can throw if it handles unexpected properties |
+| Do HostApp test pages reference resources (images, fonts) that exist? | Missing `envelope.png`, `bell.png` etc. can crash the HostApp during page scan/registration |
+| What happens at startup with null/default values for new properties? | New BindableProperty with null default must not cause NullRef in platform code paths |
 
-1. **Challenge your findings** — For each issue you flagged, ask: "Am I sure, or am I guessing?"
-2. **Challenge your approval** — If you're leaning LGTM, ask: "What could go wrong that I'm not seeing?"
-3. **Check platform blind spots** — If the change touches platforms you can't fully reason about, say so explicitly
+#### Failure-Mode Probing (MANDATORY)
+
+**Do NOT ask easy rhetorical questions with obvious answers.** Probe genuinely challenging failure modes:
+
+- What happens if this code runs on pages/items that DON'T use the new feature?
+- What happens during handler disconnect/reconnect (navigation, Shell tab switch)?
+- What happens with null `Parent`, `Handler`, `BindingContext`, or `IconImageSource`?
+- What happens if referenced resources (images, fonts) don't exist in the project?
+- Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)?
+- Does static state survive page disposal and get stale?
+- What happens if the platform API is unavailable (e.g., a newer iOS API used without an `OperatingSystem.IsIOSVersionAtLeast(...)` guard)?
+
+**Anti-pattern (from PR #34669 review):** The reviewer asked "Should BadgeText be int?" and "Could static dictionaries cause memory pressure?" — these are softballs with obvious "no" answers. Meanwhile the ACTUAL crash (toolbar code running for all items at startup) was never probed.
+
+#### Confidence Calibration
+
+**Base confidence by blast radius:**
+
+| Blast Radius | Max Confidence |
+|-------------|----------------|
+| Localized change, non-startup, non-infrastructure | May be **high** |
+| Platform-specific handler/UI plumbing | Max **medium** |
+| Shared infrastructure, startup path, global subscriptions/static state | Max **low** |
+
+**Then cap by evidence:**
+
+| Evidence | Confidence Cap |
+|----------|---------------|
+| CI red or pending | **NEEDS_CHANGES / NEEDS_DISCUSSION** (no LGTM) |
+| No relevant tests run (e.g., UITests skip PR builds) | Max **low** |
+| CI green but no UI/integration test coverage | Max **medium** |
+| CI green + targeted runtime coverage verified | May increase one level |
+| Prior critical findings unresolved | **NEEDS_CHANGES** (no LGTM) |
+
+**NEVER give "Confidence: high" on PRs that modify platform infrastructure with >500 lines unless CI is fully green AND UITests have been verified.**
+
+> **Note — high confidence is intentionally unreachable at PR-review time for this category.** UITests do not run on PR builds (see Step 5), so the "UITests verified" precondition cannot be satisfied during review. This is deliberate after PR #34669: large infrastructure PRs are capped at `medium` until post-merge verification. Do **not** hallucinate UITest verification to satisfy this rule — if the rule cannot be satisfied, the cap stands.
+
+#### Deliver Verdict
 
 Then deliver your verdict:
 
@@ -166,6 +262,24 @@ When the environment supports multiple models, run the review in parallel for di
 **Author claims:** [Summary of PR description]
 **Agreement/disagreement:** [Where your assessment matches or differs]
 
+### Prior Finding Reconciliation
+| Prior Critical Finding | Source | Current Status | Evidence |
+|---|---|---|---|
+| [finding] | [reviewer] | ✅ Fixed / ❌ Unresolved / 🔄 Obsolete | [evidence] |
+*(If no prior reviews with critical findings, state "No prior critical findings found.")*
+
+### CI Status
+- `maui-pr`: ✅ / ❌ (reason) / ⏳
+- UITests: ⚠️ Not run on PR builds
+- [Other checks]
+
+### Blast Radius Assessment
+*(Required for infrastructure/handler/platform changes; omit for simple fixes)*
+- Runs for all instances: [yes/no — explanation]
+- Startup impact: [yes/no]
+- Static/shared state: [yes/no]
+- HostApp resources verified: [yes/no/N/A]
+
 ### Findings
 
 #### ❌ Error — [Brief description]
@@ -177,11 +291,12 @@ When the environment supports multiple models, run the review in parallel for di
 #### 💡 Suggestion — [Brief description]
 [Explanation]
 
-### Devil's Advocate
-[Challenges to your own conclusions]
+### Failure-Mode Probing
+- [Probe]: [Answer — what actually happens in this scenario]
+- [Probe]: [Answer]
 
 ### Verdict: LGTM / NEEDS_CHANGES / NEEDS_DISCUSSION
-**Confidence:** high / medium / low
+**Confidence:** high / medium / low *(with justification referencing calibration table)*
 **Summary:** [2-3 sentences explaining the verdict]
 ```
 
@@ -218,8 +333,29 @@ The script wraps the review in a collapsible `<details>` section, adds PR metada
 
 - [ ] Full source files read (not just diffs)
 - [ ] Independent assessment formed before reading PR narrative
+- [ ] Prior reviews checked and critical findings reconciled (Step 4)
+- [ ] CI status verified via `gh pr checks` — not assumed (Step 5)
 - [ ] MAUI-specific checklist walked through for each applicable section
+- [ ] Blast radius assessed for infrastructure/handler/platform changes (Step 6)
+- [ ] Failure-mode probing completed with real scenarios, not softballs (Step 6)
 - [ ] Findings categorized by severity (❌ / ⚠️ / 💡)
-- [ ] Devil's advocate check performed
-- [ ] Verdict is consistent with findings
+- [ ] Confidence calibrated against blast radius and evidence tables (Step 6)
+- [ ] Verdict is consistent with findings AND prior review reconciliation
 - [ ] Output follows the format above
+
+---
+
+## Lessons Learned (Regressions Caused by Skill Gaps)
+
+### PR #34669 — Badge Feature Regression (April 2026)
+
+**What happened:** 1141-line PR adding ToolbarItem badge support was reviewed three times. Pass 1 correctly found a critical WeakReference bug. Pass 3 contradicted Pass 1 — praised the same buggy code as "robust", gave `LGTM` with "Confidence: high", and falsely stated "No CI failures detected. Clean build." The PR was merged and caused **100% UITest failure** (app crash on startup, all platforms). Required revert PR #34984.
+
+**Five skill gaps that caused this:**
+1. **No CI verification** — said "clean build" without running `gh pr checks` (CI was actually failing)
+2. **No prior review reconciliation** — silently contradicted its own earlier critical finding
+3. **No blast radius assessment** — didn't realize toolbar changes affected ALL toolbars at startup
+4. **Superficial failure-mode probing** — asked "Should BadgeText be int?" instead of "What if this code runs on pages without badges?"
+5. **Overconfidence** — "Confidence: high" on a 1141-line, 26-file platform infrastructure change
+
+**These gaps led to:** Phase 4 reconciliation requirement, Step 5 hard gate, blast radius assessment, failure-mode probing, and confidence calibration rules.
diff --git a/.github/skills/code-review/references/review-rules.md b/.github/skills/code-review/references/review-rules.md
@@ -294,6 +294,8 @@ Rules distilled from 30 reverted PRs and 50 candidate-branch failures. When a PR
 
 | Check | What to look for |
 |-------|-----------------|
+| **🚨 New feature code must not execute for non-feature users** | The #1 cause of startup crashes: feature code that runs unconditionally for ALL instances, not just instances using the new feature. PR #34669 (badge support) wrapped ALL toolbar items in a Grid + InfoBadge overlay, causing 100% app startup crash across all platforms. **Always ask:** "If I never set `BadgeText`, does this code still execute? If yes, what happens with null/default values?" Flag any handler/extension code that modifies the rendering path for ALL items without a feature-usage guard. |
+| **HostApp test pages must not crash the app** | Test pages in `TestCases.HostApp/Issues/` are registered at startup. A page that references missing images (e.g., `IconImageSource = "envelope.png"` when `envelope.png` doesn't exist in the project) or uses APIs that throw on initialization will crash the ENTIRE HostApp, causing 100% UITest failure. Verify that referenced resources exist: `find src/Controls/tests/TestCases.HostApp -name "resourcename.png"`. (PR #34669 — 188 test failures from app crash) |
 | **CollectionView changes need broad scenario coverage** | CV is the single highest-regression component (15 candidate failures). Any change to layout, scroll, spacing, cell alignment, Header/Footer, or `KeepLastItemInView` must be tested across all four: empty collection, single item, many items, and with grouping. A fix for one layout scenario routinely breaks another. (CollectionView candidate failures, multiple PRs) |
 | **Style/theme changes have cascading effects** | `ApplyToDerivedTypes`, implicit styles, and `AppThemeBinding` interact in non-obvious ways. A fix to one style propagation path often breaks another — this pattern caused two separate reverts for PR #9648 and broke source gen in PR #32728. When touching style resolution or `AppThemeColor`, test: explicit style, implicit style, derived-type style, and dark/light theme switching. |
 | **Test the fix scenario AND adjacent scenarios** | Most reverts happen because the fix works for the reported issue but breaks a neighboring case. ToolbarItem image fix (PR #28833, reverted twice) fixed one image mode while breaking others. Entry `SelectionLength` (PR #26213) fixed selection but broke focus. Require authors to enumerate what adjacent behaviors they checked. |
@@ -307,6 +309,7 @@ Rules distilled from 30 reverted PRs and 50 candidate-branch failures. When a PR
 | **Arithmetic in index/position calculations needs explicit parentheses** | mattleibow on PR #23369: "This line of code is a bit ambiguous for a quick read." Silent operator-precedence bugs in scroll offset, index math, or spacing calculations are hard to spot and have caused gesture/tap regressions. Flag expressions like `a + b * c` or `offset - padding / 2` without explicit parentheses when they appear in position or size computations. |
 | **Major dependency upgrades need broad platform validation** | WindowsAppSDK upgrade (PR #32174) was reverted because it broke too many things simultaneously. Flag PRs that bump `WindowsAppSDK`, `Microsoft.Maui.*` NuGet versions, or other platform SDK dependencies, and ask whether CI was green on all platforms (Android, iOS, MacCatalyst, Windows) before merge, not just the changed platform. |
 | **ContentPresenter BindingContext propagation breaks explicit TemplateBindings** | Propagating `BindingContext` through `ContentPresenter` overwrites `{TemplateBinding}` values that were set explicitly. This was a reverted PR. Flag any handler or renderer change that sets or propagates `BindingContext` on a `ContentPresenter` or control template root without verifying that `TemplateBinding` expressions still resolve correctly. |
+| **Static dictionaries in extension methods survive handler disposal** | Static `ConcurrentDictionary` fields in extension methods (e.g., `ToolbarExtensions._badgeDrawables`) persist across handler connect/disconnect cycles. If `DisposeMenuItems` clears ALL entries globally instead of scoping to the current toolbar, it removes badges from other active toolbars during navigation transitions. Always scope static cleanup to the specific instance being disposed. (PR #34669) |
 
 ---
 
@@ -322,7 +325,7 @@ Components ranked by regression frequency from 366 analyzed PRs (reverts + candi
 | CarouselView | 7 | ScrollTo, CurrentItem, ItemSpacing, loop mode |
 | Gesture/Tap | 7 | TapGestureRecognizer, SwipeView, outside-tap dismiss |
 | Button/Entry | 7 | Dynamic resize, focus/selection, AppThemeBinding colors |
-| Toolbar | 5 | Icon color, back button, BarTextColor across modes |
+| Toolbar | 6 | Icon color, back button, BarTextColor, **badge feature startup crash (PR #34669)** |
 | Shell/TabBar | 4 | TabBarIsVisible, Shell crashes, section rendering |
 
 Use this table as a triage guide: PRs touching these components warrant a more thorough pass through sections 1–21 above, with particular attention to the adjacent-scenario rule (§21 row 3) and the component-specific rows in this section.

diff --git a/.github/skills/code-review/tests/eval.yaml b/.github/skills/code-review/tests/eval.yaml
@@ -15,6 +15,40 @@ scenarios:
       - "The agent never posts an approval or request-changes action via the GitHub API"
     timeout: 300
 
+  - name: "CI verification - agent must check CI status"
+    prompt: "code review PR #34669 in dotnet/maui"
+    assertions:
+      - type: "output_matches"
+        pattern: "(gh pr checks|maui-pr.*(pass|fail|success|failure|pending|cancelled|red|green|❌|✅|⏳))"
+    rubric:
+      - "The agent runs 'gh pr checks' or equivalent to verify CI status before delivering a verdict"
+      - "The agent does NOT claim 'clean build' or 'no CI failures' without evidence from gh pr checks"
+      - "If CI is failing, the verdict reflects this (not LGTM with 'high confidence')"
+      - "The agent produces a CI Status section in its output"
+    timeout: 300
+
+  - name: "Prior review reconciliation - don't contradict earlier findings"
+    prompt: "code review PR #34669 in dotnet/maui - note there were prior reviews on this PR"
+    assertions:
+      - type: "output_matches"
+        pattern: "(gh pr view .*--json reviews|gh api .*/reviews|Reviewer:\\s*@?[A-Za-z0-9_.-]+|No prior critical findings)"
+    rubric:
+      - "The agent checks for prior reviews on the PR and acknowledges any critical findings"
+      - "The agent does NOT silently contradict prior critical findings without explicit reconciliation"
+      - "If prior reviews flagged critical bugs, the agent verifies whether they were fixed"
+    timeout: 300
+
+  - name: "Blast radius - infrastructure changes flagged"
+    prompt: "code review PR #34669 in dotnet/maui — this PR adds new BindableProperties to ToolbarItem and modifies platform toolbar extensions"
+    assertions:
+      - type: "output_matches"
+        pattern: "(?is)Blast Radius Assessment.*Runs for all instances:\\s*(yes|no)"
+    rubric:
+      - "The agent assesses whether the code changes affect ALL toolbar items or only badge-enabled items"
+      - "The agent asks failure-mode questions about startup behavior and null/default property values"
+      - "The confidence is NOT 'high' for a 1100+ line infrastructure change"
+    timeout: 300
+
   - name: "Negative trigger - informational query about a PR"
     prompt: "What does PR #34727 change in dotnet/maui? Just give me a summary."
     assertions:
@@ -24,6 +58,8 @@ scenarios:
         value: "NEEDS_DISCUSSION"
       - type: "output_not_contains"
         value: "Devil's Advocate"
+      - type: "output_not_contains"
+        value: "Failure-Mode Probing"
     rubric:
       - "The agent provides a plain summary without launching a structured multi-step review workflow"
       - "The agent does NOT load review-rules.md or walk through MAUI-specific review checklists"