diff --git a/.github/skills/code-review/SKILL.md b/.github/skills/code-review/SKILL.md index 1a8d1bb0b37b..e115a02b0474 100644 --- a/.github/skills/code-review/SKILL.md +++ b/.github/skills/code-review/SKILL.md @@ -28,7 +28,7 @@ Standalone skill that evaluates PR code changes for correctness, safety, perform 2. **Full-context** — Read entire source files, not just diffs. Check callers, consumers, and git history 3. **Empirical grounding** — Reference specific code, line numbers, and call sites. No vague concerns 4. **Severity calibration** — Distinguish errors from warnings from suggestions. Not everything is critical -5. **Devil's advocate** — Challenge your own conclusions before finalizing +5. **Failure-mode probing** — Challenge your own conclusions with real failure scenarios, not softballs ## Inputs @@ -108,26 +108,60 @@ Now read the PR description, linked issue, and comments. Treat these as **claims 2. If the PR claims a bug fix, verify the root cause analysis matches the code 3. Check existing review comments to avoid duplicating feedback -### Step 5: Check CI Status +#### 🚨 Prior Review Reconciliation + +Check for prior reviews on the same PR — from the Copilot PR reviewer bot, other agents, or human reviewers: ```bash -gh pr checks +gh pr view --json reviews --jq '.reviews[] | select((.body // "") != "") | "Reviewer: \(.author.login) | State: \(.state)\n\(.body[0:10000])\n---"' ``` -Review CI results. **Never post ✅ LGTM if any required CI check is failing.** If CI is failing: -- If caused by the PR's code changes, flag as ❌ error -- If a known infrastructure issue or pre-existing flake, note it but still use ⚠️ -- If the PR description acknowledges the failure, note it in the summary +**If prior reviews flagged ❌ Error-level issues:** +- Verify whether each ❌ Error finding was addressed in subsequent commits +- If unresolved → verdict must be `NEEDS_CHANGES` +- If status cannot be determined → default to unresolved (caution over optimism) +- **NEVER silently drop or contradict a prior ❌ Error finding** — confirm it no longer applies to current code before dismissing + +### Step 5: Blast Radius, Failure-Mode Probing, and Verdict + +#### Blast Radius Assessment + +**Required when PR modifies:** handlers, platform extensions, toolbar/navigation code, page registration, static state, `PropertyChanged` subscriptions, or startup paths. + +| Question | Why It Matters | +|----------|---------------| +| Does this code run for ALL instances, or only when the new feature is used? | Feature code that runs unconditionally is the #1 cause of startup crashes | +| Does this code run at app startup or page initialization? | Static fields initialized on first access can crash the app before any test page loads | +| Are there new static/shared state fields that affect all pages/windows? | Static state survives handler disposal unless explicitly scoped | +| What happens at startup with null/default values for new properties? | New BindableProperty with null default must not cause NullRef in platform code paths | + +#### Failure-Mode Probing -### Step 6: Devil's Advocate and Verdict +**Do NOT ask easy rhetorical questions.** Probe genuinely challenging failure modes: -Before finalizing your verdict: +- What happens if this code runs on items/pages that DON'T use the new feature? +- What happens during handler disconnect/reconnect (navigation, Shell tab switch)? +- What happens with null `Parent`, `Handler`, `BindingContext`, or `PlatformView`? +- Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)? +- Does static state survive page disposal and get stale? -1. **Challenge your findings** — For each issue you flagged, ask: "Am I sure, or am I guessing?" -2. **Challenge your approval** — If you're leaning LGTM, ask: "What could go wrong that I'm not seeing?" -3. **Check platform blind spots** — If the change touches platforms you can't fully reason about, say so explicitly +#### Confidence Calibration -Then deliver your verdict: +| Blast Radius | Max Confidence | +|-------------|----------------| +| Localized change, non-startup, non-infrastructure | May be **high** | +| Platform-specific handler/UI plumbing | Max **medium** | +| Shared infrastructure, startup path, global static state | Max **low** | + +**Then cap by evidence:** + +| Evidence | Confidence Cap | +|----------|---------------| +| CI red or pending | Max **low** — invoke `azdo-build-investigator` skill for CI analysis before finalizing | +| No relevant tests run (UITests skip PR builds) | Max **low** | +| Prior ❌ Error findings unresolved | **NEEDS_CHANGES** (no LGTM) | + +#### Deliver Verdict - **`LGTM`** — Code is correct, safe, and consistent with MAUI patterns. Ready for human approval. - **`NEEDS_CHANGES`** — Concrete issues found that should be addressed before merge. @@ -155,6 +189,18 @@ Then deliver your verdict: **Author claims:** [Summary of PR description] **Agreement/disagreement:** [Where your assessment matches or differs] +### Prior Review Reconciliation +| Prior ❌ Error Finding | Source | Status | Evidence | +|------------------------|--------|--------|----------| +| [finding] | [reviewer] | ✅ Fixed / ❌ Unresolved / 🔄 Obsolete | [evidence] | +*(If no prior reviews with ❌ Error findings, state "No prior ❌ Error findings found.")* + +### Blast Radius Assessment +*(Required for infrastructure/handler/platform changes; omit for simple fixes)* +- Runs for all instances: [yes/no — explanation] +- Startup impact: [yes/no] +- Static/shared state: [yes/no] + ### Findings #### ❌ Error — [Brief description] @@ -166,11 +212,12 @@ Then deliver your verdict: #### 💡 Suggestion — [Brief description] [Explanation] -### Devil's Advocate -[Challenges to your own conclusions] +### Failure-Mode Probing +- [Probe]: [Answer — what actually happens in this scenario] +- [Probe]: [Answer] ### Verdict: LGTM / NEEDS_CHANGES / NEEDS_DISCUSSION -**Confidence:** high / medium / low +**Confidence:** high / medium / low *(justified against calibration table)* **Summary:** [2-3 sentences explaining the verdict] ``` @@ -179,11 +226,13 @@ Then deliver your verdict: ## Verdict Consistency Rules 1. **The verdict must match your most severe finding.** If you have any ❌ Error findings, the verdict must be `NEEDS_CHANGES`. If only ⚠️ Warnings, use judgment but explain. -2. **Devil's advocate before finalizing.** Re-read all findings. For each warning, ask: "Would I be comfortable if this merged as-is?" +2. **Failure-mode probing before finalizing.** Re-read all findings. For each warning, ask: "Would I be comfortable if this merged as-is?" 3. **Never approve what you can't verify.** If the fix touches platform code you can't fully reason about, say so explicitly and use `NEEDS_DISCUSSION`. 4. **LGTM means no ❌ Errors.** You can LGTM with 💡 Suggestions. You can LGTM with ⚠️ Warnings only if you've explained why they're acceptable. -5. **🚨 NEVER use `--approve` or `--request-changes` on GitHub.** Only post comments. Approval is a human decision. -6. **Write findings to disk, do not post directly.** The agent does not have the GitHub comment token. Write findings to `CustomAgentLogsTmp/PRState/{PR}/PRAgent/` — the CI pipeline or posting scripts handle GitHub interaction. +5. **Prior ❌ Error findings override.** If any prior review flagged an ❌ Error-level issue (using this skill's severity taxonomy) that remains unresolved in the current code, verdict must be `NEEDS_CHANGES` regardless of your own assessment. Confirm the finding still applies to the current diff before applying the override. +6. **Never LGTM if CI is red.** If required CI checks are failing, invoke `azdo-build-investigator` to determine whether failures are PR-caused. Do not post `LGTM` until CI passes or failures are confirmed unrelated. +7. **🚨 NEVER use `--approve` or `--request-changes` on GitHub.** Only post comments. Approval is a human decision. +8. **Write findings to disk, do not post directly.** The agent does not have the GitHub comment token. Write findings to `CustomAgentLogsTmp/PRState/{PR}/PRAgent/` — the CI pipeline or posting scripts handle GitHub interaction. --- @@ -214,8 +263,11 @@ In CI (`eng/pipelines/ci-copilot.yml`), `Review-PR.ps1` calls both `post-inline- - [ ] Full source files read (not just diffs) - [ ] Independent assessment formed before reading PR narrative +- [ ] Prior reviews checked and ❌ Error findings reconciled (Step 4) - [ ] MAUI-specific checklist walked through for each applicable section +- [ ] Blast radius assessed for infrastructure/handler/platform changes (Step 5) +- [ ] Failure-mode probing completed with real scenarios, not softballs (Step 5) - [ ] Findings categorized by severity (❌ / ⚠️ / 💡) -- [ ] Devil's advocate check performed -- [ ] Verdict is consistent with findings +- [ ] Confidence calibrated against blast radius and evidence tables (Step 5) +- [ ] Verdict is consistent with findings AND prior review reconciliation - [ ] Output follows the format above diff --git a/.github/skills/code-review/tests/eval.yaml b/.github/skills/code-review/tests/eval.yaml index f02d9d1ad109..06dc871408d0 100644 --- a/.github/skills/code-review/tests/eval.yaml +++ b/.github/skills/code-review/tests/eval.yaml @@ -81,3 +81,17 @@ scenarios: - "The agent provides a descriptive summary without triggering the full review workflow" - "No severity markers (❌/⚠️/💡) or verdicts appear in the output" timeout: 120 + + - name: "Blast radius - infrastructure changes get probed" + prompt: "code review PR #35223 in dotnet/maui — this PR changes back-navigation callback registration on Android. Hypothesis: the registration may run unconditionally for all activities, not just those using predictive back. Please verify or refute." + assertions: + - type: "output_matches" + pattern: "(blast radius|all instances|unconditional|callback registration)" + - type: "output_matches" + pattern: "(medium|low)" + rubric: + - "The agent assesses blast radius for handler/platform changes (does this run for all instances?)" + - "The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView)" + - "The agent provides evidence-backed acceptance or refutation of the hypothesis — not just echoing it" + - "The confidence is calibrated — not 'high' for platform infrastructure changes" + timeout: 300