Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 73 additions & 21 deletions .github/skills/code-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Standalone skill that evaluates PR code changes for correctness, safety, perform
2. **Full-context** — Read entire source files, not just diffs. Check callers, consumers, and git history
3. **Empirical grounding** — Reference specific code, line numbers, and call sites. No vague concerns
4. **Severity calibration** — Distinguish errors from warnings from suggestions. Not everything is critical
5. **Devil's advocate** — Challenge your own conclusions before finalizing
5. **Failure-mode probing** — Challenge your own conclusions with real failure scenarios, not softballs

## Inputs

Expand Down Expand Up @@ -108,26 +108,60 @@ Now read the PR description, linked issue, and comments. Treat these as **claims
2. If the PR claims a bug fix, verify the root cause analysis matches the code
3. Check existing review comments to avoid duplicating feedback

### Step 5: Check CI Status
#### 🚨 Prior Review Reconciliation

Check for prior reviews on the same PR — from the Copilot PR reviewer bot, other agents, or human reviewers:

```bash
gh pr checks <PR_NUMBER>
gh pr view <PR_NUMBER> --json reviews --jq '.reviews[] | select((.body // "") != "") | "Reviewer: \(.author.login) | State: \(.state)\n\(.body[0:10000])\n---"'
```

Review CI results. **Never post ✅ LGTM if any required CI check is failing.** If CI is failing:
- If caused by the PR's code changes, flag as ❌ error
- If a known infrastructure issue or pre-existing flake, note it but still use ⚠️
- If the PR description acknowledges the failure, note it in the summary
**If prior reviews flagged ❌ Error-level issues:**
- Verify whether each ❌ Error finding was addressed in subsequent commits
- If unresolved → verdict must be `NEEDS_CHANGES`
- If status cannot be determined → default to unresolved (caution over optimism)
- **NEVER silently drop or contradict a prior ❌ Error finding** — confirm it no longer applies to current code before dismissing

### Step 5: Blast Radius, Failure-Mode Probing, and Verdict

#### Blast Radius Assessment

**Required when PR modifies:** handlers, platform extensions, toolbar/navigation code, page registration, static state, `PropertyChanged` subscriptions, or startup paths.

| Question | Why It Matters |
|----------|---------------|
| Does this code run for ALL instances, or only when the new feature is used? | Feature code that runs unconditionally is the #1 cause of startup crashes |
| Does this code run at app startup or page initialization? | Static fields initialized on first access can crash the app before any test page loads |
| Are there new static/shared state fields that affect all pages/windows? | Static state survives handler disposal unless explicitly scoped |
| What happens at startup with null/default values for new properties? | New BindableProperty with null default must not cause NullRef in platform code paths |

#### Failure-Mode Probing

### Step 6: Devil's Advocate and Verdict
**Do NOT ask easy rhetorical questions.** Probe genuinely challenging failure modes:

Before finalizing your verdict:
- What happens if this code runs on items/pages that DON'T use the new feature?
- What happens during handler disconnect/reconnect (navigation, Shell tab switch)?
- What happens with null `Parent`, `Handler`, `BindingContext`, or `PlatformView`?
- Can multiple subscriptions accumulate across handler lifecycle (missing unsubscribe)?
- Does static state survive page disposal and get stale?

1. **Challenge your findings** — For each issue you flagged, ask: "Am I sure, or am I guessing?"
2. **Challenge your approval** — If you're leaning LGTM, ask: "What could go wrong that I'm not seeing?"
3. **Check platform blind spots** — If the change touches platforms you can't fully reason about, say so explicitly
#### Confidence Calibration

Then deliver your verdict:
| Blast Radius | Max Confidence |
|-------------|----------------|
| Localized change, non-startup, non-infrastructure | May be **high** |
| Platform-specific handler/UI plumbing | Max **medium** |
| Shared infrastructure, startup path, global static state | Max **low** |

**Then cap by evidence:**

| Evidence | Confidence Cap |
|----------|---------------|
| CI red or pending | Max **low** — invoke `azdo-build-investigator` skill for CI analysis before finalizing |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Regression Prevention - This keeps the CI verdict rule, but the workflow no longer has a required step that fetches CI status before finalizing. The previous explicit gh pr checks <PR_NUMBER> step was removed, so an agent following the new workflow can reach verdict delivery without collecting the evidence needed to know whether CI is red or pending. Concrete scenario: required checks are failing, but the reviewer never runs a checks command and still emits LGTM. Add a mandatory CI-status collection step/output field before confidence/verdict calibration.

| No relevant tests run (UITests skip PR builds) | Max **low** |
| Prior ❌ Error findings unresolved | **NEEDS_CHANGES** (no LGTM) |

#### Deliver Verdict

- **`LGTM`** — Code is correct, safe, and consistent with MAUI patterns. Ready for human approval.
- **`NEEDS_CHANGES`** — Concrete issues found that should be addressed before merge.
Expand Down Expand Up @@ -155,6 +189,18 @@ Then deliver your verdict:
**Author claims:** [Summary of PR description]
**Agreement/disagreement:** [Where your assessment matches or differs]

### Prior Review Reconciliation
| Prior ❌ Error Finding | Source | Status | Evidence |
|------------------------|--------|--------|----------|
| [finding] | [reviewer] | ✅ Fixed / ❌ Unresolved / 🔄 Obsolete | [evidence] |
*(If no prior reviews with ❌ Error findings, state "No prior ❌ Error findings found.")*

### Blast Radius Assessment
*(Required for infrastructure/handler/platform changes; omit for simple fixes)*
- Runs for all instances: [yes/no — explanation]
- Startup impact: [yes/no]
- Static/shared state: [yes/no]

### Findings

#### ❌ Error — [Brief description]
Expand All @@ -166,11 +212,12 @@ Then deliver your verdict:
#### 💡 Suggestion — [Brief description]
[Explanation]

### Devil's Advocate
[Challenges to your own conclusions]
### Failure-Mode Probing
- [Probe]: [Answer — what actually happens in this scenario]
- [Probe]: [Answer]

### Verdict: LGTM / NEEDS_CHANGES / NEEDS_DISCUSSION
**Confidence:** high / medium / low
**Confidence:** high / medium / low *(justified against calibration table)*
**Summary:** [2-3 sentences explaining the verdict]
```

Expand All @@ -179,11 +226,13 @@ Then deliver your verdict:
## Verdict Consistency Rules

1. **The verdict must match your most severe finding.** If you have any ❌ Error findings, the verdict must be `NEEDS_CHANGES`. If only ⚠️ Warnings, use judgment but explain.
2. **Devil's advocate before finalizing.** Re-read all findings. For each warning, ask: "Would I be comfortable if this merged as-is?"
2. **Failure-mode probing before finalizing.** Re-read all findings. For each warning, ask: "Would I be comfortable if this merged as-is?"
3. **Never approve what you can't verify.** If the fix touches platform code you can't fully reason about, say so explicitly and use `NEEDS_DISCUSSION`.
4. **LGTM means no ❌ Errors.** You can LGTM with 💡 Suggestions. You can LGTM with ⚠️ Warnings only if you've explained why they're acceptable.
5. **🚨 NEVER use `--approve` or `--request-changes` on GitHub.** Only post comments. Approval is a human decision.
6. **Write findings to disk, do not post directly.** The agent does not have the GitHub comment token. Write findings to `CustomAgentLogsTmp/PRState/{PR}/PRAgent/` — the CI pipeline or posting scripts handle GitHub interaction.
5. **Prior ❌ Error findings override.** If any prior review flagged an ❌ Error-level issue (using this skill's severity taxonomy) that remains unresolved in the current code, verdict must be `NEEDS_CHANGES` regardless of your own assessment. Confirm the finding still applies to the current diff before applying the override.
6. **Never LGTM if CI is red.** If required CI checks are failing, invoke `azdo-build-investigator` to determine whether failures are PR-caused. Do not post `LGTM` until CI passes or failures are confirmed unrelated.
7. **🚨 NEVER use `--approve` or `--request-changes` on GitHub.** Only post comments. Approval is a human decision.
8. **Write findings to disk, do not post directly.** The agent does not have the GitHub comment token. Write findings to `CustomAgentLogsTmp/PRState/{PR}/PRAgent/` — the CI pipeline or posting scripts handle GitHub interaction.

---

Expand Down Expand Up @@ -214,8 +263,11 @@ In CI (`eng/pipelines/ci-copilot.yml`), `Review-PR.ps1` calls both `post-inline-

- [ ] Full source files read (not just diffs)
- [ ] Independent assessment formed before reading PR narrative
- [ ] Prior reviews checked and ❌ Error findings reconciled (Step 4)
- [ ] MAUI-specific checklist walked through for each applicable section
- [ ] Blast radius assessed for infrastructure/handler/platform changes (Step 5)
- [ ] Failure-mode probing completed with real scenarios, not softballs (Step 5)
- [ ] Findings categorized by severity (❌ / ⚠️ / 💡)
- [ ] Devil's advocate check performed
- [ ] Verdict is consistent with findings
- [ ] Confidence calibrated against blast radius and evidence tables (Step 5)
- [ ] Verdict is consistent with findings AND prior review reconciliation
- [ ] Output follows the format above
14 changes: 14 additions & 0 deletions .github/skills/code-review/tests/eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,17 @@ scenarios:
- "The agent provides a descriptive summary without triggering the full review workflow"
- "No severity markers (❌/⚠️/💡) or verdicts appear in the output"
timeout: 120

- name: "Blast radius - infrastructure changes get probed"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Testing — This scenario reuses the same live PR (#32278) as the existing "Verdict consistency" scenario
Flagged by: 2/3 reviewers

PR #32278 is referenced by both this new scenario and the existing "Verdict consistency - errors must map to NEEDS_CHANGES" scenario (line 59). Consequences:

  1. Attribution ambiguity in CI. If PR [Android] Refactor WindowInsetListener to per-view registry with MauiWindowInsetListener #32278's state changes (force-push, branch retargeting, etc.) both scenarios fail at once and the CI signal can't isolate which invariant broke.
  2. Compounded brittleness. Prior multi-model review Finding Aloha System.Maui! #4 was "depends on mutable live PR state with no anchoring." The fix-commit added a specific claim to the prompt, but the underlying instability — live PR + same anchor PR across multiple scenarios — remains.

Suggested fix: Pick a different stable, merged PR for one of the two scenarios so failures localize cleanly.

prompt: "code review PR #35223 in dotnet/maui — this PR changes back-navigation callback registration on Android. Hypothesis: the registration may run unconditionally for all activities, not just those using predictive back. Please verify or refute."
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Testing — Hypothesis is ambiguous about temporal state (pre-PR vs post-PR)

Flagged by: 2/3 reviewers

PR #35223 is the FIX that removes unconditional registration — its root cause description literally says "MauiAppCompatActivity registering a predictive-back callback unconditionally on Android 13+." The hypothesis "the registration may run unconditionally for all activities" is therefore:

  • TRUE for pre-PR code (the bug being fixed)
  • FALSE for post-PR code (this PR makes it conditional)

The prompt doesn't tell the agent which state to evaluate, and the rubric line "The agent provides evidence-backed acceptance or refutation of the hypothesis — not just echoing it" credits either outcome as long as evidence is provided. A correct refutation and an incorrect acceptance both pass — the test is non-discriminating.

Suggested fix: Reframe so refutation is the unique correct answer, e.g.: "Hypothesis: even after this PR, the registration still runs unconditionally for all activities. Please verify or refute." This makes a wrong "accept" fail the rubric on factual grounds.

Non-blocking — eval design refinement.

assertions:
- type: "output_matches"
pattern: "(blast radius|all instances|unconditional|callback registration)"
- type: "output_matches"
pattern: "(medium|low)"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Testing(medium|low) substring match is too permissive (round-3 self-introduced)

Flagged by: 2/3 reviewers

The output_matches assertion uses a bare substring regex, so any output containing "medium" or "low" anywhere passes — including innocuous phrases like "a medium-sized change", "low risk of regression", or "medium-priority bug". The skill's own output format at SKILL.md:219 declares confidence as **Confidence:** high | medium | low, so the assertion can be made much more discriminating without losing valid matches.

Suggested fix:

- type: "output_matches"
  pattern: "[Cc]onfidence[:*]*\\s*(medium|low)"

or pin to the exact field name used in the skill's structured output.

Non-blocking — eval design refinement.

rubric:
- "The agent assesses blast radius for handler/platform changes (does this run for all instances?)"
- "The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView)"
- "The agent provides evidence-backed acceptance or refutation of the hypothesis — not just echoing it"
- "The confidence is calibrated — not 'high' for platform infrastructure changes"
timeout: 300
Loading