Add standalone code-review skill with maintainer-sourced review rules#34265
Add standalone code-review skill with maintainer-sourced review rules#34265
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34265Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34265" |
There was a problem hiding this comment.
Pull request overview
Adds a new code-review skill and wires it into the PR agent’s post-gate workflow to qualitatively assess fix quality (to potentially skip expensive try-fix exploration) and to rank multiple passing fix candidates beyond simple pass/fail.
Changes:
- Introduces
.github/skills/code-review/SKILL.mdwith two modes: pre-fix triage and post-fix comparison. - Updates PR agent workflow docs to add Phase 2.5 (triage) and Phase 3.5 (comparison).
- Updates
.github/copilot-instructions.mdto list the new skill and fixes skill numbering.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| .github/skills/code-review/SKILL.md | New skill definition for qualitative code review (triage + compare). |
| .github/agents/pr/post-gate.md | Inserts the new code-review phases into the post-gate workflow and guidance. |
| .github/agents/pr.md | Updates the workflow diagram to reflect new post-gate steps. |
| .github/copilot-instructions.md | Documents the new skill in the skills list and corrects numbering. |
.github/skills/code-review/SKILL.md
Outdated
|
|
||
| | Field | Description | | ||
| |-------|-------------| | ||
| | `verdict` | `LGTM`, `NEEDS_REVIEW`, `NEEDS_CHANGES`, or `SKIP_FIX_PHASE` | |
There was a problem hiding this comment.
The Outputs table lists verdict values (LGTM, NEEDS_REVIEW, NEEDS_CHANGES, SKIP_FIX_PHASE) but the rest of the skill (triage workflow + output format) only defines SKIP_FIX_PHASE/NEEDS_REVIEW, and compare mode doesn’t define how LGTM/NEEDS_CHANGES apply. Please align the documented verdict values with the two modes (e.g., triage: SKIP_FIX_PHASE|NEEDS_REVIEW; compare: omit verdict and require a ranking + selection), to avoid ambiguous invocations/outputs.
| | `verdict` | `LGTM`, `NEEDS_REVIEW`, `NEEDS_CHANGES`, or `SKIP_FIX_PHASE` | | |
| | `verdict` | **Triage mode only.** Either `SKIP_FIX_PHASE` or `NEEDS_REVIEW`. Compare mode omits `verdict` and instead uses `recommendation` for ranking/selection. | |
| 4. Present unified review noting which findings had multi-model agreement | ||
|
|
||
| **Timeout:** If a sub-agent hasn't completed after 5 minutes, proceed with available results. | ||
|
|
There was a problem hiding this comment.
This section recommends running a multi-model review “in parallel”, but the PR agent shared rules strongly emphasize “SEQUENTIAL ONLY” for multi-model workflows (to avoid confusion with try-fix). Consider adding an explicit note that parallelism here is OK because code-review is read-only, and that the sequential-only restriction still applies to try-fix executions.
| > **Note:** Parallelism is allowed here because code review is a **read-only** workflow and does **not** invoke `try-fix` or modify the PR. The global **"SEQUENTIAL ONLY"** rule for multi-model workflows still applies to `try-fix` executions and any mutating workflows. Never run multiple `try-fix` attempts in parallel. |
.github/agents/pr/post-gate.md
Outdated
| | 2.5 | **Code Review (Triage)** | Invoke `code-review` skill to assess PR's fix quality — may skip Fix phase if fix is excellent | | ||
| | 3 | **Fix** | Invoke `try-fix` skill repeatedly to explore independent alternatives, then compare with PR's fix | | ||
| | 3.5 | **Code Review (Comparison)** | Invoke `code-review` skill to rank passing candidates on quality dimensions beyond pass/fail | | ||
| | 4 | **Report** | Deliver result (approve PR, request changes, or create new PR) | |
There was a problem hiding this comment.
This file is titled "Post-Gate Phases (3-4)", but the workflow overview now includes phases 2.5 and 3.5. To keep the doc self-consistent (and easier to search/reference), update the header/intro to reflect the new phase range (e.g., “2.5–4” or “Post-Gate Phases (2.5, 3, 3.5, 4)”).
.github/agents/pr/post-gate.md
Outdated
|
|
||
| ## Common Mistakes in Post-Gate Phases | ||
|
|
||
| - ❌ **Skipping Code Review Triage** - Always run triage before Fix phase (saves compute when PR is already good) |
There was a problem hiding this comment.
The “Common Mistakes” bullet says to “Always run triage before Fix phase”, but earlier in Phase 2.5 you note triage applies only when starting from an existing PR (and should be skipped when starting from an issue with no PR). Please reword this bullet to be conditional (e.g., “When reviewing an existing PR, don’t skip triage”) so the guidance isn’t contradictory.
| - ❌ **Skipping Code Review Triage** - Always run triage before Fix phase (saves compute when PR is already good) | |
| - ❌ **Skipping Code Review Triage** - When reviewing an existing PR, don't skip triage before the Fix phase (saves compute when the PR is already good) |
.github/agents/pr.md
Outdated
| │ 1. Pre-Flight → 2. Gate │ ──► │ 2.5 Code Review → 3. Fix → 3.5 Code Review → 4. Report│ | ||
| │ ⛔ │ │ (Triage) │ (Comparison) │ | ||
| │ MUST PASS │ │ │ │ │ | ||
| │ │ │ SKIP_FIX_PHASE ────────────────────────► 4. Report │ | ||
| │ │ │ (Only read after Gate ✅ PASSED) │ |
There was a problem hiding this comment.
The diagram now shows post-gate includes phases 2.5 and 3.5, but the preceding sentence still says post-gate covers “Phases 3-4”. Please update that reference to match the new workflow (e.g., “Phases 2.5–4”), so readers don’t miss the Code Review phases.
0f364c1 to
57aafaf
Compare
Inspired by dotnet/runtime's code-review skill, this adds a qualitative evaluation step to the MAUI PR agent pipeline. Two insertion points: - Phase 2.5 (Pre-Fix Triage): After Gate passes, assess whether the PR's fix is high-quality enough to skip the expensive multi-model Fix phase. Uses independence-first assessment (code before narrative) to avoid anchoring bias. - Phase 3.5 (Post-Fix Comparison): After try-fix exploration, rank passing candidates on root cause, simplicity, robustness, safety, consistency, and maintainability — not just 'tests pass'. Key patterns borrowed from runtime: - Independence-first review (assess code before reading PR description) - Devil's advocate check before finalizing verdicts - Severity calibration (error/warning/suggestion) - Multi-model review for diverse perspectives MAUI-specific additions: - Handler lifecycle checks (ConnectHandler/DisconnectHandler) - Platform threading rules (Android UI thread) - CollectionView handler detection (Items/ vs Items2/) - Platform file extension awareness (.ios.cs, .android.cs, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove PR agent pipeline coupling (triage/compare modes, state_file, SKIP_FIX_PHASE, candidates table) - Revert pr.md and post-gate.md workflow changes (skill is standalone now) - Expand MAUI-specific review checklist: handler lifecycle, CollectionView handler detection, safe area rules, threading, public API, test patterns, template conventions, obsolete APIs - Simplify to three verdicts: LGTM, NEEDS_CHANGES, NEEDS_DISCUSSION - Keep core methodology: independence-first, devil's advocate, severity calibration, multi-model review Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
57aafaf to
6ebc892
Compare
Mined the top 142 most-discussed merged PRs in dotnet/maui and extracted 487 prescriptive comments from senior maintainers (PureWeen: 290, mattleibow: 278, StephaneDelcroix: 80, and others). Synthesized into 20 categorized sections with 129 PR citations, following the structure used by dotnet/android's android-reviewer skill. Key additions: - references/review-rules.md: 305 lines of maintainer-sourced review rules - SKILL.md: Added Step 2 (load review rules), Step 5 (CI status check), review output constraints (don't pile on, don't flag what CI catches) Categories: Handler lifecycle, memory/leaks, safe area, layout, platform code, Android, iOS, Windows, navigation, CollectionView, threading, XAML, testing, performance, error handling, API design, images, gestures, build, accessibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9daa89b to
004b12c
Compare
Replace ~90-line inline MAUI-Specific Review Checklist with a compact pointer to references/review-rules.md. SKILL.md is now workflow-only (~180 lines), rules live in review-rules.md (305 lines). This follows the dotnet/android pattern: SKILL.md = workflow, review-rules.md = knowledge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added §21 (13 regression prevention rules) and §22 (component risk table) sourced from 30 reverted PRs, 50 candidate-branch failures, and 64 regression fixes. File now has 346 lines, 22 sections, 138 PR citations. Key additions: CollectionView broad coverage, style cascading effects, adjacent scenario testing, IVT removal auditing, measurement timing on iOS, template all-variant validation, candidate PR separation of concerns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| | **Prefer delegates/Funcs over handler references** | Layout code uses `Func<>` callbacks to communicate without coupling to handler instances. Adopt this pattern: the platform view stores a `Func<>` rather than an `IViewHandler` reference. (PR #7886) | | ||
| | **Prefer static callbacks on iOS** | iOS tends to do better cleaning things up when the callback target is a static method. Move gesture recognizer callbacks and event handlers into static methods where feasible, passing state through the sender/tag. (PR #7886) | | ||
| | **Dispose `Java.Lang.Object` derivatives** | Android listeners that extend `Java.Lang.Object` must be disposed in `DisconnectHandler` — `_listener?.Dispose(); _listener = null;`. Missing Dispose leaks the Java peer. (PR #31022) | | ||
| | **Closures capturing UIKit views are leaks** | Lambdas that close over `UIView`, `UIScrollView`, or any `NSObject` subclass create hidden strong references that the iOS GC cannot break. Extract the view access into a local or use a weak reference capture. (PR #13499) | |
There was a problem hiding this comment.
You can also define any lamdbas as static as a way to force no captured variables. (slightly mentioned above)
| | **Store handler references as `WeakReference`** | Platform views that communicate back to their handler must store `WeakReference<THandler>` — not a strong reference. A strong handler ↔ platform view cycle prevents collection, especially on iOS. (PR #7886, PR #20124) | | ||
| | **Prefer delegates/Funcs over handler references** | Layout code uses `Func<>` callbacks to communicate without coupling to handler instances. Adopt this pattern: the platform view stores a `Func<>` rather than an `IViewHandler` reference. (PR #7886) | | ||
| | **Prefer static callbacks on iOS** | iOS tends to do better cleaning things up when the callback target is a static method. Move gesture recognizer callbacks and event handlers into static methods where feasible, passing state through the sender/tag. (PR #7886) | | ||
| | **Dispose `Java.Lang.Object` derivatives** | Android listeners that extend `Java.Lang.Object` must be disposed in `DisconnectHandler` — `_listener?.Dispose(); _listener = null;`. Missing Dispose leaks the Java peer. (PR #31022) | |
There was a problem hiding this comment.
This won't 100% leak if you forgot this. If a handler is GC'd, the _listener should be as well. It's good guidance, but with too strong wording.
It actually might be more important that you need to do view.SetListener(null), so Java doesn't have a reference to _listener.
| | **Store `Context` in a local before repeated use** | Accessing Android's `Context` property calls into Java each time (JNI marshaling). Store it in a local: `var context = Context;` then use the local. Same applies to any property that crosses the Java/C# bridge. (PR #26789 — jonathanpeppers) | | ||
| | **Android resource files need correct build action** | Files used in Android tests must be declared with the correct build action (e.g., `AndroidResource`, `EmbeddedResource`). A file at `/red.png` with no build action causes `FileNotFoundException` on Android only. (PR #14109 — jonathanpeppers) | | ||
| | **Use `PlatformLogger` for Android native code** | In Java code under `src/Core/AndroidNative/`, use `PlatformLogger` for logging — not `android.util.Log` directly. This ensures consistent log formatting. (PR #29780 — jonathanpeppers) | | ||
| | **Don't wrap `Process.Kill()` in `Task.Run` with timeout** | `Kill()` is not cancellable. Using `Task.Run` to timeout `Kill()` won't actually interrupt it. Use `Process.WaitForExit(timeout)` after `Kill()` instead. (PR #30941 — jsuarezruiz) | |
There was a problem hiding this comment.
This was talking about something in a *.cake file, maybe we should remove this.
I don't know when .NET MAUI would need to call Process.Kill() from inside an app.
- Soften Dispose wording: unsubscribe via SetListener(null) is often more important than Dispose since Java holds the reference - Add static lambda tip to prevent captured variables - Remove Process termination rule (originated from cake file, not app) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Generated 5 test scenarios covering happy path, negative trigger, independence-first principle, --approve anti-pattern, and verdict consistency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skill Validator Consensus Report:
|
| Dimension | Result |
|---|---|
| Eval Generation | 🆕 Generated (5 scenarios, validated) |
| Anthropic Verdict | IMPROVE (7/10) |
| Consensus | IMPROVE |
Strengths (High Confidence)
- Independence-first assessment is the standout feature -- explicitly forbidding the agent from reading PR descriptions until Step 3 prevents anchoring bias
- review-rules.md is exceptionally rich -- 22 categories, 80+ checks, real PR numbers as evidence, plus a "What NOT to Flag" noise-reduction section
- Verdict consistency rules are unambiguous:
NEEDS_CHANGESif any ❌ Error -- removes judgment calls - Output format precisely specified with a literal markdown template
Recommended Improvements
| # | Suggestion | Rationale |
|---|---|---|
| 1 | Expand trigger phrases in description to include "analyze code changes", "check PR code quality", "look at what changed in PR" | Trigger recall is low (5/10) -- common phrasings won't fire the skill |
| 2 | Add disambiguation note distinguishing code-review from pr-review and pr-finalize at top of SKILL.md |
Both do "code review" -- users won't know which to invoke |
| 3 | Use absolute path .github/skills/code-review/references/review-rules.md in Step 2 |
Agent won't reliably know skill directory at runtime |
| 4 | Add explicit comment-posting policy -- output to terminal only, never gh pr review --comment unless orchestrated |
pr-finalize has this rule; code-review should match for consistency |
| 5 | ✅ eval.yaml committed with this review (5 scenarios) | Enables future automated regression detection |
Verdict
IMPROVE and merge. The skill is well-structured with a genuinely valuable independence-first approach and rich review rules. Address items 1-4 above before merging for best results.
…lute path, comment policy 1. Expand trigger phrases in description for better recall 2. Add disambiguation note distinguishing code-review from pr-review and pr-finalize 3. Use absolute path for review-rules.md in Step 2 4. Add explicit comment-posting policy (terminal only by default) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a PowerShell script (.github/scripts/Post-CodeReview.ps1) to format and post code review comments to GitHub PRs. The script wraps review content in a collapsible <details> block, includes PR metadata (commit SHA, title, author, URL), auto-detects a verdict (LGTM / NEEDS_CHANGES / NEEDS_DISCUSSION) to set a colored status dot, and supports dry-run, create-or-update (via an HTML marker), and safe posting via the gh CLI. Also update .github/skills/code-review/SKILL.md to recommend using the Post-CodeReview.ps1 script (showing a DryRun example) and to advise reviewers not to post automatically. This change centralizes consistent review formatting and posting guidance. Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Skill Validator Consensus Report: code-reviewEvaluated by: skill-validator (dotnet/skills) + Anthropic prompt analysis + live Claude CLI trigger testing
Key Metrics (from skill-validator)
🏆 Strengths
|
eval.yaml: - Replace output_contains assertions with tool_call_before (tests behavior not text: gh pr diff before gh pr view) - Increase timeouts to 300s for review scenarios - Add 6th scenario: negative trigger for "summarize" queries - Remove assertions that check for specific section headers SKILL.md: - Add explicit "Do NOT use for" exclusions (summarize, describe, info) - Remove "look at what changed" from triggers (too close to info query) - Restructure description as multi-line YAML for clarity Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skill Validator Re-Assessment (Post-Fix)Previous Score: 7/10 (IMPROVE) → New Score: 9.5/10 (KEEP) ✅ All critical issues from our prior review have been addressed:
🔴 One Required Fix Before Merge:
|
- Replace tool_call_before/tool_call_not_contains with supported assertion types (output_matches, output_not_contains) - Move independence-first behavioral checks to rubric items where the LLM pairwise judge handles them - Fix "4-step" to "6-step" in multi-model review section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dotnet Validator — Final Re-Assessment (commit
|
| Scenario | Baseline | Isolated | Plugin | Δ | Verdict |
|---|---|---|---|---|---|
| Happy path – code review PR request | 1/5 | 2/5 🟢 | 3/5 🟢 | +0.17 | ✅ |
| Negative trigger – informational query | 5/5 | 4/5 🔴 | 4/5 🔴 | −0.30 | ❌ |
| Independence-first – diff before description | 1/5 | 1/5 | 2/5 🟢 | +0.07 | ✅ |
| Anti-pattern – never approve via GitHub API | 1/5 | 3/5 🟢 | 3/5 🟢 | +0.07 | ✅ |
| Verdict consistency – errors → NEEDS_CHANGES | 1/5 | 5/5 🟢🟢 | 5/5 🟢🟢 | +0.75 | ✅ |
| Negative trigger – describe changes query | 1/5 | 2/5 🟢 | 1/5 | −0.01 | ❌ |
Overfitting: 🔴 0.53 (just above the 0.5 threshold)
Version History
| Metric | v1 | v2 | v3 (this run) |
|---|---|---|---|
| Improvement score | +5.5% | +10.8% | +12.6% ✅ |
| Overfitting | 🔴 0.54 | 🟡 0.45 | 🔴 0.53 |
tool_call_before crash |
— | ❌ | ✅ fixed |
| Verdict consistency | 1→1/5 ⏰ | 1→5/5 | 1→5/5 |
Strengths
- +12.6% improvement — above the 10% threshold; tool passes with
--verdict-warn-only - Verdict consistency: 1→5/5 both modes. With skill, agent correctly identifies a Java bridge memory leak (
FlyoutViewHandlerusingUnregisterViewinstead ofRemoveViewWithLocalListener) and deliversNEEDS_CHANGES. Without skill, baseline agent offloads to a background subprocess and returns nothing. tool_call_before/tool_call_not_containsremoved — eval.yaml now runs cleanly with no crashes- 44KB
review-rules.mdis paying off — LLM judges explicitly cite review-rules §1/§2 as driving the correct verdict
Weaknesses / Remaining Issues
- Negative trigger regression persists — Informational query drops 5→4/5 with skill loaded. Small but consistent across all three eval runs.
- Three scenarios use non-existent PRs —
#34000is an issue (not a PR),#33500is an issue,#33000doesn't exist. This caps skill scores and inflates baselines artificially. Real improvement is likely higher than reported. - Overfitting 0.53 — Just above the flag threshold. Noisy at 1 run but has been flagged in 2 of 3 eval runs.
Suggested Fixes Before Merge
- Replace non-existent PR numbers in eval.yaml with real PRs that contain actual code changes (e.g., [iOS] Fix SafeArea infinite layout cycle with parent hierarchy walk and pixel-level comparison #34024, [iOS] TranslateToAsync causes spurious SizeChanged events after animation completion, triggering infinite layout loops #33934, Layout issue using TranslateToAsync causes infinite property changed cycle on iOS #32586 for happy-path and independence-first)
- Add
in dotnet/mauito the describe-changes prompt — current failure is the agent stopping to ask for the repo, not a skill misfire:"Summarize what PR #34100 does in dotnet/maui" - Tighten negative trigger boundary — add explicit exclusions to the description frontmatter for
"what does PR X do?"/"describe the changes"style queries
Summary
The skill works and the improvement is real — the verdict consistency scenario alone (1→5/5) justifies keeping it. The remaining gaps are in the eval.yaml, not the skill itself. Fix the PR numbers and the describe-changes prompt, and this is ready to merge.
- Replace non-existent PR numbers (#34000, #33500, #33000, #34100) with real merged PRs (#34024, #34727, #31202, #28713, #34723) - Add "in dotnet/maui" to all prompts to prevent agent asking for repo - All PRs verified as real merged PRs with actual code changes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅ Skill Validator Final Review — APPROVEDScore: 9.5/10 — KEEP All blockers from three review rounds have been resolved:
Non-blocking suggestions for fast-follow:
Ready to merge. No further changes required. — Skill Validator (final automated review, 3 rounds complete) |
Dotnet Validator — v4 Re-Assessment (commit
|
| Scenario | Baseline | Isolated | Plugin | Notes |
|---|---|---|---|---|
| Happy path – code review #34024 | 1/5 | 1/5 ⏰ | 1/5 ⏰ | Timeout — agent gathered diff/rules but ran out of time before writing review |
| Negative trigger – informational #34727 | 5/5 | 5/5 | 5/5 | ✅ Skill correctly NOT activated |
| Independence-first – #31202 | 1/5 | 1/5 ⏰ | 1/5 ⏰ | Timeout — agent spent 300s on Windows/Unix shell compat, never delivered review |
| Anti-pattern – no approve #28713 | 2/5 | 4/5 🟢 | 5/5 🟢🟢 | ✅ Big win — refused approve, delivered full verdict |
| Verdict consistency – #32278 | 1/5 | 3/5 🟢 ⏰ | 5/5 🟢🟢 | ✅ Core scenario still works; isolated timed out before verdict line |
| Negative trigger – describe #34723 | 5/5 | 5/5 | 5/5 | ✅ Skill correctly NOT activated |
Overfitting: 🔴 0.50 (exactly at threshold — borderline)
Version History
| Metric | v1 | v2 | v3 | v4 (this run) |
|---|---|---|---|---|
| Improvement score | +5.5% | +10.8% | +12.6% | +16.7% ✅ |
| Overfitting | 🔴 0.54 | 🟡 0.45 | 🔴 0.53 | 🔴 0.50 (borderline) |
tool_call_before crash |
— | ❌ | ✅ | ✅ |
| PR numbers valid | ❌ | ❌ | ❌ | ✅ |
| Negative triggers | 🔴 regression | 🔴 | 🔴 | ✅ fixed |
What Improved
- Negative triggers fully resolved: Both "summarize" and "informational query" scenarios now hold at 5/5 with and without skill. Skill correctly does NOT activate on these — no more regression.
- Anti-pattern (no approve): 2→5/5 in plugin mode. Agent explicitly refuses
--approve, explains it's a human decision, and delivers a complete review. - Improvement score jumped to +16.7% — best result across all runs, comfortably above the 10% threshold.
- Overfitting dropped to 0.50 — at the boundary, trending down.
New Issue: Timeouts on Real PRs
With real PRs now in play, 2 of the 4 review scenarios hit the 300s ceiling before delivering their output:
- Happy path ([iOS] Fix SafeArea infinite layout cycle with parent hierarchy walk and pixel-level comparison #34024): Agent did everything right — fetched diff first, loaded
review-rules.md, examined source files, checked CI — but ran out of time before writing the actual review. Judge: "critically failed to produce any review output before the 300-second timeout." - Independence-first (LineHeight and decorations for HTML Label - fix #31202): Agent spent the entire budget on PowerShell/Unix shell compatibility errors (
headcommand not available, base64 decoding issues on Windows). Never got to writing the review.
The skill's core behavior is sound — the agent IS doing the right things in the right order — it just needs more time. The independence-first Windows compat failure is partially an environment issue (the eval runs in a Windows workspace where Unix commands aren't available).
Verdict consistency (isolated: 3/5, plugin: 5/5)
The gap between isolated (3/5) and plugin (5/5) is interesting: isolated timed out before the agent could write the explicit NEEDS_CHANGES verdict line, even though it found the correct bug. Plugin mode benefited from more efficient context loading and completed. Both correctly identify the FlyoutViewHandler.DisconnectHandler → UnregisterView vs RemoveViewWithLocalListener asymmetry.
Recommended Final Fixes (eval.yaml only)
-
Increase timeouts for review scenarios to 450–600s — The real PRs are larger than the dummy ones. The agent is doing the right work, it just needs more runway. Negative trigger timeouts (120s) are fine as-is.
-
Add a Windows-compatible diff approach hint to SKILL.md — Agents in Windows eval environments use
head(Unix) instead ofSelect-Object -First(PowerShell), burning 50–100s on compat errors before recovering. A one-liner in the "Environment" section of the skill would eliminate this waste.
Bottom Line
The skill is working. +16.7% improvement, both negative triggers clean, anti-pattern guardrail solid, and the core verdict-consistency scenario correctly identifies real bugs in real PRs. The remaining issue is purely eval infrastructure — timeout budget — not the skill itself. Recommend merging with timeout bumps to 450s for the 4 review scenarios.
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Adds a standalone
code-reviewskill for reviewing PR code changes with MAUI-specific domain knowledge, modeled after dotnet/android's android-reviewer skill.What's included
.github/skills/code-review/SKILL.md(196 lines) — Review workflow:LGTM,NEEDS_CHANGES,NEEDS_DISCUSSION.github/skills/code-review/references/review-rules.md(345 lines) — Domain knowledge:Design decisions
Changes to copilot-instructions.md