-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Improve code-review skill with regression prevention safeguards #35478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b7e4e3d
ef6b6c4
01a91ea
b82bdba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -81,3 +81,17 @@ scenarios: | |
| - "The agent provides a descriptive summary without triggering the full review workflow" | ||
| - "No severity markers (❌/⚠️/💡) or verdicts appear in the output" | ||
| timeout: 120 | ||
|
|
||
| - name: "Blast radius - infrastructure changes get probed" | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Testing — This scenario reuses the same live PR (#32278) as the existing "Verdict consistency" scenario PR #32278 is referenced by both this new scenario and the existing "Verdict consistency - errors must map to NEEDS_CHANGES" scenario (line 59). Consequences:
Suggested fix: Pick a different stable, merged PR for one of the two scenarios so failures localize cleanly. |
||
| prompt: "code review PR #35223 in dotnet/maui — this PR changes back-navigation callback registration on Android. Hypothesis: the registration may run unconditionally for all activities, not just those using predictive back. Please verify or refute." | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Testing — Hypothesis is ambiguous about temporal state (pre-PR vs post-PR) Flagged by: 2/3 reviewers PR #35223 is the FIX that removes unconditional registration — its root cause description literally says "MauiAppCompatActivity registering a predictive-back callback unconditionally on Android 13+." The hypothesis "the registration may run unconditionally for all activities" is therefore:
The prompt doesn't tell the agent which state to evaluate, and the rubric line "The agent provides evidence-backed acceptance or refutation of the hypothesis — not just echoing it" credits either outcome as long as evidence is provided. A correct refutation and an incorrect acceptance both pass — the test is non-discriminating. Suggested fix: Reframe so refutation is the unique correct answer, e.g.: "Hypothesis: even after this PR, the registration still runs unconditionally for all activities. Please verify or refute." This makes a wrong "accept" fail the rubric on factual grounds. Non-blocking — eval design refinement. |
||
| assertions: | ||
| - type: "output_matches" | ||
| pattern: "(blast radius|all instances|unconditional|callback registration)" | ||
| - type: "output_matches" | ||
| pattern: "(medium|low)" | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Testing — Flagged by: 2/3 reviewers The Suggested fix: - type: "output_matches"
pattern: "[Cc]onfidence[:*]*\\s*(medium|low)"or pin to the exact field name used in the skill's structured output. Non-blocking — eval design refinement. |
||
| rubric: | ||
| - "The agent assesses blast radius for handler/platform changes (does this run for all instances?)" | ||
| - "The agent probes real failure modes, not softballs (e.g., handler disconnect, null PlatformView)" | ||
| - "The agent provides evidence-backed acceptance or refutation of the hypothesis — not just echoing it" | ||
| - "The confidence is calibrated — not 'high' for platform infrastructure changes" | ||
| timeout: 300 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[major] Regression Prevention - This keeps the CI verdict rule, but the workflow no longer has a required step that fetches CI status before finalizing. The previous explicit
gh pr checks <PR_NUMBER>step was removed, so an agent following the new workflow can reach verdict delivery without collecting the evidence needed to know whether CI is red or pending. Concrete scenario: required checks are failing, but the reviewer never runs a checks command and still emitsLGTM. Add a mandatory CI-status collection step/output field before confidence/verdict calibration.