fix: reduce review noise — all findings inline, lean summary, consensus leak fix#167
Conversation
…ment Addresses feedback about excessive review comments: - Only 🔴 CRITICAL findings posted as inline review comments (was: all severities) - Lowered inline cap from 30 to 8 - Full findings summary posted via add-comment (supports hide-older-comments) - Previous summaries auto-collapsed as 'outdated' on re-review - Added severity re-evaluation guard to prevent inflation during posting - Added REQUEST_CHANGES prohibition (only COMMENT allowed) - Added XPIA, large-diff, pre-flight, and time-budget guards - Added 2-reviewer fallback for resilience Validated across 4 test runs on PureWeen/PolyPilot: Run 1: 2 inline (too many) → Run 4: 0 inline + clean summary ✅ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expert Code Review — PR #167Methodology: 3 independent reviewers with adversarial consensus Findings
Discarded Findings (failed consensus — single reviewer only)These findings were flagged by only 1 of 3 reviewers and rejected during follow-up validation:
Additionally discarded without follow-up (below severity cap): misleading CI Status
Test CoverageThis PR modifies workflow configuration files (
|
…ummary walls Reversed CRITICAL-only inline policy per maintainer feedback. Inline diff comments are contextual and preferred over a big summary comment. All severity levels now posted inline (up to cap of 8), with overflow in summary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ontext The safe-outputs job skipped all create_pull_request_review_comment and submit_pull_request_review outputs with 'Not in pull request context' because these defaulted to target: 'triggering' which requires a PR event. workflow_dispatch has no triggering PR, so we need target: '*' with the agent providing pull_request_number. Validated: run 25048869259 showed 4/6 outputs skipped, only add-comment (which already had target: '*') succeeded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Increase create-pull-request-review-comment max from 8 to 25 - Rewrite Part B summary to NOT repeat findings already posted inline - Summary now shows counts + methodology only; overflow table for any findings that couldn't be posted inline (invalid path/line) - Recompile lock files for frontmatter change Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tested v0.71.1: slash_command reactions still broken (403). Must stay on v0.68.3 until github/gh-aw#28767 is fixed. Added TODO for switching to REQUEST_CHANGES + supersede-older-reviews once the upgrade is safe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the shared “Expert Code Review” orchestration workflow documentation/config to reduce noisy summary output by pushing findings into inline review comments (with a higher cap) and keeping the standalone summary lean.
Changes:
- Increase inline PR review comment safe-output cap to 50 and add
target: "*"to safe-output configs to support more trigger contexts. - Revise orchestration instructions to post findings inline (with consensus methodology) and keep the summary comment focused on counts/methodology/CI/test notes.
- Regenerate gh-aw lock files to reflect the new frontmatter/safe-output configuration (including
noopnot reporting as an issue).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| .github/workflows/shared/review-shared.md | Adjusts safe-output configuration and rewrites reviewer orchestration/posting instructions to reduce summary noise. |
| .github/workflows/review.agent.lock.yml | Regenerated lock with updated safe-output limits/targets and noop reporting behavior. |
| .github/workflows/review-on-open.agent.lock.yml | Regenerated lock with updated safe-output limits/targets and noop reporting behavior. |
| > ⚠️ **Large diff guard**: After fetching the diff, count the changed files. If the PR has more than 50 changed files, do NOT embed the full diff in sub-agent prompts. Instead, split the changed files into 3 roughly equal batches and assign each reviewer a different batch (with the full PR description). In Step 3, skip cross-reviewer agreement checks for findings on files only one reviewer saw — include them directly but **downgrade severity by one level** (CRITICAL→MODERATE, MODERATE→MINOR, MINOR stays MINOR) and annotate with "low confidence — single reviewer (batch split)". | ||
|
|
||
| **Do NOT read source files yourself.** Pass only the diff and PR description to sub-agents — they will read source files independently in their own context windows. Pre-reading files wastes your token budget. | ||
| > ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add-comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit. |
There was a problem hiding this comment.
The instructions reference calling the safe-output tool as add-comment, but the actual safe-output tool name exposed to the agent is add_comment (underscore). Using add-comment will likely trigger missing_tool and prevent the failure message from being posted; please update this to add_comment here.
| > ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add-comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit. | |
| > ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add_comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit. |
There was a problem hiding this comment.
Fixed in b036a66 — updated to add_comment (underscore). Good catch on the naming inconsistency. In gh-aw the config uses dashes (add-comment) but the MCP tool uses underscores (add_comment). Prose should match the tool name.
| - **Cap at 3 disputed findings** — if more than 3 findings are 1/3, discard the rest without follow-up to preserve token budget for posting. | ||
| - **Cap at 3 disputed findings** — select the **3 most severe** for follow-up. Discard lower-severity 1/3 findings without follow-up to preserve token budget for posting. | ||
|
|
||
| **Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add-comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus." |
There was a problem hiding this comment.
Same issue as earlier: this step says to call add-comment, but the tool name is add_comment in the safe-output tools list. Please update to add_comment so the zero-findings path works.
| **Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add-comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus." | |
| **Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add_comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus." |
There was a problem hiding this comment.
Fixed in same commit (b036a66). Both add-comment references in prose updated to add_comment.
| Post results in **two parts**: inline review comments for critical findings, and a standalone summary comment for the full report. | ||
|
|
||
| #### Part A: Inline Review Comments | ||
|
|
||
| **Note:** Inline review tools require PR context (`pull_request` or `issue_comment` triggers). From `workflow_dispatch`, these tools will fail silently — use `add_comment` instead. | ||
| Post **all findings** as inline PR review comments using `create_pull_request_review_comment`. Inline comments are preferred — they're contextual and less noisy than a big comment. |
There was a problem hiding this comment.
Step 4 is internally inconsistent: it says inline comments are for “critical findings”, but immediately after instructs to “Post all findings” inline. Please reconcile these so the workflow has a single clear rule (e.g., inline all findings up to the cap, or inline only critical/moderate and summarize the rest).
There was a problem hiding this comment.
Fixed in b036a66 — the intro was stale from when we had critical-only inline mode. Updated to: "inline review comments for all findings, and a standalone lean summary comment."
| create-pull-request-review-comment: | ||
| max: 30 | ||
| max: 50 | ||
| target: "*" |
There was a problem hiding this comment.
PR description says the max for create-pull-request-review-comment was bumped from 8 → 50, but this file shows it changing from 30 → 50. Please update the PR description (or this config) so they match, to avoid confusion when auditing workflow behavior.
There was a problem hiding this comment.
The PR description describes the overall journey across this branch (started at 8, went to 25, then 30, then 50). The diff in any individual commit shows only the last hop. Both are correct in context — I'll clarify the description to say "8 → 25 → 30 → 50 across iterations" to avoid confusion.
There was a problem hiding this comment.
Expert Code Review: 3 findings posted inline. See summary comment below for full details.
Generated by Expert Code Review (auto) for issue #167 · ● 13.9M
| hide-older-comments: true | ||
| target: "*" | ||
| noop: | ||
| report-as-issue: false |
There was a problem hiding this comment.
🟡 MODERATE · 2/3 reviewers
noop is now silent — reduced failure observability. With report-as-issue: false, any unhandled edge case (context exhaustion, unexpected tool error) that results in a noop call produces zero PR-visible output. The instructions do cover known paths (zero findings → add-comment, pre-flight failure → add-comment, <2 reviewers → add-comment), but an off-script agent calling noop instead of the prescribed add-comment would be invisible to the PR author.
Recommendation: Either keep report-as-issue: true as a last-resort safety net, or add an explicit instruction: "Never call noop — always use add-comment for any outcome, including 'no issues found'." This closes the silent-failure path without adding noise under normal operation.
There was a problem hiding this comment.
Acknowledged — accepted trade-off. We turned report-as-issue: false off because it was creating noise GitHub Issues for every noop. The instructions already cover all known exit paths (zero findings, pre-flight failure, <2 reviewers) with explicit add_comment calls. An off-script noop would still be visible in the agent job logs. If we see silent failures in practice we can revisit.
|
|
||
| 1. **3/3 agree** on a finding → include immediately | ||
| 2. **2/3 agree** → include with median severity | ||
| 2. **2/3 agree** → include with the **lower** of the two severity levels (e.g., 🔴+🟢 → 🟢, 🔴+🟡 → 🟡, 🟡+🟢 → 🟢) |
There was a problem hiding this comment.
🟡 MODERATE · 3/3 reviewers (consensus after follow-up)
No severity floor in 2/3 consensus rule. The explicit example 🔴+🟢 → 🟢 means a single lenient reviewer can downgrade a CRITICAL finding (e.g., a security vulnerability) all the way to MINOR. The old "median severity" rule was ambiguous for 2 values but directionally moderate; "lower" is unambiguous but can bury urgent findings.
Recommendation: Add a floor — e.g., "take the lower severity, but never below 🟡 MODERATE if either reviewer rated 🔴 CRITICAL." This preserves the noise-reduction intent while protecting security-sensitive findings.
There was a problem hiding this comment.
Great catch — fixed in b036a66. Changed from "take the lower" to "downgrade by at most one step from the higher rating." So now 🔴+🟢 → 🟡 (not 🟢), preventing a single lenient reviewer from burying critical findings. 🔴+🟡 → 🟡 and 🟡+🟢 → 🟢 remain the same.
| create-pull-request-review-comment: | ||
| max: 30 | ||
| max: 50 | ||
| target: "*" |
There was a problem hiding this comment.
🟢 MINOR · 3/3 reviewers
target: "*" added but Step 4 doesn't require passing pull_request_number. The config now allows these tools to target any PR (necessary for workflow_dispatch), but the posting instructions in Step 4 don't explicitly tell the agent to pass pull_request_number for inline comments. The tool schema documents this requirement when target: "*", and the lock files' description suffix for add_comment mentions "Target: *" while create_pull_request_review_comment and submit_pull_request_review do not — an inconsistency.
Recommendation: Low risk since the tool schema itself documents the pull_request_number requirement. Consider adding a note in Step 4: "Always pass pull_request_number explicitly for all inline comments and the review submission."
There was a problem hiding this comment.
Fixed in b036a66 — added step 4 to Part A: "Always pass pull_request_number explicitly (required by target: "*" config)." Low risk since the tool schema documents it, but explicit is better.
Add explicit 'add_comment budget: exactly ONE call' rule to preamble, mark follow-up agent AGREE/DISAGREE responses as internal data only in Step 3, and reinforce single-call policy in Step 4 Part B. These are prompt body changes (not frontmatter), so no lock file recompilation is required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aming - Fix 2-reviewer fallback: discard 1/2 splits (no valid tiebreaker) - Add severity floor: cap downgrade at one step (🔴+🟢 → 🟡, not 🟢) - Fix add-comment → add_comment in prose (2 places) - Fix Step 4 intro: 'critical findings' → 'all findings' (stale text) - Add pull_request_number reminder to Step 4 Part A Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Responses to auto-review findingsFinding 1 (🟡 2-reviewer fallback tiebreak): Fixed in b036a66. Changed 1/2 split rule from "dispatch the failed model" (impossible) to "discard the finding" (no valid tiebreaker). Clean and deterministic. Finding 2 (🟡 Batch split defeats consensus): Accepted trade-off. With 190 files, sending the full diff to all 3 reviewers would blow the context window. Single-reviewer batches with severity downgrade + "low confidence" annotation is the practical solution. We validated this on PR #161 (190 files, 28/28 messages succeeded, good finding quality). If token budgets grow we can add overlap later. Finding 3 (🟢 Severity floor): Fixed in b036a66. Changed from "take the lower" to "at most one step downgrade" — so 🔴+🟢 → 🟡 now, not 🟢. Finding 4 (🟢 time-budget claim): Stale — I've since rewritten the PR description and removed that claim. No time-budget instruction exists and none is needed (the job has a 90-min infrastructure timeout). |
Applied 7 validated findings from 3-model adversarial review: Frontmatter: - max:5 → max:2 on add-comment (blast-radius cap — only pre-flight XOR summary needed, +1 margin) Consensus rules (Step 3): - Add 2-reviewer mode gate at top (prevents misapplying 3-reviewer rules when a sub-agent fails) - Clarify 3/3 uses highest severity, 2/3 same-severity = no downgrade - Add post-consensus-zero exit path (all findings discarded) - Explicit batch-only exception for large-diff split findings Output (Step 4): - Add submit_pull_request_review failure fallback with full-table recovery in Part B - Add pull_request_number requirement to submit + add_comment calls - Use severity emoji in large-diff guard (consistency) - Clarify discarded-findings section covers cap-discarded findings Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slash_command workflow: group by PR number from all event sources, cancel-in-progress: false (per gh-aw guidance — prevents non-matching events from killing in-flight agent runs). pull_request workflow: group by PR number, same cancel policy. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changes
Reduces
/reviewworkflow noise by posting all findings inline (not in the summary comment) and preventing internal consensus notes from leaking as PR comments.What changed in
review-shared.mdcreate_pull_request_review_commenton the exact diff line. Cap raised from 8 → 50.add_commentsummary is now just counts + methodology + overflow table (only for findings that couldn't go inline). No more duplicating all findings in both places.add_commentbudget rule (exactly ONE call) and marked follow-up AGREE/DISAGREE responses as internal-only data. Prevents orchestrator from posting raw consensus evaluation text as PR comments.target: "*"on all safe-outputs — Required forworkflow_dispatchtesting (no PR context). Agent must providepull_request_numberin every call.Why not
supersede-older-reviews/REQUEST_CHANGES?We tested
submit_pull_request_reviewwithevent: "REQUEST_CHANGES"+supersede-older-reviews: trueon PolyPilot. Two blockers:REQUEST_CHANGEStoCOMMENTEDwhen the reviewer bot is the same as the PR author (bot reviewing its own PR). This meanssupersede-older-reviewsnever triggers in that scenario.issues: write(missingpull-requests: write), so/reviewreactions fail with 403. We must stay on v0.68.3 which doesn't supportsupersede-older-reviewsreliably.A TODO in
review-shared.mddocuments the 5-step upgrade plan for when #28767 is fixed. For now, we usehide-older-comments: trueonadd_commentto collapse old summaries, and inline review comments naturally show on the latest diff.Validation
Two consecutive clean runs with zero failures:
The PR #161 run also validated the large diff guard (>50 files → batch-split across 3 reviewers with severity downgrade).