Skip to content

fix: reduce review noise — all findings inline, lean summary, consensus leak fix#167

Merged
Redth merged 12 commits into
mainfrom
fix/reduce-review-noise
Apr 29, 2026
Merged

fix: reduce review noise — all findings inline, lean summary, consensus leak fix#167
Redth merged 12 commits into
mainfrom
fix/reduce-review-noise

Conversation

@PureWeen
Copy link
Copy Markdown
Member

@PureWeen PureWeen commented Apr 28, 2026

Changes

Reduces /review workflow noise by posting all findings inline (not in the summary comment) and preventing internal consensus notes from leaking as PR comments.

What changed in review-shared.md

  1. All findings inline — Moved from summary-table-only to posting every finding as an inline create_pull_request_review_comment on the exact diff line. Cap raised from 8 → 50.
  2. Lean summary — The add_comment summary is now just counts + methodology + overflow table (only for findings that couldn't go inline). No more duplicating all findings in both places.
  3. Consensus note leak fix — Added explicit add_comment budget rule (exactly ONE call) and marked follow-up AGREE/DISAGREE responses as internal-only data. Prevents orchestrator from posting raw consensus evaluation text as PR comments.
  4. target: "*" on all safe-outputs — Required for workflow_dispatch testing (no PR context). Agent must provide pull_request_number in every call.
  5. Compiler pin note — Documented that we're pinned to gh-aw v0.68.3 with a TODO for upgrading once gh-aw#28767 is fixed.

Why not supersede-older-reviews / REQUEST_CHANGES?

We tested submit_pull_request_review with event: "REQUEST_CHANGES" + supersede-older-reviews: true on PolyPilot. Two blockers:

  1. GitHub silently downgrades REQUEST_CHANGES to COMMENTED when the reviewer bot is the same as the PR author (bot reviewing its own PR). This means supersede-older-reviews never triggers in that scenario.
  2. gh-aw v0.71.1 breaks slash commands — Activation job only grants issues: write (missing pull-requests: write), so /review reactions fail with 403. We must stay on v0.68.3 which doesn't support supersede-older-reviews reliably.

A TODO in review-shared.md documents the 5-step upgrade plan for when #28767 is fixed. For now, we use hide-older-comments: true on add_comment to collapse old summaries, and inline review comments naturally show on the latest diff.

Validation

Two consecutive clean runs with zero failures:

Run Target PR Files Messages Failures Leaked notes
25073966162 #164 (small) ~20 10/10 ✅ 0 0
25075206386 #161 (macOS AppKit) 190 28/28 ✅ 0 0

The PR #161 run also validated the large diff guard (>50 files → batch-split across 3 reviewers with severity downgrade).

…ment

Addresses feedback about excessive review comments:
- Only 🔴 CRITICAL findings posted as inline review comments (was: all severities)
- Lowered inline cap from 30 to 8
- Full findings summary posted via add-comment (supports hide-older-comments)
- Previous summaries auto-collapsed as 'outdated' on re-review
- Added severity re-evaluation guard to prevent inflation during posting
- Added REQUEST_CHANGES prohibition (only COMMENT allowed)
- Added XPIA, large-diff, pre-flight, and time-budget guards
- Added 2-reviewer fallback for resilience

Validated across 4 test runs on PureWeen/PolyPilot:
Run 1: 2 inline (too many) → Run 4: 0 inline + clean summary ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Expert Code Review — PR #167

Methodology: 3 independent reviewers with adversarial consensus

Findings

# Severity Consensus File Line(s) Finding
1 🟡 MODERATE 3/3 reviewers review-shared.md ~line 95 2-reviewer fallback has impossible tiebreak — when 2 reviewers split 1/2, the instruction says "dispatch the 1 remaining successful model" but that model is the one that failed. The same rule says "Do NOT retry the failed model." No valid tiebreaker exists. Recommendation: Change the 1/2 split rule to discard the finding (no tiebreaker available), or dispatch a fresh instance of one of the 2 successful models with the other's reasoning as context.
2 🟡 MODERATE 2/3 reviewers review-shared.md ~line 50 Large-diff batch split (>50 files) can defeat adversarial consensus entirely — each reviewer sees ~17 unique files, so every finding is single-reviewer with auto-downgraded severity. A genuine 🔴 CRITICAL in a batch-split PR becomes 🟡 MODERATE and loses inline visibility (Part A is CRITICAL-only). Recommendation: Require at least 2 reviewers to overlap on each batch, or exempt 🔴 CRITICAL findings from auto-downgrade in batch-split mode.
3 🟢 MINOR 2/3 reviewers review-shared.md ~line 102 The 2/3 "lower severity" consensus rule can suppress findings below the inline threshold — if one reviewer rates a finding 🔴 CRITICAL and the other rates it 🟢 MINOR, consensus becomes 🟢 MINOR. The finding loses inline visibility despite 2/3 agreement. Recommendation: Consider capping the downgrade at one step (🔴+🟢 → 🟡, not 🟢), or document this as an accepted trade-off.
4 🟢 MINOR 2/3 reviewers review-shared.md / PR description PR description lists "time-budget (60min)" as a new guard, but no time-budget instruction exists in review-shared.md. The workflow job has a 90-minute timeout-minutes at the infrastructure level, but no prompt-level guidance for the agent to self-limit. Recommendation: Either add a time-budget instruction or remove the claim from the PR description.

Discarded Findings (failed consensus — single reviewer only)

These findings were flagged by only 1 of 3 reviewers and rejected during follow-up validation:

Finding Original Reviewer Follow-up Result
Removed workflow_dispatch inline-comment fallback note (regression) Reviewer 1 Both follow-up reviewers disagreed — Part B summary is the primary output and captures all findings unconditionally, so inline failure doesn't lose data
CI status toolset (checks) not configured Reviewer 2 Both follow-up reviewers disagreed — pull_requests toolset includes get_check_runs and get_status methods, no separate checks toolset needed
Pagination/truncation handling not required for large-diff guard Reviewer 3 Both follow-up reviewers disagreed — diff counting uses the raw diff text (not paginated get_files), and the edge case is theoretical

Additionally discarded without follow-up (below severity cap): misleading submit_pull_request_review body when all inline comments fail validation (1/3, 🟢), hide-older-comments collapsing valid reviews on pre-flight failure (1/3, 🟢), removed explicit MCP tool names from Step 1 (1/3, 🟢), large-diff guard boundary at exactly 50 vs ≥50 (1/3, 🟢).

CI Status

Check Status
activation ✅ success
license/cla ✅ success
pre_activation ✅ success
agent 🔄 in progress (this review run)

Test Coverage

This PR modifies workflow configuration files (.md frontmatter and auto-generated .lock.yml files). No source code changes — tests are not applicable. The PR description documents 4 validation runs on PureWeen/PolyPilot with the final run showing all green.


Generated by Expert Code Review · 3 independent reviewers with adversarial consensus

Generated by Expert Code Review (auto) for issue #167 · ● 10.5M ·

PureWeen and others added 2 commits April 28, 2026 05:49
…ummary walls

Reversed CRITICAL-only inline policy per maintainer feedback. Inline diff
comments are contextual and preferred over a big summary comment. All
severity levels now posted inline (up to cap of 8), with overflow in summary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ontext

The safe-outputs job skipped all create_pull_request_review_comment
and submit_pull_request_review outputs with 'Not in pull request
context' because these defaulted to target: 'triggering' which
requires a PR event. workflow_dispatch has no triggering PR, so
we need target: '*' with the agent providing pull_request_number.

Validated: run 25048869259 showed 4/6 outputs skipped, only
add-comment (which already had target: '*') succeeded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen and others added 4 commits April 28, 2026 06:25
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Increase create-pull-request-review-comment max from 8 to 25
- Rewrite Part B summary to NOT repeat findings already posted inline
- Summary now shows counts + methodology only; overflow table for any
  findings that couldn't be posted inline (invalid path/line)
- Recompile lock files for frontmatter change

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tested v0.71.1: slash_command reactions still broken (403).
Must stay on v0.68.3 until github/gh-aw#28767 is fixed.
Added TODO for switching to REQUEST_CHANGES + supersede-older-reviews
once the upgrade is safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title fix: reduce review comment noise — CRITICALs-only inline, summary via add-comment fix: reduce review noise — all findings inline, lean summary Apr 28, 2026
@PureWeen PureWeen marked this pull request as ready for review April 28, 2026 18:43
Copilot AI review requested due to automatic review settings April 28, 2026 18:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the shared “Expert Code Review” orchestration workflow documentation/config to reduce noisy summary output by pushing findings into inline review comments (with a higher cap) and keeping the standalone summary lean.

Changes:

  • Increase inline PR review comment safe-output cap to 50 and add target: "*" to safe-output configs to support more trigger contexts.
  • Revise orchestration instructions to post findings inline (with consensus methodology) and keep the summary comment focused on counts/methodology/CI/test notes.
  • Regenerate gh-aw lock files to reflect the new frontmatter/safe-output configuration (including noop not reporting as an issue).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/shared/review-shared.md Adjusts safe-output configuration and rewrites reviewer orchestration/posting instructions to reduce summary noise.
.github/workflows/review.agent.lock.yml Regenerated lock with updated safe-output limits/targets and noop reporting behavior.
.github/workflows/review-on-open.agent.lock.yml Regenerated lock with updated safe-output limits/targets and noop reporting behavior.

> ⚠️ **Large diff guard**: After fetching the diff, count the changed files. If the PR has more than 50 changed files, do NOT embed the full diff in sub-agent prompts. Instead, split the changed files into 3 roughly equal batches and assign each reviewer a different batch (with the full PR description). In Step 3, skip cross-reviewer agreement checks for findings on files only one reviewer saw — include them directly but **downgrade severity by one level** (CRITICAL→MODERATE, MODERATE→MINOR, MINOR stays MINOR) and annotate with "low confidence — single reviewer (batch split)".

**Do NOT read source files yourself.** Pass only the diff and PR description to sub-agents — they will read source files independently in their own context windows. Pre-reading files wastes your token budget.
> ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add-comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions reference calling the safe-output tool as add-comment, but the actual safe-output tool name exposed to the agent is add_comment (underscore). Using add-comment will likely trigger missing_tool and prevent the failure message from being posted; please update this to add_comment here.

Suggested change
> ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add-comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit.
> ⚠️ **Pre-flight**: Before dispatching sub-agents, verify `.github/agents/expert-reviewer.agent.md` exists using the `view` tool. If missing, call `add_comment` with: "❌ Expert Code Review: Cannot run — `.github/agents/expert-reviewer.agent.md` not found. For slash_command on fork PRs, rebase on main. For workflow_dispatch, verify the skill file exists in the PR branch." and exit.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b036a66 — updated to add_comment (underscore). Good catch on the naming inconsistency. In gh-aw the config uses dashes (add-comment) but the MCP tool uses underscores (add_comment). Prose should match the tool name.

- **Cap at 3 disputed findings** — if more than 3 findings are 1/3, discard the rest without follow-up to preserve token budget for posting.
- **Cap at 3 disputed findings** — select the **3 most severe** for follow-up. Discard lower-severity 1/3 findings without follow-up to preserve token budget for posting.

**Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add-comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus."
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as earlier: this step says to call add-comment, but the tool name is add_comment in the safe-output tools list. Please update to add_comment so the zero-findings path works.

Suggested change
**Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add-comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus."
**Zero findings**: If all reviewers return zero findings, skip Step 4. Instead call `add_comment` with: "✅ Expert Code Review: 3 independent reviewers found no issues. Methodology: 3-model adversarial consensus."

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in same commit (b036a66). Both add-comment references in prose updated to add_comment.

Comment on lines +125 to +129
Post results in **two parts**: inline review comments for critical findings, and a standalone summary comment for the full report.

#### Part A: Inline Review Comments

**Note:** Inline review tools require PR context (`pull_request` or `issue_comment` triggers). From `workflow_dispatch`, these tools will fail silently — use `add_comment` instead.
Post **all findings** as inline PR review comments using `create_pull_request_review_comment`. Inline comments are preferred — they're contextual and less noisy than a big comment.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 4 is internally inconsistent: it says inline comments are for “critical findings”, but immediately after instructs to “Post all findings” inline. Please reconcile these so the workflow has a single clear rule (e.g., inline all findings up to the cap, or inline only critical/moderate and summarize the rest).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b036a66 — the intro was stale from when we had critical-only inline mode. Updated to: "inline review comments for all findings, and a standalone lean summary comment."

Comment on lines 30 to +32
create-pull-request-review-comment:
max: 30
max: 50
target: "*"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the max for create-pull-request-review-comment was bumped from 8 → 50, but this file shows it changing from 30 → 50. Please update the PR description (or this config) so they match, to avoid confusion when auditing workflow behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description describes the overall journey across this branch (started at 8, went to 25, then 30, then 50). The diff in any individual commit shows only the last hop. Both are correct in context — I'll clarify the description to say "8 → 25 → 30 → 50 across iterations" to avoid confusion.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert Code Review: 3 findings posted inline. See summary comment below for full details.

Generated by Expert Code Review (auto) for issue #167 · ● 13.9M

hide-older-comments: true
target: "*"
noop:
report-as-issue: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE · 2/3 reviewers

noop is now silent — reduced failure observability. With report-as-issue: false, any unhandled edge case (context exhaustion, unexpected tool error) that results in a noop call produces zero PR-visible output. The instructions do cover known paths (zero findings → add-comment, pre-flight failure → add-comment, <2 reviewers → add-comment), but an off-script agent calling noop instead of the prescribed add-comment would be invisible to the PR author.

Recommendation: Either keep report-as-issue: true as a last-resort safety net, or add an explicit instruction: "Never call noop — always use add-comment for any outcome, including 'no issues found'." This closes the silent-failure path without adding noise under normal operation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — accepted trade-off. We turned report-as-issue: false off because it was creating noise GitHub Issues for every noop. The instructions already cover all known exit paths (zero findings, pre-flight failure, <2 reviewers) with explicit add_comment calls. An off-script noop would still be visible in the agent job logs. If we see silent failures in practice we can revisit.


1. **3/3 agree** on a finding → include immediately
2. **2/3 agree** → include with median severity
2. **2/3 agree** → include with the **lower** of the two severity levels (e.g., 🔴+🟢 → 🟢, 🔴+🟡 → 🟡, 🟡+🟢 → 🟢)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE · 3/3 reviewers (consensus after follow-up)

No severity floor in 2/3 consensus rule. The explicit example 🔴+🟢 → 🟢 means a single lenient reviewer can downgrade a CRITICAL finding (e.g., a security vulnerability) all the way to MINOR. The old "median severity" rule was ambiguous for 2 values but directionally moderate; "lower" is unambiguous but can bury urgent findings.

Recommendation: Add a floor — e.g., "take the lower severity, but never below 🟡 MODERATE if either reviewer rated 🔴 CRITICAL." This preserves the noise-reduction intent while protecting security-sensitive findings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch — fixed in b036a66. Changed from "take the lower" to "downgrade by at most one step from the higher rating." So now 🔴+🟢 → 🟡 (not 🟢), preventing a single lenient reviewer from burying critical findings. 🔴+🟡 → 🟡 and 🟡+🟢 → 🟢 remain the same.

create-pull-request-review-comment:
max: 30
max: 50
target: "*"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 3/3 reviewers

target: "*" added but Step 4 doesn't require passing pull_request_number. The config now allows these tools to target any PR (necessary for workflow_dispatch), but the posting instructions in Step 4 don't explicitly tell the agent to pass pull_request_number for inline comments. The tool schema documents this requirement when target: "*", and the lock files' description suffix for add_comment mentions "Target: *" while create_pull_request_review_comment and submit_pull_request_review do not — an inconsistency.

Recommendation: Low risk since the tool schema itself documents the pull_request_number requirement. Consider adding a note in Step 4: "Always pass pull_request_number explicitly for all inline comments and the review submission."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b036a66 — added step 4 to Part A: "Always pass pull_request_number explicitly (required by target: "*" config)." Low risk since the tool schema documents it, but explicit is better.

Add explicit 'add_comment budget: exactly ONE call' rule to preamble,
mark follow-up agent AGREE/DISAGREE responses as internal data only
in Step 3, and reinforce single-call policy in Step 4 Part B.

These are prompt body changes (not frontmatter), so no lock file
recompilation is required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title fix: reduce review noise — all findings inline, lean summary fix: reduce review noise — all findings inline, lean summary, consensus leak fix Apr 28, 2026
…aming

- Fix 2-reviewer fallback: discard 1/2 splits (no valid tiebreaker)
- Add severity floor: cap downgrade at one step (🔴+🟢 → 🟡, not 🟢)
- Fix add-comment → add_comment in prose (2 places)
- Fix Step 4 intro: 'critical findings' → 'all findings' (stale text)
- Add pull_request_number reminder to Step 4 Part A

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

Responses to auto-review findings

Finding 1 (🟡 2-reviewer fallback tiebreak): Fixed in b036a66. Changed 1/2 split rule from "dispatch the failed model" (impossible) to "discard the finding" (no valid tiebreaker). Clean and deterministic.

Finding 2 (🟡 Batch split defeats consensus): Accepted trade-off. With 190 files, sending the full diff to all 3 reviewers would blow the context window. Single-reviewer batches with severity downgrade + "low confidence" annotation is the practical solution. We validated this on PR #161 (190 files, 28/28 messages succeeded, good finding quality). If token budgets grow we can add overlap later.

Finding 3 (🟢 Severity floor): Fixed in b036a66. Changed from "take the lower" to "at most one step downgrade" — so 🔴+🟢 → 🟡 now, not 🟢.

Finding 4 (🟢 time-budget claim): Stale — I've since rewritten the PR description and removed that claim. No time-budget instruction exists and none is needed (the job has a 90-min infrastructure timeout).

PureWeen and others added 3 commits April 28, 2026 16:10
Applied 7 validated findings from 3-model adversarial review:

Frontmatter:
- max:5 → max:2 on add-comment (blast-radius cap — only pre-flight
  XOR summary needed, +1 margin)

Consensus rules (Step 3):
- Add 2-reviewer mode gate at top (prevents misapplying 3-reviewer
  rules when a sub-agent fails)
- Clarify 3/3 uses highest severity, 2/3 same-severity = no downgrade
- Add post-consensus-zero exit path (all findings discarded)
- Explicit batch-only exception for large-diff split findings

Output (Step 4):
- Add submit_pull_request_review failure fallback with full-table
  recovery in Part B
- Add pull_request_number requirement to submit + add_comment calls
- Use severity emoji in large-diff guard (consistency)
- Clarify discarded-findings section covers cap-discarded findings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slash_command workflow: group by PR number from all event sources,
cancel-in-progress: false (per gh-aw guidance — prevents non-matching
events from killing in-flight agent runs).

pull_request workflow: group by PR number, same cancel policy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Redth Redth merged commit 6d21518 into main Apr 29, 2026
7 checks passed
@Redth Redth deleted the fix/reduce-review-noise branch April 29, 2026 13:59
@jfversluis jfversluis added this to the v0.1.0-preview.7 milestone May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants