Skip to content

Integrate code-review into pr-review pipeline (firewall architecture)#35105

Closed
PureWeen wants to merge 4 commits into
mainfrom
feature/code-review-integration
Closed

Integrate code-review into pr-review pipeline (firewall architecture)#35105
PureWeen wants to merge 4 commits into
mainfrom
feature/code-review-integration

Conversation

@PureWeen
Copy link
Copy Markdown
Member

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description

Integrates the code-review skill into the pr-review orchestrator with a firewall architecture that separates domain knowledge from PR-specific conclusions.

Architecture: Full analysis gist

Key principle: Try-fix models get domain knowledge (review rules) but NOT code-review's PR-specific conclusions (verdict, error findings). This preserves multi-model independence.

Changes

File What
pr-preflight.md Add Part B: code-review sub-agent with independence-first, SKIPPED fallback, firewall
pr-report.md Priority-ordered decision table (advisory, not hard gate), convergence analysis
pr-review/SKILL.md Wire code-review into flow, add review-rules.md loading for try-fix, firewall enforcement
code-review/SKILL.md Update description

Relation to other PRs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35105

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35105"

Wire the code-review skill into the pr-review orchestrator with a clear
separation between domain knowledge and PR-specific conclusions:

Pre-Flight Part B: Code-review runs as independent sub-agent
  - Independence-first: receives only PR number, no Part A context
  - Outputs to pre-flight/code-review.md + content.md summary
  - SKIPPED fallback if sub-agent fails/times out
  - Findings go to Report ONLY (firewalled from Try-Fix)

Try-Fix: Domain knowledge, not conclusions
  - Models load review-rules.md for MAUI patterns (domain knowledge)
  - Models do NOT receive code-review verdict/findings (no anchoring)
  - Each model independently applies domain rules to their fix approach

Report: Advisory convergence analysis
  - Priority-ordered decision table (first match wins)
  - Code-review is advisory signal, NOT a hard gate
  - Convergence analysis: did try-fix models independently flag
    same issues as code-review? Corroborated = high confidence
  - SKIPPED verdict handled gracefully

Architecture rationale: https://gist.github.com/PureWeen/fbd842390b471c2230018f47a57eb97a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/code-review-integration branch from f69ed7d to 1a99858 Compare April 23, 2026 14:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 15 | Agents checked: 3

Full validator output
Found 2 skill(s)
[pr-review] 📊 pr-review: 3,186 BPE tokens [chars/4: 3,090] (standard ~), 22 sections, 7 code blocks
[pr-review]    ⚠  Skill is 3,186 BPE tokens (chars/4 estimate: 3,090) — approaching "comprehensive" range where gains diminish.
[try-fix] 📊 try-fix: 4,003 BPE tokens [chars/4: 4,195] (standard ~), 37 sections, 12 code blocks
[try-fix]    ⚠  Skill is 4,003 BPE tokens (chars/4 estimate: 4,195) — approaching "comprehensive" range where gains diminish.
✅ All checks passed (2 skill(s))
Found 3 agent(s)
Validated 3 agent(s)

✅ All checks passed (3 agent(s))

❌ LLM Evaluation Failed

0/1 skill(s) passed validation

Skill Scenario Baseline Skilled Verdict

try-fix: Eval scenario 'Regression: agent must not claim success without running the test command' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent uses prescribed restore script, not raw git commands' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: exhausted iterations produces documented Fail, not silence or Pass' prompt mentions target name 'try-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs.

🔍 Full results and investigation steps

try-fix/SKILL.md had no reference to the MAUI domain knowledge in
review-rules.md. Now Step 3 instructs models to load and apply relevant
rules when analyzing code and designing fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

PureWeen commented Apr 23, 2026

🔍 Multi-Model Code Review — PR #35105

PR: Integrate code-review into pr-review pipeline (firewall architecture)
Author: @PureWeen | Base: mainfeature/code-review-integration
Files changed: 5 (markdown instruction/skill files)
Review method: 3 independent reviewers + consensus analysis


CI Status

Check Status
license/cla ✅ Pass
Static validation ✅ Pass
evaluate (try-fix) ✅ Pass
skill-validation (LLM evaluation) ⚠️ Pre-existing (same skipped pattern on main)
maui-pr ⏭️ Skipping

Re-Review (after commit a627409)

Previous Findings Status

# Finding Status Notes
1 Advisory/hard-gate contradiction ✅ FIXED Row 5 adds APPROVE path for minor findings. Advisory language and table are now consistent.
2 Parallel execution race ⚠️ PARTIALLY FIXED SKILL.md has correct sequencing, but pr-preflight.md template still says mode="sync" (see New Finding A)
3 Firewall open at try-fix callee ✅ FIXED Constraint added directly to try-fix hints input documentation
4 Convergence analysis under-specified ✅ FIXED Corroboration defined, cross-row population required
5 Reference inconsistency ✅ FIXED Both files now mention review-rules.md and .instructions.md
6 Template overlap ✅ FIXED Split into Convergence Analysis + Code Review Impact sections
7 No review-rules.md fallback ➖ Not addressed (minor) Add: "If file unavailable, proceed with general MAUI knowledge"
8 Row 2 gap ✅ FIXED "Even when Rows 1-2 fire" instruction added

Score: 6/8 fully fixed, 1 partially fixed, 1 deferred minor


New Findings

🟡 MODERATE

A. mode="sync" template contradicts mode="background" instruction (1/3 reviewers — clear contradiction once identified)

📁 pr-preflight.md:55 vs pr-review/SKILL.md:81

SKILL.md says: "launch Part B as mode: "background""
But the code template in pr-preflight.md shows: mode="sync"

An agent copying the template will block on code-review before starting try-fix — negating the parallelism fix entirely.

Fix: Change mode="sync"mode="background" in the pr-preflight.md code template.


🟢 MINOR

B. "APPROVE with notes" not in output template (1/3 reviewers)

Row 5 specifies ✅ APPROVE with notes but the output template header only lists {APPROVE/REQUEST CHANGES}. Add APPROVE with notes as a valid variant so agents produce consistent output.

C. Rows 4/5 "significant vs minor" doesn't map to code-review severity labels (1/3 reviewers)

Code-review uses ❌ Error / ⚠️ Warning / 💡 Suggestion. Rows 4/5 use "significant/minor" with examples. Consider grounding: "significant = any ❌ Error; minor = ⚠️ Warnings unrelated to safety/correctness + 💡 Suggestions".


Recommendation

⚠️ Request Changes — one fix needed:

  1. Must fix: Change mode="sync"mode="background" in pr-preflight.md Step 7 template (Finding A)
  2. Nice to have: Add "APPROVE with notes" to output template (Finding B)
  3. Nice to have: Map significant/minor to code-review severity labels (Finding C)
  4. Nice to have: Add review-rules.md fallback sentence (Finding 7)

📜 Review History

v1: 4 moderate + 4 minor findings. ⚠️ REQUEST CHANGES.
v2 (after a627409): 6/8 fixed, 1 partial, 1 new moderate (mode sync/background contradiction). ⚠️ REQUEST CHANGES — one fix remaining.

github-actions Bot and others added 2 commits April 23, 2026 10:31
1. Advisory/hard-gate: Add Row 5 allowing APPROVE with notes for minor
   uncorroborated code-review findings (style, naming). Significant
   findings (safety, correctness, leaks) still block. Add corroboration
   definition and cross-row population requirement.

2. Parallel sequencing: Replace vague 'can run in parallel' with explicit
   constraint: Part A → (Part B background ∥ Phase 2 sync) → await
   Part B → Report. Individual try-fix attempts remain sequential.

3. Firewall guard at callee: Add constraint to try-fix hints input
   documentation — hints must not contain code-review verdict/findings
   when invoked from pr-review.

4. Reference consistency: preflight firewall note now mentions both
   review-rules.md and .instructions.md (matching SKILL.md).

CI: skill-validation 'LLM evaluation failed' is pre-existing (same
pattern on main). All individual jobs passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- pr-preflight.md: mode='sync' → mode='background' to match SKILL.md
  parallelism instruction (Part B runs in background while Phase 2 runs)
- pr-report.md: Add 'APPROVE with notes' to output template header
- pr-report.md: Ground Rows 4/5 in code-review severity labels
  (❌ Error = significant, ⚠️ style/naming + 💡 = minor)
- try-fix/SKILL.md: Add fallback for missing review-rules.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant