Add regression cross-reference with automated fix-PR test execution#35290
Add regression cross-reference with automated fix-PR test execution#35290kubaflo wants to merge 71 commits into
Conversation
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 16 | Agents checked: 3 Full validator output⏭️ LLM Evaluation: SkippedNo changed skills with eval tests found. |
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that detects when a PR removes lines previously added by labeled bug-fix PRs. Algorithm: 1. Collects lines removed by the PR under review 2. Finds recent PRs touching the same files via git log 3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels) 4. Cross-references removed lines against lines those fix PRs added 5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN Integration: - Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate) - Content assembled into AI summary comment via post-ai-summary-comment.ps1 - Expert reviewer dimension #6 reads risks.json for REVERT entries - 64 unit tests covering diff parsing, normalization, and detection logic Validated against: - PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278 - PR #35272: correctly classifies as OVERLAP (no line-level revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now: 1. Extracts test files from the fix PR's diff 2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner) 3. Stores regression_tests metadata in risks.json 4. Lists required tests in content.md Review-PR.ps1 adds STEP 1.5 (after Gate builds the code): - Reads regression_tests from risks.json - Runs unit/XAML tests via dotnet test with the detected filters - Skips UI/device tests (need CI infrastructure) with clear reporting - Appends pass/fail results to content.md - Writes test-results.json for downstream consumption Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs: - UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium) - Device tests via Run-DeviceTests.ps1 (xharness on device/simulator) - Unit/XAML tests via dotnet test --filter Each test runs on the same platform as the Gate step. If a runner script is missing, the test is skipped gracefully rather than failing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries. A nearby edit can break a fix through side effects even without removing the exact lines. Running tests for all risk levels gives maximum confidence that prior fixes aren't regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs), inject them as MANDATORY additional tests into the STEP 2 try-fix prompt. Each try-fix candidate must run these tests after its own test passes — a candidate that breaks a prior fix is marked Fail. The Report phase (Phase 3) also ranks candidates that failed regression tests lower than those that passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup) STEP 0.5 → 2 (UI Test Categories) STEP 0.6 → 3 (Regression Cross-Reference) STEP 1 → 4 (Gate) STEP 1.5 → 5 (Regression Test Verification) STEP 2 → 6 (PR Review) STEP 3 → 7 (Post AI Summary) STEP 4 → 8 (Apply Labels) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 3 (Regression Cross-Reference) now includes both detection AND test execution in a single step. Removes the separate STEP 5. Pipeline is now 7 steps: 1. Branch Setup 2. UI Test Categories 3. Regression Cross-Reference (detect + run fix-PR tests) 4. Gate — Test Verification 5. PR Review (Pre-Flight, Try-Fix, Report) 6. Post AI Summary 7. Apply Labels Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates false positives from fix PRs merged to a different branch (e.g., inflight/current) than the PR under review targets (e.g., main). Uses git merge-base --is-ancestor to verify the fix PR's merge commit is reachable from the PR's base branch before flagging a REVERT or OVERLAP. Fix PRs merged to unreachable branches are skipped with a clear log message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After STEP 2 detects UI test categories (via detect-ui-test-categories.ps1), run BuildAndRunHostApp.ps1 once per detected category and surface results in the existing 'uitests' section of the AI summary comment. Behaviour: - Skipped when detection returns NONE (no UI-relevant changes). - Skipped when detection returns the run-all matrix (would take hours locally; the gate already covers PR-specific tests). - Otherwise: splits the comma-separated category list, invokes BuildAndRunHostApp.ps1 -Platform <platform> -Category <cat> sequentially, records pass/fail/skip per category with duration, and appends a Markdown table to uitests/content.md so it renders in the same collapsible section as the detection output. - Platform follows the regression-test pattern: $Platform if provided, else 'android'. - Failures are informational only — they do not block the gate or affect try-fix candidate scoring (per the section footer). Also writes: - CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/test-results.json (machine-readable per-category results) - CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/result.txt (one-line PASSED/FAILED/SKIPPED marker) Renumbers later steps: 3 (regression cross-ref) -> 4 4 (gate) -> 5 5 (PR review) -> 6 6 (post AI summary) -> 7 7 (apply labels) -> 8 Updates the script synopsis (which had stale Step 0..4 entries from before the rebase) and fixes the cross-references inside the multi-candidate-review prompt and the AI-tier-refresh comment block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c0c2bae to
9333658
Compare
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35290Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35290" |
There was a problem hiding this comment.
Pull request overview
Adds a regression cross-reference step to the PR review automation so the pipeline can detect when a PR removes lines previously introduced by labeled bug-fix PRs, and then optionally execute the associated tests from those fix PRs to reduce “silent revert” regressions.
Changes:
- Introduces
Find-RegressionRisks.ps1plus a PowerShell test suite to classify PR risk as CLEAN/OVERLAP/REVERT and emitrisks.json+ markdown output. - Integrates regression cross-reference + optional regression test execution into
Review-PR.ps1, and wires results into try-fix candidate instructions/scoring. - Extends the AI summary comment assembly and expert-reviewer checklist to surface regression findings; adds standalone skill documentation.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/skills/find-regression-risk/SKILL.md | New skill doc describing the regression cross-reference system and usage. |
| .github/scripts/tests/Test-FindRegressionRisks.ps1 | Adds script-level tests validating diff parsing and REVERT/OVERLAP logic. |
| .github/scripts/Review-PR.ps1 | Adds new pipeline steps for UI test execution and regression cross-reference + regression test execution; updates try-fix prompt wiring. |
| .github/scripts/post-ai-summary-comment.ps1 | Adds a new “regression-check” phase to the assembled PR summary comment. |
| .github/scripts/Find-RegressionRisks.ps1 | New core implementation of regression detection + optional test extraction metadata emission. |
| .github/agents/maui-expert-reviewer.md | Updates regression-prevention checklist to consult regression-check/risks.json REVERT entries. |
| The script runs as **STEP 0.6** in `Review-PR.ps1` (between UI test detection and the Gate step). Its `content.md` is assembled into the AI summary comment by `post-ai-summary-comment.ps1`. | ||
|
|
||
| When REVERT risks are detected, **STEP 1.5** (after Gate) runs the regression tests from the reverted fix PRs: |
| # Collect regression tests from ALL risk entries (REVERT + OVERLAP) | ||
| $regressionTests = @() | ||
| foreach ($risk in @($risksData.risks | Where-Object { $_.regression_tests.Count -gt 0 })) { | ||
| foreach ($test in $risk.regression_tests) { | ||
| $regressionTests += [PSCustomObject]@{ | ||
| FixPR = $risk.recent_pr | ||
| Type = $test.type | ||
| TestName = $test.test_name | ||
| Filter = $test.filter | ||
| ProjectPath = $test.project_path | ||
| Project = $test.project | ||
| Runner = $test.runner |
The Tier-3 AI category refresh (STEP 6 → pre STEP 7) was overwriting uitests/content.md with just the refreshed category list, wiping out the 'UI Test Execution Results' table that STEP 3 appends. Now read the existing file, extract everything from the '### 🧪 UI Test Execution Results' marker onward, write the refreshed category line, then re-append the preserved execution block so the collapsible 'UI Tests — Category Detection' section in the AI summary comment shows both the detected categories and the actual run results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Get-DotNetTestResults parser that extracts per-test status, name, duration, error message and stack trace from the captured 'dotnet test --logger console;verbosity=detailed' stdout. STEP 3 now: * Parses each category's output in-memory before the next category run wipes test-output.log. * Persists per-category logs to PRState/<PR>/PRAgent/uitests/<cat>-output.log for post-mortem traceability. * Renders an enriched markdown table (Tests column shows passed/total + failure count) plus a collapsible 'Failed test details' block per failed category showing test name, first 1500 chars of error message, and the first stack frame. * Renders a collapsible 'Show passed test names' block when there are passing tests. Reviewers can now diagnose UI test failures from the AI summary comment without downloading the full build artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory fails before tests
When a test category fails because the build or deploy crashed before any
test could run (e.g. CS0246 missing namespace, RS0016 PublicAPI errors),
the AI summary table previously showed '0/1 ✓' — the green-checkmark
'all passed' branch — because no per-test failures were parsed. That's
visually misleading: the row is FAILED but the cell looks healthy.
Two fixes:
1. Tests column distinguishes 'category failed AND no per-test failures
parsed' from 'all tests passed':
- 'build/deploy failed' (no tests at all)
- '0/1 — build/deploy failed before per-test results' (some discovered)
2. New optional 'build_tail' field captures the last 30 lines of stdout
when a category fails with zero per-test failures. The Failed test
details collapsible section then renders it in a code block so
reviewers see the actual compiler error / build crash inline,
instead of having to download the full CopilotLogs artifact.
This was discovered while running the regression-check pipeline against
PRs #35110 (142 RS0016 PublicAPI errors), #35281 (CS0246 NSAttributedString
missing for catalyst), and #35358 — all reported as '0/1 ✓' before the fix.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related fixes that together restore meaningful per-test failure detail in the AI summary comment. ## 1. Parser input normalization (root cause of '1 failed test' bug) PowerShell's '& dotnet test 2>&1' wraps multi-line stderr blocks (and some console-logger output paths) as a single ErrorRecord/string with embedded newlines. The 'Get-DotNetTestResults' regex is anchored '^...$' (start/end of STRING, not line), so without splitting the parser would either skip those blocks entirely or — worse — collapse the entire multi-line block as one bogus 'test result' with all 100+ test names concatenated into a single name field. Observed on PR #35358 where 119 actual test failures were rendered as '1 failed test' with name='LightTheme_VerifyVisualState DarkTheme_…' (space-separated wall of 119 names) and duration='2 m 16 s 2 m 16 s …' (matching wall of 119 durations). Fix: at parser entry, split any captured element that contains '\n' or '\r' into individual lines before passing to the regex. Verified locally against the actual ViewBaseTests-output.log artifact: parser now returns 119 entries instead of 0/1. ## 2. Grouped failure rendering (avoids 65k comment overflow) Even after the parser fix, dumping 119 full error messages + first stack frames into a single comment would blow past GitHub's 65,536- character body limit. Most large failure clusters share a single root cause (e.g. all 119 ViewBaseTests fail with the same 'OneTimeSetUp: Timed out waiting for Go To Test button' because the host app didn't deploy properly). The new rendering: - Groups failed tests by the first 200 chars of error message - Shows full detail (sample names + 1500-char error + first stack frame) for the top 5 unique error groups - Adds a 'X tests failed with the same error' header when a group has multiple members, with the first 3 sample names inline - Always emits a fully-collapsed '<details>All N failed test names' block listing every individual failure so reviewers can re-run exactly the right set Wraps Group-Object output in @() because Group-Object returns a single GroupInfo (not an array) when all rows share the same key, and 'foreach' would then iterate the group's MEMBERS instead of the groups — re-introducing the same iteration bug we're trying to fix. Verified locally with a 119-failure simulation (117 with shared error, 2 unique): produces 3 groups, top group shows count=117 with sample names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ummary When the HostApp can't be installed/launched on the device (e.g. ADB 'Failure calling service package: Broken pipe', InstallFailedException, or device offline), every test in the fixture's OneTimeSetUp times out waiting for the test driver UI. NUnit then reports all N tests as Failed. The AI summary previously rendered '119 failed tests' for what was actually a single CI infra problem — alarming reviewers about nonexistent regressions. Now: when (a) all per-test failures share a 'OneTimeSetUp:' error AND (b) the build log contains a known infra signal (ADB0010, Broken pipe, device offline, etc.), tag the category as an infrastructure failure: - Tests column shows '🛠️ infra failure (N OneTimeSetUp timeouts)' - A banner above 'Failed test details' explains these are NOT real regressions - The collapsible header uses the wrench icon instead of ❌ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….ps1) When BuildAndRunHostApp.ps1 fails because of a transient device problem (Android ADB 'Broken pipe' install failure, iOS simulator app-launch crash, XHarness exit 83, missing devices, etc.), STEP 3 was reporting the resulting OneTimeSetUp timeouts as 100+ test regressions — even though the test runner never started. The Gate has solved this for a while via Invoke-TestRunWithRetry: it retries up to 3 times with a 30s sleep between attempts and reboots the device on Android/iOS infra failures. STEP 3 was just calling BuildAndRunHostApp once and reporting whatever came back. Now STEP 3 wraps the per-category run in the same retry loop: - Detects the same env-error signals as the Gate (ADB0010, Broken pipe, XHarness 83, AOT loader crash, etc.) - Retries up to 3 times with 30s pause between attempts - adb reboot on Android, simctl shutdown/boot on iOS - If retries exhaust, the persistent env error is propagated into infra_failure so the AI summary still labels the category as '🛠️ infra failure' instead of pretending all the tests regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tahoe agents have macOS 26.4 with Xcode 26.0.1-26.3 installed, but iOS 26.4 simulator is NOT always pre-installed (some VMs only have 26.3.1). Baselines were captured on 26.4 (PR #35061). Switch to the highest available Xcode (26.3) which can download iOS 26.4, then restore the default Xcode (26.0.1) for the build. This ensures Start-Emulator.ps1 descending sort picks 26.4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bypasses ReviewPR stage entirely — RunDeepUITests starts immediately with hardcoded detectedCategories='ViewBaseTests'. Update stage disabled (no aiSummaryCommentId without ReviewPR). This is TEMPORARY for fast iteration on the iOS 26.4 simulator fix. Will be reverted once iOS 26.4 is confirmed working. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous attempt failed with 'iOS 26.4 is not available for download' even with Xcode 26.4. Now tries 3 approaches: 1. xcodebuild -downloadPlatform iOS (latest available, no version) 2. -architectureVariant universal -buildVersion 26.4 3. -buildVersion 23E244 (exact Apple build number) Also dumps xcrun simctl list runtimes (shows bundled + disk images) to help diagnose what's available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Main CI (ci-uitests.yml) uses AcesShared with ImageOverride Tahoe for Android. Without this demand, Android builds can land on Sequoia VMs which don't support HVF (Android emulator hardware virtualization). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Android emulator needs hardware virtualization: - AcesShared macOS VMs: HVF not supported (nested virt) - Azure Pipelines ubuntu-22.04: has /dev/kvm on hosted agents Main CI uses MAUI-DNCENG (ubuntu-22.04) for Android tests, not AcesShared. Switch androidPool to ubuntu-22.04 and add enable-kvm.yml template to configure KVM permissions before emulator boot. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hosted ubuntu-22.04 agents don't have enough free disk for the Android emulator + SDK + workloads (~15 GB needed). Remove pre- installed tools (dotnet, NDK, boost, GHC, CodeQL, etc.) to free space before provision runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On CI, dotnet build -t:Run launches the app during build, but the test project build takes 10+ min after that. By the time tests start, Android may have killed the app or shown ANR dialogs (all 119 tests fail with 'Timed out waiting for Go To Test button'). Force-stop the app package and re-launch it via am start right before tests begin so Appium connects to a fresh instance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UiAutomator2 driver fails with 'Can't find service: settings' on API 30 emulators when the settings service hasn't fully initialized. Poll 'adb shell settings get global device_name' up to 30 times (~2.5 min) before force-restarting the app and running tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On hosted CI ubuntu agents, the API 30 emulator's settings service doesn't fully support hidden_api_policy commands. UiAutomator2 fails at session creation with 'Can't find service: settings'. Setting appium:ignoreHiddenApiPolicyError=true allows tests to proceed — this is a non-critical Appium configuration that doesn't affect test functionality. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dotnet test spends 10-25 min building the test project on slow ubuntu agents. During that time, the already-launched app goes idle and gets killed by Android (ANR), causing 'Timed out waiting for Go To Test button' on every test. Fix: pre-build the test project FIRST, then restart the app, then run dotnet test --no-build. This ensures the app is fresh when Appium connects immediately after the test runner starts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
--no-restore failed because the test project hadn't been restored yet (NETSDK1004: Assets file not found). Remove the flag so dotnet build can restore + build the test project before app restart. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert temp changes that skipped ReviewPR and disabled Update stage. Full 3-stage pipeline restored: - ReviewPR: detects categories, emits cross-stage vars - RunDeepUITests: runs on platform pool with detected categories - UpdateAISummaryComment: patches AI summary with deep results iOS verified working on Tahoe agents with iOS 26.4 (112/112 pass). Android deferred: hosted ubuntu-22.04 agents have broken emulator (ADB broken pipe on API 30). Needs MAUI-DNCENG pool (dnceng only). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hosted Azure Pipelines ubuntu-22.04 agents have broken Android emulator (ADB broken pipe on package install). The main CI uses MAUI-DNCENG with 1ESPT-Ubuntu22.04 image which has a properly configured emulator. DevDiv has access to MAUI-1ESPT pool which also has Ubuntu 22.04 agents with the same 1ESPT image. Use this for Android Deep tests and Gate (both run on androidPool). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ReviewPR stage has a comprehensive 'Create AVD and Boot Android Emulator' step that creates the AVD with -partition-size 2048m, pre-authorizes ADB keys, and waits for full boot + package manager. This leaves the emulator running for BuildAndRunHostApp.ps1 to find. The Deep stage was missing this step — Start-Emulator.ps1 tried to boot the emulator directly but hit broken pipe errors because the partition was too large and ADB keys weren't pre-authorized. Now both stages use the same emulator setup approach. The hosted ubuntu-22.04 agents work when the emulator is properly configured. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The app was being restarted BEFORE dotnet test's design-time build (10+ min). By the time NUnit actually connected via Appium, the app had already ANR'd/died from idle. Now: 1. Pre-build test project (restore + build) 2. Skip earlier app restart 3. Force-stop + re-launch app RIGHT BEFORE dotnet test @testArgs 4. dotnet test --no-build starts immediately (no build gap) Also added diagnostic log showing actual dotnet test args to confirm --no-build is being passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The working build (BID 14052116) never force-stopped or manually restarted the app. Appium's UiAutomator2 driver handles the app lifecycle via appPackage/appActivity capabilities — it force-stops and relaunches the app when creating each test session. Our manual force-stop + re-launch was interfering with Appium's app management, causing double-stop issues. The app ended up in a bad state where 'Go To Test' button couldn't be found. Now: just dismiss system dialogs and let Appium handle everything, matching the behavior of the successfully working Gate builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The AVD boot step sets DEVICE_UDID via ##vso[task.setvariable]. BuildAndRunHostApp.ps1 was invoked without -DeviceUdid, so Start-Emulator.ps1 tried to boot a second emulator on the same AVD, causing 'Running multiple emulators with the same AVD'. Now passes $env:DEVICE_UDID when set, so Start-Emulator.ps1 uses the already-running emulator instead of trying to start a new one. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DeviceUdid is passed as emulator-5554, the script correctly outputs the UDID, but $LASTEXITCODE from earlier adb commands (e.g., settings check) may be non-zero. BuildAndRunHostApp.ps1 checks $LASTEXITCODE after Start-Emulator.ps1 returns and exits with 'Failed to start or detect device' even though the device was found. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Android Deep UI Tests fully working on ubuntu-22.04 hosted agents: - AVD boot step creates emulator with -partition-size 2048 - DEVICE_UDID passed to BuildAndRunHostApp.ps1 - LASTEXITCODE reset in Start-Emulator.ps1 - Appium manages app lifecycle (no manual force-stop) - 118/119 ViewBaseTests passed (BID 14061801) Full 3-stage pipeline restored (ReviewPR → Deep → Update). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Rename 'UI Tests — Category Detection' → 'UI Tests' in
post-ai-summary-comment.ps1
2. Replace truncated first-line error table with full error + stack
trace in code blocks. Each failed test now shows:
- Bold test name
- Full error message + stack trace (up to 1000 chars) in a
fenced code block
- Much more useful for debugging than a single truncated line
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Every <details><summary>...</summary> now followed by <br/> for proper spacing in GitHub markdown rendering 2. UI Tests section title is now dynamic: '🧪 UI Tests — TabbedPage' or '🧪 UI Tests — CollectionView, Entry' parsed from the detected categories in uitests/content.md 3. Replaced --- separator after </summary> with <br/> tag in: - Phase sections (uitests, code-review, etc.) - Gate section - Session summary - Deep results failed-test details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When many tests fail (e.g., 26 CollectionView failures), the comment
was very long with all stack traces expanded. Now each failed test
name is a collapsible <details><summary> with the error+stack trace
hidden inside, keeping the comment compact by default.
Structure:
<details><summary>❌ category — N failed</summary>
<details><summary><code>TestName</code></summary>
<br/>
error + stack trace in code block
</details>
</details>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All infrastructure validated: - iOS: 112/112 ViewBaseTests pass on iOS 26.4 (Tahoe agents) - Android: 118/119 ViewBaseTests pass on ubuntu-22.04 with KVM - Comment formatting: nested details/summary, dynamic titles - Both platforms verified across 3 PRs (35358, 35359, 35356) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DEFER_COMMENT_TO_STAGE3=true (default), Stage 1 (ReviewPR) skips STEP 7 (post comment) and STEP 8 (labels), emitting aiSummaryCommentId=DEFERRED instead. Stage 3 (UpdateAISummaryComment) now: 1. Downloads CopilotLogs artifact (content files from Review) 2. Downloads deep-uitests artifact (TRX from Deep stage) 3. Appends deep results to uitests/content.md 4. Calls post-ai-summary-comment.ps1 to post the FULL comment (including deep test results from the start) 5. Applies labels via Update-AgentLabels.ps1 This means the PR author sees one comment with everything, instead of a comment that gets patched 30 min later with deep results. Backward-compatible: set DEFER_COMMENT_TO_STAGE3 to anything other than 'true' to restore the old behavior (post in Stage 1, patch in Stage 3). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing in favor of a new PR with updated title and description reflecting the full scope of changes (3-stage deep UI test pipeline). |
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Adds a regression cross-reference system to the PR review pipeline that detects when a PR removes or modifies lines previously added by labeled bug-fix PRs, and automatically runs those fix PRs' tests to verify no regressions are introduced.
Problem
Three P/0 regressions (#34634, #34635, #34636) shipped in 10.0.60 because PRs silently reverted lines from prior bug-fix PRs. The existing expert reviewer has dimension #6 (Regression Prevention) but it asks for new tests — it does not detect prior fix lines being removed.
Solution
A purely mechanical (no AI/LLM) 5-step algorithm integrated as STEP 3 in
Review-PR.ps1:git log --followi/regression,t/bug,p/0, orp/1When REVERT or OVERLAP is detected:
Detect-TestsInDiff.ps1BuildAndRunHostApp.ps1), device tests (Run-DeviceTests.ps1), unit/XAML tests (dotnet test)Pipeline (7 steps)
Files Changed
Find-RegressionRisks.ps1Review-PR.ps1post-ai-summary-comment.ps1regression-checkphase to content assemblymaui-expert-reviewer.mdrisks.jsonfor REVERT entriesTest-FindRegressionRisks.ps1SKILL.mdProof of Value — 38 Builds Tested
Tested across 38 successful builds covering nearly every open PR in the repo. 82 regression tests executed, 69 passed, 8 caught potential regressions.
🔴 REVERT — Real Regression Risks Caught (6 PRs)
ResetDialog()from fix #33687. 2 tests: 1✅ 1❌🟡 OVERLAP — Regression Tests Executed (7 PRs)
🟢 CLEAN — No Regression Risk (sample)
Key Metrics
Additional Runs — ios/catalyst/windows batch
Notable finding: PR #35072 ran on catalyst this time (vs ios earlier) — 9 of 14 regression tests failed on catalyst while all 14 passed on ios. This proves the system catches platform-specific regressions that would be missed by single-platform testing.
Cumulative Totals (all batches)
False-Positive Fix Verification (commit c0c2bae)
Added
git merge-base --is-ancestorcheck to filter out fix PRs merged to a different branch than the PR targets. Results on 17 successful builds:Additional builds on latest commit (all CLEAN, Gate results vary):
Key takeaway: The ancestry check eliminated all false positives from fix PRs merged to
inflight/currentwhile preserving legitimate detections. 0 script bugs, 0 false positives on the latest commit.