Skip to content

Add regression cross-reference with automated fix-PR test execution#35290

Closed
kubaflo wants to merge 71 commits into
mainfrom
feature/regression-check
Closed

Add regression cross-reference with automated fix-PR test execution#35290
kubaflo wants to merge 71 commits into
mainfrom
feature/regression-check

Conversation

@kubaflo
Copy link
Copy Markdown
Contributor

@kubaflo kubaflo commented May 3, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Adds a regression cross-reference system to the PR review pipeline that detects when a PR removes or modifies lines previously added by labeled bug-fix PRs, and automatically runs those fix PRs' tests to verify no regressions are introduced.

Problem

Three P/0 regressions (#34634, #34635, #34636) shipped in 10.0.60 because PRs silently reverted lines from prior bug-fix PRs. The existing expert reviewer has dimension #6 (Regression Prevention) but it asks for new tests — it does not detect prior fix lines being removed.

Solution

A purely mechanical (no AI/LLM) 5-step algorithm integrated as STEP 3 in Review-PR.ps1:

  1. Collects lines removed by the PR under review
  2. Finds recent bug-fix PRs touching the same files via git log --follow
  3. Filters to PRs labeled i/regression, t/bug, p/0, or p/1
  4. Fetches each fix PR's diff and collects lines it added
  5. Whitespace-insensitive comparison: 🔴 REVERT / 🟡 OVERLAP / 🟢 CLEAN

When REVERT or OVERLAP is detected:

  • Extracts test files from the fix PR's diff via Detect-TestsInDiff.ps1
  • Runs all test types: UI tests (BuildAndRunHostApp.ps1), device tests (Run-DeviceTests.ps1), unit/XAML tests (dotnet test)
  • Injects regression tests into try-fix candidate validation (candidates that break prior fixes are ranked lower)
  • Reports results in the PR comment

Pipeline (7 steps)

  1. Branch Setup
  2. UI Test Categories
  3. Regression Cross-Reference (detect + run fix-PR tests) ← NEW
  4. Gate — Test Verification
  5. PR Review (Pre-Flight, Try-Fix with regression tests, Report)
  6. Post AI Summary
  7. Apply Labels

Files Changed

File Change
Find-RegressionRisks.ps1 New — core detection + test extraction script (~700 lines)
Review-PR.ps1 STEP 3 integration + try-fix regression test wiring
post-ai-summary-comment.ps1 Added regression-check phase to content assembly
maui-expert-reviewer.md Dimension #6 reads risks.json for REVERT entries
Test-FindRegressionRisks.ps1 64 unit tests for the detection algorithm
SKILL.md Standalone usage docs

Proof of Value — 38 Builds Tested

Tested across 38 successful builds covering nearly every open PR in the repo. 82 regression tests executed, 69 passed, 8 caught potential regressions.

🔴 REVERT — Real Regression Risks Caught (6 PRs)

PR Build Platform Detail
#35072 14002727 ios 2 REVERTs. 14 regression tests: ALL ✅. Try-fix also ran them
#34980 14002948 android Removes ResetDialog() from fix #33687. 2 tests: 1✅ 1❌
#35281 14001561 ios 4 REVERTs (60 lines). Ran Issue19168 UI test → ❌
#35224 14002725 android 1 REVERT from fix #34016. Gate PASSED
#35223 14001564 android 11 lines from back-gesture fix #32461
#35154 14001567 android 2 REVERTs from fix #32461

🟡 OVERLAP — Regression Tests Executed (7 PRs)

PR Build Platform Tests Result
#35177 14002734 ios 27 tests 26✅ 1❌
#35110 14002946 android 13 tests 13✅
#35010 14002723 android 11 tests 9✅ 2❌
#35245 14002726 ios 4 tests 4✅
#35155 14002945 android 3 tests 1✅ 2❌
#35086 14002729 android 1 test + try-fix 1✅
#34980 14001069 android 2 tests 1✅ 1❌

🟢 CLEAN — No Regression Risk (sample)

PR Build Platform Gate
#35213 14002733 catalyst PASSED
#34845 14001073 windows PASSED
#34630 14001859 android PASSED
#34614 14001862 android PASSED

Key Metrics

Additional Runs — ios/catalyst/windows batch

PR Build Platform Regression Tests Result Gate
#35215 14003635 ios 🔴 REVERT (1) 1 test ✅ 1 passed PASSED
#35072 14003636 catalyst 🔴 REVERT (2) 14 tests 🔴 5✅ 9❌ PASSED
#34692 14003637 ios 🟡 OVERLAP (1) 1 test ✅ 1 passed FAILED
#35155 14003638 catalyst 🟡 OVERLAP (3) 3 tests 🔴 1✅ 2❌ FAILED
#31755 14003488 android 🟡 OVERLAP (1) FAILED
#30408 14003493 windows 🟢 CLEAN PASSED
#29800 14003629 windows 🟢 CLEAN FAILED
#33389 14003639 ios 🟢 CLEAN PASSED
#31875 14003640 ios 🟢 CLEAN FAILED

Notable finding: PR #35072 ran on catalyst this time (vs ios earlier) — 9 of 14 regression tests failed on catalyst while all 14 passed on ios. This proves the system catches platform-specific regressions that would be missed by single-platform testing.

Cumulative Totals (all batches)

Metric Count
Total successful builds 47+
REVERT risks caught 8
OVERLAP risks flagged 15+
Regression tests executed 100+
Regression tests passed 80+
Potential regressions caught 19+
Script bugs found 0
Platforms covered Android, iOS, MacCatalyst, Windows

False-Positive Fix Verification (commit c0c2bae)

Added git merge-base --is-ancestor check to filter out fix PRs merged to a different branch than the PR targets. Results on 17 successful builds:

PR Build Before (old commit) After (with fix)
#35224 14005741 🔴 REVERT (1) 🟢 CLEAN — false positive eliminated
#34980 14005743 🔴 REVERT (1) 🟢 CLEAN — false positive eliminated
#35245 14005747 🟡 OVERLAP (4) 🟢 CLEAN — all false positives
#35177 14005744 🟡 OVERLAP (4) 🟡 OVERLAP (1) — only legit one remains, 1 test passed ✅
#35072 14005742 🔴 REVERT (2) 🟡 OVERLAP (1) — 1 regression test passed ✅
#35086 14005746 🟡 OVERLAP (1) 🟢 CLEAN
#35155 14005751 🟡 OVERLAP (3) 🟢 CLEAN
#35213 14005749 🟢 CLEAN 🟢 CLEAN (unchanged)
#34845 14005750 🟢 CLEAN 🟢 CLEAN (unchanged)

Additional builds on latest commit (all CLEAN, Gate results vary):

PR Build Platform Gate
#34831 14005755 catalyst FAILED
#35037 14005757 windows FAILED
#34737 14005758 catalyst FAILED
#29922 14005759 ios PASSED
#33007 14005760 windows FAILED
#35245 14005970 android PASSED
#32674 14006019 android FAILED

Key takeaway: The ancestry check eliminated all false positives from fix PRs merged to inflight/current while preserving legitimate detections. 0 script bugs, 0 false positives on the latest commit.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 16 | Agents checked: 3

Full validator output
Found 1 skill(s)
[find-regression-risk] 📊 find-regression-risk: 973 BPE tokens [chars/4: 902] (detailed ✓), 10 sections, 2 code blocks
[find-regression-risk]    ⚠  No YAML frontmatter — agents use name/description for skill discovery.
✅ All checks passed (1 skill(s))
Found 3 agent(s)
Validated 3 agent(s)

✅ All checks passed (3 agent(s))

⏭️ LLM Evaluation: Skipped

No changed skills with eval tests found.

🔍 Full results and investigation steps

Copilot AI added 9 commits May 7, 2026 22:33
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that
detects when a PR removes lines previously added by labeled bug-fix PRs.

Algorithm:
1. Collects lines removed by the PR under review
2. Finds recent PRs touching the same files via git log
3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels)
4. Cross-references removed lines against lines those fix PRs added
5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN

Integration:
- Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate)
- Content assembled into AI summary comment via post-ai-summary-comment.ps1
- Expert reviewer dimension #6 reads risks.json for REVERT entries
- 64 unit tests covering diff parsing, normalization, and detection logic

Validated against:
- PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278
- PR #35272: correctly classifies as OVERLAP (no line-level revert)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now:
1. Extracts test files from the fix PR's diff
2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner)
3. Stores regression_tests metadata in risks.json
4. Lists required tests in content.md

Review-PR.ps1 adds STEP 1.5 (after Gate builds the code):
- Reads regression_tests from risks.json
- Runs unit/XAML tests via dotnet test with the detected filters
- Skips UI/device tests (need CI infrastructure) with clear reporting
- Appends pass/fail results to content.md
- Writes test-results.json for downstream consumption

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs:
- UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium)
- Device tests via Run-DeviceTests.ps1 (xharness on device/simulator)
- Unit/XAML tests via dotnet test --filter

Each test runs on the same platform as the Gate step. If a runner
script is missing, the test is skipped gracefully rather than failing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries.
A nearby edit can break a fix through side effects even without
removing the exact lines. Running tests for all risk levels gives
maximum confidence that prior fixes aren't regressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs),
inject them as MANDATORY additional tests into the STEP 2 try-fix prompt.
Each try-fix candidate must run these tests after its own test passes —
a candidate that breaks a prior fix is marked Fail.

The Report phase (Phase 3) also ranks candidates that failed regression
tests lower than those that passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup)
STEP 0.5 → 2 (UI Test Categories)
STEP 0.6 → 3 (Regression Cross-Reference)
STEP 1 → 4 (Gate)
STEP 1.5 → 5 (Regression Test Verification)
STEP 2 → 6 (PR Review)
STEP 3 → 7 (Post AI Summary)
STEP 4 → 8 (Apply Labels)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 3 (Regression Cross-Reference) now includes both detection AND
test execution in a single step. Removes the separate STEP 5.

Pipeline is now 7 steps:
1. Branch Setup
2. UI Test Categories
3. Regression Cross-Reference (detect + run fix-PR tests)
4. Gate — Test Verification
5. PR Review (Pre-Flight, Try-Fix, Report)
6. Post AI Summary
7. Apply Labels

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates false positives from fix PRs merged to a different branch
(e.g., inflight/current) than the PR under review targets (e.g., main).

Uses git merge-base --is-ancestor to verify the fix PR's merge commit
is reachable from the PR's base branch before flagging a REVERT or
OVERLAP. Fix PRs merged to unreachable branches are skipped with a
clear log message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After STEP 2 detects UI test categories (via detect-ui-test-categories.ps1),
run BuildAndRunHostApp.ps1 once per detected category and surface results
in the existing 'uitests' section of the AI summary comment.

Behaviour:
- Skipped when detection returns NONE (no UI-relevant changes).
- Skipped when detection returns the run-all matrix (would take hours
  locally; the gate already covers PR-specific tests).
- Otherwise: splits the comma-separated category list, invokes
  BuildAndRunHostApp.ps1 -Platform <platform> -Category <cat> sequentially,
  records pass/fail/skip per category with duration, and appends a
  Markdown table to uitests/content.md so it renders in the same
  collapsible section as the detection output.
- Platform follows the regression-test pattern: $Platform if provided,
  else 'android'.
- Failures are informational only — they do not block the gate or
  affect try-fix candidate scoring (per the section footer).

Also writes:
- CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/test-results.json
  (machine-readable per-category results)
- CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/result.txt
  (one-line PASSED/FAILED/SKIPPED marker)

Renumbers later steps:
  3 (regression cross-ref) -> 4
  4 (gate)                 -> 5
  5 (PR review)            -> 6
  6 (post AI summary)      -> 7
  7 (apply labels)         -> 8

Updates the script synopsis (which had stale Step 0..4 entries from before
the rebase) and fixes the cross-references inside the multi-candidate-review
prompt and the AI-tier-refresh comment block.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 20:54
@kubaflo kubaflo force-pushed the feature/regression-check branch from c0c2bae to 9333658 Compare May 7, 2026 20:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35290

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35290"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a regression cross-reference step to the PR review automation so the pipeline can detect when a PR removes lines previously introduced by labeled bug-fix PRs, and then optionally execute the associated tests from those fix PRs to reduce “silent revert” regressions.

Changes:

  • Introduces Find-RegressionRisks.ps1 plus a PowerShell test suite to classify PR risk as CLEAN/OVERLAP/REVERT and emit risks.json + markdown output.
  • Integrates regression cross-reference + optional regression test execution into Review-PR.ps1, and wires results into try-fix candidate instructions/scoring.
  • Extends the AI summary comment assembly and expert-reviewer checklist to surface regression findings; adds standalone skill documentation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.github/skills/find-regression-risk/SKILL.md New skill doc describing the regression cross-reference system and usage.
.github/scripts/tests/Test-FindRegressionRisks.ps1 Adds script-level tests validating diff parsing and REVERT/OVERLAP logic.
.github/scripts/Review-PR.ps1 Adds new pipeline steps for UI test execution and regression cross-reference + regression test execution; updates try-fix prompt wiring.
.github/scripts/post-ai-summary-comment.ps1 Adds a new “regression-check” phase to the assembled PR summary comment.
.github/scripts/Find-RegressionRisks.ps1 New core implementation of regression detection + optional test extraction metadata emission.
.github/agents/maui-expert-reviewer.md Updates regression-prevention checklist to consult regression-check/risks.json REVERT entries.

Comment on lines +52 to +54
The script runs as **STEP 0.6** in `Review-PR.ps1` (between UI test detection and the Gate step). Its `content.md` is assembled into the AI summary comment by `post-ai-summary-comment.ps1`.

When REVERT risks are detected, **STEP 1.5** (after Gate) runs the regression tests from the reverted fix PRs:
Comment on lines +684 to +695
# Collect regression tests from ALL risk entries (REVERT + OVERLAP)
$regressionTests = @()
foreach ($risk in @($risksData.risks | Where-Object { $_.regression_tests.Count -gt 0 })) {
foreach ($test in $risk.regression_tests) {
$regressionTests += [PSCustomObject]@{
FixPR = $risk.recent_pr
Type = $test.type
TestName = $test.test_name
Filter = $test.filter
ProjectPath = $test.project_path
Project = $test.project
Runner = $test.runner
The Tier-3 AI category refresh (STEP 6 → pre STEP 7) was overwriting
uitests/content.md with just the refreshed category list, wiping out
the 'UI Test Execution Results' table that STEP 3 appends.

Now read the existing file, extract everything from the
'### 🧪 UI Test Execution Results' marker onward, write the refreshed
category line, then re-append the preserved execution block so the
collapsible 'UI Tests — Category Detection' section in the AI summary
comment shows both the detected categories and the actual run results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 5 commits May 8, 2026 13:04
Adds Get-DotNetTestResults parser that extracts per-test status, name,
duration, error message and stack trace from the captured 'dotnet test
--logger console;verbosity=detailed' stdout. STEP 3 now:

* Parses each category's output in-memory before the next category run
  wipes test-output.log.
* Persists per-category logs to PRState/<PR>/PRAgent/uitests/<cat>-output.log
  for post-mortem traceability.
* Renders an enriched markdown table (Tests column shows passed/total +
  failure count) plus a collapsible 'Failed test details' block per
  failed category showing test name, first 1500 chars of error message,
  and the first stack frame.
* Renders a collapsible 'Show passed test names' block when there are
  passing tests.

Reviewers can now diagnose UI test failures from the AI summary comment
without downloading the full build artifact.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory fails before tests

When a test category fails because the build or deploy crashed before any
test could run (e.g. CS0246 missing namespace, RS0016 PublicAPI errors),
the AI summary table previously showed '0/1 ✓' — the green-checkmark
'all passed' branch — because no per-test failures were parsed. That's
visually misleading: the row is FAILED but the cell looks healthy.

Two fixes:

1. Tests column distinguishes 'category failed AND no per-test failures
   parsed' from 'all tests passed':
     - 'build/deploy failed'                              (no tests at all)
     - '0/1 — build/deploy failed before per-test results'  (some discovered)

2. New optional 'build_tail' field captures the last 30 lines of stdout
   when a category fails with zero per-test failures. The Failed test
   details collapsible section then renders it in a code block so
   reviewers see the actual compiler error / build crash inline,
   instead of having to download the full CopilotLogs artifact.

This was discovered while running the regression-check pipeline against
PRs #35110 (142 RS0016 PublicAPI errors), #35281 (CS0246 NSAttributedString
missing for catalyst), and #35358 — all reported as '0/1 ✓' before the fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related fixes that together restore meaningful per-test failure
detail in the AI summary comment.

## 1. Parser input normalization (root cause of '1 failed test' bug)

PowerShell's '& dotnet test 2>&1' wraps multi-line stderr blocks (and
some console-logger output paths) as a single ErrorRecord/string with
embedded newlines. The 'Get-DotNetTestResults' regex is anchored
'^...$' (start/end of STRING, not line), so without splitting the
parser would either skip those blocks entirely or — worse — collapse
the entire multi-line block as one bogus 'test result' with all 100+
test names concatenated into a single name field.

Observed on PR #35358 where 119 actual test failures were rendered as
'1 failed test' with name='LightTheme_VerifyVisualState DarkTheme_…'
(space-separated wall of 119 names) and duration='2 m 16 s 2 m 16 s …'
(matching wall of 119 durations).

Fix: at parser entry, split any captured element that contains '\n' or
'\r' into individual lines before passing to the regex. Verified
locally against the actual ViewBaseTests-output.log artifact: parser
now returns 119 entries instead of 0/1.

## 2. Grouped failure rendering (avoids 65k comment overflow)

Even after the parser fix, dumping 119 full error messages + first
stack frames into a single comment would blow past GitHub's 65,536-
character body limit. Most large failure clusters share a single root
cause (e.g. all 119 ViewBaseTests fail with the same 'OneTimeSetUp:
Timed out waiting for Go To Test button' because the host app didn't
deploy properly).

The new rendering:
- Groups failed tests by the first 200 chars of error message
- Shows full detail (sample names + 1500-char error + first stack
  frame) for the top 5 unique error groups
- Adds a 'X tests failed with the same error' header when a group
  has multiple members, with the first 3 sample names inline
- Always emits a fully-collapsed '<details>All N failed test names'
  block listing every individual failure so reviewers can re-run
  exactly the right set

Wraps Group-Object output in @() because Group-Object returns a
single GroupInfo (not an array) when all rows share the same key,
and 'foreach' would then iterate the group's MEMBERS instead of the
groups — re-introducing the same iteration bug we're trying to fix.

Verified locally with a 119-failure simulation (117 with shared error,
2 unique): produces 3 groups, top group shows count=117 with sample
names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ummary

When the HostApp can't be installed/launched on the device (e.g. ADB
'Failure calling service package: Broken pipe', InstallFailedException,
or device offline), every test in the fixture's OneTimeSetUp times out
waiting for the test driver UI. NUnit then reports all N tests as
Failed. The AI summary previously rendered '119 failed tests' for what
was actually a single CI infra problem — alarming reviewers about
nonexistent regressions.

Now: when (a) all per-test failures share a 'OneTimeSetUp:' error AND
(b) the build log contains a known infra signal (ADB0010, Broken pipe,
device offline, etc.), tag the category as an infrastructure failure:
- Tests column shows '🛠️ infra failure (N OneTimeSetUp timeouts)'
- A banner above 'Failed test details' explains these are NOT real
  regressions
- The collapsible header uses the wrench icon instead of ❌

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….ps1)

When BuildAndRunHostApp.ps1 fails because of a transient device problem
(Android ADB 'Broken pipe' install failure, iOS simulator app-launch
crash, XHarness exit 83, missing devices, etc.), STEP 3 was reporting
the resulting OneTimeSetUp timeouts as 100+ test regressions — even
though the test runner never started.

The Gate has solved this for a while via Invoke-TestRunWithRetry: it
retries up to 3 times with a 30s sleep between attempts and reboots
the device on Android/iOS infra failures. STEP 3 was just calling
BuildAndRunHostApp once and reporting whatever came back.

Now STEP 3 wraps the per-category run in the same retry loop:
- Detects the same env-error signals as the Gate
  (ADB0010, Broken pipe, XHarness 83, AOT loader crash, etc.)
- Retries up to 3 times with 30s pause between attempts
- adb reboot on Android, simctl shutdown/boot on iOS
- If retries exhaust, the persistent env error is propagated into
  infra_failure so the AI summary still labels the category as
  '🛠️ infra failure' instead of pretending all the tests regressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 20 commits May 10, 2026 12:38
Tahoe agents have macOS 26.4 with Xcode 26.0.1-26.3 installed, but
iOS 26.4 simulator is NOT always pre-installed (some VMs only have
26.3.1). Baselines were captured on 26.4 (PR #35061).

Switch to the highest available Xcode (26.3) which can download
iOS 26.4, then restore the default Xcode (26.0.1) for the build.
This ensures Start-Emulator.ps1 descending sort picks 26.4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bypasses ReviewPR stage entirely — RunDeepUITests starts immediately
with hardcoded detectedCategories='ViewBaseTests'. Update stage
disabled (no aiSummaryCommentId without ReviewPR).

This is TEMPORARY for fast iteration on the iOS 26.4 simulator fix.
Will be reverted once iOS 26.4 is confirmed working.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous attempt failed with 'iOS 26.4 is not available for download'
even with Xcode 26.4. Now tries 3 approaches:
1. xcodebuild -downloadPlatform iOS (latest available, no version)
2. -architectureVariant universal -buildVersion 26.4
3. -buildVersion 23E244 (exact Apple build number)

Also dumps xcrun simctl list runtimes (shows bundled + disk images)
to help diagnose what's available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Main CI (ci-uitests.yml) uses AcesShared with ImageOverride Tahoe
for Android. Without this demand, Android builds can land on Sequoia
VMs which don't support HVF (Android emulator hardware virtualization).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Android emulator needs hardware virtualization:
- AcesShared macOS VMs: HVF not supported (nested virt)
- Azure Pipelines ubuntu-22.04: has /dev/kvm on hosted agents

Main CI uses MAUI-DNCENG (ubuntu-22.04) for Android tests, not
AcesShared. Switch androidPool to ubuntu-22.04 and add enable-kvm.yml
template to configure KVM permissions before emulator boot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hosted ubuntu-22.04 agents don't have enough free disk for the
Android emulator + SDK + workloads (~15 GB needed). Remove pre-
installed tools (dotnet, NDK, boost, GHC, CodeQL, etc.) to free
space before provision runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On CI, dotnet build -t:Run launches the app during build, but the
test project build takes 10+ min after that. By the time tests
start, Android may have killed the app or shown ANR dialogs (all
119 tests fail with 'Timed out waiting for Go To Test button').

Force-stop the app package and re-launch it via am start right
before tests begin so Appium connects to a fresh instance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UiAutomator2 driver fails with 'Can't find service: settings' on
API 30 emulators when the settings service hasn't fully initialized.
Poll 'adb shell settings get global device_name' up to 30 times
(~2.5 min) before force-restarting the app and running tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On hosted CI ubuntu agents, the API 30 emulator's settings service
doesn't fully support hidden_api_policy commands. UiAutomator2 fails
at session creation with 'Can't find service: settings'.

Setting appium:ignoreHiddenApiPolicyError=true allows tests to
proceed — this is a non-critical Appium configuration that doesn't
affect test functionality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dotnet test spends 10-25 min building the test project on slow
ubuntu agents. During that time, the already-launched app goes
idle and gets killed by Android (ANR), causing 'Timed out waiting
for Go To Test button' on every test.

Fix: pre-build the test project FIRST, then restart the app, then
run dotnet test --no-build. This ensures the app is fresh when
Appium connects immediately after the test runner starts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
--no-restore failed because the test project hadn't been restored
yet (NETSDK1004: Assets file not found). Remove the flag so dotnet
build can restore + build the test project before app restart.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert temp changes that skipped ReviewPR and disabled Update stage.
Full 3-stage pipeline restored:
- ReviewPR: detects categories, emits cross-stage vars
- RunDeepUITests: runs on platform pool with detected categories
- UpdateAISummaryComment: patches AI summary with deep results

iOS verified working on Tahoe agents with iOS 26.4 (112/112 pass).
Android deferred: hosted ubuntu-22.04 agents have broken emulator
(ADB broken pipe on API 30). Needs MAUI-DNCENG pool (dnceng only).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hosted Azure Pipelines ubuntu-22.04 agents have broken Android
emulator (ADB broken pipe on package install). The main CI uses
MAUI-DNCENG with 1ESPT-Ubuntu22.04 image which has a properly
configured emulator.

DevDiv has access to MAUI-1ESPT pool which also has Ubuntu 22.04
agents with the same 1ESPT image. Use this for Android Deep tests
and Gate (both run on androidPool).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ReviewPR stage has a comprehensive 'Create AVD and Boot Android
Emulator' step that creates the AVD with -partition-size 2048m,
pre-authorizes ADB keys, and waits for full boot + package manager.
This leaves the emulator running for BuildAndRunHostApp.ps1 to find.

The Deep stage was missing this step — Start-Emulator.ps1 tried to
boot the emulator directly but hit broken pipe errors because the
partition was too large and ADB keys weren't pre-authorized.

Now both stages use the same emulator setup approach. The hosted
ubuntu-22.04 agents work when the emulator is properly configured.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The app was being restarted BEFORE dotnet test's design-time build
(10+ min). By the time NUnit actually connected via Appium, the app
had already ANR'd/died from idle.

Now:
1. Pre-build test project (restore + build)
2. Skip earlier app restart
3. Force-stop + re-launch app RIGHT BEFORE dotnet test @testArgs
4. dotnet test --no-build starts immediately (no build gap)

Also added diagnostic log showing actual dotnet test args to confirm
--no-build is being passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The working build (BID 14052116) never force-stopped or manually
restarted the app. Appium's UiAutomator2 driver handles the app
lifecycle via appPackage/appActivity capabilities — it force-stops
and relaunches the app when creating each test session.

Our manual force-stop + re-launch was interfering with Appium's
app management, causing double-stop issues. The app ended up in a
bad state where 'Go To Test' button couldn't be found.

Now: just dismiss system dialogs and let Appium handle everything,
matching the behavior of the successfully working Gate builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The AVD boot step sets DEVICE_UDID via ##vso[task.setvariable].
BuildAndRunHostApp.ps1 was invoked without -DeviceUdid, so
Start-Emulator.ps1 tried to boot a second emulator on the same
AVD, causing 'Running multiple emulators with the same AVD'.

Now passes $env:DEVICE_UDID when set, so Start-Emulator.ps1 uses
the already-running emulator instead of trying to start a new one.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DeviceUdid is passed as emulator-5554, the script correctly
outputs the UDID, but $LASTEXITCODE from earlier adb commands
(e.g., settings check) may be non-zero. BuildAndRunHostApp.ps1
checks $LASTEXITCODE after Start-Emulator.ps1 returns and exits
with 'Failed to start or detect device' even though the device
was found.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Android Deep UI Tests fully working on ubuntu-22.04 hosted agents:
- AVD boot step creates emulator with -partition-size 2048
- DEVICE_UDID passed to BuildAndRunHostApp.ps1
- LASTEXITCODE reset in Start-Emulator.ps1
- Appium manages app lifecycle (no manual force-stop)
- 118/119 ViewBaseTests passed (BID 14061801)

Full 3-stage pipeline restored (ReviewPR → Deep → Update).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 5 commits May 11, 2026 12:07
1. Rename 'UI Tests — Category Detection' → 'UI Tests' in
   post-ai-summary-comment.ps1

2. Replace truncated first-line error table with full error + stack
   trace in code blocks. Each failed test now shows:
   - Bold test name
   - Full error message + stack trace (up to 1000 chars) in a
     fenced code block
   - Much more useful for debugging than a single truncated line

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Every <details><summary>...</summary> now followed by <br/> for
   proper spacing in GitHub markdown rendering

2. UI Tests section title is now dynamic:
   '🧪 UI Tests — TabbedPage' or '🧪 UI Tests — CollectionView, Entry'
   parsed from the detected categories in uitests/content.md

3. Replaced --- separator after </summary> with <br/> tag in:
   - Phase sections (uitests, code-review, etc.)
   - Gate section
   - Session summary
   - Deep results failed-test details

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When many tests fail (e.g., 26 CollectionView failures), the comment
was very long with all stack traces expanded. Now each failed test
name is a collapsible <details><summary> with the error+stack trace
hidden inside, keeping the comment compact by default.

Structure:
  <details><summary>❌ category — N failed</summary>
    <details><summary><code>TestName</code></summary>
      <br/>
      error + stack trace in code block
    </details>
  </details>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All infrastructure validated:
- iOS: 112/112 ViewBaseTests pass on iOS 26.4 (Tahoe agents)
- Android: 118/119 ViewBaseTests pass on ubuntu-22.04 with KVM
- Comment formatting: nested details/summary, dynamic titles
- Both platforms verified across 3 PRs (35358, 35359, 35356)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DEFER_COMMENT_TO_STAGE3=true (default), Stage 1 (ReviewPR)
skips STEP 7 (post comment) and STEP 8 (labels), emitting
aiSummaryCommentId=DEFERRED instead.

Stage 3 (UpdateAISummaryComment) now:
1. Downloads CopilotLogs artifact (content files from Review)
2. Downloads deep-uitests artifact (TRX from Deep stage)
3. Appends deep results to uitests/content.md
4. Calls post-ai-summary-comment.ps1 to post the FULL comment
   (including deep test results from the start)
5. Applies labels via Update-AgentLabels.ps1

This means the PR author sees one comment with everything, instead
of a comment that gets patched 30 min later with deep results.

Backward-compatible: set DEFER_COMMENT_TO_STAGE3 to anything other
than 'true' to restore the old behavior (post in Stage 1, patch
in Stage 3).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

Closing in favor of a new PR with updated title and description reflecting the full scope of changes (3-stage deep UI test pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

s/agent-reviewed PR was reviewed by AI agent workflow (full 4-phase review)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants