Add deep UI test execution to Copilot PR review pipeline#35376
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35376Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35376" |
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 18 | Agents checked: 4 Full validator output⏭️ LLM Evaluation: SkippedNo changed skills with eval tests found. |
There was a problem hiding this comment.
Pull request overview
This PR extends the eng/pipelines/ci-copilot.yml Copilot PR review pipeline into a multi-stage flow that can (a) detect UI test categories, (b) run “deep” UI tests on platform-appropriate agents, and (c) post/update a unified AI summary comment that includes deep UI test results and regression-risk signals. It also adds supporting PowerShell utilities for more reliable UI test execution (retry + TRX parsing) and for identifying potential regression reverts by diff cross-referencing.
Changes:
- Add
RunDeepUITestsandUpdateAISummaryCommentstages to run per-category UI tests on real platform pools and publish/update results in the PR comment. - Introduce new scripts/tests for UI test retry + TRX parsing/aggregation and for regression-risk detection (
Find-RegressionRisks.ps1+ tests/docs). - Update emulator/simulator setup and test invocation to improve stability and to produce authoritative TRX outputs for result rendering.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/TestUtils/src/UITest.Appium/AppiumAndroidApp.cs | Adds Appium capability to ignore hidden API policy errors on problematic emulator images. |
| eng/pipelines/ci-copilot.yml | Implements the 3-stage architecture (review → deep UI tests → post/update summary) and associated platform setup. |
| .github/skills/find-regression-risk/SKILL.md | Documents the regression-risk detection skill and how it integrates into the review flow. |
| .github/scripts/tests/Test-FindRegressionRisks.ps1 | Adds script-based tests for the regression-risk detector. |
| .github/scripts/shared/Start-Emulator.ps1 | Updates iOS simulator selection logic (iOS 26-first) and adds device creation fallback logic. |
| .github/scripts/shared/Invoke-UITestWithRetry.ps1 | New shared “build/deploy/run UI tests” runner with retry + device recovery and TRX discovery. |
| .github/scripts/shared/Aggregate-UITestArtifacts.Tests.ps1 | Adds Pester tests for TRX aggregation from downloaded artifacts. |
| .github/scripts/shared/Aggregate-UITestArtifacts.ps1 | New artifact downloader/parser to aggregate per-category TRX results. |
| .github/scripts/Review-PR.Tests.ps1 | Adds Pester tests for TRX parsing and console-output fallback parsing helpers. |
| .github/scripts/Review-PR.ps1 | Reworks review flow steps (category detect/run, regression cross-ref) and adds TRX-aware UI test reporting. |
| .github/scripts/post-ai-summary-comment.ps1 | Adds regression-check section and dynamic “UI Tests — ” title rendering. |
| .github/scripts/Find-RegressionRisks.ps1 | New mechanical regression-risk detector that cross-references deletions vs recent bug-fix additions. |
| .github/scripts/BuildAndRunHostApp.ps1 | Switches to TestCategory= filtering and adds TRX logging/marker output for authoritative results. |
| .github/agents/maui-expert-reviewer.md | Adds guidance to require acknowledgement when regression cross-reference detects REVERT entries. |
🔍 Multimodal Code Review — 5 agents, 5 dimensionsReviewed by: Claude Opus 4.7 ×3, Claude Opus 4.6, GPT-5.5 🔴 Critical Findings (5 found, 2 resolved)✅ C1 — Stage 3 skipped on no-UI PRs — FIXED in 2b5e746
Fix applied: ✅ C3 — Cross-stage
|
| # | Finding | Location |
|---|---|---|
| 1 | Emulator boot drift — Stage 2 missing retry loops, DEVICE_READY checks, ADB key re-copy | Stages 1 vs 2 |
| 2 | Stage 2 missing iOS simulator boot step — has install but no xcrun simctl boot |
Stage 2 |
| 3 | GitHub comment content injection — TRX error text inserted without HTML escaping | ci-copilot.yml, post-ai-summary-comment.ps1 |
| 4 | Comment update races — marker-based PATCH has no locking for concurrent runs | Stage 3 |
| 5 | Soft failures hide missing results — continueOnError + warnings mask incomplete output |
Stages 2-3 |
| 6 | $(REQUIRED_XCODE) macro undefined in Stage 2 iOS Xcode restore step |
ci-copilot.yml:246 |
| 7 | Type filter discards valid results — Where-Object { $_ -is [hashtable] } rejects PSCustomObjects |
Find-RegressionRisks.ps1:593 |
| 8 | $isEnvError uninitialized before loop — throws under strict mode |
Review-PR.ps1:1429 |
| 9 | UDID regex case-sensitive — [0-9A-F-] won't match lowercase UDIDs |
Invoke-UITestWithRetry.ps1:169 |
| 10 | simctl create output parsing fragile — 2>&1 + anchored regex vs string arrays |
Start-Emulator.ps1:444 |
| 11 | Get-PRMetadataIfBugFix completely untested — most complex function, zero coverage |
Test-FindRegressionRisks.ps1 |
| 12 | Test file uses hand-rolled framework instead of Pester — inconsistent with other test files | Test-FindRegressionRisks.ps1 |
✅ C# Appium Change — Approve
appium:ignoreHiddenApiPolicyError is correct, well-scoped, follows existing conventions, and improves CI reliability on API 30 emulators. Optional nit: tighten comment wording to clarify it suppresses adb shell settings put global hidden_api_policy* errors specifically.
🔵 Top Suggestions
- Extract duplicated steps into YAML templates — Android boot, iOS boot, workload install, Xcode bypass (~250 lines saved, eliminates drift)
- Add
Set-StrictMode -Version 3.0to all scripts - HTML-escape dynamic content in comment rendering (TRX error text, test names)
- Validate TestFilter strings before passing to
dotnet test - Convert
Test-FindRegressionRisks.ps1to Pester for consistency
Summary
| Severity | Total | Resolved | Open |
|---|---|---|---|
| 🔴 Critical | 5 | 2 | 3 |
| 🟠 Major | 12 | 0 | 12 |
| 🟡 Minor | 17 | 0 | 17 |
| 🔵 Suggestion | 10 | — | 10 |
Architecture is sound. The 3-stage pipeline design is well-motivated and clearly documented. The remaining criticals are hardening issues — Invoke-Expression replacement (C2), retry idempotency (C4), and auth guard (C5) — suitable for a fast follow-up.
|
Addressed the 3 open critical findings from the multimodal review in commit 7037237: C2 — C4 — DEFERRED mode idempotency ✅ C5 — The 12 major findings are noted for follow-up — they're hardening improvements that don't block the core functionality. |
🔍 Re-review after commit
|
| # | Finding | Status | Commit |
|---|---|---|---|
| 🔴 C1 | Stage 3 skipped on no-UI PRs | ✅ FIXED | 2b5e746 |
| 🔴 C2 | Invoke-Expression fragile + injection |
Get-TrxResults extracted, but fallback remains for 2 functions + duplicate copy in Review-PR.ps1 |
7037237 |
| 🔴 C3 | isOutput=true unverified |
✅ VERIFIED | (was already correct) |
| 🔴 C4 | DEFERRED mode not idempotent | ✅ FIXED (minor: add --paginate) |
7037237 |
| 🔴 C5 | gh auth → silent CLEAN |
✅ FIXED | 7037237 |
Bottom line: 4/5 critical findings fully resolved, C2 substantially improved (highest-risk function extracted) with residual Invoke-Expression for 2 simpler functions. The 12 major findings from the original review remain open but are hardening items for follow-up.
Good progress — the PR is in much better shape for merge now. 👍
|
Addressed all remaining items from the re-review in commit d31011f: C2 — fully resolved ✅
C4 pagination minor ✅
All 5 critical findings now fully resolved. |
🔍 Re-review after commit
|
| Function | File | Consumers |
|---|---|---|
Get-TrxResults |
shared/Get-TrxResults.ps1 |
Review-PR.ps1, Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3 |
Get-CategoryFromArtifactName |
shared/Get-CategoryFromArtifactName.ps1 |
Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3 |
Get-AggregatedTrxFromDirectory |
shared/Get-AggregatedTrxFromDirectory.ps1 |
Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3 |
Verified:
- ✅ Zero
Invoke-Expressionremaining inci-copilot.yml(only the comment# no Invoke-Expressionreferencing the fix) - ✅ Zero
Invoke-Expressionremaining inAggregate-UITestArtifacts.ps1 - ✅
Review-PR.ps1line 375:. "$PSScriptRoot/shared/Get-TrxResults.ps1"— single source of truth, no duplicate copy - ✅ Stage 3 YAML: clean dot-sourcing of all 3 files, no regex fallback
✅ C4 minor — pagination fix applied
gh api "repos/dotnet/maui/issues/$prNumber/comments?per_page=100" --paginate --jq '...'Handles PRs with 30+ comments correctly now.
Final Scorecard — All 5 Critical Findings
| # | Finding | Status |
|---|---|---|
| 🔴 C1 | Stage 3 skipped on no-UI PRs | ✅ Fixed (2b5e746) |
| 🔴 C2 | Invoke-Expression fragile + injection |
✅ Fully fixed (d31011f) |
| 🔴 C3 | isOutput=true unverified |
✅ Verified (was already correct) |
| 🔴 C4 | DEFERRED mode not idempotent | ✅ Fixed + paginated (7037237 + d31011f) |
| 🔴 C5 | gh auth → silent CLEAN |
✅ Fixed (7037237) |
All 5 critical findings resolved. 🎉
The 12 major findings (emulator boot drift, missing iOS sim boot in Stage 2, HTML escaping, comment races, etc.) remain as tracked follow-up items — none are merge blockers.
857ed58 to
1370143
Compare
🔬 Multimodal Code Review — PR #35376
Three independent reviewers analyzed this PR across correctness, security, architecture, reliability, and maintainability. Findings are deduplicated and ranked by severity. 🔴 Critical — Fix Before Merge1. TRX overwrite + double-counting destroys test results —
|
✅ Multimodal Review Complete — All Critical Issues Fixed
Fix Commits
Issues Resolved (22 original + 7 discovered during fixes)🔴 Critical (7 fixed)
🟠 Should-fix (10 fixed) 🟡 Remaining (low-severity, tracked for follow-up)
|
00a514d to
5ad468e
Compare
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that detects when a PR removes lines previously added by labeled bug-fix PRs. Algorithm: 1. Collects lines removed by the PR under review 2. Finds recent PRs touching the same files via git log 3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels) 4. Cross-references removed lines against lines those fix PRs added 5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN Integration: - Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate) - Content assembled into AI summary comment via post-ai-summary-comment.ps1 - Expert reviewer dimension #6 reads risks.json for REVERT entries - 64 unit tests covering diff parsing, normalization, and detection logic Validated against: - PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278 - PR #35272: correctly classifies as OVERLAP (no line-level revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now: 1. Extracts test files from the fix PR's diff 2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner) 3. Stores regression_tests metadata in risks.json 4. Lists required tests in content.md Review-PR.ps1 adds STEP 1.5 (after Gate builds the code): - Reads regression_tests from risks.json - Runs unit/XAML tests via dotnet test with the detected filters - Skips UI/device tests (need CI infrastructure) with clear reporting - Appends pass/fail results to content.md - Writes test-results.json for downstream consumption Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs: - UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium) - Device tests via Run-DeviceTests.ps1 (xharness on device/simulator) - Unit/XAML tests via dotnet test --filter Each test runs on the same platform as the Gate step. If a runner script is missing, the test is skipped gracefully rather than failing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries. A nearby edit can break a fix through side effects even without removing the exact lines. Running tests for all risk levels gives maximum confidence that prior fixes aren't regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs), inject them as MANDATORY additional tests into the STEP 2 try-fix prompt. Each try-fix candidate must run these tests after its own test passes — a candidate that breaks a prior fix is marked Fail. The Report phase (Phase 3) also ranks candidates that failed regression tests lower than those that passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup) STEP 0.5 → 2 (UI Test Categories) STEP 0.6 → 3 (Regression Cross-Reference) STEP 1 → 4 (Gate) STEP 1.5 → 5 (Regression Test Verification) STEP 2 → 6 (PR Review) STEP 3 → 7 (Post AI Summary) STEP 4 → 8 (Apply Labels) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Dedup check: Before any setup, query AzDO for running/queued builds with the same PR+Platform. If a duplicate exists, cancel this run (oldest-wins policy). Prevents 2-5x compute waste. 2. Fail-fast merge conflicts: Test merge feasibility right after checkout, BEFORE SDK/emulator setup (~15-20 min saved per conflict). On failure, posts a PR comment with conflicting file list (uses hidden marker to update existing comment, not spam). 3. Fix partiallySucceeded noise: Add deepTestsRan variable and condition the drop-deep-uitests download on it. When deep tests are skipped (no detected categories), the download is skipped too, avoiding the 'artifact not found' warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Findings JSON unwrapping: The expert-reviewer agent writes
{"findings": [...]} (object wrapper) instead of bare [...].
ConvertFrom-Json returns 1 object with no .path property,
causing all findings to be dropped as 'suspicious path: empty'.
Now detects and unwraps both bare arrays and object wrappers.
2. Merge conflict comment: Use GH_COMMENT_TOKEN (GitHub PAT) instead
of System.AccessToken (AzDO PAT) for posting PR comments.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the pre-check merge feasibility step and the dedup step from ci-copilot.yml. Keep only the partial-success fix (deepTestsRan condition on drop-deep-uitests download). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes: - Fix TRX overwrite bug: merge retry results into original TRX instead of replacing it, preserving all first-run passing tests. Delete retry TRX to prevent double-counting by aggregators. - Fix TRX contamination across categories: use specific TRX path from Invoke-UITestWithRetry instead of recursive 30-min time-based scan. - Fix category extraction for Stage 2 naming: add controls- prefixed variants to Get-CategoryFromArtifactName prefix list. - Stage 2 now uses Invoke-UITestWithRetry wrapper for env-error retry and device recovery. Remove retryCountOnTaskFailure (handled in-script). - Re-emit detectedCategories output variable after AI-tier refresh so Stage 2 picks up the refreshed list. - Stage 3 falls back to DEFERRED mode when aiSummaryCommentId is empty but deep test artifacts exist (reviewer crashed). - Map all VSTest outcomes (Aborted, Timeout, Error, etc.) to Failed in Get-TrxResults so failure disclosures match counter totals. Should-fix: - Preserve run-all sentinel as ALL (not NONE) so Stage 2 can distinguish run-everything from run-nothing. - Centralize env-error patterns in shared/Get-EnvErrorPatterns.ps1; Invoke-UITestWithRetry dot-sources it as single source of truth. - Strip parameter signatures from test names before building VSTest retry filter to avoid filter grammar operator injection. - Use case-insensitive deepTestsRan comparison in pipeline condition. - Track iOS 26.4 availability via pipeline variable for fallback warning. - Restrict Test-IsBugFixLabel to t/bug and i/regression only; use p/0|p/1 as secondary signal AND-ed with bug labels to reduce false positives. - Fix linked-issue regex to accept all GitHub closing keyword forms (Fix/Fixed/Close/Closed/Resolve/Resolved). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ALL sentinel producing empty loop in Stage 2: use single-element list with empty string and skip -Category param in run-all mode - Review-PR.ps1 now dot-sources shared/Get-EnvErrorPatterns.ps1 for infraSignals instead of hard-coded inline list - Invoke-UITestWithRetry.ps1 fails loudly if shared patterns file is missing instead of silently using a drifted inline fallback - CopilotLogs download gets continueOnError so DEFERRED fallback works when ReviewPR crashed before publishing the artifact - Stage 2 uses splatted hashtable for retry wrapper params, conditionally omitting -Category in run-all mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add ios_ui_tests-controls, catalyst_ui_tests-controls, windows_ui_tests-controls to Get-CategoryFromArtifactName prefix list so iOS/Catalyst/Windows Stage 2 artifacts extract correctly - TRX merge counter math now uses same outcome classification as Get-TrxResults: Aborted/Timeout/Error counted as Failed, not omitted - Only replace originally-failed entries during TRX merge (guard against substring-matching retry filter pulling in unrelated passing tests) - Sync inline Get-TrxResults in Review-PR.ps1 with shared copy (default outcome → Failed) - Remove dead IOS26Available emit step (no downstream consumer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BuildAndRunHostApp.ps1 required either -TestFilter or -Category as mandatory parameters. In ALL mode (run all tests without filter), Stage 2 invokes the script with neither. Fix: - Make -TestFilter optional (Mandatory=false) in TestFilter param set - Handle null effectiveFilter: omit --filter arg from dotnet test - Add 'ALL-<platform>' TRX base name fallback when neither param set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Windows: - Set screen resolution to 1920x1080 (AzDO agents default to 1024x768) matching main CI ui-tests-steps.yml behavior Catalyst (MacCatalyst): - Disable/re-enable Notification Center (intercepts UI interactions) - Skip Xcode provisioning (catalyst doesn't need it, matches main CI) - Pass openSslArgs: '' to avoid legacy openssl for Catalyst builds Both iOS and Catalyst: - Disable macOS text autocorrect/autocapitalize/spellcheck (interferes with Appium text entry tests), matching main CI setup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Guard against null $effectiveFilter in the success summary banner and info log. In ALL mode neither -Category nor -TestFilter is passed, so effectiveFilter is null. The .Substring() call on null throws under ErrorActionPreference=Stop, breaking ALL-mode test runs even when all tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Get-EnvErrorPatterns returns regex patterns (with .* and \s*). Wrapping them in [regex]::Escape() converts metacharacters to literals, so patterns like 'error ADB0010.*InstallFailedException' never match. Use -match directly, matching how Invoke-UITestWithRetry uses them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bash artifact-copy step used `cp -r CustomAgentLogsTmp` which silently fails on Windows agents (Git Bash path mismatch). This caused the CopilotLogs artifact to be missing PRAgent content files, so Stage 3 had nothing to post as the AI Summary comment. Move artifact copying to a separate pwsh step that works on all platforms (Linux, macOS, Windows). The bash step retains only the copilot session-state copy which uses $HOME. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot agent writes inline findings with 'file' as the key, but post-inline-review.ps1 only read 'path'. All findings were rejected as 'suspicious path: empty' and no inline comments were ever posted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stage 2 was skipped whenever the gate hit environment errors (3x retries exhausted → CopilotFailed=true → Check Copilot Result exits 1 → ReviewPR result=Failed). But detectedCategories was already emitted successfully in STEP 2, so the deep tests COULD run. Add 'Failed' to the allowed ReviewPR results in Stage 2's condition. Deep UI tests are valuable independent of the gate outcome — they verify the PR's changes don't break existing test suites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conditional download using deepTestsRan variable was unreliable — AzDO's $[ in() ] expression for cross-stage result evaluation can return unexpected values depending on stage result propagation timing. This caused Stage 3 to skip the artifact download even when Stage 2 produced valid TRX results, resulting in AI Summary comments showing 'SKIPPED' for UI tests that actually ran and passed. Fix: always attempt the download with continueOnError (handles the case where RunDeepUITests was skipped). Remove the unused deepTestsRan variable. Add diagnostic logging to Stage 3 for debugging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot agent produces findings JSON in multiple formats:
bare array, {findings:[...]}, {schemaVersion:1, findings:[...]},
single object, etc. The previous parser used $parsed.findings
which is falsy in PowerShell when the array is empty or when
property access returns unexpected types. This caused the wrapper
object to be treated as a single finding with no path/file.
Use PSObject.Properties.Match() for explicit property detection
instead of truthy evaluation. Add diagnostic logging so future
format issues are immediately visible in pipeline logs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e ref Review-PR.ps1 STEP 1 checks out origin/main and squash-merges the PR, replacing all files including .github/scripts/. Script fixes on the pipeline ref (feature/regression-check) were overwritten with main's versions, so post-inline-review.ps1 file/path fix and other changes never took effect. Fix: YAML saves .github/scripts to a backup dir before invoking Review-PR.ps1. After STEP 1 completes the branch switch, the script restores the pipeline-ref versions via SCRIPTS_BACKUP env variable. This ensures STEP 7 (inline posting) and all other script steps use the pipeline ref's code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review-PR.ps1 was checking out origin/main in CI mode, overwriting all pipeline-ref scripts with main's versions. This is why inline comment fixes, env-error patterns, and other script changes never took effect — they were replaced by main's code. Fix: in CI mode, stay on the current branch (the pipeline ref, e.g. feature/regression-check). The PR is squash-merged onto it. This preserves all script fixes while still testing the PR's changes. Remove the backup/restore workaround — no longer needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e13ca0e to
327a75c
Compare
## Description
Extends the `maui-copilot` DevDiv pipeline (pipeline 27723) with a
3-stage architecture that runs real UI tests on platform-pool agents and
reports results directly in the AI summary PR comment.
### Pipeline Workflow
```
┌─────────────────────────────────────────────────────────┐
│ Stage 1: ReviewPR │
│ │
│ STEP 1: Branch Setup (checkout + cherry-pick PR) │
│ STEP 2: Detect UI Test Categories │
│ STEP 3: Run Detected UI Tests (in-process, fast) │
│ STEP 4: Regression Cross-Reference │
│ STEP 5: Gate — verify tests fail/pass before/after fix │
│ STEP 6: Code Review — deep analysis via Copilot agent │
│ │
│ Outputs → CopilotLogs artifact + detectedCategories │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────┐
│ Stage 2: RunDeepUITests (platform-pool agent) │
│ │
│ iOS: AcesShared Tahoe + iOS 26.4 │
│ Android: ubuntu-22.04 + KVM + AVD │
│ │
│ Runs BuildAndRunHostApp.ps1 per detected category │
│ Outputs → drop-deep-uitests artifact (TRX + diffs) │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────┐
│ Stage 3: PostResults │
│ │
│ 1. Download CopilotLogs (review content files) │
│ 2. Download drop-deep-uitests (TRX results) │
│ 3. Merge deep results into uitests/content.md │
│ 4. Post full AI Summary comment on PR │
│ 5. Apply labels (s/agent-reviewed, etc.) │
│ │
│ One comment with everything — no patching needed │
└─────────────────────────────────────────────────────────┘
```
### What's New
**Deep UI Test Execution (Stage 2)**
- Runs detected UI test categories on proper platform-pool agents (not
in-process on Linux)
- **iOS**: AcesShared Tahoe agents with iOS 26.4 simulator, iPhone 11
Pro (matching `ios-26` baselines from PR dotnet#35061)
- **Android**: ubuntu-22.04 with KVM, AVD boot with `-partition-size
2048`, `ignoreHiddenApiPolicyError` capability
- TRX results + snapshot-diff PNGs published as `drop-deep-uitests`
artifact
**Unified Comment Posting (Stage 3)**
- Comment posting and label application deferred to Stage 3 (after deep
tests complete)
- Single AI summary comment includes ALL results: code review + deep
test results
- Nested collapsible `<details>` for failed tests with full error +
stack trace
- Dynamic section title: `🧪 UI Tests — CollectionView, TabbedPage`
- Artifact download link for snapshot-diff PNGs
**Android Emulator Improvements**
- AVD boot step with proper partition size, ADB key pre-authorization,
boot wait
- `DEVICE_UDID` pass-through prevents double emulator boot
- Disk cleanup on hosted ubuntu agents (frees ~22GB)
- KVM enablement + `appium:ignoreHiddenApiPolicyError` for API 30
**iOS Simulator Improvements**
- Tahoe pool demand ensures macOS 26.x agents
- Explicit iOS 26.4 download via latest Xcode
- Auto-creates iPhone 11 Pro for baseline resolution match
### Validation
Tested across 30+ pipeline iterations on 6 PRs:
| PR | iOS | Android |
|---|---|---|
| 35358 (ViewBaseTests) | **112/112 ALL PASS** ✅ | **118/119 PASS** ✅ |
| 35359 (TabbedPage) | 44/50 (1 real failure) | 74/75 (1 real failure) |
| 35356 (CollectionView) | **415/417 PASS** ✅ | 593/619 (26 real
failures) |
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Extends the
maui-copilotDevDiv pipeline (pipeline 27723) with a 3-stage architecture that runs real UI tests on platform-pool agents and reports results directly in the AI summary PR comment.Pipeline Workflow
What's New
Deep UI Test Execution (Stage 2)
ios-26baselines from PR [Testing] Resaved the iOS 26.4 images #35061)-partition-size 2048,ignoreHiddenApiPolicyErrorcapabilitydrop-deep-uitestsartifactUnified Comment Posting (Stage 3)
<details>for failed tests with full error + stack trace🧪 UI Tests — CollectionView, TabbedPageAndroid Emulator Improvements
DEVICE_UDIDpass-through prevents double emulator bootappium:ignoreHiddenApiPolicyErrorfor API 30iOS Simulator Improvements
Validation
Tested across 30+ pipeline iterations on 6 PRs: