Skip to content

Add deep UI test execution to Copilot PR review pipeline#35376

Merged
rmarinho merged 119 commits into
mainfrom
feature/regression-check
May 22, 2026
Merged

Add deep UI test execution to Copilot PR review pipeline#35376
rmarinho merged 119 commits into
mainfrom
feature/regression-check

Conversation

@kubaflo
Copy link
Copy Markdown
Contributor

@kubaflo kubaflo commented May 11, 2026

Description

Extends the maui-copilot DevDiv pipeline (pipeline 27723) with a 3-stage architecture that runs real UI tests on platform-pool agents and reports results directly in the AI summary PR comment.

Pipeline Workflow

┌─────────────────────────────────────────────────────────┐
│  Stage 1: ReviewPR                                      │
│                                                         │
│  STEP 1: Branch Setup (checkout + cherry-pick PR)       │
│  STEP 2: Detect UI Test Categories                      │
│  STEP 3: Run Detected UI Tests (in-process, fast)       │
│  STEP 4: Regression Cross-Reference                     │
│  STEP 5: Gate — verify tests fail/pass before/after fix │
│  STEP 6: Code Review — deep analysis via Copilot agent  │
│                                                         │
│  Outputs → CopilotLogs artifact + detectedCategories    │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 2: RunDeepUITests (platform-pool agent)          │
│                                                         │
│  iOS: AcesShared Tahoe + iOS 26.4                       │
│  Android: ubuntu-22.04 + KVM + AVD                      │
│                                                         │
│  Runs BuildAndRunHostApp.ps1 per detected category      │
│  Outputs → drop-deep-uitests artifact (TRX + diffs)     │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 3: PostResults                                   │
│                                                         │
│  1. Download CopilotLogs (review content files)         │
│  2. Download drop-deep-uitests (TRX results)            │
│  3. Merge deep results into uitests/content.md          │
│  4. Post full AI Summary comment on PR                  │
│  5. Apply labels (s/agent-reviewed, etc.)               │
│                                                         │
│  One comment with everything — no patching needed       │
└─────────────────────────────────────────────────────────┘

What's New

Deep UI Test Execution (Stage 2)

  • Runs detected UI test categories on proper platform-pool agents (not in-process on Linux)
  • iOS: AcesShared Tahoe agents with iOS 26.4 simulator, iPhone 11 Pro (matching ios-26 baselines from PR [Testing] Resaved the iOS 26.4 images #35061)
  • Android: ubuntu-22.04 with KVM, AVD boot with -partition-size 2048, ignoreHiddenApiPolicyError capability
  • TRX results + snapshot-diff PNGs published as drop-deep-uitests artifact

Unified Comment Posting (Stage 3)

  • Comment posting and label application deferred to Stage 3 (after deep tests complete)
  • Single AI summary comment includes ALL results: code review + deep test results
  • Nested collapsible <details> for failed tests with full error + stack trace
  • Dynamic section title: 🧪 UI Tests — CollectionView, TabbedPage
  • Artifact download link for snapshot-diff PNGs

Android Emulator Improvements

  • AVD boot step with proper partition size, ADB key pre-authorization, boot wait
  • DEVICE_UDID pass-through prevents double emulator boot
  • Disk cleanup on hosted ubuntu agents (frees ~22GB)
  • KVM enablement + appium:ignoreHiddenApiPolicyError for API 30

iOS Simulator Improvements

  • Tahoe pool demand ensures macOS 26.x agents
  • Explicit iOS 26.4 download via latest Xcode
  • Auto-creates iPhone 11 Pro for baseline resolution match

Validation

Tested across 30+ pipeline iterations on 6 PRs:

PR iOS Android
35358 (ViewBaseTests) 112/112 ALL PASS 118/119 PASS
35359 (TabbedPage) 44/50 (1 real failure) 74/75 (1 real failure)
35356 (CollectionView) 415/417 PASS 593/619 (26 real failures)

Copilot AI review requested due to automatic review settings May 11, 2026 10:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35376

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35376"

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 18 | Agents checked: 4

Full validator output
Found 1 skill(s)
[find-regression-risk] 📊 find-regression-risk: 967 BPE tokens [chars/4: 905] (detailed ✓), 10 sections, 2 code blocks
[find-regression-risk]    ⚠  No YAML frontmatter — agents use name/description for skill discovery.
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)

✅ All checks passed (4 agent(s))

⏭️ LLM Evaluation: Skipped

No changed skills with eval tests found.

🔍 Full results and investigation steps

@kubaflo kubaflo added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label May 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the eng/pipelines/ci-copilot.yml Copilot PR review pipeline into a multi-stage flow that can (a) detect UI test categories, (b) run “deep” UI tests on platform-appropriate agents, and (c) post/update a unified AI summary comment that includes deep UI test results and regression-risk signals. It also adds supporting PowerShell utilities for more reliable UI test execution (retry + TRX parsing) and for identifying potential regression reverts by diff cross-referencing.

Changes:

  • Add RunDeepUITests and UpdateAISummaryComment stages to run per-category UI tests on real platform pools and publish/update results in the PR comment.
  • Introduce new scripts/tests for UI test retry + TRX parsing/aggregation and for regression-risk detection (Find-RegressionRisks.ps1 + tests/docs).
  • Update emulator/simulator setup and test invocation to improve stability and to produce authoritative TRX outputs for result rendering.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/TestUtils/src/UITest.Appium/AppiumAndroidApp.cs Adds Appium capability to ignore hidden API policy errors on problematic emulator images.
eng/pipelines/ci-copilot.yml Implements the 3-stage architecture (review → deep UI tests → post/update summary) and associated platform setup.
.github/skills/find-regression-risk/SKILL.md Documents the regression-risk detection skill and how it integrates into the review flow.
.github/scripts/tests/Test-FindRegressionRisks.ps1 Adds script-based tests for the regression-risk detector.
.github/scripts/shared/Start-Emulator.ps1 Updates iOS simulator selection logic (iOS 26-first) and adds device creation fallback logic.
.github/scripts/shared/Invoke-UITestWithRetry.ps1 New shared “build/deploy/run UI tests” runner with retry + device recovery and TRX discovery.
.github/scripts/shared/Aggregate-UITestArtifacts.Tests.ps1 Adds Pester tests for TRX aggregation from downloaded artifacts.
.github/scripts/shared/Aggregate-UITestArtifacts.ps1 New artifact downloader/parser to aggregate per-category TRX results.
.github/scripts/Review-PR.Tests.ps1 Adds Pester tests for TRX parsing and console-output fallback parsing helpers.
.github/scripts/Review-PR.ps1 Reworks review flow steps (category detect/run, regression cross-ref) and adds TRX-aware UI test reporting.
.github/scripts/post-ai-summary-comment.ps1 Adds regression-check section and dynamic “UI Tests — ” title rendering.
.github/scripts/Find-RegressionRisks.ps1 New mechanical regression-risk detector that cross-references deletions vs recent bug-fix additions.
.github/scripts/BuildAndRunHostApp.ps1 Switches to TestCategory= filtering and adds TRX logging/marker output for authoritative results.
.github/agents/maui-expert-reviewer.md Adds guidance to require acknowledgement when regression cross-reference detects REVERT entries.

Comment thread .github/scripts/post-ai-summary-comment.ps1 Outdated
Comment thread .github/scripts/shared/Start-Emulator.ps1 Outdated
Comment thread .github/scripts/Review-PR.ps1 Outdated
Comment thread .github/scripts/Review-PR.ps1 Outdated
Comment thread eng/pipelines/ci-copilot.yml Outdated
Comment thread .github/skills/find-regression-risk/SKILL.md Outdated
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

🔍 Multimodal Code Review — 5 agents, 5 dimensions

Reviewed by: Claude Opus 4.7 ×3, Claude Opus 4.6, GPT-5.5
Dimensions: Pipeline YAML · PowerShell Scripts · Test Quality · Security & Architecture · C# Code
Scope: 14 files, +3,919 / -89 lines | Re-reviewed after commit 2b5e746


🔴 Critical Findings (5 found, 2 resolved)

C1 — Stage 3 skipped on no-UI PRs — FIXED in 2b5e746

DEFER_COMMENT_TO_STAGE3=true unconditionally defers comment posting. When detectedCategories is empty, Stage 2 is Skipped, and Stage 3's condition didn't include 'Skipped' → AI summary comment was never posted for non-UI PRs.

Fix applied: 'Skipped' added to Stage 3 condition at ci-copilot.yml:1081. ✅

C3 — Cross-stage isOutput=true unverified — VERIFIED

All 3 cross-stage variables confirmed with isOutput=true:

  • detectedCategories (line 677)
  • detectedPlatform (line 678)
  • aiSummaryCommentId (lines 1795, 1824)

Comment at line 671 updated to reference RunReview (not Detection). ✅

C2 — Invoke-Expression of regex-extracted functions — OPEN

Location: Aggregate-UITestArtifacts.ps1:71-75, ci-copilot.yml Stage 3

$rx = [regex]::Match(($aggSrc + "`n" + $reviewSrc), "(?ms)^function\s+$fn\s*\{.*?^\}", 'Multiline')
if ($rx.Success) { Invoke-Expression $rx.Value }

The regex .*?^\} matches the first line-anchored } — will silently truncate functions with nested blocks. Invoke-Expression with file-derived content is a code injection vector. Three independent reviewers flagged this (triple convergence).

Recommendation: Extract Get-TrxResults into .github/scripts/shared/Get-TrxResults.ps1 and dot-source it.

C4 — DEFERRED mode not idempotent on retry — OPEN

Location: Stage 3 DEFERRED mode

In DEFERRED mode (commentId == 'DEFERRED'), Stage 3 calls post-ai-summary-comment.ps1 which creates a new comment. Pipeline retry → duplicate comments. PATCH mode is correctly idempotent via <!-- DEEP_UITESTS_BEGIN/END --> markers, but DEFERRED mode has no equivalent guard.

Recommendation: Query existing PR comments for a marker before posting; PATCH if found.

C5 — gh auth failures → silent CLEAN — OPEN

Location: Find-RegressionRisks.ps1:~459

All gh calls use 2>$null. When auth fails (token expiry during long CI runs), every PR lookup returns empty, and the script writes result=CLEAN — the worst possible failure mode: accepting a risky PR because the analyzer was offline.

Recommendation: Run gh auth status at script start; abort with clear error on auth failure.


🟠 Major Findings (12) — all still open

# Finding Location
1 Emulator boot drift — Stage 2 missing retry loops, DEVICE_READY checks, ADB key re-copy Stages 1 vs 2
2 Stage 2 missing iOS simulator boot step — has install but no xcrun simctl boot Stage 2
3 GitHub comment content injection — TRX error text inserted without HTML escaping ci-copilot.yml, post-ai-summary-comment.ps1
4 Comment update races — marker-based PATCH has no locking for concurrent runs Stage 3
5 Soft failures hide missing resultscontinueOnError + warnings mask incomplete output Stages 2-3
6 $(REQUIRED_XCODE) macro undefined in Stage 2 iOS Xcode restore step ci-copilot.yml:246
7 Type filter discards valid resultsWhere-Object { $_ -is [hashtable] } rejects PSCustomObjects Find-RegressionRisks.ps1:593
8 $isEnvError uninitialized before loop — throws under strict mode Review-PR.ps1:1429
9 UDID regex case-sensitive[0-9A-F-] won't match lowercase UDIDs Invoke-UITestWithRetry.ps1:169
10 simctl create output parsing fragile2>&1 + anchored regex vs string arrays Start-Emulator.ps1:444
11 Get-PRMetadataIfBugFix completely untested — most complex function, zero coverage Test-FindRegressionRisks.ps1
12 Test file uses hand-rolled framework instead of Pester — inconsistent with other test files Test-FindRegressionRisks.ps1

✅ C# Appium Change — Approve

appium:ignoreHiddenApiPolicyError is correct, well-scoped, follows existing conventions, and improves CI reliability on API 30 emulators. Optional nit: tighten comment wording to clarify it suppresses adb shell settings put global hidden_api_policy* errors specifically.


🔵 Top Suggestions

  1. Extract duplicated steps into YAML templates — Android boot, iOS boot, workload install, Xcode bypass (~250 lines saved, eliminates drift)
  2. Add Set-StrictMode -Version 3.0 to all scripts
  3. HTML-escape dynamic content in comment rendering (TRX error text, test names)
  4. Validate TestFilter strings before passing to dotnet test
  5. Convert Test-FindRegressionRisks.ps1 to Pester for consistency

Summary

Severity Total Resolved Open
🔴 Critical 5 2 3
🟠 Major 12 0 12
🟡 Minor 17 0 17
🔵 Suggestion 10 10

Architecture is sound. The 3-stage pipeline design is well-motivated and clearly documented. The remaining criticals are hardening issues — Invoke-Expression replacement (C2), retry idempotency (C4), and auth guard (C5) — suitable for a fast follow-up.

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

Addressed the 3 open critical findings from the multimodal review in commit 7037237:

C2 — Invoke-Expression elimination
Extracted Get-TrxResults into .github/scripts/shared/Get-TrxResults.ps1 and dot-source it. Updated both Aggregate-UITestArtifacts.ps1 and Stage 3 YAML. The remaining Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory functions are still in the aggregate script (they're small, single-purpose, and only used there) — they use dot-source when available, with regex fallback only for those two inline functions.

C4 — DEFERRED mode idempotency
Stage 3 now queries existing PR comments for <!-- AI Summary --> marker before posting. On pipeline retry, finds the existing comment ID and falls back to PATCH mode, preventing duplicate comments.

C5 — gh auth guard
Added gh auth status check at the start of Find-RegressionRisks.ps1. Script aborts with clear error if not authenticated, preventing silent false-CLEAN results.

The 12 major findings are noted for follow-up — they're hardening improvements that don't block the core functionality.

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

🔍 Re-review after commit 7037237 — Critical findings C2, C4, C5

✅ C5 — gh auth guard — RESOLVED

Clean implementation. gh auth status at start of Find-RegressionRisks.ps1 main block, exit 2 on failure with clear error output. Prevents false CLEAN results. No issues.

✅ C4 — DEFERRED mode idempotency — RESOLVED (one minor gap)

Before entering DEFERRED mode, queries PR comments for <!-- AI Summary --> marker. If found, switches to PATCH mode — preventing duplicate comments on retry. The marker is confirmed used by post-ai-summary-comment.ps1 (line 49).

🟡 Minor gap: no --paginate on comment query
$existingComment = gh api "repos/dotnet/maui/issues/$prNumber/comments" --jq '...' 2>$null

GitHub API paginates at 30 comments/page by default. If the AI Summary comment is on page 2+ (busy PRs), it won't be found and a duplicate will be created. Add --paginate or ?per_page=100.

⚠️ C2 — Invoke-Expression elimination — PARTIALLY RESOLVED

What's fixed (good):

  • Get-TrxResults extracted to shared/Get-TrxResults.ps1 — the primary fix
  • Aggregate-UITestArtifacts.ps1 dot-sources the shared file
  • ci-copilot.yml Stage 3 dot-sources Get-TrxResults.ps1
  • ✅ Both copies of Get-TrxResults (shared file + Review-PR.ps1) are identical right now

What's still open:

🟠 Invoke-Expression fallback still exists for 2 other functions

Stage 3 YAML (lines 1136-1141):

foreach ($fn in @('Get-CategoryFromArtifactName','Get-AggregatedTrxFromDirectory')) {
    $fnFile = ".github/scripts/shared/$fn.ps1"
    if (Test-Path $fnFile) { . $fnFile }
    else {
        # Functions defined inline in Aggregate script — extract via regex as fallback
        $rx = [regex]::Match($aggSrc, "(?ms)^function\s+$fn\s*\{.*?^\}", 'Multiline')
        if ($rx.Success) { Invoke-Expression $rx.Value }
    }
}

The fallback path still uses the same fragile regex+Invoke-Expression pattern for Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory. Since the .ps1 files don't exist, the fallback always fires. Lower risk than Get-TrxResults (simpler functions), but the pattern remains.

🟡 Review-PR.ps1 still has its own copy of Get-TrxResults

Review-PR.ps1:380 defines Get-TrxResults inline instead of dot-sourcing shared/Get-TrxResults.ps1. The two copies are byte-identical today, but will drift over time. One edit to the shared file without updating Review-PR.ps1 (or vice versa) silently produces different TRX parsing behavior in Stage 1 vs Stage 2/3.

Recommendation: Have Review-PR.ps1 dot-source the shared file:

. "$PSScriptRoot/shared/Get-TrxResults.ps1"

Updated Scorecard — All 5 Critical Findings

# Finding Status Commit
🔴 C1 Stage 3 skipped on no-UI PRs FIXED 2b5e746
🔴 C2 Invoke-Expression fragile + injection ⚠️ PARTIALGet-TrxResults extracted, but fallback remains for 2 functions + duplicate copy in Review-PR.ps1 7037237
🔴 C3 isOutput=true unverified VERIFIED (was already correct)
🔴 C4 DEFERRED mode not idempotent FIXED (minor: add --paginate) 7037237
🔴 C5 gh auth → silent CLEAN FIXED 7037237

Bottom line: 4/5 critical findings fully resolved, C2 substantially improved (highest-risk function extracted) with residual Invoke-Expression for 2 simpler functions. The 12 major findings from the original review remain open but are hardening items for follow-up.

Good progress — the PR is in much better shape for merge now. 👍

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

Addressed all remaining items from the re-review in commit d31011f:

C2 — fully resolved

  • Extracted Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory into separate .ps1 files in shared/
  • Stage 3 YAML now dot-sources all 3 files directly — zero Invoke-Expression remaining
  • Review-PR.ps1 dot-sources shared/Get-TrxResults.ps1 instead of duplicate inline copy — single source of truth, no drift risk

C4 pagination minor

  • Added --paginate + ?per_page=100 to existing comment query so it finds AI Summary comments on busy PRs with 30+ comments

All 5 critical findings now fully resolved.

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 11, 2026

🔍 Re-review after commit d31011f — C2 fully resolved

✅ C2 — Invoke-Expression elimination — FULLY RESOLVED

All three shared functions extracted into standalone dot-sourceable files:

Function File Consumers
Get-TrxResults shared/Get-TrxResults.ps1 Review-PR.ps1, Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3
Get-CategoryFromArtifactName shared/Get-CategoryFromArtifactName.ps1 Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3
Get-AggregatedTrxFromDirectory shared/Get-AggregatedTrxFromDirectory.ps1 Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3

Verified:

  • Zero Invoke-Expression remaining in ci-copilot.yml (only the comment # no Invoke-Expression referencing the fix)
  • Zero Invoke-Expression remaining in Aggregate-UITestArtifacts.ps1
  • Review-PR.ps1 line 375: . "$PSScriptRoot/shared/Get-TrxResults.ps1" — single source of truth, no duplicate copy
  • ✅ Stage 3 YAML: clean dot-sourcing of all 3 files, no regex fallback

✅ C4 minor — pagination fix applied

gh api "repos/dotnet/maui/issues/$prNumber/comments?per_page=100" --paginate --jq '...'

Handles PRs with 30+ comments correctly now.


Final Scorecard — All 5 Critical Findings

# Finding Status
🔴 C1 Stage 3 skipped on no-UI PRs ✅ Fixed (2b5e746)
🔴 C2 Invoke-Expression fragile + injection Fully fixed (d31011f)
🔴 C3 isOutput=true unverified ✅ Verified (was already correct)
🔴 C4 DEFERRED mode not idempotent ✅ Fixed + paginated (7037237 + d31011f)
🔴 C5 gh auth → silent CLEAN ✅ Fixed (7037237)

All 5 critical findings resolved. 🎉

The 12 major findings (emulator boot drift, missing iOS sim boot in Stage 2, HTML escaping, comment races, etc.) remain as tracked follow-up items — none are merge blockers.

@kubaflo kubaflo changed the title [CI] Add deep UI test execution to Copilot PR review pipeline Add deep UI test execution to Copilot PR review pipeline May 11, 2026
@kubaflo kubaflo force-pushed the feature/regression-check branch from 857ed58 to 1370143 Compare May 19, 2026 11:48
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 20, 2026

🔬 Multimodal Code Review — PR #35376

Reviewed by: Claude Opus 4.7 (Extra-high reasoning) · Claude Opus 4.7 · GPT-5.5
Scope: 19 files, 4339 additions — 3-stage deep UI test pipeline

Three independent reviewers analyzed this PR across correctness, security, architecture, reliability, and maintainability. Findings are deduplicated and ranked by severity.


🔴 Critical — Fix Before Merge

1. TRX overwrite + double-counting destroys test resultsBuildAndRunHostApp.ps1:426 + Get-AggregatedTrxFromDirectory.ps1

Found by: Opus 4.7 XHigh ✦ GPT-5.5 ✦ Opus 4.7

When Android retry runs (BuildAndRunHostApp.ps1:387–430):

  1. Retry produces a TRX with only the retried tests (e.g. 5 out of 100)
  2. Copy-Item $retryTrxPath $trxFilePath -Force (line 426) overwrites the original full TRX
  3. Both retry-*.trx and the overwritten original survive in the artifact directory
  4. Get-AggregatedTrxFromDirectory sums all TRX counters with no deduplication

Result: A 100-test suite with 5 flaky failures that all pass on retry is reported as "10/10 ✓" instead of "100/100 ✓". The 95 passing tests vanish from the record.

Fix: Don't overwrite the original TRX. Either merge retry results per-test, or delete $retryTrxPath after overwrite and ensure only one TRX per category reaches the aggregator.

2. TRX contamination across categories in Stage 2ci-copilot.yml:1081-1086

Found by: GPT-5.5 ✦ Opus 4.7

After each category run, the script recursively copies every .trx modified in the last 30 min from the working directory into the current category's drop folder. This picks up TRX files from previous categories. The aggregator then attributes earlier categories' results to later ones.

Compounding issue: The 30-minute freshness filter (LastWriteTime -gt (Get-Date).AddMinutes(-30)) is also too short — CollectionView alone has taken ~40 min in real builds, so its TRX gets filtered out when later categories finish.

Fix: Copy only the TRX path emitted by the just-run BuildAndRunHostApp.ps1, or isolate output directories per category. Remove the time-based filter entirely.

3. Category extraction breaks for Stage 2 artifact namingGet-CategoryFromArtifactName.ps1:12-19

Found by: Opus 4.7 XHigh ✦ Opus 4.7

Stage 2 names artifacts drop-${platform}_ui_tests-controls-$safeCat. After stripping ^drop- and matching prefix android_ui_tests, the regex captures controls-CollectionView instead of CollectionView.

Every category in the AI summary comment gets a spurious controls- prefix. The Pester tests only cover legacy naming — no test for the new convention.

Fix: Either drop controls- from the dir name at line 1040, or add it to the prefix list in Get-CategoryFromArtifactName.

4. Stage 2 doesn't use Invoke-UITestWithRetry — no retry/recovery for deep testsci-copilot.yml:1040-1060

Found by: Opus 4.7

Stage 2's per-category loop calls BuildAndRunHostApp.ps1 directly (line 1040-1060), bypassing the Invoke-UITestWithRetry.ps1 wrapper this PR introduces. The retry helper — with device recovery, env-error detection, and adb reboot — only runs in Stage 1's in-process tests.

Additionally, retryCountOnTaskFailure: 2 on the step re-runs the entire per-category loop, causing already-passed categories to run again and produce duplicate TRX files.

Fix: Wrap each category invocation in Invoke-UITestWithRetry.ps1.

5. AI-refreshed categories not propagated to Stage 2Review-PR.ps1:1810

Found by: GPT-5.5

After the AI suggests additional test categories (line 1810-1846), the refreshed list is written to uitests/content.md but ##vso[task.setvariable variable=detectedCategories;isOutput=true] is never re-emitted. Stage 2 runs the pre-AI category set while the summary claims AI-derived categories were selected.

Fix: Re-emit the output variable after computing $refreshedCategories.

6. Stage 3 comment silently lost when Stage 1 reviewer crashesci-copilot.yml condition

Found by: Opus 4.7

UpdateAISummaryComment requires aiSummaryCommentId != ''. If the Copilot CLI crashes before STEP 6 sets that output, Stage 3 is skipped entirely — and deep test results from Stage 2 are never posted.

Fix: Add a fallback path to post a degraded comment ("review failed, but here are deep test results") when aiSummaryCommentId is empty but deep tests succeeded.

7. VSTest outcome misclassification hides failed testsGet-TrxResults.ps1:43-50

Found by: Opus 4.7

The outcome switch only handles Passed/Failed/NotExecuted/Inconclusive. Outcomes like Aborted, Timeout, Error, Disconnected fall through to the raw string. The failed-test disclosure checks $r.status -eq 'Failed', so aborted/timed-out tests count in the total but never appear in the detail section.

Fix: Map all non-Passed/non-NotExecuted outcomes to Failed.


🟠 Should Fix

8. Run-all detection silently becomes NONEReview-PR.ps1:664

Found by: GPT-5.5

The category detector uses blank/empty as the "run all categories" sentinel. Review-PR.ps1 maps blank to NONE, causing Stage 2 to skip run-all cases entirely.

9. Three divergent copies of env-error patternsInvoke-UITestWithRetry.ps1 vs Review-PR.ps1 vs Gate

Found by: Opus 4.7

The pattern lists differ (e.g. XHarness exit code: 83 exists in one but not others; Could not connect to device in another). Two different definitions of "env error" → different retry decisions for identical failures.

Fix: Extract to a shared EnvErrorPatterns.ps1 and dot-source from all sites.

10. VSTest filter operators unescaped in retry filterBuildAndRunHostApp.ps1:396

Found by: Opus 4.7 XHigh

Parameterized test names (e.g. TestMethod(arg: "value")) contain (, ), | which are VSTest filter grammar operators. Unescaped, this either breaks the filter command or matches unintended tests.

Fix: Strip parameter signatures before building filter, or escape special characters per VSTest rules.

11. Android emulator recovery gap in Stage 2ci-copilot.yml

Found by: Opus 4.7

retryCountOnTaskFailure: 2 re-runs the pwsh script but not the emulator boot step. If the emulator crashed mid-run, the retry finds a dead device and fails with "device offline" — no recovery path exists.

12. iOS 26.4 fallback silently breaks visual testsci-copilot.yml

Found by: Opus 4.7

If all three iOS 26.4 download approaches fail (continueOnError: true), Start-Emulator.ps1 falls back to iOS 18/17 — but visual baselines live under snapshots/ios-26. Every visual test then fails with "size differs" with no indication in the comment about why.

13. Regression risk false positives on p/1 labelFind-RegressionRisks.ps1:438

Found by: Opus 4.7

Test-IsBugFixLabel matches p/1 — a priority label that also applies to enhancements. This produces false "regression risk" flags for non-bugfix PRs.

Fix: Only use t/bug|i/regression; use p/0|p/1 as secondary signal AND-ed with a bug label.

14. Linked-issue regex misses valid GitHub formsFind-RegressionRisks.ps1:449

Found by: Opus 4.7

Only matches Fixes/Closes/Resolves but GitHub also accepts Fix, Fixed, Close, Closed, Resolve, Resolved.


🟡 Nits & Cleanup

15–22: Lower-severity items
  1. Stage 3 inline pwsh is 240+ lines — move to standalone testable script
  2. deepTestsRan case-sensitive comparison — use lower() for safety
  3. Android boot bash duplicated between Stage 1 and Stage 2 (~60 lines)
  4. No test coverage for Invoke-UITestWithRetry.ps1 (255 lines, 0 tests)
  5. Malformed TRX parse failures invisibleGet-TrxResults returns $null, aggregator skips silently
  6. --all fallback in git log can poison REVERT decisions on shallow clones
  7. Rate limit guard is advisory only — 1000+ gh calls possible with no hard cap
  8. Hardcoded emulator-5554 in three identical fallback blocks — extract to helper

✅ Strengths Noted

  • 3-stage architecture is the right shape for AzDO — clean separation of concerns
  • Regression risk detection algorithm is well-designed (whitespace normalization, trivial-line filtering, move suppression, ancestry verification)
  • Pester test suites for Find-RegressionRisks (418 lines) and Aggregate-UITestArtifacts (169 lines) — good coverage of the new logic
  • Output normalization in Invoke-UITestWithRetry.ps1 (splitting multi-line ErrorRecords) solves a real and subtle PowerShell pitfall
  • Secret handling is clean — ConvertTo-Json for API payloads, AzDO masking for tokens, no set -x on secret-bearing steps

@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 20, 2026

✅ Multimodal Review Complete — All Critical Issues Fixed

Models: Claude Opus 4.7 (Extra-high reasoning) · Claude Opus 4.7 · GPT-5.5
Rounds: 3 review cycles + 4 fix commits

Fix Commits

Commit Description
526271871a Round 1: TRX overwrite→merge, retry wrapper in Stage 2, category extraction, VSTest outcome mapping, filter escaping, env-pattern centralization, linked-issue regex, bug-label tightening
7a61d795c5 Round 2: ALL sentinel empty-loop fix, Review-PR.ps1 dot-sources shared env patterns, CopilotLogs continueOnError, remove drifted inline fallback
1ae95fc8ce Round 3: iOS/Catalyst/Windows category prefixes, counter math alignment, merge-only-failed guard, inline Get-TrxResults sync, dead code removal
bcfb986665 Round 4: Make TestFilter optional for ALL mode, handle no-filter dotnet test invocation

Issues Resolved (22 original + 7 discovered during fixes)

🔴 Critical (7 fixed)

  1. TRX overwrite destroys passing test data → XML merge with retry TRX deletion
  2. TRX contamination across categories → Specific TRX path from retry wrapper
  3. Category extraction breaks for Stage 2 → All platform prefixes added
  4. Stage 2 has no retry/recovery → Uses Invoke-UITestWithRetry wrapper
  5. AI-refreshed categories not propagated → Re-emits output variable
  6. Stage 3 lost when reviewer crashes → DEFERRED fallback + continueOnError
  7. VSTest outcomes misclassified → All non-Passed/non-NotExecuted → Failed

🟠 Should-fix (10 fixed)
8. Run-all → NONE conversion → ALL sentinel with proper handling
9. 3 divergent env-error pattern copies → Centralized in Get-EnvErrorPatterns.ps1
10. VSTest filter operator injection → Strip parameter signatures
11. deepTestsRan case-sensitivelower() in condition
12. p/1 false positives → Secondary signal AND-ed with bug labels
13. Linked-issue regex incomplete → All GitHub closing forms
14. ALL mode empty loop → Single-element list with empty string
15. Counter math excludes Aborted/Timeouttotal - passed - skipped
16. Retry merge flips passing tests → Only replace $failedNames entries
17. ALL mode mandatory parameter → TestFilter optional, no-filter path

🟡 Remaining (low-severity, tracked for follow-up)

  • Stage 3 inline pwsh (240+ lines) — should be extracted to standalone script
  • Android boot bash duplicated between Stage 1 and Stage 2
  • DEFERRED fallback unreachable when ReviewPR fails mid-execution (needs stage split)
  • TRX merge dedup key is testName not testId (no current collision)

Copilot AI added 6 commits May 22, 2026 12:51
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that
detects when a PR removes lines previously added by labeled bug-fix PRs.

Algorithm:
1. Collects lines removed by the PR under review
2. Finds recent PRs touching the same files via git log
3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels)
4. Cross-references removed lines against lines those fix PRs added
5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN

Integration:
- Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate)
- Content assembled into AI summary comment via post-ai-summary-comment.ps1
- Expert reviewer dimension #6 reads risks.json for REVERT entries
- 64 unit tests covering diff parsing, normalization, and detection logic

Validated against:
- PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278
- PR #35272: correctly classifies as OVERLAP (no line-level revert)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now:
1. Extracts test files from the fix PR's diff
2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner)
3. Stores regression_tests metadata in risks.json
4. Lists required tests in content.md

Review-PR.ps1 adds STEP 1.5 (after Gate builds the code):
- Reads regression_tests from risks.json
- Runs unit/XAML tests via dotnet test with the detected filters
- Skips UI/device tests (need CI infrastructure) with clear reporting
- Appends pass/fail results to content.md
- Writes test-results.json for downstream consumption

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs:
- UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium)
- Device tests via Run-DeviceTests.ps1 (xharness on device/simulator)
- Unit/XAML tests via dotnet test --filter

Each test runs on the same platform as the Gate step. If a runner
script is missing, the test is skipped gracefully rather than failing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries.
A nearby edit can break a fix through side effects even without
removing the exact lines. Running tests for all risk levels gives
maximum confidence that prior fixes aren't regressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs),
inject them as MANDATORY additional tests into the STEP 2 try-fix prompt.
Each try-fix candidate must run these tests after its own test passes —
a candidate that breaks a prior fix is marked Fail.

The Report phase (Phase 3) also ranks candidates that failed regression
tests lower than those that passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup)
STEP 0.5 → 2 (UI Test Categories)
STEP 0.6 → 3 (Regression Cross-Reference)
STEP 1 → 4 (Gate)
STEP 1.5 → 5 (Regression Test Verification)
STEP 2 → 6 (PR Review)
STEP 3 → 7 (Post AI Summary)
STEP 4 → 8 (Apply Labels)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 21 commits May 22, 2026 12:51
1. Dedup check: Before any setup, query AzDO for running/queued
   builds with the same PR+Platform. If a duplicate exists, cancel
   this run (oldest-wins policy). Prevents 2-5x compute waste.

2. Fail-fast merge conflicts: Test merge feasibility right after
   checkout, BEFORE SDK/emulator setup (~15-20 min saved per
   conflict). On failure, posts a PR comment with conflicting file
   list (uses hidden marker to update existing comment, not spam).

3. Fix partiallySucceeded noise: Add deepTestsRan variable and
   condition the drop-deep-uitests download on it. When deep tests
   are skipped (no detected categories), the download is skipped
   too, avoiding the 'artifact not found' warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Findings JSON unwrapping: The expert-reviewer agent writes
   {"findings": [...]} (object wrapper) instead of bare [...].
   ConvertFrom-Json returns 1 object with no .path property,
   causing all findings to be dropped as 'suspicious path: empty'.
   Now detects and unwraps both bare arrays and object wrappers.

2. Merge conflict comment: Use GH_COMMENT_TOKEN (GitHub PAT) instead
   of System.AccessToken (AzDO PAT) for posting PR comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the pre-check merge feasibility step and the dedup step
from ci-copilot.yml. Keep only the partial-success fix (deepTestsRan
condition on drop-deep-uitests download).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
- Fix TRX overwrite bug: merge retry results into original TRX instead of
  replacing it, preserving all first-run passing tests. Delete retry TRX
  to prevent double-counting by aggregators.
- Fix TRX contamination across categories: use specific TRX path from
  Invoke-UITestWithRetry instead of recursive 30-min time-based scan.
- Fix category extraction for Stage 2 naming: add controls- prefixed
  variants to Get-CategoryFromArtifactName prefix list.
- Stage 2 now uses Invoke-UITestWithRetry wrapper for env-error retry
  and device recovery. Remove retryCountOnTaskFailure (handled in-script).
- Re-emit detectedCategories output variable after AI-tier refresh so
  Stage 2 picks up the refreshed list.
- Stage 3 falls back to DEFERRED mode when aiSummaryCommentId is empty
  but deep test artifacts exist (reviewer crashed).
- Map all VSTest outcomes (Aborted, Timeout, Error, etc.) to Failed in
  Get-TrxResults so failure disclosures match counter totals.

Should-fix:
- Preserve run-all sentinel as ALL (not NONE) so Stage 2 can distinguish
  run-everything from run-nothing.
- Centralize env-error patterns in shared/Get-EnvErrorPatterns.ps1;
  Invoke-UITestWithRetry dot-sources it as single source of truth.
- Strip parameter signatures from test names before building VSTest
  retry filter to avoid filter grammar operator injection.
- Use case-insensitive deepTestsRan comparison in pipeline condition.
- Track iOS 26.4 availability via pipeline variable for fallback warning.
- Restrict Test-IsBugFixLabel to t/bug and i/regression only; use p/0|p/1
  as secondary signal AND-ed with bug labels to reduce false positives.
- Fix linked-issue regex to accept all GitHub closing keyword forms
  (Fix/Fixed/Close/Closed/Resolve/Resolved).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ALL sentinel producing empty loop in Stage 2: use single-element
  list with empty string and skip -Category param in run-all mode
- Review-PR.ps1 now dot-sources shared/Get-EnvErrorPatterns.ps1 for
  infraSignals instead of hard-coded inline list
- Invoke-UITestWithRetry.ps1 fails loudly if shared patterns file is
  missing instead of silently using a drifted inline fallback
- CopilotLogs download gets continueOnError so DEFERRED fallback works
  when ReviewPR crashed before publishing the artifact
- Stage 2 uses splatted hashtable for retry wrapper params, conditionally
  omitting -Category in run-all mode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add ios_ui_tests-controls, catalyst_ui_tests-controls,
  windows_ui_tests-controls to Get-CategoryFromArtifactName prefix list
  so iOS/Catalyst/Windows Stage 2 artifacts extract correctly
- TRX merge counter math now uses same outcome classification as
  Get-TrxResults: Aborted/Timeout/Error counted as Failed, not omitted
- Only replace originally-failed entries during TRX merge (guard against
  substring-matching retry filter pulling in unrelated passing tests)
- Sync inline Get-TrxResults in Review-PR.ps1 with shared copy
  (default outcome → Failed)
- Remove dead IOS26Available emit step (no downstream consumer)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BuildAndRunHostApp.ps1 required either -TestFilter or -Category as
mandatory parameters. In ALL mode (run all tests without filter),
Stage 2 invokes the script with neither. Fix:
- Make -TestFilter optional (Mandatory=false) in TestFilter param set
- Handle null effectiveFilter: omit --filter arg from dotnet test
- Add 'ALL-<platform>' TRX base name fallback when neither param set

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Windows:
- Set screen resolution to 1920x1080 (AzDO agents default to 1024x768)
  matching main CI ui-tests-steps.yml behavior

Catalyst (MacCatalyst):
- Disable/re-enable Notification Center (intercepts UI interactions)
- Skip Xcode provisioning (catalyst doesn't need it, matches main CI)
- Pass openSslArgs: '' to avoid legacy openssl for Catalyst builds

Both iOS and Catalyst:
- Disable macOS text autocorrect/autocapitalize/spellcheck (interferes
  with Appium text entry tests), matching main CI setup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Guard against null $effectiveFilter in the success summary banner and
info log. In ALL mode neither -Category nor -TestFilter is passed, so
effectiveFilter is null. The .Substring() call on null throws under
ErrorActionPreference=Stop, breaking ALL-mode test runs even when all
tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Get-EnvErrorPatterns returns regex patterns (with .* and \s*).
Wrapping them in [regex]::Escape() converts metacharacters to
literals, so patterns like 'error ADB0010.*InstallFailedException'
never match. Use -match directly, matching how
Invoke-UITestWithRetry uses them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The bash artifact-copy step used `cp -r CustomAgentLogsTmp` which
silently fails on Windows agents (Git Bash path mismatch). This caused
the CopilotLogs artifact to be missing PRAgent content files, so
Stage 3 had nothing to post as the AI Summary comment.

Move artifact copying to a separate pwsh step that works on all
platforms (Linux, macOS, Windows). The bash step retains only the
copilot session-state copy which uses $HOME.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot agent writes inline findings with 'file' as the key, but
post-inline-review.ps1 only read 'path'. All findings were rejected
as 'suspicious path: empty' and no inline comments were ever posted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stage 2 was skipped whenever the gate hit environment errors (3x retries
exhausted → CopilotFailed=true → Check Copilot Result exits 1 → ReviewPR
result=Failed). But detectedCategories was already emitted successfully
in STEP 2, so the deep tests COULD run.

Add 'Failed' to the allowed ReviewPR results in Stage 2's condition.
Deep UI tests are valuable independent of the gate outcome — they verify
the PR's changes don't break existing test suites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conditional download using deepTestsRan variable was unreliable —
AzDO's $[ in() ] expression for cross-stage result evaluation can
return unexpected values depending on stage result propagation timing.
This caused Stage 3 to skip the artifact download even when Stage 2
produced valid TRX results, resulting in AI Summary comments showing
'SKIPPED' for UI tests that actually ran and passed.

Fix: always attempt the download with continueOnError (handles the
case where RunDeepUITests was skipped). Remove the unused deepTestsRan
variable. Add diagnostic logging to Stage 3 for debugging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot agent produces findings JSON in multiple formats:
bare array, {findings:[...]}, {schemaVersion:1, findings:[...]},
single object, etc. The previous parser used $parsed.findings
which is falsy in PowerShell when the array is empty or when
property access returns unexpected types. This caused the wrapper
object to be treated as a single finding with no path/file.

Use PSObject.Properties.Match() for explicit property detection
instead of truthy evaluation. Add diagnostic logging so future
format issues are immediately visible in pipeline logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e ref

Review-PR.ps1 STEP 1 checks out origin/main and squash-merges the PR,
replacing all files including .github/scripts/. Script fixes on the
pipeline ref (feature/regression-check) were overwritten with main's
versions, so post-inline-review.ps1 file/path fix and other changes
never took effect.

Fix: YAML saves .github/scripts to a backup dir before invoking
Review-PR.ps1. After STEP 1 completes the branch switch, the script
restores the pipeline-ref versions via SCRIPTS_BACKUP env variable.
This ensures STEP 7 (inline posting) and all other script steps use
the pipeline ref's code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review-PR.ps1 was checking out origin/main in CI mode, overwriting
all pipeline-ref scripts with main's versions. This is why inline
comment fixes, env-error patterns, and other script changes never
took effect — they were replaced by main's code.

Fix: in CI mode, stay on the current branch (the pipeline ref, e.g.
feature/regression-check). The PR is squash-merged onto it. This
preserves all script fixes while still testing the PR's changes.

Remove the backup/restore workaround — no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo kubaflo force-pushed the feature/regression-check branch from e13ca0e to 327a75c Compare May 22, 2026 10:51
@rmarinho rmarinho disabled auto-merge May 22, 2026 15:17
@rmarinho rmarinho merged commit 7c76748 into main May 22, 2026
40 checks passed
@rmarinho rmarinho deleted the feature/regression-check branch May 22, 2026 15:18
@github-actions github-actions Bot added this to the .NET 10.0 SR8 milestone May 22, 2026
devanathan-vaithiyanathan pushed a commit to devanathan-vaithiyanathan/maui that referenced this pull request Jun 1, 2026
## Description

Extends the `maui-copilot` DevDiv pipeline (pipeline 27723) with a
3-stage architecture that runs real UI tests on platform-pool agents and
reports results directly in the AI summary PR comment.

### Pipeline Workflow

```
┌─────────────────────────────────────────────────────────┐
│  Stage 1: ReviewPR                                      │
│                                                         │
│  STEP 1: Branch Setup (checkout + cherry-pick PR)       │
│  STEP 2: Detect UI Test Categories                      │
│  STEP 3: Run Detected UI Tests (in-process, fast)       │
│  STEP 4: Regression Cross-Reference                     │
│  STEP 5: Gate — verify tests fail/pass before/after fix │
│  STEP 6: Code Review — deep analysis via Copilot agent  │
│                                                         │
│  Outputs → CopilotLogs artifact + detectedCategories    │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 2: RunDeepUITests (platform-pool agent)          │
│                                                         │
│  iOS: AcesShared Tahoe + iOS 26.4                       │
│  Android: ubuntu-22.04 + KVM + AVD                      │
│                                                         │
│  Runs BuildAndRunHostApp.ps1 per detected category      │
│  Outputs → drop-deep-uitests artifact (TRX + diffs)     │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 3: PostResults                                   │
│                                                         │
│  1. Download CopilotLogs (review content files)         │
│  2. Download drop-deep-uitests (TRX results)            │
│  3. Merge deep results into uitests/content.md          │
│  4. Post full AI Summary comment on PR                  │
│  5. Apply labels (s/agent-reviewed, etc.)               │
│                                                         │
│  One comment with everything — no patching needed       │
└─────────────────────────────────────────────────────────┘
```

### What's New

**Deep UI Test Execution (Stage 2)**
- Runs detected UI test categories on proper platform-pool agents (not
in-process on Linux)
- **iOS**: AcesShared Tahoe agents with iOS 26.4 simulator, iPhone 11
Pro (matching `ios-26` baselines from PR dotnet#35061)
- **Android**: ubuntu-22.04 with KVM, AVD boot with `-partition-size
2048`, `ignoreHiddenApiPolicyError` capability
- TRX results + snapshot-diff PNGs published as `drop-deep-uitests`
artifact

**Unified Comment Posting (Stage 3)**
- Comment posting and label application deferred to Stage 3 (after deep
tests complete)
- Single AI summary comment includes ALL results: code review + deep
test results
- Nested collapsible `<details>` for failed tests with full error +
stack trace
- Dynamic section title: `🧪 UI Tests — CollectionView, TabbedPage`
- Artifact download link for snapshot-diff PNGs

**Android Emulator Improvements**
- AVD boot step with proper partition size, ADB key pre-authorization,
boot wait
- `DEVICE_UDID` pass-through prevents double emulator boot
- Disk cleanup on hosted ubuntu agents (frees ~22GB)
- KVM enablement + `appium:ignoreHiddenApiPolicyError` for API 30

**iOS Simulator Improvements**
- Tahoe pool demand ensures macOS 26.x agents
- Explicit iOS 26.4 download via latest Xcode
- Auto-creates iPhone 11 Pro for baseline resolution match

### Validation

Tested across 30+ pipeline iterations on 6 PRs:

| PR | iOS | Android |
|---|---|---|
| 35358 (ViewBaseTests) | **112/112 ALL PASS** ✅ | **118/119 PASS** ✅ |
| 35359 (TabbedPage) | 44/50 (1 real failure) | 74/75 (1 real failure) |
| 35356 (CollectionView) | **415/417 PASS** ✅ | 593/619 (26 real
failures) |

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai-agents Copilot CLI agents, agent skills, AI-assisted development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants