Add deep UI test execution to Copilot PR review pipeline by kubaflo · Pull Request #35376 · dotnet/maui

kubaflo · 2026-05-11T10:36:36Z

Description

Extends the maui-copilot DevDiv pipeline (pipeline 27723) with a 3-stage architecture that runs real UI tests on platform-pool agents and reports results directly in the AI summary PR comment.

Pipeline Workflow

┌─────────────────────────────────────────────────────────┐
│  Stage 1: ReviewPR                                      │
│                                                         │
│  STEP 1: Branch Setup (checkout + cherry-pick PR)       │
│  STEP 2: Detect UI Test Categories                      │
│  STEP 3: Run Detected UI Tests (in-process, fast)       │
│  STEP 4: Regression Cross-Reference                     │
│  STEP 5: Gate — verify tests fail/pass before/after fix │
│  STEP 6: Code Review — deep analysis via Copilot agent  │
│                                                         │
│  Outputs → CopilotLogs artifact + detectedCategories    │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 2: RunDeepUITests (platform-pool agent)          │
│                                                         │
│  iOS: AcesShared Tahoe + iOS 26.4                       │
│  Android: ubuntu-22.04 + KVM + AVD                      │
│                                                         │
│  Runs BuildAndRunHostApp.ps1 per detected category      │
│  Outputs → drop-deep-uitests artifact (TRX + diffs)     │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│  Stage 3: PostResults                                   │
│                                                         │
│  1. Download CopilotLogs (review content files)         │
│  2. Download drop-deep-uitests (TRX results)            │
│  3. Merge deep results into uitests/content.md          │
│  4. Post full AI Summary comment on PR                  │
│  5. Apply labels (s/agent-reviewed, etc.)               │
│                                                         │
│  One comment with everything — no patching needed       │
└─────────────────────────────────────────────────────────┘

What's New

Deep UI Test Execution (Stage 2)

Runs detected UI test categories on proper platform-pool agents (not in-process on Linux)
iOS: AcesShared Tahoe agents with iOS 26.4 simulator, iPhone 11 Pro (matching ios-26 baselines from PR [Testing] Resaved the iOS 26.4 images #35061)
Android: ubuntu-22.04 with KVM, AVD boot with -partition-size 2048, ignoreHiddenApiPolicyError capability
TRX results + snapshot-diff PNGs published as drop-deep-uitests artifact

Unified Comment Posting (Stage 3)

Comment posting and label application deferred to Stage 3 (after deep tests complete)
Single AI summary comment includes ALL results: code review + deep test results
Nested collapsible <details> for failed tests with full error + stack trace
Dynamic section title: 🧪 UI Tests — CollectionView, TabbedPage
Artifact download link for snapshot-diff PNGs

Android Emulator Improvements

AVD boot step with proper partition size, ADB key pre-authorization, boot wait
DEVICE_UDID pass-through prevents double emulator boot
Disk cleanup on hosted ubuntu agents (frees ~22GB)
KVM enablement + appium:ignoreHiddenApiPolicyError for API 30

iOS Simulator Improvements

Tahoe pool demand ensures macOS 26.x agents
Explicit iOS 26.4 download via latest Xcode
Auto-creates iPhone 11 Pro for baseline resolution match

Validation

Tested across 30+ pipeline iterations on 6 PRs:

PR	iOS	Android
35358 (ViewBaseTests)	112/112 ALL PASS ✅	118/119 PASS ✅
35359 (TabbedPage)	44/50 (1 real failure)	74/75 (1 real failure)
35356 (CollectionView)	415/417 PASS ✅	593/619 (26 real failures)

github-actions · 2026-05-11T10:36:45Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35376

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35376"

github-actions · 2026-05-11T10:37:19Z

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 18 | Agents checked: 4

Full validator output

Found 1 skill(s)
[find-regression-risk] 📊 find-regression-risk: 967 BPE tokens [chars/4: 905] (detailed ✓), 10 sections, 2 code blocks
[find-regression-risk]    ⚠  No YAML frontmatter — agents use name/description for skill discovery.
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)

✅ All checks passed (4 agent(s))

⏭️ LLM Evaluation: Skipped

No changed skills with eval tests found.

🔍 Full results and investigation steps

Copilot

Pull request overview

This PR extends the eng/pipelines/ci-copilot.yml Copilot PR review pipeline into a multi-stage flow that can (a) detect UI test categories, (b) run “deep” UI tests on platform-appropriate agents, and (c) post/update a unified AI summary comment that includes deep UI test results and regression-risk signals. It also adds supporting PowerShell utilities for more reliable UI test execution (retry + TRX parsing) and for identifying potential regression reverts by diff cross-referencing.

Changes:

Add RunDeepUITests and UpdateAISummaryComment stages to run per-category UI tests on real platform pools and publish/update results in the PR comment.
Introduce new scripts/tests for UI test retry + TRX parsing/aggregation and for regression-risk detection (Find-RegressionRisks.ps1 + tests/docs).
Update emulator/simulator setup and test invocation to improve stability and to produce authoritative TRX outputs for result rendering.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/TestUtils/src/UITest.Appium/AppiumAndroidApp.cs	Adds Appium capability to ignore hidden API policy errors on problematic emulator images.
eng/pipelines/ci-copilot.yml	Implements the 3-stage architecture (review → deep UI tests → post/update summary) and associated platform setup.
.github/skills/find-regression-risk/SKILL.md	Documents the regression-risk detection skill and how it integrates into the review flow.
.github/scripts/tests/Test-FindRegressionRisks.ps1	Adds script-based tests for the regression-risk detector.
.github/scripts/shared/Start-Emulator.ps1	Updates iOS simulator selection logic (iOS 26-first) and adds device creation fallback logic.
.github/scripts/shared/Invoke-UITestWithRetry.ps1	New shared “build/deploy/run UI tests” runner with retry + device recovery and TRX discovery.
.github/scripts/shared/Aggregate-UITestArtifacts.Tests.ps1	Adds Pester tests for TRX aggregation from downloaded artifacts.
.github/scripts/shared/Aggregate-UITestArtifacts.ps1	New artifact downloader/parser to aggregate per-category TRX results.
.github/scripts/Review-PR.Tests.ps1	Adds Pester tests for TRX parsing and console-output fallback parsing helpers.
.github/scripts/Review-PR.ps1	Reworks review flow steps (category detect/run, regression cross-ref) and adds TRX-aware UI test reporting.
.github/scripts/post-ai-summary-comment.ps1	Adds regression-check section and dynamic “UI Tests — ” title rendering.
.github/scripts/Find-RegressionRisks.ps1	New mechanical regression-risk detector that cross-references deletions vs recent bug-fix additions.
.github/scripts/BuildAndRunHostApp.ps1	Switches to `TestCategory=` filtering and adds TRX logging/marker output for authoritative results.
.github/agents/maui-expert-reviewer.md	Adds guidance to require acknowledgement when regression cross-reference detects REVERT entries.

kubaflo · 2026-05-11T11:01:41Z

🔍 Multimodal Code Review — 5 agents, 5 dimensions

Reviewed by: Claude Opus 4.7 ×3, Claude Opus 4.6, GPT-5.5
Dimensions: Pipeline YAML · PowerShell Scripts · Test Quality · Security & Architecture · C# Code
Scope: 14 files, +3,919 / -89 lines | Re-reviewed after commit 2b5e746

🔴 Critical Findings (5 found, 2 resolved)

✅ C1 — Stage 3 skipped on no-UI PRs — FIXED in 2b5e746

DEFER_COMMENT_TO_STAGE3=true unconditionally defers comment posting. When detectedCategories is empty, Stage 2 is Skipped, and Stage 3's condition didn't include 'Skipped' → AI summary comment was never posted for non-UI PRs.

Fix applied: 'Skipped' added to Stage 3 condition at ci-copilot.yml:1081. ✅

✅ C3 — Cross-stage isOutput=true unverified — VERIFIED

All 3 cross-stage variables confirmed with isOutput=true:

detectedCategories (line 677)
detectedPlatform (line 678)
aiSummaryCommentId (lines 1795, 1824)

Comment at line 671 updated to reference RunReview (not Detection). ✅

❌ C2 — Invoke-Expression of regex-extracted functions — OPEN

Location: Aggregate-UITestArtifacts.ps1:71-75, ci-copilot.yml Stage 3

$rx = [regex]::Match(($aggSrc + "`n" + $reviewSrc), "(?ms)^function\s+$fn\s*\{.*?^\}", 'Multiline')
if ($rx.Success) { Invoke-Expression $rx.Value }

The regex .*?^\} matches the first line-anchored } — will silently truncate functions with nested blocks. Invoke-Expression with file-derived content is a code injection vector. Three independent reviewers flagged this (triple convergence).

Recommendation: Extract Get-TrxResults into .github/scripts/shared/Get-TrxResults.ps1 and dot-source it.

❌ C4 — DEFERRED mode not idempotent on retry — OPEN

Location: Stage 3 DEFERRED mode

In DEFERRED mode (commentId == 'DEFERRED'), Stage 3 calls post-ai-summary-comment.ps1 which creates a new comment. Pipeline retry → duplicate comments. PATCH mode is correctly idempotent via  markers, but DEFERRED mode has no equivalent guard.

Recommendation: Query existing PR comments for a marker before posting; PATCH if found.

❌ C5 — gh auth failures → silent CLEAN — OPEN

Location: Find-RegressionRisks.ps1:~459

All gh calls use 2>$null. When auth fails (token expiry during long CI runs), every PR lookup returns empty, and the script writes result=CLEAN — the worst possible failure mode: accepting a risky PR because the analyzer was offline.

Recommendation: Run gh auth status at script start; abort with clear error on auth failure.

🟠 Major Findings (12) — all still open

#	Finding	Location
1	Emulator boot drift — Stage 2 missing retry loops, DEVICE_READY checks, ADB key re-copy	Stages 1 vs 2
2	Stage 2 missing iOS simulator boot step — has install but no `xcrun simctl boot`	Stage 2
3	GitHub comment content injection — TRX error text inserted without HTML escaping	`ci-copilot.yml`, `post-ai-summary-comment.ps1`
4	Comment update races — marker-based PATCH has no locking for concurrent runs	Stage 3
5	Soft failures hide missing results — `continueOnError` + warnings mask incomplete output	Stages 2-3
6	`$(REQUIRED_XCODE)` macro undefined in Stage 2 iOS Xcode restore step	`ci-copilot.yml:246`
7	Type filter discards valid results — `Where-Object { $_ -is [hashtable] }` rejects PSCustomObjects	`Find-RegressionRisks.ps1:593`
8	`$isEnvError` uninitialized before loop — throws under strict mode	`Review-PR.ps1:1429`
9	UDID regex case-sensitive — `[0-9A-F-]` won't match lowercase UDIDs	`Invoke-UITestWithRetry.ps1:169`
10	`simctl create` output parsing fragile — `2>&1` + anchored regex vs string arrays	`Start-Emulator.ps1:444`
11	`Get-PRMetadataIfBugFix` completely untested — most complex function, zero coverage	`Test-FindRegressionRisks.ps1`
12	Test file uses hand-rolled framework instead of Pester — inconsistent with other test files	`Test-FindRegressionRisks.ps1`

✅ C# Appium Change — Approve

appium:ignoreHiddenApiPolicyError is correct, well-scoped, follows existing conventions, and improves CI reliability on API 30 emulators. Optional nit: tighten comment wording to clarify it suppresses adb shell settings put global hidden_api_policy* errors specifically.

🔵 Top Suggestions

Extract duplicated steps into YAML templates — Android boot, iOS boot, workload install, Xcode bypass (~250 lines saved, eliminates drift)
Add Set-StrictMode -Version 3.0 to all scripts
HTML-escape dynamic content in comment rendering (TRX error text, test names)
Validate TestFilter strings before passing to dotnet test
Convert Test-FindRegressionRisks.ps1 to Pester for consistency

Summary

Severity	Total	Resolved	Open
🔴 Critical	5	2	3
🟠 Major	12	0	12
🟡 Minor	17	0	17
🔵 Suggestion	10	—	10

Architecture is sound. The 3-stage pipeline design is well-motivated and clearly documented. The remaining criticals are hardening issues — Invoke-Expression replacement (C2), retry idempotency (C4), and auth guard (C5) — suitable for a fast follow-up.

kubaflo · 2026-05-11T11:05:38Z

Addressed the 3 open critical findings from the multimodal review in commit 7037237:

C2 — Invoke-Expression elimination ✅
Extracted Get-TrxResults into .github/scripts/shared/Get-TrxResults.ps1 and dot-source it. Updated both Aggregate-UITestArtifacts.ps1 and Stage 3 YAML. The remaining Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory functions are still in the aggregate script (they're small, single-purpose, and only used there) — they use dot-source when available, with regex fallback only for those two inline functions.

C4 — DEFERRED mode idempotency ✅
Stage 3 now queries existing PR comments for  marker before posting. On pipeline retry, finds the existing comment ID and falls back to PATCH mode, preventing duplicate comments.

C5 — gh auth guard ✅
Added gh auth status check at the start of Find-RegressionRisks.ps1. Script aborts with clear error if not authenticated, preventing silent false-CLEAN results.

The 12 major findings are noted for follow-up — they're hardening improvements that don't block the core functionality.

kubaflo · 2026-05-11T11:08:50Z

🔍 Re-review after commit `7037237` — Critical findings C2, C4, C5

✅ C5 — `gh` auth guard — RESOLVED

Clean implementation. gh auth status at start of Find-RegressionRisks.ps1 main block, exit 2 on failure with clear error output. Prevents false CLEAN results. No issues.

✅ C4 — DEFERRED mode idempotency — RESOLVED (one minor gap)

Before entering DEFERRED mode, queries PR comments for  marker. If found, switches to PATCH mode — preventing duplicate comments on retry. The marker is confirmed used by post-ai-summary-comment.ps1 (line 49).

🟡 Minor gap: no --paginate on comment query

$existingComment = gh api "repos/dotnet/maui/issues/$prNumber/comments" --jq '...' 2>$null

GitHub API paginates at 30 comments/page by default. If the AI Summary comment is on page 2+ (busy PRs), it won't be found and a duplicate will be created. Add --paginate or ?per_page=100.

⚠️ C2 — `Invoke-Expression` elimination — PARTIALLY RESOLVED

What's fixed (good):

✅ Get-TrxResults extracted to shared/Get-TrxResults.ps1 — the primary fix
✅ Aggregate-UITestArtifacts.ps1 dot-sources the shared file
✅ ci-copilot.yml Stage 3 dot-sources Get-TrxResults.ps1
✅ Both copies of Get-TrxResults (shared file + Review-PR.ps1) are identical right now

What's still open:

🟠 Invoke-Expression fallback still exists for 2 other functions

Stage 3 YAML (lines 1136-1141):

foreach ($fn in @('Get-CategoryFromArtifactName','Get-AggregatedTrxFromDirectory')) {
    $fnFile = ".github/scripts/shared/$fn.ps1"
    if (Test-Path $fnFile) { . $fnFile }
    else {
        # Functions defined inline in Aggregate script — extract via regex as fallback
        $rx = [regex]::Match($aggSrc, "(?ms)^function\s+$fn\s*\{.*?^\}", 'Multiline')
        if ($rx.Success) { Invoke-Expression $rx.Value }
    }
}

The fallback path still uses the same fragile regex+Invoke-Expression pattern for Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory. Since the .ps1 files don't exist, the fallback always fires. Lower risk than Get-TrxResults (simpler functions), but the pattern remains.

🟡 Review-PR.ps1 still has its own copy of Get-TrxResults

Review-PR.ps1:380 defines Get-TrxResults inline instead of dot-sourcing shared/Get-TrxResults.ps1. The two copies are byte-identical today, but will drift over time. One edit to the shared file without updating Review-PR.ps1 (or vice versa) silently produces different TRX parsing behavior in Stage 1 vs Stage 2/3.

Recommendation: Have Review-PR.ps1 dot-source the shared file:

. "$PSScriptRoot/shared/Get-TrxResults.ps1"

Updated Scorecard — All 5 Critical Findings

#	Finding	Status	Commit
🔴 C1	Stage 3 skipped on no-UI PRs	✅ FIXED	`2b5e746`
🔴 C2	`Invoke-Expression` fragile + injection	⚠️ PARTIAL — `Get-TrxResults` extracted, but fallback remains for 2 functions + duplicate copy in Review-PR.ps1	`7037237`
🔴 C3	`isOutput=true` unverified	✅ VERIFIED	(was already correct)
🔴 C4	DEFERRED mode not idempotent	✅ FIXED (minor: add `--paginate`)	`7037237`
🔴 C5	`gh` auth → silent CLEAN	✅ FIXED	`7037237`

Bottom line: 4/5 critical findings fully resolved, C2 substantially improved (highest-risk function extracted) with residual Invoke-Expression for 2 simpler functions. The 12 major findings from the original review remain open but are hardening items for follow-up.

Good progress — the PR is in much better shape for merge now. 👍

kubaflo · 2026-05-11T11:12:27Z

Addressed all remaining items from the re-review in commit d31011f:

C2 — fully resolved ✅

Extracted Get-CategoryFromArtifactName and Get-AggregatedTrxFromDirectory into separate .ps1 files in shared/
Stage 3 YAML now dot-sources all 3 files directly — zero Invoke-Expression remaining
Review-PR.ps1 dot-sources shared/Get-TrxResults.ps1 instead of duplicate inline copy — single source of truth, no drift risk

C4 pagination minor ✅

Added --paginate + ?per_page=100 to existing comment query so it finds AI Summary comments on busy PRs with 30+ comments

All 5 critical findings now fully resolved.

kubaflo · 2026-05-11T11:13:26Z

🔍 Re-review after commit `d31011f` — C2 fully resolved

✅ C2 — `Invoke-Expression` elimination — FULLY RESOLVED

All three shared functions extracted into standalone dot-sourceable files:

Function	File	Consumers
`Get-TrxResults`	`shared/Get-TrxResults.ps1`	Review-PR.ps1, Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3
`Get-CategoryFromArtifactName`	`shared/Get-CategoryFromArtifactName.ps1`	Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3
`Get-AggregatedTrxFromDirectory`	`shared/Get-AggregatedTrxFromDirectory.ps1`	Aggregate-UITestArtifacts.ps1, ci-copilot.yml Stage 3

Verified:

✅ Zero Invoke-Expression remaining in ci-copilot.yml (only the comment # no Invoke-Expression referencing the fix)
✅ Zero Invoke-Expression remaining in Aggregate-UITestArtifacts.ps1
✅ Review-PR.ps1 line 375: . "$PSScriptRoot/shared/Get-TrxResults.ps1" — single source of truth, no duplicate copy
✅ Stage 3 YAML: clean dot-sourcing of all 3 files, no regex fallback

✅ C4 minor — pagination fix applied

gh api "repos/dotnet/maui/issues/$prNumber/comments?per_page=100" --paginate --jq '...'

Handles PRs with 30+ comments correctly now.

Final Scorecard — All 5 Critical Findings

#	Finding	Status
🔴 C1	Stage 3 skipped on no-UI PRs	✅ Fixed (`2b5e746`)
🔴 C2	`Invoke-Expression` fragile + injection	✅ Fully fixed (`d31011f`)
🔴 C3	`isOutput=true` unverified	✅ Verified (was already correct)
🔴 C4	DEFERRED mode not idempotent	✅ Fixed + paginated (`7037237` + `d31011f`)
🔴 C5	`gh` auth → silent CLEAN	✅ Fixed (`7037237`)

All 5 critical findings resolved. 🎉

The 12 major findings (emulator boot drift, missing iOS sim boot in Stage 2, HTML escaping, comment races, etc.) remain as tracked follow-up items — none are merge blockers.

kubaflo · 2026-05-20T10:28:22Z

🔬 Multimodal Code Review — PR #35376

Reviewed by: Claude Opus 4.7 (Extra-high reasoning) · Claude Opus 4.7 · GPT-5.5
Scope: 19 files, 4339 additions — 3-stage deep UI test pipeline

Three independent reviewers analyzed this PR across correctness, security, architecture, reliability, and maintainability. Findings are deduplicated and ranked by severity.

🔴 Critical — Fix Before Merge

1. TRX overwrite + double-counting destroys test results — BuildAndRunHostApp.ps1:426 + Get-AggregatedTrxFromDirectory.ps1

Found by: Opus 4.7 XHigh ✦ GPT-5.5 ✦ Opus 4.7

When Android retry runs (BuildAndRunHostApp.ps1:387–430):

Retry produces a TRX with only the retried tests (e.g. 5 out of 100)
Copy-Item $retryTrxPath $trxFilePath -Force (line 426) overwrites the original full TRX
Both retry-*.trx and the overwritten original survive in the artifact directory
Get-AggregatedTrxFromDirectory sums all TRX counters with no deduplication

Result: A 100-test suite with 5 flaky failures that all pass on retry is reported as "10/10 ✓" instead of "100/100 ✓". The 95 passing tests vanish from the record.

Fix: Don't overwrite the original TRX. Either merge retry results per-test, or delete $retryTrxPath after overwrite and ensure only one TRX per category reaches the aggregator.

2. TRX contamination across categories in Stage 2 — ci-copilot.yml:1081-1086

Found by: GPT-5.5 ✦ Opus 4.7

After each category run, the script recursively copies every .trx modified in the last 30 min from the working directory into the current category's drop folder. This picks up TRX files from previous categories. The aggregator then attributes earlier categories' results to later ones.

Compounding issue: The 30-minute freshness filter (LastWriteTime -gt (Get-Date).AddMinutes(-30)) is also too short — CollectionView alone has taken ~40 min in real builds, so its TRX gets filtered out when later categories finish.

Fix: Copy only the TRX path emitted by the just-run BuildAndRunHostApp.ps1, or isolate output directories per category. Remove the time-based filter entirely.

3. Category extraction breaks for Stage 2 artifact naming — Get-CategoryFromArtifactName.ps1:12-19

Found by: Opus 4.7 XHigh ✦ Opus 4.7

Stage 2 names artifacts drop-${platform}_ui_tests-controls-$safeCat. After stripping ^drop- and matching prefix android_ui_tests, the regex captures controls-CollectionView instead of CollectionView.

Every category in the AI summary comment gets a spurious controls- prefix. The Pester tests only cover legacy naming — no test for the new convention.

Fix: Either drop controls- from the dir name at line 1040, or add it to the prefix list in Get-CategoryFromArtifactName.

4. Stage 2 doesn't use Invoke-UITestWithRetry — no retry/recovery for deep tests — ci-copilot.yml:1040-1060

Found by: Opus 4.7

Stage 2's per-category loop calls BuildAndRunHostApp.ps1 directly (line 1040-1060), bypassing the Invoke-UITestWithRetry.ps1 wrapper this PR introduces. The retry helper — with device recovery, env-error detection, and adb reboot — only runs in Stage 1's in-process tests.

Additionally, retryCountOnTaskFailure: 2 on the step re-runs the entire per-category loop, causing already-passed categories to run again and produce duplicate TRX files.

Fix: Wrap each category invocation in Invoke-UITestWithRetry.ps1.

5. AI-refreshed categories not propagated to Stage 2 — Review-PR.ps1:1810

Found by: GPT-5.5

After the AI suggests additional test categories (line 1810-1846), the refreshed list is written to uitests/content.md but ##vso[task.setvariable variable=detectedCategories;isOutput=true] is never re-emitted. Stage 2 runs the pre-AI category set while the summary claims AI-derived categories were selected.

Fix: Re-emit the output variable after computing $refreshedCategories.

6. Stage 3 comment silently lost when Stage 1 reviewer crashes — ci-copilot.yml condition

Found by: Opus 4.7

UpdateAISummaryComment requires aiSummaryCommentId != ''. If the Copilot CLI crashes before STEP 6 sets that output, Stage 3 is skipped entirely — and deep test results from Stage 2 are never posted.

Fix: Add a fallback path to post a degraded comment ("review failed, but here are deep test results") when aiSummaryCommentId is empty but deep tests succeeded.

7. VSTest outcome misclassification hides failed tests — Get-TrxResults.ps1:43-50

Found by: Opus 4.7

The outcome switch only handles Passed/Failed/NotExecuted/Inconclusive. Outcomes like Aborted, Timeout, Error, Disconnected fall through to the raw string. The failed-test disclosure checks $r.status -eq 'Failed', so aborted/timed-out tests count in the total but never appear in the detail section.

Fix: Map all non-Passed/non-NotExecuted outcomes to Failed.

🟠 Should Fix

8. Run-all detection silently becomes NONE — Review-PR.ps1:664

Found by: GPT-5.5

The category detector uses blank/empty as the "run all categories" sentinel. Review-PR.ps1 maps blank to NONE, causing Stage 2 to skip run-all cases entirely.

9. Three divergent copies of env-error patterns — Invoke-UITestWithRetry.ps1 vs Review-PR.ps1 vs Gate

Found by: Opus 4.7

The pattern lists differ (e.g. XHarness exit code: 83 exists in one but not others; Could not connect to device in another). Two different definitions of "env error" → different retry decisions for identical failures.

Fix: Extract to a shared EnvErrorPatterns.ps1 and dot-source from all sites.

10. VSTest filter operators unescaped in retry filter — BuildAndRunHostApp.ps1:396

Found by: Opus 4.7 XHigh

Parameterized test names (e.g. TestMethod(arg: "value")) contain (, ), | which are VSTest filter grammar operators. Unescaped, this either breaks the filter command or matches unintended tests.

Fix: Strip parameter signatures before building filter, or escape special characters per VSTest rules.

11. Android emulator recovery gap in Stage 2 — ci-copilot.yml

Found by: Opus 4.7

retryCountOnTaskFailure: 2 re-runs the pwsh script but not the emulator boot step. If the emulator crashed mid-run, the retry finds a dead device and fails with "device offline" — no recovery path exists.

12. iOS 26.4 fallback silently breaks visual tests — ci-copilot.yml

Found by: Opus 4.7

If all three iOS 26.4 download approaches fail (continueOnError: true), Start-Emulator.ps1 falls back to iOS 18/17 — but visual baselines live under snapshots/ios-26. Every visual test then fails with "size differs" with no indication in the comment about why.

13. Regression risk false positives on p/1 label — Find-RegressionRisks.ps1:438

Found by: Opus 4.7

Test-IsBugFixLabel matches p/1 — a priority label that also applies to enhancements. This produces false "regression risk" flags for non-bugfix PRs.

Fix: Only use t/bug|i/regression; use p/0|p/1 as secondary signal AND-ed with a bug label.

14. Linked-issue regex misses valid GitHub forms — Find-RegressionRisks.ps1:449

Found by: Opus 4.7

Only matches Fixes/Closes/Resolves but GitHub also accepts Fix, Fixed, Close, Closed, Resolve, Resolved.

🟡 Nits & Cleanup

15–22: Lower-severity items

Stage 3 inline pwsh is 240+ lines — move to standalone testable script
deepTestsRan case-sensitive comparison — use lower() for safety
Android boot bash duplicated between Stage 1 and Stage 2 (~60 lines)
No test coverage for Invoke-UITestWithRetry.ps1 (255 lines, 0 tests)
Malformed TRX parse failures invisible — Get-TrxResults returns $null, aggregator skips silently
--all fallback in git log can poison REVERT decisions on shallow clones
Rate limit guard is advisory only — 1000+ gh calls possible with no hard cap
Hardcoded emulator-5554 in three identical fallback blocks — extract to helper

✅ Strengths Noted

3-stage architecture is the right shape for AzDO — clean separation of concerns
Regression risk detection algorithm is well-designed (whitespace normalization, trivial-line filtering, move suppression, ancestry verification)
Pester test suites for Find-RegressionRisks (418 lines) and Aggregate-UITestArtifacts (169 lines) — good coverage of the new logic
Output normalization in Invoke-UITestWithRetry.ps1 (splitting multi-line ErrorRecords) solves a real and subtle PowerShell pitfall
Secret handling is clean — ConvertTo-Json for API payloads, AzDO masking for tokens, no set -x on secret-bearing steps

kubaflo · 2026-05-20T11:05:59Z

✅ Multimodal Review Complete — All Critical Issues Fixed

Models: Claude Opus 4.7 (Extra-high reasoning) · Claude Opus 4.7 · GPT-5.5
Rounds: 3 review cycles + 4 fix commits

Fix Commits

Commit	Description
`526271871a`	Round 1: TRX overwrite→merge, retry wrapper in Stage 2, category extraction, VSTest outcome mapping, filter escaping, env-pattern centralization, linked-issue regex, bug-label tightening
`7a61d795c5`	Round 2: ALL sentinel empty-loop fix, Review-PR.ps1 dot-sources shared env patterns, CopilotLogs continueOnError, remove drifted inline fallback
`1ae95fc8ce`	Round 3: iOS/Catalyst/Windows category prefixes, counter math alignment, merge-only-failed guard, inline Get-TrxResults sync, dead code removal
`bcfb986665`	Round 4: Make TestFilter optional for ALL mode, handle no-filter dotnet test invocation

Issues Resolved (22 original + 7 discovered during fixes)

🔴 Critical (7 fixed)

~~TRX overwrite destroys passing test data~~ → XML merge with retry TRX deletion
~~TRX contamination across categories~~ → Specific TRX path from retry wrapper
~~Category extraction breaks for Stage 2~~ → All platform prefixes added
~~Stage 2 has no retry/recovery~~ → Uses Invoke-UITestWithRetry wrapper
~~AI-refreshed categories not propagated~~ → Re-emits output variable
~~Stage 3 lost when reviewer crashes~~ → DEFERRED fallback + continueOnError
~~VSTest outcomes misclassified~~ → All non-Passed/non-NotExecuted → Failed

🟠 Should-fix (10 fixed)
8. ~~Run-all → NONE conversion~~ → ALL sentinel with proper handling
9. ~~3 divergent env-error pattern copies~~ → Centralized in Get-EnvErrorPatterns.ps1
10. ~~VSTest filter operator injection~~ → Strip parameter signatures
11. ~~deepTestsRan case-sensitive~~ → lower() in condition
12. ~~p/1 false positives~~ → Secondary signal AND-ed with bug labels
13. ~~Linked-issue regex incomplete~~ → All GitHub closing forms
14. ~~ALL mode empty loop~~ → Single-element list with empty string
15. ~~Counter math excludes Aborted/Timeout~~ → total - passed - skipped
16. ~~Retry merge flips passing tests~~ → Only replace $failedNames entries
17. ~~ALL mode mandatory parameter~~ → TestFilter optional, no-filter path

🟡 Remaining (low-severity, tracked for follow-up)

Stage 3 inline pwsh (240+ lines) — should be extracted to standalone script
Android boot bash duplicated between Stage 1 and Stage 2
DEFERRED fallback unreachable when ReviewPR fails mid-execution (needs stage split)
TRX merge dedup key is testName not testId (no current collision)

Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that detects when a PR removes lines previously added by labeled bug-fix PRs. Algorithm: 1. Collects lines removed by the PR under review 2. Finds recent PRs touching the same files via git log 3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels) 4. Cross-references removed lines against lines those fix PRs added 5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN Integration: - Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate) - Content assembled into AI summary comment via post-ai-summary-comment.ps1 - Expert reviewer dimension #6 reads risks.json for REVERT entries - 64 unit tests covering diff parsing, normalization, and detection logic Validated against: - PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278 - PR #35272: correctly classifies as OVERLAP (no line-level revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When a REVERT is detected, the script now: 1. Extracts test files from the fix PR's diff 2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner) 3. Stores regression_tests metadata in risks.json 4. Lists required tests in content.md Review-PR.ps1 adds STEP 1.5 (after Gate builds the code): - Reads regression_tests from risks.json - Runs unit/XAML tests via dotnet test with the detected filters - Skips UI/device tests (need CI infrastructure) with clear reporting - Appends pass/fail results to content.md - Writes test-results.json for downstream consumption Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

STEP 1.5 now runs every test type from reverted fix PRs: - UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium) - Device tests via Run-DeviceTests.ps1 (xharness on device/simulator) - Unit/XAML tests via dotnet test --filter Each test runs on the same platform as the Gate step. If a runner script is missing, the test is skipped gracefully rather than failing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extract and run fix-PR tests for both REVERT and OVERLAP entries. A nearby edit can break a fix through side effects even without removing the exact lines. Running tests for all risk levels gives maximum confidence that prior fixes aren't regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When regression tests are detected (from REVERT or OVERLAP fix PRs), inject them as MANDATORY additional tests into the STEP 2 try-fix prompt. Each try-fix candidate must run these tests after its own test passes — a candidate that breaks a prior fix is marked Fail. The Report phase (Phase 3) also ranks candidates that failed regression tests lower than those that passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

STEP 0 → 1 (Branch Setup) STEP 0.5 → 2 (UI Test Categories) STEP 0.6 → 3 (Regression Cross-Reference) STEP 1 → 4 (Gate) STEP 1.5 → 5 (Regression Test Verification) STEP 2 → 6 (PR Review) STEP 3 → 7 (Post AI Summary) STEP 4 → 8 (Apply Labels) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. Dedup check: Before any setup, query AzDO for running/queued builds with the same PR+Platform. If a duplicate exists, cancel this run (oldest-wins policy). Prevents 2-5x compute waste. 2. Fail-fast merge conflicts: Test merge feasibility right after checkout, BEFORE SDK/emulator setup (~15-20 min saved per conflict). On failure, posts a PR comment with conflicting file list (uses hidden marker to update existing comment, not spam). 3. Fix partiallySucceeded noise: Add deepTestsRan variable and condition the drop-deep-uitests download on it. When deep tests are skipped (no detected categories), the download is skipped too, avoiding the 'artifact not found' warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. Findings JSON unwrapping: The expert-reviewer agent writes {"findings": [...]} (object wrapper) instead of bare [...]. ConvertFrom-Json returns 1 object with no .path property, causing all findings to be dropped as 'suspicious path: empty'. Now detects and unwraps both bare arrays and object wrappers. 2. Merge conflict comment: Use GH_COMMENT_TOKEN (GitHub PAT) instead of System.AccessToken (AzDO PAT) for posting PR comments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the pre-check merge feasibility step and the dedup step from ci-copilot.yml. Keep only the partial-success fix (deepTestsRan condition on drop-deep-uitests download). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Critical fixes: - Fix TRX overwrite bug: merge retry results into original TRX instead of replacing it, preserving all first-run passing tests. Delete retry TRX to prevent double-counting by aggregators. - Fix TRX contamination across categories: use specific TRX path from Invoke-UITestWithRetry instead of recursive 30-min time-based scan. - Fix category extraction for Stage 2 naming: add controls- prefixed variants to Get-CategoryFromArtifactName prefix list. - Stage 2 now uses Invoke-UITestWithRetry wrapper for env-error retry and device recovery. Remove retryCountOnTaskFailure (handled in-script). - Re-emit detectedCategories output variable after AI-tier refresh so Stage 2 picks up the refreshed list. - Stage 3 falls back to DEFERRED mode when aiSummaryCommentId is empty but deep test artifacts exist (reviewer crashed). - Map all VSTest outcomes (Aborted, Timeout, Error, etc.) to Failed in Get-TrxResults so failure disclosures match counter totals. Should-fix: - Preserve run-all sentinel as ALL (not NONE) so Stage 2 can distinguish run-everything from run-nothing. - Centralize env-error patterns in shared/Get-EnvErrorPatterns.ps1; Invoke-UITestWithRetry dot-sources it as single source of truth. - Strip parameter signatures from test names before building VSTest retry filter to avoid filter grammar operator injection. - Use case-insensitive deepTestsRan comparison in pipeline condition. - Track iOS 26.4 availability via pipeline variable for fallback warning. - Restrict Test-IsBugFixLabel to t/bug and i/regression only; use p/0|p/1 as secondary signal AND-ed with bug labels to reduce false positives. - Fix linked-issue regex to accept all GitHub closing keyword forms (Fix/Fixed/Close/Closed/Resolve/Resolved). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix ALL sentinel producing empty loop in Stage 2: use single-element list with empty string and skip -Category param in run-all mode - Review-PR.ps1 now dot-sources shared/Get-EnvErrorPatterns.ps1 for infraSignals instead of hard-coded inline list - Invoke-UITestWithRetry.ps1 fails loudly if shared patterns file is missing instead of silently using a drifted inline fallback - CopilotLogs download gets continueOnError so DEFERRED fallback works when ReviewPR crashed before publishing the artifact - Stage 2 uses splatted hashtable for retry wrapper params, conditionally omitting -Category in run-all mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add ios_ui_tests-controls, catalyst_ui_tests-controls, windows_ui_tests-controls to Get-CategoryFromArtifactName prefix list so iOS/Catalyst/Windows Stage 2 artifacts extract correctly - TRX merge counter math now uses same outcome classification as Get-TrxResults: Aborted/Timeout/Error counted as Failed, not omitted - Only replace originally-failed entries during TRX merge (guard against substring-matching retry filter pulling in unrelated passing tests) - Sync inline Get-TrxResults in Review-PR.ps1 with shared copy (default outcome → Failed) - Remove dead IOS26Available emit step (no downstream consumer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

BuildAndRunHostApp.ps1 required either -TestFilter or -Category as mandatory parameters. In ALL mode (run all tests without filter), Stage 2 invokes the script with neither. Fix: - Make -TestFilter optional (Mandatory=false) in TestFilter param set - Handle null effectiveFilter: omit --filter arg from dotnet test - Add 'ALL-<platform>' TRX base name fallback when neither param set Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Windows: - Set screen resolution to 1920x1080 (AzDO agents default to 1024x768) matching main CI ui-tests-steps.yml behavior Catalyst (MacCatalyst): - Disable/re-enable Notification Center (intercepts UI interactions) - Skip Xcode provisioning (catalyst doesn't need it, matches main CI) - Pass openSslArgs: '' to avoid legacy openssl for Catalyst builds Both iOS and Catalyst: - Disable macOS text autocorrect/autocapitalize/spellcheck (interferes with Appium text entry tests), matching main CI setup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Guard against null $effectiveFilter in the success summary banner and info log. In ALL mode neither -Category nor -TestFilter is passed, so effectiveFilter is null. The .Substring() call on null throws under ErrorActionPreference=Stop, breaking ALL-mode test runs even when all tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Get-EnvErrorPatterns returns regex patterns (with .* and \s*). Wrapping them in [regex]::Escape() converts metacharacters to literals, so patterns like 'error ADB0010.*InstallFailedException' never match. Use -match directly, matching how Invoke-UITestWithRetry uses them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The bash artifact-copy step used `cp -r CustomAgentLogsTmp` which silently fails on Windows agents (Git Bash path mismatch). This caused the CopilotLogs artifact to be missing PRAgent content files, so Stage 3 had nothing to post as the AI Summary comment. Move artifact copying to a separate pwsh step that works on all platforms (Linux, macOS, Windows). The bash step retains only the copilot session-state copy which uses $HOME. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Copilot agent writes inline findings with 'file' as the key, but post-inline-review.ps1 only read 'path'. All findings were rejected as 'suspicious path: empty' and no inline comments were ever posted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stage 2 was skipped whenever the gate hit environment errors (3x retries exhausted → CopilotFailed=true → Check Copilot Result exits 1 → ReviewPR result=Failed). But detectedCategories was already emitted successfully in STEP 2, so the deep tests COULD run. Add 'Failed' to the allowed ReviewPR results in Stage 2's condition. Deep UI tests are valuable independent of the gate outcome — they verify the PR's changes don't break existing test suites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The conditional download using deepTestsRan variable was unreliable — AzDO's $[ in() ] expression for cross-stage result evaluation can return unexpected values depending on stage result propagation timing. This caused Stage 3 to skip the artifact download even when Stage 2 produced valid TRX results, resulting in AI Summary comments showing 'SKIPPED' for UI tests that actually ran and passed. Fix: always attempt the download with continueOnError (handles the case where RunDeepUITests was skipped). Remove the unused deepTestsRan variable. Add diagnostic logging to Stage 3 for debugging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Copilot agent produces findings JSON in multiple formats: bare array, {findings:[...]}, {schemaVersion:1, findings:[...]}, single object, etc. The previous parser used $parsed.findings which is falsy in PowerShell when the array is empty or when property access returns unexpected types. This caused the wrapper object to be treated as a single finding with no path/file. Use PSObject.Properties.Match() for explicit property detection instead of truthy evaluation. Add diagnostic logging so future format issues are immediately visible in pipeline logs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e ref Review-PR.ps1 STEP 1 checks out origin/main and squash-merges the PR, replacing all files including .github/scripts/. Script fixes on the pipeline ref (feature/regression-check) were overwritten with main's versions, so post-inline-review.ps1 file/path fix and other changes never took effect. Fix: YAML saves .github/scripts to a backup dir before invoking Review-PR.ps1. After STEP 1 completes the branch switch, the script restores the pipeline-ref versions via SCRIPTS_BACKUP env variable. This ensures STEP 7 (inline posting) and all other script steps use the pipeline ref's code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Review-PR.ps1 was checking out origin/main in CI mode, overwriting all pipeline-ref scripts with main's versions. This is why inline comment fixes, env-error patterns, and other script changes never took effect — they were replaced by main's code. Fix: in CI mode, stay on the current branch (the pipeline ref, e.g. feature/regression-check). The PR is squash-merged onto it. This preserves all script fixes while still testing the PR's changes. Remove the backup/restore workaround — no longer needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## Description Extends the `maui-copilot` DevDiv pipeline (pipeline 27723) with a 3-stage architecture that runs real UI tests on platform-pool agents and reports results directly in the AI summary PR comment. ### Pipeline Workflow ``` ┌─────────────────────────────────────────────────────────┐ │ Stage 1: ReviewPR │ │ │ │ STEP 1: Branch Setup (checkout + cherry-pick PR) │ │ STEP 2: Detect UI Test Categories │ │ STEP 3: Run Detected UI Tests (in-process, fast) │ │ STEP 4: Regression Cross-Reference │ │ STEP 5: Gate — verify tests fail/pass before/after fix │ │ STEP 6: Code Review — deep analysis via Copilot agent │ │ │ │ Outputs → CopilotLogs artifact + detectedCategories │ └──────────────────────┬──────────────────────────────────┘ │ ┌──────────────────────▼──────────────────────────────────┐ │ Stage 2: RunDeepUITests (platform-pool agent) │ │ │ │ iOS: AcesShared Tahoe + iOS 26.4 │ │ Android: ubuntu-22.04 + KVM + AVD │ │ │ │ Runs BuildAndRunHostApp.ps1 per detected category │ │ Outputs → drop-deep-uitests artifact (TRX + diffs) │ └──────────────────────┬──────────────────────────────────┘ │ ┌──────────────────────▼──────────────────────────────────┐ │ Stage 3: PostResults │ │ │ │ 1. Download CopilotLogs (review content files) │ │ 2. Download drop-deep-uitests (TRX results) │ │ 3. Merge deep results into uitests/content.md │ │ 4. Post full AI Summary comment on PR │ │ 5. Apply labels (s/agent-reviewed, etc.) │ │ │ │ One comment with everything — no patching needed │ └─────────────────────────────────────────────────────────┘ ``` ### What's New **Deep UI Test Execution (Stage 2)** - Runs detected UI test categories on proper platform-pool agents (not in-process on Linux) - **iOS**: AcesShared Tahoe agents with iOS 26.4 simulator, iPhone 11 Pro (matching `ios-26` baselines from PR dotnet#35061) - **Android**: ubuntu-22.04 with KVM, AVD boot with `-partition-size 2048`, `ignoreHiddenApiPolicyError` capability - TRX results + snapshot-diff PNGs published as `drop-deep-uitests` artifact **Unified Comment Posting (Stage 3)** - Comment posting and label application deferred to Stage 3 (after deep tests complete) - Single AI summary comment includes ALL results: code review + deep test results - Nested collapsible `<details>` for failed tests with full error + stack trace - Dynamic section title: `🧪 UI Tests — CollectionView, TabbedPage` - Artifact download link for snapshot-diff PNGs **Android Emulator Improvements** - AVD boot step with proper partition size, ADB key pre-authorization, boot wait - `DEVICE_UDID` pass-through prevents double emulator boot - Disk cleanup on hosted ubuntu agents (frees ~22GB) - KVM enablement + `appium:ignoreHiddenApiPolicyError` for API 30 **iOS Simulator Improvements** - Tahoe pool demand ensures macOS 26.x agents - Explicit iOS 26.4 download via latest Xcode - Auto-creates iPhone 11 Pro for baseline resolution match ### Validation Tested across 30+ pipeline iterations on 6 PRs: | PR | iOS | Android | |---|---|---| | 35358 (ViewBaseTests) | **112/112 ALL PASS** ✅ | **118/119 PASS** ✅ | | 35359 (TabbedPage) | 44/50 (1 real failure) | 74/75 (1 real failure) | | 35356 (CollectionView) | **415/417 PASS** ✅ | 593/619 (26 real failures) | --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 11, 2026 10:36

kubaflo added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label May 11, 2026

Copilot started reviewing on behalf of kubaflo May 11, 2026 10:39 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

kubaflo changed the title ~~[CI] Add deep UI test execution to Copilot PR review pipeline~~ Add deep UI test execution to Copilot PR review pipeline May 11, 2026

This was referenced May 12, 2026

[PR Review Queue] 2026-05-12 #35391

Closed

[PR Review Queue] 2026-05-13 #35414

Closed

kubaflo force-pushed the feature/regression-check branch from 857ed58 to 1370143 Compare May 19, 2026 11:48

This was referenced May 20, 2026

[PR Review Queue] 2026-05-20 #35535

Closed

[PR Review Queue] 2026-05-20 PureWeen/maui#50

Closed

kubaflo mentioned this pull request May 20, 2026

Fix agentic-labeler truncating labels to 1 per call #35537

Closed

kubaflo force-pushed the feature/regression-check branch from 00a514d to 5ad468e Compare May 21, 2026 10:12

This was referenced May 22, 2026

[PR Review Queue] 2026-05-22 #35581

Closed

[PR Review Queue] 2026-05-22 PureWeen/maui#65

Closed

JanKrivanek enabled auto-merge May 22, 2026 10:30

Copilot AI added 6 commits May 22, 2026 12:51

Copilot AI added 21 commits May 22, 2026 12:51

Remove accidentally committed temp files

bd78e2b

Fix: use claude-opus-4.6 as default model (4.7-1m-internal unavailable)

24dde9c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch default model to GPT-5.5

3f56e09

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove stray .playwright-mcp and screenshots from .gitignore

327a75c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kubaflo force-pushed the feature/regression-check branch from e13ca0e to 327a75c Compare May 22, 2026 10:51

rmarinho disabled auto-merge May 22, 2026 15:17

rmarinho merged commit 7c76748 into main May 22, 2026
40 checks passed

rmarinho deleted the feature/regression-check branch May 22, 2026 15:18

github-actions Bot added this to the .NET 10.0 SR8 milestone May 22, 2026

This was referenced May 24, 2026

[repo-status] Daily Repo Status — May 24, 2026 🌟 #35598

Closed

[repo-status] Daily Repo Status — May 25, 2026 🌟 #35600

Closed

[repo-status] Daily Repo Status — May 30, 2026 #35673

Closed

Conversation

kubaflo commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Pipeline Workflow

What's New

Validation

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Skill Validation Results

✅ Static Checks Passed

⏭️ LLM Evaluation: Skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kubaflo commented May 11, 2026

🔍 Multimodal Code Review — 5 agents, 5 dimensions

🔴 Critical Findings (5 found, 2 resolved)

🟠 Major Findings (12) — all still open

✅ C# Appium Change — Approve

🔵 Top Suggestions

Summary

Uh oh!

kubaflo commented May 11, 2026

Uh oh!

kubaflo commented May 11, 2026

🔍 Re-review after commit 7037237 — Critical findings C2, C4, C5

✅ C5 — gh auth guard — RESOLVED

✅ C4 — DEFERRED mode idempotency — RESOLVED (one minor gap)

⚠️ C2 — Invoke-Expression elimination — PARTIALLY RESOLVED

Updated Scorecard — All 5 Critical Findings

Uh oh!

kubaflo commented May 11, 2026

Uh oh!

kubaflo commented May 11, 2026

🔍 Re-review after commit d31011f — C2 fully resolved

✅ C2 — Invoke-Expression elimination — FULLY RESOLVED

✅ C4 minor — pagination fix applied

Final Scorecard — All 5 Critical Findings

Uh oh!

kubaflo commented May 20, 2026

🔬 Multimodal Code Review — PR #35376

🔴 Critical — Fix Before Merge

🟠 Should Fix

🟡 Nits & Cleanup

✅ Strengths Noted

Uh oh!

kubaflo commented May 20, 2026

✅ Multimodal Review Complete — All Critical Issues Fixed

Fix Commits

Issues Resolved (22 original + 7 discovered during fixes)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kubaflo commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

🔍 Re-review after commit `7037237` — Critical findings C2, C4, C5

✅ C5 — `gh` auth guard — RESOLVED

⚠️ C2 — `Invoke-Expression` elimination — PARTIALLY RESOLVED

🔍 Re-review after commit `d31011f` — C2 fully resolved

✅ C2 — `Invoke-Expression` elimination — FULLY RESOLVED