Integrate UI test category detection into PR review and fix gate reliability#35133
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35133Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35133" |
There was a problem hiding this comment.
Pull request overview
This PR adds automatic UI test category detection so PR UI test jobs can skip running when the PR doesn’t touch their category group, reducing overall UI test time and CI load.
Changes:
- Introduces a PowerShell-based category detection script (test diff scanning + source-path heuristics + optional AI hints).
- Adds a “Discover” stage and per-job early gating to skip irrelevant UI test jobs in
maui-pr-uitests. - Adds orchestration and reporting scripts to trigger the UI test pipeline and post a consolidated PR comment, plus integrates category detection into the PR review flow.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
eng/scripts/detect-ui-test-categories.ps1 |
Detects relevant UI test categories from PR changes and emits variables for downstream filtering. |
eng/pipelines/common/ui-tests.yml |
Adds the discovery stage and wires detected category outputs into UI test stages/jobs. |
eng/pipelines/common/ui-tests-steps.yml |
Adds an early per-job filter step and conditions many steps on whether the job should run. |
eng/pipelines/ci-uitests.yml |
Adds prNumber / categories parameters to support manual targeted runs. |
eng/pipelines/ci-copilot.yml |
Exposes AZURE_DEVOPS_EXT_PAT for cross-org AzDO queuing from the Copilot pipeline. |
.github/scripts/trigger-uitest-pipeline.ps1 |
Orchestrates detect → queue → monitor → summarize flow for UI tests. |
.github/scripts/post-uitest-categories-comment.ps1 |
Posts/updates a single PR comment summarizing UI test results and failures. |
.github/scripts/post-ai-summary-comment.ps1 |
Adds a UI Tests section to the unified AI summary comment. |
.github/scripts/Review-PR.ps1 |
Adds a preliminary category detection step and improves gate retry/report handling. |
.github/pr-review/pr-preflight.md |
Updates the PR pre-flight checklist to include identifying impacted UI test categories. |
Comments suppressed due to low confidence (1)
eng/pipelines/common/ui-tests-steps.yml:275
- The iOS cleanup step runs whenever
platform == ios, even when the new early category check setSHOULD_RUN_TESTS=Falseand all prior steps were skipped. This can waste time and may fail because test artifacts/simulators were never provisioned. Gate this step onSHOULD_RUN_TESTSas well (similar to the other tasks in this template).
- bash: |
cat ${BASH_SOURCE[0]}
pwsh ./build.ps1 --target=Cleanup -Script eng/devices/${{ parameters.platform }}.cake ---results="$(TestResultsDirectory)" ${{ parameters.cakeArgs }}
displayName: Cleanup and Create Simulator Logs if Test Run Failed To
condition: ${{ eq(parameters.platform, 'ios') }}
continueOnError: true
| > **Validation constraint:** The Step 7 prompt MUST NOT contain issue titles, root-cause descriptions, bug summaries, or any Part A content — only `PR #XXXXX`. If you find yourself adding context "to help" the sub-agent, you are violating independence-first. | ||
| > **Validation constraint:** The Step 8 prompt MUST NOT contain issue titles, root-cause descriptions, bug summaries, or any Part A content — only `PR #XXXXX`. If you find yourself adding context "to help" the sub-agent, you are violating independence-first. | ||
|
|
||
| 7. **Invoke the code-review skill as a sub-agent:** |
There was a problem hiding this comment.
Step numbering is inconsistent after adding the new “Identify Impacted UI Test Categories” step: this section is now labeled “Part B: Code Review (Step 8)”, but the instruction below still says “7. Invoke the code-review skill as a sub-agent”. Update the numbering to avoid confusion when following the checklist.
| 7. **Invoke the code-review skill as a sub-agent:** | |
| 8. **Invoke the code-review skill as a sub-agent:** |
| if ([string]::IsNullOrWhiteSpace($uitestCategories) -or $uitestCategories -eq 'NONE') { | ||
| Write-Host " ℹ️ No UI test categories detected" -ForegroundColor DarkGray | ||
| } else { | ||
| Write-Host " 🎯 Detected categories: $uitestCategories" -ForegroundColor Green |
There was a problem hiding this comment.
detect-ui-test-categories.ps1 uses an empty UITestCategoryList output to mean “run the full matrix” (it returns without setting the variable). Here, an empty result is treated as “no UI test categories detected”, which is misleading in the common fallback-to-all case. Treat empty as “ALL/full matrix” and reserve NONE for “skip all UI tests”.
| if (-not [string]::IsNullOrWhiteSpace($aiContent) -and $aiContent.Trim() -ne 'NONE') { | ||
| # Extract category names (lines like "Button — justification") | ||
| $aiCatLines = @($aiContent -split "`n" | ForEach-Object { | ||
| if ($_ -match '^([A-Za-z]+)') { $Matches[1] } |
There was a problem hiding this comment.
AI category parsing only captures leading letters (^([A-Za-z]+)), which will truncate categories that contain digits (e.g., Material3 becomes Material). Expand the pattern to include digits/underscores so AI-provided categories can round-trip correctly.
| if ($_ -match '^([A-Za-z]+)') { $Matches[1] } | |
| if ($_ -match '^([A-Za-z0-9_]+)') { $Matches[1] } |
| # EARLY CHECK: Determine if this category group should run tests | ||
| # This runs FIRST to avoid wasting time on provisioning if no tests will run | ||
| # Also calculates matching categories to avoid duplicating this logic later | ||
| - pwsh: | |
There was a problem hiding this comment.
This early check claims it runs first to avoid wasting time on provisioning when a job is skipped, but the template still invokes provision.yml unconditionally later in the job. That means even skipped category jobs will still pay most of the provisioning cost, which undermines the intended runtime savings. Consider gating the provisioning step(s) on SHOULD_RUN_TESTS=True as well (or adding a condition hook parameter to the provisioning template).
| Write-Host "Test file changes detected under '$TestRoot'." -ForegroundColor Green | ||
| } | ||
|
|
||
| $categoryPattern = '^\+\s*\[Category\((?<value>[^\)]*)\)\]' |
There was a problem hiding this comment.
The regex used to detect added [Category(...)] lines only matches attributes that end immediately with )]. Many tests use combined attributes like [Category(UITestCategories.X), Order(1)], which won't match and will cause category detection to miss relevant categories (leading to overly broad runs). Update the pattern to also match when Category(...) is followed by , or ] (and allow whitespace).
| $categoryPattern = '^\+\s*\[Category\((?<value>[^\)]*)\)\]' | |
| $categoryPattern = '^\+\s*\[Category\((?<value>[^\)]*)\)\s*(?:,|\])' |
| $content = Get-Content $file -Raw | ||
| $fileMatches = [regex]::Matches($content, '\[Category\(([^\)]*)\)\]') | ||
| foreach ($m in $fileMatches) { | ||
| $rawValue = $m.Groups[1].Value.Trim() | ||
| if ([string]::IsNullOrWhiteSpace($rawValue)) { continue } | ||
| if ($rawValue -match '^UITestCategories\.(?<name>[A-Za-z0-9_]+)$') { | ||
| $cat = $Matches['name'] | ||
| } elseif ($rawValue -match '^["''](?<name>[A-Za-z0-9_ -]+)["'']$') { | ||
| $cat = $Matches['name'] | ||
| } elseif ($rawValue -match 'nameof\(UITestCategories\.(?<name>[A-Za-z0-9_]+)\)') { | ||
| $cat = $Matches['name'] | ||
| } else { continue } |
There was a problem hiding this comment.
When scanning the full contents of modified test files, the \[Category\(([^\)]*)\)\] regex has the same limitation as the diff regex and will not match combined attributes like [Category(UITestCategories.X), Order(1)]. This will prevent Tier 1 fallback detection from finding existing categories in many files. Adjust the regex to capture the Category(...) argument even when additional attributes follow in the same bracket.
| } elseif (-not [string]::IsNullOrWhiteSpace($env:SYSTEM_ACCESSTOKEN)) { | ||
| $h['Authorization'] = "Bearer $env:SYSTEM_ACCESSTOKEN" |
There was a problem hiding this comment.
Get-GitHubHeaders falls back to SYSTEM_ACCESSTOKEN as a GitHub Bearer token. SYSTEM_ACCESSTOKEN is an Azure DevOps token and won't authenticate to the GitHub API, so PR metadata/label calls will fail unless GH_TOKEN is present. Consider removing this fallback and emitting a clear warning when GH_TOKEN is missing (or support a dedicated GitHub token env var).
| } elseif (-not [string]::IsNullOrWhiteSpace($env:SYSTEM_ACCESSTOKEN)) { | |
| $h['Authorization'] = "Bearer $env:SYSTEM_ACCESSTOKEN" | |
| } else { | |
| Write-Host "##[warning]GH_TOKEN is not set. GitHub API requests will be sent without authentication and may fail or be rate-limited." |
8898e29 to
8586797
Compare
231f9a9 to
1b54ca1
Compare
PR review integration (depends on #35136 for pipeline changes): - Review-PR.ps1: Step 0.5 detects categories, gate retry on ENV ERROR, absolute path resolution, bad report format stripping - post-uitest-categories-comment.ps1: Rich results with platform table, failure classification, supports -OutputFile for AI summary embedding - trigger-uitest-pipeline.ps1: Orchestrator for detect → queue → monitor - post-ai-summary-comment.ps1: UI Tests section (before gate) - pr-preflight.md: Step 7 AI category identification, Step 8 code review - ci-copilot.yml: DNCENG_PUBLIC_PAT for cross-org build queuing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1b54ca1 to
8413e08
Compare
Two issues: 1. The copilot agent was overwriting gate/content.md with its own generic version that lacked actual error messages. Added explicit instruction not to touch gate/content.md. 2. The verify script's failure details were in a collapsed <details> block. Changed to an open #### heading so the error reason is immediately visible (e.g. screenshot baseline mismatch). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 15 | Agents checked: 3 Full validator output❌ LLM Evaluation Failed0/1 skill(s) passed validation
❌ verify-tests-fail-without-fix: Eval scenario 'Regression: tests passing without fix means verification FAILED' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Edge case: no test files detected in the PR' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. Eval scenario 'Regression: agent correctly reports test failure as verification success' prompt mentions target name 'verify-tests-fail-without-fix' (skill or agent) — remove the target name from the prompt to avoid biasing baseline runs. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Expert Code Review: 8 findings posted inline (8 moderate). See summary comment for methodology and discarded findings.
Generated by Expert Code Review · ● 29.8M
| git remote add _detect_base $baseRepoCloneUrl | ||
| git fetch _detect_base "$TargetBranch" --no-tags --prune --depth=200 | Out-Null | ||
| git update-ref refs/remotes/origin/$TargetBranch _detect_base/$TargetBranch | Out-Null | ||
|
|
||
| # Fetch head commit (works for forks too) and check it out so the diff reflects the PR changes. | ||
| git remote remove _detect_head 2>$null | Out-Null | ||
| git remote add _detect_head $headRepoCloneUrl | ||
| git fetch _detect_head "$headSha" --no-tags --depth=200 | Out-Null | ||
| git checkout --quiet $headSha | Out-Null |
There was a problem hiding this comment.
🟡 Moderate — 3/3 consensus | Git state not restored on failure
Manual PR mode performs git checkout --quiet $headSha (detached HEAD) and adds temporary remotes (_detect_base, _detect_head) without a try/finally to restore original state. If the script fails mid-execution, the working tree is left detached and stray remotes persist in .git/config.
Recommendation: Wrap the manual-mode git block in try/finally. In finally, restore the original HEAD (git checkout $originalRef) and remove both temporary remotes unconditionally.
| $gateOutput | ForEach-Object { Write-Host " $_" } | ||
|
|
||
| # Check if this was an ENV ERROR (emulator timeout, ADB failure, etc.) | ||
| $gateContentFile = Join-Path $gateOutputDir "verify-tests-fail/verification-report.md" |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | Stale report file across retries
$gateContentFile (verification-report.md) is read each retry iteration to detect ENV ERROR, but is never cleared between attempts. If attempt 1 writes an ENV ERROR report and attempt 2 crashes before overwriting it, the stale file causes misclassification of the second attempt.
Recommendation: Delete or rename $gateContentFile at the start of each loop iteration before invoking the verify script:
if (Test-Path $gateContentFile) { Remove-Item $gateContentFile -Force }| $pr = Invoke-WithRetry -Uri $prUrl -Headers (Get-GitHubHeaders) | ||
| $TargetBranch = $pr.base.ref | ||
| $headRef = $pr.head.ref | ||
| $headSha = $pr.head.sha |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | No null guards on API response fields
If $pr.head.repo is null (common when a fork is deleted), accessing .clone_url silently returns $null in PowerShell. Subsequent git remote add _detect_head "" and git fetch _detect_head "" produce confusing errors. The outer try/catch may or may not catch native command failures depending on $ErrorActionPreference.
Recommendation: Add explicit null checks before git operations:
if ([string]::IsNullOrWhiteSpace($headSha) -or [string]::IsNullOrWhiteSpace($headRepoCloneUrl)) {
Write-Host "##[warning]Incomplete PR API response (fork may be deleted). Falling back."
return
}| Start-Sleep -Seconds 30 | ||
| } | ||
| } | ||
| if ($isEnvError) { |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | $isEnvError post-loop invariant is non-obvious
The check if ($isEnvError) after the loop is technically correct (it can only be $true here if ALL iterations were env errors, since non-env-error iterations break). However, the variable name suggests "any" rather than "all", and if $maxGateAttempts is ever 0, $isEnvError would be undefined ($null → falsy, so safe but fragile).
Recommendation: Either add a clarifying comment explaining the invariant, or use a counter: if ($envErrorCount -eq $maxGateAttempts) { ... }.
| $uitestCategories = "" | ||
|
|
||
| $detectScript = Join-Path $RepoRoot "eng/scripts/detect-ui-test-categories.ps1" | ||
| if (Test-Path $detectScript) { |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | Variable name shadowing
$detectScript is defined here for detect-ui-test-categories.ps1, then reassigned at line ~499 to Detect-TestsInDiff.ps1. Both live in the same script scope. Future edits could accidentally reference the wrong script path.
Recommendation: Use distinct names — e.g., $uitestDetectScript (Step 0.5) and $diffDetectScript (Step 1).
| } else { | ||
| $message = "Unrecognized category expression '$rawValue'. Expected formats: UITestCategories.<Name>, nameof(UITestCategories.<Name>), or a quoted string." | ||
| Write-Host "##[error]$message" | ||
| throw $message |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | throw on unrecognized category halts all detection
When the diff scan encounters a [Category()] expression that doesn't match the three supported patterns (e.g., a constant like [Category(MyConstants.Button)]), the script throws an exception. This kills the child pwsh process, causing the parent to receive zero categories — silently skipping Tier 2/3 fallback detection.
Note: The existing-file scan at line ~314 uses continue for unrecognized formats, creating an inconsistency.
Recommendation: Replace throw $message with Write-Host "##[warning]$message" + continue. Let Tier 2/3 fill gaps gracefully.
| if ($touchesControls) { | ||
| # Changed files under src/Controls/ but couldn't map to specific categories — run all | ||
| Write-Host "Changed files touch Controls/Core/Essentials but no specific categories identified. Running all." -ForegroundColor Yellow | ||
| return |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | "Run all" fallback paths never set output variables
This return (and several others at lines 52, 57, 130, 154, 165, 174, 180) exits without emitting ##vso[task.setvariable variable=UITestCategoryList;isOutput=true] or UITestCategoryMatrix. The log says "Running all" but the pipeline receives no output — which may be interpreted as "run nothing" rather than "run everything" depending on how the consuming YAML handles an unset matrix variable.
Recommendation: For each "run all" fallback, explicitly emit a sentinel (e.g., ALL) or the full category list, ensuring the pipeline YAML has a well-defined behavior for this case.
| # ============================================================================ | ||
|
|
||
| if (-not [string]::IsNullOrWhiteSpace($diff)) { | ||
| foreach ($line in $diff -split "`n") { |
There was a problem hiding this comment.
🟡 Moderate — 2/3 consensus | Line-by-line scan cannot detect multi-line attributes
The Tier 1 diff scan splits on \n and applies the category regex per-line. While current codebase conventions use single-line [Category(...)] attributes (making this safe today), the approach silently misses any future multi-line formatting. Under-detection is safe (falls back to broader detection) but worth noting.
Recommendation: Consider adding a comment documenting this assumption, or collapsing consecutive + lines before regex matching to handle wrapped attributes.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Expert Code Review: 2 findings posted inline (1 warning, 1 suggestion). Both are minor — overall the implementation is solid with proper error handling, retry logic, and fail-safe fallbacks.
Note
🔒 Integrity filter blocked 5 items
The following items were blocked because they don't meet the GitHub integrity level.
- 683daa4
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - 8413e08
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - 388b82f
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - bd3a0e5
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - f49d3bd
list_commits: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review · ● 40.6M
| } else { | ||
| $message = "Unrecognized category expression '$rawValue'. Expected formats: UITestCategories.<Name>, nameof(UITestCategories.<Name>), or a quoted string." | ||
| Write-Host "##[error]$message" | ||
| throw $message |
There was a problem hiding this comment.
[Category] format
If a test in the diff uses an uncommon [Category] expression format (e.g., a constant from another class, string interpolation, or concatenation), this throw aborts the entire detection script.
In practice the caller in Review-PR.ps1 treats this as non-fatal (category detection silently returns empty), so it's fail-safe. However, in the Azure Pipelines context (direct invocation), this terminates with an error and the ##vso output variables are never set — the pipeline would fall back to "run all categories" which is fine, but the ##[error] log line may cause confusion during triage.
Consider: Write-Host "##[warning]$message" + continue instead of throw, so detection continues for remaining diff lines and only logs a warning for the unrecognized expression.
| $aiCatList = @($AiCategories -split '[,\n]' | ForEach-Object { ($_ -replace '\s*[-—].*$', '').Trim() } | Where-Object { $_ -and $_ -ne 'NONE' }) | ||
| if ($aiCatList.Count -gt 0) { | ||
| Write-Host "Tier 3 (AI reasoning): $([string]::Join(', ', $aiCatList))" -ForegroundColor Green | ||
| foreach ($c in $aiCatList) { $addedCategories.Add($c) | Out-Null } |
There was a problem hiding this comment.
💡 Suggestion — AI-provided categories added without validation
Tier 3 adds AI-suggested categories directly to $addedCategories without checking they exist in UITestCategories.cs. If the AI halluccinates a category name (e.g., "Modal", "Toolbar"), it creates a matrix job that runs zero tests — wasting a CI slot and potentially masking detection failures.
Consider: Validate against the known category list (Tier 2's $pathToCategoryMap already enumerates valid categories), or emit a ##[warning] for unrecognized AI categories so they're visible in logs without silently creating empty jobs.
…ats and AI hallucinations - Replace 'throw' with '##[warning]' + continue when an unrecognized [Category(...)] expression is found in a diff. Previously a single unsupported expression (e.g., a constant from another class) aborted the whole detection script and silently fell back to running ALL categories. Now the script logs a warning and lets Tier 2 / Tier 3 fill in the gaps. - Validate Tier 3 (AI-suggested) categories against UITestCategories.cs before adding them to the matrix. Hallucinated category names would otherwise create matrix jobs that run zero tests, wasting a CI slot and masking detection failures. Invalid names are now skipped with a warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed both findings from the latest expert review in 3db7ee4:
💡 Suggestion — AI-provided categories added without validation (line 369) Did not adopt the prior review's findings ( |
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
What this does
Two things:
1. UI test category detection in PR review
During the PR review workflow, Step 0.5 detects which UI test categories the PR impacts and writes the result to the AI summary comment. This gives reviewers visibility into which UI tests are relevant.
Detection reuses the 3-tier script from #35136 (test attributes → source paths → AI reasoning).
AI summary shows a new 🧪 UI Tests section with detected categories before the gate section.
2. Gate reliability fixes
Multiple fixes to make the gate (
verify-tests-fail.ps1) more deterministic:Resolve-Path,GetFullPath)Passed: Falsewith empty counts — stripped instead of shown❌ FAILED / Platform: IOSFiles
.github/scripts/Review-PR.ps1.github/scripts/post-ai-summary-comment.ps1uitestsphase to render detected categories.github/pr-review/pr-preflight.mdValidation — PR reviewer builds (Apr 26)
10 builds against real PRs — all succeeded ✅. Category detection shown in AI summary comment.
ViewBaseTests,WebViewShellShellRadioButton,ViewBaseTestsViewBaseTestsViewBaseTestsSwipeView,ViewBaseTestsCollectionViewRefreshView,ViewBaseTests