Fix agentic-labeler truncating labels to 1 per call#35537
Closed
kubaflo wants to merge 104 commits into
Closed
Conversation
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that detects when a PR removes lines previously added by labeled bug-fix PRs. Algorithm: 1. Collects lines removed by the PR under review 2. Finds recent PRs touching the same files via git log 3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels) 4. Cross-references removed lines against lines those fix PRs added 5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN Integration: - Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate) - Content assembled into AI summary comment via post-ai-summary-comment.ps1 - Expert reviewer dimension #6 reads risks.json for REVERT entries - 64 unit tests covering diff parsing, normalization, and detection logic Validated against: - PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278 - PR #35272: correctly classifies as OVERLAP (no line-level revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now: 1. Extracts test files from the fix PR's diff 2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner) 3. Stores regression_tests metadata in risks.json 4. Lists required tests in content.md Review-PR.ps1 adds STEP 1.5 (after Gate builds the code): - Reads regression_tests from risks.json - Runs unit/XAML tests via dotnet test with the detected filters - Skips UI/device tests (need CI infrastructure) with clear reporting - Appends pass/fail results to content.md - Writes test-results.json for downstream consumption Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs: - UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium) - Device tests via Run-DeviceTests.ps1 (xharness on device/simulator) - Unit/XAML tests via dotnet test --filter Each test runs on the same platform as the Gate step. If a runner script is missing, the test is skipped gracefully rather than failing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries. A nearby edit can break a fix through side effects even without removing the exact lines. Running tests for all risk levels gives maximum confidence that prior fixes aren't regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs), inject them as MANDATORY additional tests into the STEP 2 try-fix prompt. Each try-fix candidate must run these tests after its own test passes — a candidate that breaks a prior fix is marked Fail. The Report phase (Phase 3) also ranks candidates that failed regression tests lower than those that passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup) STEP 0.5 → 2 (UI Test Categories) STEP 0.6 → 3 (Regression Cross-Reference) STEP 1 → 4 (Gate) STEP 1.5 → 5 (Regression Test Verification) STEP 2 → 6 (PR Review) STEP 3 → 7 (Post AI Summary) STEP 4 → 8 (Apply Labels) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 3 (Regression Cross-Reference) now includes both detection AND test execution in a single step. Removes the separate STEP 5. Pipeline is now 7 steps: 1. Branch Setup 2. UI Test Categories 3. Regression Cross-Reference (detect + run fix-PR tests) 4. Gate — Test Verification 5. PR Review (Pre-Flight, Try-Fix, Report) 6. Post AI Summary 7. Apply Labels Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates false positives from fix PRs merged to a different branch (e.g., inflight/current) than the PR under review targets (e.g., main). Uses git merge-base --is-ancestor to verify the fix PR's merge commit is reachable from the PR's base branch before flagging a REVERT or OVERLAP. Fix PRs merged to unreachable branches are skipped with a clear log message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After STEP 2 detects UI test categories (via detect-ui-test-categories.ps1), run BuildAndRunHostApp.ps1 once per detected category and surface results in the existing 'uitests' section of the AI summary comment. Behaviour: - Skipped when detection returns NONE (no UI-relevant changes). - Skipped when detection returns the run-all matrix (would take hours locally; the gate already covers PR-specific tests). - Otherwise: splits the comma-separated category list, invokes BuildAndRunHostApp.ps1 -Platform <platform> -Category <cat> sequentially, records pass/fail/skip per category with duration, and appends a Markdown table to uitests/content.md so it renders in the same collapsible section as the detection output. - Platform follows the regression-test pattern: $Platform if provided, else 'android'. - Failures are informational only — they do not block the gate or affect try-fix candidate scoring (per the section footer). Also writes: - CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/test-results.json (machine-readable per-category results) - CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/result.txt (one-line PASSED/FAILED/SKIPPED marker) Renumbers later steps: 3 (regression cross-ref) -> 4 4 (gate) -> 5 5 (PR review) -> 6 6 (post AI summary) -> 7 7 (apply labels) -> 8 Updates the script synopsis (which had stale Step 0..4 entries from before the rebase) and fixes the cross-references inside the multi-candidate-review prompt and the AI-tier-refresh comment block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Tier-3 AI category refresh (STEP 6 → pre STEP 7) was overwriting uitests/content.md with just the refreshed category list, wiping out the 'UI Test Execution Results' table that STEP 3 appends. Now read the existing file, extract everything from the '### 🧪 UI Test Execution Results' marker onward, write the refreshed category line, then re-append the preserved execution block so the collapsible 'UI Tests — Category Detection' section in the AI summary comment shows both the detected categories and the actual run results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Get-DotNetTestResults parser that extracts per-test status, name, duration, error message and stack trace from the captured 'dotnet test --logger console;verbosity=detailed' stdout. STEP 3 now: * Parses each category's output in-memory before the next category run wipes test-output.log. * Persists per-category logs to PRState/<PR>/PRAgent/uitests/<cat>-output.log for post-mortem traceability. * Renders an enriched markdown table (Tests column shows passed/total + failure count) plus a collapsible 'Failed test details' block per failed category showing test name, first 1500 chars of error message, and the first stack frame. * Renders a collapsible 'Show passed test names' block when there are passing tests. Reviewers can now diagnose UI test failures from the AI summary comment without downloading the full build artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory fails before tests
When a test category fails because the build or deploy crashed before any
test could run (e.g. CS0246 missing namespace, RS0016 PublicAPI errors),
the AI summary table previously showed '0/1 ✓' — the green-checkmark
'all passed' branch — because no per-test failures were parsed. That's
visually misleading: the row is FAILED but the cell looks healthy.
Two fixes:
1. Tests column distinguishes 'category failed AND no per-test failures
parsed' from 'all tests passed':
- 'build/deploy failed' (no tests at all)
- '0/1 — build/deploy failed before per-test results' (some discovered)
2. New optional 'build_tail' field captures the last 30 lines of stdout
when a category fails with zero per-test failures. The Failed test
details collapsible section then renders it in a code block so
reviewers see the actual compiler error / build crash inline,
instead of having to download the full CopilotLogs artifact.
This was discovered while running the regression-check pipeline against
PRs #35110 (142 RS0016 PublicAPI errors), #35281 (CS0246 NSAttributedString
missing for catalyst), and #35358 — all reported as '0/1 ✓' before the fix.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related fixes that together restore meaningful per-test failure detail in the AI summary comment. ## 1. Parser input normalization (root cause of '1 failed test' bug) PowerShell's '& dotnet test 2>&1' wraps multi-line stderr blocks (and some console-logger output paths) as a single ErrorRecord/string with embedded newlines. The 'Get-DotNetTestResults' regex is anchored '^...$' (start/end of STRING, not line), so without splitting the parser would either skip those blocks entirely or — worse — collapse the entire multi-line block as one bogus 'test result' with all 100+ test names concatenated into a single name field. Observed on PR #35358 where 119 actual test failures were rendered as '1 failed test' with name='LightTheme_VerifyVisualState DarkTheme_…' (space-separated wall of 119 names) and duration='2 m 16 s 2 m 16 s …' (matching wall of 119 durations). Fix: at parser entry, split any captured element that contains '\n' or '\r' into individual lines before passing to the regex. Verified locally against the actual ViewBaseTests-output.log artifact: parser now returns 119 entries instead of 0/1. ## 2. Grouped failure rendering (avoids 65k comment overflow) Even after the parser fix, dumping 119 full error messages + first stack frames into a single comment would blow past GitHub's 65,536- character body limit. Most large failure clusters share a single root cause (e.g. all 119 ViewBaseTests fail with the same 'OneTimeSetUp: Timed out waiting for Go To Test button' because the host app didn't deploy properly). The new rendering: - Groups failed tests by the first 200 chars of error message - Shows full detail (sample names + 1500-char error + first stack frame) for the top 5 unique error groups - Adds a 'X tests failed with the same error' header when a group has multiple members, with the first 3 sample names inline - Always emits a fully-collapsed '<details>All N failed test names' block listing every individual failure so reviewers can re-run exactly the right set Wraps Group-Object output in @() because Group-Object returns a single GroupInfo (not an array) when all rows share the same key, and 'foreach' would then iterate the group's MEMBERS instead of the groups — re-introducing the same iteration bug we're trying to fix. Verified locally with a 119-failure simulation (117 with shared error, 2 unique): produces 3 groups, top group shows count=117 with sample names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ummary When the HostApp can't be installed/launched on the device (e.g. ADB 'Failure calling service package: Broken pipe', InstallFailedException, or device offline), every test in the fixture's OneTimeSetUp times out waiting for the test driver UI. NUnit then reports all N tests as Failed. The AI summary previously rendered '119 failed tests' for what was actually a single CI infra problem — alarming reviewers about nonexistent regressions. Now: when (a) all per-test failures share a 'OneTimeSetUp:' error AND (b) the build log contains a known infra signal (ADB0010, Broken pipe, device offline, etc.), tag the category as an infrastructure failure: - Tests column shows '🛠️ infra failure (N OneTimeSetUp timeouts)' - A banner above 'Failed test details' explains these are NOT real regressions - The collapsible header uses the wrench icon instead of ❌ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….ps1) When BuildAndRunHostApp.ps1 fails because of a transient device problem (Android ADB 'Broken pipe' install failure, iOS simulator app-launch crash, XHarness exit 83, missing devices, etc.), STEP 3 was reporting the resulting OneTimeSetUp timeouts as 100+ test regressions — even though the test runner never started. The Gate has solved this for a while via Invoke-TestRunWithRetry: it retries up to 3 times with a 30s sleep between attempts and reboots the device on Android/iOS infra failures. STEP 3 was just calling BuildAndRunHostApp once and reporting whatever came back. Now STEP 3 wraps the per-category run in the same retry loop: - Detects the same env-error signals as the Gate (ADB0010, Broken pipe, XHarness 83, AOT loader crash, etc.) - Retries up to 3 times with 30s pause between attempts - adb reboot on Android, simctl shutdown/boot on iOS - If retries exhaust, the persistent env error is propagated into infra_failure so the AI summary still labels the category as '🛠️ infra failure' instead of pretending all the tests regressed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract the Gate's pre-boot + retry-on-env-error pipeline into a single
shared script under .github/scripts/shared/Invoke-UITestWithRetry.ps1
and have STEP 3 call it instead of duplicating the logic.
The shared script:
1. Pre-boots the device once via Start-Emulator.ps1 (avoids
BuildAndRunHostApp.ps1 racing with another booted device per
category).
2. Passes the booted UDID to BuildAndRunHostApp.ps1 -DeviceUdid.
3. Detects env-error patterns (ADB0010, Broken pipe, XHarness 83,
SIGABRT/load_aot_module, AppiumServerHasNotBeenStartedLocally, …)
in captured output — same pattern set as
verify-tests-fail.ps1's Get-TestResultFromOutput.
4. Retries up to 3 times with a 30s sleep, rebooting the device on
Android (adb -s <udid> reboot + wait-for-device) and iOS
(xcrun simctl shutdown/boot) between attempts.
5. Splits multi-line ErrorRecord captures into individual lines so
downstream parsers (Get-DotNetTestResults) don't collapse 100+
test results into one bogus entry.
This replaces the ~80-line duplicated retry block I added in the
previous commit. The Gate (verify-tests-fail.ps1) still has its own
copy in Invoke-TestRun + Invoke-TestRunWithRetry — a follow-up will
migrate it to call the shared script too so both consumers share one
implementation.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Some failure modes don't produce OneTimeSetUp-prefixed errors but are
still purely infrastructure (e.g. ADB0010 install failure causes 'Build
FAILED' before tests run, then NUnit reports every test in the
assembly as failed with regular test-name 'Failed X [2m 16s]' lines).
The previous heuristic only matched on '^OneTimeSetUp:' error prefix,
missing this very common ADB-broken-pipe pattern, leading to '119
failed tests' alarm on a single bad APK install.
New heuristic OR's two signals:
(a) every failure starts with 'OneTimeSetUp:' (driver couldn't reach
runner UI), OR
(b) build log contains 'Build FAILED' AND there were zero passes
(build/deploy died before any test could run, NUnit then 'fails'
the entire assembly).
When either is true AND any infra signal is found in the log, the
result is classified 🛠️ infra failure, which the renderer already
handles with a warning banner.
Verified locally on saved 119-failure ViewBaseTests log from #35358.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ategory=, console;normal)
The maui-pr-uitests pipeline (definition 313) runs UI tests per-category
via Cake's RunTestWithLocalDotNet, which invokes:
dotnet test $csproj
--filter "TestCategory=$category"
--logger "trx;LogFileName=$name.trx"
--logger "console;verbosity=normal"
--results-directory $dir
/p:VStestUseMSBuildOutput=false
STEP 3 was not matching that command — it used "Category=$Category"
(NUnit-only prefix) and "console;verbosity=detailed" with no TRX logger.
Counts were scraped from raw stdout, which broke when 619 tests ran and
PowerShell stream merging glued lines together: a CollectionView run with
75 passed / 544 failed was rendered as "1/1 (1 ❌)" because the parser
collapsed every result into a single bogus entry.
Three surgical changes to mirror CI byte-for-byte and make the renderer
authoritative:
1. .github/scripts/BuildAndRunHostApp.ps1
- Filter prefix: Category= → TestCategory= (matches Cake/CI)
- Logger: console;verbosity=detailed → console;verbosity=normal
- Add: --logger trx;LogFileName=... + --results-directory + /p:VStestUseMSBuildOutput=false
- Compute deterministic TRX path ($Category-$Platform.trx in CustomAgentLogsTmp/UITests/TestResults)
- Print ">>> TRX_RESULT_FILE: <path>" marker line so callers can locate it
2. .github/scripts/shared/Invoke-UITestWithRetry.ps1
- Scan captured output for the TRX marker line
- Verify the file exists and return TrxResultFile in the result hashtable
3. .github/scripts/Review-PR.ps1
- New helper Get-TrxResults: parses VSTest TRX into the same shape as
Get-DotNetTestResults (status/name/duration/error/stack) plus aggregate
Total/Passed/Failed/Skipped counts from <Counters>
- STEP 3 prefers TRX when present (authoritative, byte-for-byte matches
what AzDO PublishTestResults@2 would ingest); falls back to console
scraping only when TRX missing (build/deploy crashed before tests ran)
- Tests cell now uses TRX counts: "75/619 (544 ❌)" instead of "1/1 (1 ❌)"
- Failed test details list now drawn from TRX entries (one per real test)
Verified locally with a synthetic TRX (619/75/544): renderer produces
"75/619 (544 ❌)" instead of the previous bug's "1/1 (1 ❌)".
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fallback in BuildAndRunHostApp.ps1 (when --logger trx;LogFileName=X silently failed to produce X.trx) scanned the results directory for any .trx file. With multiple categories sharing the same TestResults dir (CollectionView.trx, Entry.trx, etc.), a stale TRX from a prior category could be misattributed to the current one. Fix: capture $trxRunStart BEFORE invoking dotnet test, then in the fallback only consider .trx files with LastWriteTime >= $trxRunStart. A stale TRX from a prior category can never match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tTestResults) These two functions sit on STEP 3's critical rendering path. A regression in either silently misrenders the AI summary's UI Test Execution Results table (we just lived through 'Get-DotNetTestResults returns 1 of 619'). 13 tests covering: Get-TrxResults - missing/empty file → null (no exceptions) - aggregate counters from ResultSummary/Counters - Skipped derived from Total-Executed when not separately tracked - per-test list with status/name/duration/error/stack - NotExecuted + Inconclusive normalized to 'Skipped' - empty Results array - malformed XML handled gracefully (logs warning, returns null) - TrxPath round-tripping Get-DotNetTestResults - single Passed entry - multiple consecutive results - error message + stack capture between results - empty input Follows the same pattern as Fix-MilestoneDrift.Tests.ps1: dot-source the helper bodies out of Review-PR.ps1 (the script has top-level imperative logic so we can't dot-source the whole file). Run: Invoke-Pester ./.github/scripts/Review-PR.Tests.ps1 -Output Detailed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the in-process per-category 'dotnet test' loop on the Copilot
agent VM with a separate AzDO pipeline (eng/pipelines/ci-copilot-uitests.yml)
that imports common/ui-tests.yml and runs the detected categories as a
real matrix on platform-appropriate pools.
Why: STEP 3 used to run categories sequentially on a single agent,
contending for CPU with the Copilot agent itself. CI 313 already does
the right thing (shared build_ui_tests stage, per-category matrix jobs,
PublishBuildArtifacts@1 → drop-* artifacts). Mirroring that here is
faster, isolates pools, and surfaces real per-test results in TRX.
Architecture
------------
* common/ui-tests.yml: new `targetPlatform` parameter (default 'all'
preserves existing behavior). All 14 stages wrapped in
`${{ if or(eq(…,'all'), eq(…,<plat>)) }}` so the child pipeline
can request just one platform's stages.
* eng/pipelines/ci-copilot-uitests.yml NEW — accepts PRNumber, PRBranch,
targetPlatform, categoryGroupsToTest (object). Single template
include of common/ui-tests.yml.
* .github/scripts/shared/Queue-CopilotUITests.ps1 NEW — wraps
`az pipelines run`, validates input, retries 3×, returns build ID.
* .github/scripts/shared/Wait-CopilotUITests.ps1 NEW — polls
`az pipelines runs show` with stage-level summaries every 60 s,
4 h cap (matches matrix-job timeout).
* .github/scripts/shared/Aggregate-UITestArtifacts.ps1 NEW — downloads
drop-* artifacts via `az pipelines runs artifact list/download`,
walks every .trx file, merges by category. Reuses Get-TrxResults
from Review-PR.ps1.
* .github/scripts/Review-PR.ps1 STEP 3 rewritten:
- CI mode ($env:TF_BUILD AND $env:COPILOT_UITESTS_PIPELINE_ID set):
queue child pipeline, write 'in progress' placeholder, save
`uitests/child-build-id.txt`. STEPS 4-6 run in parallel.
- Local fallback (env vars unset): keeps the existing in-process
per-category loop so devs can still iterate without AzDO setup.
* .github/scripts/Review-PR.ps1 STEP 6.5 NEW — when child-build-id.txt
exists, polls Wait-CopilotUITests, calls Aggregate-UITestArtifacts,
rewrites the STEP 3 section in uitests/content.md with real per-test
counts. Cap-hit/cancel renders an explicit 'still running, see build'
note instead of a stale placeholder.
Tests
-----
.github/scripts/shared/Aggregate-UITestArtifacts.Tests.ps1 NEW — 8
Pester tests covering category-name extraction (4 platforms) + TRX
walk/merge (multi-category, multi-TRX-per-category, empty dir,
non-existent dir). Existing Review-PR.Tests.ps1 unaffected (13/13 still
green). Total Pester: 21 tests, 0 failures.
Operational
-----------
The new ci-copilot-uitests.yml needs to be REGISTERED as a pipeline in
DevDiv (one-time AzDO config) and its ID stored in the
COPILOT_UITESTS_PIPELINE_ID env var on pipeline 27723 (maui-copilot).
Until that's done, the CI mode will gracefully fall back to the
in-process path with a 'failed to queue child pipeline' warning.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s CI 313)" This reverts commit aaab66e.
Mirrors CI 313's pattern but inline within ci-copilot.yml (pipeline 27723) so we don't need admin permission to register a new pipeline. After Review-PR.ps1 STEP 2 detects UI test categories, it emits detectedCategories (and aiSummaryCommentId after STEP 7) as AzDO output variables on the existing 'RunReview' bash step. Two new stages: - RunDeepUITests: dependsOn ReviewPR; condition skips when no detectable categories. One job per platform on the appropriate pool (matches the existing CopilotReview job pool selection). Inside the job, a pwsh loop calls BuildAndRunHostApp.ps1 for each detected category, copying TRX files into per-category drop dirs. Publishes drop-deep-uitests artifact. - UpdateAISummaryComment: dependsOn ReviewPR + RunDeepUITests. Downloads the artifact, parses TRX via Aggregate-UITestArtifacts.ps1 (restored from reverted commit aaab66e), renders a new STEP 3 markdown section, and PATCHes the existing AI summary comment to replace the in-process counts with deep-test results from the proper platform pool. This is a two-pass UX: reviewers see the in-process STEP 3 results in the AI summary as soon as ReviewPR finishes, then the comment updates a few minutes later with the platform-pool deep-test results once those finish. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wraps STEP 5 (Gate) and STEP 6 (Try-Fix) in 'if (-not $skipGateAndTryFix)' so we can iterate quickly on the new RunDeepUITests + UpdateAISummaryComment stages without paying the 30-90 minute cost of building the repo and running multi-candidate try-fix exploration on every test build. To re-enable: flip $skipGateAndTryFix to $false (or revert this commit). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At STAGE level, cross-stage output variables MUST be referenced as `dependencies.<stage>.outputs[<job>.<step>.<var>]`, NOT `stageDependencies.<stage>.<job>.outputs[<step>.<var>]` — the latter form is only valid at JOB level (and within a cross-stage dependsOn relationship). Mismatching this path causes the condition to evaluate against an unset variable and silently skip the stage even when the upstream job correctly emitted the ##vso[task.setvariable] command. Build 14056551 confirmed: STEP 2 emitted detectedCategories=ViewBaseTests and STEP 7 emitted aiSummaryCommentId=4409906345, but downstream stages skipped because the stage-condition lookup path was wrong. Fix: split into two paths — - Stage condition: dependencies.ReviewPR.outputs['CopilotReview.RunReview.<var>'] - Job-level variables block (cross-stage): $[ stageDependencies.ReviewPR.CopilotReview.outputs['RunReview.<var>'] ] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first run on iOS revealed that 'dotnet build Microsoft.Maui.BuildTasks.slnf' fails with NETSDK1147 (workload not installed) because provision.yml only installs the SDK, not the iOS/Android workloads. Match the CopilotReview job's setup: use build.ps1 --target=dotnet to install the workloads, then build.ps1 --target=dotnet-buildtasks to compile the BuildTasks DLL. Both invocations use the same env vars (DOTNET_TOKEN, PRIVATE_BUILD) and retry counts as the CopilotReview job. Also includes pending Review-PR.ps1 cleanup of the CI branch-detection flow that was already in the worktree (clearer logging, only override UseCurrentBranch in CI when not already explicitly set). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-LogFile/-TestFilter BuildAndRunHostApp.ps1 accepts -Category (not -TestFilter 'TestCategory=X') and does not have -OutputDir or -LogFile parameters. The invalid params caused 'A parameter cannot be found that matches parameter name OutputDir'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first iOS run of the new RunDeepUITests stage failed with: BuildAndRunHostApp.ps1: A parameter cannot be found that matches parameter name 'OutputDir'. even though the YAML never passes -OutputDir. Most likely the AzDO PowerShell task's temp-script generation mangles backtick line continuations and the tail of the call gets parsed weirdly. Switch to a hashtable splat (no continuation needed) and add diagnostic echo lines so we can see exactly what gets passed in. Also Tee output to the per-category log so artifacts include the full build/test transcript even when the call appears to fail early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 8 (BIDs 14056860-14056870) on iOS revealed that 6/11 builds got through ReviewPR but failed at RunDeepUITests with: error : This version of .NET for iOS (26.0.11017) requires Xcode 26.0. The current version of Xcode is 26.1.1. The CopilotReview job already patches Directory.Build.Override.props to add <ValidateXcodeVersion>false</ValidateXcodeVersion> (see lines ~571-580 of this file) before invoking Review-PR.ps1. The new RunDeepUITests stage runs in a fresh job, so the override file isn't preserved — we have to apply the same patch right after merging the PR and before kicking off the per-category build loop. Mirror the existing patch's per-OS sed branches (BSD vs GNU) so it works on Linux, macOS, and Windows agents. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 9 confirmed the workload + Xcode fixes work — builds completed and tests RAN. But all 417 detected tests in CollectionView (PR 35356) failed at OneTimeSetUp with: OpenQA.Selenium.Appium.Service.Exceptions.InvalidServerInstanceException There is no installed nodes! Please install node via NPM or download and install Appium app UITest.Appium.AppiumServerContext.CreateAndStartServer needs both Node.js and the Appium global package. The CopilotReview job already provisions both via UseNode@1 + ProvisionAppium MSBuild target (lines 316-329). The new RunDeepUITests job runs in a fresh agent, so we have to repeat the setup. Mirror the same UseNode@1 v24.x + dotnet build Provisioning.csproj ProvisionAppium incantation. Use the existing APPIUM_HOME pipeline variable for cache location. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After the initial test run, if there are failures on Android, parse the TRX for failed test names and re-run only those with a second dotnet test --no-build pass. If all pass on retry, mark the run as successful (flaky tests). If some still fail, report those as real failures. This catches emulator-induced timeouts and transient ADB failures that cause ~5% false-positive rate on hosted ubuntu-22.04 agents. iOS is excluded (0% flake rate on Tahoe agents). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Increase maxAttempts from 2 to 3 for build/deploy retry - Add package manager health check before retrying (waits up to 30s) - Increase ADB restart delays for more reliable recovery - Verified: broken pipe at install time was not recovering because the package manager service wasn't ready after ADB restart Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous code re-queried 'adb devices' after boot, but the device could briefly go offline between boot detection and the second query, leaving DEVICE_ID empty. Now captures the device ID immediately during boot detection and carries it forward. Also defaults to emulator-5554 as fallback since that's always the first emulator's serial. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit e2ae246.
The adb devices grep after boot can return empty if the device briefly goes offline between boot detection and the query. Now captures the ID during adb wait-for-device success, and falls back to emulator-5554. Applied to both ReviewPR and Deep stage emulator boot steps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ADB reconnect Warm-up failures caused 2 infra failures (#34627, #35154). The 3-minute timeout was too tight when the emulator needs ADB reconnect after idle. - Increase warm-up timeout from 3m to 6m - Add retryCountOnTaskFailure: 2 so warm-up retries on failure - After ADB restart, wait for sys.boot_completed (up to 90s) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The single copilot invocation was asking for 5 expert reviewer evaluations (1 PR + 4 try-fixes) + comparative report, exhausting the agent's turn budget before completing. Split into: - STEP 6a: Pre-Flight + Expert Review + Report (1 copilot call) - STEP 6b: Try-Fix x2 + Update Report (separate copilot call) Also reduced try-fix count from 4 to 2 — each gets a fresh turn budget and the expert review always completes first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Let the agent decide how many try-fix candidates to generate based on the problem complexity. Stop when a better fix is found or approaches are exhausted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New flow: STEP 6a (copilot call 1): Pre-Flight + Try-Fix (as many as needed) STEP 6b (copilot call 2): Expert Review of PR + Compare all candidates This way Try-Fix gets full turn budget to explore alternatives, and Expert Review sees all candidates for a fair comparison. The winner determines comment behavior: - pr/pr-plus-reviewer wins → post inline file comments - try-fix wins → post summary telling author to adopt it Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each try-fix candidate is now generated WITH the expert reviewer: 1. Expert reviewer analyzes problem → generates candidate fix 2. Test the candidate 3. If fails → feed failure back to reviewer → generate next candidate 4. Repeat until a passing candidate is found or approaches exhausted This replaces the previous independent-candidates approach where the reviewer wasn't involved in fix generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log what files each copilot call produced so we can see if STEP 6b actually ran and produced output (expert-pr-eval, report, winner.json, inline-findings). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The DEFER_COMMENT_TO_STAGE3 else block was wrapping both the summary comment posting AND the inline findings / labels sections. When deferral was active, inline-findings.json was generated by the agent but post-inline-review.ps1 was never called. Move the DEFER block closing brace to only wrap the summary comment posting. The winner.json parsing, inline findings posting (STEP 7b), and label application (STEP 8) now run unconditionally regardless of whether the summary comment is deferred to Stage 3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The GitHub Review API rejects comments with null body fields. The expert-reviewer agent sometimes writes findings where the body field is null/missing. Add defensive fallback: try body, then message, then content field, defaulting to '(no description)' if all are null. This fixes the '422 For properties/body, nil is not a string' error seen in pipeline run 14125163 for PR #35421. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PowerShell's ConvertFrom-Json unwraps single-element JSON arrays into a scalar, causing the script to see 1 finding (the array itself treated as an object with empty properties) instead of N individual findings. Wrap parsed result in @() to force array. Also fixes the 'suspicious path: empty' error — the unwrapped array object has no .path property, resulting in an empty string that gets dropped by path validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Dedup check: Before any setup, query AzDO for running/queued builds with the same PR+Platform. If a duplicate exists, cancel this run (oldest-wins policy). Prevents 2-5x compute waste. 2. Fail-fast merge conflicts: Test merge feasibility right after checkout, BEFORE SDK/emulator setup (~15-20 min saved per conflict). On failure, posts a PR comment with conflicting file list (uses hidden marker to update existing comment, not spam). 3. Fix partiallySucceeded noise: Add deepTestsRan variable and condition the drop-deep-uitests download on it. When deep tests are skipped (no detected categories), the download is skipped too, avoiding the 'artifact not found' warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Findings JSON unwrapping: The expert-reviewer agent writes
{"findings": [...]} (object wrapper) instead of bare [...].
ConvertFrom-Json returns 1 object with no .path property,
causing all findings to be dropped as 'suspicious path: empty'.
Now detects and unwraps both bare arrays and object wrappers.
2. Merge conflict comment: Use GH_COMMENT_TOKEN (GitHub PAT) instead
of System.AccessToken (AzDO PAT) for posting PR comments.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the pre-check merge feasibility step and the dedup step from ci-copilot.yml. Keep only the partial-success fix (deepTestsRan condition on drop-deep-uitests download). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes: - Fix TRX overwrite bug: merge retry results into original TRX instead of replacing it, preserving all first-run passing tests. Delete retry TRX to prevent double-counting by aggregators. - Fix TRX contamination across categories: use specific TRX path from Invoke-UITestWithRetry instead of recursive 30-min time-based scan. - Fix category extraction for Stage 2 naming: add controls- prefixed variants to Get-CategoryFromArtifactName prefix list. - Stage 2 now uses Invoke-UITestWithRetry wrapper for env-error retry and device recovery. Remove retryCountOnTaskFailure (handled in-script). - Re-emit detectedCategories output variable after AI-tier refresh so Stage 2 picks up the refreshed list. - Stage 3 falls back to DEFERRED mode when aiSummaryCommentId is empty but deep test artifacts exist (reviewer crashed). - Map all VSTest outcomes (Aborted, Timeout, Error, etc.) to Failed in Get-TrxResults so failure disclosures match counter totals. Should-fix: - Preserve run-all sentinel as ALL (not NONE) so Stage 2 can distinguish run-everything from run-nothing. - Centralize env-error patterns in shared/Get-EnvErrorPatterns.ps1; Invoke-UITestWithRetry dot-sources it as single source of truth. - Strip parameter signatures from test names before building VSTest retry filter to avoid filter grammar operator injection. - Use case-insensitive deepTestsRan comparison in pipeline condition. - Track iOS 26.4 availability via pipeline variable for fallback warning. - Restrict Test-IsBugFixLabel to t/bug and i/regression only; use p/0|p/1 as secondary signal AND-ed with bug labels to reduce false positives. - Fix linked-issue regex to accept all GitHub closing keyword forms (Fix/Fixed/Close/Closed/Resolve/Resolved). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ALL sentinel producing empty loop in Stage 2: use single-element list with empty string and skip -Category param in run-all mode - Review-PR.ps1 now dot-sources shared/Get-EnvErrorPatterns.ps1 for infraSignals instead of hard-coded inline list - Invoke-UITestWithRetry.ps1 fails loudly if shared patterns file is missing instead of silently using a drifted inline fallback - CopilotLogs download gets continueOnError so DEFERRED fallback works when ReviewPR crashed before publishing the artifact - Stage 2 uses splatted hashtable for retry wrapper params, conditionally omitting -Category in run-all mode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The add-labels safe-output had max: 1, which limited the number of labels per call to 1 instead of limiting the number of calls. This caused platform/* labels to be silently dropped whenever the agent also selected an area-* label (which was every time). Raise max to 10 so area + platform labels survive in a single call. Also remove the obsolete report-incomplete safe-output key that is no longer recognized by the gh-aw compiler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35537Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35537" |
Contributor
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 18 | Agents checked: 4 Full validator output⏭️ LLM Evaluation: SkippedNo changed skills with eval tests found. |
1a05bff to
e9a1ee3
Compare
Contributor
Author
|
Accidental push to wrong branch — changes belong to PR #35376. Reverted. |
Contributor
Author
|
Recreating with clean branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Description
The
add-labelssafe-output in the agentic-labeler workflow hadmax: 1, which the gh-aw runtime interprets as "max 1 label per call" — not "max 1 call". This caused everyplatform/*label to be silently dropped whenever the agent also selected anarea-*label (which was every time).Evidence from CI logs:
Changes
maxfrom 1 to 10 so area + platform labels survive in a singleadd_labelscallreport-incompletesafe-output key no longer recognized by the gh-aw compilerVerification
Audited 12 recent PRs with platform-specific files — 11 of 12 were missing
platform/*labels due to this truncation.