Skip to content

Fix agentic-labeler truncating labels to 1 per call#35537

Closed
kubaflo wants to merge 104 commits into
mainfrom
fix/agentic-labeler-max-labels
Closed

Fix agentic-labeler truncating labels to 1 per call#35537
kubaflo wants to merge 104 commits into
mainfrom
fix/agentic-labeler-max-labels

Conversation

@kubaflo
Copy link
Copy Markdown
Contributor

@kubaflo kubaflo commented May 20, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description

The add-labels safe-output in the agentic-labeler workflow had max: 1, which the gh-aw runtime interprets as "max 1 label per call" — not "max 1 call". This caused every platform/* label to be silently dropped whenever the agent also selected an area-* label (which was every time).

Evidence from CI logs:

Requested labels: ["area-controls-collectionview","platform/android"]
Too many labels (2), limiting to 1
Adding 1 labels to pull request #35536: ["area-controls-collectionview"]

Changes

  • Raise max from 1 to 10 so area + platform labels survive in a single add_labels call
  • Remove obsolete report-incomplete safe-output key no longer recognized by the gh-aw compiler
  • Recompile lock file

Verification

Audited 12 recent PRs with platform-specific files — 11 of 12 were missing platform/* labels due to this truncation.

Copilot AI added 30 commits May 19, 2026 13:45
Adds Find-RegressionRisks.ps1 — a purely mechanical (no AI/LLM) script that
detects when a PR removes lines previously added by labeled bug-fix PRs.

Algorithm:
1. Collects lines removed by the PR under review
2. Finds recent PRs touching the same files via git log
3. Filters to bug-fix PRs (i/regression, t/bug, p/0, p/1 labels)
4. Cross-references removed lines against lines those fix PRs added
5. Whitespace-insensitive comparison classifies: REVERT / OVERLAP / CLEAN

Integration:
- Runs as STEP 0.6 in Review-PR.ps1 (between UI test detection and Gate)
- Content assembled into AI summary comment via post-ai-summary-comment.ps1
- Expert reviewer dimension #6 reads risks.json for REVERT entries
- 64 unit tests covering diff parsing, normalization, and detection logic

Validated against:
- PR #33908: correctly detects REVERT of IMauiRecyclerView check from #32278
- PR #35272: correctly classifies as OVERLAP (no line-level revert)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a REVERT is detected, the script now:
1. Extracts test files from the fix PR's diff
2. Classifies them via Detect-TestsInDiff.ps1 (type, filter, project, runner)
3. Stores regression_tests metadata in risks.json
4. Lists required tests in content.md

Review-PR.ps1 adds STEP 1.5 (after Gate builds the code):
- Reads regression_tests from risks.json
- Runs unit/XAML tests via dotnet test with the detected filters
- Skips UI/device tests (need CI infrastructure) with clear reporting
- Appends pass/fail results to content.md
- Writes test-results.json for downstream consumption

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 1.5 now runs every test type from reverted fix PRs:
- UI tests via BuildAndRunHostApp.ps1 (builds app, deploys, runs Appium)
- Device tests via Run-DeviceTests.ps1 (xharness on device/simulator)
- Unit/XAML tests via dotnet test --filter

Each test runs on the same platform as the Gate step. If a runner
script is missing, the test is skipped gracefully rather than failing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract and run fix-PR tests for both REVERT and OVERLAP entries.
A nearby edit can break a fix through side effects even without
removing the exact lines. Running tests for all risk levels gives
maximum confidence that prior fixes aren't regressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When regression tests are detected (from REVERT or OVERLAP fix PRs),
inject them as MANDATORY additional tests into the STEP 2 try-fix prompt.
Each try-fix candidate must run these tests after its own test passes —
a candidate that breaks a prior fix is marked Fail.

The Report phase (Phase 3) also ranks candidates that failed regression
tests lower than those that passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 0 → 1 (Branch Setup)
STEP 0.5 → 2 (UI Test Categories)
STEP 0.6 → 3 (Regression Cross-Reference)
STEP 1 → 4 (Gate)
STEP 1.5 → 5 (Regression Test Verification)
STEP 2 → 6 (PR Review)
STEP 3 → 7 (Post AI Summary)
STEP 4 → 8 (Apply Labels)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STEP 3 (Regression Cross-Reference) now includes both detection AND
test execution in a single step. Removes the separate STEP 5.

Pipeline is now 7 steps:
1. Branch Setup
2. UI Test Categories
3. Regression Cross-Reference (detect + run fix-PR tests)
4. Gate — Test Verification
5. PR Review (Pre-Flight, Try-Fix, Report)
6. Post AI Summary
7. Apply Labels

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates false positives from fix PRs merged to a different branch
(e.g., inflight/current) than the PR under review targets (e.g., main).

Uses git merge-base --is-ancestor to verify the fix PR's merge commit
is reachable from the PR's base branch before flagging a REVERT or
OVERLAP. Fix PRs merged to unreachable branches are skipped with a
clear log message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After STEP 2 detects UI test categories (via detect-ui-test-categories.ps1),
run BuildAndRunHostApp.ps1 once per detected category and surface results
in the existing 'uitests' section of the AI summary comment.

Behaviour:
- Skipped when detection returns NONE (no UI-relevant changes).
- Skipped when detection returns the run-all matrix (would take hours
  locally; the gate already covers PR-specific tests).
- Otherwise: splits the comma-separated category list, invokes
  BuildAndRunHostApp.ps1 -Platform <platform> -Category <cat> sequentially,
  records pass/fail/skip per category with duration, and appends a
  Markdown table to uitests/content.md so it renders in the same
  collapsible section as the detection output.
- Platform follows the regression-test pattern: $Platform if provided,
  else 'android'.
- Failures are informational only — they do not block the gate or
  affect try-fix candidate scoring (per the section footer).

Also writes:
- CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/test-results.json
  (machine-readable per-category results)
- CustomAgentLogsTmp/PRState/<PR>/PRAgent/uitests/result.txt
  (one-line PASSED/FAILED/SKIPPED marker)

Renumbers later steps:
  3 (regression cross-ref) -> 4
  4 (gate)                 -> 5
  5 (PR review)            -> 6
  6 (post AI summary)      -> 7
  7 (apply labels)         -> 8

Updates the script synopsis (which had stale Step 0..4 entries from before
the rebase) and fixes the cross-references inside the multi-candidate-review
prompt and the AI-tier-refresh comment block.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Tier-3 AI category refresh (STEP 6 → pre STEP 7) was overwriting
uitests/content.md with just the refreshed category list, wiping out
the 'UI Test Execution Results' table that STEP 3 appends.

Now read the existing file, extract everything from the
'### 🧪 UI Test Execution Results' marker onward, write the refreshed
category line, then re-append the preserved execution block so the
collapsible 'UI Tests — Category Detection' section in the AI summary
comment shows both the detected categories and the actual run results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Get-DotNetTestResults parser that extracts per-test status, name,
duration, error message and stack trace from the captured 'dotnet test
--logger console;verbosity=detailed' stdout. STEP 3 now:

* Parses each category's output in-memory before the next category run
  wipes test-output.log.
* Persists per-category logs to PRState/<PR>/PRAgent/uitests/<cat>-output.log
  for post-mortem traceability.
* Renders an enriched markdown table (Tests column shows passed/total +
  failure count) plus a collapsible 'Failed test details' block per
  failed category showing test name, first 1500 chars of error message,
  and the first stack frame.
* Renders a collapsible 'Show passed test names' block when there are
  passing tests.

Reviewers can now diagnose UI test failures from the AI summary comment
without downloading the full build artifact.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory fails before tests

When a test category fails because the build or deploy crashed before any
test could run (e.g. CS0246 missing namespace, RS0016 PublicAPI errors),
the AI summary table previously showed '0/1 ✓' — the green-checkmark
'all passed' branch — because no per-test failures were parsed. That's
visually misleading: the row is FAILED but the cell looks healthy.

Two fixes:

1. Tests column distinguishes 'category failed AND no per-test failures
   parsed' from 'all tests passed':
     - 'build/deploy failed'                              (no tests at all)
     - '0/1 — build/deploy failed before per-test results'  (some discovered)

2. New optional 'build_tail' field captures the last 30 lines of stdout
   when a category fails with zero per-test failures. The Failed test
   details collapsible section then renders it in a code block so
   reviewers see the actual compiler error / build crash inline,
   instead of having to download the full CopilotLogs artifact.

This was discovered while running the regression-check pipeline against
PRs #35110 (142 RS0016 PublicAPI errors), #35281 (CS0246 NSAttributedString
missing for catalyst), and #35358 — all reported as '0/1 ✓' before the fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related fixes that together restore meaningful per-test failure
detail in the AI summary comment.

## 1. Parser input normalization (root cause of '1 failed test' bug)

PowerShell's '& dotnet test 2>&1' wraps multi-line stderr blocks (and
some console-logger output paths) as a single ErrorRecord/string with
embedded newlines. The 'Get-DotNetTestResults' regex is anchored
'^...$' (start/end of STRING, not line), so without splitting the
parser would either skip those blocks entirely or — worse — collapse
the entire multi-line block as one bogus 'test result' with all 100+
test names concatenated into a single name field.

Observed on PR #35358 where 119 actual test failures were rendered as
'1 failed test' with name='LightTheme_VerifyVisualState DarkTheme_…'
(space-separated wall of 119 names) and duration='2 m 16 s 2 m 16 s …'
(matching wall of 119 durations).

Fix: at parser entry, split any captured element that contains '\n' or
'\r' into individual lines before passing to the regex. Verified
locally against the actual ViewBaseTests-output.log artifact: parser
now returns 119 entries instead of 0/1.

## 2. Grouped failure rendering (avoids 65k comment overflow)

Even after the parser fix, dumping 119 full error messages + first
stack frames into a single comment would blow past GitHub's 65,536-
character body limit. Most large failure clusters share a single root
cause (e.g. all 119 ViewBaseTests fail with the same 'OneTimeSetUp:
Timed out waiting for Go To Test button' because the host app didn't
deploy properly).

The new rendering:
- Groups failed tests by the first 200 chars of error message
- Shows full detail (sample names + 1500-char error + first stack
  frame) for the top 5 unique error groups
- Adds a 'X tests failed with the same error' header when a group
  has multiple members, with the first 3 sample names inline
- Always emits a fully-collapsed '<details>All N failed test names'
  block listing every individual failure so reviewers can re-run
  exactly the right set

Wraps Group-Object output in @() because Group-Object returns a
single GroupInfo (not an array) when all rows share the same key,
and 'foreach' would then iterate the group's MEMBERS instead of the
groups — re-introducing the same iteration bug we're trying to fix.

Verified locally with a 119-failure simulation (117 with shared error,
2 unique): produces 3 groups, top group shows count=117 with sample
names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ummary

When the HostApp can't be installed/launched on the device (e.g. ADB
'Failure calling service package: Broken pipe', InstallFailedException,
or device offline), every test in the fixture's OneTimeSetUp times out
waiting for the test driver UI. NUnit then reports all N tests as
Failed. The AI summary previously rendered '119 failed tests' for what
was actually a single CI infra problem — alarming reviewers about
nonexistent regressions.

Now: when (a) all per-test failures share a 'OneTimeSetUp:' error AND
(b) the build log contains a known infra signal (ADB0010, Broken pipe,
device offline, etc.), tag the category as an infrastructure failure:
- Tests column shows '🛠️ infra failure (N OneTimeSetUp timeouts)'
- A banner above 'Failed test details' explains these are NOT real
  regressions
- The collapsible header uses the wrench icon instead of ❌

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….ps1)

When BuildAndRunHostApp.ps1 fails because of a transient device problem
(Android ADB 'Broken pipe' install failure, iOS simulator app-launch
crash, XHarness exit 83, missing devices, etc.), STEP 3 was reporting
the resulting OneTimeSetUp timeouts as 100+ test regressions — even
though the test runner never started.

The Gate has solved this for a while via Invoke-TestRunWithRetry: it
retries up to 3 times with a 30s sleep between attempts and reboots
the device on Android/iOS infra failures. STEP 3 was just calling
BuildAndRunHostApp once and reporting whatever came back.

Now STEP 3 wraps the per-category run in the same retry loop:
- Detects the same env-error signals as the Gate
  (ADB0010, Broken pipe, XHarness 83, AOT loader crash, etc.)
- Retries up to 3 times with 30s pause between attempts
- adb reboot on Android, simctl shutdown/boot on iOS
- If retries exhaust, the persistent env error is propagated into
  infra_failure so the AI summary still labels the category as
  '🛠️ infra failure' instead of pretending all the tests regressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract the Gate's pre-boot + retry-on-env-error pipeline into a single
shared script under .github/scripts/shared/Invoke-UITestWithRetry.ps1
and have STEP 3 call it instead of duplicating the logic.

The shared script:
  1. Pre-boots the device once via Start-Emulator.ps1 (avoids
     BuildAndRunHostApp.ps1 racing with another booted device per
     category).
  2. Passes the booted UDID to BuildAndRunHostApp.ps1 -DeviceUdid.
  3. Detects env-error patterns (ADB0010, Broken pipe, XHarness 83,
     SIGABRT/load_aot_module, AppiumServerHasNotBeenStartedLocally, …)
     in captured output — same pattern set as
     verify-tests-fail.ps1's Get-TestResultFromOutput.
  4. Retries up to 3 times with a 30s sleep, rebooting the device on
     Android (adb -s <udid> reboot + wait-for-device) and iOS
     (xcrun simctl shutdown/boot) between attempts.
  5. Splits multi-line ErrorRecord captures into individual lines so
     downstream parsers (Get-DotNetTestResults) don't collapse 100+
     test results into one bogus entry.

This replaces the ~80-line duplicated retry block I added in the
previous commit. The Gate (verify-tests-fail.ps1) still has its own
copy in Invoke-TestRun + Invoke-TestRunWithRetry — a follow-up will
migrate it to call the shared script too so both consumers share one
implementation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Some failure modes don't produce OneTimeSetUp-prefixed errors but are
still purely infrastructure (e.g. ADB0010 install failure causes 'Build
FAILED' before tests run, then NUnit reports every test in the
assembly as failed with regular test-name 'Failed X [2m 16s]' lines).

The previous heuristic only matched on '^OneTimeSetUp:' error prefix,
missing this very common ADB-broken-pipe pattern, leading to '119
failed tests' alarm on a single bad APK install.

New heuristic OR's two signals:
  (a) every failure starts with 'OneTimeSetUp:' (driver couldn't reach
      runner UI), OR
  (b) build log contains 'Build FAILED' AND there were zero passes
      (build/deploy died before any test could run, NUnit then 'fails'
      the entire assembly).

When either is true AND any infra signal is found in the log, the
result is classified 🛠️ infra failure, which the renderer already
handles with a warning banner.

Verified locally on saved 119-failure ViewBaseTests log from #35358.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ategory=, console;normal)

The maui-pr-uitests pipeline (definition 313) runs UI tests per-category
via Cake's RunTestWithLocalDotNet, which invokes:

  dotnet test $csproj
    --filter "TestCategory=$category"
    --logger "trx;LogFileName=$name.trx"
    --logger "console;verbosity=normal"
    --results-directory $dir
    /p:VStestUseMSBuildOutput=false

STEP 3 was not matching that command — it used "Category=$Category"
(NUnit-only prefix) and "console;verbosity=detailed" with no TRX logger.
Counts were scraped from raw stdout, which broke when 619 tests ran and
PowerShell stream merging glued lines together: a CollectionView run with
75 passed / 544 failed was rendered as "1/1 (1 ❌)" because the parser
collapsed every result into a single bogus entry.

Three surgical changes to mirror CI byte-for-byte and make the renderer
authoritative:

1. .github/scripts/BuildAndRunHostApp.ps1
   - Filter prefix: Category= → TestCategory= (matches Cake/CI)
   - Logger: console;verbosity=detailed → console;verbosity=normal
   - Add: --logger trx;LogFileName=... + --results-directory + /p:VStestUseMSBuildOutput=false
   - Compute deterministic TRX path ($Category-$Platform.trx in CustomAgentLogsTmp/UITests/TestResults)
   - Print ">>> TRX_RESULT_FILE: <path>" marker line so callers can locate it

2. .github/scripts/shared/Invoke-UITestWithRetry.ps1
   - Scan captured output for the TRX marker line
   - Verify the file exists and return TrxResultFile in the result hashtable

3. .github/scripts/Review-PR.ps1
   - New helper Get-TrxResults: parses VSTest TRX into the same shape as
     Get-DotNetTestResults (status/name/duration/error/stack) plus aggregate
     Total/Passed/Failed/Skipped counts from <Counters>
   - STEP 3 prefers TRX when present (authoritative, byte-for-byte matches
     what AzDO PublishTestResults@2 would ingest); falls back to console
     scraping only when TRX missing (build/deploy crashed before tests ran)
   - Tests cell now uses TRX counts: "75/619 (544 ❌)" instead of "1/1 (1 ❌)"
   - Failed test details list now drawn from TRX entries (one per real test)

Verified locally with a synthetic TRX (619/75/544): renderer produces
"75/619 (544 ❌)" instead of the previous bug's "1/1 (1 ❌)".

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fallback in BuildAndRunHostApp.ps1 (when --logger trx;LogFileName=X
silently failed to produce X.trx) scanned the results directory for any
.trx file. With multiple categories sharing the same TestResults dir
(CollectionView.trx, Entry.trx, etc.), a stale TRX from a prior category
could be misattributed to the current one.

Fix: capture $trxRunStart BEFORE invoking dotnet test, then in the
fallback only consider .trx files with LastWriteTime >= $trxRunStart.
A stale TRX from a prior category can never match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tTestResults)

These two functions sit on STEP 3's critical rendering path. A regression
in either silently misrenders the AI summary's UI Test Execution Results
table (we just lived through 'Get-DotNetTestResults returns 1 of 619').

13 tests covering:

Get-TrxResults
  - missing/empty file → null (no exceptions)
  - aggregate counters from ResultSummary/Counters
  - Skipped derived from Total-Executed when not separately tracked
  - per-test list with status/name/duration/error/stack
  - NotExecuted + Inconclusive normalized to 'Skipped'
  - empty Results array
  - malformed XML handled gracefully (logs warning, returns null)
  - TrxPath round-tripping

Get-DotNetTestResults
  - single Passed entry
  - multiple consecutive results
  - error message + stack capture between results
  - empty input

Follows the same pattern as Fix-MilestoneDrift.Tests.ps1: dot-source the
helper bodies out of Review-PR.ps1 (the script has top-level imperative
logic so we can't dot-source the whole file).

Run: Invoke-Pester ./.github/scripts/Review-PR.Tests.ps1 -Output Detailed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the in-process per-category 'dotnet test' loop on the Copilot
agent VM with a separate AzDO pipeline (eng/pipelines/ci-copilot-uitests.yml)
that imports common/ui-tests.yml and runs the detected categories as a
real matrix on platform-appropriate pools.

Why: STEP 3 used to run categories sequentially on a single agent,
contending for CPU with the Copilot agent itself. CI 313 already does
the right thing (shared build_ui_tests stage, per-category matrix jobs,
PublishBuildArtifacts@1 → drop-* artifacts). Mirroring that here is
faster, isolates pools, and surfaces real per-test results in TRX.

Architecture
------------
* common/ui-tests.yml: new `targetPlatform` parameter (default 'all'
  preserves existing behavior). All 14 stages wrapped in
  `${{ if or(eq(…,'all'), eq(…,<plat>)) }}` so the child pipeline
  can request just one platform's stages.
* eng/pipelines/ci-copilot-uitests.yml NEW — accepts PRNumber, PRBranch,
  targetPlatform, categoryGroupsToTest (object). Single template
  include of common/ui-tests.yml.
* .github/scripts/shared/Queue-CopilotUITests.ps1 NEW — wraps
  `az pipelines run`, validates input, retries 3×, returns build ID.
* .github/scripts/shared/Wait-CopilotUITests.ps1 NEW — polls
  `az pipelines runs show` with stage-level summaries every 60 s,
  4 h cap (matches matrix-job timeout).
* .github/scripts/shared/Aggregate-UITestArtifacts.ps1 NEW — downloads
  drop-* artifacts via `az pipelines runs artifact list/download`,
  walks every .trx file, merges by category. Reuses Get-TrxResults
  from Review-PR.ps1.
* .github/scripts/Review-PR.ps1 STEP 3 rewritten:
    - CI mode ($env:TF_BUILD AND $env:COPILOT_UITESTS_PIPELINE_ID set):
      queue child pipeline, write 'in progress' placeholder, save
      `uitests/child-build-id.txt`. STEPS 4-6 run in parallel.
    - Local fallback (env vars unset): keeps the existing in-process
      per-category loop so devs can still iterate without AzDO setup.
* .github/scripts/Review-PR.ps1 STEP 6.5 NEW — when child-build-id.txt
  exists, polls Wait-CopilotUITests, calls Aggregate-UITestArtifacts,
  rewrites the STEP 3 section in uitests/content.md with real per-test
  counts. Cap-hit/cancel renders an explicit 'still running, see build'
  note instead of a stale placeholder.

Tests
-----
.github/scripts/shared/Aggregate-UITestArtifacts.Tests.ps1 NEW — 8
Pester tests covering category-name extraction (4 platforms) + TRX
walk/merge (multi-category, multi-TRX-per-category, empty dir,
non-existent dir). Existing Review-PR.Tests.ps1 unaffected (13/13 still
green). Total Pester: 21 tests, 0 failures.

Operational
-----------
The new ci-copilot-uitests.yml needs to be REGISTERED as a pipeline in
DevDiv (one-time AzDO config) and its ID stored in the
COPILOT_UITESTS_PIPELINE_ID env var on pipeline 27723 (maui-copilot).
Until that's done, the CI mode will gracefully fall back to the
in-process path with a 'failed to queue child pipeline' warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors CI 313's pattern but inline within ci-copilot.yml (pipeline
27723) so we don't need admin permission to register a new pipeline.

After Review-PR.ps1 STEP 2 detects UI test categories, it emits
detectedCategories (and aiSummaryCommentId after STEP 7) as AzDO
output variables on the existing 'RunReview' bash step.

Two new stages:

- RunDeepUITests: dependsOn ReviewPR; condition skips when no
  detectable categories. One job per platform on the appropriate
  pool (matches the existing CopilotReview job pool selection).
  Inside the job, a pwsh loop calls BuildAndRunHostApp.ps1 for each
  detected category, copying TRX files into per-category drop dirs.
  Publishes drop-deep-uitests artifact.

- UpdateAISummaryComment: dependsOn ReviewPR + RunDeepUITests.
  Downloads the artifact, parses TRX via Aggregate-UITestArtifacts.ps1
  (restored from reverted commit aaab66e), renders a new STEP 3
  markdown section, and PATCHes the existing AI summary comment to
  replace the in-process counts with deep-test results from the
  proper platform pool.

This is a two-pass UX: reviewers see the in-process STEP 3 results
in the AI summary as soon as ReviewPR finishes, then the comment
updates a few minutes later with the platform-pool deep-test
results once those finish.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wraps STEP 5 (Gate) and STEP 6 (Try-Fix) in 'if (-not $skipGateAndTryFix)'
so we can iterate quickly on the new RunDeepUITests + UpdateAISummaryComment
stages without paying the 30-90 minute cost of building the repo and
running multi-candidate try-fix exploration on every test build.

To re-enable: flip $skipGateAndTryFix to $false (or revert this commit).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At STAGE level, cross-stage output variables MUST be referenced as
`dependencies.<stage>.outputs[<job>.<step>.<var>]`, NOT
`stageDependencies.<stage>.<job>.outputs[<step>.<var>]` — the
latter form is only valid at JOB level (and within a cross-stage
dependsOn relationship).

Mismatching this path causes the condition to evaluate against an
unset variable and silently skip the stage even when the upstream
job correctly emitted the ##vso[task.setvariable] command.

Build 14056551 confirmed: STEP 2 emitted detectedCategories=ViewBaseTests
and STEP 7 emitted aiSummaryCommentId=4409906345, but downstream
stages skipped because the stage-condition lookup path was wrong.

Fix: split into two paths —
- Stage condition: dependencies.ReviewPR.outputs['CopilotReview.RunReview.<var>']
- Job-level variables block (cross-stage): $[ stageDependencies.ReviewPR.CopilotReview.outputs['RunReview.<var>'] ]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first run on iOS revealed that 'dotnet build Microsoft.Maui.BuildTasks.slnf'
fails with NETSDK1147 (workload not installed) because provision.yml only
installs the SDK, not the iOS/Android workloads.

Match the CopilotReview job's setup: use build.ps1 --target=dotnet to
install the workloads, then build.ps1 --target=dotnet-buildtasks to
compile the BuildTasks DLL. Both invocations use the same env vars
(DOTNET_TOKEN, PRIVATE_BUILD) and retry counts as the CopilotReview job.

Also includes pending Review-PR.ps1 cleanup of the CI branch-detection
flow that was already in the worktree (clearer logging, only override
UseCurrentBranch in CI when not already explicitly set).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-LogFile/-TestFilter

BuildAndRunHostApp.ps1 accepts -Category (not -TestFilter 'TestCategory=X')
and does not have -OutputDir or -LogFile parameters. The invalid params
caused 'A parameter cannot be found that matches parameter name OutputDir'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The first iOS run of the new RunDeepUITests stage failed with:

  BuildAndRunHostApp.ps1: A parameter cannot be found that matches
  parameter name 'OutputDir'.

even though the YAML never passes -OutputDir. Most likely the AzDO
PowerShell task's temp-script generation mangles backtick line
continuations and the tail of the call gets parsed weirdly.

Switch to a hashtable splat (no continuation needed) and add diagnostic
echo lines so we can see exactly what gets passed in. Also Tee output
to the per-category log so artifacts include the full build/test
transcript even when the call appears to fail early.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 8 (BIDs 14056860-14056870) on iOS revealed that 6/11 builds got
through ReviewPR but failed at RunDeepUITests with:

  error : This version of .NET for iOS (26.0.11017) requires Xcode 26.0.
  The current version of Xcode is 26.1.1.

The CopilotReview job already patches Directory.Build.Override.props to
add <ValidateXcodeVersion>false</ValidateXcodeVersion> (see lines
~571-580 of this file) before invoking Review-PR.ps1. The new
RunDeepUITests stage runs in a fresh job, so the override file isn't
preserved — we have to apply the same patch right after merging the PR
and before kicking off the per-category build loop.

Mirror the existing patch's per-OS sed branches (BSD vs GNU) so it
works on Linux, macOS, and Windows agents.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 9 confirmed the workload + Xcode fixes work — builds completed
and tests RAN. But all 417 detected tests in CollectionView (PR 35356)
failed at OneTimeSetUp with:

  OpenQA.Selenium.Appium.Service.Exceptions.InvalidServerInstanceException
  There is no installed nodes! Please install node via NPM or download
  and install Appium app

UITest.Appium.AppiumServerContext.CreateAndStartServer needs both
Node.js and the Appium global package. The CopilotReview job already
provisions both via UseNode@1 + ProvisionAppium MSBuild target (lines
316-329). The new RunDeepUITests job runs in a fresh agent, so we have
to repeat the setup.

Mirror the same UseNode@1 v24.x + dotnet build Provisioning.csproj
ProvisionAppium incantation. Use the existing APPIUM_HOME pipeline
variable for cache location.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 21 commits May 19, 2026 13:47
After the initial test run, if there are failures on Android, parse
the TRX for failed test names and re-run only those with a second
dotnet test --no-build pass. If all pass on retry, mark the run as
successful (flaky tests). If some still fail, report those as real
failures.

This catches emulator-induced timeouts and transient ADB failures
that cause ~5% false-positive rate on hosted ubuntu-22.04 agents.
iOS is excluded (0% flake rate on Tahoe agents).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Increase maxAttempts from 2 to 3 for build/deploy retry
- Add package manager health check before retrying (waits up to 30s)
- Increase ADB restart delays for more reliable recovery
- Verified: broken pipe at install time was not recovering because
  the package manager service wasn't ready after ADB restart

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous code re-queried 'adb devices' after boot, but the device
could briefly go offline between boot detection and the second query,
leaving DEVICE_ID empty. Now captures the device ID immediately during
boot detection and carries it forward.

Also defaults to emulator-5554 as fallback since that's always the
first emulator's serial.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The adb devices grep after boot can return empty if the device briefly
goes offline between boot detection and the query. Now captures the ID
during adb wait-for-device success, and falls back to emulator-5554.

Applied to both ReviewPR and Deep stage emulator boot steps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ADB reconnect

Warm-up failures caused 2 infra failures (#34627, #35154). The 3-minute
timeout was too tight when the emulator needs ADB reconnect after idle.

- Increase warm-up timeout from 3m to 6m
- Add retryCountOnTaskFailure: 2 so warm-up retries on failure
- After ADB restart, wait for sys.boot_completed (up to 90s)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The single copilot invocation was asking for 5 expert reviewer
evaluations (1 PR + 4 try-fixes) + comparative report, exhausting
the agent's turn budget before completing.

Split into:
- STEP 6a: Pre-Flight + Expert Review + Report (1 copilot call)
- STEP 6b: Try-Fix x2 + Update Report (separate copilot call)

Also reduced try-fix count from 4 to 2 — each gets a fresh turn
budget and the expert review always completes first.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Let the agent decide how many try-fix candidates to generate based on
the problem complexity. Stop when a better fix is found or approaches
are exhausted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New flow:
  STEP 6a (copilot call 1): Pre-Flight + Try-Fix (as many as needed)
  STEP 6b (copilot call 2): Expert Review of PR + Compare all candidates

This way Try-Fix gets full turn budget to explore alternatives, and
Expert Review sees all candidates for a fair comparison. The winner
determines comment behavior:
  - pr/pr-plus-reviewer wins → post inline file comments
  - try-fix wins → post summary telling author to adopt it

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each try-fix candidate is now generated WITH the expert reviewer:
1. Expert reviewer analyzes problem → generates candidate fix
2. Test the candidate
3. If fails → feed failure back to reviewer → generate next candidate
4. Repeat until a passing candidate is found or approaches exhausted

This replaces the previous independent-candidates approach where the
reviewer wasn't involved in fix generation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log what files each copilot call produced so we can see if STEP 6b
actually ran and produced output (expert-pr-eval, report, winner.json,
inline-findings).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The DEFER_COMMENT_TO_STAGE3 else block was wrapping both the summary
comment posting AND the inline findings / labels sections. When
deferral was active, inline-findings.json was generated by the agent
but post-inline-review.ps1 was never called.

Move the DEFER block closing brace to only wrap the summary comment
posting. The winner.json parsing, inline findings posting (STEP 7b),
and label application (STEP 8) now run unconditionally regardless of
whether the summary comment is deferred to Stage 3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The GitHub Review API rejects comments with null body fields. The
expert-reviewer agent sometimes writes findings where the body field
is null/missing. Add defensive fallback: try body, then message,
then content field, defaulting to '(no description)' if all are null.

This fixes the '422 For properties/body, nil is not a string' error
seen in pipeline run 14125163 for PR #35421.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PowerShell's ConvertFrom-Json unwraps single-element JSON arrays
into a scalar, causing the script to see 1 finding (the array
itself treated as an object with empty properties) instead of N
individual findings. Wrap parsed result in @() to force array.

Also fixes the 'suspicious path: empty' error — the unwrapped
array object has no .path property, resulting in an empty string
that gets dropped by path validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Dedup check: Before any setup, query AzDO for running/queued
   builds with the same PR+Platform. If a duplicate exists, cancel
   this run (oldest-wins policy). Prevents 2-5x compute waste.

2. Fail-fast merge conflicts: Test merge feasibility right after
   checkout, BEFORE SDK/emulator setup (~15-20 min saved per
   conflict). On failure, posts a PR comment with conflicting file
   list (uses hidden marker to update existing comment, not spam).

3. Fix partiallySucceeded noise: Add deepTestsRan variable and
   condition the drop-deep-uitests download on it. When deep tests
   are skipped (no detected categories), the download is skipped
   too, avoiding the 'artifact not found' warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Findings JSON unwrapping: The expert-reviewer agent writes
   {"findings": [...]} (object wrapper) instead of bare [...].
   ConvertFrom-Json returns 1 object with no .path property,
   causing all findings to be dropped as 'suspicious path: empty'.
   Now detects and unwraps both bare arrays and object wrappers.

2. Merge conflict comment: Use GH_COMMENT_TOKEN (GitHub PAT) instead
   of System.AccessToken (AzDO PAT) for posting PR comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the pre-check merge feasibility step and the dedup step
from ci-copilot.yml. Keep only the partial-success fix (deepTestsRan
condition on drop-deep-uitests download).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
- Fix TRX overwrite bug: merge retry results into original TRX instead of
  replacing it, preserving all first-run passing tests. Delete retry TRX
  to prevent double-counting by aggregators.
- Fix TRX contamination across categories: use specific TRX path from
  Invoke-UITestWithRetry instead of recursive 30-min time-based scan.
- Fix category extraction for Stage 2 naming: add controls- prefixed
  variants to Get-CategoryFromArtifactName prefix list.
- Stage 2 now uses Invoke-UITestWithRetry wrapper for env-error retry
  and device recovery. Remove retryCountOnTaskFailure (handled in-script).
- Re-emit detectedCategories output variable after AI-tier refresh so
  Stage 2 picks up the refreshed list.
- Stage 3 falls back to DEFERRED mode when aiSummaryCommentId is empty
  but deep test artifacts exist (reviewer crashed).
- Map all VSTest outcomes (Aborted, Timeout, Error, etc.) to Failed in
  Get-TrxResults so failure disclosures match counter totals.

Should-fix:
- Preserve run-all sentinel as ALL (not NONE) so Stage 2 can distinguish
  run-everything from run-nothing.
- Centralize env-error patterns in shared/Get-EnvErrorPatterns.ps1;
  Invoke-UITestWithRetry dot-sources it as single source of truth.
- Strip parameter signatures from test names before building VSTest
  retry filter to avoid filter grammar operator injection.
- Use case-insensitive deepTestsRan comparison in pipeline condition.
- Track iOS 26.4 availability via pipeline variable for fallback warning.
- Restrict Test-IsBugFixLabel to t/bug and i/regression only; use p/0|p/1
  as secondary signal AND-ed with bug labels to reduce false positives.
- Fix linked-issue regex to accept all GitHub closing keyword forms
  (Fix/Fixed/Close/Closed/Resolve/Resolved).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ALL sentinel producing empty loop in Stage 2: use single-element
  list with empty string and skip -Category param in run-all mode
- Review-PR.ps1 now dot-sources shared/Get-EnvErrorPatterns.ps1 for
  infraSignals instead of hard-coded inline list
- Invoke-UITestWithRetry.ps1 fails loudly if shared patterns file is
  missing instead of silently using a drifted inline fallback
- CopilotLogs download gets continueOnError so DEFERRED fallback works
  when ReviewPR crashed before publishing the artifact
- Stage 2 uses splatted hashtable for retry wrapper params, conditionally
  omitting -Category in run-all mode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The add-labels safe-output had max: 1, which limited the number of
labels per call to 1 instead of limiting the number of calls. This
caused platform/* labels to be silently dropped whenever the agent
also selected an area-* label (which was every time).

Raise max to 10 so area + platform labels survive in a single call.
Also remove the obsolete report-incomplete safe-output key that is
no longer recognized by the gh-aw compiler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 35537

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 35537"

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 18 | Agents checked: 4

Full validator output
Found 1 skill(s)
[find-regression-risk] 📊 find-regression-risk: 967 BPE tokens [chars/4: 905] (detailed ✓), 10 sections, 2 code blocks
[find-regression-risk]    ⚠  No YAML frontmatter — agents use name/description for skill discovery.
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)

✅ All checks passed (4 agent(s))

⏭️ LLM Evaluation: Skipped

No changed skills with eval tests found.

🔍 Full results and investigation steps

@github-actions github-actions Bot added the area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions label May 20, 2026
@kubaflo kubaflo force-pushed the fix/agentic-labeler-max-labels branch from 1a05bff to e9a1ee3 Compare May 20, 2026 11:27
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 20, 2026

Accidental push to wrong branch — changes belong to PR #35376. Reverted.

@kubaflo kubaflo closed this May 20, 2026
@kubaflo kubaflo reopened this May 20, 2026
@kubaflo
Copy link
Copy Markdown
Contributor Author

kubaflo commented May 20, 2026

Recreating with clean branch

@kubaflo kubaflo closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants