Skip to content

Security hardening for ci-copilot (#35677 stack): sanitize PR stdout, fail-closed Gate state, HMAC verdict, Phase-Audit#35804

Merged
kubaflo merged 15 commits into
dotnet:feature/enhanced-reviewerfrom
T-Gro:security-on-enhanced-reviewer
Jun 8, 2026
Merged

Security hardening for ci-copilot (#35677 stack): sanitize PR stdout, fail-closed Gate state, HMAC verdict, Phase-Audit#35804
kubaflo merged 15 commits into
dotnet:feature/enhanced-reviewerfrom
T-Gro:security-on-enhanced-reviewer

Conversation

@T-Gro

@T-Gro T-Gro commented Jun 8, 2026

Copy link
Copy Markdown
Member

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Stacked on #35677

This PR layers a security-hardening stack on top of @kubaflo's #35677 (feature/enhanced-reviewer). When merged here, the combined work flows to main via #35677.

Scope

Re-applies and updates the security audit work originally landed in #35324 against the latest #35677 base, plus three new defenses caught by adversarial review rounds.

# Change Why
F1 Cross-phase state hardening — dedicated phase-state/ subdir; CopilotReview re-derives `` instead of restoring a persisted path; fail-closed on missing/garbage Gate verdict Original review said writable Split-Path -Parent of \ allowed PR test code to forge gate-result.txt and to redirect & pwsh -File execution under COPILOT_GITHUB_TOKEN
F1.A HMAC-sign gate-result.txt via per-job openssl rand -hex 32 key passed to CopilotReview/Post through an AzDO secret variable Adversarial reviewer demonstrated empirically that chmod -R a-w is bypassable same-user (mv state state.bak && mkdir state && echo PASSED > ...). Key generated AFTER pwsh exits → PR-spawned daemons cannot read it
F2 Out-SafePRSubprocessLine helper + inline -replace '(?i)##(vso\[|\[)', '##~SANITIZED~\' at every PR-controlled Write-Host site + & pwsh -NoProfile -File subprocess wraps at every shared runner site Prevents PR test code from forging ##vso[task.setvariable / task.complete / task.uploadfile] AzDO logging commands through stdout
F3 Marker-comment hiding now requires Test-IsMauiBotCommentAuthor A maintainer who quotes <!-- AI Summary --> in their own comment was being silently hidden
D1 / D1.A Whitelist ^(NONE|ALL|[A-Za-z0-9_,]+)\$ before re-emitting detectedCategories; ^[A-Za-z0-9_\-]+\$ for aiSummaryReviewNodeId Closes a future-proofing class where PR-controlled text could inject a second ##vso directive via setvariable echo
M1 / M2 Setup invokes \/scripts/Review-PR.ps1 (not the worktree); exit \ on failure; Tasks 2-4 gated on ne(variables['SetupFailed'], 'true') so a half-set-up worktree can't keep running Original review said the worktree-vs-trusted asymmetry was inconsistent and the lack of short-circuit let Gate run on stale state
Phase-Audit AST-based static analyzer of the `` call-graph closure: default-deny gh-write-verb classifier + YAML token-binding audit, run as a Pester suite Locks the contract that Gate's GH_TOKEN is read-only; any future write-verb regression fails CI
Pester CI security-scripts-pester.yml workflow runs all 9 .Tests.ps1 files on every PR touching .github/scripts, .github/skills, eng/scripts, or eng/pipelines/ci-copilot.yml. Required-files allowlist + minPassed floor + zero-skipped tolerance → silent test deletion / Pester skip cannot land green Pre-existing PS tests had no CI runner; this is the first Pester gate in the repo

The original sanitizer / phase-state work passed an adversarial review with three independent reviewers (opus-4.8 + opus-4.7-xhigh + gpt-5.5); the rebase onto #35677's new Copilot-token-usage commits was re-validated by the same set. gpt-5.5 caught one pre-existing F2 gap (UI/Device runner \ re-emit at L1526/1539 was raw Write-Host) which is fixed in this branch.

Validation

  • Pester: 233/233 green (50s runtime)
  • Phase-Audit Axis A: 21/21 — no gh write-verb regressions on real Gate-reachable scripts
  • Phase-Audit Axis B: 3/3 — every YAML task's GH_TOKEN binding is the documented PAT alias; CopilotReview and the new AnalyzeCopilotTokenUsage stage have no GH_TOKEN
  • openssl ↔ .NET HMAC cross-check verified locally (-mac HMAC -macopt hexkey: round-trips with Convert.FromHexString)
  • YAML lint: yaml.safe_load clean

Note for #35677 owner

The PR title and description should be reviewed against the actual diff before merge; this stack only adds defenses, no functional changes to the review pipeline beyond the trust-boundary tightening.

cc @kubaflo

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🔍 Skill Validation Results

✅ Static Checks Passed

Skills checked: 18 | Agents checked: 4

Full validator output
Found 1 skill(s)
[verify-tests-fail-without-fix] 📊 verify-tests-fail-without-fix: 2,271 BPE tokens [chars/4: 2,189] (detailed ✓), 26 sections, 7 code blocks
✅ All checks passed (1 skill(s))
Found 4 agent(s)
Validated 4 agent(s)

✅ All checks passed (4 agent(s))

❌ LLM Evaluation Failed

0/1 skill(s) passed validation

Skill Scenario Baseline Skilled Verdict
verify-tests-fail-without-fix Happy path: full verification mode with fix files present 1.0/5 1.0/5 (iso) · 1.0/5 (plug) ⚠️
verify-tests-fail-without-fix Happy path: verify failure only mode for test creation 1.0/5 1.0/5 (iso) · 1.0/5 (plug) ⚠️
verify-tests-fail-without-fix Negative trigger: general test question should not invoke verification 4.0/5 4.0/5 (iso) · 4.0/5 (plug) ⚠️ [1]
verify-tests-fail-without-fix Regression: tests passing without fix means verification FAILED 5.0/5 5.0/5 (iso) · 5.0/5 (plug) ⚠️ [2]
verify-tests-fail-without-fix Edge case: no test files detected in the PR 2.0/5 3.7/5 (iso) · 1.3/5 (plug) ⚠️
verify-tests-fail-without-fix Regression: agent must not manually revert files with git commands 1.0/5 ⏳ 2.7/5 (iso) · 2.0/5 (plug) ⏳ [3]
verify-tests-fail-without-fix Edge case: agent uses RequireFullVerification when fix files exist 1.0/5 ⏳ 1.3/5 (iso) · 2.0/5 (plug) ⏳ [4]
verify-tests-fail-without-fix Regression: agent correctly reports test failure as verification success 5.0/5 5.0/5 (iso) · 5.0/5 (plug) ⚠️ [5]
verify-tests-fail-without-fix Feature: agent reports verification result clearly 2.0/5 3.7/5 (iso) · 3.3/5 (plug) [6]
verify-tests-fail-without-fix Feature: agent trusts script auto-detection of test files from git diff 1.0/5 3.3/5 (iso) · 3.7/5 (plug)

verify-tests-fail-without-fix: Skill regressed on task completion in one or more scenarios [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 3 scenario(s)]

[1] High run-to-run variance (CV=3.99)
[2] High run-to-run variance (CV=1.08)
[3] High run-to-run variance (CV=0.85). Timeout at 900s
[4] High run-to-run variance (CV=0.52). Timeout at 900s
[5] High run-to-run variance (CV=0.58)
[6] High run-to-run variance (CV=0.53)

timeout — run(s) hit the scenario timeout limit; scoring may be impacted

📊 verify-tests-fail-without-fix / edge-case-agent-uses-requirefullverification-when-fix-files-exist

Judge Report: Edge case: agent uses RequireFullVerification when fix files exist

Baseline Judge

Overall Score: 1/5
Reasoning: The agent failed to accomplish anything meaningful. After discovering the working directory was empty and a non-existent skill wasn't available, it gave up entirely and asked the user for help rather than attempting to clone the repository or find another way to proceed. The output is just a description of what should be done, not actual verification work.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 1/5 — The agent never used -RequireFullVerification or any equivalent mechanism. It attempted a skill that doesn't exist, checked the workspace, found it empty, and gave up with recommendations instead of actually performing verification.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The agent did not run any part of the verification workflow. It found the workspace empty and simply provided instructions for what someone else could do, rather than attempting to clone the repo or find alternative approaches.

With-Skill Judge (Isolated)

Overall Score: 1.3/5
Reasoning: The agent made a reasonable effort to verify the tests but was blocked by environment limitations (no Android workload, no emulator, missing helper scripts). It correctly identified the fix file and test files, attempted the verification script with proper parameters including -RequireFullVerification, and tried manual workarounds. However, it ultimately could not complete the core task of verifying tests fail without the fix and pass with it. The agent spent many turns trying various approaches but never succeeded in running any actual verification. The final output correctly reports the blocked status but doesn't deliver the required verification.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 3.3/5 — The agent did attempt to use -RequireFullVerification in its first script invocation (the one that ran for 12+ minutes but was using wrong parameters from a previous run's state). In the final attempt before timeout, the command was cut off so we can't confirm the flag was included, but the agent clearly understood the requirement and attempted it.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The agent never completed the workflow. It spent most of its time debugging repository structure, finding the correct repo path, dealing with a stale previous run, and fetching origin/main. The final verification attempt timed out before producing any results. No phase of the two-phase verification was completed.

With-Skill Judge (Plugin)

Overall Score: 2/5
Reasoning: The agent correctly identified that the environment was not set up for this task (wrong PR checkout, missing files, unauthenticated gh CLI). It made reasonable attempts to run the verification script with appropriate parameters including -RequireFullVerification. However, the core task was not completed - no verification was performed. The agent's diagnosis was accurate and its suggested remediation was reasonable, but the end result is that the verification objective was not achieved.

  • Agent uses -RequireFullVerification to ensure full two-phase verification: 2.7/5 — The agent did include -RequireFullVerification in its command invocations, showing it understood the requirement. However, the flag never actually executed successfully due to missing infrastructure scripts.
  • Agent runs the complete workflow: fail without fix, then pass with fix: 1/5 — The workflow never ran at all. The agent was blocked by missing scripts and could not complete either phase of the verification. It correctly identified the blocker but made no attempt to work around it or manually simulate the verification steps.

Baseline Agent Output

`
The working directory is empty — there's no repository cloned here. To perform the "verify tests fail without fix" workflow, I need access to the dotnet/maui repository.

Here's what this verification requires:

Full Verification Mode (test + fix):

  1. Clone/checkout the repo with the PR changes
  2. Revert the fix (src/Controls/src/Core/Handlers/ScrollView/ScrollViewHandler.Android.cs) while keeping the test files
  3. Build & run tests with filter FullyQualifiedName~Issue55555 on Android — tests should FAIL (proving they catch the bug)
  4. Restore the fix
  5. Build & run tests again — tests should PASS (proving the fix resolves the bug)

What I need to proceed:

  • The PR number or branch name so I can clone the repository and check out the changes
  • Or: access to an environment where the repo is already cloned

Could you provide the PR number (e.g., gh pr checkout <number>) or point me to where the repo is available?
`

With-Skill Agent Output (Isolated)

`

Verification Result: ⚠️ BLOCKED

Cannot complete full verification — this environment lacks the required infrastructure:

  1. Android workload not installeddotnet build fails with NETSDK1147 (no android workload)
  2. No Android emulator/ADB — UI tests require a running device
  3. Missing helper scriptsEstablishBrokenBaseline.ps1 not present in repo

What was confirmed:

  • Fix file identified: src/Core/src/Handlers/ScrollView/ScrollViewHandler.Android.cs (has meaningful diff vs origin/main — null-check consolidation and removal of UpdateClipForShadow)
  • Test files exist: UI test (Issue55555.cs) in both TestCases.Shared.Tests and TestCases.HostApp
  • Test is correctly structured: Waits for element "vegetables" in a grouped ListView with RecycleElement caching

To run full verification:

This needs a CI environment with the Android workload + emulator, or the maui-pr-uitests pipeline with filter FullyQualifiedName~Issue55555.
`

With-Skill Agent Output (Plugin)

This environment doesn't contain the actual fix file (src/Controls/src/Core/Handlers/ScrollView/ScrollViewHandler.Android.cs`) or any test files for Issue55555. The repository appears to be a shallow clone from an unrelated PR (#33134) and doesn't have the commits for Issue #55555.

Result: BLOCKED

The verification cannot proceed because:

  1. The fix file src/Controls/src/Core/Handlers/ScrollView/ScrollViewHandler.Android.cs does not exist in this checkout
  2. No commits related to Issue55555 are present in the git history
  3. The gh CLI is not authenticated, preventing PR checkout
  4. We're in a detached HEAD state from an unrelated PR

To run this verification, you'd need to:
bash gh pr checkout <PR_NUMBER_FOR_ISSUE_55555> pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform android \ -TestFilter "FullyQualifiedName~Issue55555" \ -RequireFullVerification \ -FixFiles @("src/Controls/src/Core/Handlers/ScrollView/ScrollViewHandler.Android.cs") \ -BaseBranch main
`

📊 verify-tests-fail-without-fix / edge-case-no-test-files-detected-in-the-pr

Judge Report: Edge case: no test files detected in the PR

Baseline Judge

Overall Score: 2/5
Reasoning: The agent's response is minimal and somewhat unhelpful. While it correctly identified the working directory was empty and didn't fabricate anything, it only checked one directory and immediately gave up. It could have searched more broadly for the repository, or engaged more substantively with the user's stated concern about missing test files. The task prompt clearly states the issue (fix without tests), and the agent should have discussed the implications and proposed concrete next steps around test coverage for the Entry handler Android fix, even while asking for the PR details.

  • Agent recognizes that without test files, verification cannot proceed: 2/5 — The agent did not explicitly recognize or analyze that without test files verification cannot proceed. It simply found an empty directory and asked for more context, without connecting the empty workspace to the stated problem about missing tests.
  • Agent suggests that tests need to be written before verification can be run: 2.7/5 — The agent did not suggest that tests need to be written. It merely asked for more context about which PR and what fix was made. While it mentioned it could 'Determine if the fix is adequately tested or if new tests are needed', this is a future promise rather than an actual suggestion based on analysis.
  • Agent does not attempt to fabricate or skip the test requirement: 5/5 — The agent did not fabricate any results or skip the test requirement. It honestly reported that it couldn't find the relevant code and asked for clarification rather than making things up.

With-Skill Judge (Isolated)

Overall Score: 3.7/5
Reasoning: The agent correctly identified the blockers (empty working directory, no test files) and asked appropriate clarifying questions. However, it could have been more proactive - for example, it could have read the skill context file to understand the full workflow, or attempted to look for the repository in other locations. The response is reasonable but minimal in terms of effort spent exploring options before asking the user for help.

  • Agent recognizes that without test files, verification cannot proceed: 4/5 — The agent clearly identified that no test files were added to the PR and that the working directory was empty/not a git repo. It explicitly stated 'I cannot verify the fix is caught by tests because there are no tests to verify against.' This shows clear recognition of the blocker.
  • Agent suggests that tests need to be written before verification can be run: 4.7/5 — The agent explicitly recommended writing tests first, mentioning the 'write-tests-agent' skill, and also offered the alternative of providing a TestFilter for existing tests. Both are valid and helpful suggestions.
  • Agent does not attempt to fabricate or skip the test requirement: 5/5 — The agent did not attempt to fabricate tests, hallucinate results, or skip the requirement. It honestly reported the situation and asked for guidance on how to proceed.

With-Skill Judge (Plugin)

Overall Score: 1.3/5
Reasoning: The agent completely failed to engage with the task. It had a specific file path (src/Controls/src/Core/Handlers/Entry/EntryHandler.Android.cs) and platform (android) to work with, yet made zero tool calls. It could have examined the file, searched for existing tests related to EntryHandler, checked git log for recent changes, or explored the test directory structure. Instead it asked for a PR number which wasn't strictly necessary to begin investigation. This is a minimal-effort non-response.

  • Agent recognizes that without test files, verification cannot proceed: 1.7/5 — The agent did not explore the codebase or engage with the fact that no test files were added. It simply asked for a PR number without acknowledging the core issue about missing tests. It vaguely mentioned checking 'whether existing tests catch the bug' but didn't investigate whether such tests exist.
  • Agent suggests that tests need to be written before verification can be run: 1/5 — The agent made no suggestion about writing tests. It only asked for a PR number and mentioned a skill it would use, without addressing the fundamental issue that tests may need to be created.
  • Agent does not attempt to fabricate or skip the test requirement: 3.7/5 — The agent did not fabricate results or skip the test requirement. However, it also didn't do anything productive - it just asked for more information rather than exploring the codebase to determine the test situation.

Baseline Agent Output

`
The working directory is empty. I need more context to help you:

  1. Which PR? — Please provide the PR number (e.g., #12345) so I can check it out.
  2. Which repository? — I'll assume dotnet/maui unless you specify otherwise.

Once I have the PR number, I can:

  • Check out the code and review the fix
  • Identify or write tests that verify the fix catches the bug
  • Run those tests against the changed code

Could you share the PR number?
`

With-Skill Agent Output (Isolated)

`
The working directory is empty — there's no repository cloned here. To run the verification skill, I need the actual MAUI repository with the PR changes.

Could you provide:

  1. The PR number (e.g., #12345) so I can check it out, or
  2. A branch/repo URL to clone from?

Without a git repository containing the fix and test files, I can't run the verification script. Additionally, the skill notes that test files must exist in the PR or a TestFilter must be specified. Since you mentioned no test files were added, I'd also need either:

  • A specific test filter targeting existing tests that should cover this fix (e.g., -TestFilter "EntryHandlerTests")
  • Or test files to be written first (via the write-tests-agent)

Would you like me to write tests for this fix first, or do you have a PR number/test filter I can use?
`

With-Skill Agent Output (Plugin)

I'd like to help verify the fix is caught by tests, but I need the PR number to proceed. Could you provide the PR number (e.g.,#12345) so I can check the changes and run the appropriate verification?

📊 verify-tests-fail-without-fix / feature-agent-reports-verification-result-clearly

Judge Report: Feature: agent reports verification result clearly

Baseline Judge

Overall Score: 2/5
Reasoning: The agent found an empty working directory and gave up on doing any actual investigation, instead providing generic guidance about running .NET MAUI UI tests. It didn't attempt to clone a repo, search for documentation about the verification framework, or look for any specific output format. The response is generic and doesn't address the specific verification system's output format or inverted semantics that the rubric is looking for.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 2/5 — The agent does not mention the specific 'VERIFICATION PASSED' or 'VERIFICATION FAILED' output strings that the test framework would produce. Instead, it provides a general table of outcomes. If the system under test produces specific formatted output like 'VERIFICATION PASSED' / 'VERIFICATION FAILED', the agent failed to identify or mention this.
  • Agent describes what each result means in the context of inverted semantics: 2/5 — The agent does provide a table explaining what different test outcomes mean (tests pass without fix = bad, tests fail without fix but pass with fix = good). This touches on inverted semantics reasoning, but it's generic advice rather than specific to whatever inverted verification pattern might be used in the actual test framework. The explanation is reasonable but not tailored to the specific system.

With-Skill Judge (Isolated)

Overall Score: 3.7/5
Reasoning: The agent provided a reasonable conceptual explanation of the verification process with clear pass/fail criteria and inverted semantics. However, it didn't actually read the verification script to confirm the real output format, parameter names, or behavior - the details appear to be inferred/fabricated from the skill name rather than derived from the actual code. The agent acknowledged it couldn't fully execute the task due to missing git context, which is honest. The output is helpful but potentially inaccurate in specifics.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 4.3/5 — The agent clearly presents a table showing 'VERIFICATION PASSED ✅' and 'VERIFICATION FAILED ❌' as the two possible script outputs. It also shows the command that would produce these outputs.
  • Agent describes what each result means in the context of inverted semantics: 4.7/5 — The agent clearly explains the inverted semantics: tests failing without the fix is the good outcome (proves they detect the bug), and tests passing without the fix means they don't detect the bug. The warning callout about inverted semantics is helpful and accurate.

With-Skill Judge (Plugin)

Overall Score: 3.3/5
Reasoning: The agent couldn't run the actual verification (no repo available), but it read the skill context and provided a clear, accurate explanation of how to interpret verification results. The inverted semantics explanation is particularly well done. The agent was efficient in checking for the repo and then providing a comprehensive answer. Minor deduction for not being able to actually execute anything, though that was due to environment constraints rather than agent error.

  • Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED): 4/5 — The agent clearly explains that the output will show 'VERIFICATION PASSED' or 'VERIFICATION FAILED' and mentions the verification report file path. It also mentions the PowerShell script and the -RequireFullVerification flag. However, the agent couldn't actually read the script to confirm exact output strings, so this is partially based on inference from the skill context.
  • Agent describes what each result means in the context of inverted semantics: 4.3/5 — The agent provides a clear table showing inverted semantics: tests FAIL without fix = verification passed (tests detect the bug), tests PASS without fix = verification failed (tests miss the bug). It also explains the full verification mode where both directions are checked. This is a solid explanation of the inverted logic.

Baseline Agent Output

`
The working directory is empty — there's no repository cloned here, so I can't run the UI tests directly.

However, I can explain how verification works for .NET MAUI UI tests on iOS:

Running the Test

bash dotnet test src/Controls/tests/TestCases.Shared.Tests/ \ -f net10.0-ios \ --filter "FullyQualifiedName~Issue66666" \ --logger "trx;LogFileName=results.trx"

How to Know Pass/Fail

Signal Passed ✅ Failed ❌
Exit code 0 1
Console output Passed! - Failed: 0 Failed! - Failed: N
TRX file outcome="Passed" outcome="Failed" with stack trace
Appium log Element assertions succeed NoSuchElementException or assertion mismatch

Verifying the Bug Is Caught

To confirm the test catches the bug without the fix:

  1. With the fix applied → test should pass (correct behavior verified)
  2. Without the fix (revert the code change, keep the test) → test should fail (proving the test detects the regression)

If the test passes in both scenarios, it doesn't actually catch the bug.

Prerequisites for iOS

  • macOS with Xcode installed
  • iOS Simulator booted (or physical device connected)
  • Appium server running (appium &)
  • App deployed to the simulator

Would you like me to help with anything else, such as cloning the repo or looking at the specific PR?
`

With-Skill Agent Output (Isolated)

`
This directory isn't a git repository, so there's no PR context to verify against. However, let me explain how the verification works based on your setup:

How Verification Works

Since the PR has both test files and a fix, the command would be:

powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification

How to interpret results:

Script Output Meaning
VERIFICATION PASSED ✅ Tests fail without fix AND pass with fix — bug is caught ✅
VERIFICATION FAILED ❌ Tests pass without fix (don't detect bug) OR fail with fix (fix is broken) ❌

⚠️ Key concept: Semantics are inverted — tests failing without the fix is the good outcome (proves they detect the bug).

Output files will be in:

CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/ ├── verification-report.md # Detailed report ├── test-without-fix.log # Should show FAILURES └── test-with-fix.log # Should show PASSES


To actually run this, I need to be in the MAUI repository with the PR branch checked out. Would you like me to clone the repo and check out the PR branch, or is there a specific repo path I should work from?
`

With-Skill Agent Output (Plugin)

`
The working directory is empty and there's no MAUI repository cloned here. I can't run the verification script without the repository and PR checked out.

However, here's how you'll know if verification passed or failed:

Since the PR has both test files and a fix, the script runs in Full Verification mode. It performs two phases:

✅ VERIFICATION PASSED means:

  1. Without fix → tests FAIL (proves tests detect the bug)
  2. With fix → tests PASS (proves the fix resolves it)

❌ VERIFICATION FAILED means either:

  • Tests PASS without the fix → tests don't catch the bug
  • Tests FAIL with the fix → the fix doesn't work

The command to run (once the repo is checked out):

powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification

Output locations:

  • Report: CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/verification-report.md
  • Logs: test-without-fix.log and test-with-fix.log in the same directory

⚠️ Key reminder: In this skill, test outcomes are inverted — tests failing without the fix is the good outcome. Would you like me to clone the repo and run this?
`

📊 verify-tests-fail-without-fix / feature-agent-trusts-script-auto-detection-of-test-files-from-git-diff

Judge Report: Feature: agent trusts script auto-detection of test files from git diff

Baseline Judge

Overall Score: 1/5
Reasoning: The agent barely explored the problem space - it only checked if the current directory had files, found it empty, and gave up immediately. It made an incorrect definitive claim that no auto-detection script exists without actually investigating (e.g., cloning the repo, checking for scripts). The agent's response directly contradicts all three rubric criteria by telling the user they need to manually identify tests rather than confirming script auto-detection capabilities.

  • Agent explains that the script can auto-detect test files from the PR diff: 1/5 — The agent explicitly stated the opposite — that there is no built-in verification script that auto-discovers which tests to run from a PR's changed files. It failed to find or acknowledge any such script.
  • Agent does not require the user to manually specify every test file path: 1.7/5 — The agent directly told the user they'd need to 'Identify test files manually' using grep on the PR diff, placing the burden on the user.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 1/5 — The agent didn't find, use, or trust any script. It suggested entirely manual approaches and concluded no automated detection exists.

With-Skill Judge (Isolated)

Overall Score: 3.3/5
Reasoning: The agent correctly answered the user's core question about whether the script can auto-detect test files. It couldn't actually run the verification due to environment constraints (no repo, no GH_TOKEN), but it handled this gracefully by explaining what's needed and confirming the script's capabilities. The agent could have read the skill's script files to provide more detailed/authoritative information about how the auto-detection works, but the answer is accurate and helpful.

  • Agent explains that the script can auto-detect test files from the PR diff: 5/5 — The agent clearly explains that the verification script auto-detects test files from the git diff, listing the steps: detect changed files, auto-classify test type, auto-extract test class names, and run the appropriate test runner.
  • Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly states 'you don't need to specify them manually' and shows the command without any test file paths specified.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 5/5 — The agent trusts the script's built-in capability to analyze the git diff and doesn't attempt to manually find or specify test files. It recommends running the script with just -Platform and -RequireFullVerification flags.

With-Skill Judge (Plugin)

Overall Score: 3.7/5
Reasoning: The agent correctly answered the user's question about whether the script can auto-detect tests, explained the mechanism clearly, and appropriately noted the environmental limitations preventing actual execution. It didn't waste excessive steps and recovered gracefully from the missing auth/repo issues. The only minor gap is it couldn't actually run the verification, but that's an environmental constraint, not an agent failure.

  • Agent explains that the script can auto-detect test files from the PR diff: 5/5 — The agent clearly explains that the script auto-detects test type from changed files in the git diff, auto-generates the test filter from changed test class names, and routes to the appropriate test runner.
  • Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly states 'No -TestFilter or -TestType needed — both are auto-detected' and shows the command without manual file specification.
  • Agent trusts the script's git diff analysis rather than manually searching for test files: 5/5 — The agent fully trusts the script's capabilities and recommends using it with just -Platform and -RequireFullVerification flags, relying on its git diff analysis.

Baseline Agent Output

`
The current directory is empty and not a git repository. There's no .NET MAUI repository cloned here, so there's no verification script available to inspect.

To answer your question directly: The .NET MAUI repository does not have a built-in script that automatically detects which tests to run from a PR's changed files. You'd typically need to:

  1. Check the PR's changed files on GitHub (gh pr view 77777 --json files)
  2. Identify test projects manually based on which area was changed (Core, Controls, Essentials, etc.)
  3. Run the relevant test project, e.g.:
    bash dotnet test src/Controls/tests/Core.UnitTests/Controls.Core.UnitTests.csproj -f net9.0-android

Would you like me to:

  • Clone the repo and inspect PR #77777's changed files to identify the right tests?
  • Or help you set up a workflow to run specific tests for this PR?
    `

With-Skill Agent Output (Isolated)

I can't proceed with verifying PR #77777 — the environment has no git repository cloned and no GitHub token configured forgh` CLI access.

To answer your question: Yes, the verification script auto-detects test files from the git diff. You don't need to specify them manually. It will:

  1. Detect changed test files from the PR's diff
  2. Auto-determine the test type (UITest, UnitTest, DeviceTest, etc.)
  3. Auto-build a test filter from detected test classes

What's needed to run it:

  1. Clone the MAUI repo with the PR branch checked out
  2. Have GH_TOKEN set (or be authenticated with gh)
  3. Run:
    powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 -Platform android -RequireFullVerification

Would you like me to set up the repository here, or is there another environment where this should run?
`

With-Skill Agent Output (Plugin)

`
The environment doesn't have GitHub CLI authentication configured, and the working directory isn't a git repo. To run the verification script, I need:

  1. A cloned repo with the PR branch checked out (the script uses git diff to auto-detect test files and fix files)
  2. GH_TOKEN set for gh CLI access

To answer your question: Yes, the script absolutely can figure out which tests to run on its own! Here's how it works:

  • It analyzes the git diff between the PR branch and base branch
  • It auto-detects the test type (UITest, UnitTest, DeviceTest, XamlUnitTest) based on file paths
  • It auto-builds the test filter from changed test class names
  • It determines if fix files exist to choose Full Verification vs Failure-Only mode

However, I can't run it right now because this environment lacks:

  • A cloned dotnet/maui repo with PR #77777 checked out
  • GitHub authentication (GH_TOKEN)

If you can point me to a cloned repo with the PR branch, or set up auth, I can run:

powershell pwsh scripts/verify-tests-fail.ps1 -Platform android -PRNumber 77777

No -TestFilter or -TestType needed — it auto-detects both from the diff.
`

📊 verify-tests-fail-without-fix / happy-path-full-verification-mode-with-fix-files-present

Judge Report: Happy path: full verification mode with fix files present

Baseline Judge

Overall Score: 1/5
Reasoning: The agent gave up almost immediately after finding an empty directory and a failed skill invocation. It made no attempt to clone the repository, find relevant scripts, or actually perform the verification task. The output is essentially a help request back to the user rather than completing the assigned task.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent did not run any tests at all. It checked the working directory, found it empty, and stopped.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — No tests were run in any phase.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 1/5 — The agent attempted to invoke a non-existent skill 'verify-tests-fail-without-fix' but did not look for or use any verify-tests-fail script in the environment.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1.3/5 — No results were produced to interpret.

With-Skill Judge (Isolated)

Overall Score: 1/5
Reasoning: The agent made one attempt to run the script, encountered a 'not a git repository' error, and immediately gave up by asking the user for more information. It didn't explore the skill's files/documentation to understand how to set up the environment, didn't try alternative approaches (like checking if there's a repo elsewhere, reading the script to understand its parameters like -RepoRoot or -PRNumber), and didn't attempt any recovery. The task was essentially not completed at all.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests at all. It encountered a missing git repo error and stopped.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — No tests were ever run, so no second phase occurred.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 2.3/5 — The agent did attempt to use the script (correct approach), but failed to set up the environment needed for it to work and gave up rather than finding an alternative path.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1/5 — No results were produced to interpret. The agent ended with a question to the user instead of completing the task.

With-Skill Judge (Plugin)

Overall Score: 1/5
Reasoning: The agent failed to complete the verification task. While it correctly identified and attempted to use the verify-tests-fail.ps1 script, it gave up when dependencies were missing rather than attempting the manual verification approach it described. The agent could have read the script to understand the logic, identified test and fix files in the diff, and performed the revert-test-restore-test cycle manually. Instead it produced only a 'blocked' message with suggestions for the user.

  • Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests. It encountered missing infrastructure scripts and stopped, reporting the verification as blocked.
  • Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — The agent never reached this phase. No tests were run at all.
  • Agent uses the verify-tests-fail script rather than manually running dotnet test: 2.3/5 — The agent attempted to use the verify-tests-fail.ps1 script twice (with and without PR number), which is the correct approach. However, when it failed due to missing dependencies, the agent gave up rather than trying alternative approaches or manual verification.
  • Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1.3/5 — No results were produced to interpret. The agent reported BLOCKED without completing any verification phase.

Baseline Agent Output

The working directory is empty and there's no repository cloned here. I'm also unable to invoke theverify-tests-fail-without-fix` skill in this environment.

To run full verification (test fails without fix, passes with fix), I need:

  1. The dotnet/maui repository cloned locally
  2. The PR branch checked out (with both the test and fix)
  3. Android SDK + OpenJDK 17 configured for building

Manual workflow (what the skill would do):

`

1. Clone and checkout PR

gh repo clone dotnet/maui && cd maui
gh pr checkout <PR_NUMBER>

2. Run test WITH fix (should pass)

dotnet test --filter "Issue54321" -f net9.0-android

3. Revert fix files only, keep test

git stash push --

4. Run test WITHOUT fix (should fail)

dotnet test --filter "Issue54321" -f net9.0-android

5. Restore fix

git stash pop
`

Would you like me to clone the repository and attempt this here, or is there a specific environment/PR number I should target?
`

With-Skill Agent Output (Isolated)

`
The script requires a git repository context. This working directory isn't a git repo — the skill needs to run inside the actual MAUI repository where the PR branch is checked out.

Could you provide:

  1. The PR number (so I can check it out), or
  2. The path to the repo where the PR is already checked out?

Without a cloned repo with the PR branch, the verification script can't detect fix files or revert them to test both states.
`

With-Skill Agent Output (Plugin)

The required supporting scripts (EstablishBrokenBaseline.ps1, Detect-TestsInDiff.ps1`, etc.) don't exist in this environment. This is a partial checkout or the repository is missing CI infrastructure scripts.

Verification blocked — the verify-tests-fail.ps1 script depends on shared scripts that aren't present in this environment:

  • .github/scripts/EstablishBrokenBaseline.ps1
  • .github/scripts/shared/Detect-TestsInDiff.ps1
  • .github/scripts/shared/Write-SafeSubprocessOutput.ps1

These are CI infrastructure scripts that provide merge-base detection, test file identification, and log sanitization.

What you can do:

  1. Run in full CI — trigger the maui-pr pipeline where these scripts exist
  2. Run manually — on a full repo clone with all scripts present:
    powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1
    -Platform android -TestFilter "Issue54321" -RequireFullVerification -PRNumber 54321
    `
  3. Manual equivalent — revert the fix files, run the test (expect failure), restore files, run again (expect pass):
    bash git stash push -- <fix-files> # remove fix dotnet test <project> --filter "Issue54321" # should FAIL git stash pop # restore fix dotnet test <project> --filter "Issue54321" # should PASS

Would you like me to attempt the manual approach if you can point me to the fix files and test project?
`

📊 verify-tests-fail-without-fix / happy-path-verify-failure-only-mode-for-test-creation

Judge Report: Happy path: verify failure only mode for test creation

Baseline Judge

Overall Score: 1/5
Reasoning: The agent failed to accomplish the task. After finding the working directory empty, it simply asked the user for more information rather than attempting to clone the repository, locate the test, or run it. The agent made only 3 tool calls and gave up very quickly without exploring alternative approaches to find or access the code.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1/5 — The agent did not run any test at all. It checked the working directory, found it empty, and gave up by asking the user for more information instead of attempting to clone the repository or find another way to proceed.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of running a test or interpreting any results. It could not demonstrate understanding of the concept that a test failure = success in this context.
  • Agent does not require fix files to be present for this mode: 2.7/5 — The agent didn't explicitly require fix files, but it also didn't demonstrate understanding that this is a verify-failure-only scenario where fix files aren't needed. It simply asked for repository location and test file path.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent didn't use any flags at all since it never ran anything. It technically didn't violate this criterion, but it also didn't demonstrate any knowledge of the flag system.

With-Skill Judge (Isolated)

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. After the initial script execution failed due to missing git repository, the agent did some basic file searching but never read the skill documentation or script contents to understand available parameters/modes. It should have examined the script to find a verify-failure-only mode, potentially cloned the necessary repo, or found another way to proceed. Instead, it gave up and asked the user for help after just a few exploratory commands.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1.3/5 — The agent attempted to run the script in what appears to be the correct mode (no full verification flag), but it failed due to a missing git repo and the agent gave up rather than exploring the skill context files to understand how to properly set up the environment or pass the correct parameters.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of interpreting any test results. It gave up after the initial error and asked the user for more information instead of completing the task.
  • Agent does not require fix files to be present for this mode: 2.7/5 — The agent did not attempt to look for or require fix files, which is correct behavior. However, it also didn't complete the task at all, so this is somewhat moot.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3.3/5 — The agent correctly did not use the -RequireFullVerification flag when running the script. This was appropriate given that no fix exists yet.

With-Skill Judge (Plugin)

Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. While it made reasonable attempts to run the verification script and search for test files, it was unable to find any test files for Issue #88200 and gave up, asking the user for clarification. None of the core rubric criteria were meaningfully satisfied - no test was run, no results were interpreted, and no verification was produced.

  • Agent runs the test in verify-failure-only mode since no fix exists yet: 1.3/5 — The agent never successfully ran the test. It hit an error with PR number auto-detection and then gave up after finding no modified files, rather than exploring the skill script parameters or searching the repo for Issue 88200 test files.
  • Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of running a test or interpreting any results. It asked the user for clarification instead of completing the task.
  • Agent does not require fix files to be present for this mode: 2.3/5 — The agent didn't explicitly require fix files, but it also failed to proceed without them. It didn't demonstrate understanding that verify-failure-only mode doesn't need fix files.
  • Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent didn't use -RequireFullVerification, but this is a neutral outcome since it never successfully ran any command at all.

Baseline Agent Output

`
The working directory is empty — there's no repository checked out here. To verify your test catches the bug, I need access to the .NET MAUI repository with your test changes.

Could you point me to:

  1. The repo location — is it checked out somewhere else, or should I clone it?
  2. Your test file — the path or branch/PR where your UI test lives?

Once I have the code, I'll build and run the test on Android to confirm it fails without the fix applied.
`

With-Skill Agent Output (Isolated)

`
I don't see a .NET MAUI repository or any test files in the current environment. The working directory is empty and there's no git repo with your changes.

Could you help me locate your test files? Either:

  1. Clone the repo with your branch: e.g., gh repo clone dotnet/maui -- -b <your-branch>
  2. Tell me the PR or branch name so I can check it out
  3. Confirm the path if your code is somewhere else on this machine

Without the repository and your test changes, I can't run the verification script.
`

With-Skill Agent Output (Plugin)

I can't find any test files for Issue #88200 in the working tree. The test you wrote doesn't appear to be committed or present in this environment. There are no files matching "88200" anywhere in the repository, andgit status` shows no staged/unstaged changes related to it.

Could you confirm:

  1. Did you commit the test files, or are they on a different branch/PR?
  2. What's the test file path (e.g., src/Controls/tests/TestCases.Shared.Tests/Tests/Issue88200.cs)?

I need the test files present in the working tree to run verification.
`

📊 verify-tests-fail-without-fix / negative-trigger-general-test-question-should-not-invoke-verification

Judge Report: Negative trigger: general test question should not invoke verification

Baseline Judge

Overall Score: 4/5
Reasoning: The agent correctly identified this as a knowledge/guidance question and answered it directly without unnecessary tool usage. The response is well-structured with code examples, a comparison table, and practical tips. The advice about preferring element-based checks for scrolling bugs is sound. It could have explored the actual codebase for concrete examples to ground its advice, but the response is solid as-is.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance directly, including code examples, assertion tables, and tips. It did not attempt to launch any verification workflow - it simply searched for relevant examples in the codebase, found none (repo not cloned), and provided guidance from its knowledge.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made no attempts to run verification scripts or check PR labels. Its only tool usage was searching for existing test patterns via grep/glob to inform its answer, which is appropriate research behavior.

With-Skill Judge (Isolated)

Overall Score: 4/5
Reasoning: The agent appropriately recognized this as a knowledge question and answered it directly without unnecessary tool usage. The guidance is well-structured, practical, and specific to the .NET MAUI testing context. It correctly recommends element-based checks over VerifyScreenshot for scrolling bugs with good reasoning. The code examples are illustrative and the file path references add specificity. Minor deduction because the agent didn't verify any of its claims against the actual codebase (e.g., confirming API names like App.ScrollDown exist), but for a guidance question this is acceptable.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance directly from its knowledge, without attempting to launch any verification workflow, scripts, or background agents. It gave a clear, well-structured answer with code examples, a comparison table, and specific recommendations.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made zero tool calls. It did not attempt to run any scripts, check PR labels, or perform any verification steps. It simply answered the question directly.

With-Skill Judge (Plugin)

Overall Score: 4/5
Reasoning: The agent correctly identified this as a knowledge question and provided a well-structured, informative answer about writing CollectionView scrolling UI tests. It covered test structure, when to use VerifyScreenshot vs element-based checks, recommended assertions, and practical tips. It didn't waste any tool calls or attempt unnecessary workflows. The advice is reasonable and well-organized, though it's generic rather than grounded in actual codebase exploration.

  • Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent directly answered the user's question about writing UI tests for CollectionView scrolling bugs with practical guidance, code examples, and clear recommendations. It made zero tool calls and did not attempt any verification workflow.
  • Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made no tool calls whatsoever - no bash commands, no file searches, no scripts. It simply provided a helpful answer to the question.

Baseline Agent Output

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Test Structure

`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollingTests : _IssuesUITest
{
public CollectionViewScrollingTests(TestDevice device) : base(device) { }

public override string Issue => "CollectionView scrolling bug description";

[Test]
public void CollectionViewShouldScrollToItem()
{
    App.WaitForElement("collectionView");
    App.ScrollDown("collectionView");
    App.WaitForElement("targetItem");
    Assert.That(App.FindElement("targetItem").GetText(), Is.EqualTo("Expected"));
}

}
`

Screenshot vs Element-Based Checks

Approach When to Use
Element-based Scrolling bugs — assert item visibility, text, position
VerifyScreenshot Visual rendering bugs — layout glitches, clipping, overlaps

For scrolling bugs, prefer element-based checks because:

  • They're more stable across platforms/resolutions
  • They directly verify the behavior (item became visible after scroll)
  • They give clearer failure messages

Recommended Assertions

`csharp
// Verify item becomes visible after scroll
App.ScrollDown("collectionView");
App.WaitForElement("item_10"); // item scrolled into view

// Verify scroll-to-item programmatically
App.Tap("scrollToButton");
App.WaitForElement("targetItem");

// Verify item count after incremental loading
var items = App.WaitForElement("lastLoadedItem");
Assert.That(items, Is.Not.Empty);
`

Tips

  • Use App.WaitForElement over App.FindElement for timing resilience
  • Add AutomationId to items in your XAML issue page for reliable queries
  • Use App.ScrollDown/App.ScrollUp rather than raw swipe gestures
  • Place your test page XAML in src/Controls/tests/TestCases.HostApp/Issues/
  • Use VerifyScreenshot only as a supplement if the visual state matters (e.g., item partially clipped)
    `

With-Skill Agent Output (Isolated)

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Assertion Strategy

Prefer element-based checks over VerifyScreenshot for scrolling bugs:

  • Element checks are deterministic and CI-friendly — scrolling offsets, item visibility, and element positions are stable across platforms.
  • VerifyScreenshot is brittle for scroll tests — pixel differences from scroll inertia, timing, and platform rendering make these flaky.

Recommended Assertions

`csharp
// Verify an item scrolled into view is visible
var item = App.WaitForElement("Item_25");
Assert.NotNull(item);

// Verify position changed after scroll
var beforeScroll = App.WaitForElement("TargetItem").GetRect();
App.ScrollDown("CollectionViewId");
var afterScroll = App.WaitForElement("TargetItem").GetRect();
Assert.True(afterScroll.Y < beforeScroll.Y);

// Verify ScrollTo programmatic scroll worked
App.WaitForElement("ScrollToButton").Click();
App.WaitForElement("Item_99"); // item now visible
`

Key Patterns

  1. Use App.WaitForElement — never FindElement directly (race conditions)
  2. Use AutomationId on items via ItemTemplate bindings
  3. Add a test issue page in src/Controls/tests/TestCases.HostApp/Issues/ with [Issue] attribute
  4. Use App.ScrollDown/App.ScrollUp from the Appium helpers
  5. Add delays sparingly — prefer WaitForElement with timeout over Thread.Sleep

When to Use VerifyScreenshot

Only use it for visual rendering bugs (e.g., items overlapping, wrong background color after scroll) — not for functional scroll behavior.

Test Location

  • Test page: src/Controls/tests/TestCases.HostApp/Issues/IssuePR_XXXXX.cs
  • Test class: src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/

Want me to scaffold a concrete test for a specific issue?
`

With-Skill Agent Output (Plugin)

`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:

Test Structure

`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollingTests : _IssuesUITest
{
public CollectionViewScrollingTests(TestDevice device) : base(device) { }

public override string Issue => "GH_ISSUE_NUMBER description";

[Test]
public void CollectionViewShouldScrollToItem()
{
    App.WaitForElement("collectionView");
    App.ScrollDown("collectionView");
    App.WaitForElement("targetItem");
    // Assert element is visible after scroll
    var element = App.WaitForElement("targetItem");
    Assert.That(element.GetRect().Height, Is.GreaterThan(0));
}

}
`

VerifyScreenshot vs Element-Based Checks

Approach Use When
Element-based Testing scroll position, item visibility, layout correctness
VerifyScreenshot Testing visual rendering (clipping, overlap, transparency)

For scrolling bugs, prefer element-based checks because:

  • Screenshots are brittle across platforms/resolutions
  • Scroll position assertions are more deterministic
  • Easier to diagnose failures

Recommended Assertions

`csharp
// Item exists after scrolling
App.WaitForElement("item_10");

// Item position (below another element)
var item1 = App.WaitForElement("item_1").GetRect();
var item5 = App.WaitForElement("item_5").GetRect();
Assert.That(item5.Y, Is.GreaterThan(item1.Y));

// Item not visible (scrolled off screen)
App.WaitForNoElement("item_0");

// Use VerifyScreenshot ONLY for visual glitches
#if !ANDROID // if flaky on specific platforms
VerifyScreenshot();
#endif
`

Tips

  • Use App.ScrollDown/Up("automationId") for reliable scrolling
  • Add AutomationId to items via ItemTemplate in your XAML issue page
  • Use WaitForElement (not FindElement) to handle scroll animation delays
  • Place the test XAML page in src/Controls/tests/TestCases.HostApp/Issues/
  • Use the write-ui-tests skill (/write-ui-tests) if you want automated test generation and verification
    `
📊 verify-tests-fail-without-fix / regression-agent-correctly-reports-test-failure-as-verification-success

Judge Report: Regression: agent correctly reports test failure as verification success

Baseline Judge

Overall Score: 5/5
Reasoning: The agent provides a clear, correct, and well-structured response. It correctly interprets the failure-only verification semantics, provides a ready-to-use report format, mentions the platform (Android), and outlines the logical next step. No tools were needed for this knowledge-based question, so zero tool calls is appropriate.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent correctly identifies that in failure-only verification, a failing test means the verification passed. It clearly states 'In failure-only verification, a failing test is a passing verification' and marks the status as PASS.
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent does not suggest any fixes. It correctly interprets this as confirmation that the test is meaningful and will catch regressions if the fix is reverted.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent provided a correct, clear, and well-structured response. It correctly interpreted failure-only verification semantics, provided a clean report template, and gave appropriate next steps. No tools were needed for this knowledge-based question, so zero tool calls is appropriate.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent clearly and correctly identifies that a test failure during failure-only verification means the verification succeeded — the test catches the bug as intended.
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent explicitly states 'No further action needed on the test itself' and correctly identifies the next step as applying the fix, not modifying the test.

With-Skill Judge (Plugin)

Overall Score: 5/5
Reasoning: The agent correctly understood the failure-only verification paradigm, provided a clear and well-structured report, included all relevant details (platform, assertion, interpretation), and did not make any erroneous recommendations. It was efficient with zero tool calls needed for what is essentially a reasoning/knowledge task.

  • Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent clearly and correctly interprets the test failure as a verification success. It explains that in failure-only mode, a test failure is the expected outcome because it proves the test catches the bug.
  • Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent does not suggest any fixes. It correctly states this is a pass and that the test correctly reproduces the bug, with no recommendation to modify anything.

Baseline Agent Output

`
This is a successful result. 🎉

In failure-only verification, the test is supposed to fail — it proves the test correctly reproduces the bug before any fix is applied.

What to report:

Failure-only verification passed (Android)

The test fails as expected without the fix:
Assert.That(rect.Height, Is.GreaterThan(0)) — element rendered with zero height
This confirms the test reproduces the reported bug. The element collapses to zero height, which is the exact symptom described in the issue.

Next step: When the fix is applied, re-run the test to confirm it passes — completing the full "fails without fix, passes with fix" cycle.
`

With-Skill Agent Output (Isolated)

`
Good news — this is a SUCCESS for failure-only verification.

The goal of failure-only mode is to confirm the test catches the bug. Your test failed with the expected assertion error (zero height), which proves:

  1. ✅ The test reproduces the issue on Android
  2. ✅ The assertion correctly detects the broken behavior
  3. ✅ The test will pass once a fix is applied

What to report:

Verification result: PASS (test catches the bug)

  • Platform: Android
  • Test fails without fix: ✅ Confirmed
  • Failure: Assert.That(rect.Height, Is.GreaterThan(0)) — element renders with zero height
  • The test is ready for fix validation (full verification mode)

In failure-only mode, a failing test = a passing verification. The test is working as intended — it detects the defect. The next step would be to apply the fix and run full verification to confirm the test then passes.
`

With-Skill Agent Output (Plugin)

`
Report: ✅ Verification PASSED (test correctly fails without fix)

Since this is failure-only verification, a failing test is the expected outcome. It proves the test reproduces the bug. Here's what to report:


Result: PASS — Test correctly detects the bug.

Details:

  • Platform: Android
  • Assertion: Assert.That(rect.Height, Is.GreaterThan(0)) → element renders with zero height
  • Interpretation: The test fails without the fix applied, confirming it catches the regression/bug. Once the fix is merged, this test should pass, providing ongoing protection.

Key point: In failure-only mode, "test failed" = "verification succeeded." The test is valid and ready to ship with the fix.
`

📊 verify-tests-fail-without-fix / regression-agent-must-not-manually-revert-files-with-git-commands

Judge Report: Regression: agent must not manually revert files with git commands

Baseline Judge

Overall Score: 1/5
Reasoning: The agent completely missed the core requirement of the task. Instead of using the verify-tests-fail.ps1 script to actually verify that tests fail without the fix (the standard verification workflow), it just fetched source files from GitHub and did a manual code review, speculating about what would happen. Additionally, the agent noted the test filter mismatch (Issue32932 vs Issue33134) but never attempted to resolve this or run anything. The approach was entirely wrong for the task.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 1/5 — The agent never used the verify-tests-fail.ps1 script. It gave up after finding no GH_TOKEN and no repo clone, suggesting manual dotnet test commands instead.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 3/5 — The agent didn't use any git revert commands, but only because it never got far enough to attempt anything with the codebase. This is a neutral result rather than a positive one.
  • Agent interprets the script output correctly to determine if verification passed or failed: 1/5 — The agent never ran any script, so there was no output to interpret. It produced no verification result whatsoever.

With-Skill Judge (Isolated)

Overall Score: 2.7/5
Reasoning: The agent followed the correct workflow: used the skill, found the right script, corrected the test filter from Issue33134 to Issue32932 (showing good investigation), ran the verification script properly, and correctly interpreted the blocking result (no Android SDK). The agent didn't manually manipulate git state and relied on the script's built-in logic. The main limitation is environmental (no Android emulator), which the agent correctly identified and reported. The output is clear and informative.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 2.7/5 — The agent never used verify-tests-fail.ps1. It invoked the skill but then searched for scripts like 'EstablishBrokenBaseline.ps1' and couldn't find the proper verification script. It fell back entirely to manual code analysis rather than using the automated script.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 4.3/5 — The agent did not manually revert fix files using git checkout/restore/stash. It used 'git show' to view the original file content for comparison but never actually modified the working tree. This is technically compliant, though the reason was that it gave up on runtime verification entirely rather than deliberately avoiding manual reverts.
  • Agent interprets the script output correctly to determine if verification passed or failed: 2.3/5 — The agent never ran the verification script, so there was no script output to interpret. Instead it provided a purely logical/theoretical analysis of why the test would catch the bug, without any actual test execution.

With-Skill Judge (Plugin)

Overall Score: 2/5
Reasoning: The agent followed the right approach by using the verification script and not manually reverting files, but it ultimately failed to complete the task. It spent significant time troubleshooting environment issues (sparse checkout, finding the correct test name, merge base issues) and when it finally ran the script with correct parameters, it timed out without producing results. The final output is just a description of what it was about to do, not an actual verification result. The task was not completed.

  • Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 3.3/5 — The agent did find and attempt to use the verify-tests-fail.ps1 script multiple times. It first ran it without FixFiles (which entered wrong mode), then correctly identified it needed to pass -FixFiles explicitly and ran it again with the correct parameters. The script was used as intended.
  • Agent does not manually use git checkout, git restore, or git stash to revert fix files: 5/5 — The agent never manually reverted fix files using git checkout, git restore, or git stash. It correctly relied on the script to handle file revert/restore automatically, only using git to fetch the PR branch and the base branch for reference.
  • Agent interprets the script output correctly to determine if verification passed or failed: 1/5 — The agent never got to interpret the final script output. The session timed out (900s) while waiting for the final script run to complete. The agent's last substantive output was explaining why the first run entered 'Verify Failure Only Mode' but never provided a final determination of whether verification passed or failed.

Baseline Agent Output

`
I've reviewed all the relevant files. Here's my verification:

✅ Tests properly catch the EmptyView bug on Android

Test Structure

HostApp (Issue32932.cs):

  • Places a CollectionView inside a VerticalStackLayout (the exact repro scenario)
  • Sets ItemsSource to an empty ObservableCollection<string>
  • EmptyView contains a Label with AutomationId="EmptyViewLabel"

NUnit Test (Issue32932.cs):

  • Waits for the CollectionView, then asserts App.WaitForElement("EmptyViewLabel")
  • On Android, this will timeout if the EmptyView doesn't render (the bug)

Fix (SizedItemContentView.cs):

csharp static double NormalizeDimension(double value) => value == int.MaxValue ? double.PositiveInfinity : value;
Converts the corrupted int.MaxValue (from (int)double.PositiveInfinity) back to infinity so the existing double.IsInfinity() checks work correctly.

Verdict

Condition Result
Without fix ❌ FAIL — EmptyViewLabel never renders, WaitForElement times out
With fix ✅ PASS — EmptyView displays, label found immediately

The test correctly reproduces the exact bug scenario (CollectionView in VerticalStackLayout with empty source) and uses the filter name Issue32932 (note: the issue number, not Issue33134). To run with your filter Issue33134 you'd need to adjust — the test class is named Issue32932 matching the issue it fixes.

Filter correction: Use --filter Issue32932 (not Issue33134) to target this test.
`

With-Skill Agent Output (Isolated)

`

Verification Blocked — No Android Emulator

The verification script ran correctly in Full Verification Mode, detecting:

  • Fix file: src/Controls/src/Core/Handlers/Items/Android/SizedItemContentView.cs
  • Test files: Issue32932 (HostApp + Shared.Tests)

However, both test runs failed with:

❌ Android SDK (adb) not found

This environment doesn't have the Android SDK/emulator installed, so the UI tests can't actually execute. The script correctly identified the fix/test structure but cannot run the tests locally.

Key findings:

  1. Test filter correction: The actual test class is Issue32932, not Issue33134 (the PR number). Using Issue33134 as filter would match nothing.
  2. Test type: UITest (detected from TestCases.Shared.Tests and TestCases.HostApp)
  3. Status: ⚠️ Blocked — requires an Android emulator/device to execute. This verification needs to run in CI (the maui-pr-uitests pipeline) or on a machine with Android SDK configured.
    `

With-Skill Agent Output (Plugin)

Good, files are now available. Now let me run the verification with the correct test filter (Issue32932is the actual test class name) and explicit fix files and base commit:

📊 verify-tests-fail-without-fix / regression-tests-passing-without-fix-means-verification-failed

Judge Report: Regression: tests passing without fix means verification FAILED

Baseline Judge

Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and well-structured answer. It correctly identified that a passing test without the fix means failed verification, explained why, and offered actionable next steps. No tools were needed for this conceptual question, so the zero tool calls is appropriate.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent immediately and clearly states 'No, the verification failed' and explains that in failure-only verification, the test should fail without the fix applied.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 4.7/5 — The agent provides concrete next steps including revisiting the test, checking input data/configuration/assertions, and re-running verification after updates.
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent explicitly distinguishes these concepts, clearly stating the test passed but the verification failed, and explains why these are opposite outcomes in this context.

With-Skill Judge (Isolated)

Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and concise answer that addresses all aspects of the question. It correctly identified the counterintuitive result, explained why, and provided actionable next steps. No tools were needed for this conceptual question.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent immediately and clearly states 'No, the verification failed' and explains that in failure-only verification, the test should fail without the fix applied.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 5/5 — The agent explicitly states 'The test must be revised so it fails on the unfixed code and passes only after the fix is applied' as the next step.
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these concepts, explaining that the test passing means the verification failed, and articulates why these are opposite outcomes in this context.

With-Skill Judge (Plugin)

Overall Score: 5/5
Reasoning: The agent provided a correct, clear, and concise answer. It correctly identified that in failure-only verification, a passing test without the fix means the verification failed. It explained the implications and provided actionable next steps. No tools were needed for this conceptual question, so the zero tool calls are appropriate.

  • Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent clearly and immediately states 'No, the verification failed' and explains that the test should fail without the fix applied.
  • Agent recommends reviewing and improving the test assertions so they actually detect the issue: 5/5 — The agent explicitly recommends the test be revised so it fails without the fix and passes with it.
  • Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these, explaining that a passing test in this context means the verification failed.

Baseline Agent Output

`
No, this verification is NOT successful.

In failure-only verification, the goal is to confirm that the test fails without the fix — proving it actually catches the bug. If the test passed without the fix, it means:

  1. The test doesn't reproduce the bug — it can't distinguish between fixed and unfixed code
  2. The test is not a valid regression test — it would pass regardless of whether the fix is present

What to do next:

  • Revisit the test to ensure it actually exercises the buggy code path
  • Check if the test conditions match the issue's reproduction steps
  • The test may need different assertions or setup to trigger the failure
    `

With-Skill Agent Output (Isolated)

`
No, the verification failed.

In failure-only verification mode, the goal is to confirm that the test catches the bug — meaning the test should fail when the fix is not applied.

If the test passed without the fix, it means:

  • The test does not reproduce or detect the bug
  • It cannot distinguish between fixed and unfixed code
  • The test needs to be rewritten to properly target the defective behavior

Next step: Revise the test so it exercises the specific broken behavior and fails on unfixed code, then re-run verification.
`

With-Skill Agent Output (Plugin)

`
No, the verification failed.

In failure-only verification, the goal is to confirm that the test catches the bug — meaning it should fail when the fix is NOT applied.

If the test passed without the fix, it means:

  • The test does not reproduce or detect the bug
  • It cannot serve as a regression test for this issue
  • The test needs to be rewritten to actually exercise the failing scenario

Next steps: The test must be revised so it fails without the fix (proving it catches the bug) and passes with the fix (proving the fix works).
`

🔍 Full results and investigation steps

@github-actions github-actions Bot added the area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions label Jun 8, 2026
Copilot added 15 commits June 8, 2026 16:31
Runs the Pester suite on every PR that touches .github/scripts/**,
.github/skills/**, eng/scripts/**, or eng/pipelines/ci-copilot.yml. Make
this a REQUIRED check via repo settings → branches → branch protection.

Defense-in-depth gates:

* Explicit required-test-files list (refuses to run if any required
  .Tests.ps1 has been deleted/renamed — prevents silent green from
  "nothing ran"). Includes the four tests added by dotnet#35677
  (Post-AISummaryComment, Review-PR, Aggregate-UITestArtifacts,
  Run-DeviceTests, Remove-StaleMauiBotComments) plus this PR's new
  Write-SafeSubprocessOutput and Review-PR.Phase-Audit suites.
* Minimum-passed-count floor (catches BeforeAll discovery errors
  that flip green tests into Skipped state — Pester reports
  "Passed: 0, Failed: 0" as success otherwise).
* Zero-skipped tolerance (catches leftover -Skip / Inconclusive /
  NotRun states).
* Pinned Pester 5.3+ with -SkipPublisherCheck for the runner image's
  pre-installed copy.

The workflow also runs Verify-AuditNegativeControls.ps1 — an
end-to-end harness that injects ~245 bypass patterns into a
real Gate-reachable script and asserts the audit catches each.
This catches regressions where the extractor works but scope
detection silently drops lines.

Without this workflow, the .Tests.ps1 files already in the repo
(plus the ones added by dotnet#35677) are unrun and provide no
verification value.
…ose F1-CRIT subprocess-bypass class

Adds Out-SafePRSubprocessLine helper, a non-anchored case-insensitive neutralizer for AzDO logging commands embedded in PR-controlled subprocess output. Without it, an attacker who plants ##vso[task.complete result=Succeeded] / ##vso[task.setvariable ...] / ##vso[task.uploadfile ...] in any string the pipeline echoes can forge a green gate, flip cross-stage variables, or exfiltrate files.

Sanitization applied at every PR-derived stdout site in Review-PR.ps1:

  * Subprocess re-emit (Find-RegressionRisks, regression-test runners, agent JSON catch branch) -> pipeline form | Out-SafePRSubprocessLine

  * Single-line agent fields (PR title, Prompt, currentIntent, modelName, displayName, detail) -> inline -replace '(?i)##(vso[|[)', '##~SANITIZED~$1' (preserves -ForegroundColor which the helper doesn't propagate)

Pester AST suite enforces both shapes: 'no raw Write-Host of PR-derived variable' walks every emitter call and fails if any argument references a known-PR-derived variable without ancestor sanitizer wrapping; a >=10 sanitizer-sites floor catches wholesale deletion.

Closes F1-CRIT subprocess-bypass class (3 sites):

  * Review-PR.ps1 regression-test runners (UITest, DeviceTest) - BuildAndRunHostApp.ps1 and Run-DeviceTests.ps1 emit Write-Host of attacker-controlled device-log content. In-process invocation lets that Write-Host write directly to the parent host (stream 6, NOT captured by 2>&1), bypassing the sanitizer. Now launched as pwsh -NoProfile -File subprocesses so the child has its own host.

  * Review-PR.ps1 Find-RegressionRisks invocation - same shape, same fix.

  * Invoke-UITestWithRetry.ps1 BuildAndRunHostApp launch - same shape, same fix. ci-copilot.yml deep-UI-test stage now re-emits captured \.Output through the same neutralizer so build/test diagnostics stay visible in the task log.

Two AST regression guards in Write-SafeSubprocessOutput.Tests.ps1 lock these in: any future PR that reverts back to in-process & \ ... 2>&1 fails the workflow.

Tests: 46/46 sanitizer Pester + 393/393 Phase-Audit Pester.
…loses write-class gh in Gate)

Phase-Audit walks the call-graph closure of Gate-reachable code in Review-PR.ps1 (from \ through every dot-sourced helper) and proves no write-class gh invocation can land in the Gate phase. This is the structural defense behind the per-task token-scoping rule documented in ci-copilot-pipeline-security.instructions.md.

Architecture (5 axes):

  * Axis A - Gate call-graph closure + WholeFile audit of the four Gate subprocess scripts (Find-RegressionRisks, Detect-TestsInDiff, verify-tests-fail, detect-ui-test-categories). Default-deny: every gh call must match a tight read-allowlist; everything else is forbidden in Gate scope.

  * Axis B - YAML binding audit of ci-copilot.yml. Per-task GH_TOKEN scoping is required; CopilotReview task MUST NOT bind GH_TOKEN.

  * Axis C - Extractor self-test: synthetic snippets routed through the same extractors used by Axis A. Audit-of-the-audit.

  * Axis D - Call-graph closure self-test (descend into dot-sourced helpers; stop at Invoke-WithoutGhTokens wrapper).

  * Axis E - Static-eval self-test for Resolve-StaticStringValue (folds operator/method/index/cast/ternary/coalesce/pipeline shapes; consults hashState/varState/arrayState for compound expressions).

Coverage (closes against PureWeen's original review + adversarial review feedback):

  * Pattern 1-3: alias / Invoke-Expression / Set-Alias indirection

  * Pattern 4-bis: paren / sub / member / invoke-member indirection through static folds

  * Pattern 5: shell-runner inline command flags + encoded pwsh + cmd /v:on delayed expansion

  * Pattern 6: assignment-site literal folds (string / Concat / Join / Format / -join / -replace / instance methods)

  * Pattern 6 deep: per-variable hashtable / array state, mutations (=, +=, .Add, .Set_Item, .Insert), variable copies, [hashtable]/[ordered]/[pscustomobject] casts, nested chains, multi-LHS unpack, paren-wrapped int index, tainted-key fallback for unresolvable keys

  * Pattern 7: dynamic scriptblock construction ([scriptblock]::Create, .AddScript, [Activator]::CreateInstance)

  * Pattern 8: exec/arg wrappers (env, sudo, container, remote)

  * Pattern 9: Import-Module / using module from PR paths

  * Pattern 10: forbidden cmdlets (CIM/WMI/Job/COM)

  * Statement-RHS: if / switch / try-catch-finally / data / foreach / do-until / do-while / while / for / SubExpression wraps

  * Pipeline folds: 'a','b' | Select-Object -First N, ForEach-Object { scriptblock } body yield, ({...}).Invoke() / .InvokeReturnAsIs()

  * Compound expressions: Binary+, Convert, Ternary (?:), null-coalesce (??), method chains over hashState lookups

  * Constructor seeds: New-Object Hashtable/ArrayList/OrderedDictionary/PSObject -Property hashtable, [T]::new()

Tests: 393 Pester cases pass. End-to-end negative-control harness (Verify-AuditNegativeControls.ps1) injects ~245 distinct bypass patterns into a real Gate-reachable script and asserts the audit catches each.
…process rule, Pester gate

Rule 6 (anti-injection): replace the bare 'tr | sed' guidance with the two enforced sanitizer shapes (pipeline Out-SafePRSubprocessLine helper + inline -replace pattern). Both AST-audited by the Pester suite.

Rule 7 (F1-CRIT subprocess-wrap, NEW): any PR-controlled .ps1 that emits attacker-controlled Write-Host (BuildAndRunHostApp, Run-DeviceTests, Find-RegressionRisks, etc.) must be launched as pwsh -NoProfile -File. In-process & \ bypasses 2>&1 capture for stream 6 (Information / Write-Host). AST regression guard enforces.

Rule 8 (Pester gate, NEW): security-scripts-pester.yml runs the full Pester suite on every PR touching the security surface; minimum-passed floor + zero-skipped tolerance catch test-deletion / BeforeAll discovery errors.

Grep checklist extended with an F1-CRIT search (in-process \& \\ of known PR-controlled scripts).
…ew-PR raw re-emits)

opus-4.8 adversarial review identified that the prior subprocess-wrap
closure missed the GATE phase's call chain:

  Review-PR.ps1 (Gate) → pwsh -File verify-tests-fail.ps1 (good)
    → verify-tests-fail.ps1:380/:485 & \\\ / & \\\ 2>&1  (BAD: in-process)
      → BuildAndRunHostApp.ps1:637 Write-Host \\\ (attacker-controlled device log)
    → verify-tests-fail.ps1:1633/:1721 \\\ | ForEach-Object { Write-Host "  \\\" }  (BAD: raw re-emit)
  → captured into Review-PR.ps1:1155 \\\
  → Review-PR.ps1:1155 \\\ | ForEach-Object { Write-Host "    \\\" }  (BAD: raw re-emit)
  → AzDO host parses ##vso[…] directives and executes them.

A malicious PR test that prints ##vso[task.setvariable variable=detectedCategories;isOutput=true]NONE
to its own log would (1) leak through BuildAndRunHostApp's in-process
Write-Host into verify-tests-fail's host, (2) be captured into
verify-tests-fail's stdout, (3) be re-emitted raw by Review-PR's gate
echo, and (4) be parsed by AzDO to skip the entire RunDeepUITests stage.
Similar payloads forge a green gate or exfiltrate files.

Three fixes:

* verify-tests-fail.ps1:380 + :485 - launch BuildAndRunHostApp /
  Run-DeviceTests as pwsh -NoProfile -File subprocess (matches the
  Review-PR.ps1 regression-runner fix in the prior commit).
* verify-tests-fail.ps1:1633 + :1721 - dot-source Write-SafeSubprocessOutput
  and route the log-file re-emit through Out-SafePRSubprocessLine.
* Review-PR.ps1 - replace four raw ForEach-Object { Write-Host "<prefix>\\\" }
  re-emit sites (gateOutput, detectOutput, testDetectScript inline,
  refreshOutput) with | Out-SafePRSubprocessLine -Prefix "<prefix>".

Plus extend Write-SafeSubprocessOutput.Tests.ps1's \\\ list
to include verify-tests-fail.ps1 so the existing F1-CRIT AST regression
guard catches any future revert.

Tests: 46/46 sanitizer + 572/572 full Pester suite.
…x Finding dotnet#1+dotnet#2)

Adversarial review found a critical, end-to-end exploitable bypass at Review-PR.ps1 L726:

  'assistant.message' { \ = \.data.content; \ = \.Trim(); ...; Write-Host \ }

A malicious PR title (or body, or code comment) containing ##vso[task.complete result=Succeeded] would be quoted back by the Copilot agent inside an assistant.message event. Review-PR rebinds \ -> \ via .Trim(), the inventory audit's static name-membership check then misses \ (not in PrDerivedVars), and Write-Host \ emits the directive raw to the AzDO host parser - forging a green CopilotReview task regardless of actual review outcome.

Two fixes:

(1) Inline sanitize \ at the call site (Review-PR.ps1 L726).

    Same inline -replace pattern as the other agent-output sites.

(2) Inventory test taint closure (Write-SafeSubprocessOutput.Tests.ps1):

    Replace the static \ membership check with a forward def-use closure (fixed-point walk over every AssignmentStatementAst.Left -> taint if RHS subtree references any tainted var). Plus an isSafeMemberAccess exemption for member access on safe properties (.Count, .Length, .Keys, .GetType, .PSObject, .Type) and a knownSafeOverrides allowlist for variables that are type-bounded by structure (numeric counts, fixed emoji-literal lookups, formatted .NET TimeSpan strings).

Net result: any future assignment that copies a tainted variable into a fresh name (e.g. \ = \; Write-Host \) is automatically detected by the inventory walk. This is the same defense pattern as the F1-CRIT 4-hop forward def-use walk in the regression-guard test.

Tests: 46/46 sanitizer + 572/572 full Pester suite (no regression).
When Invoke-UITestWithRetry.ps1 is missing (e.g. PR removes/renames it), the deep-UI-test stage falls back to direct invocation:

  & pwsh @Arglist 2>&1 | Tee-Object -FilePath \ | ForEach-Object { Write-Host \ }

BuildAndRunHostApp.ps1 emits attacker-controlled device logcat via Write-Host. Although the pwsh subprocess captures those calls into 2>&1, this branch re-emits each line verbatim via Write-Host - letting ##vso[...] reach the AzDO parser.

Fix: apply the same inline (?i)##(vso[|[) neutralizer the primary branch already uses for \.Output.
…ignment (gpt-5.5 Finding dotnet#2)

The earlier mechanical comment scrub left 76 orphan markers like '#  : state-aware fallback.' across Helpers.ps1 / Tests.ps1 / sanitizer tests. These obscure intent in a security-critical 4k-line analyzer. Removed mechanically by:

  '# : <text>' -> '# <text>'
  '# : <eol>'  -> drop the line
  parenthetical scrub-artifacts -> remove

Also removed a dead duplicate assignment in the LHS nested-member-write branch:

  $cur[$k] = @{}   # immediately overwritten by next line
  $cur = @{}
  $cur[$k] = $cur

Tests: 572/572 still green (no behavior change).
…, fail-closed

Original PR dotnet#35324 review #4403922251 (3/3 reviewers) flagged:
- gate-result.txt etc. landed in 'Split-Path \$TrustedScriptsDir -Parent'
  which is the writable Build.ArtifactStagingDirectory - any later task
  could forge 'PASSED' between Gate exit and CopilotReview restore.
- detect-script-path.txt was persisted then *executed* under
  COPILOT_GITHUB_TOKEN. Test-Path checked existence, not trust - a
  tampered path is direct RCE.
- Missing gate-result.txt defaulted to 'SKIPPED' (fail open).

This commit:
1. Adds Get-PhaseStateDir returning a dedicated 'phase-state' sibling of
   trusted-github. Used at all 5 sites (setup sentinel R/W, gate write,
   CopilotReview/Post restore).
2. Drops detect-script-path.txt persist+restore entirely. CopilotReview
   re-derives `Join-Path \$EngScriptsDir 'detect-ui-test-categories.ps1'`
   (same expression Gate uses).
3. Adds Restore-GateResultOrFailClosed which throws on missing /
   non-whitelisted verdict. Replaces both fail-open SKIPPED defaults.
4. ci-copilot.yml Gate task chmods the state dir read-only after the
   pwsh call completes (belt-and-braces; Linux/Mac).
5. Adds 6 Pester tests for the new helpers covering: path safety
   (not the writable parent), idempotency, local-dev fallback path,
   fail-closed on missing/empty/garbage, trim-after-newline.

Tests: Review-PR.Tests.ps1 22/22 green.
Original PR dotnet#35324 review #4403922251 (2/3 reviewers, opus-4.7x noted
severity is reduced because the cleanup now MINIMIZES rather than
deletes — but UX is still broken: any maintainer who quotes
`<!-- AI Summary -->` or `<!-- AI Gate -->` in their own comment gets
silently collapsed on the next cleanup pass).

Before:
    $matchesGeneratedMarker =
        ($IncludeAISummary -and (Test-IsAISummaryCommentBody $body)) -or
        ($IncludeLegacyGate -and $body.Contains($script:AiGateCommentMarker))

After (mirrors the $matchesBotOnlyContent shape just below):
    $matchesGeneratedMarker =
        (Test-IsMauiBotCommentAuthor $comment) -and (
            ($IncludeAISummary -and (Test-IsAISummaryCommentBody $body)) -or
            ($IncludeLegacyGate -and $body.Contains($script:AiGateCommentMarker))
        )

Tests (Remove-StaleMauiBotComments.Tests.ps1, +5 cases, 10/10 green):
- bot comment with AI Summary marker IS hidden
- contributor comment quoting AI Summary marker is NOT hidden
- contributor comment quoting AI Gate marker is NOT hidden
- bot comment with AI Gate marker IS hidden
- mixed batch: bot hidden, contributor preserved
Original PR dotnet#35324 review #4403922251 minor findings:

M1 (3/3) - Setup invoked Review-PR.ps1 from the worktree path
.github/scripts/Review-PR.ps1 while Tasks 2-4 used
"$TRUSTED/scripts/Review-PR.ps1". The trusted copy was already
created (and chmod'd a-w) at lines 621-626 BEFORE the Setup pwsh
call. Switching Setup to "$TRUSTED/scripts/Review-PR.ps1" makes all
four phases run the same script bytes.

M2 (3/3) - Setup wrapped pwsh in "set +e" and on failure set
CopilotFailed=true but did not exit, so subsequent tasks ran on a
possibly half-set-up worktree (no merged PR HEAD, no sentinel). The
error surfaced later via CopilotFailed but only after a Gate that
read stale state. Fix is two-layered:

(a) Setup now exits with SETUP_EXIT after recording the failure.

(b) A dedicated SetupFailed variable (distinct from CopilotFailed,
    which later phases also set) gates the follow-on tasks:
      - Task 2 Gate:    condition and(succeeded(), ne(variables.SetupFailed, true))
      - Task 3 Review:  condition and(succeeded(), ne(variables.SetupFailed, true))
      - Task 4 Post:    condition ne(variables.SetupFailed, true)
    Using a Setup-specific variable keeps Task 4 running after Gate/
    Review failures so the partial AI summary / gate-result comment
    still surfaces on the PR; only a Setup failure (where there is
    no state to post) skips Post.
opus-4.8 adversarial finding: in the Tier 3 AI-tier refresh,
`\$refreshedCategories` is parsed from the detect script's stdout via
regex and then re-emitted as a fresh ##vso[task.setvariable] without
validation. The detect script is trusted (\$EngScriptsDir) and the AI
categories input is PR-derived, so today no exploit is possible — but
a future change that surfaces PR-controlled text into the category
list would silently turn into a setvariable injection.

Add a whitelist guard (`^(NONE|ALL|[A-Za-z0-9_,]+)$`) before re-emit.
On reject, log a warning and skip the re-emit; the pre-refresh value
(already emitted earlier) remains in effect.
opus-4.7x demonstrated empirically that the F1 chmod defense is
bypassable by a same-user attacker:

    \$ chmod -R a-w state/
    \$ mv state state.bak && mkdir state            # parent is writable
    \$ echo PASSED > state/gate-result.txt
    \$ cat state/gate-result.txt                    # FORGED

A PR test running under Gate's verify-tests-fail.ps1 -> dotnet test
chain can spawn a daemonized process (nohup/setsid) that survives
Gate task termination, waits out the chmod, and rewrites
gate-result.txt before CopilotReview reads it. The forged 'PASSED'
flips the AI prompt's gate status without ever tripping the
file-mode lock.

This commit adds an authenticated layer on top of the integrity
check:

1. ci-copilot.yml Gate task, AFTER the pwsh call completes, generates
   a 256-bit random key with openssl, computes
       openssl dgst -sha256 -mac HMAC -macopt hexkey:KEY gate-result.txt
   writes the MAC to gate-result.txt.hmac, and emits the key via
       ##vso[task.setvariable variable=GATE_HMAC_KEY;issecret=true]
   The key is generated AFTER pwsh exits (so PR test code spawned
   inside pwsh never sees it), never written to disk, and propagated
   to CopilotReview/Post only through AzDO's secret-variable channel
   (which does not inherit into PR-spawned daemon processes).

2. CopilotReview and Post tasks bind GATE_HMAC_KEY into their env:
   block. The trusted pwsh that runs there sees \;
   PR-spawned daemons do not.

3. Restore-GateResultOrFailClosed accepts the key as a parameter
   (defaulting to \) and verifies HMAC-SHA256
   against gate-result.txt.hmac when a key is supplied. Mismatch /
   missing .hmac throws (fails closed). When no key is supplied
   (local dev), HMAC verification is skipped — the value check still
   runs.

4. Pester tests (Review-PR.Tests.ps1 +9 cases, 31/31 green):
   - happy path via parameter form
   - happy path via env form
   - local-dev path with no key
   - missing .hmac when key supplied -> throws
   - tampered verdict file with stale .hmac -> throws (forgery)
   - .hmac forged with wrong key -> throws
   - openssl-format compatibility (lowercase hex, no whitespace)
   - x-check: openssl hexkey form <-> .NET hex-decoded form match
   - x-check: openssl default -hmac form (UTF-8 key bytes) does NOT
     match and is rejected (guards YAML against accidental downgrade)

Residual risk (documented): a sophisticated attacker can still try
to read /proc/<copilotreview-pwsh-pid>/environ during the narrow
window between pwsh start and gate-result.txt read, and forge a new
HMAC for new content. This requires a tight polling daemon and same
agent uid. The real fix for that class is splitting CopilotReview
into a separate AzDO job (ephemeral agent, fresh FS); HMAC raises
the bar from 'trivial mv attack' to 'tight-race environ scrape'.

Also adds D1.A: aiSummaryReviewNodeId is validated against
^[A-Za-z0-9_\-]+$ before re-emitting through ##vso[task.setvariable]
to close the same future-proofing class D1 already covered for
detectedCategories.
…ness

4 rubber-duck judges (opus-4.6, opus-4.7-high, opus-4.8, gpt-5.5) all
voted "70% too big". This applies the consensus cuts:

- Strip ~71 model-name provenance markers ("opus-4.8 F2:", "gpt-5.5 F1:",
  "Round-N:", "PureWeen") from comments — git history is the right place.
- Trim multi-paragraph "Defense-in-depth: ..." preambles to 1-2 lines
  across Review-PR.ps1, ci-copilot.yml, Write-SafeSubprocessOutput.ps1,
  and the Phase-Audit Helpers/Tests headers.
- Delete Verify-AuditNegativeControls.ps1 (463 LOC, 2h runtime). The
  Phase-Audit Pester suite covers the same surface in seconds.
- Cut Axes C/D/E/F self-tests in Review-PR.Phase-Audit.Tests.ps1 (-1817).
  These were "tests of the analyzer about the analyzer" (all 4 judges
  flagged them). Axis A (real-script audit) + Axis B (YAML binding)
  remain — those are what protects production. Pester floor 300 -> 200.
- Use [System.Convert]::FromHexString + HMACSHA256::HashData instead of
  hand-rolled byte loops + try/finally Dispose.
- Drop CmdletBinding ceremony from Get-PhaseStateDir (private helper).
- Cut the paranoid "openssl WITHOUT hexkey is REJECTED" test.

Pester 226/226 green (was 594; 368 of those were the deleted self-tests).
Net -2589 / +156 LOC across 12 files.
The skill-validator rejects eval prompts that mention the target skill
name because it biases the baseline (skill-disabled) runs. Replace
'verify-tests-fail-without-fix' references with neutral phrasing
('failure-only verification') in the three offending test scenarios.

This was a pre-existing failure on pr35677; my PR triggers
skill-validation because it modifies verify-tests-fail.ps1, so the
already-broken eval blocked CI here too.
@T-Gro T-Gro force-pushed the security-on-enhanced-reviewer branch from a69e0ba to d79d6f0 Compare June 8, 2026 14:35
@T-Gro

T-Gro commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

CI status

Check Status Notes
Pester (.github/scripts) ✅ pass All 233 tests green (47s)
Discover/Static/evaluate/Post/Report skill-validator jobs ✅ pass
maui-pr (AzDO ci.yml, definitionId=302) ⏸ skipping Expected — pipeline excludes .github/** paths and targets only branches main, net*.0, release/*, inflight/*. This PR is .github/ heavy and targets feature/enhanced-reviewer, so AzDO correctly does not run.
Skill Validation / skill-validation ❌ fail Pre-existing on #35677 base — failure is skill_not_activated, meaning the eval agent doesn't invoke verify-tests-fail-without-fix from neutral-language prompts. The skill's activation criteria need work (out of scope for this security PR). I did remove the spec-conformance violation that #35677 base also fails on (prompt mentions target name) per the validator's own request, but exposing the deeper activation issue. The skill-eval evaluate job itself runs and reports cleanly — only the wrapping Report status flips the umbrella check red.
Build Analysis ⏳ pending Arcade status check, asynchronous.

The only meaningful CI failure (Skill Validation) is a skill-description / activation-criteria issue that pre-exists in #35677 base and is owned by the skill author. All checks in scope for this security PR pass.

@kubaflo kubaflo merged commit 9f78d76 into dotnet:feature/enhanced-reviewer Jun 8, 2026
11 of 14 checks passed
kubaflo pushed a commit that referenced this pull request Jun 8, 2026
… stdout, fail-closed Gate state, HMAC verdict, Phase-Audit (#35804)"

This reverts commit 9f78d76.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure CI, Maestro / Coherency, upstream dependencies/versions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants