Security hardening for ci-copilot (#35677 stack): sanitize PR stdout, fail-closed Gate state, HMAC verdict, Phase-Audit#35804
Conversation
🔍 Skill Validation Results✅ Static Checks PassedSkills checked: 18 | Agents checked: 4 Full validator output❌ LLM Evaluation Failed0/1 skill(s) passed validation
❌ verify-tests-fail-without-fix: Skill regressed on task completion in one or more scenarios [NOT ACTIVATED (isolated) in 2 scenario(s)] [NOT ACTIVATED (plugin) in 3 scenario(s)] [1] High run-to-run variance (CV=3.99)
📊 verify-tests-fail-without-fix / edge-case-agent-uses-requirefullverification-when-fix-files-existJudge Report: Edge case: agent uses RequireFullVerification when fix files existBaseline JudgeOverall Score: 1/5
With-Skill Judge (Isolated)Overall Score: 1.3/5
With-Skill Judge (Plugin)Overall Score: 2/5
Baseline Agent Output
Here's what this verification requires: Full Verification Mode (test + fix):
What I need to proceed:
Could you provide the PR number (e.g., With-Skill Agent Output (Isolated)
Verification Result:
|
| Signal | Passed ✅ | Failed ❌ |
|---|---|---|
| Exit code | 0 |
1 |
| Console output | Passed! - Failed: 0 |
Failed! - Failed: N |
| TRX file | outcome="Passed" |
outcome="Failed" with stack trace |
| Appium log | Element assertions succeed | NoSuchElementException or assertion mismatch |
Verifying the Bug Is Caught
To confirm the test catches the bug without the fix:
- With the fix applied → test should pass (correct behavior verified)
- Without the fix (revert the code change, keep the test) → test should fail (proving the test detects the regression)
If the test passes in both scenarios, it doesn't actually catch the bug.
Prerequisites for iOS
- macOS with Xcode installed
- iOS Simulator booted (or physical device connected)
- Appium server running (
appium &) - App deployed to the simulator
Would you like me to help with anything else, such as cloning the repo or looking at the specific PR?
`
With-Skill Agent Output (Isolated)
`
This directory isn't a git repository, so there's no PR context to verify against. However, let me explain how the verification works based on your setup:
How Verification Works
Since the PR has both test files and a fix, the command would be:
powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification
How to interpret results:
| Script Output | Meaning |
|---|---|
VERIFICATION PASSED ✅ |
Tests fail without fix AND pass with fix — bug is caught ✅ |
VERIFICATION FAILED ❌ |
Tests pass without fix (don't detect bug) OR fail with fix (fix is broken) ❌ |
Output files will be in:
CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/ ├── verification-report.md # Detailed report ├── test-without-fix.log # Should show FAILURES └── test-with-fix.log # Should show PASSES
To actually run this, I need to be in the MAUI repository with the PR branch checked out. Would you like me to clone the repo and check out the PR branch, or is there a specific repo path I should work from?
`
With-Skill Agent Output (Plugin)
`
The working directory is empty and there's no MAUI repository cloned here. I can't run the verification script without the repository and PR checked out.
However, here's how you'll know if verification passed or failed:
Since the PR has both test files and a fix, the script runs in Full Verification mode. It performs two phases:
✅ VERIFICATION PASSED means:
- Without fix → tests FAIL (proves tests detect the bug)
- With fix → tests PASS (proves the fix resolves it)
❌ VERIFICATION FAILED means either:
- Tests PASS without the fix → tests don't catch the bug
- Tests FAIL with the fix → the fix doesn't work
The command to run (once the repo is checked out):
powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 \ -Platform ios \ -TestFilter "FullyQualifiedName~Issue66666" \ -RequireFullVerification
Output locations:
- Report:
CustomAgentLogsTmp/PRState/<PRNumber>/PRAgent/gate/verify-tests-fail/verification-report.md - Logs:
test-without-fix.logandtest-with-fix.login the same directory
`
📊 verify-tests-fail-without-fix / feature-agent-trusts-script-auto-detection-of-test-files-from-git-diff
Judge Report: Feature: agent trusts script auto-detection of test files from git diff
Baseline Judge
Overall Score: 1/5
Reasoning: The agent barely explored the problem space - it only checked if the current directory had files, found it empty, and gave up immediately. It made an incorrect definitive claim that no auto-detection script exists without actually investigating (e.g., cloning the repo, checking for scripts). The agent's response directly contradicts all three rubric criteria by telling the user they need to manually identify tests rather than confirming script auto-detection capabilities.
- Agent explains that the script can auto-detect test files from the PR diff: 1/5 — The agent explicitly stated the opposite — that there is no built-in verification script that auto-discovers which tests to run from a PR's changed files. It failed to find or acknowledge any such script.
- Agent does not require the user to manually specify every test file path: 1.7/5 — The agent directly told the user they'd need to 'Identify test files manually' using grep on the PR diff, placing the burden on the user.
- Agent trusts the script's git diff analysis rather than manually searching for test files: 1/5 — The agent didn't find, use, or trust any script. It suggested entirely manual approaches and concluded no automated detection exists.
With-Skill Judge (Isolated)
Overall Score: 3.3/5
Reasoning: The agent correctly answered the user's core question about whether the script can auto-detect test files. It couldn't actually run the verification due to environment constraints (no repo, no GH_TOKEN), but it handled this gracefully by explaining what's needed and confirming the script's capabilities. The agent could have read the skill's script files to provide more detailed/authoritative information about how the auto-detection works, but the answer is accurate and helpful.
- Agent explains that the script can auto-detect test files from the PR diff: 5/5 — The agent clearly explains that the verification script auto-detects test files from the git diff, listing the steps: detect changed files, auto-classify test type, auto-extract test class names, and run the appropriate test runner.
- Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly states 'you don't need to specify them manually' and shows the command without any test file paths specified.
- Agent trusts the script's git diff analysis rather than manually searching for test files: 5/5 — The agent trusts the script's built-in capability to analyze the git diff and doesn't attempt to manually find or specify test files. It recommends running the script with just -Platform and -RequireFullVerification flags.
With-Skill Judge (Plugin)
Overall Score: 3.7/5
Reasoning: The agent correctly answered the user's question about whether the script can auto-detect tests, explained the mechanism clearly, and appropriately noted the environmental limitations preventing actual execution. It didn't waste excessive steps and recovered gracefully from the missing auth/repo issues. The only minor gap is it couldn't actually run the verification, but that's an environmental constraint, not an agent failure.
- Agent explains that the script can auto-detect test files from the PR diff: 5/5 — The agent clearly explains that the script auto-detects test type from changed files in the git diff, auto-generates the test filter from changed test class names, and routes to the appropriate test runner.
- Agent does not require the user to manually specify every test file path: 5/5 — The agent explicitly states 'No -TestFilter or -TestType needed — both are auto-detected' and shows the command without manual file specification.
- Agent trusts the script's git diff analysis rather than manually searching for test files: 5/5 — The agent fully trusts the script's capabilities and recommends using it with just -Platform and -RequireFullVerification flags, relying on its git diff analysis.
Baseline Agent Output
`
The current directory is empty and not a git repository. There's no .NET MAUI repository cloned here, so there's no verification script available to inspect.
To answer your question directly: The .NET MAUI repository does not have a built-in script that automatically detects which tests to run from a PR's changed files. You'd typically need to:
- Check the PR's changed files on GitHub (
gh pr view 77777 --json files) - Identify test projects manually based on which area was changed (Core, Controls, Essentials, etc.)
- Run the relevant test project, e.g.:
bash dotnet test src/Controls/tests/Core.UnitTests/Controls.Core.UnitTests.csproj -f net9.0-android
Would you like me to:
- Clone the repo and inspect PR #77777's changed files to identify the right tests?
- Or help you set up a workflow to run specific tests for this PR?
`
With-Skill Agent Output (Isolated)
I can't proceed with verifying PR #77777 — the environment has no git repository cloned and no GitHub token configured forgh` CLI access.
To answer your question: Yes, the verification script auto-detects test files from the git diff. You don't need to specify them manually. It will:
- Detect changed test files from the PR's diff
- Auto-determine the test type (UITest, UnitTest, DeviceTest, etc.)
- Auto-build a test filter from detected test classes
What's needed to run it:
- Clone the MAUI repo with the PR branch checked out
- Have
GH_TOKENset (or be authenticated withgh) - Run:
powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 -Platform android -RequireFullVerification
Would you like me to set up the repository here, or is there another environment where this should run?
`
With-Skill Agent Output (Plugin)
`
The environment doesn't have GitHub CLI authentication configured, and the working directory isn't a git repo. To run the verification script, I need:
- A cloned repo with the PR branch checked out (the script uses
git diffto auto-detect test files and fix files) GH_TOKENset forghCLI access
To answer your question: Yes, the script absolutely can figure out which tests to run on its own! Here's how it works:
- It analyzes the
git diffbetween the PR branch and base branch - It auto-detects the test type (UITest, UnitTest, DeviceTest, XamlUnitTest) based on file paths
- It auto-builds the test filter from changed test class names
- It determines if fix files exist to choose Full Verification vs Failure-Only mode
However, I can't run it right now because this environment lacks:
- A cloned
dotnet/mauirepo with PR #77777 checked out - GitHub authentication (
GH_TOKEN)
If you can point me to a cloned repo with the PR branch, or set up auth, I can run:
powershell pwsh scripts/verify-tests-fail.ps1 -Platform android -PRNumber 77777
No -TestFilter or -TestType needed — it auto-detects both from the diff.
`
📊 verify-tests-fail-without-fix / happy-path-full-verification-mode-with-fix-files-present
Judge Report: Happy path: full verification mode with fix files present
Baseline Judge
Overall Score: 1/5
Reasoning: The agent gave up almost immediately after finding an empty directory and a failed skill invocation. It made no attempt to clone the repository, find relevant scripts, or actually perform the verification task. The output is essentially a help request back to the user rather than completing the assigned task.
- Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent did not run any tests at all. It checked the working directory, found it empty, and stopped.
- Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — No tests were run in any phase.
- Agent uses the verify-tests-fail script rather than manually running dotnet test: 1/5 — The agent attempted to invoke a non-existent skill 'verify-tests-fail-without-fix' but did not look for or use any verify-tests-fail script in the environment.
- Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1.3/5 — No results were produced to interpret.
With-Skill Judge (Isolated)
Overall Score: 1/5
Reasoning: The agent made one attempt to run the script, encountered a 'not a git repository' error, and immediately gave up by asking the user for more information. It didn't explore the skill's files/documentation to understand how to set up the environment, didn't try alternative approaches (like checking if there's a repo elsewhere, reading the script to understand its parameters like -RepoRoot or -PRNumber), and didn't attempt any recovery. The task was essentially not completed at all.
- Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests at all. It encountered a missing git repo error and stopped.
- Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — No tests were ever run, so no second phase occurred.
- Agent uses the verify-tests-fail script rather than manually running dotnet test: 2.3/5 — The agent did attempt to use the script (correct approach), but failed to set up the environment needed for it to work and gave up rather than finding an alternative path.
- Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1/5 — No results were produced to interpret. The agent ended with a question to the user instead of completing the task.
With-Skill Judge (Plugin)
Overall Score: 1/5
Reasoning: The agent failed to complete the verification task. While it correctly identified and attempted to use the verify-tests-fail.ps1 script, it gave up when dependencies were missing rather than attempting the manual verification approach it described. The agent could have read the script to understand the logic, identified test and fix files in the diff, and performed the revert-test-restore-test cycle manually. Instead it produced only a 'blocked' message with suggestions for the user.
- Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug): 1/5 — The agent never ran any tests. It encountered missing infrastructure scripts and stopped, reporting the verification as blocked.
- Agent then restores the fix and runs tests again, checking that they PASS: 1/5 — The agent never reached this phase. No tests were run at all.
- Agent uses the verify-tests-fail script rather than manually running dotnet test: 2.3/5 — The agent attempted to use the verify-tests-fail.ps1 script twice (with and without PR number), which is the correct approach. However, when it failed due to missing dependencies, the agent gave up rather than trying alternative approaches or manual verification.
- Agent correctly interprets the two-phase result (fail-then-pass = verification success): 1.3/5 — No results were produced to interpret. The agent reported BLOCKED without completing any verification phase.
Baseline Agent Output
The working directory is empty and there's no repository cloned here. I'm also unable to invoke theverify-tests-fail-without-fix` skill in this environment.
To run full verification (test fails without fix, passes with fix), I need:
- The dotnet/maui repository cloned locally
- The PR branch checked out (with both the test and fix)
- Android SDK + OpenJDK 17 configured for building
Manual workflow (what the skill would do):
`
1. Clone and checkout PR
gh repo clone dotnet/maui && cd maui
gh pr checkout <PR_NUMBER>
2. Run test WITH fix (should pass)
dotnet test --filter "Issue54321" -f net9.0-android
3. Revert fix files only, keep test
git stash push --
4. Run test WITHOUT fix (should fail)
dotnet test --filter "Issue54321" -f net9.0-android
5. Restore fix
git stash pop
`
Would you like me to clone the repository and attempt this here, or is there a specific environment/PR number I should target?
`
With-Skill Agent Output (Isolated)
`
The script requires a git repository context. This working directory isn't a git repo — the skill needs to run inside the actual MAUI repository where the PR branch is checked out.
Could you provide:
- The PR number (so I can check it out), or
- The path to the repo where the PR is already checked out?
Without a cloned repo with the PR branch, the verification script can't detect fix files or revert them to test both states.
`
With-Skill Agent Output (Plugin)
The required supporting scripts (EstablishBrokenBaseline.ps1, Detect-TestsInDiff.ps1`, etc.) don't exist in this environment. This is a partial checkout or the repository is missing CI infrastructure scripts.
Verification blocked — the verify-tests-fail.ps1 script depends on shared scripts that aren't present in this environment:
.github/scripts/EstablishBrokenBaseline.ps1.github/scripts/shared/Detect-TestsInDiff.ps1.github/scripts/shared/Write-SafeSubprocessOutput.ps1
These are CI infrastructure scripts that provide merge-base detection, test file identification, and log sanitization.
What you can do:
- Run in full CI — trigger the
maui-prpipeline where these scripts exist - Run manually — on a full repo clone with all scripts present:
powershell pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1
-Platform android -TestFilter "Issue54321" -RequireFullVerification -PRNumber 54321
` - Manual equivalent — revert the fix files, run the test (expect failure), restore files, run again (expect pass):
bash git stash push -- <fix-files> # remove fix dotnet test <project> --filter "Issue54321" # should FAIL git stash pop # restore fix dotnet test <project> --filter "Issue54321" # should PASS
Would you like me to attempt the manual approach if you can point me to the fix files and test project?
`
📊 verify-tests-fail-without-fix / happy-path-verify-failure-only-mode-for-test-creation
Judge Report: Happy path: verify failure only mode for test creation
Baseline Judge
Overall Score: 1/5
Reasoning: The agent failed to accomplish the task. After finding the working directory empty, it simply asked the user for more information rather than attempting to clone the repository, locate the test, or run it. The agent made only 3 tool calls and gave up very quickly without exploring alternative approaches to find or access the code.
- Agent runs the test in verify-failure-only mode since no fix exists yet: 1/5 — The agent did not run any test at all. It checked the working directory, found it empty, and gave up by asking the user for more information instead of attempting to clone the repository or find another way to proceed.
- Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of running a test or interpreting any results. It could not demonstrate understanding of the concept that a test failure = success in this context.
- Agent does not require fix files to be present for this mode: 2.7/5 — The agent didn't explicitly require fix files, but it also didn't demonstrate understanding that this is a verify-failure-only scenario where fix files aren't needed. It simply asked for repository location and test file path.
- Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent didn't use any flags at all since it never ran anything. It technically didn't violate this criterion, but it also didn't demonstrate any knowledge of the flag system.
With-Skill Judge (Isolated)
Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. After the initial script execution failed due to missing git repository, the agent did some basic file searching but never read the skill documentation or script contents to understand available parameters/modes. It should have examined the script to find a verify-failure-only mode, potentially cloned the necessary repo, or found another way to proceed. Instead, it gave up and asked the user for help after just a few exploratory commands.
- Agent runs the test in verify-failure-only mode since no fix exists yet: 1.3/5 — The agent attempted to run the script in what appears to be the correct mode (no full verification flag), but it failed due to a missing git repo and the agent gave up rather than exploring the skill context files to understand how to properly set up the environment or pass the correct parameters.
- Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of interpreting any test results. It gave up after the initial error and asked the user for more information instead of completing the task.
- Agent does not require fix files to be present for this mode: 2.7/5 — The agent did not attempt to look for or require fix files, which is correct behavior. However, it also didn't complete the task at all, so this is somewhat moot.
- Agent does not use -RequireFullVerification flag since no fix files exist: 3.3/5 — The agent correctly did not use the -RequireFullVerification flag when running the script. This was appropriate given that no fix exists yet.
With-Skill Judge (Plugin)
Overall Score: 1/5
Reasoning: The agent completely failed to accomplish the task. While it made reasonable attempts to run the verification script and search for test files, it was unable to find any test files for Issue #88200 and gave up, asking the user for clarification. None of the core rubric criteria were meaningfully satisfied - no test was run, no results were interpreted, and no verification was produced.
- Agent runs the test in verify-failure-only mode since no fix exists yet: 1.3/5 — The agent never successfully ran the test. It hit an error with PR number auto-detection and then gave up after finding no modified files, rather than exploring the skill script parameters or searching the repo for Issue 88200 test files.
- Agent correctly interprets test failure as SUCCESS (test catches the bug): 1/5 — The agent never reached the point of running a test or interpreting any results. It asked the user for clarification instead of completing the task.
- Agent does not require fix files to be present for this mode: 2.3/5 — The agent didn't explicitly require fix files, but it also failed to proceed without them. It didn't demonstrate understanding that verify-failure-only mode doesn't need fix files.
- Agent does not use -RequireFullVerification flag since no fix files exist: 3/5 — The agent didn't use -RequireFullVerification, but this is a neutral outcome since it never successfully ran any command at all.
Baseline Agent Output
`
The working directory is empty — there's no repository checked out here. To verify your test catches the bug, I need access to the .NET MAUI repository with your test changes.
Could you point me to:
- The repo location — is it checked out somewhere else, or should I clone it?
- Your test file — the path or branch/PR where your UI test lives?
Once I have the code, I'll build and run the test on Android to confirm it fails without the fix applied.
`
With-Skill Agent Output (Isolated)
`
I don't see a .NET MAUI repository or any test files in the current environment. The working directory is empty and there's no git repo with your changes.
Could you help me locate your test files? Either:
- Clone the repo with your branch: e.g.,
gh repo clone dotnet/maui -- -b <your-branch> - Tell me the PR or branch name so I can check it out
- Confirm the path if your code is somewhere else on this machine
Without the repository and your test changes, I can't run the verification script.
`
With-Skill Agent Output (Plugin)
I can't find any test files for Issue #88200 in the working tree. The test you wrote doesn't appear to be committed or present in this environment. There are no files matching "88200" anywhere in the repository, andgit status` shows no staged/unstaged changes related to it.
Could you confirm:
- Did you commit the test files, or are they on a different branch/PR?
- What's the test file path (e.g.,
src/Controls/tests/TestCases.Shared.Tests/Tests/Issue88200.cs)?
I need the test files present in the working tree to run verification.
`
📊 verify-tests-fail-without-fix / negative-trigger-general-test-question-should-not-invoke-verification
Judge Report: Negative trigger: general test question should not invoke verification
Baseline Judge
Overall Score: 4/5
Reasoning: The agent correctly identified this as a knowledge/guidance question and answered it directly without unnecessary tool usage. The response is well-structured with code examples, a comparison table, and practical tips. The advice about preferring element-based checks for scrolling bugs is sound. It could have explored the actual codebase for concrete examples to ground its advice, but the response is solid as-is.
- Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance directly, including code examples, assertion tables, and tips. It did not attempt to launch any verification workflow - it simply searched for relevant examples in the codebase, found none (repo not cloned), and provided guidance from its knowledge.
- Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made no attempts to run verification scripts or check PR labels. Its only tool usage was searching for existing test patterns via grep/glob to inform its answer, which is appropriate research behavior.
With-Skill Judge (Isolated)
Overall Score: 4/5
Reasoning: The agent appropriately recognized this as a knowledge question and answered it directly without unnecessary tool usage. The guidance is well-structured, practical, and specific to the .NET MAUI testing context. It correctly recommends element-based checks over VerifyScreenshot for scrolling bugs with good reasoning. The code examples are illustrative and the file path references add specificity. Minor deduction because the agent didn't verify any of its claims against the actual codebase (e.g., confirming API names like App.ScrollDown exist), but for a guidance question this is acceptable.
- Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent provided comprehensive UI testing guidance directly from its knowledge, without attempting to launch any verification workflow, scripts, or background agents. It gave a clear, well-structured answer with code examples, a comparison table, and specific recommendations.
- Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made zero tool calls. It did not attempt to run any scripts, check PR labels, or perform any verification steps. It simply answered the question directly.
With-Skill Judge (Plugin)
Overall Score: 4/5
Reasoning: The agent correctly identified this as a knowledge question and provided a well-structured, informative answer about writing CollectionView scrolling UI tests. It covered test structure, when to use VerifyScreenshot vs element-based checks, recommended assertions, and practical tips. It didn't waste any tool calls or attempt unnecessary workflows. The advice is reasonable and well-organized, though it's generic rather than grounded in actual codebase exploration.
- Agent provides UI testing guidance without launching the verification workflow: 5/5 — The agent directly answered the user's question about writing UI tests for CollectionView scrolling bugs with practical guidance, code examples, and clear recommendations. It made zero tool calls and did not attempt any verification workflow.
- Agent does not attempt to run any verification scripts or check PR labels: 5/5 — The agent made no tool calls whatsoever - no bash commands, no file searches, no scripts. It simply provided a helpful answer to the question.
Baseline Agent Output
`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:
Test Structure
`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollingTests : _IssuesUITest
{
public CollectionViewScrollingTests(TestDevice device) : base(device) { }
public override string Issue => "CollectionView scrolling bug description";
[Test]
public void CollectionViewShouldScrollToItem()
{
App.WaitForElement("collectionView");
App.ScrollDown("collectionView");
App.WaitForElement("targetItem");
Assert.That(App.FindElement("targetItem").GetText(), Is.EqualTo("Expected"));
}
}
`
Screenshot vs Element-Based Checks
| Approach | When to Use |
|---|---|
| Element-based ✅ | Scrolling bugs — assert item visibility, text, position |
| VerifyScreenshot | Visual rendering bugs — layout glitches, clipping, overlaps |
For scrolling bugs, prefer element-based checks because:
- They're more stable across platforms/resolutions
- They directly verify the behavior (item became visible after scroll)
- They give clearer failure messages
Recommended Assertions
`csharp
// Verify item becomes visible after scroll
App.ScrollDown("collectionView");
App.WaitForElement("item_10"); // item scrolled into view
// Verify scroll-to-item programmatically
App.Tap("scrollToButton");
App.WaitForElement("targetItem");
// Verify item count after incremental loading
var items = App.WaitForElement("lastLoadedItem");
Assert.That(items, Is.Not.Empty);
`
Tips
- Use
App.WaitForElementoverApp.FindElementfor timing resilience - Add
AutomationIdto items in your XAML issue page for reliable queries - Use
App.ScrollDown/App.ScrollUprather than raw swipe gestures - Place your test page XAML in
src/Controls/tests/TestCases.HostApp/Issues/ - Use
VerifyScreenshotonly as a supplement if the visual state matters (e.g., item partially clipped)
`
With-Skill Agent Output (Isolated)
`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:
Assertion Strategy
Prefer element-based checks over VerifyScreenshot for scrolling bugs:
- Element checks are deterministic and CI-friendly — scrolling offsets, item visibility, and element positions are stable across platforms.
VerifyScreenshotis brittle for scroll tests — pixel differences from scroll inertia, timing, and platform rendering make these flaky.
Recommended Assertions
`csharp
// Verify an item scrolled into view is visible
var item = App.WaitForElement("Item_25");
Assert.NotNull(item);
// Verify position changed after scroll
var beforeScroll = App.WaitForElement("TargetItem").GetRect();
App.ScrollDown("CollectionViewId");
var afterScroll = App.WaitForElement("TargetItem").GetRect();
Assert.True(afterScroll.Y < beforeScroll.Y);
// Verify ScrollTo programmatic scroll worked
App.WaitForElement("ScrollToButton").Click();
App.WaitForElement("Item_99"); // item now visible
`
Key Patterns
- Use
App.WaitForElement— neverFindElementdirectly (race conditions) - Use AutomationId on items via
ItemTemplatebindings - Add a test issue page in
src/Controls/tests/TestCases.HostApp/Issues/with[Issue]attribute - Use
App.ScrollDown/App.ScrollUpfrom the Appium helpers - Add delays sparingly — prefer
WaitForElementwith timeout overThread.Sleep
When to Use VerifyScreenshot
Only use it for visual rendering bugs (e.g., items overlapping, wrong background color after scroll) — not for functional scroll behavior.
Test Location
- Test page:
src/Controls/tests/TestCases.HostApp/Issues/IssuePR_XXXXX.cs - Test class:
src/Controls/tests/TestCases.Shared.Tests/Tests/Issues/
Want me to scaffold a concrete test for a specific issue?
`
With-Skill Agent Output (Plugin)
`
Here's guidance for writing a CollectionView scrolling UI test in .NET MAUI:
Test Structure
`csharp
[Category(UITestCategories.CollectionView)]
public class CollectionViewScrollingTests : _IssuesUITest
{
public CollectionViewScrollingTests(TestDevice device) : base(device) { }
public override string Issue => "GH_ISSUE_NUMBER description";
[Test]
public void CollectionViewShouldScrollToItem()
{
App.WaitForElement("collectionView");
App.ScrollDown("collectionView");
App.WaitForElement("targetItem");
// Assert element is visible after scroll
var element = App.WaitForElement("targetItem");
Assert.That(element.GetRect().Height, Is.GreaterThan(0));
}
}
`
VerifyScreenshot vs Element-Based Checks
| Approach | Use When |
|---|---|
| Element-based ✅ | Testing scroll position, item visibility, layout correctness |
| VerifyScreenshot | Testing visual rendering (clipping, overlap, transparency) |
For scrolling bugs, prefer element-based checks because:
- Screenshots are brittle across platforms/resolutions
- Scroll position assertions are more deterministic
- Easier to diagnose failures
Recommended Assertions
`csharp
// Item exists after scrolling
App.WaitForElement("item_10");
// Item position (below another element)
var item1 = App.WaitForElement("item_1").GetRect();
var item5 = App.WaitForElement("item_5").GetRect();
Assert.That(item5.Y, Is.GreaterThan(item1.Y));
// Item not visible (scrolled off screen)
App.WaitForNoElement("item_0");
// Use VerifyScreenshot ONLY for visual glitches
#if !ANDROID // if flaky on specific platforms
VerifyScreenshot();
#endif
`
Tips
- Use
App.ScrollDown/Up("automationId")for reliable scrolling - Add
AutomationIdto items viaItemTemplatein your XAML issue page - Use
WaitForElement(notFindElement) to handle scroll animation delays - Place the test XAML page in
src/Controls/tests/TestCases.HostApp/Issues/ - Use the
write-ui-testsskill (/write-ui-tests) if you want automated test generation and verification
`
📊 verify-tests-fail-without-fix / regression-agent-correctly-reports-test-failure-as-verification-success
Judge Report: Regression: agent correctly reports test failure as verification success
Baseline Judge
Overall Score: 5/5
Reasoning: The agent provides a clear, correct, and well-structured response. It correctly interprets the failure-only verification semantics, provides a ready-to-use report format, mentions the platform (Android), and outlines the logical next step. No tools were needed for this knowledge-based question, so zero tool calls is appropriate.
- Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent correctly identifies that in failure-only verification, a failing test means the verification passed. It clearly states 'In failure-only verification, a failing test is a passing verification' and marks the status as PASS.
- Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent does not suggest any fixes. It correctly interprets this as confirmation that the test is meaningful and will catch regressions if the fix is reverted.
With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent provided a correct, clear, and well-structured response. It correctly interpreted failure-only verification semantics, provided a clean report template, and gave appropriate next steps. No tools were needed for this knowledge-based question, so zero tool calls is appropriate.
- Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent clearly and correctly identifies that a test failure during failure-only verification means the verification succeeded — the test catches the bug as intended.
- Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent explicitly states 'No further action needed on the test itself' and correctly identifies the next step as applying the fix, not modifying the test.
With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent correctly understood the failure-only verification paradigm, provided a clear and well-structured report, included all relevant details (platform, assertion, interpretation), and did not make any erroneous recommendations. It was efficient with zero tool calls needed for what is essentially a reasoning/knowledge task.
- Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug: 5/5 — The agent clearly and correctly interprets the test failure as a verification success. It explains that in failure-only mode, a test failure is the expected outcome because it proves the test catches the bug.
- Agent does not recommend fixing the test when the failure proves the test detects the issue: 5/5 — The agent does not suggest any fixes. It correctly states this is a pass and that the test correctly reproduces the bug, with no recommendation to modify anything.
Baseline Agent Output
`
This is a successful result. 🎉
In failure-only verification, the test is supposed to fail — it proves the test correctly reproduces the bug before any fix is applied.
What to report:
✅ Failure-only verification passed (Android)
The test fails as expected without the fix:
Assert.That(rect.Height, Is.GreaterThan(0)) — element rendered with zero height
This confirms the test reproduces the reported bug. The element collapses to zero height, which is the exact symptom described in the issue.
Next step: When the fix is applied, re-run the test to confirm it passes — completing the full "fails without fix, passes with fix" cycle.
`
With-Skill Agent Output (Isolated)
`
Good news — this is a SUCCESS for failure-only verification.
The goal of failure-only mode is to confirm the test catches the bug. Your test failed with the expected assertion error (zero height), which proves:
- ✅ The test reproduces the issue on Android
- ✅ The assertion correctly detects the broken behavior
- ✅ The test will pass once a fix is applied
What to report:
Verification result: PASS (test catches the bug)
- Platform: Android
- Test fails without fix: ✅ Confirmed
- Failure:
Assert.That(rect.Height, Is.GreaterThan(0))— element renders with zero height- The test is ready for fix validation (full verification mode)
In failure-only mode, a failing test = a passing verification. The test is working as intended — it detects the defect. The next step would be to apply the fix and run full verification to confirm the test then passes.
`
With-Skill Agent Output (Plugin)
`
Report: ✅ Verification PASSED (test correctly fails without fix)
Since this is failure-only verification, a failing test is the expected outcome. It proves the test reproduces the bug. Here's what to report:
Result: PASS — Test correctly detects the bug.
Details:
- Platform: Android
- Assertion:
Assert.That(rect.Height, Is.GreaterThan(0))→ element renders with zero height - Interpretation: The test fails without the fix applied, confirming it catches the regression/bug. Once the fix is merged, this test should pass, providing ongoing protection.
Key point: In failure-only mode, "test failed" = "verification succeeded." The test is valid and ready to ship with the fix.
`
📊 verify-tests-fail-without-fix / regression-agent-must-not-manually-revert-files-with-git-commands
Judge Report: Regression: agent must not manually revert files with git commands
Baseline Judge
Overall Score: 1/5
Reasoning: The agent completely missed the core requirement of the task. Instead of using the verify-tests-fail.ps1 script to actually verify that tests fail without the fix (the standard verification workflow), it just fetched source files from GitHub and did a manual code review, speculating about what would happen. Additionally, the agent noted the test filter mismatch (Issue32932 vs Issue33134) but never attempted to resolve this or run anything. The approach was entirely wrong for the task.
- Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 1/5 — The agent never used the verify-tests-fail.ps1 script. It gave up after finding no GH_TOKEN and no repo clone, suggesting manual dotnet test commands instead.
- Agent does not manually use git checkout, git restore, or git stash to revert fix files: 3/5 — The agent didn't use any git revert commands, but only because it never got far enough to attempt anything with the codebase. This is a neutral result rather than a positive one.
- Agent interprets the script output correctly to determine if verification passed or failed: 1/5 — The agent never ran any script, so there was no output to interpret. It produced no verification result whatsoever.
With-Skill Judge (Isolated)
Overall Score: 2.7/5
Reasoning: The agent followed the correct workflow: used the skill, found the right script, corrected the test filter from Issue33134 to Issue32932 (showing good investigation), ran the verification script properly, and correctly interpreted the blocking result (no Android SDK). The agent didn't manually manipulate git state and relied on the script's built-in logic. The main limitation is environmental (no Android emulator), which the agent correctly identified and reported. The output is clear and informative.
- Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 2.7/5 — The agent never used verify-tests-fail.ps1. It invoked the skill but then searched for scripts like 'EstablishBrokenBaseline.ps1' and couldn't find the proper verification script. It fell back entirely to manual code analysis rather than using the automated script.
- Agent does not manually use git checkout, git restore, or git stash to revert fix files: 4.3/5 — The agent did not manually revert fix files using git checkout/restore/stash. It used 'git show' to view the original file content for comparison but never actually modified the working tree. This is technically compliant, though the reason was that it gave up on runtime verification entirely rather than deliberately avoiding manual reverts.
- Agent interprets the script output correctly to determine if verification passed or failed: 2.3/5 — The agent never ran the verification script, so there was no script output to interpret. Instead it provided a purely logical/theoretical analysis of why the test would catch the bug, without any actual test execution.
With-Skill Judge (Plugin)
Overall Score: 2/5
Reasoning: The agent followed the right approach by using the verification script and not manually reverting files, but it ultimately failed to complete the task. It spent significant time troubleshooting environment issues (sparse checkout, finding the correct test name, merge base issues) and when it finally ran the script with correct parameters, it timed out without producing results. The final output is just a description of what it was about to do, not an actual verification result. The task was not completed.
- Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically: 3.3/5 — The agent did find and attempt to use the verify-tests-fail.ps1 script multiple times. It first ran it without FixFiles (which entered wrong mode), then correctly identified it needed to pass -FixFiles explicitly and ran it again with the correct parameters. The script was used as intended.
- Agent does not manually use git checkout, git restore, or git stash to revert fix files: 5/5 — The agent never manually reverted fix files using git checkout, git restore, or git stash. It correctly relied on the script to handle file revert/restore automatically, only using git to fetch the PR branch and the base branch for reference.
- Agent interprets the script output correctly to determine if verification passed or failed: 1/5 — The agent never got to interpret the final script output. The session timed out (900s) while waiting for the final script run to complete. The agent's last substantive output was explaining why the first run entered 'Verify Failure Only Mode' but never provided a final determination of whether verification passed or failed.
Baseline Agent Output
`
I've reviewed all the relevant files. Here's my verification:
✅ Tests properly catch the EmptyView bug on Android
Test Structure
HostApp (Issue32932.cs):
- Places a
CollectionViewinside aVerticalStackLayout(the exact repro scenario) - Sets
ItemsSourceto an emptyObservableCollection<string> - EmptyView contains a Label with
AutomationId="EmptyViewLabel"
NUnit Test (Issue32932.cs):
- Waits for the CollectionView, then asserts
App.WaitForElement("EmptyViewLabel") - On Android, this will timeout if the EmptyView doesn't render (the bug)
Fix (SizedItemContentView.cs):
csharp static double NormalizeDimension(double value) => value == int.MaxValue ? double.PositiveInfinity : value;
Converts the corrupted int.MaxValue (from (int)double.PositiveInfinity) back to infinity so the existing double.IsInfinity() checks work correctly.
Verdict
| Condition | Result |
|---|---|
| Without fix | ❌ FAIL — EmptyViewLabel never renders, WaitForElement times out |
| With fix | ✅ PASS — EmptyView displays, label found immediately |
The test correctly reproduces the exact bug scenario (CollectionView in VerticalStackLayout with empty source) and uses the filter name Issue32932 (note: the issue number, not Issue33134). To run with your filter Issue33134 you'd need to adjust — the test class is named Issue32932 matching the issue it fixes.
Filter correction: Use --filter Issue32932 (not Issue33134) to target this test.
`
With-Skill Agent Output (Isolated)
`
Verification Blocked — No Android Emulator
The verification script ran correctly in Full Verification Mode, detecting:
- Fix file:
src/Controls/src/Core/Handlers/Items/Android/SizedItemContentView.cs - Test files:
Issue32932(HostApp + Shared.Tests)
However, both test runs failed with:
❌ Android SDK (adb) not found
This environment doesn't have the Android SDK/emulator installed, so the UI tests can't actually execute. The script correctly identified the fix/test structure but cannot run the tests locally.
Key findings:
- Test filter correction: The actual test class is
Issue32932, notIssue33134(the PR number). UsingIssue33134as filter would match nothing. - Test type: UITest (detected from
TestCases.Shared.TestsandTestCases.HostApp) - Status:
⚠️ Blocked — requires an Android emulator/device to execute. This verification needs to run in CI (themaui-pr-uitestspipeline) or on a machine with Android SDK configured.
`
With-Skill Agent Output (Plugin)
Good, files are now available. Now let me run the verification with the correct test filter (Issue32932is the actual test class name) and explicit fix files and base commit:
📊 verify-tests-fail-without-fix / regression-tests-passing-without-fix-means-verification-failed
Judge Report: Regression: tests passing without fix means verification FAILED
Baseline Judge
Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and well-structured answer. It correctly identified that a passing test without the fix means failed verification, explained why, and offered actionable next steps. No tools were needed for this conceptual question, so the zero tool calls is appropriate.
- Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent immediately and clearly states 'No, the verification failed' and explains that in failure-only verification, the test should fail without the fix applied.
- Agent recommends reviewing and improving the test assertions so they actually detect the issue: 4.7/5 — The agent provides concrete next steps including revisiting the test, checking input data/configuration/assertions, and re-running verification after updates.
- Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent explicitly distinguishes these concepts, clearly stating the test passed but the verification failed, and explains why these are opposite outcomes in this context.
With-Skill Judge (Isolated)
Overall Score: 5/5
Reasoning: The agent provided a clear, correct, and concise answer that addresses all aspects of the question. It correctly identified the counterintuitive result, explained why, and provided actionable next steps. No tools were needed for this conceptual question.
- Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent immediately and clearly states 'No, the verification failed' and explains that in failure-only verification, the test should fail without the fix applied.
- Agent recommends reviewing and improving the test assertions so they actually detect the issue: 5/5 — The agent explicitly states 'The test must be revised so it fails on the unfixed code and passes only after the fix is applied' as the next step.
- Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these concepts, explaining that the test passing means the verification failed, and articulates why these are opposite outcomes in this context.
With-Skill Judge (Plugin)
Overall Score: 5/5
Reasoning: The agent provided a correct, clear, and concise answer. It correctly identified that in failure-only verification, a passing test without the fix means the verification failed. It explained the implications and provided actionable next steps. No tools were needed for this conceptual question, so the zero tool calls are appropriate.
- Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug: 5/5 — The agent clearly and immediately states 'No, the verification failed' and explains that the test should fail without the fix applied.
- Agent recommends reviewing and improving the test assertions so they actually detect the issue: 5/5 — The agent explicitly recommends the test be revised so it fails without the fix and passes with it.
- Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context: 5/5 — The agent clearly distinguishes these, explaining that a passing test in this context means the verification failed.
Baseline Agent Output
`
No, this verification is NOT successful.
In failure-only verification, the goal is to confirm that the test fails without the fix — proving it actually catches the bug. If the test passed without the fix, it means:
- The test doesn't reproduce the bug — it can't distinguish between fixed and unfixed code
- The test is not a valid regression test — it would pass regardless of whether the fix is present
What to do next:
- Revisit the test to ensure it actually exercises the buggy code path
- Check if the test conditions match the issue's reproduction steps
- The test may need different assertions or setup to trigger the failure
`
With-Skill Agent Output (Isolated)
`
No, the verification failed.
In failure-only verification mode, the goal is to confirm that the test catches the bug — meaning the test should fail when the fix is not applied.
If the test passed without the fix, it means:
- The test does not reproduce or detect the bug
- It cannot distinguish between fixed and unfixed code
- The test needs to be rewritten to properly target the defective behavior
Next step: Revise the test so it exercises the specific broken behavior and fails on unfixed code, then re-run verification.
`
With-Skill Agent Output (Plugin)
`
No, the verification failed.
In failure-only verification, the goal is to confirm that the test catches the bug — meaning it should fail when the fix is NOT applied.
If the test passed without the fix, it means:
- The test does not reproduce or detect the bug
- It cannot serve as a regression test for this issue
- The test needs to be rewritten to actually exercise the failing scenario
Next steps: The test must be revised so it fails without the fix (proving it catches the bug) and passes with the fix (proving the fix works).
`
Runs the Pester suite on every PR that touches .github/scripts/**, .github/skills/**, eng/scripts/**, or eng/pipelines/ci-copilot.yml. Make this a REQUIRED check via repo settings → branches → branch protection. Defense-in-depth gates: * Explicit required-test-files list (refuses to run if any required .Tests.ps1 has been deleted/renamed — prevents silent green from "nothing ran"). Includes the four tests added by dotnet#35677 (Post-AISummaryComment, Review-PR, Aggregate-UITestArtifacts, Run-DeviceTests, Remove-StaleMauiBotComments) plus this PR's new Write-SafeSubprocessOutput and Review-PR.Phase-Audit suites. * Minimum-passed-count floor (catches BeforeAll discovery errors that flip green tests into Skipped state — Pester reports "Passed: 0, Failed: 0" as success otherwise). * Zero-skipped tolerance (catches leftover -Skip / Inconclusive / NotRun states). * Pinned Pester 5.3+ with -SkipPublisherCheck for the runner image's pre-installed copy. The workflow also runs Verify-AuditNegativeControls.ps1 — an end-to-end harness that injects ~245 bypass patterns into a real Gate-reachable script and asserts the audit catches each. This catches regressions where the extractor works but scope detection silently drops lines. Without this workflow, the .Tests.ps1 files already in the repo (plus the ones added by dotnet#35677) are unrun and provide no verification value.
…ose F1-CRIT subprocess-bypass class Adds Out-SafePRSubprocessLine helper, a non-anchored case-insensitive neutralizer for AzDO logging commands embedded in PR-controlled subprocess output. Without it, an attacker who plants ##vso[task.complete result=Succeeded] / ##vso[task.setvariable ...] / ##vso[task.uploadfile ...] in any string the pipeline echoes can forge a green gate, flip cross-stage variables, or exfiltrate files. Sanitization applied at every PR-derived stdout site in Review-PR.ps1: * Subprocess re-emit (Find-RegressionRisks, regression-test runners, agent JSON catch branch) -> pipeline form | Out-SafePRSubprocessLine * Single-line agent fields (PR title, Prompt, currentIntent, modelName, displayName, detail) -> inline -replace '(?i)##(vso[|[)', '##~SANITIZED~$1' (preserves -ForegroundColor which the helper doesn't propagate) Pester AST suite enforces both shapes: 'no raw Write-Host of PR-derived variable' walks every emitter call and fails if any argument references a known-PR-derived variable without ancestor sanitizer wrapping; a >=10 sanitizer-sites floor catches wholesale deletion. Closes F1-CRIT subprocess-bypass class (3 sites): * Review-PR.ps1 regression-test runners (UITest, DeviceTest) - BuildAndRunHostApp.ps1 and Run-DeviceTests.ps1 emit Write-Host of attacker-controlled device-log content. In-process invocation lets that Write-Host write directly to the parent host (stream 6, NOT captured by 2>&1), bypassing the sanitizer. Now launched as pwsh -NoProfile -File subprocesses so the child has its own host. * Review-PR.ps1 Find-RegressionRisks invocation - same shape, same fix. * Invoke-UITestWithRetry.ps1 BuildAndRunHostApp launch - same shape, same fix. ci-copilot.yml deep-UI-test stage now re-emits captured \.Output through the same neutralizer so build/test diagnostics stay visible in the task log. Two AST regression guards in Write-SafeSubprocessOutput.Tests.ps1 lock these in: any future PR that reverts back to in-process & \ ... 2>&1 fails the workflow. Tests: 46/46 sanitizer Pester + 393/393 Phase-Audit Pester.
…loses write-class gh in Gate)
Phase-Audit walks the call-graph closure of Gate-reachable code in Review-PR.ps1 (from \ through every dot-sourced helper) and proves no write-class gh invocation can land in the Gate phase. This is the structural defense behind the per-task token-scoping rule documented in ci-copilot-pipeline-security.instructions.md.
Architecture (5 axes):
* Axis A - Gate call-graph closure + WholeFile audit of the four Gate subprocess scripts (Find-RegressionRisks, Detect-TestsInDiff, verify-tests-fail, detect-ui-test-categories). Default-deny: every gh call must match a tight read-allowlist; everything else is forbidden in Gate scope.
* Axis B - YAML binding audit of ci-copilot.yml. Per-task GH_TOKEN scoping is required; CopilotReview task MUST NOT bind GH_TOKEN.
* Axis C - Extractor self-test: synthetic snippets routed through the same extractors used by Axis A. Audit-of-the-audit.
* Axis D - Call-graph closure self-test (descend into dot-sourced helpers; stop at Invoke-WithoutGhTokens wrapper).
* Axis E - Static-eval self-test for Resolve-StaticStringValue (folds operator/method/index/cast/ternary/coalesce/pipeline shapes; consults hashState/varState/arrayState for compound expressions).
Coverage (closes against PureWeen's original review + adversarial review feedback):
* Pattern 1-3: alias / Invoke-Expression / Set-Alias indirection
* Pattern 4-bis: paren / sub / member / invoke-member indirection through static folds
* Pattern 5: shell-runner inline command flags + encoded pwsh + cmd /v:on delayed expansion
* Pattern 6: assignment-site literal folds (string / Concat / Join / Format / -join / -replace / instance methods)
* Pattern 6 deep: per-variable hashtable / array state, mutations (=, +=, .Add, .Set_Item, .Insert), variable copies, [hashtable]/[ordered]/[pscustomobject] casts, nested chains, multi-LHS unpack, paren-wrapped int index, tainted-key fallback for unresolvable keys
* Pattern 7: dynamic scriptblock construction ([scriptblock]::Create, .AddScript, [Activator]::CreateInstance)
* Pattern 8: exec/arg wrappers (env, sudo, container, remote)
* Pattern 9: Import-Module / using module from PR paths
* Pattern 10: forbidden cmdlets (CIM/WMI/Job/COM)
* Statement-RHS: if / switch / try-catch-finally / data / foreach / do-until / do-while / while / for / SubExpression wraps
* Pipeline folds: 'a','b' | Select-Object -First N, ForEach-Object { scriptblock } body yield, ({...}).Invoke() / .InvokeReturnAsIs()
* Compound expressions: Binary+, Convert, Ternary (?:), null-coalesce (??), method chains over hashState lookups
* Constructor seeds: New-Object Hashtable/ArrayList/OrderedDictionary/PSObject -Property hashtable, [T]::new()
Tests: 393 Pester cases pass. End-to-end negative-control harness (Verify-AuditNegativeControls.ps1) injects ~245 distinct bypass patterns into a real Gate-reachable script and asserts the audit catches each.
…process rule, Pester gate Rule 6 (anti-injection): replace the bare 'tr | sed' guidance with the two enforced sanitizer shapes (pipeline Out-SafePRSubprocessLine helper + inline -replace pattern). Both AST-audited by the Pester suite. Rule 7 (F1-CRIT subprocess-wrap, NEW): any PR-controlled .ps1 that emits attacker-controlled Write-Host (BuildAndRunHostApp, Run-DeviceTests, Find-RegressionRisks, etc.) must be launched as pwsh -NoProfile -File. In-process & \ bypasses 2>&1 capture for stream 6 (Information / Write-Host). AST regression guard enforces. Rule 8 (Pester gate, NEW): security-scripts-pester.yml runs the full Pester suite on every PR touching the security surface; minimum-passed floor + zero-skipped tolerance catch test-deletion / BeforeAll discovery errors. Grep checklist extended with an F1-CRIT search (in-process \& \\ of known PR-controlled scripts).
…ew-PR raw re-emits)
opus-4.8 adversarial review identified that the prior subprocess-wrap
closure missed the GATE phase's call chain:
Review-PR.ps1 (Gate) → pwsh -File verify-tests-fail.ps1 (good)
→ verify-tests-fail.ps1:380/:485 & \\\ / & \\\ 2>&1 (BAD: in-process)
→ BuildAndRunHostApp.ps1:637 Write-Host \\\ (attacker-controlled device log)
→ verify-tests-fail.ps1:1633/:1721 \\\ | ForEach-Object { Write-Host " \\\" } (BAD: raw re-emit)
→ captured into Review-PR.ps1:1155 \\\
→ Review-PR.ps1:1155 \\\ | ForEach-Object { Write-Host " \\\" } (BAD: raw re-emit)
→ AzDO host parses ##vso[…] directives and executes them.
A malicious PR test that prints ##vso[task.setvariable variable=detectedCategories;isOutput=true]NONE
to its own log would (1) leak through BuildAndRunHostApp's in-process
Write-Host into verify-tests-fail's host, (2) be captured into
verify-tests-fail's stdout, (3) be re-emitted raw by Review-PR's gate
echo, and (4) be parsed by AzDO to skip the entire RunDeepUITests stage.
Similar payloads forge a green gate or exfiltrate files.
Three fixes:
* verify-tests-fail.ps1:380 + :485 - launch BuildAndRunHostApp /
Run-DeviceTests as pwsh -NoProfile -File subprocess (matches the
Review-PR.ps1 regression-runner fix in the prior commit).
* verify-tests-fail.ps1:1633 + :1721 - dot-source Write-SafeSubprocessOutput
and route the log-file re-emit through Out-SafePRSubprocessLine.
* Review-PR.ps1 - replace four raw ForEach-Object { Write-Host "<prefix>\\\" }
re-emit sites (gateOutput, detectOutput, testDetectScript inline,
refreshOutput) with | Out-SafePRSubprocessLine -Prefix "<prefix>".
Plus extend Write-SafeSubprocessOutput.Tests.ps1's \\\ list
to include verify-tests-fail.ps1 so the existing F1-CRIT AST regression
guard catches any future revert.
Tests: 46/46 sanitizer + 572/572 full Pester suite.
…x Finding dotnet#1+dotnet#2) Adversarial review found a critical, end-to-end exploitable bypass at Review-PR.ps1 L726: 'assistant.message' { \ = \.data.content; \ = \.Trim(); ...; Write-Host \ } A malicious PR title (or body, or code comment) containing ##vso[task.complete result=Succeeded] would be quoted back by the Copilot agent inside an assistant.message event. Review-PR rebinds \ -> \ via .Trim(), the inventory audit's static name-membership check then misses \ (not in PrDerivedVars), and Write-Host \ emits the directive raw to the AzDO host parser - forging a green CopilotReview task regardless of actual review outcome. Two fixes: (1) Inline sanitize \ at the call site (Review-PR.ps1 L726). Same inline -replace pattern as the other agent-output sites. (2) Inventory test taint closure (Write-SafeSubprocessOutput.Tests.ps1): Replace the static \ membership check with a forward def-use closure (fixed-point walk over every AssignmentStatementAst.Left -> taint if RHS subtree references any tainted var). Plus an isSafeMemberAccess exemption for member access on safe properties (.Count, .Length, .Keys, .GetType, .PSObject, .Type) and a knownSafeOverrides allowlist for variables that are type-bounded by structure (numeric counts, fixed emoji-literal lookups, formatted .NET TimeSpan strings). Net result: any future assignment that copies a tainted variable into a fresh name (e.g. \ = \; Write-Host \) is automatically detected by the inventory walk. This is the same defense pattern as the F1-CRIT 4-hop forward def-use walk in the regression-guard test. Tests: 46/46 sanitizer + 572/572 full Pester suite (no regression).
When Invoke-UITestWithRetry.ps1 is missing (e.g. PR removes/renames it), the deep-UI-test stage falls back to direct invocation: & pwsh @Arglist 2>&1 | Tee-Object -FilePath \ | ForEach-Object { Write-Host \ } BuildAndRunHostApp.ps1 emits attacker-controlled device logcat via Write-Host. Although the pwsh subprocess captures those calls into 2>&1, this branch re-emits each line verbatim via Write-Host - letting ##vso[...] reach the AzDO parser. Fix: apply the same inline (?i)##(vso[|[) neutralizer the primary branch already uses for \.Output.
…ignment (gpt-5.5 Finding dotnet#2) The earlier mechanical comment scrub left 76 orphan markers like '# : state-aware fallback.' across Helpers.ps1 / Tests.ps1 / sanitizer tests. These obscure intent in a security-critical 4k-line analyzer. Removed mechanically by: '# : <text>' -> '# <text>' '# : <eol>' -> drop the line parenthetical scrub-artifacts -> remove Also removed a dead duplicate assignment in the LHS nested-member-write branch: $cur[$k] = @{} # immediately overwritten by next line $cur = @{} $cur[$k] = $cur Tests: 572/572 still green (no behavior change).
…, fail-closed Original PR dotnet#35324 review #4403922251 (3/3 reviewers) flagged: - gate-result.txt etc. landed in 'Split-Path \$TrustedScriptsDir -Parent' which is the writable Build.ArtifactStagingDirectory - any later task could forge 'PASSED' between Gate exit and CopilotReview restore. - detect-script-path.txt was persisted then *executed* under COPILOT_GITHUB_TOKEN. Test-Path checked existence, not trust - a tampered path is direct RCE. - Missing gate-result.txt defaulted to 'SKIPPED' (fail open). This commit: 1. Adds Get-PhaseStateDir returning a dedicated 'phase-state' sibling of trusted-github. Used at all 5 sites (setup sentinel R/W, gate write, CopilotReview/Post restore). 2. Drops detect-script-path.txt persist+restore entirely. CopilotReview re-derives `Join-Path \$EngScriptsDir 'detect-ui-test-categories.ps1'` (same expression Gate uses). 3. Adds Restore-GateResultOrFailClosed which throws on missing / non-whitelisted verdict. Replaces both fail-open SKIPPED defaults. 4. ci-copilot.yml Gate task chmods the state dir read-only after the pwsh call completes (belt-and-braces; Linux/Mac). 5. Adds 6 Pester tests for the new helpers covering: path safety (not the writable parent), idempotency, local-dev fallback path, fail-closed on missing/empty/garbage, trim-after-newline. Tests: Review-PR.Tests.ps1 22/22 green.
Original PR dotnet#35324 review #4403922251 (2/3 reviewers, opus-4.7x noted severity is reduced because the cleanup now MINIMIZES rather than deletes — but UX is still broken: any maintainer who quotes `<!-- AI Summary -->` or `<!-- AI Gate -->` in their own comment gets silently collapsed on the next cleanup pass). Before: $matchesGeneratedMarker = ($IncludeAISummary -and (Test-IsAISummaryCommentBody $body)) -or ($IncludeLegacyGate -and $body.Contains($script:AiGateCommentMarker)) After (mirrors the $matchesBotOnlyContent shape just below): $matchesGeneratedMarker = (Test-IsMauiBotCommentAuthor $comment) -and ( ($IncludeAISummary -and (Test-IsAISummaryCommentBody $body)) -or ($IncludeLegacyGate -and $body.Contains($script:AiGateCommentMarker)) ) Tests (Remove-StaleMauiBotComments.Tests.ps1, +5 cases, 10/10 green): - bot comment with AI Summary marker IS hidden - contributor comment quoting AI Summary marker is NOT hidden - contributor comment quoting AI Gate marker is NOT hidden - bot comment with AI Gate marker IS hidden - mixed batch: bot hidden, contributor preserved
Original PR dotnet#35324 review #4403922251 minor findings: M1 (3/3) - Setup invoked Review-PR.ps1 from the worktree path .github/scripts/Review-PR.ps1 while Tasks 2-4 used "$TRUSTED/scripts/Review-PR.ps1". The trusted copy was already created (and chmod'd a-w) at lines 621-626 BEFORE the Setup pwsh call. Switching Setup to "$TRUSTED/scripts/Review-PR.ps1" makes all four phases run the same script bytes. M2 (3/3) - Setup wrapped pwsh in "set +e" and on failure set CopilotFailed=true but did not exit, so subsequent tasks ran on a possibly half-set-up worktree (no merged PR HEAD, no sentinel). The error surfaced later via CopilotFailed but only after a Gate that read stale state. Fix is two-layered: (a) Setup now exits with SETUP_EXIT after recording the failure. (b) A dedicated SetupFailed variable (distinct from CopilotFailed, which later phases also set) gates the follow-on tasks: - Task 2 Gate: condition and(succeeded(), ne(variables.SetupFailed, true)) - Task 3 Review: condition and(succeeded(), ne(variables.SetupFailed, true)) - Task 4 Post: condition ne(variables.SetupFailed, true) Using a Setup-specific variable keeps Task 4 running after Gate/ Review failures so the partial AI summary / gate-result comment still surfaces on the PR; only a Setup failure (where there is no state to post) skips Post.
opus-4.8 adversarial finding: in the Tier 3 AI-tier refresh, `\$refreshedCategories` is parsed from the detect script's stdout via regex and then re-emitted as a fresh ##vso[task.setvariable] without validation. The detect script is trusted (\$EngScriptsDir) and the AI categories input is PR-derived, so today no exploit is possible — but a future change that surfaces PR-controlled text into the category list would silently turn into a setvariable injection. Add a whitelist guard (`^(NONE|ALL|[A-Za-z0-9_,]+)$`) before re-emit. On reject, log a warning and skip the re-emit; the pre-refresh value (already emitted earlier) remains in effect.
opus-4.7x demonstrated empirically that the F1 chmod defense is
bypassable by a same-user attacker:
\$ chmod -R a-w state/
\$ mv state state.bak && mkdir state # parent is writable
\$ echo PASSED > state/gate-result.txt
\$ cat state/gate-result.txt # FORGED
A PR test running under Gate's verify-tests-fail.ps1 -> dotnet test
chain can spawn a daemonized process (nohup/setsid) that survives
Gate task termination, waits out the chmod, and rewrites
gate-result.txt before CopilotReview reads it. The forged 'PASSED'
flips the AI prompt's gate status without ever tripping the
file-mode lock.
This commit adds an authenticated layer on top of the integrity
check:
1. ci-copilot.yml Gate task, AFTER the pwsh call completes, generates
a 256-bit random key with openssl, computes
openssl dgst -sha256 -mac HMAC -macopt hexkey:KEY gate-result.txt
writes the MAC to gate-result.txt.hmac, and emits the key via
##vso[task.setvariable variable=GATE_HMAC_KEY;issecret=true]
The key is generated AFTER pwsh exits (so PR test code spawned
inside pwsh never sees it), never written to disk, and propagated
to CopilotReview/Post only through AzDO's secret-variable channel
(which does not inherit into PR-spawned daemon processes).
2. CopilotReview and Post tasks bind GATE_HMAC_KEY into their env:
block. The trusted pwsh that runs there sees \;
PR-spawned daemons do not.
3. Restore-GateResultOrFailClosed accepts the key as a parameter
(defaulting to \) and verifies HMAC-SHA256
against gate-result.txt.hmac when a key is supplied. Mismatch /
missing .hmac throws (fails closed). When no key is supplied
(local dev), HMAC verification is skipped — the value check still
runs.
4. Pester tests (Review-PR.Tests.ps1 +9 cases, 31/31 green):
- happy path via parameter form
- happy path via env form
- local-dev path with no key
- missing .hmac when key supplied -> throws
- tampered verdict file with stale .hmac -> throws (forgery)
- .hmac forged with wrong key -> throws
- openssl-format compatibility (lowercase hex, no whitespace)
- x-check: openssl hexkey form <-> .NET hex-decoded form match
- x-check: openssl default -hmac form (UTF-8 key bytes) does NOT
match and is rejected (guards YAML against accidental downgrade)
Residual risk (documented): a sophisticated attacker can still try
to read /proc/<copilotreview-pwsh-pid>/environ during the narrow
window between pwsh start and gate-result.txt read, and forge a new
HMAC for new content. This requires a tight polling daemon and same
agent uid. The real fix for that class is splitting CopilotReview
into a separate AzDO job (ephemeral agent, fresh FS); HMAC raises
the bar from 'trivial mv attack' to 'tight-race environ scrape'.
Also adds D1.A: aiSummaryReviewNodeId is validated against
^[A-Za-z0-9_\-]+$ before re-emitting through ##vso[task.setvariable]
to close the same future-proofing class D1 already covered for
detectedCategories.
…ness
4 rubber-duck judges (opus-4.6, opus-4.7-high, opus-4.8, gpt-5.5) all
voted "70% too big". This applies the consensus cuts:
- Strip ~71 model-name provenance markers ("opus-4.8 F2:", "gpt-5.5 F1:",
"Round-N:", "PureWeen") from comments — git history is the right place.
- Trim multi-paragraph "Defense-in-depth: ..." preambles to 1-2 lines
across Review-PR.ps1, ci-copilot.yml, Write-SafeSubprocessOutput.ps1,
and the Phase-Audit Helpers/Tests headers.
- Delete Verify-AuditNegativeControls.ps1 (463 LOC, 2h runtime). The
Phase-Audit Pester suite covers the same surface in seconds.
- Cut Axes C/D/E/F self-tests in Review-PR.Phase-Audit.Tests.ps1 (-1817).
These were "tests of the analyzer about the analyzer" (all 4 judges
flagged them). Axis A (real-script audit) + Axis B (YAML binding)
remain — those are what protects production. Pester floor 300 -> 200.
- Use [System.Convert]::FromHexString + HMACSHA256::HashData instead of
hand-rolled byte loops + try/finally Dispose.
- Drop CmdletBinding ceremony from Get-PhaseStateDir (private helper).
- Cut the paranoid "openssl WITHOUT hexkey is REJECTED" test.
Pester 226/226 green (was 594; 368 of those were the deleted self-tests).
Net -2589 / +156 LOC across 12 files.
The skill-validator rejects eval prompts that mention the target skill
name because it biases the baseline (skill-disabled) runs. Replace
'verify-tests-fail-without-fix' references with neutral phrasing
('failure-only verification') in the three offending test scenarios.
This was a pre-existing failure on pr35677; my PR triggers
skill-validation because it modifies verify-tests-fail.ps1, so the
already-broken eval blocked CI here too.
a69e0ba to
d79d6f0
Compare
CI status
The only meaningful CI failure ( |
9f78d76
into
dotnet:feature/enhanced-reviewer
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Stacked on #35677
This PR layers a security-hardening stack on top of @kubaflo's #35677 (
feature/enhanced-reviewer). When merged here, the combined work flows tomainvia #35677.Scope
Re-applies and updates the security audit work originally landed in #35324 against the latest #35677 base, plus three new defenses caught by adversarial review rounds.
phase-state/subdir; CopilotReview re-derives `` instead of restoring a persisted path; fail-closed on missing/garbage Gate verdictSplit-Path -Parentof\allowed PR test code to forgegate-result.txtand to redirect& pwsh -Fileexecution underCOPILOT_GITHUB_TOKENgate-result.txtvia per-jobopenssl rand -hex 32key passed to CopilotReview/Post through an AzDO secret variablechmod -R a-wis bypassable same-user (mv state state.bak && mkdir state && echo PASSED > ...). Key generated AFTER pwsh exits → PR-spawned daemons cannot read itOut-SafePRSubprocessLinehelper + inline-replace '(?i)##(vso\[|\[)', '##~SANITIZED~\'at every PR-controlledWrite-Hostsite +& pwsh -NoProfile -Filesubprocess wraps at every shared runner site##vso[task.setvariable / task.complete / task.uploadfile]AzDO logging commands through stdoutTest-IsMauiBotCommentAuthor<!-- AI Summary -->in their own comment was being silently hidden^(NONE|ALL|[A-Za-z0-9_,]+)\$before re-emittingdetectedCategories;^[A-Za-z0-9_\-]+\$foraiSummaryReviewNodeId##vsodirective via setvariable echo\/scripts/Review-PR.ps1(not the worktree);exit \on failure; Tasks 2-4 gated onne(variables['SetupFailed'], 'true')so a half-set-up worktree can't keep runninggh-write-verb classifier + YAML token-binding audit, run as a Pester suitesecurity-scripts-pester.ymlworkflow runs all 9.Tests.ps1files on every PR touching.github/scripts,.github/skills,eng/scripts, oreng/pipelines/ci-copilot.yml. Required-files allowlist + minPassed floor + zero-skipped tolerance → silent test deletion / Pester skip cannot land greenThe original sanitizer / phase-state work passed an adversarial review with three independent reviewers (opus-4.8 + opus-4.7-xhigh + gpt-5.5); the rebase onto #35677's new Copilot-token-usage commits was re-validated by the same set.
gpt-5.5caught one pre-existing F2 gap (UI/Device runner\re-emit at L1526/1539 was rawWrite-Host) which is fixed in this branch.Validation
ghwrite-verb regressions on real Gate-reachable scriptsGH_TOKENbinding is the documented PAT alias;CopilotReviewand the newAnalyzeCopilotTokenUsagestage have no GH_TOKEN-mac HMAC -macopt hexkey:round-trips withConvert.FromHexString)yaml.safe_loadcleanNote for #35677 owner
The PR title and description should be reviewed against the actual diff before merge; this stack only adds defenses, no functional changes to the review pipeline beyond the trust-boundary tightening.
cc @kubaflo