Skip to content

.NET: Foundry.Hosting IT - eliminate MSBuild parallel-output races#5725

Open
rogerbarreto wants to merge 2 commits intomainfrom
bugfix/fix-hosted-agents-it-concurrency-error
Open

.NET: Foundry.Hosting IT - eliminate MSBuild parallel-output races#5725
rogerbarreto wants to merge 2 commits intomainfrom
bugfix/fix-hosted-agents-it-concurrency-error

Conversation

@rogerbarreto
Copy link
Copy Markdown
Member

Problem

Across the last 10 merge_group runs of dotnet-build-and-test, 9 failures came from dotnet-foundry-hosted-it. PR #5689 patched one symptom (MSB3026 at publish time) but the underlying races kept surfacing in many forms: MSB3026, MSB3491 file already exists, MSB4018 GenerateDepsFile, MSB3883 on ref/*.dll. All on Linux, all on multi-targeted dependencies (Abstractions, AI, Workflows, Workflows.Declarative, Foundry.Hosting).

Two distinct races, both inside the IT job runner:

  1. Race A: parallel MSBuild inner-builds inside one dotnet build invocation. dotnet build dotnet/filtered-foundry-hosted.slnx -f net10.0 forces a global TFM override on a multi-rooted slnx. Multi-targeted projects get scheduled twice from different parent nodes and write the same bin/Release/net10.0/ files concurrently.
  2. Race B: dotnet publish vs the preceding dotnet build. Publish propagates -r linux-musl-x64 to ProjectReferences and re-touches the same src/<lib>/bin/Release/net10.0/*.dll paths while VBCSCompiler from the prebuild still holds file handles. PR .NET: Foundry.Hosting IT - avoid MSB3026 in publish step #5689's -UsePrebuiltProjectReferences was a partial mitigation gated behind an opt-in switch.

Fix

Two surgical changes inside the existing dotnet-foundry-hosted-it job. No new workflow file. The job stays in dotnet-build-and-test-check.needs and continues to gate the merge queue.

1) Prebuild: build the test csproj directly (closes Race A).

The test project Foundry.Hosting.IntegrationTests.csproj pins <TargetFrameworks>net10.0</TargetFrameworks>. Building the csproj instead of a filtered slnx gives MSBuild a single-rooted graph; each multi-targeted dependency is invoked exactly once for net10.0. The two New-FilteredSolution.ps1 steps are deleted.

dotnet build dotnet/tests/Foundry.Hosting.IntegrationTests/Foundry.Hosting.IntegrationTests.csproj -c Release --warnaserror

2) Publish: always pass --no-dependencies, drop the -UsePrebuiltProjectReferences switch (closes Race B).

Publish builds only the TestContainer and resolves its framework references by reading the prebuilt DLLs. It never re-touches src/<lib>/bin/Release/net10.0/*.dll, so the file-handle race is structurally impossible. Replaces the partial-mitigation switch from #5689 with a standard CLI flag and removes ~80 lines of preflight machinery.

A small new preflight in it-build-image.ps1 errors with a clear run dotnet build first message when the prebuilt outputs are missing (for local dev, since CI always prebuilds).

Verification

Test branch bugfix/fix-hosted-agents-it-concurrency-error dispatched twice:

Run Result
https://github.com/microsoft/agent-framework/actions/runs/25575873120 All IT steps green. Prebuild 45s, publish 17s, tests 43s.

Local validation also confirmed the published Microsoft.Agents.AI.Foundry.dll is byte-identical and same mtime as the prebuild DLL, proving publish reuses the prebuilt output.

Why --no-dependencies, not --no-build

TestContainer is not in the prebuild graph (the prebuild builds IntegrationTests.csproj, which references Foundry, Foundry.Hosting, AgentConformance.IntegrationTests). TestContainer is a separate project that publish must build itself, with -r linux-musl-x64. --no-build would refuse because TestContainer's outputs do not exist; --no-dependencies builds TestContainer but skips its ProjectReferences, which is exactly the right behavior.

Two surgical changes inside the dotnet-foundry-hosted-it job:

1. Replace dotnet build <slnx> -f net10.0 with dotnet build <test.csproj>. The test csproj pins TargetFrameworks=net10.0 and its ProjectReference closure gives MSBuild a single-rooted graph, eliminating the duplicate inner-builds that race on bin/obj. Drops the two New-FilteredSolution.ps1 steps.

2. In it-build-image.ps1, drop the -UsePrebuiltProjectReferences switch and always pass --no-dependencies to dotnet publish. Publish now resolves TestContainer's framework refs by reading prebuilt DLLs and never re-touches them. Replaces the partial-mitigation in PR #5689 with a structural fix.

Local validation confirmed published Foundry.dll has identical mtime and bytes as the prebuild output.
Copilot AI review requested due to automatic review settings May 8, 2026 19:47
@moonbox3 moonbox3 added the .NET label May 8, 2026
@rogerbarreto rogerbarreto enabled auto-merge May 8, 2026 19:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the .NET Foundry.Hosting hosted-agent integration-test (IT) CI job to eliminate recurring Linux MSBuild output races by switching to a single-rooted prebuild graph and ensuring the image publish step never rebuilds ProjectReferences.

Changes:

  • Build the Foundry.Hosting.IntegrationTests.csproj directly in CI (instead of generating filtered .slnx files with a global -f override) to prevent parallel inner-build collisions.
  • Update it-build-image.ps1 to always run dotnet publish --no-dependencies and add a preflight check for required prebuilt DLL outputs; remove the prior -UsePrebuiltProjectReferences switch.
  • Update the workflow to run the hosted-agent tests via the test project directly, matching the new prebuild approach.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
dotnet/tests/Foundry.Hosting.IntegrationTests/scripts/it-build-image.ps1 Always publishes without rebuilding ProjectReferences and adds prebuild-output validation.
.github/workflows/dotnet-build-and-test.yml Simplifies the hosted IT job to build/test the pinned net10.0 csproj directly and removes filtered-solution generation.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 94%

✓ Correctness

The PR correctly eliminates two MSBuild parallel-output races by (1) building the test csproj directly instead of a filtered slnx with a global TFM override, and (2) always passing --no-dependencies to dotnet publish instead of the opt-in -UsePrebuiltProjectReferences switch. All references to the removed targetFramework env var are updated consistently. The test project confirms it pins TargetFrameworks=net10.0, the TestContainer project is correctly excluded from the prebuild graph (justifying --no-dependencies over --no-build), and the new prebuild DLL probes provide a clear error for local dev when the prerequisite build hasn't run. No correctness issues found.

✓ Security Reliability

This PR cleanly addresses two MSBuild race conditions by building the test csproj directly (eliminating parallel inner-build races) and using --no-dependencies for publish (eliminating file-handle races with VBCSCompiler). The preflight probe correctly checks for the two leaf dependency DLLs, and transitive deps would cause publish to fail with a clear error if missing. No injection risks, no secrets exposed, no resource leaks. The change from opt-in UsePrebuiltProjectReferences to always-on --no-dependencies is a valid trade-off: CI always prebuilds so correctness is guaranteed, and local dev gets a clear error message if prebuilt DLLs are absent. The workflow changes (removing filtered solution steps, building csproj directly, using --project for dotnet test) are consistent and correct.

✓ Test Coverage

This PR is a CI/build infrastructure change that eliminates MSBuild parallel-output races by building a csproj directly instead of a filtered slnx and using --no-dependencies for publish. No application code is changed. The existing integration tests (C# test files) are untouched and continue to run via the same dotnet test --no-build step. The PowerShell script's new preflight probe for prebuilt DLLs is a fail-fast guard validated by the CI pipeline itself. There are no existing Pester tests for it-build-image.ps1, so this PR neither introduces nor widens a testing gap. The test invocation correctly drops the -f net10.0 flag since the csproj pins <TargetFrameworks>net10.0</TargetFrameworks> (line 9 of the csproj), making it redundant. No test coverage concerns.

✗ Design Approach

I found one design-level regression in the new it-build-image.ps1 flow: it now always consumes prebuilt project outputs, but the new preflight only proves that two DLLs exist, not that the full hashed ProjectReference closure was rebuilt. That can silently publish stale framework bits under a fresh image tag in local/dev flows.

Flagged Issues

  • The new always---no-dependencies path can silently publish stale Microsoft.Agents.AI.Abstractions / Microsoft.Agents.AI.Workflows bits. The image tag hashes those source trees (it-build-image.ps1:82-88), and Foundry.Hosting directly references them (Microsoft.Agents.AI.Foundry.Hosting.csproj:41-43), but the preflight guard only checks for Foundry.dll and Foundry.Hosting.dll. Either validate the full ProjectReference closure/freshness or keep the old explicit opt-in outside CI.

Automated review by rogerbarreto's agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants