Fix thread safety issues in WorkerNodeTelemetryData#13413
Fix thread safety issues in WorkerNodeTelemetryData#13413AR-May merged 3 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a concurrency bug in MSBuild’s internal worker-node telemetry aggregation when running in in-proc multi-threaded mode (/m /mt), where multiple RequestBuilder instances can concurrently mutate shared telemetry dictionaries and corrupt their state.
Changes:
- Switch telemetry reporting in
RequestBuilder.UpdateStatisticsPostBuildfrom per-target/per-task updates to batching into a localWorkerNodeTelemetryDataand merging once. - Replace
ITelemetryForwarder.AddTask/AddTargetwith a singleMergeWorkerDataAPI and update the provider implementations accordingly. - Add explicit locking around aggregation in the internal telemetry-consuming logger.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Telemetry/WorkerNodeTelemetryData.cs | Adds method documentation and minor refactor while preserving merge/aggregation behavior. |
| src/Build/TelemetryInfra/TelemetryForwarderProvider.cs | Replaces per-item update APIs with MergeWorkerData and exposes key creation helper for batching. |
| src/Build/TelemetryInfra/InternalTelemetryConsumingLogger.cs | Adds a lock around worker telemetry aggregation. |
| src/Build/TelemetryInfra/ITelemetryForwarder.cs | Updates forwarder contract to support batched merging. |
| src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs | Implements local accumulation + single merge under a lock to prevent concurrent dictionary writes. |
f2e0591 to
3e1fd80
Compare
3e1fd80 to
1a19fa3
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Fixes telemetry thread-safety and counter inflation in in-proc multithreaded (/m /mt) builds by batching telemetry per RequestBuilder, merging into a shared forwarder under a lock, and preventing repeated “send the whole buffer” behavior during engine shutdown.
Changes:
- Accumulate task/target telemetry in a per-
RequestBuilderWorkerNodeTelemetryDataand merge once into the shared forwarder. - Make
TelemetryForwarderthread-safe and change finalization to “swap-and-send” to avoid Nx duplication across multipleBuildRequestEnginefinalizers. - Add unit tests for
WorkerNodeTelemetryData.IsEmptyand forwarder reset behavior afterFinalizeProcessing.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Telemetry/WorkerNodeTelemetryData.cs | Adds IsEmpty and improves merge/add method clarity used by the forwarder swap-and-send logic. |
| src/Framework/Telemetry/TaskOrTargetTelemetryKey.cs | Introduces a helper factory (Create) to centralize key construction used by RequestBuilder. |
| src/Build/TelemetryInfra/TelemetryForwarderProvider.cs | Adds locking, batch merge entrypoint, and swap-and-send finalization to prevent races and duplicate sends. |
| src/Build/TelemetryInfra/ITelemetryForwarder.cs | Replaces per-item APIs with a batch merge API (MergeWorkerData). |
| src/Build/BackEnd/Components/RequestBuilder/RequestBuilder.cs | Switches to batch-then-merge telemetry collection to remove dictionary contention. |
| src/Build.UnitTests/Telemetry/Telemetry_Tests.cs | Adds tests for the new reset/empty behavior and forwarder finalization semantics. |
Related to #12867
Context
Fixing two bugs in telemetry infrastructure when using /m /mt (in-proc multithreaded) mode:
Thread-safety crash: all in-proc nodes share a single TelemetryForwarderProvider singleton. Multiple RequestBuilder instances run on dedicated threads and call AddTask/AddTarget concurrently on the same WorkerNodeTelemetryData dictionary fields, causing race conditions and dictionary corruption.
Nx telemetry duplication: In /m /mt mode, N BuildRequestEngine instances share one TelemetryForwarderProvider singleton. Each engine calls FinalizeProcessing on shutdown, sending the entire accumulated data each time. The InternalTelemetryConsumingLogger merges all N copies, inflating every counter N times.
Reproduction: 20+ non-SDK .NET Framework library projects + 1 exe referencing all of them, built with MSBuild.exe Repro.sln /m /mt.
Changes Made
Fix
Batch-then-merge in RequestBuilder: Each RequestBuilder now accumulates task/target telemetry into a local WorkerNodeTelemetryData instance (zero contention), then merges once into the shared state via elemetryForwarder.MergeWorkerData().
Thread-safe TelemetryForwarder: Added an internal lock protecting both MergeWorkerData and FinalizeProcessing. The forwarder is a singleton shared across BuildRequestEngine instances in /m /mt mode, so concurrent access is expected.
Swap-and-send in FinalizeProcessing: Instead of sending the same accumulated data on every call, FinalizeProcessing atomically swaps the internal data with a fresh empty instance under the lock, then sends only if non-empty. This ensures:
data loss)
Testing
Locally tested that the issue is gone on a repro project.
Unit tests