-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add MSBuild Coordinator for fair-share node allocation #13653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 22 commits
Commits
Show all changes
75 commits
Select commit
Hold shift + click to select a range
d146431
Add MSBuild.Coordinator exe project
DustinCampbell efd5fda
Add missing Microsoft.IO.Redist package reference to Framework
DustinCampbell 21ad7f0
Add MSBuild.Coordinator.UnitTests project
DustinCampbell b384e1c
Add coordinator protocol types and tests
DustinCampbell 4b2cdda
Add coordinator server with fair-share node budget management
DustinCampbell cfd43a5
Add CoordinatorClient for BuildManager integration
DustinCampbell 8d0e2d9
Add tracing for coordinator client and server
DustinCampbell eab2916
Move NamedPipeUtil.cs to Framework and use in Protocol
DustinCampbell d1011c0
Add integration tests and pipe name isolation for coordinator
DustinCampbell e79fd55
Use platform-specific pipe names in coordinator unit tests
DustinCampbell d727075
Refactor coordinator message read/write into base class pattern
DustinCampbell f5fd21e
Use FrameworkErrorUtilities for protocol error handling
DustinCampbell c42f00e
Consolidate coordinator configuration into CoordinatorSettings class
DustinCampbell 7925a59
Wait for client handlers to complete on CoordinatorServer shutdown
DustinCampbell ddcd043
Fix race condition in CoordinatorClient heartbeat disposal
DustinCampbell 824c508
Clean up NodeBudgetManager a bit
DustinCampbell d704afe
Fix reconnection race condition and improve locking in CoordinatorServer
DustinCampbell 3be0b52
Move coordinator env var names from Protocol to Traits
DustinCampbell 8debfa2
Add MSBuild-Coordinator.md architecture document
DustinCampbell bf74d87
Add string polyfills and optimize escaping allocation path
DustinCampbell e3be85f
Fix Unix coordinator startup by using a path-safe mutex name
DustinCampbell 204371e
Fix incorrect MSBuild.Coordinator path in MSBuild.SourceBuild.slnf
DustinCampbell d0e6a27
CR Feedback: Allow concurrent access in RunAsync
DustinCampbell eb52c7a
CR Feedback: Wait for in-flight callbacks to complete on Dispose
DustinCampbell 52fa047
CR Feedback: Clamp values and avoid overflow in CoordinatorSettings
DustinCampbell 5d6ebee
CR Feedback: Dispose connection and set to null when grant released
DustinCampbell bec832b
CR Feedback: Validate RequestNodesMessage values
DustinCampbell 46b2060
CR Feedback: TryGrant should guard against invalid requests
DustinCampbell aa1ecb5
CR Feedback: BuildGrant should validate arguments
DustinCampbell 7df6947
CR Feedback: Send heartbeats while waiting for server response
DustinCampbell 2b5d7c8
Tweak formatting in src/MSBuild.Coordinator/Program.cs slightly
DustinCampbell ddb6d46
Update MSBuild-Coordinator.md for clarity and correctness
DustinCampbell 1266a32
Merge branch 'main' into build-coordinator
DustinCampbell eb4d57b
Tweak note in MSBuild-Coordinator.md
DustinCampbell a46b540
Update note on node grant behavior in MSBuild
DustinCampbell fc3a44b
Merge branch 'main' into build-coordinator
DustinCampbell 803cb7d
Merge branch 'main' into build-coordinator
DustinCampbell 68d78f2
Merge branch 'main' into build-coordinator
DustinCampbell 672f241
Merge branch 'main' into build-coordinator
DustinCampbell 3d690f4
Merge branch 'main' into build-coordinator
DustinCampbell bdb044b
Merge branch 'main' into build-coordinator
DustinCampbell 33c8c09
Merge branch 'main' into build-coordinator
DustinCampbell e8da1e7
Log coordinator status messages through MSBuild logging system
DustinCampbell 3da15cd
Clean up: Add XML doc comments and reorder members in coordinator
DustinCampbell 5e0069d
Rename "Server:" log prefix to "CoordinatorServer:" for consistency
DustinCampbell dd62b6c
Serialize coordinator launches with a named mutex
DustinCampbell d6cf2f4
Merge branch 'main' into build-coordinator
DustinCampbell 34418c6
Improve CoordinatorClient diagnostic output
DustinCampbell 960a73e
Move coordinator message types to Messages subfolder
DustinCampbell 79171ef
Track wait duration when coordinator defers node grant
DustinCampbell efa1b62
Report coordinator wait duration in build telemetry
DustinCampbell 744f687
Merge branch 'main' into build-coordinator
DustinCampbell c005655
Merge branch 'main' into build-coordinator
DustinCampbell 07f8425
Refactor CoordinatorIntegration_Tests and add async bootstrap helper
DustinCampbell 880ba91
Add coordinator server telemetry using MSBuild's activity infrastructure
DustinCampbell 8500427
Merge branch 'main' into build-coordinator
DustinCampbell f9fe12f
Update ReadGuid/WriteGuid extension methods for BinaryReader/BinaryWr…
DustinCampbell d024c66
Add ConnectionId to coordinator protocol for unique client identifica…
DustinCampbell a8018a3
Make NodeBudgetManager thread-safe internally
DustinCampbell 2cab812
Remove low-value resx comments from coordinator strings
DustinCampbell 9543259
Move coordinator configuration constants to dedicated Constants class
DustinCampbell f41ab30
Specify explicit file extensions for coordinator bootstrap copies
DustinCampbell 56464da
Replace version byte with capabilities handshake protocol
DustinCampbell 4081564
Handle ErrorMessage in coordinator client and log all failure paths
DustinCampbell 608e747
Consistent ownership transfer in TryNegotiate
DustinCampbell 1b55649
Extract TrySendHandshake from TryNegotiate
DustinCampbell 51f7105
Reduce cold-start latency in coordinator client
DustinCampbell bc9f13a
Fix misleading test names in CoordinatorClient_Tests
DustinCampbell 4d9cb8d
Document how requested node count is determined
DustinCampbell 1246e08
Update coordinator documentation for accuracy
DustinCampbell ab0d220
Rename ICoordinatorOutput to ICoordinatorDebugOutput
DustinCampbell 3642612
Merge branch 'main' into build-coordinator
DustinCampbell 69b2b35
Merge branch 'main' into build-coordinator
DustinCampbell 709c4a3
Merge branch 'main' into build-coordinator
DustinCampbell f3edba5
Fix benign timer leak race in ResetShutdownTimer
DustinCampbell File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,308 @@ | ||
| # MSBuild Build Coordinator: Architecture and Flow | ||
|
|
||
| > **Important Note:** This document describes the architecture and design of the MSBuild Build Coordinator at a high level. | ||
| > For current implementation details, class structures, method signatures, or specific code patterns, always consult the source code directly. | ||
| > This ensures you're working with accurate, up-to-date information. | ||
|
|
||
| ## Overview | ||
|
|
||
| The **MSBuild Build Coordinator** is a resource management system that orchestrates and enforces fair-share allocation of build nodes across multiple simultaneous MSBuild processes. It prevents system resource exhaustion by maintaining a global node budget and dynamically distributing available nodes among competing builds. | ||
|
|
||
| ### Purpose | ||
|
|
||
| When multiple MSBuild processes run concurrently (common in CI/CD environments, distributed builds, or user multi-tasking), each process could independently attempt to spawn the maximum number of nodes, leading to: | ||
| - System resource exhaustion | ||
| - Excessive memory consumption | ||
| - CPU contention and slowdown | ||
| - Reduced overall build throughput | ||
|
|
||
| The coordinator solves this by: | ||
| 1. **Enforcing a global node budget** (defaults to processor count) | ||
|
DustinCampbell marked this conversation as resolved.
|
||
| 2. **Implementing fair-share allocation** to distribute available nodes fairly | ||
| 3. **Monitoring build health** via periodic heartbeats | ||
| 4. **Auto-shutting down** after a timeout period | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture Overview | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ System with Multiple Builds │ | ||
| └──────────────────────────────────────────────────────────────────┘ | ||
|
|
||
| Build 1 Build 2 Build 3 | ||
| │ │ │ | ||
| │ RequestNodes(4) │ RequestNodes(4) │ RequestNodes(4) | ||
| │ │ │ | ||
| └─────────────────────┼─────────────────────┘ | ||
| (via Named Pipes - IPC) | ||
| ↓ | ||
| ┌────────────────────────────────────┐ | ||
| │ MSBuild Build Coordinator │ | ||
| │ │ | ||
| │ ┌──────────────────────────────┐ │ | ||
| │ │ Node Budget Manager │ │ | ||
| │ │ • Total Budget: 8 nodes │ │ | ||
| │ │ • Allocated: 8 │ │ | ||
| │ │ • Available: 0 │ │ | ||
| │ └──────────────────────────────┘ │ | ||
| │ │ | ||
| │ ┌──────────────────────────────┐ │ | ||
| │ │ Active Builds │ │ | ||
| │ │ • Build 1: 4 nodes │ │ | ||
| │ │ • Build 2: 4 nodes │ │ | ||
| │ └──────────────────────────────┘ │ | ||
| │ │ | ||
| │ ┌──────────────────────────────┐ │ | ||
| │ │ Waiting Builds Queue │ │ | ||
| │ │ • Build 3: waiting │ │ | ||
| │ └──────────────────────────────┘ │ | ||
| └────────────────────────────────────┘ | ||
| ↓ ↓ | ||
| Grant(nodes=2) Wait(queued) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Component Architecture | ||
|
|
||
| ### Key Components | ||
|
|
||
| **Coordinator Server** ([src/MSBuild.Coordinator/](src/MSBuild.Coordinator/)) | ||
| - `CoordinatorServer.cs` - Main server that listens for client connections via named pipe | ||
| - `NodeBudgetManager.cs` - Implements node allocation and fair-share logic | ||
| - `ClientConnection.cs` - Manages individual client connections | ||
| - `BuildGrant.cs` - Represents a node allocation to a build | ||
| - `Program.cs` - Server launcher and singleton instance management | ||
|
|
||
| **Client-Side** ([src/Build/BackEnd/BuildManager/](src/Build/BackEnd/BuildManager/)) | ||
| - `CoordinatorClient.cs` - Client connection handler integrated into BuildManager | ||
| - `BuildManager.cs` - Requests nodes from coordinator and sets build parallelism | ||
|
|
||
| **Protocol** ([src/Framework/Coordinator/](src/Framework/Coordinator/)) | ||
| - Message types: `RequestNodesMessage`, `HeartbeatMessage`, `ReleaseNodesMessage`, `NodeGrantMessage`, `WaitMessage`, `ErrorMessage` | ||
| - `CoordinatorSettings.cs` - Configuration management | ||
| - `Protocol.cs` - Protocol versioning | ||
|
|
||
| ### Directory Structure | ||
|
|
||
| ``` | ||
| src/ | ||
| ├── MSBuild.Coordinator/ # Coordinator server executable | ||
| │ ├── CoordinatorServer.cs | ||
| │ ├── NodeBudgetManager.cs | ||
| │ ├── ClientConnection.cs | ||
| │ ├── BuildGrant.cs | ||
| │ ├── Program.cs | ||
| │ └── ... | ||
| ├── Framework/Coordinator/ # Protocol and interfaces | ||
| │ ├── RequestNodesMessage.cs | ||
| │ ├── HeartbeatMessage.cs | ||
| │ ├── ReleaseNodesMessage.cs | ||
| │ ├── NodeGrantMessage.cs | ||
| │ ├── WaitMessage.cs | ||
| │ ├── ErrorMessage.cs | ||
| │ ├── CoordinatorSettings.cs | ||
| │ ├── Protocol.cs | ||
| │ └── ... | ||
| ├── Build/BackEnd/BuildManager/ | ||
| │ ├── BuildManager.cs | ||
| │ ├── CoordinatorClient.cs | ||
| │ └── ... | ||
| └── MSBuild.Coordinator.UnitTests/ | ||
| ├── CoordinatorServerTests.cs | ||
| ├── NodeBudgetManagerTests.cs | ||
| └── ... | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Communication Protocol | ||
|
|
||
| ### Message Types | ||
|
|
||
| The coordinator uses a binary protocol with six message types: | ||
|
|
||
| **Client → Server:** | ||
| - `RequestNodesMessage` - Sent when a build starts, requests a node grant | ||
| - `HeartbeatMessage` - Periodic keep-alive message (default: every 5 seconds) | ||
| - `ReleaseNodesMessage` - Sent when build completes, releases allocated nodes | ||
|
|
||
| **Server → Client:** | ||
| - `NodeGrantMessage` - Grants nodes to a build | ||
| - `WaitMessage` - Indicates build is queued, no nodes immediately available | ||
| - `ErrorMessage` - Indicates an error condition (e.g., protocol version mismatch) | ||
|
|
||
| Each message includes a protocol version for compatibility verification. | ||
|
|
||
| **Source:** [src/Framework/Coordinator/](src/Framework/Coordinator/) | ||
|
|
||
| ### Message Flow Example | ||
|
|
||
| ``` | ||
| Successful Grant: | ||
| Build → RequestNodesMessage(4) | ||
| Build ← NodeGrantMessage(4) | ||
| Build → Heartbeat (every 5s) | ||
| Build → ReleaseNodesMessage (on completion) | ||
|
|
||
| Build Queued: | ||
| Build → RequestNodesMessage(4) | ||
| Build ← WaitMessage (queue position 1) | ||
| Build → Heartbeat (every 5s while waiting) | ||
| Eventually: Build ← NodeGrantMessage(2) [after fair-share calculation] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Fair-Share Allocation Algorithm | ||
|
|
||
| ### Core Concept | ||
|
|
||
| When multiple builds compete for limited nodes, the coordinator distributes them fairly using: | ||
|
|
||
| ``` | ||
| fair_share = max(1, available_nodes / (waiting_builds + 1)) | ||
|
DustinCampbell marked this conversation as resolved.
|
||
| ``` | ||
|
|
||
| This ensures: | ||
| - Every waiting build gets at least 1 node | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
|
||
| - Available nodes are divided equally among contenders | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
|
||
| - Nodes are processed from the wait queue as they become available | ||
|
|
||
| ### Example Scenarios | ||
|
|
||
| **Two Competing Builds** (8 total nodes) | ||
| - Build A gets 4 nodes (active) | ||
| - Build B requests 4 nodes → Available: 4, Waiting: 1 | ||
| - Fair share: max(1, 4 / 2) = 2 nodes | ||
| - Build B granted 2 nodes | ||
|
|
||
| **Multiple Builds in Queue** (8 total nodes) | ||
| - Build A uses 4 nodes | ||
| - Build B waiting (wants 6) and Build C waiting (wants 8) | ||
| - When A completes: 4 nodes available, 2 waiting | ||
| - Build B: fair_share = max(1, 4 / 2) = 2 nodes | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
|
||
| - Build C: fair_share = max(1, 2 / 1) = 2 nodes | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
|
||
|
|
||
| --- | ||
|
|
||
| ## Integration with BuildManager | ||
|
|
||
| ### How Coordination Works | ||
|
|
||
| During build initialization: | ||
|
|
||
| 1. BuildManager checks if `MSBUILDUSECOORDINATOR` environment variable is set | ||
| 2. If enabled, `CoordinatorClient` attempts to connect to the coordinator | ||
| 3. Sends `RequestNodesMessage` with desired node count | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
AR-May marked this conversation as resolved.
Outdated
|
||
| 4. Receives either `NodeGrantMessage` (nodes granted) or `WaitMessage` (queued) | ||
| 5. Updates build's maximum node count based on grant | ||
| 6. During execution, spawns build nodes limited by this capped value | ||
|
DustinCampbell marked this conversation as resolved.
Outdated
|
||
| 7. On completion, sends `ReleaseNodesMessage` to free nodes for other builds | ||
|
|
||
| **Key Principle:** The coordinator is entirely optional. If it's unavailable or disabled, the build uses its requested node count without coordination. | ||
|
|
||
| **Sources:** | ||
| - [src/Build/BackEnd/BuildManager/BuildManager.cs](src/Build/BackEnd/BuildManager/BuildManager.cs) | ||
| - [src/Build/BackEnd/BuildManager/CoordinatorClient.cs](src/Build/BackEnd/BuildManager/CoordinatorClient.cs) | ||
| - [src/Framework/Traits.cs](src/Framework/Traits.cs) - Enablement logic | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration and Environment Variables | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| | Variable | Default | Purpose | | ||
| |----------|---------|---------| | ||
| | `MSBUILDUSECOORDINATOR` | (empty) | Enable coordinator (set to any value to enable) | | ||
|
DustinCampbell marked this conversation as resolved.
|
||
| | `MSBUILDCOORDINATORPIPENAME` | `msbuild-coordinator-{UserName}` | Override default pipe name | | ||
| | `MSBUILDCOORDINATORNODEBUDGET` | Processor count | Override total node budget | | ||
| | `MSBUILDCOORDINATORHEARTBEAT` | 5000 | Override heartbeat interval (ms) | | ||
| | `MSBUILDCOORDINATORSHUTDOWNTIMEOUT` | 60000 | Override shutdown timeout (ms) | | ||
|
|
||
| --- | ||
|
|
||
| ## Lifecycle and Operation | ||
|
|
||
| ### Coordinator Startup | ||
|
|
||
| 1. When first MSBuild process needs coordination, it attempts to start the coordinator | ||
| 2. Coordinator uses a system-wide mechanism to ensure only one instance runs | ||
| 3. If an instance already exists, the new process connects as a client instead | ||
| 4. Coordinator listens on a named pipe for client connections | ||
|
|
||
| **Source:** [src/MSBuild.Coordinator/Program.cs](src/MSBuild.Coordinator/Program.cs) | ||
|
|
||
| ### Heartbeat Monitoring | ||
|
|
||
| The coordinator detects stalled or crashed clients through periodic heartbeats: | ||
|
|
||
| - Clients send heartbeat messages at configured intervals (default: 5 seconds) | ||
| - Coordinator tracks missed heartbeats | ||
| - After threshold is reached (default: 3 misses = 15 seconds), client is considered stalled | ||
| - Coordinator automatically releases nodes allocated to stalled client | ||
| - Waiting builds can then be granted those nodes | ||
|
|
||
| **Source:** [src/MSBuild.Coordinator/CoordinatorServer.cs](src/MSBuild.Coordinator/CoordinatorServer.cs) | ||
|
|
||
| ### Graceful Shutdown | ||
|
|
||
| When a build completes normally: | ||
|
|
||
| 1. Client sends `ReleaseNodesMessage` with its grant ID | ||
| 2. Coordinator frees those nodes | ||
| 3. Processes waiting queue to allocate freed nodes to waiting builds | ||
| 4. If no active or waiting clients remain, coordinator enters timeout mode | ||
| 5. After 60 seconds of inactivity, coordinator exits | ||
|
DustinCampbell marked this conversation as resolved.
|
||
|
|
||
| **Source:** [src/MSBuild.Coordinator/CoordinatorServer.cs](src/MSBuild.Coordinator/CoordinatorServer.cs) | ||
|
|
||
| --- | ||
|
|
||
| ## Error Handling | ||
|
|
||
| ### Resilient Design | ||
|
|
||
| The coordinator system is designed to be fully optional: | ||
|
|
||
| - **Unavailable coordinator** → Build proceeds without coordination using full node count | ||
| - **Connection failure** → Build proceeds independently | ||
| - **Protocol mismatch** → Graceful fallback to unlimited nodes | ||
| - **Crashed client** → Detected via heartbeat timeout, resources cleaned up | ||
| - **Coordinator crash** → Next build can launch new instance | ||
|
|
||
| This means coordinator failures never block or degrade build execution—they only disable coordination. | ||
|
|
||
| **Sources:** | ||
| - [src/Build/BackEnd/BuildManager/CoordinatorClient.cs](src/Build/BackEnd/BuildManager/CoordinatorClient.cs) | ||
| - [src/MSBuild.Coordinator/CoordinatorServer.cs](src/MSBuild.Coordinator/CoordinatorServer.cs) | ||
|
|
||
| --- | ||
|
|
||
| ## Testing | ||
|
|
||
| ### Unit Tests | ||
|
|
||
| Comprehensive test coverage in [src/MSBuild.Coordinator.UnitTests/](src/MSBuild.Coordinator.UnitTests/): | ||
|
|
||
| - Protocol serialization/deserialization | ||
| - Node budget manager allocation logic | ||
| - Fair-share algorithm correctness | ||
| - Heartbeat monitoring | ||
| - Multi-build coordination scenarios | ||
| - Error conditions and edge cases | ||
|
|
||
| --- | ||
|
|
||
| ## Source Code References | ||
|
|
||
| For detailed implementation information, refer to: | ||
|
|
||
| - **Server Implementation:** `src/MSBuild.Coordinator/` | ||
| - **Protocol Definitions:** `src/Framework/Coordinator/` | ||
| - **Client Integration:** `src/Build/BackEnd/BuildManager/` | ||
| - **Configuration:** `src/Framework/Traits.cs`, `src/Framework/Coordinator/CoordinatorSettings.cs` | ||
| - **Tests:** `src/MSBuild.Coordinator.UnitTests/` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.