Skip to content

Expose operational counters#12695

Closed
baronfel wants to merge 14 commits into
mainfrom
counters
Closed

Expose operational counters#12695
baronfel wants to merge 14 commits into
mainfrom
counters

Conversation

@baronfel
Copy link
Copy Markdown
Member

@baronfel baronfel commented Oct 24, 2025

This is a draft PR to explore exporting some System.Diagnostics.Metrics.Meter-based counters for the operational metrics of the build. The aim would be to make it easy to collect operational data cross-platform for comparisons when trying things like scheduling algorithm changes, or to quickly in real-world scenarios how the scheduler is assigning configurations to nodes, etc.

To use:

  • ./build.sh|cmd
  • copy the built Microsoft.Build and Microsoft.Build.Framework dlls to a global SDK 10+ location
    • yes, I know this is horrible - this is actually important because we'll be using dotnet-counters to view the counters and like all framework-dependent tools it gets confused when using bootstrap/local SDK installs
  • install dotnet-counters - dotnet tool install -g dotnet-counters
  • run a build of something using your hacked SDK install, instrumenting it with the counters: dotnet-counters monitor --refresh-interval 1 --counters Microsoft.Build -- dotnet build rest_of_args

You'll get output like this:
image

This method makes use of the dotnet diagnostics pipe - so when spawning processes we need to make sure that other dotnet processes don't inherit the diagnostics behaviors. For this reason I injected the "disable diagnostics" env var into node creation and TaskHost environments. Without this any process spawning will hang.

Alternative usage

Instead of needing to hack a global SDK install you can spawn a bootstrapped dotnet build invocation with the DOTNET_DiagnosticPorts=my_diag_port1 and DOTNET_EnableDiagnostics=1 env vars, and then run dotnet counters with the --diagnostic-port option pointing to that same port name.

Counter documentation

Metrics Overview

All metrics use the System.Diagnostics.Metrics API and are exposed under the Microsoft.Build meter name. They can be collected using standard tools like dotnet-counters, OpenTelemetry, or Prometheus exporters.


Logging Metrics

msbuild_forwarded_log_messages

Type: ObservableCounter (monotonically increasing)
Unit: messages
Description: Total number of log messages forwarded from worker nodes to the central node during distributed builds.

Tags:

  • source_node (int): The node ID that sent the log message

Use Cases:

  • Detect uneven logging distribution across nodes
  • Identify nodes generating excessive log traffic
  • Diagnose communication bottlenecks in distributed builds
  • Monitor logging overhead per node

Scheduler Metrics

msbuild_scheduler_node_count

Type: ObservableGauge
Unit: nodes
Description: Current count of active nodes in the scheduler.

Tags:

  • node.type: Node provider type
    • "outofproc" - Out-of-process worker nodes
    • "inproc" - In-process node

Use Cases:

  • Monitor node availability during builds
  • Verify expected parallelism level
  • Detect node provisioning issues
  • Track node lifecycle events

msbuild_request_blocked_events

Type: ObservableCounter (monotonically increasing)
Unit: events
Description: Count of request blocking events, categorized by the type of blocker that caused the wait.

Tags:

  • blocker_type: The reason a request was blocked
    • "yield" - Request explicitly yielded execution
    • "results_transfer" - Blocked waiting for result transfer
    • "in_progress_target" - Blocked waiting for an in-progress target
    • "new_requests" - Blocked by new child requests

Use Cases:

  • Identify most common blocking patterns
  • Detect excessive yielding behavior
  • Find synchronization bottlenecks
  • Analyze build parallelism effectiveness

msbuild_circular_dependency_errors

Type: ObservableCounter (monotonically increasing)
Unit: errors
Description: Count of circular dependency detections during scheduling.

Tags:

  • error_type: The context where the circular dependency was detected
    • "in_progress_target" - Circular dependency involving in-progress targets
    • "new_requests" - Circular dependency in new request chain

Use Cases:

  • Detect project dependency graph issues
  • Track frequency of circular dependency errors
  • Identify problematic project configurations
  • Monitor build health

msbuild_cores_in_use

Type: ObservableGauge
Unit: cores
Description: Number of CPU cores currently allocated to build requests via IBuildEngine9.RequestCores.

Tags: None

Use Cases:

  • Monitor CPU resource utilization
  • Verify core allocation is working as expected
  • Detect core allocation leaks or inefficiencies
  • Analyze parallelism opportunities

msbuild_pending_core_requests

Type: ObservableGauge
Unit: requests
Description: Number of build requests currently waiting for core allocation.

Tags: None

Use Cases:

  • Identify CPU resource contention
  • Detect over-subscription of cores
  • Find tasks waiting for parallelism resources
  • Tune core allocation strategies

Scheduling Data Metrics

msbuild_scheduler_request_count

Type: ObservableGauge
Unit: requests
Description: Current count of requests in the scheduler by state.

Tags:

  • request.type: The state of the requests being counted
    • "executing" - Currently executing requests
    • "ready" - Requests ready to execute
    • "blocked" - Requests blocked on dependencies
    • "yielding" - Requests that have yielded
    • "unscheduled" - Requests not yet scheduled to nodes

Use Cases:

  • Monitor request queue depths
  • Identify scheduling bottlenecks
  • Detect stuck or stalled requests
  • Analyze scheduler efficiency

msbuild_scheduler_node_configuration_count

Type: ObservableGauge
Unit: configurations
Description: Current count of build configurations assigned to each node.

Tags:

  • node.id (int): The node ID

Use Cases:

  • Monitor configuration distribution across nodes
  • Detect uneven work distribution
  • Identify configuration affinity issues
  • Optimize node assignment strategies

msbuild_scheduler_build_event_count

Type: ObservableGauge
Unit: events
Description: Total count of build events that have occurred during the current build.

Tags: None

Use Cases:

  • Track build activity level
  • Monitor event processing throughput
  • Detect event processing issues
  • General build progress indicator

msbuild_node_idle_time

Type: ObservableCounter (monotonically increasing)
Unit: ms (milliseconds)
Description: Total time each node has spent idle (not executing requests).

Tags:

  • node_id (int): The node ID

Use Cases:

  • Identify underutilized nodes
  • Detect load balancing issues
  • Calculate node efficiency ratios
  • Optimize node count and work distribution

Build Request Engine Metrics

build_request_engine_requests

Type: ObservableGauge
Unit: requests
Description: Number of active build requests in the BuildRequestEngine by state.

Tags:

  • nodeId (int): The node ID where the engine is running
  • state: The state of the request (determined by callback implementation)

Use Cases:

  • Monitor per-node request queue depth
  • Track request lifecycle on each node
  • Identify nodes with processing issues
  • Analyze per-node workload

build_request_engine_work_queue_length

Type: ObservableGauge
Unit: items
Description: Number of work items pending in the BuildRequestEngine's work queue.

Tags:

  • nodeId (int): The node ID where the engine is running

Use Cases:

  • Detect work queue backlog
  • Identify processing bottlenecks
  • Monitor engine responsiveness
  • Track per-node processing capacity

build_request_engine_status

Type: ObservableGauge
Unit: status
Description: Current status of the BuildRequestEngine as an integer enum value.

Tags:

  • nodeId (int): The node ID where the engine is running

Use Cases:

  • Monitor engine lifecycle states
  • Detect stuck or failed engines
  • Track engine state transitions
  • Debug engine behavior

msbuild_configuration_resolution_duration

Type: Histogram
Unit: ms (milliseconds)
Description: Time taken to resolve build configurations (round-trip from configuration request to response).

Tags: None

Use Cases:

  • Identify slow configuration resolution
  • Detect configuration caching issues
  • Optimize configuration creation
  • Find configuration bottlenecks

msbuild_build_request_state_transitions

Type: ObservableCounter (monotonically increasing)
Unit: transitions
Description: Count of build request state transitions.

Tags:

  • transition: The state transition in format "FromState->ToState"
    • Examples: "Active->Waiting", "Waiting->Ready", "Ready->Active", "Active->Complete"

Use Cases:

  • Analyze request lifecycle patterns
  • Detect abnormal state transitions
  • Track how requests flow through states
  • Identify requests getting stuck in specific states

msbuild_request_wait_time

Type: Histogram
Unit: ms (milliseconds)
Description: Time requests spend in the Waiting state, categorized by the reason for blocking.

Tags:

  • reason: The reason the request was waiting
    • "blocking_target" - Waiting for another target to complete
    • "unresolved_configuration" - Waiting for configuration resolution
    • "child_requests" - Waiting for child build requests

Use Cases:

  • Identify most time-consuming blocking scenarios
  • Detect synchronization bottlenecks
  • Find opportunities to improve parallelism
  • Diagnose slow build phases

Metric Collection Examples

Using dotnet-counters

# Monitor all MSBuild metrics in real-time
dotnet-counters monitor --process-id <pid> --counters Microsoft.Build

# Monitor specific metrics
dotnet-counters monitor --process-id <pid> --counters Microsoft.Build[msbuild_node_idle_time,msbuild_request_wait_time]

Using OpenTelemetry

Configure OpenTelemetry to collect metrics from the Microsoft.Build meter:

services.AddOpenTelemetry()
    .WithMetrics(builder => builder
        .AddMeter("Microsoft.Build")
        .AddPrometheusExporter());

Common Analysis Scenarios

Identifying Parallelism Issues

  1. Check msbuild_node_idle_time - High idle time suggests poor work distribution
  2. Review msbuild_request_blocked_events - Excessive blocking reduces parallelism
  3. Analyze msbuild_request_wait_time histogram - Shows where requests spend time waiting
  4. Monitor msbuild_scheduler_request_count with request.type=blocked - Track blocked request count

Detecting Resource Contention

  1. Watch msbuild_pending_core_requests - Non-zero indicates CPU contention
  2. Check msbuild_cores_in_use vs available cores - Shows resource utilization
  3. Review build_request_engine_work_queue_length - Backlog indicates processing bottleneck

Analyzing Build Performance

  1. Examine msbuild_configuration_resolution_duration percentiles - Identify slow configs
  2. Track msbuild_request_wait_time by reason - Find most impactful wait causes
  3. Monitor msbuild_build_request_state_transitions - Understand request flow patterns
  4. Check msbuild_forwarded_log_messages - Excessive logging can slow builds

Troubleshooting Build Issues

  1. Look for msbuild_circular_dependency_errors - Indicates dependency graph problems
  2. Review msbuild_build_request_state_transitions - Find abnormal state patterns
  3. Check build_request_engine_status per node - Detect failed or stuck engines
  4. Monitor msbuild_scheduler_node_configuration_count distribution - Uneven suggests issues

Implementation Notes

  • All counters are monotonically increasing and track cumulative values
  • Gauges represent current point-in-time snapshots
  • Histograms automatically calculate percentiles when collected by tools like Prometheus
  • Metrics are only collected when an observer is listening (no overhead when unused)
  • Tags/dimensions allow filtering and grouping for detailed analysis

@baronfel baronfel requested a review from a team as a code owner October 24, 2025 23:35
Copilot AI review requested due to automatic review settings October 24, 2025 23:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds diagnostic metrics support to MSBuild using System.Diagnostics.Metrics.Meter, enabling real-time monitoring of build operations through dotnet-counters. The implementation tracks scheduler states, node counts, and configuration assignments to help analyze build performance and scheduling behavior.

Key Changes:

  • Added System.Diagnostics.Metrics-based gauges throughout the scheduler, node manager, and configuration cache to expose operational metrics
  • Set DOTNET_EnableDiagnostics=0 environment variable for spawned processes to prevent diagnostic port conflicts that cause hangs
  • Updated bootstrap SDK version to RC2

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/Shared/CommunicationsUtilities.cs Added DOTNET_EnableDiagnostics=0 to environment variables for spawned processes
src/Build/BackEnd/Components/Scheduler/SchedulingData.cs Added observable gauges for tracking request counts by state and node configurations
src/Build/BackEnd/Components/Scheduler/Scheduler.cs Added observable gauges for node counts and total request count
src/Build/BackEnd/Components/Communications/NodeManager.cs Added gauge metric for active node counts with improved disposal pattern
src/Build/BackEnd/Components/Communications/NodeLauncher.cs Set DOTNET_EnableDiagnostics environment variable in process startup
src/Build/BackEnd/Components/Caching/ConfigCache.cs Added observable gauge for configurations per project
eng/Versions.props Updated bootstrap SDK version from RC1 to RC2
Comments suppressed due to low confidence (1)

src/Build/BackEnd/Components/Caching/ConfigCache.cs:1

  • The CreateObservableGauge call creates a gauge but doesn't assign it to the _configurationsPerProjectGauge field. This means the gauge instance is created but not referenced, which could lead to it being garbage collected and not functioning as intended. Assign the result to the field: _configurationsPerProjectGauge = _configurationMetrics.CreateObservableGauge(...)
// Licensed to the .NET Foundation under one or more agreements.

Comment thread src/Build/BackEnd/Components/Caching/ConfigCache.cs Outdated
Comment thread src/Build/BackEnd/Components/Communications/NodeLauncher.cs Outdated
Comment thread src/Build/BackEnd/Components/Scheduler/SchedulingData.cs Outdated
Comment thread src/Build/BackEnd/Components/Caching/ConfigCache.cs
@baronfel baronfel changed the title Counters Expose operational counters Oct 24, 2025
@baronfel
Copy link
Copy Markdown
Member Author

Also, note that all of this data is only for the central node - the process that the user is actually communicating with. The child worker nodes do not send counters back in this way (yet?). That's mostly why the counters I've made so far are focused on scheduling - since that happens on the central node. This makes these counters mostly-useless for multiproc mode, though they could be useful for multiarch mode.

Comment thread src/Build/BackEnd/Components/Communications/NodeLauncher.cs Outdated
@baronfel baronfel marked this pull request as draft October 27, 2025 14:36
Comment thread src/Build/BackEnd/Components/Communications/NodeLauncher.cs Outdated
@baronfel baronfel force-pushed the counters branch 3 times, most recently from 568e4d3 to b52ea9f Compare January 9, 2026 17:18
@baronfel baronfel marked this pull request as ready for review January 12, 2026 17:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

This pull request has been automatically closed because it has been open for more than 180 days with no recent activity.

If you believe this work is still relevant, please feel free to reopen or create a new pull request. Thank you for your contribution!

Note

🔒 Integrity filter blocked 44 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13678 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13676 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13675 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13673 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13661 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13660 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13653 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13651 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13650 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13585 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13577 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13576 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13559 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13556 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13548 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13546 list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • ... and 28 more items

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Close Stale Pull Requests · ● 954.2K

@github-actions github-actions Bot closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants