Reduce spin-waiting in the thread pool on Arm processors #115402

kouvel · 2025-05-08T15:52:24Z

Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit
This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases.
There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance
- In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies.
- The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).

- Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit - This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases. - There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance - In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies. - The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).

dotnet-policy-service · 2025-05-08T15:52:55Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Copilot

Pull Request Overview

This PR simplifies the spin-wait logic in the thread pool by removing the ARM-specific multiplier, thereby unifying the behavior across platforms and reducing unnecessary CPU usage. Key changes include:

Removing conditional compilation for ARM and related architectures.
Setting the semaphore spin count constant directly to 70.
Eliminating the unused baseline constant for spin count.

Comments suppressed due to low confidence (1)

src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs:15

Since the 'SemaphoreSpinCountDefaultBaseline' constant is no longer used after this change, consider removing it to reduce code clutter.

private const int SemaphoreSpinCountDefaultBaseline = 70;

Copilot · 2025-05-08T15:52:55Z

src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs

        private static partial class WorkerThread
        {
            private static readonly short ThreadsToKeepAlive = DetermineThreadsToKeepAlive();



Consider adding a comment that explains the removal of the ARM-specific spin multiplier and notes the performance improvements observed, to aid future maintainers.

Suggested change

// The ARM-specific spin multiplier was removed to simplify the code and ensure consistent behavior across architectures.

// Performance testing showed that a default spin count of 70 provides optimal performance on both ARM and non-ARM platforms.

kouvel · 2025-05-08T15:53:29Z

Some perf numbers from Cobalt 100 below.

Json, 48-proc VM, 4096 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	2,016,322		41.9		48145.8
140	2,014,717	-0.08%	41.0	-0.9	49187.7	2.16%
After: 70	2,012,294	-0.20%	40.1	-1.8	50177.8	4.22%
35	2,003,038	-0.66%	39.1	-2.8	51282.0	6.51%
17	1,994,986	-1.06%	39.0	-2.8	51099.4	6.13%
8	1,975,699	-2.01%	37.8	-4.1	52305.9	8.64%
0	1,962,333	-2.68%	38.3	-3.6	51216.1	6.38%

No regression in throughput, slight reduction in CPU usage
Reducing the spin count further appears to regress throughput slightly

Orchard, 48-proc VM, 64 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	29,806		43.2		690.0
140	29,595	-0.71%	42.9	-0.3	690.5	0.07%
After: 70	29,275	-1.78%	42.0	-1.2	697.5	1.09%
35	28,582	-4.10%	41.4	-1.8	691.2	0.18%
17	28,089	-5.76%	40.6	-2.6	691.6	0.24%
0	26,910	-9.72%	38.9	-4.3	692.4	0.36%

Small regression in throughput, slight reduction in CPU usage
Reducing the spin count further appears to regress throughput more significantly

Orchard, 48-proc VM, 16 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	11,442		14.9		767.3
140	11,484	0.36%	14.3	-0.7	805.2	4.94%
After: 70	11,365	-0.68%	13.4	-1.5	845.8	10.24%
0	10,806	-5.56%	12.2	-2.7	883.8	15.19%

No regression in throughput, small reduction in CPU usage. CPU efficiency (RPS per proc) improves with lower spin counts.
Disabling spin-waiting appears to reduce throughput further with a limited number of connections, but with better CPU efficiency

Orchard, 48-proc VM, 8 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	6,044		7.8		771.9
140	5,954	-1.48%	7.2	-0.6	824.2	6.78%
After: 70	5,997	-0.77%	6.9	-1.0	874.2	13.25%
0	5,644	-6.61%	6.1	-1.8	930.9	20.60%

Similar to results for 16 connections, differences are a bit more pronounced

Orchard with Postgres DB on separate machine, 16-proc VM, 160 connections, 1600 fixed RPS

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	1,601.5		2.486		644.2
140	1,601.0	-0.03%	2.217	-0.269	722.1	12.09%
After: 70	1,600.5	-0.06%	2.033	-0.454	787.5	22.24%
0	1,600.8	-0.05%	1.862	-0.624	859.6	33.43%

Significant reduction in CPU usage and improvement in CPU efficiency

Orchard with Postgres DB on separate machine, 16-proc VM, 20 connections, 200 fixed RPS

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	200.3		0.421		475.9
140	200.0	-0.12%	0.340	-0.081	588.7	23.69%
After: 70	200.0	-0.12%	0.298	-0.123	672.3	41.25%
0	200.0	-0.12%	0.265	-0.156	754.7	58.58%

Similar to results for 160 connections and 1600 fixed RPS, differences are much more pronounced

kouvel · 2025-05-08T15:54:42Z

Some perf numbers from aspnet-citrine-arm-lin below.

Json, 80 procs, 256 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	865,533		51.6		16761.6
After: 70	856,088	-1.09%	41.0	-10.6	20872.3	24.52%
0	793,820	-8.29%	35.2	-16.4	22558.1	34.58%

Json, 80 procs, 512 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	1,029,221		59.0		17431.9
After: 70	1,029,566	0.03%	49.1	-10.0	20973.0	20.31%
0	972,028	-5.56%	43.1	-15.9	22540.6	29.31%

Json, 80 procs, 1024 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	1,113,428		62.4		17829.3
After: 70	1,107,307	-0.55%	55.4	-7.1	19994.9	12.15%
0	1,091,217	-1.99%	50.5	-12.0	21623.0	21.28%

Fortunes, 80 procs, 256 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	413,150		46.9		8799.8
After: 70	390,563	-5.47%	36.8	-10.1	10601.7	20.48%
0	346,682	-16.09%	30.2	-16.7	11461.7	30.25%

Fortunes, 80 procs, 512 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	476,625		51.7		9217.9
After: 70	471,417	-1.09%	42.2	-9.5	11170.6	21.18%
0	436,519	-8.41%	35.2	-16.5	12387.6	34.39%

Fortunes, 80 procs, 1024 connections

Spin count	RPS	Diff	Procs used	Diff	RPS per proc used	Diff
Before: 280	486,773		51.7		9414.6
After: 70	497,833	2.27%	44.6	-7.1	11168.9	18.63%
0	471,301	-3.18%	37.2	-14.5	12663.2	34.51%

Small regression in throughput in Fortunes at 256 connections, though it's clear that the server can process higher RPS with more connections
Significant reduction in CPU usage and improvement in CPU efficiency
Disabling spin-waiting results in the best CPU efficiency, though with larger throughput regressions at limited lower connection counts

kouvel added this to the 10.0.0 milestone May 8, 2025

kouvel self-assigned this May 8, 2025

Copilot AI review requested due to automatic review settings May 8, 2025 15:52

kouvel added the area-System.Threading label May 8, 2025

Copilot AI reviewed May 8, 2025

View reviewed changes

EgorBo approved these changes May 8, 2025

View reviewed changes

kouvel merged commit 0fe82fb into dotnet:main May 12, 2025
140 of 143 checks passed

kouvel deleted the ArmTpSpin branch May 12, 2025 14:13

LoopedBard3 mentioned this pull request May 15, 2025

[Perf] Linux/arm64: 29 Regressions on 5/12/2025 3:12:34 PM +00:00 #115620

Closed

github-actions bot locked and limited conversation to collaborators Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce spin-waiting in the thread pool on Arm processors #115402

Reduce spin-waiting in the thread pool on Arm processors #115402

Uh oh!

kouvel commented May 8, 2025

Uh oh!

dotnet-policy-service bot commented May 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 8, 2025

Uh oh!

kouvel commented May 8, 2025

Uh oh!

kouvel commented May 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



	// The ARM-specific spin multiplier was removed to simplify the code and ensure consistent behavior across architectures.
	// Performance testing showed that a default spin count of 70 provides optimal performance on both ARM and non-ARM platforms.

Reduce spin-waiting in the thread pool on Arm processors #115402

Reduce spin-waiting in the thread pool on Arm processors #115402

Uh oh!

Conversation

kouvel commented May 8, 2025

Uh oh!

dotnet-policy-service bot commented May 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kouvel commented May 8, 2025

Uh oh!

kouvel commented May 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants