-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Reduce spin-waiting in the thread pool on Arm processors #115402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kouvel
commented
May 8, 2025
- Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit
- This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases.
- There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance
- In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies.
- The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).
- Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit - This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases. - There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance - In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies. - The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).
Tagging subscribers to this area: @mangod9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR simplifies the spin-wait logic in the thread pool by removing the ARM-specific multiplier, thereby unifying the behavior across platforms and reducing unnecessary CPU usage. Key changes include:
- Removing conditional compilation for ARM and related architectures.
- Setting the semaphore spin count constant directly to 70.
- Eliminating the unused baseline constant for spin count.
Comments suppressed due to low confidence (1)
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs:15
- Since the 'SemaphoreSpinCountDefaultBaseline' constant is no longer used after this change, consider removing it to reduce code clutter.
private const int SemaphoreSpinCountDefaultBaseline = 70;
private static partial class WorkerThread | ||
{ | ||
private static readonly short ThreadsToKeepAlive = DetermineThreadsToKeepAlive(); | ||
|
Copilot
AI
May 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment that explains the removal of the ARM-specific spin multiplier and notes the performance improvements observed, to aid future maintainers.
// The ARM-specific spin multiplier was removed to simplify the code and ensure consistent behavior across architectures. | |
// Performance testing showed that a default spin count of 70 provides optimal performance on both ARM and non-ARM platforms. |
Copilot uses AI. Check for mistakes.
Some perf numbers from Cobalt 100 below. Json, 48-proc VM, 4096 connections
Orchard, 48-proc VM, 64 connections
Orchard, 48-proc VM, 16 connections
Orchard, 48-proc VM, 8 connections
Orchard with Postgres DB on separate machine, 16-proc VM, 160 connections, 1600 fixed RPS
Orchard with Postgres DB on separate machine, 16-proc VM, 20 connections, 200 fixed RPS
|
Some perf numbers from Json, 80 procs, 256 connections
Json, 80 procs, 512 connections
Json, 80 procs, 1024 connections
Fortunes, 80 procs, 256 connections
Fortunes, 80 procs, 512 connections
Fortunes, 80 procs, 1024 connections
|