Skip to content

Conversation

kouvel
Copy link
Contributor

@kouvel kouvel commented May 8, 2025

  • Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit
  • This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases.
  • There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance
    • In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies.
    • The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).

- Currently, the spin count is multiplied by 4 on Arm processors to avoid throughput regressions, but this appears to significantly increase CPU usage without much benefit
- This change removes the multiplier, restoring the spin count on Arm processors to the same value as on x64. With this, throughput appears to be mostly similar, and CPU usage is significantly reduced in many cases.
- There appear to be a few small throughput regressions in limited-connection high-throughput tests, but that seems to be mostly an artifact of limiting the connections and is not necessarily indicative of lower performance
  - In limited-connection high-throughput tests, a request is sent on a connection once the response to the previous request is received. In bursty scenarios, spin-waiting more can reduce the response time for work items queued to the thread pool, resulting in a slightly earlier response compared with spin-waiting less. The difference is typically very short, in the order of low microseconds or less. When spin-waiting less with a limited number of connections, the slight delay in response results in a slight delay in the next request being sent, and this gets compounded. Effectively, the client ends up sending fewer requests per unit of time due to this artifact, hence the lower throughput. Due to the lower CPU usage with less spin-waiting, if more connections were used, the server can actually handle the same higher RPS at lower CPU usage and with roughly the same latencies.
  - The same kind of artifact is seen in limited-connection high-throughput benchmarks to a larger degree when spin-waiting in the thread pool is disabled. Despite this change, in some scenarios it may still be more beneficial to disable spin-waiting (which many scenarios currently do without any significant loss in performance).
@kouvel kouvel added this to the 10.0.0 milestone May 8, 2025
@kouvel kouvel self-assigned this May 8, 2025
@Copilot Copilot AI review requested due to automatic review settings May 8, 2025 15:52
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR simplifies the spin-wait logic in the thread pool by removing the ARM-specific multiplier, thereby unifying the behavior across platforms and reducing unnecessary CPU usage. Key changes include:

  • Removing conditional compilation for ARM and related architectures.
  • Setting the semaphore spin count constant directly to 70.
  • Eliminating the unused baseline constant for spin count.
Comments suppressed due to low confidence (1)

src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs:15

  • Since the 'SemaphoreSpinCountDefaultBaseline' constant is no longer used after this change, consider removing it to reduce code clutter.
private const int SemaphoreSpinCountDefaultBaseline = 70;

private static partial class WorkerThread
{
private static readonly short ThreadsToKeepAlive = DetermineThreadsToKeepAlive();

Copy link

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment that explains the removal of the ARM-specific spin multiplier and notes the performance improvements observed, to aid future maintainers.

Suggested change
// The ARM-specific spin multiplier was removed to simplify the code and ensure consistent behavior across architectures.
// Performance testing showed that a default spin count of 70 provides optimal performance on both ARM and non-ARM platforms.

Copilot uses AI. Check for mistakes.

@kouvel
Copy link
Contributor Author

kouvel commented May 8, 2025

Some perf numbers from Cobalt 100 below.

Json, 48-proc VM, 4096 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 2,016,322 41.9 48145.8
140 2,014,717 -0.08% 41.0 -0.9 49187.7 2.16%
After: 70 2,012,294 -0.20% 40.1 -1.8 50177.8 4.22%
35 2,003,038 -0.66% 39.1 -2.8 51282.0 6.51%
17 1,994,986 -1.06% 39.0 -2.8 51099.4 6.13%
8 1,975,699 -2.01% 37.8 -4.1 52305.9 8.64%
0 1,962,333 -2.68% 38.3 -3.6 51216.1 6.38%
  • No regression in throughput, slight reduction in CPU usage
  • Reducing the spin count further appears to regress throughput slightly

Orchard, 48-proc VM, 64 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 29,806 43.2 690.0
140 29,595 -0.71% 42.9 -0.3 690.5 0.07%
After: 70 29,275 -1.78% 42.0 -1.2 697.5 1.09%
35 28,582 -4.10% 41.4 -1.8 691.2 0.18%
17 28,089 -5.76% 40.6 -2.6 691.6 0.24%
0 26,910 -9.72% 38.9 -4.3 692.4 0.36%
  • Small regression in throughput, slight reduction in CPU usage
  • Reducing the spin count further appears to regress throughput more significantly

Orchard, 48-proc VM, 16 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 11,442 14.9 767.3
140 11,484 0.36% 14.3 -0.7 805.2 4.94%
After: 70 11,365 -0.68% 13.4 -1.5 845.8 10.24%
0 10,806 -5.56% 12.2 -2.7 883.8 15.19%
  • No regression in throughput, small reduction in CPU usage. CPU efficiency (RPS per proc) improves with lower spin counts.
  • Disabling spin-waiting appears to reduce throughput further with a limited number of connections, but with better CPU efficiency

Orchard, 48-proc VM, 8 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 6,044 7.8 771.9
140 5,954 -1.48% 7.2 -0.6 824.2 6.78%
After: 70 5,997 -0.77% 6.9 -1.0 874.2 13.25%
0 5,644 -6.61% 6.1 -1.8 930.9 20.60%
  • Similar to results for 16 connections, differences are a bit more pronounced

Orchard with Postgres DB on separate machine, 16-proc VM, 160 connections, 1600 fixed RPS

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 1,601.5 2.486 644.2
140 1,601.0 -0.03% 2.217 -0.269 722.1 12.09%
After: 70 1,600.5 -0.06% 2.033 -0.454 787.5 22.24%
0 1,600.8 -0.05% 1.862 -0.624 859.6 33.43%
  • Significant reduction in CPU usage and improvement in CPU efficiency

Orchard with Postgres DB on separate machine, 16-proc VM, 20 connections, 200 fixed RPS

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 200.3 0.421 475.9
140 200.0 -0.12% 0.340 -0.081 588.7 23.69%
After: 70 200.0 -0.12% 0.298 -0.123 672.3 41.25%
0 200.0 -0.12% 0.265 -0.156 754.7 58.58%
  • Similar to results for 160 connections and 1600 fixed RPS, differences are much more pronounced

@kouvel
Copy link
Contributor Author

kouvel commented May 8, 2025

Some perf numbers from aspnet-citrine-arm-lin below.

Json, 80 procs, 256 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 865,533 51.6 16761.6
After: 70 856,088 -1.09% 41.0 -10.6 20872.3 24.52%
0 793,820 -8.29% 35.2 -16.4 22558.1 34.58%

Json, 80 procs, 512 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 1,029,221 59.0 17431.9
After: 70 1,029,566 0.03% 49.1 -10.0 20973.0 20.31%
0 972,028 -5.56% 43.1 -15.9 22540.6 29.31%

Json, 80 procs, 1024 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 1,113,428 62.4 17829.3
After: 70 1,107,307 -0.55% 55.4 -7.1 19994.9 12.15%
0 1,091,217 -1.99% 50.5 -12.0 21623.0 21.28%

Fortunes, 80 procs, 256 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 413,150 46.9 8799.8
After: 70 390,563 -5.47% 36.8 -10.1 10601.7 20.48%
0 346,682 -16.09% 30.2 -16.7 11461.7 30.25%

Fortunes, 80 procs, 512 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 476,625 51.7 9217.9
After: 70 471,417 -1.09% 42.2 -9.5 11170.6 21.18%
0 436,519 -8.41% 35.2 -16.5 12387.6 34.39%

Fortunes, 80 procs, 1024 connections

Spin count RPS Diff Procs used Diff RPS per proc used Diff
Before: 280 486,773 51.7 9414.6
After: 70 497,833 2.27% 44.6 -7.1 11168.9 18.63%
0 471,301 -3.18% 37.2 -14.5 12663.2 34.51%
  • Small regression in throughput in Fortunes at 256 connections, though it's clear that the server can process higher RPS with more connections
  • Significant reduction in CPU usage and improvement in CPU efficiency
  • Disabling spin-waiting results in the best CPU efficiency, though with larger throughput regressions at limited lower connection counts

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants