Skip to content

validator: Pin core in PoH speed test#5485

Closed
steviez wants to merge 1 commit intoanza-xyz:masterfrom
steviez:val_pin_poh_speed_check
Closed

validator: Pin core in PoH speed test#5485
steviez wants to merge 1 commit intoanza-xyz:masterfrom
steviez:val_pin_poh_speed_check

Conversation

@steviez
Copy link
Copy Markdown

@steviez steviez commented Mar 25, 2025

Problem

Checking the PoH speed requires a Bank to derive the cluster hash rate as of #2447. By the time a Bank is available, many services have started up and are competing for CPU time.

Summary of Changes

Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does

Offered as an alternative #4185. I personally like this solution more as it:

  • Is more contained (ie no split)
  • Is more similar to what PohService actually does at steady state

@steviez steviez force-pushed the val_pin_poh_speed_check branch from 9d6e006 to 44ec9f9 Compare March 25, 2025 22:10
@steviez
Copy link
Copy Markdown
Author

steviez commented Mar 25, 2025

Tested this out. Some startups from recent tip of master (mean = 14,442,587.5, stddev = 60,795.9):

[...] PoH speed check: computed hashes per second 14461433, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14422646, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14371642, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14474593, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14536324, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14388887, target hashes per second 10000000

and then this branch (mean = 14,349,034, stddev = 163,245.36):

[...] PoH speed check: computed hashes per second 14194282, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14441717, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14248035, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14545363, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14492728, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14172079, target hashes per second 10000000

These were run on the same machine a day apart. This data would suggest that the change actually degraded things; will need to poke around more but will obviously not be pushing the PR as-is if things degrade.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 25, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (9f273bc) to head (3e2fbdc).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5485   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         853      853           
  Lines      374743   374763   +20     
=======================================
+ Hits       311857   311874   +17     
- Misses      62886    62889    +3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link
Copy Markdown

This makes sense, pinning a core to a 100% CPU-bound load will make that core very hot. Normally, the CPU would prefer this load to be moved to a colder core so the overheated one can cool off. Pinning the core makes this process impossible, so CPU will be forced to throttle.

@steviez
Copy link
Copy Markdown
Author

steviez commented Mar 26, 2025

This makes sense, pinning a core to a 100% CPU-bound load will make that core very hot. Normally, the CPU would prefer this load to be moved to a colder core so the overheated one can cool off. Pinning the core makes this process impossible, so CPU will be forced to throttle.

That's fair. For what it is worth, I think there is merit in pinning the PoH speed check thread to match PohService behavior, even if that means the speed check has raw performance:

// PoH service runs in a tight loop, generating hashes as fast as possible.
// Let's dedicate one of the CPU cores to this thread so that it can gain
// from cache performance.
if let Some(cores) = core_affinity::get_core_ids() {
core_affinity::set_for_current(cores[pinned_cpu_core]);
}

There is no value in spitting out a faster number at startup that can't be realized at steady state.

As for my test results, the degradation could also be a symptom of improper system tuning and/or sampling (ie too few hashes). Might experiment a little more and/or send this over to a few validators who I think might be interested in playing around with this

@alexpyattaev
Copy link
Copy Markdown

If you want to accurately measure PoH rate, you would want to measure across e.g. 10 batches of 10ms rather than 1 batch of 100ms. Discard highest and lowest batch mean, average the rest.

Copy link
Copy Markdown

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Use a separate thread for the PoH speed test, and pin the thread to a
specific core like PohService does
@steviez steviez force-pushed the val_pin_poh_speed_check branch from 44ec9f9 to 3e2fbdc Compare July 21, 2025 05:41
@steviez
Copy link
Copy Markdown
Author

steviez commented Jul 21, 2025

If you want to accurately measure PoH rate, you would want to measure across e.g. 10 batches of 10ms rather than 1 batch of 100ms. Discard highest and lowest batch mean, average the rest.

Yeah, this logic could potentially be revisited. I see that you adjusted your duration accordingly for multiple trials, but we definitely want to keep the total measurement duration low to avoid this from slowing node restart in any meaningful way.

If we want to tie it into what actually runs, we could do num_ticks_per_slot measurements of num_hashes_per_tick hashes.

That being said, I might argue we could break that into a separate PR. For the sake of a public papertrail, another item that has come up multiple times is isolating the PoH core so other threads are not scheduled on it

@steviez steviez requested a review from alexpyattaev July 21, 2025 06:03
@steviez
Copy link
Copy Markdown
Author

steviez commented Jul 21, 2025

I cleared your review since this one had been sitting so long and rebasing did not clear your ship-it; might experiment a bit more just to see so no need to review again immediately.

@steviez steviez marked this pull request as draft July 21, 2025 06:05
@t-nelson
Copy link
Copy Markdown

i dunno how much we can expect an improvement from thread pinning without core isolation. being migrated away from an otherwise busy core is likely preferable. need to be sure that that core is never busy

that said, if we lean into the test too hard, we're going to diverge from normal runtime poh perf, so should likely be considering changing that config too (if they're different)

@steviez
Copy link
Copy Markdown
Author

steviez commented Jul 22, 2025

i dunno how much we can expect an improvement from thread pinning without core isolation. being migrated away from an otherwise busy core is likely preferable. need to be sure that that core is never busy

that said, if we lean into the test too hard, we're going to diverge from normal runtime poh perf, so should likely be considering changing that config too (if they're different)

The way I see it, there are two changes to make:

  1. Make the PoH test more closely match what PohService does
  2. Isolate the core for both the test and PohService

This PR accomplishes 1. as pinning (without isolation) is what PohService does:

// PoH service runs in a tight loop, generating hashes as fast as possible.
// Let's dedicate one of the CPU cores to this thread so that it can gain
// from cache performance.
if let Some(cores) = core_affinity::get_core_ids() {
core_affinity::set_for_current(cores[pinned_cpu_core]);
}

I think we want both changes, but I'm also currently thinking that we can make them in any order. I'm also looking at 2. so if someone has strong convictions than me on the order of operations, we can merge one before the other

@t-nelson
Copy link
Copy Markdown

seems like doing this one first might make borderline hw fail more frequently with the std up 2.5x. i'm not sure we should be taking a (n unexpected) step backward in anticipation of a future change clawing it all back

@steviez
Copy link
Copy Markdown
Author

steviez commented Jul 22, 2025

seems like doing this one first might make borderline hw fail more frequently with the std up 2.5x. i'm not sure we should be taking a (n unexpected) step backward in anticipation of a future change clawing it all back

Technically, isolating a core (but not yet using it for the test) could make the test flakier as there is now one less core available for all the competing threads.

So, maybe one PR is the move

@alexpyattaev
Copy link
Copy Markdown

Would it be wrong to let the PoHService start, then observe its hash rate for 60 seconds rather than making a dedicated speedtest?

@alexpyattaev alexpyattaev removed their request for review October 26, 2025 08:51
@github-actions
Copy link
Copy Markdown

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the stale label Feb 19, 2026
@github-actions
Copy link
Copy Markdown

This pull request was closed because it has been stale for 7 days with no activity.

@github-actions github-actions Bot closed this Feb 26, 2026
@steviez steviez deleted the val_pin_poh_speed_check branch February 26, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants