validator: Pin core in PoH speed test by steviez · Pull Request #5485 · anza-xyz/agave

steviez · 2025-03-25T22:07:33Z

Problem

Checking the PoH speed requires a Bank to derive the cluster hash rate as of #2447. By the time a Bank is available, many services have started up and are competing for CPU time.

Summary of Changes

Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does

Offered as an alternative #4185. I personally like this solution more as it:

Is more contained (ie no split)
Is more similar to what PohService actually does at steady state

steviez · 2025-03-25T22:21:36Z

Tested this out. Some startups from recent tip of master (mean = 14,442,587.5, stddev = 60,795.9):

[...] PoH speed check: computed hashes per second 14461433, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14422646, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14371642, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14474593, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14536324, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14388887, target hashes per second 10000000

and then this branch (mean = 14,349,034, stddev = 163,245.36):

[...] PoH speed check: computed hashes per second 14194282, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14441717, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14248035, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14545363, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14492728, target hashes per second 10000000
[...] PoH speed check: computed hashes per second 14172079, target hashes per second 10000000

These were run on the same machine a day apart. This data would suggest that the change actually degraded things; will need to poke around more but will obviously not be pushing the PR as-is if things degrade.

codecov-commenter · 2025-03-25T23:15:00Z

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (9f273bc) to head (3e2fbdc).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5485   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         853      853           
  Lines      374743   374763   +20     
=======================================
+ Hits       311857   311874   +17     
- Misses      62886    62889    +3

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexpyattaev · 2025-03-26T08:02:41Z

This makes sense, pinning a core to a 100% CPU-bound load will make that core very hot. Normally, the CPU would prefer this load to be moved to a colder core so the overheated one can cool off. Pinning the core makes this process impossible, so CPU will be forced to throttle.

steviez · 2025-03-26T20:51:56Z

This makes sense, pinning a core to a 100% CPU-bound load will make that core very hot. Normally, the CPU would prefer this load to be moved to a colder core so the overheated one can cool off. Pinning the core makes this process impossible, so CPU will be forced to throttle.

That's fair. For what it is worth, I think there is merit in pinning the PoH speed check thread to match PohService behavior, even if that means the speed check has raw performance:

agave/poh/src/poh_service.rs

Lines 127 to 132 in 381a5d5

    
           // PoH service runs in a tight loop, generating hashes as fast as possible. 
        
           // Let's dedicate one of the CPU cores to this thread so that it can gain 
        
           // from cache performance. 
        
           if let Some(cores) = core_affinity::get_core_ids() { 
        
               core_affinity::set_for_current(cores[pinned_cpu_core]); 
        
           }

There is no value in spitting out a faster number at startup that can't be realized at steady state.

As for my test results, the degradation could also be a symptom of improper system tuning and/or sampling (ie too few hashes). Might experiment a little more and/or send this over to a few validators who I think might be interested in playing around with this

alexpyattaev · 2025-03-27T16:55:37Z

If you want to accurately measure PoH rate, you would want to measure across e.g. 10 batches of 10ms rather than 1 batch of 100ms. Discard highest and lowest batch mean, average the rest.

alexpyattaev

LGTM

Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does

steviez · 2025-07-21T06:03:25Z

If you want to accurately measure PoH rate, you would want to measure across e.g. 10 batches of 10ms rather than 1 batch of 100ms. Discard highest and lowest batch mean, average the rest.

Yeah, this logic could potentially be revisited. I see that you adjusted your duration accordingly for multiple trials, but we definitely want to keep the total measurement duration low to avoid this from slowing node restart in any meaningful way.

If we want to tie it into what actually runs, we could do num_ticks_per_slot measurements of num_hashes_per_tick hashes.

That being said, I might argue we could break that into a separate PR. For the sake of a public papertrail, another item that has come up multiple times is isolating the PoH core so other threads are not scheduled on it

steviez · 2025-07-21T06:04:59Z

I cleared your review since this one had been sitting so long and rebasing did not clear your ship-it; might experiment a bit more just to see so no need to review again immediately.

t-nelson · 2025-07-22T16:22:15Z

i dunno how much we can expect an improvement from thread pinning without core isolation. being migrated away from an otherwise busy core is likely preferable. need to be sure that that core is never busy

that said, if we lean into the test too hard, we're going to diverge from normal runtime poh perf, so should likely be considering changing that config too (if they're different)

steviez · 2025-07-22T19:33:19Z

i dunno how much we can expect an improvement from thread pinning without core isolation. being migrated away from an otherwise busy core is likely preferable. need to be sure that that core is never busy

that said, if we lean into the test too hard, we're going to diverge from normal runtime poh perf, so should likely be considering changing that config too (if they're different)

The way I see it, there are two changes to make:

Make the PoH test more closely match what PohService does
Isolate the core for both the test and PohService

This PR accomplishes 1. as pinning (without isolation) is what PohService does:

agave/poh/src/poh_service.rs

Lines 130 to 135 in 64382a5

    
           // PoH service runs in a tight loop, generating hashes as fast as possible. 
        
           // Let's dedicate one of the CPU cores to this thread so that it can gain 
        
           // from cache performance. 
        
           if let Some(cores) = core_affinity::get_core_ids() { 
        
               core_affinity::set_for_current(cores[pinned_cpu_core]); 
        
           }

I think we want both changes, but I'm also currently thinking that we can make them in any order. I'm also looking at 2. so if someone has strong convictions than me on the order of operations, we can merge one before the other

t-nelson · 2025-07-22T19:41:51Z

seems like doing this one first might make borderline hw fail more frequently with the std up 2.5x. i'm not sure we should be taking a (n unexpected) step backward in anticipation of a future change clawing it all back

steviez · 2025-07-22T20:15:40Z

seems like doing this one first might make borderline hw fail more frequently with the std up 2.5x. i'm not sure we should be taking a (n unexpected) step backward in anticipation of a future change clawing it all back

Technically, isolating a core (but not yet using it for the test) could make the test flakier as there is now one less core available for all the competing threads.

So, maybe one PR is the move

alexpyattaev · 2025-07-22T20:58:06Z

Would it be wrong to let the PoHService start, then observe its hash rate for 60 seconds rather than making a dedicated speedtest?

github-actions · 2026-02-19T08:21:38Z

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2026-02-26T08:22:03Z

This pull request was closed because it has been stale for 7 days with no activity.

steviez force-pushed the val_pin_poh_speed_check branch from 9d6e006 to 44ec9f9 Compare March 25, 2025 22:10

alexpyattaev approved these changes Apr 3, 2025

View reviewed changes

steviez mentioned this pull request Jul 21, 2025

validator: Split PoH speed measurement from check #4185

Closed

validator: Pin core in PoH speed test

3e2fbdc

Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does

steviez force-pushed the val_pin_poh_speed_check branch from 44ec9f9 to 3e2fbdc Compare July 21, 2025 05:41

steviez requested a review from alexpyattaev July 21, 2025 06:03

steviez marked this pull request as draft July 21, 2025 06:05

alexpyattaev removed their request for review October 26, 2025 08:51

github-actions Bot added the stale label Feb 19, 2026

github-actions Bot closed this Feb 26, 2026

steviez deleted the val_pin_poh_speed_check branch February 26, 2026 16:12

Conversation

steviez commented Mar 25, 2025

Problem

Summary of Changes

Uh oh!

steviez commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexpyattaev commented Mar 26, 2025

Uh oh!

steviez commented Mar 26, 2025

Uh oh!

alexpyattaev commented Mar 27, 2025

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

steviez commented Jul 21, 2025

Uh oh!

steviez commented Jul 21, 2025

Uh oh!

t-nelson commented Jul 22, 2025

Uh oh!

steviez commented Jul 22, 2025

Uh oh!

t-nelson commented Jul 22, 2025

Uh oh!

steviez commented Jul 22, 2025

Uh oh!

alexpyattaev commented Jul 22, 2025

Uh oh!

github-actions Bot commented Feb 19, 2026

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steviez commented Mar 25, 2025 •

edited

Loading

codecov-commenter commented Mar 25, 2025 •

edited

Loading