validator: Pin core in PoH speed test#5485
Conversation
9d6e006 to
44ec9f9
Compare
|
Tested this out. Some startups from recent tip of master (mean = 14,442,587.5, stddev = 60,795.9): and then this branch (mean = 14,349,034, stddev = 163,245.36): These were run on the same machine a day apart. This data would suggest that the change actually degraded things; will need to poke around more but will obviously not be pushing the PR as-is if things degrade. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5485 +/- ##
=======================================
Coverage 83.2% 83.2%
=======================================
Files 853 853
Lines 374743 374763 +20
=======================================
+ Hits 311857 311874 +17
- Misses 62886 62889 +3 🚀 New features to boost your workflow:
|
|
This makes sense, pinning a core to a 100% CPU-bound load will make that core very hot. Normally, the CPU would prefer this load to be moved to a colder core so the overheated one can cool off. Pinning the core makes this process impossible, so CPU will be forced to throttle. |
That's fair. For what it is worth, I think there is merit in pinning the PoH speed check thread to match PohService behavior, even if that means the speed check has raw performance: Lines 127 to 132 in 381a5d5 There is no value in spitting out a faster number at startup that can't be realized at steady state. As for my test results, the degradation could also be a symptom of improper system tuning and/or sampling (ie too few hashes). Might experiment a little more and/or send this over to a few validators who I think might be interested in playing around with this |
|
If you want to accurately measure PoH rate, you would want to measure across e.g. 10 batches of 10ms rather than 1 batch of 100ms. Discard highest and lowest batch mean, average the rest. |
Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does
44ec9f9 to
3e2fbdc
Compare
Yeah, this logic could potentially be revisited. I see that you adjusted your duration accordingly for multiple trials, but we definitely want to keep the total measurement duration low to avoid this from slowing node restart in any meaningful way. If we want to tie it into what actually runs, we could do That being said, I might argue we could break that into a separate PR. For the sake of a public papertrail, another item that has come up multiple times is isolating the PoH core so other threads are not scheduled on it |
|
I cleared your review since this one had been sitting so long and rebasing did not clear your ship-it; might experiment a bit more just to see so no need to review again immediately. |
|
i dunno how much we can expect an improvement from thread pinning without core isolation. being migrated away from an otherwise busy core is likely preferable. need to be sure that that core is never busy that said, if we lean into the test too hard, we're going to diverge from normal runtime poh perf, so should likely be considering changing that config too (if they're different) |
The way I see it, there are two changes to make:
This PR accomplishes 1. as pinning (without isolation) is what PohService does: Lines 130 to 135 in 64382a5 I think we want both changes, but I'm also currently thinking that we can make them in any order. I'm also looking at 2. so if someone has strong convictions than me on the order of operations, we can merge one before the other |
|
seems like doing this one first might make borderline hw fail more frequently with the std up 2.5x. i'm not sure we should be taking a (n unexpected) step backward in anticipation of a future change clawing it all back |
Technically, isolating a core (but not yet using it for the test) could make the test flakier as there is now one less core available for all the competing threads. So, maybe one PR is the move |
|
Would it be wrong to let the PoHService start, then observe its hash rate for 60 seconds rather than making a dedicated speedtest? |
|
This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
|
This pull request was closed because it has been stale for 7 days with no activity. |
Problem
Checking the PoH speed requires a Bank to derive the cluster hash rate as of #2447. By the time a Bank is available, many services have started up and are competing for CPU time.
Summary of Changes
Use a separate thread for the PoH speed test, and pin the thread to a specific core like PohService does
Offered as an alternative #4185. I personally like this solution more as it:
PohServiceactually does at steady state