v3.1: Increase the duration of the EMA smoothing window (STREAM_LOAD_EMA_INTERVAL_COUNT) (backport of #10033)#10089
Closed
mergify[bot] wants to merge 1 commit intov3.1from
Closed
v3.1: Increase the duration of the EMA smoothing window (STREAM_LOAD_EMA_INTERVAL_COUNT) (backport of #10033)#10089mergify[bot] wants to merge 1 commit intov3.1from
mergify[bot] wants to merge 1 commit intov3.1from
Conversation
…TERVAL_COUNT) (#10033) streamer/TPU: increase STREAM_LOAD_EMA_INTERVAL_COUNT from 10 to 40 This constant controls the duration of the EMA smoothing window used to reduce sensitivity to short-lived load spikes at the start of a leader slot. Throttling is only triggered when saturation is sustained. The value 40 was chosen based on simulations: at a max target TPS of ~400K, it allows the system to absorb a burst of ~50K transactions over ~40 ms before throttling activates. There is no magic about N=40; the value should be tuned based on the size and duration of spikes we want to tolerate. (cherry picked from commit 51ebbc4) # Conflicts: # streamer/src/nonblocking/stream_throttle.rs
Author
|
Cherry-pick of 51ebbc4 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
|
@gregcusack @alexpyattaev It doesn't look like my first PR was merged. So this one can't be merged yet... This one should be backported first: |
ya you're good! nothing to do on this one right now. as you mentioned, the other needs to be merged first |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
This change is a follow-up to #9580.
The
STREAM_LOAD_EMA_INTERVAL_COUNTconstant controls the duration of the EMA smoothing window used to reduce sensitivity to short-lived load spikes at the start of a leader slot. With #9580 in place, throttling is only triggered when saturation is sustained (reaching 95% of max target).Problem
With 10, the duration of the smoothing window is too short (see the simulation results below).
Summary of Changes
The value 40 was chosen based on simulations: at a max target TPS of ~400K, it allows the system to absorb a burst of ~50K transactions over ~40 ms before throttling activates.
There is no magic about N=40; the value should be tuned based on the size and duration of spikes we want to tolerate.
This choice was made based on simulations: the
alphain the EMA (new_ema = alpha * latest + (1 - alpha) * ema) is basically2/(N+1), whereNisSTREAM_LOAD_EMA_INTERVAL_COUNT.The larger
Nis, the slower the EMA grows (i.e., the larger a burst it can absorb). With N=10 (current code), alpha ≈ 0.18. For example, here’s the EMA growth under sustained load of 1K / 5ms.N=10 (alpha ≈ 0.18)
N=40 (alpha ≈ 0.047)
Below is simulated ingestion of ~60K transactions over 100ms with a spike at the beginning -- roughly corresponding to a pattern we recently saw on mds1 (mainnet), but at about 10x more traffic.
Note: throttling is activated at 95% of the target (500K TPS) load and deactivated at 90%). The quota of 40K basically means unthrottled.
N=10
N=40
With N=40, we can absorb ~50K transactions (with a spike) over ~40ms before throttling gets activated.
Fixes #
This is an automatic backport of pull request #10033 done by [Mergify](https://mergify.com).