SWQOS: a better implementation#8880
Conversation
872c299 to
bf5878b
Compare
|
If all buckets are full, how much TPS (or bandwidth or w/e) would that translate into? How many transactions during the next 10ms interval? I'm a little concerned that overfill represents past underutilization, but it is being given to hungrier connections for future utilization. Past underutilization has already been lost. In practice, I think we've IBRLed ingest enough that we can rely on the law of averages + large numbers and just oversubscribe a bit and be fine. But theoretically, there could be some burst capacity problems |
The overflow from last round gets reassigned to every active consumer in the current round, so at any given point in time there is at most 2x tokens floating around. If we had zero usage for a while (all buckets full) and then suddenly TXs start arriving at full speed, we'd allow 2x sustained TPS target for 10ms, then start throttling based on immediate load.
It is one of the simplest way to achieve O(N) complexity for refilling algorithm across N connection table keys, for 4000 connections this is desirable. An exact solution generally requires O(N^2) compute complexity, so it is out of the question. There are some N*log(N) solutions too, but they are more complex implementation-wise.
Yeah I'm looking into some algorithms that do not require overflow but instead estimate true demand per connection and adjust refill for next round based on past demand. So if you did not use your tokens for a while, your refill rate would be zero (but your bucket is full). Once you start using the bandwidth, refill rate gets bumped to reflect your stake at the start of next round. |
19bab09 to
da553c1
Compare
da553c1 to
ee2c4f0
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8880 +/- ##
==========================================
+ Coverage 69.4% 82.7% +13.3%
==========================================
Files 847 855 +8
Lines 256453 321259 +64806
==========================================
+ Hits 178012 265786 +87774
+ Misses 78441 55473 -22968 🚀 New features to boost your workflow:
|
7b21256 to
8317646
Compare
|
@bw-solana no more overflow reallocation, now the algorithm predicts future demand based on past use, and allocates correctly in one go. This actually ends up being far more flexible too. |
8317646 to
e8d1dd4
Compare
e8d1dd4 to
696acb6
Compare
|
|
||
| impl QosController<SwQosConnectionContext> for SwQos { | ||
| async fn async_init(&mut self) { | ||
| tokio::spawn(refill_task( |
There was a problem hiding this comment.
This API is confusing to me. async_init() sounds like it’s doing some optional setup that’s only needed when the object is used in an async environment, but here it actually starts refill_task(), which looks like a core part of the functionality. Since this is the only place where refill_task() is called, it seems like the object can’t really be used in a sync environment anyway (or refill_task() would need to be called explicitly?). If that’s the case, why not move this into new() or new_async() (this would also remove mut in mut qos : T in run_server())?
Also, does async_init() take &mut self with the intention that the task handle will be stored at some point? As it is now, the task is detached, which makes things even less clear.
There was a problem hiding this comment.
Also, won't this task become a bottleneck (by consistently missing its 10ms deadline) in the presence of 1000s of active connection tasks? And so most of these tasks would go to sleep before their stakes are recalculated...
btw., if we do keep these sleeps for now, it looks like the tasks would only need to sleep until the start of the next 10 ms window and wake up once their tokens were recalculated.
What if instead of sleeping they at least simply yield immediately? (like the first step to get rid of sleep).
There was a problem hiding this comment.
async_init is marked async to avoid users from calling it outside of tokio context errorneously (as it would result in runtime panic). spawning a tokio task can only be done from inside the runtime. new_async is also an option but it would break the API similarity with SimpleQos.
The refill task needs to loop over at most 6000 connections. Each connection requires a few microseconds (as the math is trivial, and it only acquires 2 locks). We could refill less often at the cost of longer response to load changes, needs to be tested.
There was a problem hiding this comment.
Added benchmarks in 9ddfd1b . One refill over 20K connections takes ~300 microseconds. So we can totally afford it even with the lock contention over connection tables.
c8d6204 to
a6101ad
Compare
47a8ecf to
b313448
Compare
9ddfd1b to
66be57a
Compare
fbf3c1f to
daf349f
Compare
|
@anza-xyz/networking
I still lean toward the reactive approach 🤔. A static TPS cap will likely be always wrong in some direction — too low and we leave capacity on the table, too high and we're not protecting anything. The actual capacity likely varies with transaction mix, internal state, and hopefully improves with the pipeline optimisations over time. With reactive throttling, the pipeline's own processing rate becomes the clock, and there's zero overhead when the system isn't under pressure (which is most of the time). For back-pressure, the channel between TPU and the processing pipeline has a bounded capacity (Kirill's changes). We could monitor its fill level. I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile. In any case, I think it'd be best to put all the design options on the table, involve relevant stakeholders, make a decision, and then execute (and stop arguing about it). |
As long as a remote attacker can cause transactions that have been admitted via TPU to get dropped at a later stage, legit users will be forced to send multiple copies to compensate. This was the case when UDP was used, it was not pretty. Also in case of reactive throttling, we still need swqos to figure. out whom to throttle once congestion starts. Even in a fully reactive approach you need accounting of who used how much and a mechanism to enforce throttling. |
Yes, but the fundamental difference is that the artificial TPS cap would not be enforced by TPU in this case. |
Well let us try to put it together and see how it performs. |
|
I agree with @stablebits that we shouldn't throttle until necessary. We already had this debate with his PR here: #9580. I don't think it is productive to keep going back and forth. @alessandrod has already helped us settle this debate and asked us to implement "(2) Reactive Throttling". Setting the TPS per client-stake is wayy too finicky and hard to get right. Accept and throttle only when backpressure. And yes, we can use swqos when we actually want to start throttling |
Throttle whom and by how much "when backpressure"? We'd still need logic to figure out whom to throttle once backpressure kicks in. Which logic do we use for that? |
Problems
SWQOS introduces separation between staked and unstaked connections with artificial stake thresholds
SWQOS uses EMA averaging, parameters for which are not well documented and do not line up well with real loads (see also Increase the duration of the EMA smoothing window (STREAM_LOAD_EMA_INTERVAL_COUNT) #10033 which fixed some of that)
We want TPS budget unused by connection X to be used by other connections (including the unstaked ones)
We want strong guarantees we will not be throttling anyone unless overall load is high enough
We want to have no limits on the amount of connections we can have open and idle (currently we "preallocate" bandwidth to unstaked connections).
We want to lay foundations to enable streamer to directly control the MAX_STREAM parameter of the connection instead of throttling reads as it does now.
Summary of Changes
Fixes #8863
Design Description & Overview
This way any unused bandwidth gets redistributed to other peers in stake-proportional manner. The code is about 1/5 the size of original SWQOS + EMA.
Open questions:
Fixes #8863
Test results
5 staked IDs, each with 4 connections, 5ms latency
Targeting 500K TPS with new code:
stake proportionality:
Timelapse of transfers:
Legacy streamer (targeting 500K TPS):
Critically:
So clearly legacy streamer could do better at reaching target TPS
