-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359
Comments
Am I correct that the stall actually happens earlier than the disconnect due to reputation changes? |
I'm having a bit of a trouble to fully interpret the logs (but you can access all the validator/collator logs from the file), but the stall started right when the parachain upgrade was applied (the block itself wasn't being included, because of timeout, which led to reputation issue) |
The real issue is, why is fetching the collation taking so long. The timeout is 1 second, size is 3M ... so with a decent connection that transfer should take way less than a second. 🤔 |
Yep, that makes sense. The disconnect is not the actual problem, because even if it did not happen the chain would still not make any progress, because the import will keep failing. Just dropping the reputation change is no good solution anyway as this would make DoSing a parachain easier. Increasing the timeout will likely not help either - we only have two seconds in total, if the initial upload takes longer than a second, it becomes very unlikely that the candidate is going to make it in a block. How good is the connection between validators and collators? Like, what bandwidth is available? |
How big is the backing group? |
Interestingly, the time it takes to transfer a collation, varies a lot between validators. Sending a collation to validator 3 takes 134ms :
but sending the same collation to validator 2, takes almost 900ms (transfer to the next validator starts, once the first transfer finished):
|
Sometimes the logs don't really make any sense - could it be that log lines are missing? |
@eskimor I don't fully agree with your statement that 1s is enough for 3Mb (at least not in a decentralized network). So this brings me to the case where transferring 3Mb with a RTT of 200ms, over TCP with an congestion slow start of 10 * MSS (64kb), is going to be close to 1000ms in some cases (not pre-established or idle connection). There are some improvements to be done on that area (using QUIC instead of TCP or changing default configuration: paritytech/polkadot-sdk#908) but it is not realistic to expected 3Mb to always be under 1s (RTT of 300ms would go over). |
There could be few lines missing when the log lines are using , but when I looked it was line 10 lines of log total for the file. If you want a precise validator or collator log, I can retrieve the exact one. |
I see. The problem is, without contextual execution (yes we really need that asap), there is only so much we can do. We only have 2 seconds in total for the complete backing phase, if the initial PoV sending already takes more than a second, this is going to be tough. Things we can do about that -very short term:
Longer term (our roadmap):
As the problem seems to be pressing right now, it would be great if you could try out those short term solutions. Other than that - yes precise logs might also help revealing things we can do about the issue right now. |
Thank you @eskimor ; All the short term changes are to be applied on Relay nodes, which is fine for our internal network but not really important as those issues happens only in rare condition (like runtime upgrade/migration). However, they are appearing all the time on Moonriver, not sure what we can do in those conditions. All our collators have very good hardware (over the Polkadot Requirements) and bandwidth. But we have zero control on the validators |
You mentioned that this happens in your super centralized test setup as well. But there we really should not be seeing that it takes a second for a 3MB PoV, assuming well speced nodes. So something else must be going on - I think we really should be looking at the logs, so the exact ones would be really useful here. Other than that, what could also help in pinpointing the issue, if you could try increasing timeout or dropping reputation changes does help in your setup. E.g. 10% increase in timeout fixes it, vs. doubling it, issues remains ... would be interesting data points - although the very first thing definitely will be proper logs examination. |
Ok, going forward with this issue, I think we should:
|
^^ Obsoleteted by: paritytech/polkadot#4386 |
Conditions
Event
Logs
https://drive.google.com/file/d/1tNixS4n4_HHiaGz5bMURWM5-wGyGFmWb/view?usp=sharing
Observations
PoV for the block applying the runtime upgrade is 3000kb:
Validators are reporting the collators:
Conclusion
When the PoV is too big and can't be fetch in time by the validators, those are reducing the reputation of the collator and preventing it from sending any further block, stalling the network
The text was updated successfully, but these errors were encountered: