Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

crystalin · 2021-11-24T01:37:36Z

Conditions

Internal permanent network (4 validators, 4 collators)
Relay runtime 9122 (kusama-local) client 0.12.1
Parachain runtime Moonbase 902 (substrate v0.9.11) migrating to 1000 (substrate v0.9.12)

Event

Parachain runtime upgraded from 902 to 1000
Parachain stalled at the block applying the upgrade

Logs

https://drive.google.com/file/d/1tNixS4n4_HHiaGz5bMURWM5-wGyGFmWb/view?usp=sharing

Observations

PoV for the block applying the runtime upgrade is 3000kb:

PoV size { header: 0.244140625kb, extrinsics: 2.8232421875kb, storage_proof: 3100.509765625kb }

Validators are reporting the collators:

validator2 2021-11-24 00:56:49.379 DEBUG tokio-runtime-worker parachain::collator-protocol: Fetching collation failed due to network error hash=0x91139d3d4a8c5c31febfb614a2a509681204535403fe0a769e5d0837095dd2fe para_id=Id(1000) peer_id=PeerId("12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS") err=Network(Timeout)
validator2 2021-11-24 00:56:49.379 DEBUG tokio-runtime-worker parachain::network-bridge: peer=PeerId("12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS") rep=CostMinor("Some network error") action="ReportPeer"

Conclusion

When the PoV is too big and can't be fetch in time by the validators, those are reducing the reputation of the collator and preventing it from sending any further block, stalling the network

The text was updated successfully, but these errors were encountered:

eskimor · 2021-11-24T18:01:03Z

Am I correct that the stall actually happens earlier than the disconnect due to reputation changes?

crystalin · 2021-11-24T18:04:07Z

I'm having a bit of a trouble to fully interpret the logs (but you can access all the validator/collator logs from the file), but the stall started right when the parachain upgrade was applied (the block itself wasn't being included, because of timeout, which led to reputation issue)

eskimor · 2021-11-24T18:04:27Z

The real issue is, why is fetching the collation taking so long. The timeout is 1 second, size is 3M ... so with a decent connection that transfer should take way less than a second. 🤔

eskimor · 2021-11-24T18:11:55Z

I'm having a bit of a trouble to fully interpret the logs (but you can access all the validator/collator logs from the file), but the stall started right when the parachain upgrade was applied (the block itself wasn't being included, because of timeout, which led to reputation issue)

Yep, that makes sense. The disconnect is not the actual problem, because even if it did not happen the chain would still not make any progress, because the import will keep failing.

Just dropping the reputation change is no good solution anyway as this would make DoSing a parachain easier. Increasing the timeout will likely not help either - we only have two seconds in total, if the initial upload takes longer than a second, it becomes very unlikely that the candidate is going to make it in a block.

How good is the connection between validators and collators? Like, what bandwidth is available?

eskimor · 2021-11-24T18:52:20Z

How big is the backing group?

eskimor · 2021-11-24T19:44:41Z

Interestingly, the time it takes to transfer a collation, varies a lot between validators. Sending a collation to validator 3 takes 134ms :

validator3 2021-11-24 00:55:24.337 DEBUG tokio-runtime-worker parachain::collator-protocol: Requesting collation peer_id=12D3KooWEokXpahv1QL2aJceYDnfjR6d2B4bTpc1xnY2wP7cgkiS para_id=1000 relay_parent=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac

validator3 2021-11-24 00:55:24.471 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840

but sending the same collation to validator 2, takes almost 900ms (transfer to the next validator starts, once the first transfer finished):

validator3 2021-11-24 00:55:24.471 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840
validator2 2021-11-24 00:55:25.368 DEBUG tokio-runtime-worker parachain::collator-protocol: Received collation para_id=1000 hash=0x187fb3bf0012f5fe882253446aebcc5971652cee47fef7e3c53e22aa83b846ac candidate_hash=0x7530401ceb9d00469bd7fab7b0484790d99bad8ab00e71d9bc09e29a34480840

eskimor · 2021-11-24T19:49:28Z

Sometimes the logs don't really make any sense - could it be that log lines are missing?

crystalin · 2021-11-24T20:08:00Z

@eskimor I don't fully agree with your statement that 1s is enough for 3Mb (at least not in a decentralized network).
But let me give a bit more context: Both our collators (8) and validators (4) are split in 2 regions (US, EU) with 2 collators and 4 validators each.
So the latency between some of the collator<->validator is not optimal (around 100ms).

So this brings me to the case where transferring 3Mb with a RTT of 200ms, over TCP with an congestion slow start of 10 * MSS (64kb), is going to be close to 1000ms in some cases (not pre-established or idle connection). There are some improvements to be done on that area (using QUIC instead of TCP or changing default configuration: paritytech/polkadot-sdk#908) but it is not realistic to expected 3Mb to always be under 1s (RTT of 300ms would go over).

crystalin · 2021-11-24T20:34:10Z

Sometimes the logs don't really make any sense - could it be that log lines are missing?

There could be few lines missing when the log lines are using , but when I looked it was line 10 lines of log total for the file. If you want a precise validator or collator log, I can retrieve the exact one.

eskimor · 2021-11-25T17:24:59Z

@eskimor I don't fully agree with your statement that 1s is enough for 3Mb (at least not in a decentralized network).

I see. The problem is, without contextual execution (yes we really need that asap), there is only so much we can do. We only have 2 seconds in total for the complete backing phase, if the initial PoV sending already takes more than a second, this is going to be tough.

Things we can do about that -very short term:

Increase the timeout and see if it helps, maybe the 2 second window is still enough.
Try if dropping the reputation change does indeed help - for testing you could simply drop all reputation changes there and see if it makes a difference.
Change deployment (different regions?), ..

Longer term (our roadmap):

Improve collator protocol to better establish connections ahead of time
Contextual Execution ..... this will pretty much solve everything, but is also the hardest
Other collation protocol performance improvements we have in the pipeline

As the problem seems to be pressing right now, it would be great if you could try out those short term solutions. Other than that - yes precise logs might also help revealing things we can do about the issue right now.

crystalin · 2021-11-25T21:18:15Z

Thank you @eskimor ; All the short term changes are to be applied on Relay nodes, which is fine for our internal network but not really important as those issues happens only in rare condition (like runtime upgrade/migration).

However, they are appearing all the time on Moonriver, not sure what we can do in those conditions. All our collators have very good hardware (over the Polkadot Requirements) and bandwidth. But we have zero control on the validators

eskimor · 2021-11-26T11:56:32Z

You mentioned that this happens in your super centralized test setup as well. But there we really should not be seeing that it takes a second for a 3MB PoV, assuming well speced nodes. So something else must be going on - I think we really should be looking at the logs, so the exact ones would be really useful here.

Other than that, what could also help in pinpointing the issue, if you could try increasing timeout or dropping reputation changes does help in your setup. E.g. 10% increase in timeout fixes it, vs. doubling it, issues remains ... would be interesting data points - although the very first thing definitely will be proper logs examination.

eskimor · 2021-11-26T14:53:21Z

Ok, going forward with this issue, I think we should:

Examine logs more closely - rule out any bugs
Fix any found bugs
Make collators pre-connect
Remove reputation change on timeout ... not great, but if it helps until we get contextual execution and pre validation in, it might be a worthwhile tradeoff.

eskimor · 2021-11-26T17:13:01Z

Ok, going forward with this issue, I think we should:

1. Examine logs more closely - rule out any bugs

2. Fix any found bugs

3. Make collators [pre-connect](https://github.com/paritytech/polkadot/issues/4381)

4. Remove reputation change on timeout ... not great, but if it helps until we get contextual execution and [pre validation](https://github.com/paritytech/polkadot/issues/3428) in, it might be a worthwhile tradeoff.

^^ Obsoleteted by: paritytech/polkadot#4386

eskimor mentioned this issue Nov 26, 2021

Pre-connect on collator protocol paritytech/polkadot#4381

Closed

eskimor mentioned this issue Nov 26, 2021

Optimize backing for latency and liveness paritytech/polkadot#4386

Closed

crystalin mentioned this issue Nov 26, 2021

Parachain block skipped by validators on Kusama paritytech/polkadot-sdk#950

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

crystalin commented Nov 24, 2021

eskimor commented Nov 24, 2021

crystalin commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

crystalin commented Nov 24, 2021

crystalin commented Nov 24, 2021

eskimor commented Nov 25, 2021 •

edited

Loading

crystalin commented Nov 25, 2021

eskimor commented Nov 26, 2021 •

edited

Loading

eskimor commented Nov 26, 2021 •

edited

Loading

eskimor commented Nov 26, 2021

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

Runtime Upgrade (Big PoV) leading to collator peer reputation dropping (network stalled) #10359

Comments

crystalin commented Nov 24, 2021

Conditions

Event

Logs

Observations

Conclusion

eskimor commented Nov 24, 2021

crystalin commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

eskimor commented Nov 24, 2021

crystalin commented Nov 24, 2021

crystalin commented Nov 24, 2021

eskimor commented Nov 25, 2021 • edited Loading

crystalin commented Nov 25, 2021

eskimor commented Nov 26, 2021 • edited Loading

eskimor commented Nov 26, 2021 • edited Loading

eskimor commented Nov 26, 2021

eskimor commented Nov 25, 2021 •

edited

Loading

eskimor commented Nov 26, 2021 •

edited

Loading

eskimor commented Nov 26, 2021 •

edited

Loading