Make approval-distribution aggression a bit more robust and less spammy #6696

alexggh · 2024-11-28T15:02:12Z

After finality started lagging on kusama around 2025-11-25 15:55:40 nodes started being overloaded with messages and some restarted with

Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage. origin=polkadot_service::relay_chain_selection::SelectRelayChainInner<sc_client_db::Backend<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32, sp_runtime::traits::BlakeTwo256>, sp_runtime::OpaqueExtrinsic>>, polkadot_overseer::Handle>

I think this happened because our aggression in the current form is way too spammy and create problems in situation where we already constructed blocks with a load of candidates to check which what happened around #25933682 before and after. However aggression, does help in the nightmare scenario where the network is segmented and sparsely connected, so I tend to think we shouldn't completely remove it.

The current configuration is:

l1_threshold: Some(16),
l2_threshold: Some(28),
resend_unfinalized_period: Some(8),

The way aggression works right now :

After L1 is triggered all nodes send all messages they created to all the other nodes and all messages they would have they already send according to the topology.
Because of resend_unfinalized_period for each block all messages at step 1) are sent every 8 blocks, so for example let's say we have blocks 1 to 24 unfinalized, then at block 25, all messages for block 1, 9 will be resent, and consequently at block 26, all messages for block 2, 10 will be resent, this becomes worse as more blocks are created if backing backpressure did not kick in yet. In total this logic makes that each node receive 3 * total_number_of messages_per_block
L2 aggression is way too spammy, when L2 aggression is enabled all nodes sends all messages of a block on GridXY, that means that all messages are received and sent by node at least 2*sqrt(num_validators), so on kusama would be 66 * NUM_MESSAGES_AT_FIRST_UNFINALIZED_BLOCK, so even with a reasonable number of messages like 10K, which you can have if you escalated because of no shows, you end-up sending and receiving ~660k messages at once, I think that's what makes the approval-distribution to appear unresponsive on some nodes.
Duplicate messages are received by the nodes which turn, mark the node as banned, which may create more no-shows.

Proposed improvements:

Make L2 trigger way later 28 blocks, instead of 64, this should literally the last resort, until then we should try to let the approval-voting escalation mechanism to do its things and cover the no-shows.
On L1 aggression don't send messages for blocks too far from the first_unfinalized there is no point in sending the messages for block 20, if block 1 is still unfinalized.
On L1 aggression, send messages then back-off for 3 * resend_unfinalized_period to give time for everyone to clear up their queues.
If aggression is enabled accept duplicate messages from validators and don't punish them by reducting their reputation which, which may create no-shows.

After finality started lagging on kusama around 025-11-25 15:55:40 validators started seeing ocassionally this log, when importing votes covering more than one assignment. ``` Possible bug: Vote import failed ``` That happens because the assumption that assignments from the same validator would have the same required routing doesn't hold after you enabled aggression, so you might end up receiving the first assignment then you modify the routing for it in `enable_aggression` then your receive the second assignment and the vote covering both assignments, so the rouing for the first and second assingment wouldn't match and we would fail to import the vote. From the logs I've seen, I don't think this is the reason the network didn't fully recover until the failsafe kicked it, because the votes had been already imported in approval-voting before this error. Signed-off-by: Alexandru Gheorghe <[email protected]>

Signed-off-by: Alexandru Gheorghe <[email protected]>

burdges · 2024-11-28T15:36:41Z

Is this only approval votes? Or also approval assigment annoucements? It's maybe both buit you do not notice the annoucements much since we merged tranche zero?

I've worried that our topology sucks for ages now: We've two disjoint topologies which do not benefit from each other, a 2d gride and a random graph. I've no idea how one justifies two disjoint topologies with no synergy between them. At 1000 node the grid needs 31.6-1=30 message per layer for two layers. And the random graph gets pretty dense too.

Instead I'd propose roughly some unified slimmer topology, like..

A 3d grid, so 10-1=9 message per layer for three layers, but which unifies with log2(1000) = 10 extra random hops, or maybe its ln(1000)=7, and which enforce some unified ttl slightly larger than 3.

If Alice -> Bob -> Carol is a grid path, then Carol still tries sending to all 18 of her neighbors who are not neighbors of Bob, as well as 10 random ones. Alice and Bob also sent to 10 randoms. If Dave is one of Carol's neighbors, and the ttl>4, then Dave tries sending to all 18 of his neighbors that're not neighbors of Carol, and another 10 randoms. If Rob is someone's random, then Rob sends to all 27 of his neighbors, and another 10 randoms.

We still have an aggression problem: Are we sending the message directly or asking the other side if they'd like the message? We'd ideally keep all messages small and send them unrequested.

You recieve a message whenever someone in your 27 3d grid neighbors recieves it, so this does not get worse vs the 2d grid. In expectation, you also recieve a message 10 times from random guys too, so that's 37 incoming messages per message on the network, likely much less than currently.

We might dampen the grid at exactly layer 3, meaning Carol sends to like 20 randoms or something, which assumes Bob was honest.

We still have a finishing problem caused by our ttl: What happens if I never get a message? We should think about how this impacts everything.

Anyways I've discussed this some with @chenda-w3f but it turns out the distributed systems theory says very little about the really optimal scheme is here. It's maybe worth discussing at the retreat..

polkadot/node/network/approval-distribution/src/lib.rs

Co-authored-by: Andrei Sandu <[email protected]>

Signed-off-by: Alexandru Gheorghe <[email protected]>

sandreim

Just missing some test, otherwise LGTM 🚀

polkadot/node/network/approval-distribution/src/lib.rs

…potential_fixes

Signed-off-by: Alexandru Gheorghe <[email protected]>

AndreiEres

Good job!

polkadot/node/network/approval-distribution/src/lib.rs

polkadot/node/network/approval-distribution/src/tests.rs

Polkadot-Forum · 2024-12-10T14:42:12Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2025-11-25-kusama-parachains-spammening-aftermath/11108/1

paritytech-workflow-stopper · 2024-12-10T22:51:11Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12263100459
Failed job name: test-linux-stable

paritytech-cmd-bot-polkadot-sdk · 2024-12-11T09:08:37Z

Created backport PR for stable2407:

[stable2407] Backport #6696 #6839 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-6696-to-stable2407
git worktree add --checkout .worktree/backport-6696-to-stable2407 backport-6696-to-stable2407
cd .worktree/backport-6696-to-stable2407
git reset --hard HEAD^
git cherry-pick -x 85dd228d9a7d3d736e0dab7c0b084e3fc2c1b003
git push --force-with-lease

paritytech-cmd-bot-polkadot-sdk · 2024-12-11T09:08:43Z

Successfully created backport PR for stable2409:

[stable2409] Backport #6696 #6840

…my (#6696) After finality started lagging on kusama around `2025-11-25 15:55:40` nodes started being overloaded with messages and some restarted with ``` Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage. origin=polkadot_service::relay_chain_selection::SelectRelayChainInner<sc_client_db::Backend<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32, sp_runtime::traits::BlakeTwo256>, sp_runtime::OpaqueExtrinsic>>, polkadot_overseer::Handle> ``` I think this happened because our aggression in the current form is way too spammy and create problems in situation where we already constructed blocks with a load of candidates to check which what happened around `#25933682` before and after. However aggression, does help in the nightmare scenario where the network is segmented and sparsely connected, so I tend to think we shouldn't completely remove it. The current configuration is: ``` l1_threshold: Some(16), l2_threshold: Some(28), resend_unfinalized_period: Some(8), ``` The way aggression works right now : 1. After L1 is triggered all nodes send all messages they created to all the other nodes and all messages they would have they already send according to the topology. 2. Because of resend_unfinalized_period for each block all messages at step 1) are sent every 8 blocks, so for example let's say we have blocks 1 to 24 unfinalized, then at block 25, all messages for block 1, 9 will be resent, and consequently at block 26, all messages for block 2, 10 will be resent, this becomes worse as more blocks are created if backing backpressure did not kick in yet. In total this logic makes that each node receive 3 * total_number_of messages_per_block 3. L2 aggression is way too spammy, when L2 aggression is enabled all nodes sends all messages of a block on GridXY, that means that all messages are received and sent by node at least 2*sqrt(num_validators), so on kusama would be 66 * NUM_MESSAGES_AT_FIRST_UNFINALIZED_BLOCK, so even with a reasonable number of messages like 10K, which you can have if you escalated because of no shows, you end-up sending and receiving ~660k messages at once, I think that's what makes the approval-distribution to appear unresponsive on some nodes. 4. Duplicate messages are received by the nodes which turn, mark the node as banned, which may create more no-shows. ## Proposed improvements: 1. Make L2 trigger way later 28 blocks, instead of 64, this should literally the last resort, until then we should try to let the approval-voting escalation mechanism to do its things and cover the no-shows. 2. On L1 aggression don't send messages for blocks too far from the first_unfinalized there is no point in sending the messages for block 20, if block 1 is still unfinalized. 3. On L1 aggression, send messages then back-off for 3 * resend_unfinalized_period to give time for everyone to clear up their queues. 4. If aggression is enabled accept duplicate messages from validators and don't punish them by reducting their reputation which, which may create no-shows. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Andrei Sandu <[email protected]> (cherry picked from commit 85dd228)

paritytech-cmd-bot-polkadot-sdk · 2024-12-11T09:08:47Z

Successfully created backport PR for stable2412:

[stable2412] Backport #6696 #6841

…my (#6696) After finality started lagging on kusama around `2025-11-25 15:55:40` nodes started being overloaded with messages and some restarted with ``` Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage. origin=polkadot_service::relay_chain_selection::SelectRelayChainInner<sc_client_db::Backend<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32, sp_runtime::traits::BlakeTwo256>, sp_runtime::OpaqueExtrinsic>>, polkadot_overseer::Handle> ``` I think this happened because our aggression in the current form is way too spammy and create problems in situation where we already constructed blocks with a load of candidates to check which what happened around `#25933682` before and after. However aggression, does help in the nightmare scenario where the network is segmented and sparsely connected, so I tend to think we shouldn't completely remove it. The current configuration is: ``` l1_threshold: Some(16), l2_threshold: Some(28), resend_unfinalized_period: Some(8), ``` The way aggression works right now : 1. After L1 is triggered all nodes send all messages they created to all the other nodes and all messages they would have they already send according to the topology. 2. Because of resend_unfinalized_period for each block all messages at step 1) are sent every 8 blocks, so for example let's say we have blocks 1 to 24 unfinalized, then at block 25, all messages for block 1, 9 will be resent, and consequently at block 26, all messages for block 2, 10 will be resent, this becomes worse as more blocks are created if backing backpressure did not kick in yet. In total this logic makes that each node receive 3 * total_number_of messages_per_block 3. L2 aggression is way too spammy, when L2 aggression is enabled all nodes sends all messages of a block on GridXY, that means that all messages are received and sent by node at least 2*sqrt(num_validators), so on kusama would be 66 * NUM_MESSAGES_AT_FIRST_UNFINALIZED_BLOCK, so even with a reasonable number of messages like 10K, which you can have if you escalated because of no shows, you end-up sending and receiving ~660k messages at once, I think that's what makes the approval-distribution to appear unresponsive on some nodes. 4. Duplicate messages are received by the nodes which turn, mark the node as banned, which may create more no-shows. ## Proposed improvements: 1. Make L2 trigger way later 28 blocks, instead of 64, this should literally the last resort, until then we should try to let the approval-voting escalation mechanism to do its things and cover the no-shows. 2. On L1 aggression don't send messages for blocks too far from the first_unfinalized there is no point in sending the messages for block 20, if block 1 is still unfinalized. 3. On L1 aggression, send messages then back-off for 3 * resend_unfinalized_period to give time for everyone to clear up their queues. 4. If aggression is enabled accept duplicate messages from validators and don't punish them by reducting their reputation which, which may create no-shows. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Andrei Sandu <[email protected]> (cherry picked from commit 85dd228)

Backport #6696 into `stable2409` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6696 into `stable2412` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Co-authored-by: Alexandru Gheorghe <[email protected]>

Backport #6696 into `stable2407` from alexggh. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]>

alexggh added 2 commits November 28, 2024 13:07

Make approval-distribution aggression a bit more robust

e8c2a6b

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh marked this pull request as draft November 28, 2024 15:03

sandreim reviewed Nov 28, 2024

View reviewed changes

polkadot/node/network/approval-distribution/src/lib.rs Outdated Show resolved Hide resolved

polkadot/node/network/approval-distribution/src/lib.rs Show resolved Hide resolved

polkadot/node/network/approval-distribution/src/lib.rs Outdated Show resolved Hide resolved

sandreim reviewed Nov 28, 2024

View reviewed changes

polkadot/node/network/approval-distribution/src/lib.rs Show resolved Hide resolved

alexggh and others added 3 commits December 9, 2024 11:28

Update polkadot/node/network/approval-distribution/src/lib.rs

0ac0478

Co-authored-by: Andrei Sandu <[email protected]>

Fixup logs

fd55635

Signed-off-by: Alexandru Gheorghe <[email protected]>

Resend only for min age

4c94eca

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh marked this pull request as ready for review December 9, 2024 10:47

sandreim approved these changes Dec 9, 2024

View reviewed changes

polkadot/node/network/approval-distribution/src/lib.rs Show resolved Hide resolved

Base automatically changed from alexggh/fix_possible_bug to master December 9, 2024 14:14

alexggh added 5 commits December 9, 2024 18:30

Merge remote-tracking branch 'origin/master' into alexggh/aggression_…

13f2863

…potential_fixes

Fix unittests

93af247

Signed-off-by: Alexandru Gheorghe <[email protected]>

Add unittest to accept duplicates when aggression enabled

a86baf4

Signed-off-by: Alexandru Gheorghe <[email protected]>

Make clippy happy

e0e911a

Signed-off-by: Alexandru Gheorghe <[email protected]>

Add prdoc

fbecc86

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh added T8-polkadot This PR/Issue is related to/affects the Polkadot network. A4-needs-backport Pull request must be backported to all maintained releases. labels Dec 10, 2024

AndreiEres approved these changes Dec 10, 2024

View reviewed changes

alexggh added 2 commits December 10, 2024 11:36

Update tests.rs

2ae72c4

Merge branch 'master' into alexggh/aggression_potential_fixes

fe6bce7

alexggh enabled auto-merge December 10, 2024 11:18

Update pr_6696.prdoc

835fb5d

Merge branch 'master' into alexggh/aggression_potential_fixes

2e580b2

alexggh added this pull request to the merge queue Dec 11, 2024

Merged via the queue into master with commit 85dd228 Dec 11, 2024
196 of 199 checks passed

alexggh deleted the alexggh/aggression_potential_fixes branch December 11, 2024 09:08

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 11, 2024

[stable2407] Backport #6696 #6839

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 11, 2024

[stable2409] Backport #6696 #6840

Merged

paritytech-cmd-bot-polkadot-sdk bot mentioned this pull request Dec 11, 2024

[stable2412] Backport #6696 #6841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make approval-distribution aggression a bit more robust and less spammy #6696

Make approval-distribution aggression a bit more robust and less spammy #6696

alexggh commented Nov 28, 2024

burdges commented Nov 28, 2024

sandreim left a comment

AndreiEres left a comment

Polkadot-Forum commented Dec 10, 2024

paritytech-workflow-stopper bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024

Make approval-distribution aggression a bit more robust and less spammy #6696

Make approval-distribution aggression a bit more robust and less spammy #6696

Conversation

alexggh commented Nov 28, 2024

Proposed improvements:

burdges commented Nov 28, 2024

sandreim left a comment

Choose a reason for hiding this comment

AndreiEres left a comment

Choose a reason for hiding this comment

Polkadot-Forum commented Dec 10, 2024

paritytech-workflow-stopper bot commented Dec 10, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024

paritytech-cmd-bot-polkadot-sdk bot commented Dec 11, 2024