collator-protocol: cleanup connecting to backing group #9178

alexggh · 2025-07-11T14:42:32Z

There are a few things wrong with the way we are handling connecting the validators in the backing group:

validators_to_connect returns only validators in groups we already have a block to advertise and the last backing groups we advertised something to, that means that if our backing group changes, but we don't have anything to advertise it will continue to try to connect to the previous backing group and validator will log this and disconnect it immediately.
On the validator you will seeDeclared as collator for unneeded para and on the collator you will see Connect/Disconnect requests. This will continue every reconnect_timeout(4s from each active signal) until the collator advertises something to the new backing group. This is harmless, but it pollutes both the collator and the validator logs.
A collator connects only when it has something to advertise to its backing group, this is a bit too late and we can improve it by connecting the collators to the backing group immediately after they notice their assigned backing group.
Staying connected to the last backingroup we advertised something does not work for elastic scaling because we have different backing groups and if the collator set is big enough that collators author just one block per group rotation, then we will always connect just when we have a candidate to advertise.

Proposal to fix:

Have collators always connect to the backing group they got assigned to and keep the connection open until backing group changes. Also, try to connect when have something to advertise or on timeout to have more chances of being correctly connected.

Todo

Confirm that proposal does not have other undesired side effects.
Tests

Signed-off-by: Alexandru Gheorghe <[email protected]>

…to_backing_group

alexggh · 2025-07-16T12:00:32Z

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

 	/// as we learn the [`PeerId`]'s by `PeerConnected` events.
 	peer_ids: HashMap<PeerId, HashSet<AuthorityDiscoveryId>>,

-	/// Tracks which validators we want to stay connected to.


Not needed anymore, removed the entire class.

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

polkadot/node/network/collator-protocol/src/collator_side/validators_buffer.rs

lexnv

Nice one!

ordian

A collator connects only when it has something to advertise to its backing group, this is a bit too late and we can improve it by connecting the collators to the backing group immediately after they notice their assigned backing group.

The problem with that was that the number of collators of a parachain could substantially exceed the number of connection slots of validators. E.g. Moonbeam has around 80 collators. I think we set the limit high enough to alleviate the problem.

Staying connected to the last backingroup we advertised something does not work for elastic scaling because we have different backing groups and if the collator set is big enough that collators author just one block per group rotation, then we will always connect just when we have a candidate to advertise

also keep in mind there might be forks and even without elastic scaling in theory we could be connected to multiple backing groups because of that - since we connect to the union of the group for active view

Overall the changes make sense given that are well tested

Curious how much of this is expected to be affected by the collator protocol revamp @alindima

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

alindima · 2025-07-25T08:28:06Z

The problem with that was that the number of collators of a parachain could substantially exceed the number of connection slots of validators. E.g. Moonbeam has around 80 collators. I think we set the limit high enough to alleviate the problem.

This is a good point. Moreover, currently up to 3 paras can share the same core, so they effectively share this limit (without ensuring that it's fairly distributed).

With the revamp, we split the overall connection limit to the number of paras within view that are assigned to this core (and try to ensure it's evenly distributed).

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.

We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

alexggh · 2025-07-25T10:22:51Z

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

alindima · 2025-07-25T10:43:54Z

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

CC: @skunert how does this sound?

skunert · 2025-07-25T10:58:59Z

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

So can a collator mess around with this by providing false data? If they lie about their turn it will mess up the connections, and I understand we have limited slots right?

alindima · 2025-07-25T11:31:08Z

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

So can a collator mess around with this by providing false data? If they lie about their turn it will mess up the connections, and I understand we have limited slots right?

This is about the happy case, when collators use the implementation we provide.

We were previously connecting only after we built the collation.
With this PR we will always maintain connections to the backing group (from all collators)

My idea was to have something in between: connect in advance but only once we know we'll have the right to author a collation in X amount of time.

The adversarial case you mention is handled by the revamped protocol (on the validator side)

…ggh/cleanup_connecting_to_backing_group Signed-off-by: Andrei Sandu <[email protected]>

Signed-off-by: Andrei Sandu <[email protected]>

…ggh/cleanup_connecting_to_backing_group Signed-off-by: Andrei Sandu <[email protected]>

Signed-off-by: Andrei Sandu <[email protected]>

paritytech-workflow-stopper · 2025-10-10T19:33:14Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/18416013367
Failed job name: test-linux-stable

Signed-off-by: Andrei Sandu <[email protected]>

…ggh/cleanup_connecting_to_backing_group

Signed-off-by: Andrei Sandu <[email protected]>

paritytech-release-backport-bot · 2025-10-20T12:55:27Z

Created backport PR for stable2506:

[stable2506] Backport #9178 #10058 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9178-to-stable2506
git worktree add --checkout .worktree/backport-9178-to-stable2506 backport-9178-to-stable2506
cd .worktree/backport-9178-to-stable2506
git reset --hard HEAD^
git cherry-pick -x db5c89ffb65503c766fec651cf4dabfa8c820398
git push --force-with-lease

On top of #9178. Implements a mechanism to pre-connect to backers, see #9767 (comment) How it works: - connect to backers 6s before own slot starts - disconnect from all backers and stop connecting as the RC advances if own slot has finished TODO: - [x] fix collator protocol tests - [x] Explicitly disconnect from all backers when own slot has passed - [x] add test coverage for new connect/disconnect notifications --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Signed-off-by: Andrei Sandu <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Bastian Köcher <[email protected]>

paritytech-release-backport-bot · 2025-11-03T11:08:47Z

Created backport PR for stable2509:

[stable2509] Backport #9178 #10189 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9178-to-stable2509
git worktree add --checkout .worktree/backport-9178-to-stable2509 backport-9178-to-stable2509
cd .worktree/backport-9178-to-stable2509
git reset --hard HEAD^
git cherry-pick -x db5c89ffb65503c766fec651cf4dabfa8c820398
git push --force-with-lease

There are a few things wrong with the way we are handling connecting the validators in the backing group: 1. `validators_to_connect` returns only validators in groups we already have a block to advertise and the last backing groups we advertised something to, that means that if our backing group changes, but we don't have anything to advertise it will continue to try to connect to the previous backing group and validator will log this and disconnect it immediately. On the validator you will see`Declared as collator for unneeded para` and on the collator you will see Connect/Disconnect requests. This will continue every reconnect_timeout(4s from each active signal) until the collator advertises something to the new backing group. This is harmless, but it pollutes both the collator and the validator logs. 2. A collator connects only when it has something to advertise to its backing group, this is a bit too late and we can improve it by connecting the collators to the backing group immediately after they notice their assigned backing group. 3. Staying connected to the last backingroup we advertised something does not work for elastic scaling because we have different backing groups and if the collator set is big enough that collators author just one block per group rotation, then we will always connect just when we have a candidate to advertise. Have collators always connect to the backing group they got assigned to and keep the connection open until backing group changes. Also, try to connect when have something to advertise or on timeout to have more chances of being correctly connected. - [x] Confirm that proposal does not have other undesired side effects. - [x] Tests --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Signed-off-by: Andrei Sandu <[email protected]> Co-authored-by: Andrei Sandu <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

On top of paritytech#9178. Implements a mechanism to pre-connect to backers, see paritytech#9767 (comment) How it works: - connect to backers 6s before own slot starts - disconnect from all backers and stop connecting as the RC advances if own slot has finished TODO: - [x] fix collator protocol tests - [x] Explicitly disconnect from all backers when own slot has passed - [x] add test coverage for new connect/disconnect notifications --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Signed-off-by: Andrei Sandu <[email protected]> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Bastian Köcher <[email protected]>

collator-protocol: cleanup connecting to backing group

97555c7

Signed-off-by: Alexandru Gheorghe <[email protected]>

alexggh requested review from AndreiEres, alindima, sandreim and tdimitrov July 11, 2025 14:42

alexggh added 3 commits July 15, 2025 17:55

fixup connection to validators

2fc08e1

Signed-off-by: Alexandru Gheorghe <[email protected]>

fix tests and cleanup unneeded code

48ab6d0

Signed-off-by: Alexandru Gheorghe <[email protected]>

Merge remote-tracking branch 'origin/master' into cleanup_connecting_…

f6b47ba

…to_backing_group

alexggh commented Jul 16, 2025

View reviewed changes

alexggh marked this pull request as ready for review July 16, 2025 12:02

alexggh mentioned this pull request Jul 16, 2025

Poor connectivity to a validators group #8885

Open

lexnv mentioned this pull request Jul 16, 2025

AsyncBacking: Parachain blocks produced after 1 minute #9227

Open

AndreiEres approved these changes Jul 18, 2025

View reviewed changes

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Show resolved Hide resolved

polkadot/node/network/collator-protocol/src/collator_side/validators_buffer.rs Outdated Show resolved Hide resolved

AndreiEres approved these changes Jul 18, 2025

View reviewed changes

lexnv approved these changes Jul 18, 2025

View reviewed changes

lexnv mentioned this pull request Jul 21, 2025

[dnm] net: Improve connectivity to validators by reconnecting on requests #9160

Closed

alexggh requested a review from ordian July 21, 2025 12:38

ordian approved these changes Jul 22, 2025

View reviewed changes

tdimitrov reviewed Jul 23, 2025

View reviewed changes

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Outdated Show resolved Hide resolved

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Outdated Show resolved Hide resolved

alindima reviewed Jul 25, 2025

View reviewed changes

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Show resolved Hide resolved

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Outdated Show resolved Hide resolved

lexnv mentioned this pull request Jul 31, 2025

lexnv/evidence: Retain at rank I polkadot-fellows/Evaluations#186

Merged

3 tasks

sandreim mentioned this pull request Sep 18, 2025

block confidence: multiple collators advertise same collation #9767

Open

sandreim added 2 commits September 24, 2025 15:01

Merge branch 'master' of github.com:paritytech/polkadot-sdk into alex…

ac823cc

…ggh/cleanup_connecting_to_backing_group Signed-off-by: Andrei Sandu <[email protected]>

ensure we consider all valid relay parents

76d7cdf

Signed-off-by: Andrei Sandu <[email protected]>

sandreim added the T0-node This PR/Issue is related to the topic “node”. label Oct 2, 2025

sandreim added this to parachains team board Oct 6, 2025

github-project-automation bot moved this to Backlog in parachains team board Oct 6, 2025

sandreim moved this from Backlog to Review/Audit in progress in parachains team board Oct 6, 2025

sandreim added 2 commits October 6, 2025 18:49

fix prdoc

7112e9f

Signed-off-by: Andrei Sandu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into alex…

eca9049

…ggh/cleanup_connecting_to_backing_group Signed-off-by: Andrei Sandu <[email protected]>

sandreim requested a review from alindima October 10, 2025 14:39

sandreim added 2 commits October 10, 2025 18:17

remove validator buffer file

f106131

Signed-off-by: Andrei Sandu <[email protected]>

review

c661973

Signed-off-by: Andrei Sandu <[email protected]>

alindima approved these changes Oct 20, 2025

View reviewed changes

sandreim added 4 commits October 20, 2025 14:12

fix tests

386ca15

Signed-off-by: Andrei Sandu <[email protected]>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into alex…

ce208e8

…ggh/cleanup_connecting_to_backing_group

remove clone derive

02362a9

Signed-off-by: Andrei Sandu <[email protected]>

update prdoc

0169bd6

Signed-off-by: Andrei Sandu <[email protected]>

sandreim enabled auto-merge October 20, 2025 11:18

sandreim added this pull request to the merge queue Oct 20, 2025

Merged via the queue into master with commit db5c89f Oct 20, 2025
243 of 244 checks passed

sandreim deleted the alexggh/cleanup_connecting_to_backing_group branch October 20, 2025 12:54

github-project-automation bot moved this from Review/Audit in progress to Completed in parachains team board Oct 20, 2025

paritytech-release-backport-bot bot mentioned this pull request Oct 20, 2025

[stable2506] Backport #9178 #10058

Closed

sandreim added A4-backport-stable2509 Pull request must be backported to the stable2509 release branch and removed A4-backport-stable2506 Pull request must be backported to the stable2506 release branch labels Nov 3, 2025

paritytech-release-backport-bot bot mentioned this pull request Nov 3, 2025

[stable2509] Backport #9178 #10189

Closed

alindima mentioned this pull request Dec 10, 2025

collator/elastic scaling: Advertising to peer timed out #10341

Open

collator-protocol: cleanup connecting to backing group #9178

collator-protocol: cleanup connecting to backing group #9178

Uh oh!

Conversation

alexggh commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal to fix:

Todo

Uh oh!

alexggh Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lexnv left a comment

Choose a reason for hiding this comment

Uh oh!

ordian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alindima commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

alexggh commented Jul 25, 2025

Uh oh!

alindima commented Jul 25, 2025

Uh oh!

skunert commented Jul 25, 2025

Uh oh!

alindima commented Jul 25, 2025

Uh oh!

paritytech-workflow-stopper bot commented Oct 10, 2025

Uh oh!

Uh oh!

paritytech-release-backport-bot bot commented Oct 20, 2025

Uh oh!

paritytech-release-backport-bot bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

alexggh commented Jul 11, 2025 •

edited

Loading