Skip to content

Conversation

@alexggh
Copy link
Contributor

@alexggh alexggh commented Jul 11, 2025

There are a few things wrong with the way we are handling connecting the validators in the backing group:

  1. validators_to_connect returns only validators in groups we already have a block to advertise and the last backing groups we advertised something to, that means that if our backing group changes, but we don't have anything to advertise it will continue to try to connect to the previous backing group and validator will log this and disconnect it immediately.
    On the validator you will seeDeclared as collator for unneeded para and on the collator you will see Connect/Disconnect requests. This will continue every reconnect_timeout(4s from each active signal) until the collator advertises something to the new backing group. This is harmless, but it pollutes both the collator and the validator logs.

  2. A collator connects only when it has something to advertise to its backing group, this is a bit too late and we can improve it by connecting the collators to the backing group immediately after they notice their assigned backing group.

  3. Staying connected to the last backingroup we advertised something does not work for elastic scaling because we have different backing groups and if the collator set is big enough that collators author just one block per group rotation, then we will always connect just when we have a candidate to advertise.

Proposal to fix:

Have collators always connect to the backing group they got assigned to and keep the connection open until backing group changes. Also, try to connect when have something to advertise or on timeout to have more chances of being correctly connected.

Todo

  • Confirm that proposal does not have other undesired side effects.
  • Tests

/// as we learn the [`PeerId`]'s by `PeerConnected` events.
peer_ids: HashMap<PeerId, HashSet<AuthorityDiscoveryId>>,

/// Tracks which validators we want to stay connected to.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore, removed the entire class.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one!

Copy link
Contributor

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. A collator connects only when it has something to advertise to its backing group, this is a bit too late and we can improve it by connecting the collators to the backing group immediately after they notice their assigned backing group.

The problem with that was that the number of collators of a parachain could substantially exceed the number of connection slots of validators. E.g. Moonbeam has around 80 collators. I think we set the limit high enough to alleviate the problem.

  1. Staying connected to the last backingroup we advertised something does not work for elastic scaling because we have different backing groups and if the collator set is big enough that collators author just one block per group rotation, then we will always connect just when we have a candidate to advertise

also keep in mind there might be forks and even without elastic scaling in theory we could be connected to multiple backing groups because of that - since we connect to the union of the group for active view


Overall the changes make sense given that are well tested

Curious how much of this is expected to be affected by the collator protocol revamp @alindima

@alindima
Copy link
Contributor

The problem with that was that the number of collators of a parachain could substantially exceed the number of connection slots of validators. E.g. Moonbeam has around 80 collators. I think we set the limit high enough to alleviate the problem.

This is a good point. Moreover, currently up to 3 paras can share the same core, so they effectively share this limit (without ensuring that it's fairly distributed).

With the revamp, we split the overall connection limit to the number of paras within view that are assigned to this core (and try to ensure it's evenly distributed).

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.

We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

@alexggh
Copy link
Contributor Author

alexggh commented Jul 25, 2025

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

@alindima
Copy link
Contributor

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

CC: @skunert how does this sound?

@skunert
Copy link
Contributor

skunert commented Jul 25, 2025

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

So can a collator mess around with this by providing false data? If they lie about their turn it will mess up the connections, and I understand we have limited slots right?

@alindima
Copy link
Contributor

One idea that could make this even smarter would be to have some other message sent to the collator protocol (by cumulus) which declares that it'll be our turn to author a block in X amount of time. With aura, we know for sure when that is going to be.
We get the advantage of being connected in advance without putting too much pressure on backing validators by connecting all collators to them at all times.

I like this idea very much, I'll try to have a look how hard would be to implement, hopefully not too hard.

So can a collator mess around with this by providing false data? If they lie about their turn it will mess up the connections, and I understand we have limited slots right?

This is about the happy case, when collators use the implementation we provide.

We were previously connecting only after we built the collation.
With this PR we will always maintain connections to the backing group (from all collators)

My idea was to have something in between: connect in advance but only once we know we'll have the right to author a collation in X amount of time.

The adversarial case you mention is handled by the revamped protocol (on the validator side)

@sandreim sandreim added the T0-node This PR/Issue is related to the topic “node”. label Oct 2, 2025
@sandreim sandreim moved this from Backlog to Review/Audit in progress in parachains team board Oct 6, 2025
Signed-off-by: Andrei Sandu <[email protected]>
…ggh/cleanup_connecting_to_backing_group

Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim requested a review from alindima October 10, 2025 14:39
Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/18416013367
Failed job name: test-linux-stable

Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
@sandreim sandreim enabled auto-merge October 20, 2025 11:18
@sandreim sandreim added this pull request to the merge queue Oct 20, 2025
Merged via the queue into master with commit db5c89f Oct 20, 2025
243 of 244 checks passed
@sandreim sandreim deleted the alexggh/cleanup_connecting_to_backing_group branch October 20, 2025 12:54
@github-project-automation github-project-automation bot moved this from Review/Audit in progress to Completed in parachains team board Oct 20, 2025
@paritytech-release-backport-bot

Created backport PR for stable2506:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9178-to-stable2506
git worktree add --checkout .worktree/backport-9178-to-stable2506 backport-9178-to-stable2506
cd .worktree/backport-9178-to-stable2506
git reset --hard HEAD^
git cherry-pick -x db5c89ffb65503c766fec651cf4dabfa8c820398
git push --force-with-lease

github-merge-queue bot pushed a commit that referenced this pull request Nov 3, 2025
On top of #9178.
Implements a mechanism to pre-connect to backers, see
#9767 (comment)

How it works:
- connect to backers 6s before own slot starts
- disconnect from all backers and stop connecting as the RC advances if
own slot has finished

TODO:
- [x] fix collator protocol tests
- [x] Explicitly disconnect from all backers when own slot has passed
- [x] add test coverage for new connect/disconnect notifications

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
@sandreim sandreim added A4-backport-stable2509 Pull request must be backported to the stable2509 release branch and removed A4-backport-stable2506 Pull request must be backported to the stable2506 release branch labels Nov 3, 2025
@paritytech-release-backport-bot

Created backport PR for stable2509:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9178-to-stable2509
git worktree add --checkout .worktree/backport-9178-to-stable2509 backport-9178-to-stable2509
cd .worktree/backport-9178-to-stable2509
git reset --hard HEAD^
git cherry-pick -x db5c89ffb65503c766fec651cf4dabfa8c820398
git push --force-with-lease

RomarQ pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Dec 3, 2025
There are a few things wrong with the way we are handling connecting the
validators in the backing group:
1. `validators_to_connect` returns only validators in groups we already
have a block to advertise and the last backing groups we advertised
something to, that means that if our backing group changes, but we don't
have anything to advertise it will continue to try to connect to the
previous backing group and validator will log this and disconnect it
immediately.
On the validator you will see`Declared as collator for unneeded para`
and on the collator you will see Connect/Disconnect requests. This will
continue every reconnect_timeout(4s from each active signal) until the
collator advertises something to the new backing group. This is
harmless, but it pollutes both the collator and the validator logs.

2. A collator connects only when it has something to advertise to its
backing group, this is a bit too late and we can improve it by
connecting the collators to the backing group immediately after they
notice their assigned backing group.

3. Staying connected to the last backingroup we advertised something
does not work for elastic scaling because we have different backing
groups and if the collator set is big enough that collators author just
one block per group rotation, then we will always connect just when we
have a candidate to advertise.

Have collators always connect to the backing group they got assigned to
and keep the connection open until backing group changes. Also, try to
connect when have something to advertise or on timeout to have more
chances of being correctly connected.

- [x] Confirm that proposal does not have other undesired side effects.
- [x] Tests

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
RomarQ pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Dec 3, 2025
On top of paritytech#9178.
Implements a mechanism to pre-connect to backers, see
paritytech#9767 (comment)

How it works:
- connect to backers 6s before own slot starts
- disconnect from all backers and stop connecting as the RC advances if
own slot has finished

TODO:
- [x] fix collator protocol tests
- [x] Explicitly disconnect from all backers when own slot has passed
- [x] add test coverage for new connect/disconnect notifications

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Signed-off-by: Andrei Sandu <[email protected]>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2509 Pull request must be backported to the stable2509 release branch T0-node This PR/Issue is related to the topic “node”.

Projects

Status: Completed

Development

Successfully merging this pull request may close these issues.

9 participants