Skip to content

DHT bootnodes: rate limit the discovery attempts instead of limiting max retry count#8792

Merged
dmitry-markin merged 6 commits intomasterfrom
dm-dht-bootnodes-retry
Jun 19, 2025
Merged

DHT bootnodes: rate limit the discovery attempts instead of limiting max retry count#8792
dmitry-markin merged 6 commits intomasterfrom
dm-dht-bootnodes-retry

Conversation

@dmitry-markin
Copy link
Copy Markdown
Contributor

Instead of giving up after 5 discovery attempts, keep retrying with a delay of 30 seconds until the discovery succeeds.

This fixes a DHT bootnodes zombinet test execution on CI where individual nodes may start with a significant delay. This will also help should there be temporary connectivity issues leading to 5 failures in a row.

@dmitry-markin dmitry-markin requested review from lexnv and skunert June 6, 2025 15:16
@dmitry-markin dmitry-markin added the R0-no-crate-publish-required The change does not require any crates to be re-published. label Jun 6, 2025
@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Jun 6, 2025
@dmitry-markin dmitry-markin changed the title DHT bootnodes: rate limit the discovery attempts istead of limiting max retry count DHT bootnodes: rate limit the discovery attempts instead of limiting max retry count Jun 6, 2025
Comment thread cumulus/client/bootnodes/src/discovery.rs Outdated
Comment thread cumulus/client/bootnodes/src/discovery.rs Outdated
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Left a few minor thoughts on the process itself. Wondering if we could simplify a bit the state machine needed for retrying. However, that could be a followup 🙏

Im wondering if it would make sense to have an exponential timer similar to what we have in the substrate/discovery? Where we might go from 8s to 16s to 32s and cap it there? 🤔

Comment thread cumulus/client/bootnodes/src/discovery.rs Outdated
Comment thread cumulus/client/bootnodes/src/discovery.rs
@dmitry-markin
Copy link
Copy Markdown
Contributor Author

Wondering if we could simplify a bit the state machine needed for retrying. However, that could be a followup 🙏

I don't think we can simplify it, because currently the state machine boils down to:

  1. discovery in progress -> do nothing
  2. no discovery in progress -> trigger retry in 30 seconds

Im wondering if it would make sense to have an exponential timer similar to what we have in the substrate/discovery? Where we might go from 8s to 16s to 32s and cap it there? 🤔

On real networks the discovery takes 2-3 minutes, so 8s or 32s won't make a big difference, while introducing the complexity of an exponential timer.

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15757558286
Failed job name: fmt

@dmitry-markin dmitry-markin added this pull request to the merge queue Jun 19, 2025
Merged via the queue into master with commit 35d8868 Jun 19, 2025
245 checks passed
@dmitry-markin dmitry-markin deleted the dm-dht-bootnodes-retry branch June 19, 2025 16:13
@github-project-automation github-project-automation Bot moved this to Blocked ⛔️ in Networking Jun 19, 2025
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
…max retry count (#8792)

Instead of giving up after 5 discovery attempts, keep retrying with a
delay of 30 seconds until the discovery succeeds.

This fixes a DHT bootnodes zombinet test execution on CI where
individual nodes may start with a significant delay. This will also help
should there be temporary connectivity issues leading to 5 failures in a
row.

---------

Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

R0-no-crate-publish-required The change does not require any crates to be re-published. T0-node This PR/Issue is related to the topic “node”.

Projects

Status: Blocked ⛔️

Development

Successfully merging this pull request may close these issues.

4 participants