Skip to content

catchup: improve classBasedPeerSelector on small peer sets#6277

Merged
algorandskiy merged 5 commits intoalgorand:masterfrom
algorandskiy:pavel/p2p-catchup-fix
Mar 14, 2025
Merged

catchup: improve classBasedPeerSelector on small peer sets#6277
algorandskiy merged 5 commits intoalgorand:masterfrom
algorandskiy:pavel/p2p-catchup-fix

Conversation

@algorandskiy
Copy link
Copy Markdown
Contributor

@algorandskiy algorandskiy commented Mar 14, 2025

Summary

Per #6152 there was an issue with catchup on P2P networks that ended up to be a peer selector (classBasedPeerSelector) issue. Suppose there are only two Relays and no Archival nodes, and only one of the relays have all the blocks, and a node in question does not have DHT enabled. In this case it will only get ConnectedOut and PhonebookRelay peers (that are the same) and attempts to catchup from them. The following will happen:

  1. A relay without old blocks will be randomly selected and after couple of failures gets down ranked below a relay with blocks ( exact rank number is 3 but it does not really matter)
  2. Then the second relay get selected and about a hundred blocks get downloaded from it (exact number is 79 until its rank reaches 3, see below)
  3. peerSelector has a special logic to penalize a node as it is used in order to distribute load more evenly.
  4. After that the first node is selected again, fails, and the steps 2-3 repeat.
  5. Next, the classBasedPeerSelector instance determines ConnectedOut class had too many failures, and it is time to temporary disable this class. It advances to PhonebookRelay and the process repeats.
  6. Then this class gets banned as well, and no peers in remaining classes.
  7. Catchup service runs out of peers and exits until it resumes in 17 (agreement.defaultDeadlineTimeout) seconds again.

Fixed by decrementing failures counter in classBasedPeerSelector so that a single failing node does not prevent the entire class to give up on - the underlying peerSelector will punish it harder and harder until its rank passes the usage saturation bound (rank=23 for more context).

Additionally:

  1. Made rankSamples a linked list with node reusage instead of a slice => no extra mem allocs on long catchups.
  2. Fixed some types and uniformed fetchAndWrite logging.

Peer selector / catchup updates:

  1. History intro Fine tune catchup peer selection logic #2060
  2. Better 404 handling catchup: fetchAndWrite/fetchRound quit early on errNoBlockForRound #5809
  3. Class based peer selector to support archival role Network: Class-based Peer Selector #5937

Closes #6152

Test Plan

  1. Updated a unit test
  2. Run a local net catchup before and after the fix

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 14, 2025

Codecov Report

Attention: Patch coverage is 62.50000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 51.69%. Comparing base (f6cc4e2) to head (3df9c59).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
network/gossipNode.go 0.00% 12 Missing ⚠️
catchup/service.go 72.72% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
- Coverage   51.71%   51.69%   -0.02%     
==========================================
  Files         647      647              
  Lines       86888    86907      +19     
==========================================
- Hits        44934    44927       -7     
- Misses      39095    39119      +24     
- Partials     2859     2861       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jannotti
jannotti previously approved these changes Mar 14, 2025
Copy link
Copy Markdown
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a suggestion, but ready to approve even if you don't like it!

gmalouf
gmalouf previously approved these changes Mar 14, 2025
Copy link
Copy Markdown
Contributor

@gmalouf gmalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sprinkle of comments, but this looks good to me.

@algorandskiy algorandskiy dismissed stale reviews from gmalouf and jannotti via 3df9c59 March 14, 2025 20:37
@algorandskiy algorandskiy merged commit 272ec02 into algorand:master Mar 14, 2025
19 checks passed
@algorandskiy algorandskiy deleted the pavel/p2p-catchup-fix branch March 16, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catchup over P2P stops intermittently

3 participants