Skip to content

[Refactor] Tighten up high-level connection/peer handling#3713

Merged
vicsn merged 3 commits intoProvableHQ:stagingfrom
ljedrz:refactor/peer_handling
Jul 24, 2025
Merged

[Refactor] Tighten up high-level connection/peer handling#3713
vicsn merged 3 commits intoProvableHQ:stagingfrom
ljedrz:refactor/peer_handling

Conversation

@ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Jul 8, 2025

We currently have several high-level connection/peer-related collections, which provides us with a good overview when reading the objects, but isn't without potential issues: it involves multiple lock operations whenever a peer is added or removed, which can easily lead to race conditions.

This PR is an attempt to get rid of these issues altogether by collapsing several collections of simple items in favor of a single collection of more complex ones. The basic summary of the changes is this:

# InnerRouter<N: Network> {
#    (...)
-    /// The map of connected peer IPs to their peer handlers.
-    connected_peers: RwLock<HashMap<SocketAddr, Peer<N>>>,
-    /// The set of handshaking peers. While `Tcp` already recognizes the connecting IP addresses
-    /// and prevents duplicate outbound connection attempts to the same IP address, it is unable to
-    /// prevent simultaneous "two-way" connections between two peers (i.e. both nodes simultaneously
-    /// attempt to connect to each other). This set is used to prevent this from happening.
-    connecting_peers: Mutex<HashMap<SocketAddr, Option<Peer<N>>>>,
-    /// The set of candidate peer IPs.
-    candidate_peers: RwLock<HashSet<SocketAddr>>,
-    /// The set of restricted peer IPs.
-    restricted_peers: RwLock<HashMap<SocketAddr, Instant>>,
+    /// The collection of both candidate and connected peers.
+    peer_pool: RwLock<HashMap<SocketAddr, Peer<N>>>,

A short summary of the other changes involved:

  • the Peer becomes an enum with 3 possible variants: Candidate, Connecting, and Connected
  • the Peer is no longer Clone (to avoid any possibility of unintentional duplication)
  • new concrete types, CandidatePeer and ConnectedPeer, are isolated for added type safety
  • the completion of the high-level network handshake and the high-level disconnect now only involve accessing a single collection
  • the Resolver is only ever modified indirectly, through the Peer
  • the Resolver no longer needs to map from the listener address to the connected address (the peer_pool can do it)

I tried to limit the scope of the logic changes to minimum, and confirmed that they are working by running all the tests locally, plus the devnet script.

Fixes #3705.

@vicsn
Copy link
Collaborator

vicsn commented Jul 11, 2025

Triggered prerelease tests against this.

EDIT: Nevermind, still adding more clients to prerelease tests so it more accurately reflects the load.

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz force-pushed the refactor/peer_handling branch from d61b6ad to 0758688 Compare July 17, 2025 14:55
@vicsn vicsn added the v4.1.0 label Jul 21, 2025
@vicsn
Copy link
Collaborator

vicsn commented Jul 21, 2025

Connections on a 40 validator 80 client network look healthy. The visible drops to 0 connections are due to expected restarts.
Screenshot 2025-07-21 at 15 26 55

Can you evaluate if the logs look good to you too @ljedrz ?
https://drive.google.com/file/d/1WZcRemR1vUWhWkeLEsdsl-Pp3c_HK1tl/view?usp=sharing

@vicsn vicsn removed the request for review from niklaslong July 21, 2025 15:50
Copy link
Contributor

@kaimast kaimast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are exciting as they improve the code quality and maintainability a lot.
I left a few comments, and we can hopefully get this merged soon :)

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz requested a review from kaimast July 23, 2025 09:00
@ljedrz
Copy link
Collaborator Author

ljedrz commented Jul 23, 2025

@vicsn The logs look ok to me 👍.

@ljedrz
Copy link
Collaborator Author

ljedrz commented Jul 23, 2025

The CI failure is unrelated:

    Scanning Cargo.lock for vulnerabilities (603 crate dependencies)
Crate:     curve25519-dalek
Version:   4.2.0
Warning:   yanked
Dependency tree:
curve25519-dalek 4.2.0
└── ed25519-dalek 2.2.0
    └── zipsign-api 0.1.5
        └── self_update 0.42.0
            └── snarkos-cli 3.8.0
                └── snarkos 4.0.0

error: 1 denied warning found!

@kaimast
Copy link
Contributor

kaimast commented Jul 23, 2025

I had the audit issue too. cargo update should downgrade that package and fix it.

Copy link
Contributor

@kaimast kaimast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good to merge now 🥳

@vicsn vicsn merged commit 6aba00e into ProvableHQ:staging Jul 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] The Resolver can get unaligned with the higher-level address maps

3 participants