Conversation
…going dial attempt fails asynchronously (network timeout, connection refused, etc.), the peer was NOT removed from the current_dials HashSet, leaving it permanently stuck in dialing state.
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in the connection management system where failed dial attempts were not properly cleaned up, causing peers to become permanently stuck in "dialing" state and preventing reconnection attempts. The fix adds the missing remove_dial() call in the OutgoingConnectionError event handler, allowing failed peers to be retried according to the peer_redialing configuration.
Additionally, the PR refactors DNS resolution by removing custom DNS resolution logic in favor of libp2p's built-in DNS support via the dns feature and .with_dns() transport layer, simplifying the codebase and delegating DNS handling to the library.
Key Changes
- Fixed stuck dial attempts by cleaning up
current_dialsHashSet when outgoing connections fail - Replaced custom DNS resolution with libp2p's built-in DNS transport layer
- Removed manual DNS resolution logic and associated tests that are now redundant
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
crates/node/gossip/src/driver.rs |
Added remove_dial() call in OutgoingConnectionError handler to fix the bug preventing peer retry attempts |
crates/node/gossip/src/gater.rs |
Removed custom try_resolve_dns() method and simplified can_dial() to skip manual DNS resolution, delegating this to libp2p |
crates/node/gossip/src/builder.rs |
Added .with_dns() to SwarmBuilder to enable libp2p's DNS transport layer |
crates/node/gossip/Cargo.toml |
Added "dns" feature to libp2p dependency to support DNS multiaddrs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
theochap
left a comment
There was a problem hiding this comment.
Looks good overall, although I am wondering if this is right to remove abilities to remove support for blocked IPs/subnets if we're using dns addresses. Maybe a better solution would be to either add support to block dns(es) or (probably better solution) to just translate the dns into IP in kona using into_socket_addrs
Head branch was pushed to by a user without write access
9d50522 to
ddff23f
Compare
# Description
1. Add DNS module to SwarmBuilder to support DNS based multiaddresses at
the transport layer.
2. Fix bug in DialConnections where unable to dial error resulted in not
being able to re-dial a peer since it was never removed from the dial
list
Previously Swarm did not have DNS abilities which resulted in the
following error. After adding this peering is now working with DNS based
multi-address
```
2025-12-10T18:49:47.003184Z DEBUG gossip: Outgoing connection error: Transport([(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu, MultiaddrNotSupported(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu))])
```
# Bug Summary: Stuck Dial Attempts Preventing Peer Connections
### Symptoms
Kona nodes only connecting to 3 out of 8 peers in the network
Discovery successfully finding 145 peers in the routing table
Prometheus metrics showing 842+
kona_node_dial_peer_error{type="already_dialing"} errors and growing
Missing cleanup in SwarmEvent::OutgoingConnectionError handler
When a dial attempt fails asynchronously (network timeout, connection
refused, DNS resolution failure, etc.), the peer ID was never removed
from the current_dials HashSet in the connection gater. This caused the
peer to be permanently stuck in "dialing" state.
## Code Flow:
1. Node discovers peers via discv5 (145 peers found)
3. Node attempts to dial discovered peers
4. Dial attempt starts → peer added to current_dials HashSet
5. If dial succeeds → ConnectionEstablished event → peer stays in
current_dials (OK, protected by "already connected" check)
6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed
from current_dials
7.Node tries to redial the failed peer later
8. can_dial() check fails with DialError::AlreadyDialing because peer is
still in current_dials
Peer can never be retried, despite peer_redialing: 500 configuration
### Before Fix:
Only 3/8 peers connected (37.5% connectivity)
Failed peers blacklisted forever after first attempt
Network partition risk in production
Peer redial configuration (peer_redialing: 500) effectively useless
## After Fix:
Failed dial attempts can be retried according to peer_redialing config
Should achieve full mesh connectivity (7/7 peers, excluding self)
Proper network resilience against transient failures
### Discovery Process
Started investigating PMS dashboard showing 6 vs RPC showing 3 peers
Found PMS was exporting duplicate metric series (unrelated issue, fixed
with max instead of sum)
Confirmed node was actually only connected to 3 peers via RPC
(opp2p_peers, opp2p_peerStats)
Discovered discv5 was working (145 peers in table) but gossip
connections failing
Examined dial error metrics and found 842 "already_dialing" errors
Traced through connection gater and gossip driver code
Identified missing cleanup in OutgoingConnectionError event handler
### Testing Recommendations
Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop
increasing
Monitor kona_node_swarm_peer_count - should increase from 3 towards 7
Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected
peers
Verify PMS dashboard shows correct peer counts with updated query
Description
Previously Swarm did not have DNS abilities which resulted in the following error. After adding this peering is now working with DNS based multi-address
Bug Summary: Stuck Dial Attempts Preventing Peer Connections
Symptoms
Kona nodes only connecting to 3 out of 8 peers in the network
Discovery successfully finding 145 peers in the routing table
Prometheus metrics showing 842+ kona_node_dial_peer_error{type="already_dialing"} errors and growing
Missing cleanup in SwarmEvent::OutgoingConnectionError handler
When a dial attempt fails asynchronously (network timeout, connection refused, DNS resolution failure, etc.), the peer ID was never removed from the current_dials HashSet in the connection gater. This caused the peer to be permanently stuck in "dialing" state.
Code Flow:
6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed from current_dials
7.Node tries to redial the failed peer later
Peer can never be retried, despite peer_redialing: 500 configuration
Before Fix:
Only 3/8 peers connected (37.5% connectivity)
Failed peers blacklisted forever after first attempt
Network partition risk in production
Peer redial configuration (peer_redialing: 500) effectively useless
After Fix:
Failed dial attempts can be retried according to peer_redialing config
Should achieve full mesh connectivity (7/7 peers, excluding self)
Proper network resilience against transient failures
Discovery Process
Started investigating PMS dashboard showing 6 vs RPC showing 3 peers
Found PMS was exporting duplicate metric series (unrelated issue, fixed with max instead of sum)
Confirmed node was actually only connected to 3 peers via RPC (opp2p_peers, opp2p_peerStats)
Discovered discv5 was working (145 peers in table) but gossip connections failing
Examined dial error metrics and found 842 "already_dialing" errors
Traced through connection gater and gossip driver code
Identified missing cleanup in OutgoingConnectionError event handler
Testing Recommendations
Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop increasing
Monitor kona_node_swarm_peer_count - should increase from 3 towards 7
Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected peers
Verify PMS dashboard shows correct peer counts with updated query