Skip to content
This repository was archived by the owner on Jan 16, 2026. It is now read-only.

feat: Add DNS to SwarmBuilder#3182

Merged
theochap merged 4 commits intoop-rs:mainfrom
jelias2:jelias2/kona-remove-dial-on-error
Dec 18, 2025
Merged

feat: Add DNS to SwarmBuilder#3182
theochap merged 4 commits intoop-rs:mainfrom
jelias2:jelias2/kona-remove-dial-on-error

Conversation

@jelias2
Copy link
Contributor

@jelias2 jelias2 commented Dec 10, 2025

Description

  1. Add DNS module to SwarmBuilder to support DNS based multiaddresses at the transport layer.
  2. Fix bug in DialConnections where unable to dial error resulted in not being able to re-dial a peer since it was never removed from the dial list

Previously Swarm did not have DNS abilities which resulted in the following error. After adding this peering is now working with DNS based multi-address

2025-12-10T18:49:47.003184Z DEBUG gossip: Outgoing connection error: Transport([(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu, MultiaddrNotSupported(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu))])

Bug Summary: Stuck Dial Attempts Preventing Peer Connections

Symptoms

Kona nodes only connecting to 3 out of 8 peers in the network
Discovery successfully finding 145 peers in the routing table
Prometheus metrics showing 842+ kona_node_dial_peer_error{type="already_dialing"} errors and growing
Missing cleanup in SwarmEvent::OutgoingConnectionError handler
When a dial attempt fails asynchronously (network timeout, connection refused, DNS resolution failure, etc.), the peer ID was never removed from the current_dials HashSet in the connection gater. This caused the peer to be permanently stuck in "dialing" state.

Code Flow:

  1. Node discovers peers via discv5 (145 peers found)
  2. Node attempts to dial discovered peers
  3. Dial attempt starts → peer added to current_dials HashSet
  4. If dial succeeds → ConnectionEstablished event → peer stays in current_dials (OK, protected by "already connected" check)
    6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed from current_dials
    7.Node tries to redial the failed peer later
  5. can_dial() check fails with DialError::AlreadyDialing because peer is still in current_dials
    Peer can never be retried, despite peer_redialing: 500 configuration

Before Fix:

Only 3/8 peers connected (37.5% connectivity)
Failed peers blacklisted forever after first attempt
Network partition risk in production
Peer redial configuration (peer_redialing: 500) effectively useless

After Fix:

Failed dial attempts can be retried according to peer_redialing config
Should achieve full mesh connectivity (7/7 peers, excluding self)
Proper network resilience against transient failures

Discovery Process

Started investigating PMS dashboard showing 6 vs RPC showing 3 peers
Found PMS was exporting duplicate metric series (unrelated issue, fixed with max instead of sum)
Confirmed node was actually only connected to 3 peers via RPC (opp2p_peers, opp2p_peerStats)
Discovered discv5 was working (145 peers in table) but gossip connections failing
Examined dial error metrics and found 842 "already_dialing" errors
Traced through connection gater and gossip driver code
Identified missing cleanup in OutgoingConnectionError event handler

Testing Recommendations

Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop increasing
Monitor kona_node_swarm_peer_count - should increase from 3 towards 7
Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected peers
Verify PMS dashboard shows correct peer counts with updated query

…going dial attempt fails asynchronously (network timeout, connection refused, etc.), the peer was NOT removed from the current_dials HashSet, leaving it permanently stuck in dialing state.
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.1%. Comparing base (86910c9) to head (9d50522).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/node/gossip/src/driver.rs 0.0% 3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (86910c9) and HEAD (9d50522). Click for more details.

HEAD has 19 uploads less than BASE
Flag BASE (86910c9) HEAD (9d50522)
proof 7 0
e2e 11 0
unit 2 1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jelias2 jelias2 marked this pull request as ready for review December 11, 2025 14:54
Copilot AI review requested due to automatic review settings December 11, 2025 14:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the connection management system where failed dial attempts were not properly cleaned up, causing peers to become permanently stuck in "dialing" state and preventing reconnection attempts. The fix adds the missing remove_dial() call in the OutgoingConnectionError event handler, allowing failed peers to be retried according to the peer_redialing configuration.

Additionally, the PR refactors DNS resolution by removing custom DNS resolution logic in favor of libp2p's built-in DNS support via the dns feature and .with_dns() transport layer, simplifying the codebase and delegating DNS handling to the library.

Key Changes

  • Fixed stuck dial attempts by cleaning up current_dials HashSet when outgoing connections fail
  • Replaced custom DNS resolution with libp2p's built-in DNS transport layer
  • Removed manual DNS resolution logic and associated tests that are now redundant

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
crates/node/gossip/src/driver.rs Added remove_dial() call in OutgoingConnectionError handler to fix the bug preventing peer retry attempts
crates/node/gossip/src/gater.rs Removed custom try_resolve_dns() method and simplified can_dial() to skip manual DNS resolution, delegating this to libp2p
crates/node/gossip/src/builder.rs Added .with_dns() to SwarmBuilder to enable libp2p's DNS transport layer
crates/node/gossip/Cargo.toml Added "dns" feature to libp2p dependency to support DNS multiaddrs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jelias2 jelias2 changed the title fix: Bug in SwarmEvent::OutgoingConnectionError handler - When an out… feat: Add DNS to Swarm Dec 11, 2025
@jelias2 jelias2 changed the title feat: Add DNS to Swarm feat: Add DNS to SwarmBuilder Dec 11, 2025
Copy link
Member

@theochap theochap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, although I am wondering if this is right to remove abilities to remove support for blocked IPs/subnets if we're using dns addresses. Maybe a better solution would be to either add support to block dns(es) or (probably better solution) to just translate the dns into IP in kona using into_socket_addrs

@op-will op-will enabled auto-merge December 17, 2025 20:14
auto-merge was automatically disabled December 18, 2025 15:56

Head branch was pushed to by a user without write access

@jelias2 jelias2 force-pushed the jelias2/kona-remove-dial-on-error branch from 9d50522 to ddff23f Compare December 18, 2025 15:56
@theochap theochap enabled auto-merge December 18, 2025 16:24
@theochap theochap added this pull request to the merge queue Dec 18, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 18, 2025
@theochap theochap added this pull request to the merge queue Dec 18, 2025
Merged via the queue into op-rs:main with commit 3d78173 Dec 18, 2025
46 of 51 checks passed
theochap pushed a commit to ethereum-optimism/optimism that referenced this pull request Jan 15, 2026
# Description
1. Add DNS module to SwarmBuilder to support DNS based multiaddresses at
the transport layer.
2. Fix bug in DialConnections where unable to dial error resulted in not
being able to re-dial a peer since it was never removed from the dial
list

Previously Swarm did not have DNS abilities which resulted in the
following error. After adding this peering is now working with DNS based
multi-address
```
2025-12-10T18:49:47.003184Z DEBUG gossip: Outgoing connection error: Transport([(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu, MultiaddrNotSupported(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu))])
```


# Bug Summary: Stuck Dial Attempts Preventing Peer Connections
### Symptoms
Kona nodes only connecting to 3 out of 8 peers in the network
Discovery successfully finding 145 peers in the routing table
Prometheus metrics showing 842+
kona_node_dial_peer_error{type="already_dialing"} errors and growing
Missing cleanup in SwarmEvent::OutgoingConnectionError handler
When a dial attempt fails asynchronously (network timeout, connection
refused, DNS resolution failure, etc.), the peer ID was never removed
from the current_dials HashSet in the connection gater. This caused the
peer to be permanently stuck in "dialing" state.

## Code Flow:
1. Node discovers peers via discv5 (145 peers found)
3. Node attempts to dial discovered peers
4. Dial attempt starts → peer added to current_dials HashSet
5. If dial succeeds → ConnectionEstablished event → peer stays in
current_dials (OK, protected by "already connected" check)
6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed
from current_dials
7.Node tries to redial the failed peer later
8. can_dial() check fails with DialError::AlreadyDialing because peer is
still in current_dials
Peer can never be retried, despite peer_redialing: 500 configuration

### Before Fix:
Only 3/8 peers connected (37.5% connectivity)
Failed peers blacklisted forever after first attempt
Network partition risk in production
Peer redial configuration (peer_redialing: 500) effectively useless

## After Fix:
Failed dial attempts can be retried according to peer_redialing config
Should achieve full mesh connectivity (7/7 peers, excluding self)
Proper network resilience against transient failures
### Discovery Process
Started investigating PMS dashboard showing 6 vs RPC showing 3 peers
Found PMS was exporting duplicate metric series (unrelated issue, fixed
with max instead of sum)
Confirmed node was actually only connected to 3 peers via RPC
(opp2p_peers, opp2p_peerStats)
Discovered discv5 was working (145 peers in table) but gossip
connections failing
Examined dial error metrics and found 842 "already_dialing" errors
Traced through connection gater and gossip driver code
Identified missing cleanup in OutgoingConnectionError event handler

### Testing Recommendations
Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop
increasing
Monitor kona_node_swarm_peer_count - should increase from 3 towards 7
Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected
peers
Verify PMS dashboard shows correct peer counts with updated query
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants