feat: Add DNS to SwarmBuilder by jelias2 · Pull Request #3182 · op-rs/kona

jelias2 · 2025-12-10T18:09:44Z

Description

Add DNS module to SwarmBuilder to support DNS based multiaddresses at the transport layer.
Fix bug in DialConnections where unable to dial error resulted in not being able to re-dial a peer since it was never removed from the dial list

Previously Swarm did not have DNS abilities which resulted in the following error. After adding this peering is now working with DNS based multi-address

2025-12-10T18:49:47.003184Z DEBUG gossip: Outgoing connection error: Transport([(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu, MultiaddrNotSupported(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu))])

Bug Summary: Stuck Dial Attempts Preventing Peer Connections

Symptoms

Kona nodes only connecting to 3 out of 8 peers in the network
Discovery successfully finding 145 peers in the routing table
Prometheus metrics showing 842+ kona_node_dial_peer_error{type="already_dialing"} errors and growing
Missing cleanup in SwarmEvent::OutgoingConnectionError handler
When a dial attempt fails asynchronously (network timeout, connection refused, DNS resolution failure, etc.), the peer ID was never removed from the current_dials HashSet in the connection gater. This caused the peer to be permanently stuck in "dialing" state.

Code Flow:

Node discovers peers via discv5 (145 peers found)
Node attempts to dial discovered peers
Dial attempt starts → peer added to current_dials HashSet
If dial succeeds → ConnectionEstablished event → peer stays in current_dials (OK, protected by "already connected" check)
6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed from current_dials
7.Node tries to redial the failed peer later
can_dial() check fails with DialError::AlreadyDialing because peer is still in current_dials
Peer can never be retried, despite peer_redialing: 500 configuration

Before Fix:

Only 3/8 peers connected (37.5% connectivity)
Failed peers blacklisted forever after first attempt
Network partition risk in production
Peer redial configuration (peer_redialing: 500) effectively useless

After Fix:

Failed dial attempts can be retried according to peer_redialing config
Should achieve full mesh connectivity (7/7 peers, excluding self)
Proper network resilience against transient failures

Discovery Process

Started investigating PMS dashboard showing 6 vs RPC showing 3 peers
Found PMS was exporting duplicate metric series (unrelated issue, fixed with max instead of sum)
Confirmed node was actually only connected to 3 peers via RPC (opp2p_peers, opp2p_peerStats)
Discovered discv5 was working (145 peers in table) but gossip connections failing
Examined dial error metrics and found 842 "already_dialing" errors
Traced through connection gater and gossip driver code
Identified missing cleanup in OutgoingConnectionError event handler

Testing Recommendations

Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop increasing
Monitor kona_node_swarm_peer_count - should increase from 3 towards 7
Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected peers
Verify PMS dashboard shows correct peer counts with updated query

…going dial attempt fails asynchronously (network timeout, connection refused, etc.), the peer was NOT removed from the current_dials HashSet, leaving it permanently stuck in dialing state.

codecov · 2025-12-10T18:18:56Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.1%. Comparing base (86910c9) to head (9d50522).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
crates/node/gossip/src/driver.rs	0.0%	3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (86910c9) and HEAD (9d50522). Click for more details.

HEAD has 19 uploads less than BASE

Flag BASE (86910c9) HEAD (9d50522)

proof 7 0

e2e 11 0

unit 2 1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot

Pull request overview

This PR fixes a critical bug in the connection management system where failed dial attempts were not properly cleaned up, causing peers to become permanently stuck in "dialing" state and preventing reconnection attempts. The fix adds the missing remove_dial() call in the OutgoingConnectionError event handler, allowing failed peers to be retried according to the peer_redialing configuration.

Additionally, the PR refactors DNS resolution by removing custom DNS resolution logic in favor of libp2p's built-in DNS support via the dns feature and .with_dns() transport layer, simplifying the codebase and delegating DNS handling to the library.

Key Changes

Fixed stuck dial attempts by cleaning up current_dials HashSet when outgoing connections fail
Replaced custom DNS resolution with libp2p's built-in DNS transport layer
Removed manual DNS resolution logic and associated tests that are now redundant

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`crates/node/gossip/src/driver.rs`	Added `remove_dial()` call in `OutgoingConnectionError` handler to fix the bug preventing peer retry attempts
`crates/node/gossip/src/gater.rs`	Removed custom `try_resolve_dns()` method and simplified `can_dial()` to skip manual DNS resolution, delegating this to libp2p
`crates/node/gossip/src/builder.rs`	Added `.with_dns()` to SwarmBuilder to enable libp2p's DNS transport layer
`crates/node/gossip/Cargo.toml`	Added "dns" feature to libp2p dependency to support DNS multiaddrs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

theochap

Looks good overall, although I am wondering if this is right to remove abilities to remove support for blocked IPs/subnets if we're using dns addresses. Maybe a better solution would be to either add support to block dns(es) or (probably better solution) to just translate the dns into IP in kona using into_socket_addrs

# Description 1. Add DNS module to SwarmBuilder to support DNS based multiaddresses at the transport layer. 2. Fix bug in DialConnections where unable to dial error resulted in not being able to re-dial a peer since it was never removed from the dial list Previously Swarm did not have DNS abilities which resulted in the following error. After adding this peering is now working with DNS based multi-address ``` 2025-12-10T18:49:47.003184Z DEBUG gossip: Outgoing connection error: Transport([(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu, MultiaddrNotSupported(/dns4/kona-net-0-kona-reth-f-sequencer-2-p2p.primary.infra.dev.oplabs.cloud/tcp/9003/p2p/16Uiu2HAm3e6LBYw9JK5rcyE5rCANd2ZF5i53qAoCsaEbpvJgR6Uu))]) ``` # Bug Summary: Stuck Dial Attempts Preventing Peer Connections ### Symptoms Kona nodes only connecting to 3 out of 8 peers in the network Discovery successfully finding 145 peers in the routing table Prometheus metrics showing 842+ kona_node_dial_peer_error{type="already_dialing"} errors and growing Missing cleanup in SwarmEvent::OutgoingConnectionError handler When a dial attempt fails asynchronously (network timeout, connection refused, DNS resolution failure, etc.), the peer ID was never removed from the current_dials HashSet in the connection gater. This caused the peer to be permanently stuck in "dialing" state. ## Code Flow: 1. Node discovers peers via discv5 (145 peers found) 3. Node attempts to dial discovered peers 4. Dial attempt starts → peer added to current_dials HashSet 5. If dial succeeds → ConnectionEstablished event → peer stays in current_dials (OK, protected by "already connected" check) 6.If dial fails → OutgoingConnectionError event → BUG: peer NOT removed from current_dials 7.Node tries to redial the failed peer later 8. can_dial() check fails with DialError::AlreadyDialing because peer is still in current_dials Peer can never be retried, despite peer_redialing: 500 configuration ### Before Fix: Only 3/8 peers connected (37.5% connectivity) Failed peers blacklisted forever after first attempt Network partition risk in production Peer redial configuration (peer_redialing: 500) effectively useless ## After Fix: Failed dial attempts can be retried according to peer_redialing config Should achieve full mesh connectivity (7/7 peers, excluding self) Proper network resilience against transient failures ### Discovery Process Started investigating PMS dashboard showing 6 vs RPC showing 3 peers Found PMS was exporting duplicate metric series (unrelated issue, fixed with max instead of sum) Confirmed node was actually only connected to 3 peers via RPC (opp2p_peers, opp2p_peerStats) Discovered discv5 was working (145 peers in table) but gossip connections failing Examined dial error metrics and found 842 "already_dialing" errors Traced through connection gater and gossip driver code Identified missing cleanup in OutgoingConnectionError event handler ### Testing Recommendations Monitor kona_node_dial_peer_error{type="already_dialing"} - should stop increasing Monitor kona_node_swarm_peer_count - should increase from 3 towards 7 Check opp2p_peerStats RPC after 5-10 minutes - should show 7 connected peers Verify PMS dashboard shows correct peer counts with updated query

fix: Bug in SwarmEvent::OutgoingConnectionError handler - When an out…

a1ffe0b

…going dial attempt fails asynchronously (network timeout, connection refused, etc.), the peer was NOT removed from the current_dials HashSet, leaving it permanently stuck in dialing state.

jelias2 added 2 commits December 10, 2025 15:15

feat: allow kona to dial dns based multiaddres by resolving them firsT

2884332

feat: libp2p add dns module

577c13c

jelias2 marked this pull request as ready for review December 11, 2025 14:54

jelias2 requested review from BioMark3r, emhane and theochap as code owners December 11, 2025 14:54

Copilot AI review requested due to automatic review settings December 11, 2025 14:54

jelias2 requested review from einar-oplabs and op-will as code owners December 11, 2025 14:54

Copilot AI reviewed Dec 11, 2025

View reviewed changes

jelias2 changed the title ~~fix: Bug in SwarmEvent::OutgoingConnectionError handler - When an out…~~ feat: Add DNS to Swarm Dec 11, 2025

jelias2 changed the title ~~feat: Add DNS to Swarm~~ feat: Add DNS to SwarmBuilder Dec 11, 2025

theochap reviewed Dec 12, 2025

View reviewed changes

Add back blocking IPS based on DNS Resolution

ddff23f

theochap approved these changes Dec 17, 2025

View reviewed changes

op-will enabled auto-merge December 17, 2025 20:14

auto-merge was automatically disabled December 18, 2025 15:56
Head branch was pushed to by a user without write access

jelias2 force-pushed the jelias2/kona-remove-dial-on-error branch from 9d50522 to ddff23f Compare December 18, 2025 15:56

theochap enabled auto-merge December 18, 2025 16:24

theochap added this pull request to the merge queue Dec 18, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 18, 2025

theochap added this pull request to the merge queue Dec 18, 2025

Merged via the queue into op-rs:main with commit 3d78173 Dec 18, 2025
46 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add DNS to SwarmBuilder#3182

feat: Add DNS to SwarmBuilder#3182
theochap merged 4 commits intoop-rs:mainfrom
jelias2:jelias2/kona-remove-dial-on-error

jelias2 commented Dec 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

theochap left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jelias2 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Bug Summary: Stuck Dial Attempts Preventing Peer Connections

Symptoms

Code Flow:

Before Fix:

After Fix:

Discovery Process

Testing Recommendations

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

theochap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jelias2 commented Dec 10, 2025 •

edited

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading