fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866fpenezic wants to merge 6 commits intonetbirdio:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughOn WG handshake or ICE timeout, relay entries can be marked stale and evicted; ICE worker can rotate the local session ID. New relay client/manager APIs allow closing a relay connection by peer key and WorkerRelay coordinates eviction-and-retry on duplicate-open collisions. Changes
Sequence Diagram(s)sequenceDiagram
participant Conn
participant WorkerRelay
participant RelayClient
participant WorkerICE
Conn->>WorkerRelay: onWGDisconnected (priority==Relay) / MarkStale()
WorkerRelay->>RelayClient: CloseConnByPeerKey(peerKey)
RelayClient-->>WorkerRelay: confirm closed
WorkerRelay->>WorkerICE: ResetSessionID() (if exists)
WorkerICE-->>WorkerRelay: new sessionID, remoteSessionID cleared
WorkerRelay->>Conn: continue disconnect handling / retry OpenConn if needed
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
a9783e0 to
7d55ac2
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
🚧 Files skipped from review as they are similar to previous changes (3)
- client/internal/peer/conn.go
- shared/relay/client/manager.go
- shared/relay/client/client.go
|
Converting to draft - I discovered a regression during extended testing. Bug: The Root cause: my earlier reasoning that Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:
Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation). |
…tunnel after NAT IP change
The previous fix unconditionally closed and reopened the existing relay conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an infinite tear-down/rebuild loop when the remote peer sent rapid successive offers (e.g. during reconnection): close → reopen → remote sees teardown → sends new offer → ErrConnAlreadyExists → close → loop. Introduce a relayConnStale atomic flag that is set only when an event indicates the existing entry is no longer backed by a live peer session: local WG handshake timeout, relay server close, or explicit CloseConn. OnNewOffer only tears down and reopens when the flag is set; otherwise it reuses the existing healthy connection. Signal sources for the flag: - conn.onWGDisconnected (Relay path) → MarkStale before CloseConn - WorkerRelay.CloseConn → marks stale on close - WorkerRelay.onRelayClientDisconnected → marks stale on server close
When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the `w.agent == agent` guard fails (w.agent is already nil) and the session ID is not rotated. Without session ID rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's sessionID, finds them equal, and skips agent recreation — reusing the existing agent and its stale candidates from the broken network state. Observed in production: 30s reconnect loop after a NAT/conntrack event with the same offer session ID logged ~20+ times in a row, WG handshake never recovering despite "ICE connection succeeded" on every cycle. Mirror the existing Relay-path behavior: explicitly rotate the session ID before closing the ICE worker so the remote peer always recreates its agent after a WG timeout on ICE.
b0b9b85 to
c0b3e4e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 168-172: The code reads rt.relayClient without holding the
RelayTrack lock; fix by grabbing rt.RLock() when accessing the field, copy the
pointer to a local (e.g. rc := rt.relayClient) while holding the RLock, then
RUnlock and call rc.CloseConnByPeerKey(peerKey) only if rc != nil; this protects
RelayTrack.relayClient (used/initialized under rt.Lock() in openConnVia()) while
avoiding holding the track lock during the CloseConnByPeerKey call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4b1aed8a-feb3-4803-b3cf-18c9a265d315
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
…eerKey CloseConnByPeerKey was reading rt.relayClient without holding rt.RLock(), but openConnVia() initializes that field under rt.Lock(). Race could skip the stale-entry close while the track was still being populated, defeating the foreign-server cleanup path. Take rt.RLock() to copy the pointer to a local, release the track lock, then call CloseConnByPeerKey on the copy — protects the field without holding the track lock across a potentially blocking network call. Addresses CodeRabbit review on PR netbirdio#5866.
|



Summary
Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports
Connected, with the connection never recovering on its own.This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).
Problem 1 - Relay path: stale
ErrConnAlreadyExistsreuseWhen the peer is running over relay and a network event invalidates the existing relay session,
WorkerRelay.OnNewOffercallsrelayManager.OpenConnwhich returnsErrConnAlreadyExistsbecause the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.Fix (commits 1–4):
ErrConnAlreadyExists, close the existing relay conn and reopen it.CloseConnByPeerKeyto the correct relay client (home vs. foreign server) - addresses CodeRabbit feedback.relayConnStale atomic.Boolflag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).conn.onWGDisconnected(Relay path),WorkerRelay.CloseConn,WorkerRelay.onRelayClientDisconnected.Problem 2 - ICE path: session ID not rotated on WG timeout
When WireGuard handshake times out while running over ICE,
onWGDisconnectedcallsworkerICE.Close().Close()setsw.agent = nilsynchronously before pion's ICE library fires the asynchronousConnectionStateClosedevent. By the timeonConnectionStateChangerunscloseAgent(), thew.agent == agentguard fails (w.agentis alreadynil) and the session ID is not rotated.Without rotation, the next ICE offer carries the same local session ID. The remote peer in
OnNewOffercomparesremoteSessionIDagainst the incoming offer'sSessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows
Connected, P2Pwhile last successful WG handshake is 40+ minutes old.Fix (commit 5): Rotate the session ID explicitly in
onWGDisconnectedfor the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.Test plan
go build ./...)ICE → Relay → ICEtransition (~4 seconds), no loop, WG handshake refreshed normally on the 3-minute cadence afterwards. Stable for 4+ hours post-event.Caveat: unrelated Rosenpass interaction
During testing we hit a separate failure mode where Rosenpass (post-quantum PSK) was enabled in permissive mode on both peers. After an IP change, the WG tunnel could not recover because the local PSK state diverged from the remote peer's PSK (one side had a Rosenpass-managed PSK, the other had none). This produced the same external symptom (silent tunnel, UI says
Connected) but is not addressed by this PR - the fixes here only cover the relay/ICE transport state. With Rosenpass disabled, both fixes recovered the tunnel cleanly across multiple PPPoE cycles. Filing the Rosenpass interaction as a separate issue.Notes
The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:
Both fixes are strictly defensive - they can only fire on
ErrConnAlreadyExists(relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.Summary by CodeRabbit
Bug Fixes