fix: close stale relay connection on ErrConnAlreadyExists to recover … by fpenezic · Pull Request #5866 · netbirdio/netbird

fpenezic · 2026-04-12T18:44:58Z

Summary

Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports Connected, with the connection never recovering on its own.

This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

When the peer is running over relay and a network event invalidates the existing relay session, WorkerRelay.OnNewOffer calls relayManager.OpenConn which returns ErrConnAlreadyExists because the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.

Fix (commits 1–4):

On ErrConnAlreadyExists, close the existing relay conn and reopen it.
Route CloseConnByPeerKey to the correct relay client (home vs. foreign server) - addresses CodeRabbit feedback.
Gate the close-and-retry behind a relayConnStale atomic.Bool flag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).
Signal sources for the stale flag: conn.onWGDisconnected (Relay path), WorkerRelay.CloseConn, WorkerRelay.onRelayClientDisconnected.

Problem 2 - ICE path: session ID not rotated on WG timeout

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent = nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the w.agent == agent guard fails (w.agent is already nil) and the session ID is not rotated.

Without rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's SessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.

Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows Connected, P2P while last successful WG handshake is 40+ minutes old.

Fix (commit 5): Rotate the session ID explicitly in onWGDisconnected for the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.

Test plan

Local build passes (go build ./...)
Deployed on two production peers (RPi behind PPPoE NAT, OCI public IP) for live testing
Verified stable P2P connection with WG handshake refreshing on the normal 3-minute cadence (no regression)
Verified no infinite tear-down loop on relay-active peer (the bug introduced by the v1 fix and resolved in v2)
Validated against a real PPPoE reconnect with public IP rotation: tunnel recovered automatically in a single ICE → Relay → ICE transition (~4 seconds), no loop, WG handshake refreshed normally on the 3-minute cadence afterwards. Stable for 4+ hours post-event.

Caveat: unrelated Rosenpass interaction

During testing we hit a separate failure mode where Rosenpass (post-quantum PSK) was enabled in permissive mode on both peers. After an IP change, the WG tunnel could not recover because the local PSK state diverged from the remote peer's PSK (one side had a Rosenpass-managed PSK, the other had none). This produced the same external symptom (silent tunnel, UI says Connected) but is not addressed by this PR - the fixes here only cover the relay/ICE transport state. With Rosenpass disabled, both fixes recovered the tunnel cleanly across multiple PPPoE cycles. Filing the Rosenpass interaction as a separate issue.

Notes

The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:

Without the relay fix, peers stuck on relay loop the relay close/reopen cycle.
Without the ICE fix, peers that fall back to relay then re-establish ICE get pinned to a stale ICE agent on the remote side.

Both fixes are strictly defensive - they can only fire on ErrConnAlreadyExists (relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.

Summary by CodeRabbit

Bug Fixes

Enhanced timeout handling for WireGuard handshakes with improved relay connection cleanup and ICE session reset
Improved ICE session management to ensure remote peers properly recreate connection agents
Optimized relay connection reuse to prevent unnecessary teardowns during concurrent connection attempts

CLAassistant · 2026-04-12T18:45:10Z

All committers have signed the CLA.

coderabbitai · 2026-04-12T18:45:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

On WG handshake or ICE timeout, relay entries can be marked stale and evicted; ICE worker can rotate the local session ID. New relay client/manager APIs allow closing a relay connection by peer key and WorkerRelay coordinates eviction-and-retry on duplicate-open collisions.

Changes

Cohort / File(s)	Summary
Conn + ICE `client/internal/peer/conn.go`, `client/internal/peer/worker_ice.go`	On WireGuard relay disconnects, mark relay entry stale and call `WorkerICE.ResetSessionID()` (nil-guarded). Added `WorkerICE.ResetSessionID()` to rotate local ICE session ID and clear `remoteSessionID`.
Relay worker `client/internal/peer/worker_relay.go`	Track `relayConnStale` flag, set it on close/disconnect, add `MarkStale()`; on `OpenConn` returning `ErrConnAlreadyExists` the worker will close stale entry and retry `OpenConn` once.
Relay client & manager `shared/relay/client/client.go`, `shared/relay/client/manager.go`	Add `Client.CloseConnByPeerKey(peerKey)` and `Manager.CloseConnByPeerKey(serverAddress, peerKey)` to evict/close a connection by hashed peerKey and unsubscribe/close its container.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: onWGDisconnected (priority==Relay) / MarkStale()
  WorkerRelay->>RelayClient: CloseConnByPeerKey(peerKey)
  RelayClient-->>WorkerRelay: confirm closed
  WorkerRelay->>WorkerICE: ResetSessionID() (if exists)
  WorkerICE-->>WorkerRelay: new sessionID, remoteSessionID cleared
  WorkerRelay->>Conn: continue disconnect handling / retry OpenConn if needed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[client] Reset WireGuard endpoint on ICE session change during relay fallback #5283: Related ICE session handling and propagation into WireGuard/relay fallback.
[client] Fix race condition and ensure correct message ordering in Relay #5265: Related changes to relay connection lifecycle and conn map handling.
[client] Extend WG watcher for ICE connection too #5133: Prior edits touching peer connection lifecycle and relay/ICE interactions.

Suggested reviewers

pappz
mlsmaycon

Poem

🐰 I twitch my whiskers when sessions stale,
I nudge old relays down the trail,
I spin a new ICE name with glee,
Retry the hop, set stale things free,
Hooray — fresh paths for you and me!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description comprehensively documents both problems, fixes, test validation, and caveats, but the required template checklist and documentation sections are incomplete or unchecked.	Complete the checklist by selecting issue type (bug fix is most applicable), verify if documentation was added or confirm it's not needed with explanation, and provide any related docs PR URL.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title references fixing stale relay connection recovery on ErrConnAlreadyExists, which is a primary concern addressed in the PR, though it doesn't capture the ICE session ID rotation fix.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e

📥 Commits

Reviewing files that changed from the base of the PR and between a9783e0 and 7d55ac2.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

🚧 Files skipped from review as they are similar to previous changes (3)

client/internal/peer/conn.go
shared/relay/client/manager.go
shared/relay/client/client.go

fpenezic · 2026-04-22T22:04:12Z

Converting to draft - I discovered a regression during extended testing.

Bug: The ErrConnAlreadyExists retry loop can fire repeatedly (observed 3600+ times over ~1 hour) in multi-peer scenarios. When remote peers send offers in rapid succession (session renewal, ICE state changes, etc. - not only after a disconnect), each offer triggers close + reopen, and the remote side sees the reconnect as another state change and sends a fresh offer, producing an infinite tear-down/rebuild loop. Each iteration allocates a new wgProxy port on 127.0.0.1 while the previous one fails with use of closed network connection, generating significant background traffic (observed 9 Mbit/s constant on the affected peer).

Root cause: my earlier reasoning that OnNewOffer is only invoked after Guard detects a disconnect was incorrect. OnNewOffer also fires on normal offer/answer exchanges, so unconditionally treating ErrConnAlreadyExists as "stale by contract" is wrong - the original CodeRabbit concern about healthy connections being torn down was valid.

Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:

Cooldown: skip close if the existing conn was (re)created within the last N seconds
Liveness check via WGWatcher handshake state
Per-peer serialization of OnNewOffer so rapid back-to-back offers don't overlap

Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation).

…tunnel after NAT IP change

…vers

The previous fix unconditionally closed and reopened the existing relay conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an infinite tear-down/rebuild loop when the remote peer sent rapid successive offers (e.g. during reconnection): close → reopen → remote sees teardown → sends new offer → ErrConnAlreadyExists → close → loop. Introduce a relayConnStale atomic flag that is set only when an event indicates the existing entry is no longer backed by a live peer session: local WG handshake timeout, relay server close, or explicit CloseConn. OnNewOffer only tears down and reopens when the flag is set; otherwise it reuses the existing healthy connection. Signal sources for the flag: - conn.onWGDisconnected (Relay path) → MarkStale before CloseConn - WorkerRelay.CloseConn → marks stale on close - WorkerRelay.onRelayClientDisconnected → marks stale on server close

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the `w.agent == agent` guard fails (w.agent is already nil) and the session ID is not rotated. Without session ID rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's sessionID, finds them equal, and skips agent recreation — reusing the existing agent and its stale candidates from the broken network state. Observed in production: 30s reconnect loop after a NAT/conntrack event with the same offer session ID logged ~20+ times in a row, WG handshake never recovering despite "ICE connection succeeded" on every cycle. Mirror the existing Relay-path behavior: explicitly rotate the session ID before closing the ICE worker so the remote peer always recreates its agent after a WG timeout on ICE.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 168-172: The code reads rt.relayClient without holding the
RelayTrack lock; fix by grabbing rt.RLock() when accessing the field, copy the
pointer to a local (e.g. rc := rt.relayClient) while holding the RLock, then
RUnlock and call rc.CloseConnByPeerKey(peerKey) only if rc != nil; this protects
RelayTrack.relayClient (used/initialized under rt.Lock() in openConnVia()) while
avoiding holding the track lock during the CloseConnByPeerKey call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b1aed8a-feb3-4803-b3cf-18c9a265d315

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and c0b3e4e.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

…eerKey CloseConnByPeerKey was reading rt.relayClient without holding rt.RLock(), but openConnVia() initializes that field under rt.Lock(). Race could skip the stale-entry close while the track was still being populated, defeating the foreign-server cleanup path. Take rt.RLock() to copy the pointer to a local, release the track lock, then call CloseConnByPeerKey on the copy — protects the field without holding the track lock across a potentially blocking network call. Addresses CodeRabbit review on PR netbirdio#5866.

sonarqubecloud · 2026-04-28T08:13:07Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread shared/relay/client/manager.go

pappz self-requested a review April 13, 2026 07:53

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from a9783e0 to 7d55ac2 Compare April 20, 2026 21:19

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread client/internal/peer/worker_relay.go

fpenezic marked this pull request as draft April 22, 2026 22:02

fpenezic added 5 commits April 25, 2026 10:52

fix: close stale relay connection on ErrConnAlreadyExists to recover …

46a5a7d

…tunnel after NAT IP change

fix: route CloseConnByPeerKey to correct relay client for foreign ser…

0f1fc4f

…vers

docs: explain Guard invariant for ErrConnAlreadyExists branch

159183b

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from b0b9b85 to c0b3e4e Compare April 25, 2026 09:32

fpenezic marked this pull request as ready for review April 28, 2026 06:32

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread shared/relay/client/manager.go Outdated

Uh oh!

Conversation

fpenezic commented Apr 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem 1 - Relay path: stale ErrConnAlreadyExists reuse

Problem 2 - ICE path: session ID not rotated on WG timeout

Test plan

Caveat: unrelated Rosenpass interaction

Notes

Summary by CodeRabbit

Bug Fixes

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fpenezic commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Apr 28, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fpenezic commented Apr 12, 2026 •

edited by coderabbitai Bot

Loading

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

CLAassistant commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading