Skip to content

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866

Open
fpenezic wants to merge 6 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery
Open

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fpenezic wants to merge 6 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery

Conversation

@fpenezic
Copy link
Copy Markdown

@fpenezic fpenezic commented Apr 12, 2026

Summary

Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports Connected, with the connection never recovering on its own.

This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).

Problem 1 - Relay path: stale ErrConnAlreadyExists reuse

When the peer is running over relay and a network event invalidates the existing relay session, WorkerRelay.OnNewOffer calls relayManager.OpenConn which returns ErrConnAlreadyExists because the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.

Fix (commits 1–4):

  • On ErrConnAlreadyExists, close the existing relay conn and reopen it.
  • Route CloseConnByPeerKey to the correct relay client (home vs. foreign server) - addresses CodeRabbit feedback.
  • Gate the close-and-retry behind a relayConnStale atomic.Bool flag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).
  • Signal sources for the stale flag: conn.onWGDisconnected (Relay path), WorkerRelay.CloseConn, WorkerRelay.onRelayClientDisconnected.

Problem 2 - ICE path: session ID not rotated on WG timeout

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent = nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the w.agent == agent guard fails (w.agent is already nil) and the session ID is not rotated.

Without rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's SessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.

Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows Connected, P2P while last successful WG handshake is 40+ minutes old.

Fix (commit 5): Rotate the session ID explicitly in onWGDisconnected for the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.

Test plan

  • Local build passes (go build ./...)
  • Deployed on two production peers (RPi behind PPPoE NAT, OCI public IP) for live testing
  • Verified stable P2P connection with WG handshake refreshing on the normal 3-minute cadence (no regression)
  • Verified no infinite tear-down loop on relay-active peer (the bug introduced by the v1 fix and resolved in v2)
  • Validated against a real PPPoE reconnect with public IP rotation: tunnel recovered automatically in a single ICE → Relay → ICE transition (~4 seconds), no loop, WG handshake refreshed normally on the 3-minute cadence afterwards. Stable for 4+ hours post-event.

Caveat: unrelated Rosenpass interaction

During testing we hit a separate failure mode where Rosenpass (post-quantum PSK) was enabled in permissive mode on both peers. After an IP change, the WG tunnel could not recover because the local PSK state diverged from the remote peer's PSK (one side had a Rosenpass-managed PSK, the other had none). This produced the same external symptom (silent tunnel, UI says Connected) but is not addressed by this PR - the fixes here only cover the relay/ICE transport state. With Rosenpass disabled, both fixes recovered the tunnel cleanly across multiple PPPoE cycles. Filing the Rosenpass interaction as a separate issue.

Notes

The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:

  • Without the relay fix, peers stuck on relay loop the relay close/reopen cycle.
  • Without the ICE fix, peers that fall back to relay then re-establish ICE get pinned to a stale ICE agent on the remote side.

Both fixes are strictly defensive - they can only fire on ErrConnAlreadyExists (relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.

Summary by CodeRabbit

Bug Fixes

  • Enhanced timeout handling for WireGuard handshakes with improved relay connection cleanup and ICE session reset
  • Improved ICE session management to ensure remote peers properly recreate connection agents
  • Optimized relay connection reuse to prevent unnecessary teardowns during concurrent connection attempts

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

On WG handshake or ICE timeout, relay entries can be marked stale and evicted; ICE worker can rotate the local session ID. New relay client/manager APIs allow closing a relay connection by peer key and WorkerRelay coordinates eviction-and-retry on duplicate-open collisions.

Changes

Cohort / File(s) Summary
Conn + ICE
client/internal/peer/conn.go, client/internal/peer/worker_ice.go
On WireGuard relay disconnects, mark relay entry stale and call WorkerICE.ResetSessionID() (nil-guarded). Added WorkerICE.ResetSessionID() to rotate local ICE session ID and clear remoteSessionID.
Relay worker
client/internal/peer/worker_relay.go
Track relayConnStale flag, set it on close/disconnect, add MarkStale(); on OpenConn returning ErrConnAlreadyExists the worker will close stale entry and retry OpenConn once.
Relay client & manager
shared/relay/client/client.go, shared/relay/client/manager.go
Add Client.CloseConnByPeerKey(peerKey) and Manager.CloseConnByPeerKey(serverAddress, peerKey) to evict/close a connection by hashed peerKey and unsubscribe/close its container.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: onWGDisconnected (priority==Relay) / MarkStale()
  WorkerRelay->>RelayClient: CloseConnByPeerKey(peerKey)
  RelayClient-->>WorkerRelay: confirm closed
  WorkerRelay->>WorkerICE: ResetSessionID() (if exists)
  WorkerICE-->>WorkerRelay: new sessionID, remoteSessionID cleared
  WorkerRelay->>Conn: continue disconnect handling / retry OpenConn if needed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • pappz
  • mlsmaycon

Poem

🐰 I twitch my whiskers when sessions stale,
I nudge old relays down the trail,
I spin a new ICE name with glee,
Retry the hop, set stale things free,
Hooray — fresh paths for you and me!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description comprehensively documents both problems, fixes, test validation, and caveats, but the required template checklist and documentation sections are incomplete or unchecked. Complete the checklist by selecting issue type (bug fix is most applicable), verify if documentation was added or confirm it's not needed with explanation, and provide any related docs PR URL.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title references fixing stale relay connection recovery on ErrConnAlreadyExists, which is a primary concern addressed in the PR, though it doesn't capture the ICE session ID rotation fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go

Comment thread shared/relay/client/manager.go
@pappz pappz self-requested a review April 13, 2026 07:53
@fpenezic fpenezic force-pushed the fix/stale-relay-conn-recovery branch from a9783e0 to 7d55ac2 Compare April 20, 2026 21:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e

📥 Commits

Reviewing files that changed from the base of the PR and between a9783e0 and 7d55ac2.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • client/internal/peer/conn.go
  • shared/relay/client/manager.go
  • shared/relay/client/client.go

Comment thread client/internal/peer/worker_relay.go
@fpenezic fpenezic marked this pull request as draft April 22, 2026 22:02
@fpenezic
Copy link
Copy Markdown
Author

Converting to draft - I discovered a regression during extended testing.

Bug: The ErrConnAlreadyExists retry loop can fire repeatedly (observed 3600+ times over ~1 hour) in multi-peer scenarios. When remote peers send offers in rapid succession (session renewal, ICE state changes, etc. - not only after a disconnect), each offer triggers close + reopen, and the remote side sees the reconnect as another state change and sends a fresh offer, producing an infinite tear-down/rebuild loop. Each iteration allocates a new wgProxy port on 127.0.0.1 while the previous one fails with use of closed network connection, generating significant background traffic (observed 9 Mbit/s constant on the affected peer).

Root cause: my earlier reasoning that OnNewOffer is only invoked after Guard detects a disconnect was incorrect. OnNewOffer also fires on normal offer/answer exchanges, so unconditionally treating ErrConnAlreadyExists as "stale by contract" is wrong - the original CodeRabbit concern about healthy connections being torn down was valid.

Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:

  • Cooldown: skip close if the existing conn was (re)created within the last N seconds
  • Liveness check via WGWatcher handshake state
  • Per-peer serialization of OnNewOffer so rapid back-to-back offers don't overlap

Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation).

The previous fix unconditionally closed and reopened the existing relay
conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an
infinite tear-down/rebuild loop when the remote peer sent rapid
successive offers (e.g. during reconnection): close → reopen → remote
sees teardown → sends new offer → ErrConnAlreadyExists → close → loop.

Introduce a relayConnStale atomic flag that is set only when an event
indicates the existing entry is no longer backed by a live peer
session: local WG handshake timeout, relay server close, or explicit
CloseConn. OnNewOffer only tears down and reopens when the flag is
set; otherwise it reuses the existing healthy connection.

Signal sources for the flag:
- conn.onWGDisconnected (Relay path) → MarkStale before CloseConn
- WorkerRelay.CloseConn → marks stale on close
- WorkerRelay.onRelayClientDisconnected → marks stale on server close
When WireGuard handshake times out while running over ICE,
onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil
synchronously before pion's ICE library fires the asynchronous
ConnectionStateClosed event. By the time onConnectionStateChange
runs closeAgent(), the `w.agent == agent` guard fails (w.agent is
already nil) and the session ID is not rotated.

Without session ID rotation, the next ICE offer carries the same
local session ID. The remote peer in OnNewOffer compares
remoteSessionID against the incoming offer's sessionID, finds them
equal, and skips agent recreation — reusing the existing agent and
its stale candidates from the broken network state.

Observed in production: 30s reconnect loop after a NAT/conntrack
event with the same offer session ID logged ~20+ times in a row,
WG handshake never recovering despite "ICE connection succeeded"
on every cycle.

Mirror the existing Relay-path behavior: explicitly rotate the
session ID before closing the ICE worker so the remote peer always
recreates its agent after a WG timeout on ICE.
@fpenezic fpenezic force-pushed the fix/stale-relay-conn-recovery branch from b0b9b85 to c0b3e4e Compare April 25, 2026 09:32
@fpenezic fpenezic marked this pull request as ready for review April 28, 2026 06:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 168-172: The code reads rt.relayClient without holding the
RelayTrack lock; fix by grabbing rt.RLock() when accessing the field, copy the
pointer to a local (e.g. rc := rt.relayClient) while holding the RLock, then
RUnlock and call rc.CloseConnByPeerKey(peerKey) only if rc != nil; this protects
RelayTrack.relayClient (used/initialized under rt.Lock() in openConnVia()) while
avoiding holding the track lock during the CloseConnByPeerKey call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b1aed8a-feb3-4803-b3cf-18c9a265d315

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and c0b3e4e.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go

Comment thread shared/relay/client/manager.go Outdated
…eerKey

CloseConnByPeerKey was reading rt.relayClient without holding rt.RLock(),
but openConnVia() initializes that field under rt.Lock(). Race could skip
the stale-entry close while the track was still being populated, defeating
the foreign-server cleanup path.

Take rt.RLock() to copy the pointer to a local, release the track lock,
then call CloseConnByPeerKey on the copy — protects the field without
holding the track lock across a potentially blocking network call.

Addresses CodeRabbit review on PR netbirdio#5866.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants