Skip to content

[client] Reset WireGuard endpoint on ICE session change during relay fallback#5283

Merged
pappz merged 2 commits intomainfrom
refactor/reset-wg-endpoint
Feb 16, 2026
Merged

[client] Reset WireGuard endpoint on ICE session change during relay fallback#5283
pappz merged 2 commits intomainfrom
refactor/reset-wg-endpoint

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Feb 9, 2026

When an ICE connection disconnects and falls back to relay, reset the WireGuard endpoint and handshake watcher if the remote peer's ICE session has changed. This ensures the controller re-establishes a fresh WireGuard handshake rather than waiting on a stale endpoint from the previous session.

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced peer connection resilience during network disconnections and transitions between connection protocols.
    • Improved connection state management with better session change detection and endpoint cleanup.
    • Added reset mechanisms to ensure proper state recovery from connection disruptions.

When an ICE connection disconnects and falls back to relay, reset the
WireGuard endpoint and handshake watcher if the remote peer's ICE session
has changed. This ensures the controller re-establishes a fresh WireGuard
handshake rather than waiting on a stale endpoint from the previous session.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

These changes introduce session-aware reset mechanisms for ICE peer connections. When an ICE session changes during disconnection, the system propagates this state through worker_ice, conn, endpoint, and wg_watcher modules to reset the endpoint address and restart handshake timing.

Changes

Cohort / File(s) Summary
ICE Session Change Detection
client/internal/peer/worker_ice.go
Added remoteSessionChanged flag to track session ID changes; modified closeAgent() to return bool indicating session change state and propagate it to callers.
Endpoint Reset Handler
client/internal/peer/conn.go
Updated onICEStateDisconnected() signature to accept sessionChanged bool parameter; added resetEndpoint() method to remove endpoint address and trigger watcher reset when session changes occur.
Endpoint Address Removal
client/internal/peer/endpoint.go
Added public RemoveEndpointAddress() method to EndpointUpdater for explicit endpoint address cleanup.
Watcher Reset Mechanism
client/internal/peer/wg_watcher.go
Introduced buffered resetCh channel and public Reset() method; modified periodic handshake loop to handle reset signals by restarting handshake timeout logic.

Sequence Diagram

sequenceDiagram
    participant WorkerICE
    participant Conn
    participant EndpointUpdater
    participant WgInterface
    participant WGWatcher
    
    WorkerICE->>WorkerICE: Detect session change via new offer
    WorkerICE->>WorkerICE: Set remoteSessionChanged = true
    WorkerICE->>WorkerICE: closeAgent() called
    WorkerICE->>Conn: onICEStateDisconnected(sessionChanged=true)
    Conn->>Conn: resetEndpoint() invoked
    Conn->>EndpointUpdater: RemoveEndpointAddress()
    EndpointUpdater->>WgInterface: RemoveEndpointAddress(RemoteKey)
    WgInterface-->>EndpointUpdater: Confirm removal
    Conn->>WGWatcher: Reset()
    WGWatcher->>WGWatcher: Signal resetCh
    WGWatcher->>WGWatcher: Restart handshake timeout logic
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Suggested Reviewers

  • lixmal

Poem

🐰 When sessions shift and ICE breaks free,
A watcher resets with glee,
Endpoints cleared, the timeout reborn,
Fresh handshakes greet the morn!
Reset logic hops with care and grace,

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description covers the main objective and rationale, but the 'Describe your changes' section is empty, which is a required section in the template. Fill in the 'Describe your changes' section with a summary of the technical modifications made across the affected files.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: resetting the WireGuard endpoint when ICE disconnects during relay fallback due to session changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/reset-wg-endpoint

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
client/internal/peer/endpoint.go (1)

69-71: RemoveEndpointAddress does not acquire e.mu, unlike sibling methods.

Both ConfigureWGEndpoint and RemoveWgPeer acquire e.mu and cancel any in-flight delayed update before touching WireGuard state. RemoveEndpointAddress skips this, which means a concurrent scheduleDelayedUpdate goroutine could race and re-set the endpoint address immediately after removal.

Even though the current call site (resetEndpoint) is serialized by conn.mu and is immediately followed by ConfigureWGEndpoint (which cancels the delayed update), protecting this method at its own level is safer and consistent.

Proposed fix
 func (e *EndpointUpdater) RemoveEndpointAddress() error {
+	e.mu.Lock()
+	defer e.mu.Unlock()
+
+	e.waitForCloseTheDelayedUpdate()
 	return e.wgConfig.WgInterface.RemoveEndpointAddress(e.wgConfig.RemoteKey)
 }
client/internal/peer/conn.go (1)

437-444: wgProxyRelay.Work() is called twice (lines 437 and 444).

This appears to be a pre-existing duplicate, not introduced by this PR. The second call at line 444 is redundant since line 437 already activates the relay proxy. Consider removing the duplicate in a follow-up.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

defer w.muxAgent.Unlock()

sessionChanged := w.remoteSessionChanged
w.remoteSessionChanged = false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both old and new agents consume this flag. Intentional?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not get it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agent A is alive and ICE connection is established with a WireGuard endpoint configured.
  2. New offer arrives (OnNewOffer):
    - Remote peer has a new session ID (different from current), meaning it wants to reconnect.
    - remoteSessionChanged = true is set, this is the signal that the WG endpoint needs to be reset.
    - Agent A is closed directly and replaced by a new Agent B.
    - w.agent now points to Agent B.
  3. Deprecated Agent A's callback fires (onConnectionStateChange → closeAgent):
    - Agent A transitions to disconnected/failed/closed.
    - closeAgent is called with Agent A.
    - but we must to reset the WireGuard, consume the flag

The problem is we have a race between the Agent A close callback and the Agent B on connected callback. If the Agent B connected faster then disconnect the old one then we have an inconsistent state. This is why we do all the time this double check but it is not enough.

@sonarqubecloud
Copy link
Copy Markdown

@pappz pappz merged commit baed6e4 into main Feb 16, 2026
52 of 55 checks passed
@pappz pappz deleted the refactor/reset-wg-endpoint branch February 16, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants