Detect stale ICE handshakes by Silex · Pull Request #5098 · netbirdio/netbird

Silex · 2026-01-13T11:57:07Z

Describe your changes

Add a WireGuard handshake watchdog for ICE connections so stale handshakes trigger a disconnect/reconnect, matching the relay path behavior.

WARNING: this code is untested

Issue ticket number and link

This would close #4769 (and maybe others)

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Reason: internal connectivity watchdog behavior only; no user-facing change.

Summary by CodeRabbit

Bug Fixes
- Improved connection state monitoring and disconnection handling during ICE (Interactive Connectivity Establishment) transitions to enhance connection reliability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-13T11:57:20Z

📝 Walkthrough

Walkthrough

Adds a WGWatcher field to the Conn struct that monitors ICE connection state transitions and manages watcher lifecycle across connection state changes (ready, active, disconnected). The watcher is initialized, enabled/disabled at appropriate lifecycle points, and cleaned up during connection closure.

Changes

Cohort / File(s)	Summary
ICE Connection State Monitoring `client/internal/peer/conn.go`	Introduces wgWatcherICE field and orchestrates its lifecycle: initialization in NewConn with dumpState, disabling during ICE setup (onICEConnectionIsReady), enabling after ICE upgrade for transition monitoring, disabling on disconnection, and cleanup in Close. Adds goroutines to manage watcher state during relay upgrade flows.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A watcher is born to keep watch and survey,
Following ICE states throughout the day,
When disconnections creep near with silent stealth,
We're there with a nudge for your network's health,
From ready to active, a lifecycle dance! 🎭

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Detect stale ICE handshakes' directly reflects the main change—adding a watchdog to detect stale ICE handshakes and trigger reconnection.
Description check	✅ Passed	The description includes a clear change summary, issue reference (#4769), checklist completion, and documentation justification, matching the repository template.
Linked Issues check	✅ Passed	The changes implement a WireGuard handshake watchdog for ICE connections [#4769] to detect and recover from stale handshakes, directly addressing the intermittent connectivity loss issue.
Out of Scope Changes check	✅ Passed	All changes in client/internal/peer/conn.go are directly scoped to implementing the ICE watchdog logic required by the linked issue; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-01-13T11:57:44Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

client/internal/peer/conn.go (2)
427-444: Bug: Duplicate wgProxyRelay.Work() call.

wgProxyRelay.Work() is called twice in this block - once at line 431 and again at line 443. The second call appears to be unintentional and could cause unexpected behavior depending on the proxy's implementation.
🐛 Proposed fix
 		conn.wgWatcherWg.Add(1)
 		go func() {
 			defer conn.wgWatcherWg.Done()
 			conn.workerRelay.EnableWgWatcher(conn.ctx)
 		}()
-		conn.wgProxyRelay.Work()
 		conn.currentConnPriority = conntype.Relay
373-397: Add wgWatcherWg.Wait() before re-enabling the watcher to prevent concurrent watchers.

The TODO comment on line 374 correctly identifies that conn.wgWatcherWg.Wait() should be called before re-enabling the watcher. Without it, the previous watcher goroutine from an earlier Add(1) may still be running when EnableWgWatcher() spawns a new one (lines 393–397), creating a race condition. The Close() method demonstrates the proper pattern with defer conn.wgWatcherWg.Wait() before cleanup, which should be applied here as well.

🧹 Nitpick comments (1)

client/internal/peer/conn.go (1)

99-102: Testing recommendation: The PR description notes this code is untested.

Given the complexity of the state machine transitions (ICE ready → active → disconnected → relay fallback) and the concurrent nature of the watcher goroutines, consider adding test coverage for:

Stale handshake detection triggering disconnect/reconnect

Race conditions between watcher enable/disable cycles

Proper cleanup during Close() while watcher is active

Also applies to: 145-145

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94de656 and 43487ed.

📒 Files selected for processing (1)

client/internal/peer/conn.go

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2026-01-08T16:23:14.146Z

Learnt from: lixmal
Repo: netbirdio/netbird PR: 4792
File: client/iface/bind/ice_bind.go:34-44
Timestamp: 2026-01-08T16:23:14.146Z
Learning: In client/iface/bind/ice_bind.go, the fallback receiver path in CreateReceiverFn (lines 34-44) is intentionally a minimal stub to maintain interface compatibility with the wireguard-go fork. It does not include STUN filtering, activity recording, or UDP mux initialization because IPv6 is not yet supported in the udpmux, and this path primarily handles non-IPv4 PacketConn cases that aren't fully supported yet.

Applied to files:

client/internal/peer/conn.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Build Cache
GitHub Check: Windows
GitHub Check: release_ui
GitHub Check: FreeBSD Port / Build & Test
GitHub Check: Darwin
GitHub Check: Android / Build
GitHub Check: release
GitHub Check: release_ui_darwin
GitHub Check: JS / Lint
GitHub Check: iOS / Build
GitHub Check: Linux
GitHub Check: Client / Unit
GitHub Check: Client / Unit
GitHub Check: Client / Unit

🔇 Additional comments (4)

client/internal/peer/conn.go (4)

99-102: LGTM!

The new wgWatcherICE field follows the existing architectural pattern where relay has its own watcher. The wgWatcherWg WaitGroup properly coordinates watcher goroutine lifecycle.

218-240: LGTM!

The cleanup sequence is correctly ordered:

Context cancellation signals all goroutines to stop

Watchers are explicitly disabled

Mutex is released before waiting on wgWatcherWg (due to defer LIFO), preventing deadlock since watcher callbacks acquire conn.mu

410-419: Verify that DisableWgWatcher is safe to call from within its own callback.

onICEStateDisconnected is passed to EnableWgWatcher as a callback and calls DisableWgWatcher on the same watcher (line 419). If the callback is invoked synchronously within the watcher's goroutine, calling DisableWgWatcher from inside could cause deadlock or panic depending on WGWatcher's implementation.

129-145: No action needed. NewWGWatcher is infallible—it only returns a *WGWatcher pointer without any error return value. The function simply allocates and returns a struct literal, making error handling unnecessary.

Likely an incorrect or invalid review comment.

pappz · 2026-01-13T16:58:03Z

@Silex Why do you think this is necessary? ICE has keep-alives, so it can detect connectivity issues faster than the WireGuard key exchange.

Silex · 2026-01-14T08:49:08Z

In my experience if ICE has keep-alives then it does not work:

foo.bar.baz:
  NetBird IP: 100.70.205.198
  Public key: oWxONSNjVcitGblH6DA5OGDTh6PEIM6m4UwU37CODUE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.168.106:51820/116.202.18.146:58980
  Relay server address:
  Last connection update: 3 hours, 36 minutes ago
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/354.8 KiB
  Quantum resistance: false
  Networks: -
  Latency: 51.9013ms

See Last WireGuard handshake: - that stays like this for days. This is for a peer that is an AXIS camera running the netbird client, but I've seen it happen on teltonika routers as well. It just stops doing handshakes and the only solution is to "netbird down/up" on one of the peers.

And when it happens it's not with all peers, e.g A cannot ping C but B can ping C just fine. Restarting netbird on either A or C fixes the problem of A not being able to ping C.

Unfortunately right now I don't have this problem anymore but when it happens again in a few days I'll do netbird debug bundle -S -U as requested in the linked issue.

Problem is I need to wait for the problem to come back and it usually takes around 24h/48h to trigger.

Silex · 2026-01-14T09:07:38Z

Here's a summary of the issue by ChatGPT, but we have yet to confirm that it correctly assed the problem:

ICE vs WireGuard — Why NetBird Can Lose Handshakes

ICE and WireGuard are separate layers
- ICE handles NAT traversal and connectivity discovery.
- WireGuard handles encrypted data transport.
ICE keep-alives do not validate WireGuard health
- They keep NAT mappings open and confirm STUN reachability.
- They do not confirm that WireGuard packets are flowing or that a handshake is active.
WireGuard has no automatic recovery mechanism
- No per-peer reset
- No forced re-handshake
- No liveness timeout that triggers a reconnect
On Windows, UDP sockets can silently break
- Triggered by sleep, NIC power saving, or NAT rebinding
- Packets are sent but replies never arrive
- WireGuard remains stuck with Last handshake: -
ICE can report “connected” while WireGuard is effectively dead
- ICE sees a valid candidate
- NetBird considers the peer connected
- The data plane is black-holed
Only a full teardown recovers the connection
- netbird down/up or restarting the NetBird service
- This recreates sockets and forces a new WireGuard handshake

Bottom line:
ICE keep-alives do not protect WireGuard from stale or broken UDP paths, and WireGuard cannot recover on its own once this happens.

pappz · 2026-01-15T10:28:57Z

@Silex I just want to figure out the root cause of your issue. I see that a complete reconnection resolves the WireGuard connection, but it only covers up a bug. Basically, we have two conditions: an established ICE connection, and the WireGuard endpoint configured with the correct ICE candidate address. So, as a first step, we need to check the endpoint settings in the debug bundle.

Silex · 2026-01-15T11:15:03Z

Ah the problem is finally back! debug bundle incoming 🥳

Silex · 2026-01-15T11:29:04Z

Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/232a5f43-a1cf-4e11-a0d6-9b1be0d824b6

Here are more information:

camera.peer.debug:
  NetBird IP: 100.70.205.198
  Public key: oWxONSNjVcitGblH6DA5OGDTh6PEIM6m4UwU37CODUE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 83.173.249.42:51820/116.202.18.146:45889
  Relay server address:
  Last connection update: 3 hours, 42 minutes ago
  Last WireGuard handshake: 3 hours, 42 minutes ago
  Transfer status (received/sent) 180 B/309.9 KiB
  Quantum resistance: false
  Networks: -
  Latency: 62.3291ms

 router.peer.debug
  NetBird IP: 100.70.221.132
  Public key: +oBb+Eno1lY0y0iXdQruOGQsWNV4HiM02xZIgRpyvnQ=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 83.173.249.42:51820/116.202.18.146:40372
  Relay server address:
  Last connection update: 3 hours, 42 minutes ago
  Last WireGuard handshake: 16 seconds ago
  Transfer status (received/sent) 8.5 GiB/282.2 MiB
  Quantum resistance: false
  Networks: -
  Latency: 46.5375ms

As you see the last wireshark handshake was 3 hours ago. Tomorrow it will probably display - and never recover.

The camera has internet through a Teltonika (openWRT) router which also runs netbird and zerotier (we are in the process of switching to netbird once stability is there). Both are accessed from a NVR, and the NVR can ping the teltonika "forever" but the camera is lost after 1-2 days.

Both run fairly old versions of netbird (0.37.1), the NVR runs 0.62.2. The self hosted server runs 0.48.0. I believe this should not matter much but I can try to upgrade to recent versions if you think something recent explicitely fixes these kind of problems.

The camera runs in userspace mode like this, because it cannot run as root at all:

export NB_USE_NETSTACK_MODE=true
export NB_ENABLE_NETSTACK_LOCAL_FORWARDING=true

I'm pretty sure I also noticed this behavior with the teltonika routers, but given the routers have ping reboot in place it's harder to notice.

Silex · 2026-01-16T15:09:17Z

@pappz: do you need something else?

I'll soon upgrade everything to latest netbird version if that helps.

pappz · 2026-01-19T10:42:13Z

Thank you for the logs! It’s good to know about the old versions. This could be a hidden backward-compatibility issue, but I can’t find an explanation in the logs. Let’s go deeper.
Could you please retest with these environment variables? This will provide more detailed logs:

PIONS_LOG_DEBUG=all
NB_WG_DEBUG=true

I also started extending the WireGuard watcher to cover ICE connections for another use case. It will achieve similar results, but through a different approach. Instead of instantiating a new watcher, I’m planning to move the existing one out of the Relay worker. I’m hoping to finish this today or within the next couple of days.

pappz · 2026-01-19T17:13:34Z

@Silex here is the PR. If you’d like, you can take an early look and test this version.

Silex · 2026-01-20T07:32:49Z

@pappz thanks. First waiting for the problem to trigger again with PIONS_LOG_DEBUG=all so I can provide another trace...

Silex · 2026-01-22T09:31:11Z

@pappz: ok problem finally happened again:

PS C:\Users\stvs> netbird debug bundle -S -U
Local file:
C:\Windows\SystemTemp\netbird.debug.2019101234.zip
Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/03221c7f-c077-438c-9023-1df49c7ed2f7

netbird.debug.2019101234.zip

By the way it's not very clear if the "Upload file key" is enough or if I should also upload the zip here like I did.

The camera peer that interests you is 100.70.205.198. The router it's attached to is 100.70.221.132.

pappz · 2026-01-26T08:06:12Z

@pappz: ok problem finally happened again:
PS C:\Users\stvs> netbird debug bundle -S -U
Local file:
C:\Windows\SystemTemp\netbird.debug.2019101234.zip
Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/03221c7f-c077-438c-9023-1df49c7ed2f7
netbird.debug.2019101234.zip

By the way it's not very clear if the "Upload file key" is enough or if I should also upload the zip here like I did.

The camera peer that interests you is 100.70.205.198. The router it's attached to is 100.70.221.132.

Unfortunately, the logs did not contain the Pion logs for some reason :/
I will close this PR because the same changes have already been merged elsewhere and achieve similar behavior. Let’s continue the discussion on the issues page.

It’s enough to share the ID of the debug bundle. Based on that, we can download it.

pappz · 2026-01-26T08:07:59Z

Fixed in: #5133

Silex · 2026-01-26T08:58:21Z

Will test 0.64.1 and report

Detect stale ICE handshakes

43487ed

coderabbitai Bot reviewed Jan 13, 2026

View reviewed changes

pappz closed this Jan 26, 2026

Silex deleted the fix/ice-connection-watchdog branch January 26, 2026 08:08

Uh oh!

Conversation

Silex commented Jan 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Checklist

Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sonarqubecloud Bot commented Jan 13, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

pappz commented Jan 13, 2026

Uh oh!

Silex commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Silex commented Jan 14, 2026

ICE vs WireGuard — Why NetBird Can Lose Handshakes

Uh oh!

pappz commented Jan 15, 2026

Uh oh!

Silex commented Jan 15, 2026

Uh oh!

Silex commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Silex commented Jan 16, 2026

Uh oh!

pappz commented Jan 19, 2026

Uh oh!

pappz commented Jan 19, 2026

Uh oh!

Silex commented Jan 20, 2026

Uh oh!

Silex commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pappz commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pappz commented Jan 26, 2026

Uh oh!

Silex commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Silex commented Jan 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 13, 2026 •

edited

Loading

Silex commented Jan 14, 2026 •

edited

Loading

Silex commented Jan 15, 2026 •

edited

Loading

Silex commented Jan 22, 2026 •

edited

Loading

pappz commented Jan 26, 2026 •

edited

Loading