Detect stale ICE handshakes#5098
Conversation
📝 WalkthroughWalkthroughAdds a WGWatcher field to the Conn struct that monitors ICE connection state transitions and manages watcher lifecycle across connection state changes (ready, active, disconnected). The watcher is initialized, enabled/disabled at appropriate lifecycle points, and cleaned up during connection closure. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
client/internal/peer/conn.go (2)
427-444: Bug: DuplicatewgProxyRelay.Work()call.
wgProxyRelay.Work()is called twice in this block - once at line 431 and again at line 443. The second call appears to be unintentional and could cause unexpected behavior depending on the proxy's implementation.🐛 Proposed fix
conn.wgWatcherWg.Add(1) go func() { defer conn.wgWatcherWg.Done() conn.workerRelay.EnableWgWatcher(conn.ctx) }() - conn.wgProxyRelay.Work() conn.currentConnPriority = conntype.Relay
373-397: AddwgWatcherWg.Wait()before re-enabling the watcher to prevent concurrent watchers.The TODO comment on line 374 correctly identifies that
conn.wgWatcherWg.Wait()should be called before re-enabling the watcher. Without it, the previous watcher goroutine from an earlierAdd(1)may still be running whenEnableWgWatcher()spawns a new one (lines 393–397), creating a race condition. TheClose()method demonstrates the proper pattern withdefer conn.wgWatcherWg.Wait()before cleanup, which should be applied here as well.
🧹 Nitpick comments (1)
client/internal/peer/conn.go (1)
99-102: Testing recommendation: The PR description notes this code is untested.Given the complexity of the state machine transitions (ICE ready → active → disconnected → relay fallback) and the concurrent nature of the watcher goroutines, consider adding test coverage for:
- Stale handshake detection triggering disconnect/reconnect
- Race conditions between watcher enable/disable cycles
- Proper cleanup during
Close()while watcher is activeAlso applies to: 145-145
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
client/internal/peer/conn.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-08T16:23:14.146Z
Learnt from: lixmal
Repo: netbirdio/netbird PR: 4792
File: client/iface/bind/ice_bind.go:34-44
Timestamp: 2026-01-08T16:23:14.146Z
Learning: In client/iface/bind/ice_bind.go, the fallback receiver path in CreateReceiverFn (lines 34-44) is intentionally a minimal stub to maintain interface compatibility with the wireguard-go fork. It does not include STUN filtering, activity recording, or UDP mux initialization because IPv6 is not yet supported in the udpmux, and this path primarily handles non-IPv4 PacketConn cases that aren't fully supported yet.
Applied to files:
client/internal/peer/conn.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: Build Cache
- GitHub Check: Windows
- GitHub Check: release_ui
- GitHub Check: FreeBSD Port / Build & Test
- GitHub Check: Darwin
- GitHub Check: Android / Build
- GitHub Check: release
- GitHub Check: release_ui_darwin
- GitHub Check: JS / Lint
- GitHub Check: iOS / Build
- GitHub Check: Linux
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
🔇 Additional comments (4)
client/internal/peer/conn.go (4)
99-102: LGTM!The new
wgWatcherICEfield follows the existing architectural pattern where relay has its own watcher. ThewgWatcherWgWaitGroup properly coordinates watcher goroutine lifecycle.
218-240: LGTM!The cleanup sequence is correctly ordered:
- Context cancellation signals all goroutines to stop
- Watchers are explicitly disabled
- Mutex is released before waiting on
wgWatcherWg(due to defer LIFO), preventing deadlock since watcher callbacks acquireconn.mu
410-419: Verify thatDisableWgWatcheris safe to call from within its own callback.
onICEStateDisconnectedis passed toEnableWgWatcheras a callback and callsDisableWgWatcheron the same watcher (line 419). If the callback is invoked synchronously within the watcher's goroutine, callingDisableWgWatcherfrom inside could cause deadlock or panic depending on WGWatcher's implementation.
129-145: No action needed.NewWGWatcheris infallible—it only returns a*WGWatcherpointer without any error return value. The function simply allocates and returns a struct literal, making error handling unnecessary.Likely an incorrect or invalid review comment.
|
@Silex Why do you think this is necessary? ICE has keep-alives, so it can detect connectivity issues faster than the WireGuard key exchange. |
|
In my experience if ICE has keep-alives then it does not work: See And when it happens it's not with all peers, e.g A cannot ping C but B can ping C just fine. Restarting netbird on either A or C fixes the problem of A not being able to ping C. Unfortunately right now I don't have this problem anymore but when it happens again in a few days I'll do Problem is I need to wait for the problem to come back and it usually takes around 24h/48h to trigger. |
|
Here's a summary of the issue by ChatGPT, but we have yet to confirm that it correctly assed the problem: ICE vs WireGuard — Why NetBird Can Lose Handshakes
Bottom line: |
|
@Silex I just want to figure out the root cause of your issue. I see that a complete reconnection resolves the WireGuard connection, but it only covers up a bug. Basically, we have two conditions: an established ICE connection, and the WireGuard endpoint configured with the correct ICE candidate address. So, as a first step, we need to check the endpoint settings in the debug bundle. |
|
Ah the problem is finally back! debug bundle incoming 🥳 |
|
Upload file key: Here are more information: As you see the last wireshark handshake was 3 hours ago. Tomorrow it will probably display The camera has internet through a Teltonika (openWRT) router which also runs netbird and zerotier (we are in the process of switching to netbird once stability is there). Both are accessed from a NVR, and the NVR can ping the teltonika "forever" but the camera is lost after 1-2 days. Both run fairly old versions of netbird ( The camera runs in userspace mode like this, because it cannot run as root at all: export NB_USE_NETSTACK_MODE=true
export NB_ENABLE_NETSTACK_LOCAL_FORWARDING=trueI'm pretty sure I also noticed this behavior with the teltonika routers, but given the routers have ping reboot in place it's harder to notice. |
|
@pappz: do you need something else? I'll soon upgrade everything to latest netbird version if that helps. |
|
Thank you for the logs! It’s good to know about the old versions. This could be a hidden backward-compatibility issue, but I can’t find an explanation in the logs. Let’s go deeper. I also started extending the WireGuard watcher to cover ICE connections for another use case. It will achieve similar results, but through a different approach. Instead of instantiating a new watcher, I’m planning to move the existing one out of the Relay worker. I’m hoping to finish this today or within the next couple of days. |
|
@pappz thanks. First waiting for the problem to trigger again with |
|
@pappz: ok problem finally happened again: By the way it's not very clear if the "Upload file key" is enough or if I should also upload the zip here like I did. The camera peer that interests you is 100.70.205.198. The router it's attached to is 100.70.221.132. |
Unfortunately, the logs did not contain the Pion logs for some reason :/ It’s enough to share the ID of the debug bundle. Based on that, we can download it. |
|
Fixed in: #5133 |
|
Will test 0.64.1 and report |



Describe your changes
Add a WireGuard handshake watchdog for ICE connections so stale handshakes trigger a disconnect/reconnect, matching the relay path behavior.
WARNING: this code is untested
Issue ticket number and link
This would close #4769 (and maybe others)
Checklist
Documentation
Select exactly one:
Reason: internal connectivity watchdog behavior only; no user-facing change.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.