Skip to content

Detect stale ICE handshakes#5098

Closed
Silex wants to merge 1 commit intonetbirdio:mainfrom
Silex:fix/ice-connection-watchdog
Closed

Detect stale ICE handshakes#5098
Silex wants to merge 1 commit intonetbirdio:mainfrom
Silex:fix/ice-connection-watchdog

Conversation

@Silex
Copy link
Copy Markdown
Contributor

@Silex Silex commented Jan 13, 2026

Describe your changes

Add a WireGuard handshake watchdog for ICE connections so stale handshakes trigger a disconnect/reconnect, matching the relay path behavior.

WARNING: this code is untested

Issue ticket number and link

This would close #4769 (and maybe others)

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Reason: internal connectivity watchdog behavior only; no user-facing change.

Summary by CodeRabbit

  • Bug Fixes
    • Improved connection state monitoring and disconnection handling during ICE (Interactive Connectivity Establishment) transitions to enhance connection reliability.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Adds a WGWatcher field to the Conn struct that monitors ICE connection state transitions and manages watcher lifecycle across connection state changes (ready, active, disconnected). The watcher is initialized, enabled/disabled at appropriate lifecycle points, and cleaned up during connection closure.

Changes

Cohort / File(s) Summary
ICE Connection State Monitoring
client/internal/peer/conn.go
Introduces wgWatcherICE field and orchestrates its lifecycle: initialization in NewConn with dumpState, disabling during ICE setup (onICEConnectionIsReady), enabling after ICE upgrade for transition monitoring, disabling on disconnection, and cleanup in Close. Adds goroutines to manage watcher state during relay upgrade flows.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A watcher is born to keep watch and survey,
Following ICE states throughout the day,
When disconnections creep near with silent stealth,
We're there with a nudge for your network's health,
From ready to active, a lifecycle dance! 🎭

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Detect stale ICE handshakes' directly reflects the main change—adding a watchdog to detect stale ICE handshakes and trigger reconnection.
Description check ✅ Passed The description includes a clear change summary, issue reference (#4769), checklist completion, and documentation justification, matching the repository template.
Linked Issues check ✅ Passed The changes implement a WireGuard handshake watchdog for ICE connections [#4769] to detect and recover from stale handshakes, directly addressing the intermittent connectivity loss issue.
Out of Scope Changes check ✅ Passed All changes in client/internal/peer/conn.go are directly scoped to implementing the ICE watchdog logic required by the linked issue; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
client/internal/peer/conn.go (2)

427-444: Bug: Duplicate wgProxyRelay.Work() call.

wgProxyRelay.Work() is called twice in this block - once at line 431 and again at line 443. The second call appears to be unintentional and could cause unexpected behavior depending on the proxy's implementation.

🐛 Proposed fix
 		conn.wgWatcherWg.Add(1)
 		go func() {
 			defer conn.wgWatcherWg.Done()
 			conn.workerRelay.EnableWgWatcher(conn.ctx)
 		}()
-		conn.wgProxyRelay.Work()
 		conn.currentConnPriority = conntype.Relay

373-397: Add wgWatcherWg.Wait() before re-enabling the watcher to prevent concurrent watchers.

The TODO comment on line 374 correctly identifies that conn.wgWatcherWg.Wait() should be called before re-enabling the watcher. Without it, the previous watcher goroutine from an earlier Add(1) may still be running when EnableWgWatcher() spawns a new one (lines 393–397), creating a race condition. The Close() method demonstrates the proper pattern with defer conn.wgWatcherWg.Wait() before cleanup, which should be applied here as well.

🧹 Nitpick comments (1)
client/internal/peer/conn.go (1)

99-102: Testing recommendation: The PR description notes this code is untested.

Given the complexity of the state machine transitions (ICE ready → active → disconnected → relay fallback) and the concurrent nature of the watcher goroutines, consider adding test coverage for:

  1. Stale handshake detection triggering disconnect/reconnect
  2. Race conditions between watcher enable/disable cycles
  3. Proper cleanup during Close() while watcher is active

Also applies to: 145-145

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94de656 and 43487ed.

📒 Files selected for processing (1)
  • client/internal/peer/conn.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-08T16:23:14.146Z
Learnt from: lixmal
Repo: netbirdio/netbird PR: 4792
File: client/iface/bind/ice_bind.go:34-44
Timestamp: 2026-01-08T16:23:14.146Z
Learning: In client/iface/bind/ice_bind.go, the fallback receiver path in CreateReceiverFn (lines 34-44) is intentionally a minimal stub to maintain interface compatibility with the wireguard-go fork. It does not include STUN filtering, activity recording, or UDP mux initialization because IPv6 is not yet supported in the udpmux, and this path primarily handles non-IPv4 PacketConn cases that aren't fully supported yet.

Applied to files:

  • client/internal/peer/conn.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Build Cache
  • GitHub Check: Windows
  • GitHub Check: release_ui
  • GitHub Check: FreeBSD Port / Build & Test
  • GitHub Check: Darwin
  • GitHub Check: Android / Build
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: JS / Lint
  • GitHub Check: iOS / Build
  • GitHub Check: Linux
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
🔇 Additional comments (4)
client/internal/peer/conn.go (4)

99-102: LGTM!

The new wgWatcherICE field follows the existing architectural pattern where relay has its own watcher. The wgWatcherWg WaitGroup properly coordinates watcher goroutine lifecycle.


218-240: LGTM!

The cleanup sequence is correctly ordered:

  1. Context cancellation signals all goroutines to stop
  2. Watchers are explicitly disabled
  3. Mutex is released before waiting on wgWatcherWg (due to defer LIFO), preventing deadlock since watcher callbacks acquire conn.mu

410-419: Verify that DisableWgWatcher is safe to call from within its own callback.

onICEStateDisconnected is passed to EnableWgWatcher as a callback and calls DisableWgWatcher on the same watcher (line 419). If the callback is invoked synchronously within the watcher's goroutine, calling DisableWgWatcher from inside could cause deadlock or panic depending on WGWatcher's implementation.


129-145: No action needed. NewWGWatcher is infallible—it only returns a *WGWatcher pointer without any error return value. The function simply allocates and returns a struct literal, making error handling unnecessary.

Likely an incorrect or invalid review comment.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 13, 2026

@Silex Why do you think this is necessary? ICE has keep-alives, so it can detect connectivity issues faster than the WireGuard key exchange.

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 14, 2026

In my experience if ICE has keep-alives then it does not work:

foo.bar.baz:
  NetBird IP: 100.70.205.198
  Public key: oWxONSNjVcitGblH6DA5OGDTh6PEIM6m4UwU37CODUE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.168.106:51820/116.202.18.146:58980
  Relay server address:
  Last connection update: 3 hours, 36 minutes ago
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/354.8 KiB
  Quantum resistance: false
  Networks: -
  Latency: 51.9013ms

See Last WireGuard handshake: - that stays like this for days. This is for a peer that is an AXIS camera running the netbird client, but I've seen it happen on teltonika routers as well. It just stops doing handshakes and the only solution is to "netbird down/up" on one of the peers.

And when it happens it's not with all peers, e.g A cannot ping C but B can ping C just fine. Restarting netbird on either A or C fixes the problem of A not being able to ping C.

Unfortunately right now I don't have this problem anymore but when it happens again in a few days I'll do netbird debug bundle -S -U as requested in the linked issue.

Problem is I need to wait for the problem to come back and it usually takes around 24h/48h to trigger.

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 14, 2026

Here's a summary of the issue by ChatGPT, but we have yet to confirm that it correctly assed the problem:


ICE vs WireGuard — Why NetBird Can Lose Handshakes

  • ICE and WireGuard are separate layers

    • ICE handles NAT traversal and connectivity discovery.
    • WireGuard handles encrypted data transport.
  • ICE keep-alives do not validate WireGuard health

    • They keep NAT mappings open and confirm STUN reachability.
    • They do not confirm that WireGuard packets are flowing or that a handshake is active.
  • WireGuard has no automatic recovery mechanism

    • No per-peer reset
    • No forced re-handshake
    • No liveness timeout that triggers a reconnect
  • On Windows, UDP sockets can silently break

    • Triggered by sleep, NIC power saving, or NAT rebinding
    • Packets are sent but replies never arrive
    • WireGuard remains stuck with Last handshake: -
  • ICE can report “connected” while WireGuard is effectively dead

    • ICE sees a valid candidate
    • NetBird considers the peer connected
    • The data plane is black-holed
  • Only a full teardown recovers the connection

    • netbird down/up or restarting the NetBird service
    • This recreates sockets and forces a new WireGuard handshake

Bottom line:
ICE keep-alives do not protect WireGuard from stale or broken UDP paths, and WireGuard cannot recover on its own once this happens.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 15, 2026

@Silex I just want to figure out the root cause of your issue. I see that a complete reconnection resolves the WireGuard connection, but it only covers up a bug. Basically, we have two conditions: an established ICE connection, and the WireGuard endpoint configured with the correct ICE candidate address. So, as a first step, we need to check the endpoint settings in the debug bundle.

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 15, 2026

Ah the problem is finally back! debug bundle incoming 🥳

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 15, 2026

Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/232a5f43-a1cf-4e11-a0d6-9b1be0d824b6

Here are more information:

camera.peer.debug:
  NetBird IP: 100.70.205.198
  Public key: oWxONSNjVcitGblH6DA5OGDTh6PEIM6m4UwU37CODUE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 83.173.249.42:51820/116.202.18.146:45889
  Relay server address:
  Last connection update: 3 hours, 42 minutes ago
  Last WireGuard handshake: 3 hours, 42 minutes ago
  Transfer status (received/sent) 180 B/309.9 KiB
  Quantum resistance: false
  Networks: -
  Latency: 62.3291ms

 router.peer.debug
  NetBird IP: 100.70.221.132
  Public key: +oBb+Eno1lY0y0iXdQruOGQsWNV4HiM02xZIgRpyvnQ=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 83.173.249.42:51820/116.202.18.146:40372
  Relay server address:
  Last connection update: 3 hours, 42 minutes ago
  Last WireGuard handshake: 16 seconds ago
  Transfer status (received/sent) 8.5 GiB/282.2 MiB
  Quantum resistance: false
  Networks: -
  Latency: 46.5375ms

As you see the last wireshark handshake was 3 hours ago. Tomorrow it will probably display - and never recover.

The camera has internet through a Teltonika (openWRT) router which also runs netbird and zerotier (we are in the process of switching to netbird once stability is there). Both are accessed from a NVR, and the NVR can ping the teltonika "forever" but the camera is lost after 1-2 days.

Both run fairly old versions of netbird (0.37.1), the NVR runs 0.62.2. The self hosted server runs 0.48.0. I believe this should not matter much but I can try to upgrade to recent versions if you think something recent explicitely fixes these kind of problems.

The camera runs in userspace mode like this, because it cannot run as root at all:

export NB_USE_NETSTACK_MODE=true
export NB_ENABLE_NETSTACK_LOCAL_FORWARDING=true

I'm pretty sure I also noticed this behavior with the teltonika routers, but given the routers have ping reboot in place it's harder to notice.

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 16, 2026

@pappz: do you need something else?

I'll soon upgrade everything to latest netbird version if that helps.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 19, 2026

Thank you for the logs! It’s good to know about the old versions. This could be a hidden backward-compatibility issue, but I can’t find an explanation in the logs. Let’s go deeper.
Could you please retest with these environment variables? This will provide more detailed logs:

PIONS_LOG_DEBUG=all
NB_WG_DEBUG=true

I also started extending the WireGuard watcher to cover ICE connections for another use case. It will achieve similar results, but through a different approach. Instead of instantiating a new watcher, I’m planning to move the existing one out of the Relay worker. I’m hoping to finish this today or within the next couple of days.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 19, 2026

@Silex here is the PR. If you’d like, you can take an early look and test this version.

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 20, 2026

@pappz thanks. First waiting for the problem to trigger again with PIONS_LOG_DEBUG=all so I can provide another trace...

@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 22, 2026

@pappz: ok problem finally happened again:

PS C:\Users\stvs> netbird debug bundle -S -U
Local file:
C:\Windows\SystemTemp\netbird.debug.2019101234.zip
Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/03221c7f-c077-438c-9023-1df49c7ed2f7

netbird.debug.2019101234.zip

By the way it's not very clear if the "Upload file key" is enough or if I should also upload the zip here like I did.

The camera peer that interests you is 100.70.205.198. The router it's attached to is 100.70.221.132.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 26, 2026

@pappz: ok problem finally happened again:

PS C:\Users\stvs> netbird debug bundle -S -U
Local file:
C:\Windows\SystemTemp\netbird.debug.2019101234.zip
Upload file key:
c33053488251f90ea4683596101892e50e3e31e8030de3052d05b2b3fe6d2468/03221c7f-c077-438c-9023-1df49c7ed2f7

netbird.debug.2019101234.zip

By the way it's not very clear if the "Upload file key" is enough or if I should also upload the zip here like I did.

The camera peer that interests you is 100.70.205.198. The router it's attached to is 100.70.221.132.

Unfortunately, the logs did not contain the Pion logs for some reason :/
I will close this PR because the same changes have already been merged elsewhere and achieve similar behavior. Let’s continue the discussion on the issues page.

It’s enough to share the ID of the debug bundle. Based on that, we can download it.

@pappz pappz closed this Jan 26, 2026
@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Jan 26, 2026

Fixed in: #5133

@Silex Silex deleted the fix/ice-connection-watchdog branch January 26, 2026 08:08
@Silex
Copy link
Copy Markdown
Contributor Author

Silex commented Jan 26, 2026

Will test 0.64.1 and report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent connectivity loss to routing peers requiring manual restart

2 participants