[client] Fix DNS probe thread safety and avoid blocking engine sync by pappz · Pull Request #5576 · netbirdio/netbird

pappz · 2026-03-11T16:22:18Z

Refactor ProbeAvailability to prevent blocking the engine's sync mutex during slow DNS probes.

the probe now derives its context from the server's own context (s.ctx) instead of accepting one from the caller
uses a mutex to ensure only one probe runs at a time, new calls cancel the previous probe before starting.
fixes a data race in Stop() when accessing probeCancel without the probe mutex.
fix multierr race condition handling

Describe your changes

Issue ticket number and link

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Bug Fixes
- Prevented overlapping DNS probes and ensured probes are cleanly canceled during shutdown.
Performance
- DNS availability checks now run asynchronously so updates don’t block main flows.
Chores
- Improved probe lifecycle coordination and cancellation handling for safer, more reliable concurrent behavior.

Refactor ProbeAvailability to prevent blocking the engine's sync mutex during slow DNS probes. The probe now derives its context from the server's own context (s.ctx) instead of accepting one from the caller, and uses a mutex to ensure only one probe runs at a time — new calls cancel the previous probe before starting. Also fixes a data race in Stop() when accessing probeCancel without the probe mutex.

coderabbitai · 2026-03-11T16:22:47Z

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae20015a-ab1b-4990-884e-527f97f9f900

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

ProbeAvailability methods were made context-aware and probe lifecycles were synchronized: DefaultServer and upstream resolvers gained mutexes, cancel functions, and waitgroups to coordinate start/cancel/wait of probes; engine now launches DNS probes asynchronously.

Changes

Cohort / File(s)	Summary
Local Resolver `client/internal/dns/local/local.go`	Signature change: `Resolver.ProbeAvailability()` → `Resolver.ProbeAvailability(context.Context)` (body unchanged).
DNS Server `client/internal/dns/server.go`, `client/internal/dns/server_test.go`	Added `probeMu`, `probeCancel`, `probeWg` to `DefaultServer`; `ProbeAvailability(ctx context.Context)` now serialized/cancellable; `Stop()` cancels and waits for probes; tests/mocks updated to new signature.
Upstream Resolver `client/internal/dns/upstream.go`, `client/internal/dns/upstream_test.go`	Added `wg sync.WaitGroup` to upstream base; `ProbeAvailability(ctx context.Context)` replaces parameterless variant; `testNameserver` gained `baseCtx, externalCtx` params; Stop/disable/waitUntilResponse coordinate via wg and combined contexts; tests updated to pass context.
Engine Integration `client/internal/engine.go`	`updateNetworkMap` now launches DNS probing in a background goroutine instead of calling it synchronously.
Tests / Mocks `client/internal/dns/server_test.go`, `client/internal/dns/upstream_test.go`	Mocks and tests updated to new `ProbeAvailability(context.Context)` signatures; call sites adjusted to provide a context.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as Engine
    participant Server as DefaultServer
    participant Upstream as UpstreamResolver
    participant NS as Nameserver

    Engine->>Server: ProbeAvailability(ctx)
    Server->>Server: probeMu.Lock / cancel prior probe / start new probeCtx
    Server->>Upstream: ProbeAvailability(probeCtx)
    Upstream->>Upstream: wg.Add / spawn per-host probes
    Upstream->>NS: testNameserver(baseCtx, externalCtx, addr)
    NS-->>Upstream: success/failure
    Upstream-->>Server: report result / update state
    Upstream->>Upstream: wg.Done for probes
    Server->>Server: probeWg wait / cancel probeCtx when done

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[client] stop upstream retry loop immediately on context cancellation #5403: Modifies upstream resolver probe lifecycle, cancellation and wait/disable coordination—strongly related to upstream changes here.

Suggested reviewers

mlsmaycon
lixmal

Poem

🐰
I hopped through contexts, mutex in paw,
I nudged a probe until it saw,
Waitgroups hummed a careful song,
Cancels kept the dance from running long,
DNS naps safe — carrot dreams all day long 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the two main changes: fixing DNS probe thread safety and preventing engine sync blocking.
Description check	✅ Passed	Description provides rationale and key technical details, though missing specific issue/ticket number and some template sections are incomplete.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/server.go`:
- Around line 531-539: The loop launches goroutines that read s.dnsMuxMap while
other code (e.g., UpdateDNSServer / updateMux) may rewrite it, causing
concurrent map iteration/write; to fix, while holding s.probeMu create a stable
snapshot slice of the handlers (e.g., []handlerWithStop by iterating s.dnsMuxMap
and appending each mux.handler), then release s.probeMu and range over that
slice to spawn goroutines calling handler.ProbeAvailability(probeCtx) (keep
existing wg logic), ensuring no map access occurs after the lock is released.

In `@client/internal/dns/upstream.go`:
- Around line 116-120: Stop() calls u.wg.Wait() without synchronization while
disable() calls u.wg.Add(1) concurrently; this can panic. Fix by serializing
access to the WaitGroup: acquire u.mutex (the same mutex used in disable())
around the sequence that updates cancellation state and calls u.wg.Wait() in
Stop(), and also perform the u.wg.Add(1) call inside disable() while holding
u.mutex (or check a guarded boolean like u.disabled/u.stopped before calling
Add) so no Add can race with Wait; update the implementations of Stop() and
disable() to use that mutex-protected pattern.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54237e4b-b421-49a8-b5ba-02727312b638

📥 Commits

Reviewing files that changed from the base of the PR and between 7a23c57 and 8a72699.

📒 Files selected for processing (6)

client/internal/dns/local/local.go
client/internal/dns/server.go
client/internal/dns/server_test.go
client/internal/dns/upstream.go
client/internal/dns/upstream_test.go
client/internal/engine.go

Add proper locking to prevent data races when accessing shared resources during DNS probe execution and Stop(). Update handlers snapshot logic to avoid conflicts with concurrent writers.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 397-401: The recovery goroutine and the probe path race on
u.disabled because waitUntilResponse() clears it without holding u.mutex while
disable() reads/sets it; fix by ensuring every access to u.disabled is protected
by u.mutex (e.g., add u.mutex.Lock()/Unlock() around reads/writes in
waitUntilResponse() and in disable(), or introduce helper methods like
setDisabled(bool) and isDisabled() that acquire the mutex), and update any other
code paths touching u.disabled to use the same mutex-protected accessors so all
reads/writes are synchronized.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e38232c-85d0-4c81-99fc-504bc8833bd7

📥 Commits

Reviewing files that changed from the base of the PR and between 8be63ef and 0390b42.

📒 Files selected for processing (1)

client/internal/dns/upstream.go

sonarqubecloud · 2026-03-12T16:19:52Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

client/internal/dns/upstream.go (1)

389-401: ⚠️ Potential issue | 🔴 Critical

u.disabled is still racing with the recovery goroutine.

disable() is effectively serialized by u.mutex, but waitUntilResponse() still clears the same flag at Line 375 with no synchronization. That can double-start the recovery loop or skip a needed deactivate/reactivate cycle depending on timing. Reusing u.mutex for this flag will deadlock with Stop()'s locked wg.Wait(), so this likely needs an atomic.Bool or a dedicated state lock instead.

#!/bin/bash
echo "=== disabled access sites ==="
rg -n -C2 'u\.disabled' client/internal/dns/upstream.go
echo
echo "=== WaitGroup and mutex coordination ==="
rg -n -C2 'wg\.(Add|Wait|Done)|mutex\.(Lock|Unlock)' client/internal/dns/upstream.go

🐛 Safer direction

 type upstreamResolverBase struct {
-	disabled         bool
+	disabled         atomic.Bool
 	successCount     atomic.Int32
 	mutex            sync.Mutex
@@
-	u.disabled = false
+	u.disabled.Store(false)
 }
@@
 func (u *upstreamResolverBase) disable(err error) {
-	if u.disabled {
+	if !u.disabled.CompareAndSwap(false, true) {
 		return
 	}
@@
-	u.disabled = true
 	u.wg.Add(1)
 	go func() {
 		defer u.wg.Done()
 		u.waitUntilResponse()
 	}()
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 389 - 401, The u.disabled flag
is raced between disable() and the recovery goroutine in waitUntilResponse()
because mutex use would deadlock with Stop()/wg.Wait(); change the boolean field
u.disabled to an atomic.Bool (or another atomic/state lock) and replace all
reads/writes in disable(), waitUntilResponse(), and any other access sites with
atomic operations (Load, Store, and where appropriate CompareAndSwap) so the
disable/reactivate flow is serialized without acquiring u.mutex and without
introducing wg deadlocks; ensure disable() checks atomic.Load(&u.disabled) and
sets via atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates
using atomic operations and still coordinates wg.Done() as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 285-310: The probe goroutine in the upstream worker logs and
appends every testNameserver error even when the probe was intentionally
canceled; before appending to errs or calling log.Warnf in the anonymous
goroutine (the closure that calls u.testNameserver with upstream), check if
ctx.Err() != nil or u.ctx.Err() != nil and if so simply return without
logging/appending; otherwise proceed to append the error to errs and warn as
before. Ensure you reference the same ctx and u.ctx used to control cancellation
so superseded probes bail out quietly.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 389-401: The u.disabled flag is raced between disable() and the
recovery goroutine in waitUntilResponse() because mutex use would deadlock with
Stop()/wg.Wait(); change the boolean field u.disabled to an atomic.Bool (or
another atomic/state lock) and replace all reads/writes in disable(),
waitUntilResponse(), and any other access sites with atomic operations (Load,
Store, and where appropriate CompareAndSwap) so the disable/reactivate flow is
serialized without acquiring u.mutex and without introducing wg deadlocks;
ensure disable() checks atomic.Load(&u.disabled) and sets via
atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates using
atomic operations and still coordinates wg.Done() as before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b01fb9e-f7ca-4107-badb-09e6df278a7d

📥 Commits

Reviewing files that changed from the base of the PR and between 0390b42 and 75be643.

📒 Files selected for processing (1)

client/internal/dns/upstream.go

coderabbitai

♻️ Duplicate comments (2)

client/internal/dns/upstream.go (2)

287-292: ⚠️ Potential issue | 🟡 Minor

Skip expected errors from canceled probes.

With the new “cancel the previous probe” flow, a superseded probe now returns context.Canceled here during normal operation. Lines 289-292 still append and warn on that error, and the cancellation check at Lines 304-310 happens too late to prevent noisy false failures. Bail out before multierror.Append / Warnf when ctx.Err() or u.ctx.Err() is already set.

🔇 Quietly ignore superseded probe cancellations

 		go func(upstream netip.AddrPort) {
 			defer wg.Done()
 			err := u.testNameserver(u.ctx, ctx, upstream, 500*time.Millisecond)
 			if err != nil {
+				if (ctx != nil && ctx.Err() != nil) || u.ctx.Err() != nil {
+					return
+				}
 				mu.Lock()
 				errs = multierror.Append(errs, err)
 				mu.Unlock()
 				log.Warnf("probing upstream nameserver %s: %s", upstream, err)
 				return

#!/bin/bash
set -euo pipefail

echo "=== ProbeAvailability() error handling ==="
sed -n '283,310p' client/internal/dns/upstream.go

Also applies to: 304-310

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 287 - 292, The probe routine
should ignore expected context cancellations before treating them as errors:
after calling u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later
cancellation-handling block around the same probe flow) check whether ctx.Err()
!= nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for
that probe; specifically, before appending to errs or warning in the logic
referencing u.testNameserver and the subsequent cancellation-check block, bail
out early when either context is canceled so superseded probes are not recorded
as errors.

116-122: ⚠️ Potential issue | 🔴 Critical

Stop() can deadlock with a recovering probe.

Line 121 waits on u.wg while u.mutex is held, but the goroutine started at Lines 400-404 does not call Done() until waitUntilResponse() returns, and that success path tries to reacquire the same mutex at Lines 375-377. If shutdown races with a just-recovered upstream, the recovery goroutine blocks on u.mutex, Done() never runs, and Stop() hangs forever. Move the Wait() out from under the lock and gate late wg.Add(1) calls with a stopping flag (or equivalent) protected by the same mutex.

🐛 Safer shutdown pattern

 type upstreamResolverBase struct {
 	...
 	mutex            sync.Mutex
+	stopping         bool
 	reactivatePeriod time.Duration
 	upstreamTimeout  time.Duration
 	wg               sync.WaitGroup
 	...
 }

 func (u *upstreamResolverBase) Stop() {
 	log.Debugf("stopping serving DNS for upstreams %s", u.upstreamServers)
-	u.cancel()
-
 	u.mutex.Lock()
-	u.wg.Wait()
+	if u.stopping {
+		u.mutex.Unlock()
+		return
+	}
+	u.stopping = true
+	u.cancel()
 	u.mutex.Unlock()
+	u.wg.Wait()
 }

 func (u *upstreamResolverBase) disable(err error) {
+	u.mutex.Lock()
+	defer u.mutex.Unlock()
-	if u.disabled {
+	if u.disabled || u.stopping {
 		return
 	}
 	...
 	u.wg.Add(1)
 	go func() {
 		defer u.wg.Done()
 		u.waitUntilResponse()
 	}()
 }

#!/bin/bash
set -euo pipefail

echo "=== Stop() holds the mutex across Wait() ==="
sed -n '116,123p' client/internal/dns/upstream.go

echo
echo "=== waitUntilResponse() reacquires the same mutex before returning ==="
sed -n '372,378p' client/internal/dns/upstream.go

echo
echo "=== disable() defers Done() until waitUntilResponse() returns ==="
sed -n '396,404p' client/internal/dns/upstream.go

Also applies to: 375-377, 400-404

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 116 - 122, Stop() currently
holds u.mutex while calling u.wg.Wait(), which can deadlock with the recovery
goroutine that reacquires u.mutex before calling u.wg.Done(); fix by adding a
boolean stopping flag on upstreamResolverBase (e.g., u.stopping) protected by
u.mutex, set stopping = true inside Stop() while holding the mutex, call
u.cancel(), then unlock and call u.wg.Wait() outside the lock; in the
recovery/disable path (the goroutine that calls waitUntilResponse() and defers
u.wg.Done()) guard any late u.wg.Add(1) with the same mutex (acquire u.mutex, if
u.stopping then skip starting the goroutine, else call u.wg.Add(1) and release
the mutex) so no new goroutines are added after Stop() begins — update
upstreamResolverBase.Stop, the recovery/disable goroutine (where u.wg.Add/Done
and waitUntilResponse() are used), and add the u.stopping field to ensure safe
shutdown.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 287-292: The probe routine should ignore expected context
cancellations before treating them as errors: after calling
u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later
cancellation-handling block around the same probe flow) check whether ctx.Err()
!= nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for
that probe; specifically, before appending to errs or warning in the logic
referencing u.testNameserver and the subsequent cancellation-check block, bail
out early when either context is canceled so superseded probes are not recorded
as errors.
- Around line 116-122: Stop() currently holds u.mutex while calling u.wg.Wait(),
which can deadlock with the recovery goroutine that reacquires u.mutex before
calling u.wg.Done(); fix by adding a boolean stopping flag on
upstreamResolverBase (e.g., u.stopping) protected by u.mutex, set stopping =
true inside Stop() while holding the mutex, call u.cancel(), then unlock and
call u.wg.Wait() outside the lock; in the recovery/disable path (the goroutine
that calls waitUntilResponse() and defers u.wg.Done()) guard any late
u.wg.Add(1) with the same mutex (acquire u.mutex, if u.stopping then skip
starting the goroutine, else call u.wg.Add(1) and release the mutex) so no new
goroutines are added after Stop() begins — update upstreamResolverBase.Stop, the
recovery/disable goroutine (where u.wg.Add/Done and waitUntilResponse() are
used), and add the u.stopping field to ensure safe shutdown.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 475d92be-c9d7-4eda-9416-d2dc0b84e438

📥 Commits

Reviewing files that changed from the base of the PR and between 75be643 and 126dd64.

📒 Files selected for processing (1)

client/internal/dns/upstream.go

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread client/internal/dns/server.go Outdated

Comment thread client/internal/dns/upstream.go

Ensure DNS probe thread safety by locking critical sections

8be63ef

Add proper locking to prevent data races when accessing shared resources during DNS probe execution and Stop(). Update handlers snapshot logic to avoid conflicts with concurrent writers.

lixmal reviewed Mar 12, 2026

View reviewed changes

Comment thread client/internal/dns/upstream.go Outdated

Comment thread client/internal/dns/upstream.go Outdated

Comment thread client/internal/dns/upstream.go

Comment thread client/internal/dns/server.go

Rename context and remove redundant cancellation

0390b42

pappz requested a review from lixmal March 12, 2026 16:04

Cancel first and lock

75be643

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread client/internal/dns/upstream.go

Add locking to ensure thread safety when reactivating upstream servers

126dd64

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread client/internal/dns/upstream.go

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

lixmal approved these changes Mar 12, 2026

View reviewed changes

pappz merged commit f80fe50 into main Mar 13, 2026
42 checks passed

pappz deleted the fix/dns-probe-thread-safety branch March 13, 2026 12:22

coderabbitai Bot mentioned this pull request Mar 18, 2026

fix(dns): skip probe for NetBird overlay addresses at engine startup #5625

Closed

6 tasks

YouKyi mentioned this pull request Mar 29, 2026

[Bug] macOS: DNS probe fires before WireGuard routes are established — false "Unable to reach DNS servers" warning on every reconnection #5727

Open

6 tasks

This was referenced Apr 16, 2026

DNS-aware dial for HTTP reverse proxy — use NetBird nameservers for hostname resolution #5905

Open

[client] Drop DNS probes for passive health projection #5971

Open

Uh oh!

Conversation

pappz commented Mar 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Stack

Checklist

Documentation

Docs PR URL (required if "docs added" is checked)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Mar 12, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pappz commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading