Skip to content

[client] Fix DNS probe thread safety and avoid blocking engine sync#5576

Merged
pappz merged 5 commits intomainfrom
fix/dns-probe-thread-safety
Mar 13, 2026
Merged

[client] Fix DNS probe thread safety and avoid blocking engine sync#5576
pappz merged 5 commits intomainfrom
fix/dns-probe-thread-safety

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Mar 11, 2026

Refactor ProbeAvailability to prevent blocking the engine's sync mutex during slow DNS probes.

  • the probe now derives its context from the server's own context (s.ctx) instead of accepting one from the caller
  • uses a mutex to ensure only one probe runs at a time, new calls cancel the previous probe before starting.
  • fixes a data race in Stop() when accessing probeCancel without the probe mutex.
  • fix multierr race condition handling

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes

    • Prevented overlapping DNS probes and ensured probes are cleanly canceled during shutdown.
  • Performance

    • DNS availability checks now run asynchronously so updates don’t block main flows.
  • Chores

    • Improved probe lifecycle coordination and cancellation handling for safer, more reliable concurrent behavior.

Refactor ProbeAvailability to prevent blocking the engine's sync mutex
during slow DNS probes. The probe now derives its context from the
server's own context (s.ctx) instead of accepting one from the caller,
and uses a mutex to ensure only one probe runs at a time — new calls
cancel the previous probe before starting. Also fixes a data race in
Stop() when accessing probeCancel without the probe mutex.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 11, 2026

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae20015a-ab1b-4990-884e-527f97f9f900

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

ProbeAvailability methods were made context-aware and probe lifecycles were synchronized: DefaultServer and upstream resolvers gained mutexes, cancel functions, and waitgroups to coordinate start/cancel/wait of probes; engine now launches DNS probes asynchronously.

Changes

Cohort / File(s) Summary
Local Resolver
client/internal/dns/local/local.go
Signature change: Resolver.ProbeAvailability()Resolver.ProbeAvailability(context.Context) (body unchanged).
DNS Server
client/internal/dns/server.go, client/internal/dns/server_test.go
Added probeMu, probeCancel, probeWg to DefaultServer; ProbeAvailability(ctx context.Context) now serialized/cancellable; Stop() cancels and waits for probes; tests/mocks updated to new signature.
Upstream Resolver
client/internal/dns/upstream.go, client/internal/dns/upstream_test.go
Added wg sync.WaitGroup to upstream base; ProbeAvailability(ctx context.Context) replaces parameterless variant; testNameserver gained baseCtx, externalCtx params; Stop/disable/waitUntilResponse coordinate via wg and combined contexts; tests updated to pass context.
Engine Integration
client/internal/engine.go
updateNetworkMap now launches DNS probing in a background goroutine instead of calling it synchronously.
Tests / Mocks
client/internal/dns/server_test.go, client/internal/dns/upstream_test.go
Mocks and tests updated to new ProbeAvailability(context.Context) signatures; call sites adjusted to provide a context.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as Engine
    participant Server as DefaultServer
    participant Upstream as UpstreamResolver
    participant NS as Nameserver

    Engine->>Server: ProbeAvailability(ctx)
    Server->>Server: probeMu.Lock / cancel prior probe / start new probeCtx
    Server->>Upstream: ProbeAvailability(probeCtx)
    Upstream->>Upstream: wg.Add / spawn per-host probes
    Upstream->>NS: testNameserver(baseCtx, externalCtx, addr)
    NS-->>Upstream: success/failure
    Upstream-->>Server: report result / update state
    Upstream->>Upstream: wg.Done for probes
    Server->>Server: probeWg wait / cancel probeCtx when done
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • mlsmaycon
  • lixmal

Poem

🐰
I hopped through contexts, mutex in paw,
I nudged a probe until it saw,
Waitgroups hummed a careful song,
Cancels kept the dance from running long,
DNS naps safe — carrot dreams all day long 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: fixing DNS probe thread safety and preventing engine sync blocking.
Description check ✅ Passed Description provides rationale and key technical details, though missing specific issue/ticket number and some template sections are incomplete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/server.go`:
- Around line 531-539: The loop launches goroutines that read s.dnsMuxMap while
other code (e.g., UpdateDNSServer / updateMux) may rewrite it, causing
concurrent map iteration/write; to fix, while holding s.probeMu create a stable
snapshot slice of the handlers (e.g., []handlerWithStop by iterating s.dnsMuxMap
and appending each mux.handler), then release s.probeMu and range over that
slice to spawn goroutines calling handler.ProbeAvailability(probeCtx) (keep
existing wg logic), ensuring no map access occurs after the lock is released.

In `@client/internal/dns/upstream.go`:
- Around line 116-120: Stop() calls u.wg.Wait() without synchronization while
disable() calls u.wg.Add(1) concurrently; this can panic. Fix by serializing
access to the WaitGroup: acquire u.mutex (the same mutex used in disable())
around the sequence that updates cancellation state and calls u.wg.Wait() in
Stop(), and also perform the u.wg.Add(1) call inside disable() while holding
u.mutex (or check a guarded boolean like u.disabled/u.stopped before calling
Add) so no Add can race with Wait; update the implementations of Stop() and
disable() to use that mutex-protected pattern.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54237e4b-b421-49a8-b5ba-02727312b638

📥 Commits

Reviewing files that changed from the base of the PR and between 7a23c57 and 8a72699.

📒 Files selected for processing (6)
  • client/internal/dns/local/local.go
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_test.go
  • client/internal/engine.go

Comment thread client/internal/dns/server.go Outdated
Comment thread client/internal/dns/upstream.go
Add proper locking to prevent data races when accessing shared resources during DNS probe execution and Stop(). Update handlers snapshot logic to avoid conflicts with concurrent writers.
Comment thread client/internal/dns/upstream.go Outdated
Comment thread client/internal/dns/upstream.go Outdated
Comment thread client/internal/dns/upstream.go
Comment thread client/internal/dns/server.go
@pappz pappz requested a review from lixmal March 12, 2026 16:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 397-401: The recovery goroutine and the probe path race on
u.disabled because waitUntilResponse() clears it without holding u.mutex while
disable() reads/sets it; fix by ensuring every access to u.disabled is protected
by u.mutex (e.g., add u.mutex.Lock()/Unlock() around reads/writes in
waitUntilResponse() and in disable(), or introduce helper methods like
setDisabled(bool) and isDisabled() that acquire the mutex), and update any other
code paths touching u.disabled to use the same mutex-protected accessors so all
reads/writes are synchronized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e38232c-85d0-4c81-99fc-504bc8833bd7

📥 Commits

Reviewing files that changed from the base of the PR and between 8be63ef and 0390b42.

📒 Files selected for processing (1)
  • client/internal/dns/upstream.go

Comment thread client/internal/dns/upstream.go
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)

389-401: ⚠️ Potential issue | 🔴 Critical

u.disabled is still racing with the recovery goroutine.

disable() is effectively serialized by u.mutex, but waitUntilResponse() still clears the same flag at Line 375 with no synchronization. That can double-start the recovery loop or skip a needed deactivate/reactivate cycle depending on timing. Reusing u.mutex for this flag will deadlock with Stop()'s locked wg.Wait(), so this likely needs an atomic.Bool or a dedicated state lock instead.

#!/bin/bash
echo "=== disabled access sites ==="
rg -n -C2 'u\.disabled' client/internal/dns/upstream.go
echo
echo "=== WaitGroup and mutex coordination ==="
rg -n -C2 'wg\.(Add|Wait|Done)|mutex\.(Lock|Unlock)' client/internal/dns/upstream.go
🐛 Safer direction
 type upstreamResolverBase struct {
-	disabled         bool
+	disabled         atomic.Bool
 	successCount     atomic.Int32
 	mutex            sync.Mutex
@@
-	u.disabled = false
+	u.disabled.Store(false)
 }
@@
 func (u *upstreamResolverBase) disable(err error) {
-	if u.disabled {
+	if !u.disabled.CompareAndSwap(false, true) {
 		return
 	}
@@
-	u.disabled = true
 	u.wg.Add(1)
 	go func() {
 		defer u.wg.Done()
 		u.waitUntilResponse()
 	}()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 389 - 401, The u.disabled flag
is raced between disable() and the recovery goroutine in waitUntilResponse()
because mutex use would deadlock with Stop()/wg.Wait(); change the boolean field
u.disabled to an atomic.Bool (or another atomic/state lock) and replace all
reads/writes in disable(), waitUntilResponse(), and any other access sites with
atomic operations (Load, Store, and where appropriate CompareAndSwap) so the
disable/reactivate flow is serialized without acquiring u.mutex and without
introducing wg deadlocks; ensure disable() checks atomic.Load(&u.disabled) and
sets via atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates
using atomic operations and still coordinates wg.Done() as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 285-310: The probe goroutine in the upstream worker logs and
appends every testNameserver error even when the probe was intentionally
canceled; before appending to errs or calling log.Warnf in the anonymous
goroutine (the closure that calls u.testNameserver with upstream), check if
ctx.Err() != nil or u.ctx.Err() != nil and if so simply return without
logging/appending; otherwise proceed to append the error to errs and warn as
before. Ensure you reference the same ctx and u.ctx used to control cancellation
so superseded probes bail out quietly.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 389-401: The u.disabled flag is raced between disable() and the
recovery goroutine in waitUntilResponse() because mutex use would deadlock with
Stop()/wg.Wait(); change the boolean field u.disabled to an atomic.Bool (or
another atomic/state lock) and replace all reads/writes in disable(),
waitUntilResponse(), and any other access sites with atomic operations (Load,
Store, and where appropriate CompareAndSwap) so the disable/reactivate flow is
serialized without acquiring u.mutex and without introducing wg deadlocks;
ensure disable() checks atomic.Load(&u.disabled) and sets via
atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates using
atomic operations and still coordinates wg.Done() as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b01fb9e-f7ca-4107-badb-09e6df278a7d

📥 Commits

Reviewing files that changed from the base of the PR and between 0390b42 and 75be643.

📒 Files selected for processing (1)
  • client/internal/dns/upstream.go

Comment thread client/internal/dns/upstream.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
client/internal/dns/upstream.go (2)

287-292: ⚠️ Potential issue | 🟡 Minor

Skip expected errors from canceled probes.

With the new “cancel the previous probe” flow, a superseded probe now returns context.Canceled here during normal operation. Lines 289-292 still append and warn on that error, and the cancellation check at Lines 304-310 happens too late to prevent noisy false failures. Bail out before multierror.Append / Warnf when ctx.Err() or u.ctx.Err() is already set.

🔇 Quietly ignore superseded probe cancellations
 		go func(upstream netip.AddrPort) {
 			defer wg.Done()
 			err := u.testNameserver(u.ctx, ctx, upstream, 500*time.Millisecond)
 			if err != nil {
+				if (ctx != nil && ctx.Err() != nil) || u.ctx.Err() != nil {
+					return
+				}
 				mu.Lock()
 				errs = multierror.Append(errs, err)
 				mu.Unlock()
 				log.Warnf("probing upstream nameserver %s: %s", upstream, err)
 				return
#!/bin/bash
set -euo pipefail

echo "=== ProbeAvailability() error handling ==="
sed -n '283,310p' client/internal/dns/upstream.go

Also applies to: 304-310

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 287 - 292, The probe routine
should ignore expected context cancellations before treating them as errors:
after calling u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later
cancellation-handling block around the same probe flow) check whether ctx.Err()
!= nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for
that probe; specifically, before appending to errs or warning in the logic
referencing u.testNameserver and the subsequent cancellation-check block, bail
out early when either context is canceled so superseded probes are not recorded
as errors.

116-122: ⚠️ Potential issue | 🔴 Critical

Stop() can deadlock with a recovering probe.

Line 121 waits on u.wg while u.mutex is held, but the goroutine started at Lines 400-404 does not call Done() until waitUntilResponse() returns, and that success path tries to reacquire the same mutex at Lines 375-377. If shutdown races with a just-recovered upstream, the recovery goroutine blocks on u.mutex, Done() never runs, and Stop() hangs forever. Move the Wait() out from under the lock and gate late wg.Add(1) calls with a stopping flag (or equivalent) protected by the same mutex.

🐛 Safer shutdown pattern
 type upstreamResolverBase struct {
 	...
 	mutex            sync.Mutex
+	stopping         bool
 	reactivatePeriod time.Duration
 	upstreamTimeout  time.Duration
 	wg               sync.WaitGroup
 	...
 }

 func (u *upstreamResolverBase) Stop() {
 	log.Debugf("stopping serving DNS for upstreams %s", u.upstreamServers)
-	u.cancel()
-
 	u.mutex.Lock()
-	u.wg.Wait()
+	if u.stopping {
+		u.mutex.Unlock()
+		return
+	}
+	u.stopping = true
+	u.cancel()
 	u.mutex.Unlock()
+	u.wg.Wait()
 }

 func (u *upstreamResolverBase) disable(err error) {
+	u.mutex.Lock()
+	defer u.mutex.Unlock()
-	if u.disabled {
+	if u.disabled || u.stopping {
 		return
 	}
 	...
 	u.wg.Add(1)
 	go func() {
 		defer u.wg.Done()
 		u.waitUntilResponse()
 	}()
 }
#!/bin/bash
set -euo pipefail

echo "=== Stop() holds the mutex across Wait() ==="
sed -n '116,123p' client/internal/dns/upstream.go

echo
echo "=== waitUntilResponse() reacquires the same mutex before returning ==="
sed -n '372,378p' client/internal/dns/upstream.go

echo
echo "=== disable() defers Done() until waitUntilResponse() returns ==="
sed -n '396,404p' client/internal/dns/upstream.go

Also applies to: 375-377, 400-404

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 116 - 122, Stop() currently
holds u.mutex while calling u.wg.Wait(), which can deadlock with the recovery
goroutine that reacquires u.mutex before calling u.wg.Done(); fix by adding a
boolean stopping flag on upstreamResolverBase (e.g., u.stopping) protected by
u.mutex, set stopping = true inside Stop() while holding the mutex, call
u.cancel(), then unlock and call u.wg.Wait() outside the lock; in the
recovery/disable path (the goroutine that calls waitUntilResponse() and defers
u.wg.Done()) guard any late u.wg.Add(1) with the same mutex (acquire u.mutex, if
u.stopping then skip starting the goroutine, else call u.wg.Add(1) and release
the mutex) so no new goroutines are added after Stop() begins — update
upstreamResolverBase.Stop, the recovery/disable goroutine (where u.wg.Add/Done
and waitUntilResponse() are used), and add the u.stopping field to ensure safe
shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 287-292: The probe routine should ignore expected context
cancellations before treating them as errors: after calling
u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later
cancellation-handling block around the same probe flow) check whether ctx.Err()
!= nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for
that probe; specifically, before appending to errs or warning in the logic
referencing u.testNameserver and the subsequent cancellation-check block, bail
out early when either context is canceled so superseded probes are not recorded
as errors.
- Around line 116-122: Stop() currently holds u.mutex while calling u.wg.Wait(),
which can deadlock with the recovery goroutine that reacquires u.mutex before
calling u.wg.Done(); fix by adding a boolean stopping flag on
upstreamResolverBase (e.g., u.stopping) protected by u.mutex, set stopping =
true inside Stop() while holding the mutex, call u.cancel(), then unlock and
call u.wg.Wait() outside the lock; in the recovery/disable path (the goroutine
that calls waitUntilResponse() and defers u.wg.Done()) guard any late
u.wg.Add(1) with the same mutex (acquire u.mutex, if u.stopping then skip
starting the goroutine, else call u.wg.Add(1) and release the mutex) so no new
goroutines are added after Stop() begins — update upstreamResolverBase.Stop, the
recovery/disable goroutine (where u.wg.Add/Done and waitUntilResponse() are
used), and add the u.stopping field to ensure safe shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 475d92be-c9d7-4eda-9416-d2dc0b84e438

📥 Commits

Reviewing files that changed from the base of the PR and between 75be643 and 126dd64.

📒 Files selected for processing (1)
  • client/internal/dns/upstream.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants