[client] Fix DNS probe thread safety and avoid blocking engine sync#5576
[client] Fix DNS probe thread safety and avoid blocking engine sync#5576
Conversation
Refactor ProbeAvailability to prevent blocking the engine's sync mutex during slow DNS probes. The probe now derives its context from the server's own context (s.ctx) instead of accepting one from the caller, and uses a mutex to ensure only one probe runs at a time — new calls cancel the previous probe before starting. Also fixes a data race in Stop() when accessing probeCancel without the probe mutex.
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughProbeAvailability methods were made context-aware and probe lifecycles were synchronized: DefaultServer and upstream resolvers gained mutexes, cancel functions, and waitgroups to coordinate start/cancel/wait of probes; engine now launches DNS probes asynchronously. Changes
Sequence Diagram(s)sequenceDiagram
participant Engine as Engine
participant Server as DefaultServer
participant Upstream as UpstreamResolver
participant NS as Nameserver
Engine->>Server: ProbeAvailability(ctx)
Server->>Server: probeMu.Lock / cancel prior probe / start new probeCtx
Server->>Upstream: ProbeAvailability(probeCtx)
Upstream->>Upstream: wg.Add / spawn per-host probes
Upstream->>NS: testNameserver(baseCtx, externalCtx, addr)
NS-->>Upstream: success/failure
Upstream-->>Server: report result / update state
Upstream->>Upstream: wg.Done for probes
Server->>Server: probeWg wait / cancel probeCtx when done
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/server.go`:
- Around line 531-539: The loop launches goroutines that read s.dnsMuxMap while
other code (e.g., UpdateDNSServer / updateMux) may rewrite it, causing
concurrent map iteration/write; to fix, while holding s.probeMu create a stable
snapshot slice of the handlers (e.g., []handlerWithStop by iterating s.dnsMuxMap
and appending each mux.handler), then release s.probeMu and range over that
slice to spawn goroutines calling handler.ProbeAvailability(probeCtx) (keep
existing wg logic), ensuring no map access occurs after the lock is released.
In `@client/internal/dns/upstream.go`:
- Around line 116-120: Stop() calls u.wg.Wait() without synchronization while
disable() calls u.wg.Add(1) concurrently; this can panic. Fix by serializing
access to the WaitGroup: acquire u.mutex (the same mutex used in disable())
around the sequence that updates cancellation state and calls u.wg.Wait() in
Stop(), and also perform the u.wg.Add(1) call inside disable() while holding
u.mutex (or check a guarded boolean like u.disabled/u.stopped before calling
Add) so no Add can race with Wait; update the implementations of Stop() and
disable() to use that mutex-protected pattern.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 54237e4b-b421-49a8-b5ba-02727312b638
📒 Files selected for processing (6)
client/internal/dns/local/local.goclient/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_test.goclient/internal/engine.go
Add proper locking to prevent data races when accessing shared resources during DNS probe execution and Stop(). Update handlers snapshot logic to avoid conflicts with concurrent writers.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 397-401: The recovery goroutine and the probe path race on
u.disabled because waitUntilResponse() clears it without holding u.mutex while
disable() reads/sets it; fix by ensuring every access to u.disabled is protected
by u.mutex (e.g., add u.mutex.Lock()/Unlock() around reads/writes in
waitUntilResponse() and in disable(), or introduce helper methods like
setDisabled(bool) and isDisabled() that acquire the mutex), and update any other
code paths touching u.disabled to use the same mutex-protected accessors so all
reads/writes are synchronized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5e38232c-85d0-4c81-99fc-504bc8833bd7
📒 Files selected for processing (1)
client/internal/dns/upstream.go
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)
389-401:⚠️ Potential issue | 🔴 Critical
u.disabledis still racing with the recovery goroutine.
disable()is effectively serialized byu.mutex, butwaitUntilResponse()still clears the same flag at Line 375 with no synchronization. That can double-start the recovery loop or skip a needed deactivate/reactivate cycle depending on timing. Reusingu.mutexfor this flag will deadlock withStop()'s lockedwg.Wait(), so this likely needs anatomic.Boolor a dedicated state lock instead.#!/bin/bash echo "=== disabled access sites ===" rg -n -C2 'u\.disabled' client/internal/dns/upstream.go echo echo "=== WaitGroup and mutex coordination ===" rg -n -C2 'wg\.(Add|Wait|Done)|mutex\.(Lock|Unlock)' client/internal/dns/upstream.go🐛 Safer direction
type upstreamResolverBase struct { - disabled bool + disabled atomic.Bool successCount atomic.Int32 mutex sync.Mutex @@ - u.disabled = false + u.disabled.Store(false) } @@ func (u *upstreamResolverBase) disable(err error) { - if u.disabled { + if !u.disabled.CompareAndSwap(false, true) { return } @@ - u.disabled = true u.wg.Add(1) go func() { defer u.wg.Done() u.waitUntilResponse() }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 389 - 401, The u.disabled flag is raced between disable() and the recovery goroutine in waitUntilResponse() because mutex use would deadlock with Stop()/wg.Wait(); change the boolean field u.disabled to an atomic.Bool (or another atomic/state lock) and replace all reads/writes in disable(), waitUntilResponse(), and any other access sites with atomic operations (Load, Store, and where appropriate CompareAndSwap) so the disable/reactivate flow is serialized without acquiring u.mutex and without introducing wg deadlocks; ensure disable() checks atomic.Load(&u.disabled) and sets via atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates using atomic operations and still coordinates wg.Done() as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 285-310: The probe goroutine in the upstream worker logs and
appends every testNameserver error even when the probe was intentionally
canceled; before appending to errs or calling log.Warnf in the anonymous
goroutine (the closure that calls u.testNameserver with upstream), check if
ctx.Err() != nil or u.ctx.Err() != nil and if so simply return without
logging/appending; otherwise proceed to append the error to errs and warn as
before. Ensure you reference the same ctx and u.ctx used to control cancellation
so superseded probes bail out quietly.
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 389-401: The u.disabled flag is raced between disable() and the
recovery goroutine in waitUntilResponse() because mutex use would deadlock with
Stop()/wg.Wait(); change the boolean field u.disabled to an atomic.Bool (or
another atomic/state lock) and replace all reads/writes in disable(),
waitUntilResponse(), and any other access sites with atomic operations (Load,
Store, and where appropriate CompareAndSwap) so the disable/reactivate flow is
serialized without acquiring u.mutex and without introducing wg deadlocks;
ensure disable() checks atomic.Load(&u.disabled) and sets via
atomic.Store/CompareAndSwap, and waitUntilResponse() clears/reactivates using
atomic operations and still coordinates wg.Done() as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8b01fb9e-f7ca-4107-badb-09e6df278a7d
📒 Files selected for processing (1)
client/internal/dns/upstream.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
client/internal/dns/upstream.go (2)
287-292:⚠️ Potential issue | 🟡 MinorSkip expected errors from canceled probes.
With the new “cancel the previous probe” flow, a superseded probe now returns
context.Canceledhere during normal operation. Lines 289-292 still append and warn on that error, and the cancellation check at Lines 304-310 happens too late to prevent noisy false failures. Bail out beforemultierror.Append/Warnfwhenctx.Err()oru.ctx.Err()is already set.🔇 Quietly ignore superseded probe cancellations
go func(upstream netip.AddrPort) { defer wg.Done() err := u.testNameserver(u.ctx, ctx, upstream, 500*time.Millisecond) if err != nil { + if (ctx != nil && ctx.Err() != nil) || u.ctx.Err() != nil { + return + } mu.Lock() errs = multierror.Append(errs, err) mu.Unlock() log.Warnf("probing upstream nameserver %s: %s", upstream, err) return#!/bin/bash set -euo pipefail echo "=== ProbeAvailability() error handling ===" sed -n '283,310p' client/internal/dns/upstream.goAlso applies to: 304-310
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 287 - 292, The probe routine should ignore expected context cancellations before treating them as errors: after calling u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later cancellation-handling block around the same probe flow) check whether ctx.Err() != nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for that probe; specifically, before appending to errs or warning in the logic referencing u.testNameserver and the subsequent cancellation-check block, bail out early when either context is canceled so superseded probes are not recorded as errors.
116-122:⚠️ Potential issue | 🔴 Critical
Stop()can deadlock with a recovering probe.Line 121 waits on
u.wgwhileu.mutexis held, but the goroutine started at Lines 400-404 does not callDone()untilwaitUntilResponse()returns, and that success path tries to reacquire the same mutex at Lines 375-377. If shutdown races with a just-recovered upstream, the recovery goroutine blocks onu.mutex,Done()never runs, andStop()hangs forever. Move theWait()out from under the lock and gate latewg.Add(1)calls with astoppingflag (or equivalent) protected by the same mutex.🐛 Safer shutdown pattern
type upstreamResolverBase struct { ... mutex sync.Mutex + stopping bool reactivatePeriod time.Duration upstreamTimeout time.Duration wg sync.WaitGroup ... } func (u *upstreamResolverBase) Stop() { log.Debugf("stopping serving DNS for upstreams %s", u.upstreamServers) - u.cancel() - u.mutex.Lock() - u.wg.Wait() + if u.stopping { + u.mutex.Unlock() + return + } + u.stopping = true + u.cancel() u.mutex.Unlock() + u.wg.Wait() } func (u *upstreamResolverBase) disable(err error) { + u.mutex.Lock() + defer u.mutex.Unlock() - if u.disabled { + if u.disabled || u.stopping { return } ... u.wg.Add(1) go func() { defer u.wg.Done() u.waitUntilResponse() }() }#!/bin/bash set -euo pipefail echo "=== Stop() holds the mutex across Wait() ===" sed -n '116,123p' client/internal/dns/upstream.go echo echo "=== waitUntilResponse() reacquires the same mutex before returning ===" sed -n '372,378p' client/internal/dns/upstream.go echo echo "=== disable() defers Done() until waitUntilResponse() returns ===" sed -n '396,404p' client/internal/dns/upstream.goAlso applies to: 375-377, 400-404
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 116 - 122, Stop() currently holds u.mutex while calling u.wg.Wait(), which can deadlock with the recovery goroutine that reacquires u.mutex before calling u.wg.Done(); fix by adding a boolean stopping flag on upstreamResolverBase (e.g., u.stopping) protected by u.mutex, set stopping = true inside Stop() while holding the mutex, call u.cancel(), then unlock and call u.wg.Wait() outside the lock; in the recovery/disable path (the goroutine that calls waitUntilResponse() and defers u.wg.Done()) guard any late u.wg.Add(1) with the same mutex (acquire u.mutex, if u.stopping then skip starting the goroutine, else call u.wg.Add(1) and release the mutex) so no new goroutines are added after Stop() begins — update upstreamResolverBase.Stop, the recovery/disable goroutine (where u.wg.Add/Done and waitUntilResponse() are used), and add the u.stopping field to ensure safe shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 287-292: The probe routine should ignore expected context
cancellations before treating them as errors: after calling
u.testNameserver(u.ctx, ctx, upstream, ...) (and in the later
cancellation-handling block around the same probe flow) check whether ctx.Err()
!= nil or u.ctx.Err() != nil and if so skip multierror.Append and log.Warnf for
that probe; specifically, before appending to errs or warning in the logic
referencing u.testNameserver and the subsequent cancellation-check block, bail
out early when either context is canceled so superseded probes are not recorded
as errors.
- Around line 116-122: Stop() currently holds u.mutex while calling u.wg.Wait(),
which can deadlock with the recovery goroutine that reacquires u.mutex before
calling u.wg.Done(); fix by adding a boolean stopping flag on
upstreamResolverBase (e.g., u.stopping) protected by u.mutex, set stopping =
true inside Stop() while holding the mutex, call u.cancel(), then unlock and
call u.wg.Wait() outside the lock; in the recovery/disable path (the goroutine
that calls waitUntilResponse() and defers u.wg.Done()) guard any late
u.wg.Add(1) with the same mutex (acquire u.mutex, if u.stopping then skip
starting the goroutine, else call u.wg.Add(1) and release the mutex) so no new
goroutines are added after Stop() begins — update upstreamResolverBase.Stop, the
recovery/disable goroutine (where u.wg.Add/Done and waitUntilResponse() are
used), and add the u.stopping field to ensure safe shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 475d92be-c9d7-4eda-9416-d2dc0b84e438
📒 Files selected for processing (1)
client/internal/dns/upstream.go



Refactor ProbeAvailability to prevent blocking the engine's sync mutex during slow DNS probes.
Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
Bug Fixes
Performance
Chores