[client] stop upstream retry loop immediately on context cancellation#5403
[client] stop upstream retry loop immediately on context cancellation#5403
Conversation
…dler re-registration Three related bugs in upstreamResolverBase: 1. disable() had no lock, so concurrent ProbeAvailability() calls (one per registered domain for the same upstream group) could all race past the u.disabled guard and each spawn a separate waitUntilResponse goroutine. Fixed by splitting disable() into disable() (acquires mutex) and disableLocked() (assumes mutex held), with ProbeAvailability() using the latter since it already holds the mutex. 2. waitUntilResponse() used a plain ExponentialBackOff, so after Stop() cancelled u.ctx the goroutine would keep sleeping up to reactivatePeriod (30s) before noticing the cancellation. On wake it would call reactivate(), re-registering a discarded stale handler into the live DNS mux. Fixed by wrapping with backoff.WithContext so Retry returns immediately when the context is cancelled. 3. Context cancellation logged a bare "context canceled" WARN with no indication of which upstreams were affected. Downgraded to Debug and include the upstream server addresses in the message.
📝 WalkthroughWalkthroughwaitUntilResponse now uses a context-bound backoff ( Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)
364-368:⚠️ Potential issue | 🟠 Major
u.disabledwritten without the mutex — data race survives.
disableLocked()correctly reads and writesu.disabledunder the mutex, butwaitUntilResponse()setsu.disabled = false(line 367) and callsu.reactivate()(line 366) without holding it. A concurrentdisable()orProbeAvailability()call can race on this field, which is the class of bug this PR aims to fix.Suggested fix
log.Infof("upstreams %s are responsive again. Adding them back to system", u.upstreamServersString()) u.successCount.Add(1) - u.reactivate() - u.disabled = false + + u.mutex.Lock() + u.disabled = false + u.mutex.Unlock() + + u.reactivate() }The
reactivate()callback acquires a different mutex (s.muxin the server), notu.mutex, so there is no deadlock risk from calling it either inside or outside the lock.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 364 - 368, waitUntilResponse() writes u.disabled = false and calls u.reactivate() without holding u.mutex, causing a data race with disable()/disableLocked()/ProbeAvailability(); fix by acquiring u.mutex around the update and any accesses to u.disabled in waitUntilResponse() (mirror disableLocked()'s locking), i.e., lock u.mutex before setting u.disabled = false and unlocking after, then call u.reactivate() (safe to call outside the lock if desired because it uses s.mux) so all reads/writes of u.disabled are synchronized.
🧹 Nitpick comments (1)
client/internal/dns/upstream.go (1)
354-362: Inconsistent logging for context cancellation.backoff.WithContextcorrectly makes the retry sleep cancellable, but the operation'sselectat line 336 breaks the error chain when context is cancelled at operation start. If context is cancelled during sleep,Retryreturnscontext.Canceleddirectly (Debug log). If cancelled before the operation starts, it returns the wrappedfmt.Errorf(Warn log). Wrap the context error to preserve the error type:select { case <-u.ctx.Done(): - return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: parent context has been canceled", u.upstreamServersString())) + return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: %w", u.upstreamServersString(), u.ctx.Err())) default: }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 354 - 362, The logs differ because when the operation returns a wrapped context cancellation error the backoff.Retry sees a non-Context error; modify the operation (the select in the operation function used with backoff.Retry and backoff.WithContext/exponentialBackOff and u.ctx) so that when it detects context cancellation it returns the context error wrapped with %w (e.g., return fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply return u.ctx.Err(), ensuring the error preserves the context.Canceled sentinel so backoff.Retry will return context.Canceled and the Debug path is consistently taken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 364-368: waitUntilResponse() writes u.disabled = false and calls
u.reactivate() without holding u.mutex, causing a data race with
disable()/disableLocked()/ProbeAvailability(); fix by acquiring u.mutex around
the update and any accesses to u.disabled in waitUntilResponse() (mirror
disableLocked()'s locking), i.e., lock u.mutex before setting u.disabled = false
and unlocking after, then call u.reactivate() (safe to call outside the lock if
desired because it uses s.mux) so all reads/writes of u.disabled are
synchronized.
---
Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 354-362: The logs differ because when the operation returns a
wrapped context cancellation error the backoff.Retry sees a non-Context error;
modify the operation (the select in the operation function used with
backoff.Retry and backoff.WithContext/exponentialBackOff and u.ctx) so that when
it detects context cancellation it returns the context error wrapped with %w
(e.g., return fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply
return u.ctx.Err(), ensuring the error preserves the context.Canceled sentinel
so backoff.Retry will return context.Canceled and the Debug path is consistently
taken.
|



stop upstream retry loop immediately on context cancellation
Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit