Skip to content

[client] stop upstream retry loop immediately on context cancellation#5403

Merged
pappz merged 2 commits intomainfrom
fix/close-dns-probes
Feb 20, 2026
Merged

[client] stop upstream retry loop immediately on context cancellation#5403
pappz merged 2 commits intomainfrom
fix/close-dns-probes

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Feb 20, 2026

stop upstream retry loop immediately on context cancellation

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes
    • Bound upstream retry logic to request context so retries stop when operations are cancelled.
    • Improved cancellation handling to avoid treating cancelled retries as errors.
    • Clarified logging to distinguish cancellation events from genuine upstream errors.

…dler re-registration

Three related bugs in upstreamResolverBase:

1. disable() had no lock, so concurrent ProbeAvailability() calls (one per
  registered domain for the same upstream group) could all race past the
  u.disabled guard and each spawn a separate waitUntilResponse goroutine.
  Fixed by splitting disable() into disable() (acquires mutex) and
  disableLocked() (assumes mutex held), with ProbeAvailability() using
  the latter since it already holds the mutex.

2. waitUntilResponse() used a plain ExponentialBackOff, so after Stop()
  cancelled u.ctx the goroutine would keep sleeping up to reactivatePeriod
  (30s) before noticing the cancellation. On wake it would call reactivate(),
  re-registering a discarded stale handler into the live DNS mux.
  Fixed by wrapping with backoff.WithContext so Retry returns immediately
  when the context is cancelled.

3. Context cancellation logged a bare "context canceled" WARN with no
  indication of which upstreams were affected. Downgraded to Debug and
  include the upstream server addresses in the message.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

waitUntilResponse now uses a context-bound backoff (backoff.WithContext) and differentiates retry-loop exit logging between cancellations (debug) and other errors (warning).

Changes

Cohort / File(s) Summary
DNS upstream retry/backoff
client/internal/dns/upstream.go
Replaced plain backoff with backoff.WithContext(..., u.ctx) in waitUntilResponse; updated retry-loop exit logging to log debug on context cancellation and warning on other errors; removed previous unconditional warning.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested reviewers

  • lixmal
  • mlsmaycon

Poem

🐰 In tunnels of code I softly tread,
Backoff now holds my thread,
If canceled, I whisper low,
Else I shout so logs will know,
Hoppity-hop — retries in bed.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description follows the template structure with completed checklist, issue identification as a bug fix, and documentation rationale clearly stated.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately reflects the main change: stopping the upstream retry loop immediately on context cancellation by wrapping the backoff with backoff.WithContext.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/close-dns-probes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)

364-368: ⚠️ Potential issue | 🟠 Major

u.disabled written without the mutex — data race survives.

disableLocked() correctly reads and writes u.disabled under the mutex, but waitUntilResponse() sets u.disabled = false (line 367) and calls u.reactivate() (line 366) without holding it. A concurrent disable() or ProbeAvailability() call can race on this field, which is the class of bug this PR aims to fix.

Suggested fix
 	log.Infof("upstreams %s are responsive again. Adding them back to system", u.upstreamServersString())
 	u.successCount.Add(1)
-	u.reactivate()
-	u.disabled = false
+
+	u.mutex.Lock()
+	u.disabled = false
+	u.mutex.Unlock()
+
+	u.reactivate()
 }

The reactivate() callback acquires a different mutex (s.mux in the server), not u.mutex, so there is no deadlock risk from calling it either inside or outside the lock.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 364 - 368, waitUntilResponse()
writes u.disabled = false and calls u.reactivate() without holding u.mutex,
causing a data race with disable()/disableLocked()/ProbeAvailability(); fix by
acquiring u.mutex around the update and any accesses to u.disabled in
waitUntilResponse() (mirror disableLocked()'s locking), i.e., lock u.mutex
before setting u.disabled = false and unlocking after, then call u.reactivate()
(safe to call outside the lock if desired because it uses s.mux) so all
reads/writes of u.disabled are synchronized.
🧹 Nitpick comments (1)
client/internal/dns/upstream.go (1)

354-362: Inconsistent logging for context cancellation. backoff.WithContext correctly makes the retry sleep cancellable, but the operation's select at line 336 breaks the error chain when context is cancelled at operation start. If context is cancelled during sleep, Retry returns context.Canceled directly (Debug log). If cancelled before the operation starts, it returns the wrapped fmt.Errorf (Warn log). Wrap the context error to preserve the error type:

 	select {
 	case <-u.ctx.Done():
-		return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: parent context has been canceled", u.upstreamServersString()))
+		return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: %w", u.upstreamServersString(), u.ctx.Err()))
 	default:
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 354 - 362, The logs differ
because when the operation returns a wrapped context cancellation error the
backoff.Retry sees a non-Context error; modify the operation (the select in the
operation function used with backoff.Retry and
backoff.WithContext/exponentialBackOff and u.ctx) so that when it detects
context cancellation it returns the context error wrapped with %w (e.g., return
fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply return u.ctx.Err(),
ensuring the error preserves the context.Canceled sentinel so backoff.Retry will
return context.Canceled and the Debug path is consistently taken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 364-368: waitUntilResponse() writes u.disabled = false and calls
u.reactivate() without holding u.mutex, causing a data race with
disable()/disableLocked()/ProbeAvailability(); fix by acquiring u.mutex around
the update and any accesses to u.disabled in waitUntilResponse() (mirror
disableLocked()'s locking), i.e., lock u.mutex before setting u.disabled = false
and unlocking after, then call u.reactivate() (safe to call outside the lock if
desired because it uses s.mux) so all reads/writes of u.disabled are
synchronized.

---

Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 354-362: The logs differ because when the operation returns a
wrapped context cancellation error the backoff.Retry sees a non-Context error;
modify the operation (the select in the operation function used with
backoff.Retry and backoff.WithContext/exponentialBackOff and u.ctx) so that when
it detects context cancellation it returns the context error wrapped with %w
(e.g., return fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply
return u.ctx.Err(), ensuring the error preserves the context.Canceled sentinel
so backoff.Retry will return context.Canceled and the Debug path is consistently
taken.

lixmal
lixmal previously approved these changes Feb 20, 2026
@sonarqubecloud
Copy link
Copy Markdown

@pappz pappz changed the title [client] prevent duplicate waitUntilResponse goroutines and stale handler re-registration [client] stop upstream retry loop immediately on context cancellation Feb 20, 2026
@pappz pappz merged commit 2a26cb4 into main Feb 20, 2026
42 checks passed
@pappz pappz deleted the fix/close-dns-probes branch February 20, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants