[client] stop upstream retry loop immediately on context cancellation by pappz · Pull Request #5403 · netbirdio/netbird

pappz · 2026-02-20T12:07:03Z

stop upstream retry loop immediately on context cancellation

Describe your changes

Issue ticket number and link

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Bug Fixes
- Bound upstream retry logic to request context so retries stop when operations are cancelled.
- Improved cancellation handling to avoid treating cancelled retries as errors.
- Clarified logging to distinguish cancellation events from genuine upstream errors.

…dler re-registration Three related bugs in upstreamResolverBase: 1. disable() had no lock, so concurrent ProbeAvailability() calls (one per registered domain for the same upstream group) could all race past the u.disabled guard and each spawn a separate waitUntilResponse goroutine. Fixed by splitting disable() into disable() (acquires mutex) and disableLocked() (assumes mutex held), with ProbeAvailability() using the latter since it already holds the mutex. 2. waitUntilResponse() used a plain ExponentialBackOff, so after Stop() cancelled u.ctx the goroutine would keep sleeping up to reactivatePeriod (30s) before noticing the cancellation. On wake it would call reactivate(), re-registering a discarded stale handler into the live DNS mux. Fixed by wrapping with backoff.WithContext so Retry returns immediately when the context is cancelled. 3. Context cancellation logged a bare "context canceled" WARN with no indication of which upstreams were affected. Downgraded to Debug and include the upstream server addresses in the message.

coderabbitai · 2026-02-20T12:07:21Z

📝 Walkthrough

Walkthrough

waitUntilResponse now uses a context-bound backoff (backoff.WithContext) and differentiates retry-loop exit logging between cancellations (debug) and other errors (warning).

Changes

Cohort / File(s)	Summary
DNS upstream retry/backoff `client/internal/dns/upstream.go`	Replaced plain backoff with `backoff.WithContext(..., u.ctx)` in `waitUntilResponse`; updated retry-loop exit logging to log debug on context cancellation and warning on other errors; removed previous unconditional warning.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested reviewers

lixmal
mlsmaycon

Poem

🐰 In tunnels of code I softly tread,
Backoff now holds my thread,
If canceled, I whisper low,
Else I shout so logs will know,
Hoppity-hop — retries in bed.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description follows the template structure with completed checklist, issue identification as a bug fix, and documentation rationale clearly stated.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately reflects the main change: stopping the upstream retry loop immediately on context cancellation by wrapping the backoff with backoff.WithContext.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/close-dns-probes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

client/internal/dns/upstream.go (1)
364-368: ⚠️ Potential issue | 🟠 Major

u.disabled written without the mutex — data race survives.

disableLocked() correctly reads and writes u.disabled under the mutex, but waitUntilResponse() sets u.disabled = false (line 367) and calls u.reactivate() (line 366) without holding it. A concurrent disable() or ProbeAvailability() call can race on this field, which is the class of bug this PR aims to fix.
Suggested fix
 	log.Infof("upstreams %s are responsive again. Adding them back to system", u.upstreamServersString())
 	u.successCount.Add(1)
-	u.reactivate()
-	u.disabled = false
+
+	u.mutex.Lock()
+	u.disabled = false
+	u.mutex.Unlock()
+
+	u.reactivate()
 }
The reactivate() callback acquires a different mutex (s.mux in the server), not u.mutex, so there is no deadlock risk from calling it either inside or outside the lock.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 364 - 368, waitUntilResponse()
writes u.disabled = false and calls u.reactivate() without holding u.mutex,
causing a data race with disable()/disableLocked()/ProbeAvailability(); fix by
acquiring u.mutex around the update and any accesses to u.disabled in
waitUntilResponse() (mirror disableLocked()'s locking), i.e., lock u.mutex
before setting u.disabled = false and unlocking after, then call u.reactivate()
(safe to call outside the lock if desired because it uses s.mux) so all
reads/writes of u.disabled are synchronized.

🧹 Nitpick comments (1)

client/internal/dns/upstream.go (1)
354-362: Inconsistent logging for context cancellation. backoff.WithContext correctly makes the retry sleep cancellable, but the operation's select at line 336 breaks the error chain when context is cancelled at operation start. If context is cancelled during sleep, Retry returns context.Canceled directly (Debug log). If cancelled before the operation starts, it returns the wrapped fmt.Errorf (Warn log). Wrap the context error to preserve the error type:
 	select {
 	case <-u.ctx.Done():
-		return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: parent context has been canceled", u.upstreamServersString()))
+		return backoff.Permanent(fmt.Errorf("exiting upstream retry loop for upstreams %s: %w", u.upstreamServersString(), u.ctx.Err()))
 	default:
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 354 - 362, The logs differ
because when the operation returns a wrapped context cancellation error the
backoff.Retry sees a non-Context error; modify the operation (the select in the
operation function used with backoff.Retry and
backoff.WithContext/exponentialBackOff and u.ctx) so that when it detects
context cancellation it returns the context error wrapped with %w (e.g., return
fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply return u.ctx.Err(),
ensuring the error preserves the context.Canceled sentinel so backoff.Retry will
return context.Canceled and the Debug path is consistently taken.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 364-368: waitUntilResponse() writes u.disabled = false and calls
u.reactivate() without holding u.mutex, causing a data race with
disable()/disableLocked()/ProbeAvailability(); fix by acquiring u.mutex around
the update and any accesses to u.disabled in waitUntilResponse() (mirror
disableLocked()'s locking), i.e., lock u.mutex before setting u.disabled = false
and unlocking after, then call u.reactivate() (safe to call outside the lock if
desired because it uses s.mux) so all reads/writes of u.disabled are
synchronized.

---

Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 354-362: The logs differ because when the operation returns a
wrapped context cancellation error the backoff.Retry sees a non-Context error;
modify the operation (the select in the operation function used with
backoff.Retry and backoff.WithContext/exponentialBackOff and u.ctx) so that when
it detects context cancellation it returns the context error wrapped with %w
(e.g., return fmt.Errorf("operation canceled: %w", u.ctx.Err())) or simply
return u.ctx.Err(), ensuring the error preserves the context.Canceled sentinel
so backoff.Retry will return context.Canceled and the Debug path is consistently
taken.

sonarqubecloud · 2026-02-20T12:26:51Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai Bot reviewed Feb 20, 2026

View reviewed changes

lixmal previously approved these changes Feb 20, 2026

View reviewed changes

[client] remove redundant disableLocked method in upstreamResolverBase

d52ba31

pappz dismissed lixmal’s stale review via d52ba31 February 20, 2026 12:26

lixmal approved these changes Feb 20, 2026

View reviewed changes

pappz changed the title ~~[client] prevent duplicate waitUntilResponse goroutines and stale handler re-registration~~ [client] stop upstream retry loop immediately on context cancellation Feb 20, 2026

pappz merged commit 2a26cb4 into main Feb 20, 2026
42 checks passed

pappz deleted the fix/close-dns-probes branch February 20, 2026 13:44

coderabbitai Bot mentioned this pull request Mar 11, 2026

[client] Fix DNS probe thread safety and avoid blocking engine sync #5576

Merged

7 tasks

coderabbitai Bot mentioned this pull request Mar 18, 2026

fix(dns): skip probe for NetBird overlay addresses at engine startup #5625

Closed

6 tasks

This was referenced Apr 23, 2026

[client] Drop DNS probes for passive health projection #5971

Open

[client] Use ctx.Err() instead of gRPC codes.Canceled to detect shutdown #6019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[client] stop upstream retry loop immediately on context cancellation#5403

[client] stop upstream retry loop immediately on context cancellation#5403
pappz merged 2 commits intomainfrom
fix/close-dns-probes

pappz commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

sonarqubecloud Bot commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pappz commented Feb 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Stack

Checklist

Documentation

Docs PR URL (required if "docs added" is checked)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Feb 20, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pappz commented Feb 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 20, 2026 •

edited

Loading