Skip to content

[client] Add TTL-based refresh to mgmt DNS cache via handler chain#5945

Merged
lixmal merged 8 commits intomainfrom
mgmt-cache-ttl-refresh
Apr 22, 2026
Merged

[client] Add TTL-based refresh to mgmt DNS cache via handler chain#5945
lixmal merged 8 commits intomainfrom
mgmt-cache-ttl-refresh

Conversation

@lixmal
Copy link
Copy Markdown
Collaborator

@lixmal lixmal commented Apr 21, 2026

Describe your changes

The mgmt DNS cache was previously populated once and never refreshed, so if the mgmt/signal/relay/etc. hostname's IP changed (DDNS, cloud IP rotation, failover) the client would keep using the stale address until restart. This PR makes cache entries expire and refresh on a TTL, using stale-while-revalidate semantics so queries never block on a lookup.

The tricky part is where to send the refresh query. When NetBird has taken over the OS's DNS resolver, calling net.DefaultResolver would loop straight back into the mgmt cache itself (we'd serve our own stale answer to ourselves and "succeed"). So the refresh prefers the internal handler chain, bypassing the mgmt-cache priority and routing through the upstream / default / fallback handlers that actually know how to reach an external resolver. The OS fallback is only used when the chain has no root-zone handler at or below PriorityUpstream, which by construction means NetBird is not the system resolver and net.DefaultResolver is safe.

Pitfalls worth knowing:

  • Loop case: if the predicate ever gets it wrong (NetBird is the system resolver but no chain root handler), the OS path would loop. The flow is self-limiting via singleflight (at most one extra parked goroutine per cycle) but silently wrong: every "refresh" returns our own stale answer with a fresh timestamp. A lightweight detector arms only on the OS path and logs a single warn if a ServeDNS hit lands during an inflight OS-fallback refresh for the same question.
  • Fallback availability: the predicate depends on a root-zone handler being registered. This is true for NetBird-as-default-resolver (Primary nsgroup or addHostRootZone), and for Linux hosts where hostManagerWithOriginalNS captures the pre-takeover /etc/resolv.conf as a fallback. It does not fire on systemd-resolved Linux when no Primary nsgroup is configured (that manager doesn't expose original nameservers) — but that scenario is also the one where the OS path is safe, so it works out.
  • Persistent-failure noise: each failed refresh fires every ~30s (backoff window). The first failure logs at warn, subsequent failures at debug, so a broken upstream doesn't flood the log.
  • Goroutine fan-out: one refresh goroutine per stale hit is skipped when a refresh is already inflight for that question, keeping things quiet under bursty load.

NB_MGMT_CACHE_TTL shortens the default 5min TTL for integration testing.

Issue ticket number and link

#5113

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Internal behavior change; no user-facing surface or CLI/API.

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • In-process DNS resolution API with root-handler detection by priority.
  • Improvements

    • Priority-aware handler dispatching and captured responses for handlers that write raw bytes.
    • Management cache: per-entry timestamps, stale-while-revalidate serving, async refresh with singleflight deduplication, backoff, and separate A/AAAA handling.
    • Chain-first resolution with OS fallback.
  • Bug Fixes

    • Better context/deadline handling and clearer failure signals when no handler returns a usable response.
  • Tests

    • Expanded coverage for dispatch, priority filtering, raw-response capture, timeouts, refresh/backoff, root detection, and cache refresh behaviors.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b44ea6ea-a2e0-4b78-a864-8c3dce0dd45a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds in-process handler-chain dispatch (ResolveInternal) with priority-aware filtering and response capture; exposes HasRootHandlerAtOrBelow. Management resolver cache now stores per-question records with stale-while-revalidate, singleflight refreshes, backoff/loop detection, per-family A/AAAA handling, and prefers resolving via the handler chain; tests and server wiring updated.

Changes

Cohort / File(s) Summary
Handler Chain Core
client/internal/dns/handler_chain.go
Extracted dispatch loop to dispatch(w, r, maxPriority); ServeDNS delegates to it. Added ResolveInternal(ctx, r, maxPriority), HasRootHandlerAtOrBelow(maxPriority), and internalResponseWriter to capture dns.Msg (supports WriteMsg and raw Write).
Handler Chain Tests
client/internal/dns/handler_chain_test.go
Added test handlers and suites for ResolveInternal (priority gating, empty-question error, raw Write capture, context deadline) and HasRootHandlerAtOrBelow behaviors.
Management Resolver Cache
client/internal/dns/mgmt/mgmt.go
Replaced map[dns.Question][]dns.RR with map[dns.Question]*cachedRecord (adds timestamps and refresh state); implemented stale-while-revalidate with async singleflight refresh, refresh backoff, inflight-refresh loop detection, lookupViaChain/osLookup split, and per-family A/AAAA semantics in AddDomain/RemoveDomain.
Resolver Refresh Tests
client/internal/dns/mgmt/mgmt_refresh_test.go
New tests with a fakeChain validating fresh vs stale serving, async refresh/collapse, failure backoff suppression, loop-detection behavior, and AddDomain chain resolution.
Server Integration & Tests
client/internal/dns/server.go, client/internal/dns/mgmt/mgmt_test.go
Wired management cache resolver to use handler chain (SetChainResolver); added cache-TTL parsing and Resolver response-TTL tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Resolver as Management Resolver
    participant Chain as Handler Chain
    participant OS as OS Resolver

    Client->>Resolver: ServeDNS(query)
    Resolver->>Resolver: Check cache (fresh / stale / miss)
    alt Fresh
        Resolver->>Client: Return cached answer
    else Stale
        Resolver->>Client: Return stale answer
        Resolver->>Resolver: Schedule async refresh (singleflight)
        par Async refresh
            Resolver->>Chain: ResolveInternal(msg, maxPriority)
            alt Chain has root handler ≤ maxPriority
                Chain->>Chain: dispatch (priority-filtered handlers)
                Chain->>Resolver: Return dns.Msg
            else No suitable root handler
                Resolver->>OS: Fallback OS lookup
                OS->>Resolver: Return records
            end
            Resolver->>Resolver: Update cache or mark failure/backoff
        and Concurrent callers...
            Client->>Resolver: Subsequent queries (refresh collapsed)
        end
    else Cache miss
        Resolver->>Chain: ResolveInternal(...)
        alt Chain has root handler
            Chain->>Resolver: Return dns.Msg
        else
            Resolver->>OS: OS lookup
            OS->>Resolver: Return records
        end
        Resolver->>Resolver: Cache result
        Resolver->>Client: Serve answer
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • pappz
  • mlsmaycon

Poem

🐰 I hopped through chains both near and far,
I unwrapped writes and caught each byte,
Stale replies hummed while singleflight buzzed,
I sniffed one loop and kept the cache polite,
Now chain and resolver dance through night. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: adding TTL-based refresh capability to the mgmt DNS cache using the handler chain, which is the core objective of the PR.
Description check ✅ Passed The description comprehensively covers the changes, explains the problem being solved, describes the implementation strategy, documents pitfalls and safeguards, and completes all required template sections including checklist items and documentation rationale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 337-344: Resolve the issue where internalResponseWriter.Write
ignores raw DNS packets by updating the internalResponseWriter.Write
implementation to unpack the provided byte slice into a dns.Msg and set
internalResponseWriter.response (similar to WriteMsg) so ResolveInternal can
detect a resolved handler; keep WriteMsg unchanged. Also add explicit no-op
comments to TsigTimersOnly and Hijack methods (and ensure
TsigStatus/Close/LocalAddr/RemoteAddr return values are intentional) to satisfy
static analysis; reference internalResponseWriter.Write,
internalResponseWriter.WriteMsg, and ResolveInternal when making changes.

In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 260-263: refreshQuestion currently writes the refreshed
cachedRecord unconditionally which can resurrect a domain removed concurrently
by RemoveDomain; modify refreshQuestion so that after obtaining records and
before assigning m.records[question] it acquires m.mutex and checks whether the
question was removed (e.g. verify m.records no longer contains an explicit
tombstone or that a removed-set/flag for question is not set) and only write the
new cachedRecord if the question is still expected to exist; use the same
m.mutex around the check-and-set to avoid races with RemoveDomain and reference
the symbols refreshQuestion, RemoveDomain, m.records, cachedRecord, m.mutex and
question in the implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7df51b33-7689-4e49-96a3-2d1439532a19

📥 Commits

Reviewing files that changed from the base of the PR and between 5a89e66 and 8b7eae4.

📒 Files selected for processing (5)
  • client/internal/dns/handler_chain.go
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/server.go

Comment thread client/internal/dns/handler_chain.go Outdated
Comment thread client/internal/dns/mgmt/mgmt.go
@LufeiCheng
Copy link
Copy Markdown

This problem has been bothering me for a long time, thanks.

@lixmal lixmal force-pushed the mgmt-cache-ttl-refresh branch 2 times, most recently from 6f432ef to ec10dba Compare April 21, 2026 16:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
client/internal/dns/handler_chain.go (1)

355-361: SonarCloud S1186 still flags TsigTimersOnly/Hijack.

The CI hint shows both methods still failing S1186 ("Add a nested comment explaining why this function is empty…"). The rule generally requires the explanatory comment inside the function body, not a preceding godoc comment. Easiest fix:

Proposed fix
-// TsigTimersOnly is part of dns.ResponseWriter. Internal in-process queries
-// never use TSIG, so the setting is a no-op.
-func (w *internalResponseWriter) TsigTimersOnly(bool) {}
+// TsigTimersOnly is part of dns.ResponseWriter. Internal in-process queries
+// never use TSIG, so the setting is a no-op.
+func (w *internalResponseWriter) TsigTimersOnly(bool) {
+	// no-op: internal in-process queries never use TSIG
+}
 
-// Hijack is part of dns.ResponseWriter. Internal in-process queries have no
-// underlying connection to hijack.
-func (w *internalResponseWriter) Hijack() {}
+// Hijack is part of dns.ResponseWriter. Internal in-process queries have no
+// underlying connection to hijack.
+func (w *internalResponseWriter) Hijack() {
+	// no-op: no underlying connection to hijack
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/handler_chain.go` around lines 355 - 361, The
empty-method lints for internalResponseWriter.TsigTimersOnly and
internalResponseWriter.Hijack should include an inline comment inside the
function body explaining why they are intentionally no-ops; update each method
(TsigTimersOnly and Hijack) to keep the body empty but add a nested comment such
as "no-op: internal in-process queries never use TSIG / have no underlying
connection to hijack" so S1186 is satisfied and the intent is clear.
client/internal/dns/mgmt/mgmt.go (1)

24-40: cacheTTL() reads the environment on every ServeDNS call.

cacheTTL() does os.Getenv + possibly time.ParseDuration on the hot path (called in ServeDNS staleness check and again in splitIPsByType per refresh). NB_MGMT_CACHE_TTL is documented as integration/dev-only and not expected to change at runtime. Consider resolving it once:

Proposed fix
-func cacheTTL() time.Duration {
-	if v := os.Getenv(envMgmtCacheTTL); v != "" {
-		if d, err := time.ParseDuration(v); err == nil && d > 0 {
-			return d
-		}
-	}
-	return defaultTTL
-}
+var cacheTTLOnce sync.Once
+var resolvedCacheTTL time.Duration
+
+func cacheTTL() time.Duration {
+	cacheTTLOnce.Do(func() {
+		resolvedCacheTTL = defaultTTL
+		if v := os.Getenv(envMgmtCacheTTL); v != "" {
+			if d, err := time.ParseDuration(v); err == nil && d > 0 {
+				resolvedCacheTTL = d
+			}
+		}
+	})
+	return resolvedCacheTTL
+}

If tests rely on changing the env mid-process, keep the current behavior — otherwise this avoids per-query syscalls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 24 - 40, The cacheTTL function
currently calls os.Getenv and time.ParseDuration on every ServeDNS/hot-path
call; replace this with a one-time resolution so NB_MGMT_CACHE_TTL is read and
parsed once at startup (or on first use) and stored in a package-level variable.
Implement this by introducing a package-level variable (e.g.,
resolvedMgmtCacheTTL) and initialize it in an init() or via sync.Once, using
envMgmtCacheTTL and defaultTTL as inputs; ensure cacheTTL simply returns the
stored duration and preserves the existing semantics (fallback to defaultTTL and
ignore invalid/empty values).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 355-361: The empty-method lints for
internalResponseWriter.TsigTimersOnly and internalResponseWriter.Hijack should
include an inline comment inside the function body explaining why they are
intentionally no-ops; update each method (TsigTimersOnly and Hijack) to keep the
body empty but add a nested comment such as "no-op: internal in-process queries
never use TSIG / have no underlying connection to hijack" so S1186 is satisfied
and the intent is clear.

In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 24-40: The cacheTTL function currently calls os.Getenv and
time.ParseDuration on every ServeDNS/hot-path call; replace this with a one-time
resolution so NB_MGMT_CACHE_TTL is read and parsed once at startup (or on first
use) and stored in a package-level variable. Implement this by introducing a
package-level variable (e.g., resolvedMgmtCacheTTL) and initialize it in an
init() or via sync.Once, using envMgmtCacheTTL and defaultTTL as inputs; ensure
cacheTTL simply returns the stored duration and preserves the existing semantics
(fallback to defaultTTL and ignore invalid/empty values).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7c8fd74-0ad3-4c21-b12e-0f8a1ffc899e

📥 Commits

Reviewing files that changed from the base of the PR and between 8b7eae4 and 6f432ef.

📒 Files selected for processing (5)
  • client/internal/dns/handler_chain.go
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/server.go
✅ Files skipped from review due to trivial changes (2)
  • client/internal/dns/server.go
  • client/internal/dns/handler_chain_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/mgmt/mgmt_refresh_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/dns/handler_chain.go (1)

293-316: Optional: short-circuit an already-canceled context before spawning the goroutine.

If the caller's context is already done on entry, the current code still spawns a goroutine and runs a full dispatch pass (which itself takes the chain read lock and may invoke handler logic) before the select observes ctx.Done(). A cheap pre-check avoids that wasted work and tightens cancellation semantics for the mgmt refresher, which routinely runs under deadlines.

♻️ Proposed early ctx check
 func (c *HandlerChain) ResolveInternal(ctx context.Context, r *dns.Msg, maxPriority int) (*dns.Msg, error) {
 	if len(r.Question) == 0 {
 		return nil, fmt.Errorf("empty question")
 	}
+	if err := ctx.Err(); err != nil {
+		return nil, fmt.Errorf("resolve %s: %w", strings.ToLower(r.Question[0].Name), err)
+	}
 
 	base := &internalResponseWriter{}
 	done := make(chan struct{})

Note: the acknowledged goroutine-drain-on-cancel behavior in the doc comment is fine as-is — the happens-before via close(done) plus the caller not reading base on the cancel path keeps this race-free. No change needed there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/handler_chain.go` around lines 293 - 316, The
ResolveInternal function should short-circuit when the provided context is
already canceled before spawning the goroutine: at the start of
HandlerChain.ResolveInternal (before creating base and launching the goroutine
that calls c.dispatch), check ctx.Err() or select on ctx.Done() and return the
same formatted context error ("resolve %s: %w" with the question name and
ctx.Err()) so you avoid taking the chain lock and running dispatch when the
caller's context is already done; keep the existing done/close(done)
goroutine-drain behavior for the normal path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 293-316: The ResolveInternal function should short-circuit when
the provided context is already canceled before spawning the goroutine: at the
start of HandlerChain.ResolveInternal (before creating base and launching the
goroutine that calls c.dispatch), check ctx.Err() or select on ctx.Done() and
return the same formatted context error ("resolve %s: %w" with the question name
and ctx.Err()) so you avoid taking the chain lock and running dispatch when the
caller's context is already done; keep the existing done/close(done)
goroutine-drain behavior for the normal path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 44955c40-25b0-4406-8033-266fe221712d

📥 Commits

Reviewing files that changed from the base of the PR and between 6f432ef and ec10dba.

📒 Files selected for processing (5)
  • client/internal/dns/handler_chain.go
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/server.go
✅ Files skipped from review due to trivial changes (3)
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/mgmt/mgmt.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 236-249: The current refresh logic treats len(records)==0 &&
err==nil as a failure; instead, in the refresh path inside the function that
calls lookupRecords (the shown block), if lookupRecords returns no records but
no error you must treat that as a successful refresh for that qtype:
remove/expire the cached question/qtype (do not call markRefreshFailed or
increment backoff) and return nil so callers stop serving stale family data.
Concretely, update the block that calls lookupRecords to detect err==nil &&
len(records)==0, call the cache eviction/path that deletes the specific
question/qtype (or invalidate the qtype entry) and return nil; only invoke
markRefreshFailed and the failure logging when err != nil. Also ensure
lookupRecords implementations on the chain path surface real NOERROR/NODATA as
(nil, nil) rather than converting them into an error so the above logic can act
on them.
- Around line 562-571: The current filtering of resp.Answer into filtered (based
on qtype) preserves upstream A/AAAA RRs owned by the canonical target and their
TTLs; instead, when building the cache entry for dnsName you must normalize
chain responses by synthesizing RRs that use the queried name (dnsName) with the
management cache TTL (cacheTTL()) rather than caching target-owned names or
upstream TTLs. Update the logic that constructs filtered from resp.Answer
(references: resp.Answer, filtered, qtype) to: detect CNAME chains, follow to
the final A/AAAA answers, create new RRs whose Name equals dnsName and whose TTL
equals cacheTTL(), and return/cache those synthesized records (dropping the
original target-owned owner names and upstream TTL values) so later lookups for
dnsName get records with the management-cache TTL.
- Around line 139-144: The current logic calls scheduleRefresh(question) which
unconditionally spawns a goroutine before singleflight deduplication, causing
many parked goroutines for bursty stale hits; instead, call the singleflight
Do/DoChan directly from the caller so deduplication happens before any goroutine
is spawned: replace the scheduleRefresh(question) invocation with a direct
singleflight.DoChan (or Do) call that performs the refresh work and returns the
result, using the same key (question) and refresh function, and remove/adjust
scheduleRefresh (and its internal goroutine spawn) accordingly; apply the same
change for the other occurrence around shouldRefresh/inflight (the 214-220
block) so only one goroutine is created per unique inflight refresh.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 155e0c0e-cd4c-49ee-b2b8-ff53e4569912

📥 Commits

Reviewing files that changed from the base of the PR and between ec10dba and db3153f.

📒 Files selected for processing (2)
  • client/internal/dns/handler_chain.go
  • client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/mgmt/mgmt.go
Comment thread client/internal/dns/mgmt/mgmt.go
Comment thread client/internal/dns/mgmt/mgmt.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)

212-220: DoChan allocates a new result channel for each duplicate stale hit.

Using DoChan doesn't make duplicate callers share the same channel; it creates and appends a new buffered channel to the singleflight group's internal channel list for each call. Under a slow refresh, bursty stale hits grow that channel list until the refresh completes. Consider adding an explicit per-question "refresh scheduled" gate and running one goroutine per key with Do instead.

Possible shape
 type Resolver struct {
 	records       map[dns.Question]*cachedRecord
 	mgmtDomain    *domain.Domain
 	serverDomains *dnsconfig.ServerDomains
 	mutex         sync.RWMutex
 
 	chain            ChainResolver
 	chainMaxPriority int
 	refreshGroup     singleflight.Group
+	refreshScheduled map[dns.Question]struct{}
 
 	refreshing map[dns.Question]*atomic.Bool
 }
 
 func NewResolver() *Resolver {
 	return &Resolver{
-		records:    make(map[dns.Question]*cachedRecord),
-		refreshing: make(map[dns.Question]*atomic.Bool),
+		records:          make(map[dns.Question]*cachedRecord),
+		refreshing:        make(map[dns.Question]*atomic.Bool),
+		refreshScheduled:  make(map[dns.Question]struct{}),
 	}
 }
 
 func (m *Resolver) scheduleRefresh(question dns.Question) {
+	m.mutex.Lock()
+	if _, ok := m.refreshScheduled[question]; ok {
+		m.mutex.Unlock()
+		return
+	}
+	m.refreshScheduled[question] = struct{}{}
+	m.mutex.Unlock()
+
 	key := question.Name + "|" + dns.TypeToString[question.Qtype]
-	_ = m.refreshGroup.DoChan(key, func() (any, error) {
-		return nil, m.refreshQuestion(question)
-	})
+	go func() {
+		defer func() {
+			m.mutex.Lock()
+			delete(m.refreshScheduled, question)
+			m.mutex.Unlock()
+		}()
+		_, _, _ = m.refreshGroup.Do(key, func() (any, error) {
+			return nil, m.refreshQuestion(question)
+		})
+	}()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 212 - 220, The current
scheduleRefresh uses refreshGroup.DoChan which allocates a new result channel
per duplicate caller; instead, ensure only one goroutine per question by
invoking refreshGroup.Do inside a single spawned goroutine (or add an explicit
per-question gate) so duplicate stale hits don't create extra channels/parked
goroutines. Change scheduleRefresh to spawn a goroutine that calls
m.refreshGroup.Do(key, func() (any, error) { return nil,
m.refreshQuestion(question) }), or alternatively add a protected map/set (e.g.
m.refreshScheduled with a mutex or sync.Map) to short-circuit if a refresh is
already scheduled and clear it after the Do completes; reference symbols:
scheduleRefresh, refreshGroup, DoChan, Do, refreshQuestion, Resolver.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 190-201: lookupBoth/lookupViaChain can report a qtype as
successfully resolved with zero records, but AddDomain only writes cache entries
when len(records)>0 leaving stale per-family entries (e.g., old AAAA) behind;
update AddDomain (the code that writes m.records using dns.Question keys and
cachedRecord) to propagate per-family success status from
lookupBoth/lookupViaChain and either store an explicit empty cachedRecord
(cachedAt set) for qtypes that resolved successfully with zero records or delete
the specific m.records[dns.Question{..., Qtype: ...}] when the lookup returned a
successful NODATA, so stale old qtypes are removed; apply the same change to the
other symmetric block referenced around the 310-318 area.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 212-220: The current scheduleRefresh uses refreshGroup.DoChan
which allocates a new result channel per duplicate caller; instead, ensure only
one goroutine per question by invoking refreshGroup.Do inside a single spawned
goroutine (or add an explicit per-question gate) so duplicate stale hits don't
create extra channels/parked goroutines. Change scheduleRefresh to spawn a
goroutine that calls m.refreshGroup.Do(key, func() (any, error) { return nil,
m.refreshQuestion(question) }), or alternatively add a protected map/set (e.g.
m.refreshScheduled with a mutex or sync.Map) to short-circuit if a refresh is
already scheduled and clear it after the Do completes; reference symbols:
scheduleRefresh, refreshGroup, DoChan, Do, refreshQuestion, Resolver.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 66a7684d-72f8-4341-91c2-c289efa9e708

📥 Commits

Reviewing files that changed from the base of the PR and between db3153f and eafc7a5.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/mgmt/mgmt.go Outdated
@lixmal lixmal force-pushed the mgmt-cache-ttl-refresh branch from eafc7a5 to 7486738 Compare April 21, 2026 16:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 305-314: The select between <-done and <-ctx.Done() can randomly
choose the context error over a just-completed handler, discarding a valid
base.response; change the logic in handler_chain.go (around the select that
waits on done and ctx) so that after the select you check base.response (and
base.response.Rcode) first and return it when present/valid, only returning
ctx.Err() when no valid base.response exists (i.e., prefer base.response over
ctx.Err()); update the code paths that currently return fmt.Errorf("resolve %s:
%w", ...) to only do so when base.response is nil or refused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 200d3046-705a-4a2c-b8eb-66e18189a6e1

📥 Commits

Reviewing files that changed from the base of the PR and between db3153f and 7486738.

📒 Files selected for processing (5)
  • client/internal/dns/handler_chain.go
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/server.go
✅ Files skipped from review due to trivial changes (1)
  • client/internal/dns/mgmt/mgmt_refresh_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • client/internal/dns/server.go
  • client/internal/dns/handler_chain_test.go
  • client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/handler_chain.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Comment thread client/internal/dns/mgmt/mgmt.go Outdated
@lixmal lixmal force-pushed the mgmt-cache-ttl-refresh branch from 4eb4529 to 236e99a Compare April 22, 2026 09:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 117-145: ServeDNS currently appends cached.records unchanged,
which can return RRs with a full TTL even if the cache is near expiry; implement
a TTL clamp by computing remaining := cacheTTL() - time.Since(cached.cachedAt)
(helper responseTTL(cached.cachedAt) returning uint32 seconds, zero if expired)
and clone the cached RRs while setting their Hdr.Ttl to that remaining TTL
(helper cloneRecordsWithTTL(records []dns.RR, ttl uint32) []dns.RR that
deep-copies A/AAAA records and sets Ttl); use
cloneRecordsWithTTL(cached.records, responseTTL(cached.cachedAt)) when building
resp.Answer, keeping the existing scheduleRefresh/inflight logic intact so only
the returned RRs have their TTL adjusted.
- Around line 249-267: The refresh must verify record identity, not just key
presence, to avoid stale in-flight refreshes overwriting newer entries: capture
the current cached pointer (e.g., cached := m.records[question]) in ServeDNS and
pass that pointer into the refresh goroutine; in the NODATA branch only delete
from m.records if the stored pointer equals the captured cached pointer; and in
the success path, before writing replace, check that m.records[question] is
still the same pointer (compare by pointer equality) and only then update
m.records[question] = &cachedRecord{...}; update ServeDNS to pass the captured
cached pointer into the refresh scheduling so the refresh can perform these
identity checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82b0a500-2fa6-411a-9b14-3c4f656bd705

📥 Commits

Reviewing files that changed from the base of the PR and between e6c6241 and 4eb4529.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/mgmt/mgmt.go Outdated
Comment thread client/internal/dns/mgmt/mgmt.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
client/internal/dns/mgmt/mgmt.go (2)

217-221: ⚠️ Potential issue | 🟠 Major

Guard refresh mutations by cached-record identity.

The refresh path still mutates by question key only. An older in-flight refresh can race with AddDomain or remove+re-add, then delete, overwrite, or mark failure on a newer cache entry. Pass the *cachedRecord captured when scheduling and only mutate while m.records[question] == expected.

Proposed direction
-func (m *Resolver) scheduleRefresh(question dns.Question) {
+func (m *Resolver) scheduleRefresh(question dns.Question, expected *cachedRecord) {
 	key := question.Name + "|" + dns.TypeToString[question.Qtype]
 	_ = m.refreshGroup.DoChan(key, func() (any, error) {
-		return nil, m.refreshQuestion(question)
+		return nil, m.refreshQuestion(question, expected)
 	})
 }
-func (m *Resolver) refreshQuestion(question dns.Question) error {
+func (m *Resolver) refreshQuestion(question dns.Question, expected *cachedRecord) error {
@@
 	if len(records) == 0 {
 		m.mutex.Lock()
-		delete(m.records, question)
+		if m.records[question] == expected {
+			delete(m.records, question)
+		}
 		m.mutex.Unlock()
@@
 	m.mutex.Lock()
-	if _, stillCached := m.records[question]; !stillCached {
+	if m.records[question] != expected {
 		m.mutex.Unlock()
-		log.Debugf("skipping refresh write for domain=%s type=%s: entry was removed during refresh",
+		log.Debugf("skipping refresh write for domain=%s type=%s: entry changed during refresh",
 			d.SafeString(), dns.TypeToString[question.Qtype])
 		return nil
 	}

Also apply the same identity guard to failure marking:

-func (m *Resolver) markRefreshFailed(question dns.Question) int {
+func (m *Resolver) markRefreshFailed(question dns.Question, expected *cachedRecord) int {
 	m.mutex.Lock()
 	defer m.mutex.Unlock()
 	c, ok := m.records[question]
-	if !ok {
+	if !ok || c != expected {
 		return 0
 	}

Then call m.scheduleRefresh(question, cached) from ServeDNS.

Also applies to: 249-267, 289-298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 217 - 221, Change
scheduleRefresh to accept the captured *cachedRecord expected and use it to
guard any mutations: update the signature of Resolver.scheduleRefresh(question
dns.Question, expected *cachedRecord), pass expected into the
refreshGroup.DoChan closure and inside the closure (and any code path that
mutates m.records or marks failures, e.g., in refreshQuestion or the
failure-marking sections) verify pointer identity with if m.records[question] !=
expected { return nil, nil } before making changes; update the call site in
ServeDNS to call m.scheduleRefresh(question, cached) and apply the same
identity-guard check to the failure-marking code paths currently referenced to
ensure only the expected cachedRecord is mutated.

117-145: ⚠️ Potential issue | 🟠 Major

Clamp served RR TTLs to the remaining cache lifetime.

Line 145 still returns cached RRs unchanged, so stale or near-expired entries can be handed to downstream DNS caches with a full cacheTTL() TTL. Clone the records and set TTL to the remaining in-process lifetime; expired stale answers should get TTL 0.

Proposed fix
 	if found {
-		stale := time.Since(cached.cachedAt) > cacheTTL()
+		age := time.Since(cached.cachedAt)
+		stale := age > cacheTTL()
 		inBackoff := !cached.lastFailedRefresh.IsZero() && time.Since(cached.lastFailedRefresh) < refreshBackoff
 		shouldRefresh = stale && !inBackoff
 	}
@@
-	resp.Answer = append(resp.Answer, cached.records...)
+	resp.Answer = append(resp.Answer, cloneRecordsWithTTL(cached.records, responseTTL(cached.cachedAt))...)

Add helpers outside ServeDNS:

func responseTTL(cachedAt time.Time) uint32 {
	remaining := cacheTTL() - time.Since(cachedAt)
	if remaining <= 0 {
		return 0
	}
	return uint32((remaining + time.Second - 1) / time.Second)
}

func cloneRecordsWithTTL(records []dns.RR, ttl uint32) []dns.RR {
	out := make([]dns.RR, 0, len(records))
	for _, rr := range records {
		switch r := rr.(type) {
		case *dns.A:
			cp := *r
			cp.Hdr.Ttl = ttl
			cp.A = append(net.IP(nil), r.A...)
			out = append(out, &cp)
		case *dns.AAAA:
			cp := *r
			cp.Hdr.Ttl = ttl
			cp.AAAA = append(net.IP(nil), r.AAAA...)
			out = append(out, &cp)
		}
	}
	return out
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 117 - 145, ServeDNS currently
appends cached.records unchanged which lets downstream caches see the full
cacheTTL; compute the remaining TTL from cached.cachedAt (helper
responseTTL(cachedAt) returns remaining seconds or 0) and clone the RRs with
that TTL (helper cloneRecordsWithTTL(records, ttl) should deep-copy A/AAAA
types: copy header TTL, copy IP slices) then append the cloned records to
resp.Answer instead of cached.records; place the TTL computation and clone call
just before building resp.Answer in ServeDNS and use the cloned slice when
setting resp.Answer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 217-221: Change scheduleRefresh to accept the captured
*cachedRecord expected and use it to guard any mutations: update the signature
of Resolver.scheduleRefresh(question dns.Question, expected *cachedRecord), pass
expected into the refreshGroup.DoChan closure and inside the closure (and any
code path that mutates m.records or marks failures, e.g., in refreshQuestion or
the failure-marking sections) verify pointer identity with if
m.records[question] != expected { return nil, nil } before making changes;
update the call site in ServeDNS to call m.scheduleRefresh(question, cached) and
apply the same identity-guard check to the failure-marking code paths currently
referenced to ensure only the expected cachedRecord is mutated.
- Around line 117-145: ServeDNS currently appends cached.records unchanged which
lets downstream caches see the full cacheTTL; compute the remaining TTL from
cached.cachedAt (helper responseTTL(cachedAt) returns remaining seconds or 0)
and clone the RRs with that TTL (helper cloneRecordsWithTTL(records, ttl) should
deep-copy A/AAAA types: copy header TTL, copy IP slices) then append the cloned
records to resp.Answer instead of cached.records; place the TTL computation and
clone call just before building resp.Answer in ServeDNS and use the cloned slice
when setting resp.Answer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16848360-62fc-438b-b3b7-a3f13b1531e6

📥 Commits

Reviewing files that changed from the base of the PR and between 4eb4529 and 236e99a.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 180-196: The code currently treats the case where one family
returned an error and the other returned NODATA as success (caching nothing); in
lookupBoth handling inside the block around lookupBoth, detect when both
aRecords and aaaaRecords are empty AND at least one of errA or errAAAA is
non-nil, and return a combined error (e.g., using errors.Join) before
locking/applying records; keep the existing dual-NODATA behavior (both errs nil
and both slices empty) returning the specific "no A/AAAA records" error, but add
the new check (referencing lookupBoth, errA, errAAAA, aRecords, aaaaRecords,
applyFamilyRecords) to prevent AddDomain from reporting success with no usable
cached records.
- Around line 603-627: The code currently rewrites every A/AAAA in resp.Answer
to dnsName; instead build and consult an owners set before accepting A/AAAA:
create a helper (e.g., seedOwnersFromCNAMEs) that starts owners = {dnsName},
iterates resp.Answer to follow *dns.CNAME records (adding each CNAME.Target/Name
along the chain) and resolves indirect targets, then in the loop that builds
filtered only accept A/AAAA records whose rr.Header().Name is present in owners
(and still enforce Class & qtype checks); use the existing dnsName, resp.Answer,
filtered variables and keep TTL and slices.Clone behavior unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3f27cff-0f05-47de-8323-2cc6056d67d8

📥 Commits

Reviewing files that changed from the base of the PR and between 236e99a and 4c25ac6.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/mgmt/mgmt.go
Comment thread client/internal/dns/mgmt/mgmt.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)

452-464: Consider clearing m.refreshing[question] on RemoveDomain.

RemoveDomain deletes records for both families but leaves any entry in m.refreshing untouched. Ordinarily clearRefreshing (called via defer in lookupRecords) cleans up, but if RemoveDomain is followed immediately by AddDomain while an OS-fallback refresh is still inflight for the same question, a later ServeDNS hit during that window could still fire the "possible resolver loop" warning against the new entry. Low-risk — just noting it for completeness.

Optional cleanup
 	delete(m.records, dns.Question{Name: dnsName, Qtype: dns.TypeA, Qclass: dns.ClassINET})
 	delete(m.records, dns.Question{Name: dnsName, Qtype: dns.TypeAAAA, Qclass: dns.ClassINET})
+	delete(m.refreshing, dns.Question{Name: dnsName, Qtype: dns.TypeA, Qclass: dns.ClassINET})
+	delete(m.refreshing, dns.Question{Name: dnsName, Qtype: dns.TypeAAAA, Qclass: dns.ClassINET})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 452 - 464, RemoveDomain
currently deletes A/AAAA entries but leaves any inflight markers in
m.refreshing, which can trigger a "possible resolver loop" for a subsequent
AddDomain; update RemoveDomain (method Resolver.RemoveDomain) to also clear
m.refreshing for the two dns.Question keys (same keys used when deleting from
m.records) while holding m.mutex — or invoke the existing clearRefreshing helper
for those questions — so inflight refresh state is removed alongside the cached
records, keeping behavior consistent with lookupRecords/clearRefreshing and
preventing stale refreshing entries from affecting ServeDNS/AddDomain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 249-258: The warn-level log in refreshQuestion can emit
"consecutive failures=0" because markRefreshFailed(question, expected) returns 0
when the cache entry changed (no backoff); update refreshQuestion to skip or
lower the severity for that case: after calling markRefreshFailed(question,
expected) check if fails == 0 and either return err without logging or use
log.Debugf instead of log.Warnf; reference the markRefreshFailed and
refreshQuestion logic and the logf selection (log.Warnf / log.Debugf) so you
only log warnings when fails > 0.

---

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 452-464: RemoveDomain currently deletes A/AAAA entries but leaves
any inflight markers in m.refreshing, which can trigger a "possible resolver
loop" for a subsequent AddDomain; update RemoveDomain (method
Resolver.RemoveDomain) to also clear m.refreshing for the two dns.Question keys
(same keys used when deleting from m.records) while holding m.mutex — or invoke
the existing clearRefreshing helper for those questions — so inflight refresh
state is removed alongside the cached records, keeping behavior consistent with
lookupRecords/clearRefreshing and preventing stale refreshing entries from
affecting ServeDNS/AddDomain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb1936e4-1cc6-47a1-afb3-1a80b0780d0d

📥 Commits

Reviewing files that changed from the base of the PR and between cdc1ff8 and 0b21d7c.

📒 Files selected for processing (3)
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/mgmt/mgmt_refresh_test.go
  • client/internal/dns/mgmt/mgmt_test.go
✅ Files skipped from review due to trivial changes (1)
  • client/internal/dns/mgmt/mgmt_refresh_test.go

Comment thread client/internal/dns/mgmt/mgmt.go
pappz
pappz previously approved these changes Apr 22, 2026
@sonarqubecloud
Copy link
Copy Markdown

@lixmal lixmal merged commit 801de8c into main Apr 22, 2026
41 of 42 checks passed
@lixmal lixmal deleted the mgmt-cache-ttl-refresh branch April 22, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants