[client] Add TTL-based refresh to mgmt DNS cache via handler chain#5945
[client] Add TTL-based refresh to mgmt DNS cache via handler chain#5945
Conversation
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds in-process handler-chain dispatch (ResolveInternal) with priority-aware filtering and response capture; exposes HasRootHandlerAtOrBelow. Management resolver cache now stores per-question records with stale-while-revalidate, singleflight refreshes, backoff/loop detection, per-family A/AAAA handling, and prefers resolving via the handler chain; tests and server wiring updated. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Resolver as Management Resolver
participant Chain as Handler Chain
participant OS as OS Resolver
Client->>Resolver: ServeDNS(query)
Resolver->>Resolver: Check cache (fresh / stale / miss)
alt Fresh
Resolver->>Client: Return cached answer
else Stale
Resolver->>Client: Return stale answer
Resolver->>Resolver: Schedule async refresh (singleflight)
par Async refresh
Resolver->>Chain: ResolveInternal(msg, maxPriority)
alt Chain has root handler ≤ maxPriority
Chain->>Chain: dispatch (priority-filtered handlers)
Chain->>Resolver: Return dns.Msg
else No suitable root handler
Resolver->>OS: Fallback OS lookup
OS->>Resolver: Return records
end
Resolver->>Resolver: Update cache or mark failure/backoff
and Concurrent callers...
Client->>Resolver: Subsequent queries (refresh collapsed)
end
else Cache miss
Resolver->>Chain: ResolveInternal(...)
alt Chain has root handler
Chain->>Resolver: Return dns.Msg
else
Resolver->>OS: OS lookup
OS->>Resolver: Return records
end
Resolver->>Resolver: Cache result
Resolver->>Client: Serve answer
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 337-344: Resolve the issue where internalResponseWriter.Write
ignores raw DNS packets by updating the internalResponseWriter.Write
implementation to unpack the provided byte slice into a dns.Msg and set
internalResponseWriter.response (similar to WriteMsg) so ResolveInternal can
detect a resolved handler; keep WriteMsg unchanged. Also add explicit no-op
comments to TsigTimersOnly and Hijack methods (and ensure
TsigStatus/Close/LocalAddr/RemoteAddr return values are intentional) to satisfy
static analysis; reference internalResponseWriter.Write,
internalResponseWriter.WriteMsg, and ResolveInternal when making changes.
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 260-263: refreshQuestion currently writes the refreshed
cachedRecord unconditionally which can resurrect a domain removed concurrently
by RemoveDomain; modify refreshQuestion so that after obtaining records and
before assigning m.records[question] it acquires m.mutex and checks whether the
question was removed (e.g. verify m.records no longer contains an explicit
tombstone or that a removed-set/flag for question is not set) and only write the
new cachedRecord if the question is still expected to exist; use the same
m.mutex around the check-and-set to avoid races with RemoveDomain and reference
the symbols refreshQuestion, RemoveDomain, m.records, cachedRecord, m.mutex and
question in the implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7df51b33-7689-4e49-96a3-2d1439532a19
📒 Files selected for processing (5)
client/internal/dns/handler_chain.goclient/internal/dns/handler_chain_test.goclient/internal/dns/mgmt/mgmt.goclient/internal/dns/mgmt/mgmt_refresh_test.goclient/internal/dns/server.go
|
This problem has been bothering me for a long time, thanks. |
6f432ef to
ec10dba
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
client/internal/dns/handler_chain.go (1)
355-361: SonarCloud S1186 still flagsTsigTimersOnly/Hijack.The CI hint shows both methods still failing S1186 ("Add a nested comment explaining why this function is empty…"). The rule generally requires the explanatory comment inside the function body, not a preceding godoc comment. Easiest fix:
Proposed fix
-// TsigTimersOnly is part of dns.ResponseWriter. Internal in-process queries -// never use TSIG, so the setting is a no-op. -func (w *internalResponseWriter) TsigTimersOnly(bool) {} +// TsigTimersOnly is part of dns.ResponseWriter. Internal in-process queries +// never use TSIG, so the setting is a no-op. +func (w *internalResponseWriter) TsigTimersOnly(bool) { + // no-op: internal in-process queries never use TSIG +} -// Hijack is part of dns.ResponseWriter. Internal in-process queries have no -// underlying connection to hijack. -func (w *internalResponseWriter) Hijack() {} +// Hijack is part of dns.ResponseWriter. Internal in-process queries have no +// underlying connection to hijack. +func (w *internalResponseWriter) Hijack() { + // no-op: no underlying connection to hijack +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/handler_chain.go` around lines 355 - 361, The empty-method lints for internalResponseWriter.TsigTimersOnly and internalResponseWriter.Hijack should include an inline comment inside the function body explaining why they are intentionally no-ops; update each method (TsigTimersOnly and Hijack) to keep the body empty but add a nested comment such as "no-op: internal in-process queries never use TSIG / have no underlying connection to hijack" so S1186 is satisfied and the intent is clear.client/internal/dns/mgmt/mgmt.go (1)
24-40:cacheTTL()reads the environment on everyServeDNScall.
cacheTTL()doesos.Getenv+ possiblytime.ParseDurationon the hot path (called inServeDNSstaleness check and again insplitIPsByTypeper refresh).NB_MGMT_CACHE_TTLis documented as integration/dev-only and not expected to change at runtime. Consider resolving it once:Proposed fix
-func cacheTTL() time.Duration { - if v := os.Getenv(envMgmtCacheTTL); v != "" { - if d, err := time.ParseDuration(v); err == nil && d > 0 { - return d - } - } - return defaultTTL -} +var cacheTTLOnce sync.Once +var resolvedCacheTTL time.Duration + +func cacheTTL() time.Duration { + cacheTTLOnce.Do(func() { + resolvedCacheTTL = defaultTTL + if v := os.Getenv(envMgmtCacheTTL); v != "" { + if d, err := time.ParseDuration(v); err == nil && d > 0 { + resolvedCacheTTL = d + } + } + }) + return resolvedCacheTTL +}If tests rely on changing the env mid-process, keep the current behavior — otherwise this avoids per-query syscalls.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 24 - 40, The cacheTTL function currently calls os.Getenv and time.ParseDuration on every ServeDNS/hot-path call; replace this with a one-time resolution so NB_MGMT_CACHE_TTL is read and parsed once at startup (or on first use) and stored in a package-level variable. Implement this by introducing a package-level variable (e.g., resolvedMgmtCacheTTL) and initialize it in an init() or via sync.Once, using envMgmtCacheTTL and defaultTTL as inputs; ensure cacheTTL simply returns the stored duration and preserves the existing semantics (fallback to defaultTTL and ignore invalid/empty values).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 355-361: The empty-method lints for
internalResponseWriter.TsigTimersOnly and internalResponseWriter.Hijack should
include an inline comment inside the function body explaining why they are
intentionally no-ops; update each method (TsigTimersOnly and Hijack) to keep the
body empty but add a nested comment such as "no-op: internal in-process queries
never use TSIG / have no underlying connection to hijack" so S1186 is satisfied
and the intent is clear.
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 24-40: The cacheTTL function currently calls os.Getenv and
time.ParseDuration on every ServeDNS/hot-path call; replace this with a one-time
resolution so NB_MGMT_CACHE_TTL is read and parsed once at startup (or on first
use) and stored in a package-level variable. Implement this by introducing a
package-level variable (e.g., resolvedMgmtCacheTTL) and initialize it in an
init() or via sync.Once, using envMgmtCacheTTL and defaultTTL as inputs; ensure
cacheTTL simply returns the stored duration and preserves the existing semantics
(fallback to defaultTTL and ignore invalid/empty values).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7c8fd74-0ad3-4c21-b12e-0f8a1ffc899e
📒 Files selected for processing (5)
client/internal/dns/handler_chain.goclient/internal/dns/handler_chain_test.goclient/internal/dns/mgmt/mgmt.goclient/internal/dns/mgmt/mgmt_refresh_test.goclient/internal/dns/server.go
✅ Files skipped from review due to trivial changes (2)
- client/internal/dns/server.go
- client/internal/dns/handler_chain_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/mgmt/mgmt_refresh_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/dns/handler_chain.go (1)
293-316: Optional: short-circuit an already-canceled context before spawning the goroutine.If the caller's context is already done on entry, the current code still spawns a goroutine and runs a full dispatch pass (which itself takes the chain read lock and may invoke handler logic) before the
selectobservesctx.Done(). A cheap pre-check avoids that wasted work and tightens cancellation semantics for the mgmt refresher, which routinely runs under deadlines.♻️ Proposed early ctx check
func (c *HandlerChain) ResolveInternal(ctx context.Context, r *dns.Msg, maxPriority int) (*dns.Msg, error) { if len(r.Question) == 0 { return nil, fmt.Errorf("empty question") } + if err := ctx.Err(); err != nil { + return nil, fmt.Errorf("resolve %s: %w", strings.ToLower(r.Question[0].Name), err) + } base := &internalResponseWriter{} done := make(chan struct{})Note: the acknowledged goroutine-drain-on-cancel behavior in the doc comment is fine as-is — the happens-before via
close(done)plus the caller not readingbaseon the cancel path keeps this race-free. No change needed there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/handler_chain.go` around lines 293 - 316, The ResolveInternal function should short-circuit when the provided context is already canceled before spawning the goroutine: at the start of HandlerChain.ResolveInternal (before creating base and launching the goroutine that calls c.dispatch), check ctx.Err() or select on ctx.Done() and return the same formatted context error ("resolve %s: %w" with the question name and ctx.Err()) so you avoid taking the chain lock and running dispatch when the caller's context is already done; keep the existing done/close(done) goroutine-drain behavior for the normal path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 293-316: The ResolveInternal function should short-circuit when
the provided context is already canceled before spawning the goroutine: at the
start of HandlerChain.ResolveInternal (before creating base and launching the
goroutine that calls c.dispatch), check ctx.Err() or select on ctx.Done() and
return the same formatted context error ("resolve %s: %w" with the question name
and ctx.Err()) so you avoid taking the chain lock and running dispatch when the
caller's context is already done; keep the existing done/close(done)
goroutine-drain behavior for the normal path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 44955c40-25b0-4406-8033-266fe221712d
📒 Files selected for processing (5)
client/internal/dns/handler_chain.goclient/internal/dns/handler_chain_test.goclient/internal/dns/mgmt/mgmt.goclient/internal/dns/mgmt/mgmt_refresh_test.goclient/internal/dns/server.go
✅ Files skipped from review due to trivial changes (3)
- client/internal/dns/handler_chain_test.go
- client/internal/dns/mgmt/mgmt_refresh_test.go
- client/internal/dns/mgmt/mgmt.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/server.go
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 236-249: The current refresh logic treats len(records)==0 &&
err==nil as a failure; instead, in the refresh path inside the function that
calls lookupRecords (the shown block), if lookupRecords returns no records but
no error you must treat that as a successful refresh for that qtype:
remove/expire the cached question/qtype (do not call markRefreshFailed or
increment backoff) and return nil so callers stop serving stale family data.
Concretely, update the block that calls lookupRecords to detect err==nil &&
len(records)==0, call the cache eviction/path that deletes the specific
question/qtype (or invalidate the qtype entry) and return nil; only invoke
markRefreshFailed and the failure logging when err != nil. Also ensure
lookupRecords implementations on the chain path surface real NOERROR/NODATA as
(nil, nil) rather than converting them into an error so the above logic can act
on them.
- Around line 562-571: The current filtering of resp.Answer into filtered (based
on qtype) preserves upstream A/AAAA RRs owned by the canonical target and their
TTLs; instead, when building the cache entry for dnsName you must normalize
chain responses by synthesizing RRs that use the queried name (dnsName) with the
management cache TTL (cacheTTL()) rather than caching target-owned names or
upstream TTLs. Update the logic that constructs filtered from resp.Answer
(references: resp.Answer, filtered, qtype) to: detect CNAME chains, follow to
the final A/AAAA answers, create new RRs whose Name equals dnsName and whose TTL
equals cacheTTL(), and return/cache those synthesized records (dropping the
original target-owned owner names and upstream TTL values) so later lookups for
dnsName get records with the management-cache TTL.
- Around line 139-144: The current logic calls scheduleRefresh(question) which
unconditionally spawns a goroutine before singleflight deduplication, causing
many parked goroutines for bursty stale hits; instead, call the singleflight
Do/DoChan directly from the caller so deduplication happens before any goroutine
is spawned: replace the scheduleRefresh(question) invocation with a direct
singleflight.DoChan (or Do) call that performs the refresh work and returns the
result, using the same key (question) and refresh function, and remove/adjust
scheduleRefresh (and its internal goroutine spawn) accordingly; apply the same
change for the other occurrence around shouldRefresh/inflight (the 214-220
block) so only one goroutine is created per unique inflight refresh.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 155e0c0e-cd4c-49ee-b2b8-ff53e4569912
📒 Files selected for processing (2)
client/internal/dns/handler_chain.goclient/internal/dns/mgmt/mgmt.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)
212-220:DoChanallocates a new result channel for each duplicate stale hit.Using
DoChandoesn't make duplicate callers share the same channel; it creates and appends a new buffered channel to the singleflight group's internal channel list for each call. Under a slow refresh, bursty stale hits grow that channel list until the refresh completes. Consider adding an explicit per-question "refresh scheduled" gate and running one goroutine per key withDoinstead.Possible shape
type Resolver struct { records map[dns.Question]*cachedRecord mgmtDomain *domain.Domain serverDomains *dnsconfig.ServerDomains mutex sync.RWMutex chain ChainResolver chainMaxPriority int refreshGroup singleflight.Group + refreshScheduled map[dns.Question]struct{} refreshing map[dns.Question]*atomic.Bool } func NewResolver() *Resolver { return &Resolver{ - records: make(map[dns.Question]*cachedRecord), - refreshing: make(map[dns.Question]*atomic.Bool), + records: make(map[dns.Question]*cachedRecord), + refreshing: make(map[dns.Question]*atomic.Bool), + refreshScheduled: make(map[dns.Question]struct{}), } } func (m *Resolver) scheduleRefresh(question dns.Question) { + m.mutex.Lock() + if _, ok := m.refreshScheduled[question]; ok { + m.mutex.Unlock() + return + } + m.refreshScheduled[question] = struct{}{} + m.mutex.Unlock() + key := question.Name + "|" + dns.TypeToString[question.Qtype] - _ = m.refreshGroup.DoChan(key, func() (any, error) { - return nil, m.refreshQuestion(question) - }) + go func() { + defer func() { + m.mutex.Lock() + delete(m.refreshScheduled, question) + m.mutex.Unlock() + }() + _, _, _ = m.refreshGroup.Do(key, func() (any, error) { + return nil, m.refreshQuestion(question) + }) + }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 212 - 220, The current scheduleRefresh uses refreshGroup.DoChan which allocates a new result channel per duplicate caller; instead, ensure only one goroutine per question by invoking refreshGroup.Do inside a single spawned goroutine (or add an explicit per-question gate) so duplicate stale hits don't create extra channels/parked goroutines. Change scheduleRefresh to spawn a goroutine that calls m.refreshGroup.Do(key, func() (any, error) { return nil, m.refreshQuestion(question) }), or alternatively add a protected map/set (e.g. m.refreshScheduled with a mutex or sync.Map) to short-circuit if a refresh is already scheduled and clear it after the Do completes; reference symbols: scheduleRefresh, refreshGroup, DoChan, Do, refreshQuestion, Resolver.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 190-201: lookupBoth/lookupViaChain can report a qtype as
successfully resolved with zero records, but AddDomain only writes cache entries
when len(records)>0 leaving stale per-family entries (e.g., old AAAA) behind;
update AddDomain (the code that writes m.records using dns.Question keys and
cachedRecord) to propagate per-family success status from
lookupBoth/lookupViaChain and either store an explicit empty cachedRecord
(cachedAt set) for qtypes that resolved successfully with zero records or delete
the specific m.records[dns.Question{..., Qtype: ...}] when the lookup returned a
successful NODATA, so stale old qtypes are removed; apply the same change to the
other symmetric block referenced around the 310-318 area.
---
Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 212-220: The current scheduleRefresh uses refreshGroup.DoChan
which allocates a new result channel per duplicate caller; instead, ensure only
one goroutine per question by invoking refreshGroup.Do inside a single spawned
goroutine (or add an explicit per-question gate) so duplicate stale hits don't
create extra channels/parked goroutines. Change scheduleRefresh to spawn a
goroutine that calls m.refreshGroup.Do(key, func() (any, error) { return nil,
m.refreshQuestion(question) }), or alternatively add a protected map/set (e.g.
m.refreshScheduled with a mutex or sync.Map) to short-circuit if a refresh is
already scheduled and clear it after the Do completes; reference symbols:
scheduleRefresh, refreshGroup, DoChan, Do, refreshQuestion, Resolver.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 66a7684d-72f8-4341-91c2-c289efa9e708
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go
eafc7a5 to
7486738
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/handler_chain.go`:
- Around line 305-314: The select between <-done and <-ctx.Done() can randomly
choose the context error over a just-completed handler, discarding a valid
base.response; change the logic in handler_chain.go (around the select that
waits on done and ctx) so that after the select you check base.response (and
base.response.Rcode) first and return it when present/valid, only returning
ctx.Err() when no valid base.response exists (i.e., prefer base.response over
ctx.Err()); update the code paths that currently return fmt.Errorf("resolve %s:
%w", ...) to only do so when base.response is nil or refused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 200d3046-705a-4a2c-b8eb-66e18189a6e1
📒 Files selected for processing (5)
client/internal/dns/handler_chain.goclient/internal/dns/handler_chain_test.goclient/internal/dns/mgmt/mgmt.goclient/internal/dns/mgmt/mgmt_refresh_test.goclient/internal/dns/server.go
✅ Files skipped from review due to trivial changes (1)
- client/internal/dns/mgmt/mgmt_refresh_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- client/internal/dns/server.go
- client/internal/dns/handler_chain_test.go
- client/internal/dns/mgmt/mgmt.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
4eb4529 to
236e99a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 117-145: ServeDNS currently appends cached.records unchanged,
which can return RRs with a full TTL even if the cache is near expiry; implement
a TTL clamp by computing remaining := cacheTTL() - time.Since(cached.cachedAt)
(helper responseTTL(cached.cachedAt) returning uint32 seconds, zero if expired)
and clone the cached RRs while setting their Hdr.Ttl to that remaining TTL
(helper cloneRecordsWithTTL(records []dns.RR, ttl uint32) []dns.RR that
deep-copies A/AAAA records and sets Ttl); use
cloneRecordsWithTTL(cached.records, responseTTL(cached.cachedAt)) when building
resp.Answer, keeping the existing scheduleRefresh/inflight logic intact so only
the returned RRs have their TTL adjusted.
- Around line 249-267: The refresh must verify record identity, not just key
presence, to avoid stale in-flight refreshes overwriting newer entries: capture
the current cached pointer (e.g., cached := m.records[question]) in ServeDNS and
pass that pointer into the refresh goroutine; in the NODATA branch only delete
from m.records if the stored pointer equals the captured cached pointer; and in
the success path, before writing replace, check that m.records[question] is
still the same pointer (compare by pointer equality) and only then update
m.records[question] = &cachedRecord{...}; update ServeDNS to pass the captured
cached pointer into the refresh scheduling so the refresh can perform these
identity checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82b0a500-2fa6-411a-9b14-3c4f656bd705
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
client/internal/dns/mgmt/mgmt.go (2)
217-221:⚠️ Potential issue | 🟠 MajorGuard refresh mutations by cached-record identity.
The refresh path still mutates by question key only. An older in-flight refresh can race with
AddDomainor remove+re-add, then delete, overwrite, or mark failure on a newer cache entry. Pass the*cachedRecordcaptured when scheduling and only mutate whilem.records[question] == expected.Proposed direction
-func (m *Resolver) scheduleRefresh(question dns.Question) { +func (m *Resolver) scheduleRefresh(question dns.Question, expected *cachedRecord) { key := question.Name + "|" + dns.TypeToString[question.Qtype] _ = m.refreshGroup.DoChan(key, func() (any, error) { - return nil, m.refreshQuestion(question) + return nil, m.refreshQuestion(question, expected) }) }-func (m *Resolver) refreshQuestion(question dns.Question) error { +func (m *Resolver) refreshQuestion(question dns.Question, expected *cachedRecord) error { @@ if len(records) == 0 { m.mutex.Lock() - delete(m.records, question) + if m.records[question] == expected { + delete(m.records, question) + } m.mutex.Unlock() @@ m.mutex.Lock() - if _, stillCached := m.records[question]; !stillCached { + if m.records[question] != expected { m.mutex.Unlock() - log.Debugf("skipping refresh write for domain=%s type=%s: entry was removed during refresh", + log.Debugf("skipping refresh write for domain=%s type=%s: entry changed during refresh", d.SafeString(), dns.TypeToString[question.Qtype]) return nil }Also apply the same identity guard to failure marking:
-func (m *Resolver) markRefreshFailed(question dns.Question) int { +func (m *Resolver) markRefreshFailed(question dns.Question, expected *cachedRecord) int { m.mutex.Lock() defer m.mutex.Unlock() c, ok := m.records[question] - if !ok { + if !ok || c != expected { return 0 }Then call
m.scheduleRefresh(question, cached)fromServeDNS.Also applies to: 249-267, 289-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 217 - 221, Change scheduleRefresh to accept the captured *cachedRecord expected and use it to guard any mutations: update the signature of Resolver.scheduleRefresh(question dns.Question, expected *cachedRecord), pass expected into the refreshGroup.DoChan closure and inside the closure (and any code path that mutates m.records or marks failures, e.g., in refreshQuestion or the failure-marking sections) verify pointer identity with if m.records[question] != expected { return nil, nil } before making changes; update the call site in ServeDNS to call m.scheduleRefresh(question, cached) and apply the same identity-guard check to the failure-marking code paths currently referenced to ensure only the expected cachedRecord is mutated.
117-145:⚠️ Potential issue | 🟠 MajorClamp served RR TTLs to the remaining cache lifetime.
Line 145 still returns cached RRs unchanged, so stale or near-expired entries can be handed to downstream DNS caches with a full
cacheTTL()TTL. Clone the records and set TTL to the remaining in-process lifetime; expired stale answers should get TTL0.Proposed fix
if found { - stale := time.Since(cached.cachedAt) > cacheTTL() + age := time.Since(cached.cachedAt) + stale := age > cacheTTL() inBackoff := !cached.lastFailedRefresh.IsZero() && time.Since(cached.lastFailedRefresh) < refreshBackoff shouldRefresh = stale && !inBackoff } @@ - resp.Answer = append(resp.Answer, cached.records...) + resp.Answer = append(resp.Answer, cloneRecordsWithTTL(cached.records, responseTTL(cached.cachedAt))...)Add helpers outside
ServeDNS:func responseTTL(cachedAt time.Time) uint32 { remaining := cacheTTL() - time.Since(cachedAt) if remaining <= 0 { return 0 } return uint32((remaining + time.Second - 1) / time.Second) } func cloneRecordsWithTTL(records []dns.RR, ttl uint32) []dns.RR { out := make([]dns.RR, 0, len(records)) for _, rr := range records { switch r := rr.(type) { case *dns.A: cp := *r cp.Hdr.Ttl = ttl cp.A = append(net.IP(nil), r.A...) out = append(out, &cp) case *dns.AAAA: cp := *r cp.Hdr.Ttl = ttl cp.AAAA = append(net.IP(nil), r.AAAA...) out = append(out, &cp) } } return out }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 117 - 145, ServeDNS currently appends cached.records unchanged which lets downstream caches see the full cacheTTL; compute the remaining TTL from cached.cachedAt (helper responseTTL(cachedAt) returns remaining seconds or 0) and clone the RRs with that TTL (helper cloneRecordsWithTTL(records, ttl) should deep-copy A/AAAA types: copy header TTL, copy IP slices) then append the cloned records to resp.Answer instead of cached.records; place the TTL computation and clone call just before building resp.Answer in ServeDNS and use the cloned slice when setting resp.Answer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 217-221: Change scheduleRefresh to accept the captured
*cachedRecord expected and use it to guard any mutations: update the signature
of Resolver.scheduleRefresh(question dns.Question, expected *cachedRecord), pass
expected into the refreshGroup.DoChan closure and inside the closure (and any
code path that mutates m.records or marks failures, e.g., in refreshQuestion or
the failure-marking sections) verify pointer identity with if
m.records[question] != expected { return nil, nil } before making changes;
update the call site in ServeDNS to call m.scheduleRefresh(question, cached) and
apply the same identity-guard check to the failure-marking code paths currently
referenced to ensure only the expected cachedRecord is mutated.
- Around line 117-145: ServeDNS currently appends cached.records unchanged which
lets downstream caches see the full cacheTTL; compute the remaining TTL from
cached.cachedAt (helper responseTTL(cachedAt) returns remaining seconds or 0)
and clone the RRs with that TTL (helper cloneRecordsWithTTL(records, ttl) should
deep-copy A/AAAA types: copy header TTL, copy IP slices) then append the cloned
records to resp.Answer instead of cached.records; place the TTL computation and
clone call just before building resp.Answer in ServeDNS and use the cloned slice
when setting resp.Answer.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 16848360-62fc-438b-b3b7-a3f13b1531e6
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 180-196: The code currently treats the case where one family
returned an error and the other returned NODATA as success (caching nothing); in
lookupBoth handling inside the block around lookupBoth, detect when both
aRecords and aaaaRecords are empty AND at least one of errA or errAAAA is
non-nil, and return a combined error (e.g., using errors.Join) before
locking/applying records; keep the existing dual-NODATA behavior (both errs nil
and both slices empty) returning the specific "no A/AAAA records" error, but add
the new check (referencing lookupBoth, errA, errAAAA, aRecords, aaaaRecords,
applyFamilyRecords) to prevent AddDomain from reporting success with no usable
cached records.
- Around line 603-627: The code currently rewrites every A/AAAA in resp.Answer
to dnsName; instead build and consult an owners set before accepting A/AAAA:
create a helper (e.g., seedOwnersFromCNAMEs) that starts owners = {dnsName},
iterates resp.Answer to follow *dns.CNAME records (adding each CNAME.Target/Name
along the chain) and resolves indirect targets, then in the loop that builds
filtered only accept A/AAAA records whose rr.Header().Name is present in owners
(and still enforce Class & qtype checks); use the existing dnsName, resp.Answer,
filtered variables and keep TTL and slices.Clone behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3f27cff-0f05-47de-8323-2cc6056d67d8
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)
452-464: Consider clearingm.refreshing[question]on RemoveDomain.
RemoveDomaindeletes records for both families but leaves any entry inm.refreshinguntouched. OrdinarilyclearRefreshing(called via defer inlookupRecords) cleans up, but ifRemoveDomainis followed immediately byAddDomainwhile an OS-fallback refresh is still inflight for the same question, a laterServeDNShit during that window could still fire the "possible resolver loop" warning against the new entry. Low-risk — just noting it for completeness.Optional cleanup
delete(m.records, dns.Question{Name: dnsName, Qtype: dns.TypeA, Qclass: dns.ClassINET}) delete(m.records, dns.Question{Name: dnsName, Qtype: dns.TypeAAAA, Qclass: dns.ClassINET}) + delete(m.refreshing, dns.Question{Name: dnsName, Qtype: dns.TypeA, Qclass: dns.ClassINET}) + delete(m.refreshing, dns.Question{Name: dnsName, Qtype: dns.TypeAAAA, Qclass: dns.ClassINET})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 452 - 464, RemoveDomain currently deletes A/AAAA entries but leaves any inflight markers in m.refreshing, which can trigger a "possible resolver loop" for a subsequent AddDomain; update RemoveDomain (method Resolver.RemoveDomain) to also clear m.refreshing for the two dns.Question keys (same keys used when deleting from m.records) while holding m.mutex — or invoke the existing clearRefreshing helper for those questions — so inflight refresh state is removed alongside the cached records, keeping behavior consistent with lookupRecords/clearRefreshing and preventing stale refreshing entries from affecting ServeDNS/AddDomain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 249-258: The warn-level log in refreshQuestion can emit
"consecutive failures=0" because markRefreshFailed(question, expected) returns 0
when the cache entry changed (no backoff); update refreshQuestion to skip or
lower the severity for that case: after calling markRefreshFailed(question,
expected) check if fails == 0 and either return err without logging or use
log.Debugf instead of log.Warnf; reference the markRefreshFailed and
refreshQuestion logic and the logf selection (log.Warnf / log.Debugf) so you
only log warnings when fails > 0.
---
Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 452-464: RemoveDomain currently deletes A/AAAA entries but leaves
any inflight markers in m.refreshing, which can trigger a "possible resolver
loop" for a subsequent AddDomain; update RemoveDomain (method
Resolver.RemoveDomain) to also clear m.refreshing for the two dns.Question keys
(same keys used when deleting from m.records) while holding m.mutex — or invoke
the existing clearRefreshing helper for those questions — so inflight refresh
state is removed alongside the cached records, keeping behavior consistent with
lookupRecords/clearRefreshing and preventing stale refreshing entries from
affecting ServeDNS/AddDomain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bb1936e4-1cc6-47a1-afb3-1a80b0780d0d
📒 Files selected for processing (3)
client/internal/dns/mgmt/mgmt.goclient/internal/dns/mgmt/mgmt_refresh_test.goclient/internal/dns/mgmt/mgmt_test.go
✅ Files skipped from review due to trivial changes (1)
- client/internal/dns/mgmt/mgmt_refresh_test.go
|



Describe your changes
The mgmt DNS cache was previously populated once and never refreshed, so if the mgmt/signal/relay/etc. hostname's IP changed (DDNS, cloud IP rotation, failover) the client would keep using the stale address until restart. This PR makes cache entries expire and refresh on a TTL, using stale-while-revalidate semantics so queries never block on a lookup.
The tricky part is where to send the refresh query. When NetBird has taken over the OS's DNS resolver, calling
net.DefaultResolverwould loop straight back into the mgmt cache itself (we'd serve our own stale answer to ourselves and "succeed"). So the refresh prefers the internal handler chain, bypassing the mgmt-cache priority and routing through the upstream / default / fallback handlers that actually know how to reach an external resolver. The OS fallback is only used when the chain has no root-zone handler at or belowPriorityUpstream, which by construction means NetBird is not the system resolver andnet.DefaultResolveris safe.Pitfalls worth knowing:
singleflight(at most one extra parked goroutine per cycle) but silently wrong: every "refresh" returns our own stale answer with a fresh timestamp. A lightweight detector arms only on the OS path and logs a single warn if aServeDNShit lands during an inflight OS-fallback refresh for the same question.addHostRootZone), and for Linux hosts wherehostManagerWithOriginalNScaptures the pre-takeover/etc/resolv.confas a fallback. It does not fire on systemd-resolved Linux when no Primary nsgroup is configured (that manager doesn't expose original nameservers) — but that scenario is also the one where the OS path is safe, so it works out.NB_MGMT_CACHE_TTLshortens the default 5min TTL for integration testing.Issue ticket number and link
#5113
Stack
Checklist
Documentation
Select exactly one:
Internal behavior change; no user-facing surface or CLI/API.
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests