Skip to content

[client] scope macOS DNS state keys by WireGuard interface name#5504

Open
siriobalmelli wants to merge 2 commits intonetbirdio:mainfrom
siriobalmelli:fix/dns-iface-scoping-v2
Open

[client] scope macOS DNS state keys by WireGuard interface name#5504
siriobalmelli wants to merge 2 commits intonetbirdio:mainfrom
siriobalmelli:fix/dns-iface-scoping-v2

Conversation

@siriobalmelli
Copy link
Copy Markdown

@siriobalmelli siriobalmelli commented Mar 4, 2026

Summary

Fixes a bug where multiple netbird instances on the same macOS host clobber each other's DNS entries in scutil, because the key format State:/Network/Service/NetBird-%s/DNS has no per-instance disambiguation.

Alleviates #446: while one instance wouldn't connect to multiple networks, removes the barrier to having multiple instances running simultaneously.

Problem

When two netbird instances run on the same host with different WireGuard interfaces (e.g. utun0 and utun1), they both write to the same scutil keys (NetBird-Match/DNS, NetBird-Search/DNS). The second instance silently overwrites the first's DNS configuration, breaking DNS for the first instance's network.

Solution

Add the WireGuard interface name as a component in all NetBird scutil key formats:

Key type Before After
Named NetBird-Match/DNS NetBird-utun0-Match/DNS
Batched NetBird-Match-0/DNS NetBird-utun0-Match-0/DNS

This builds on top of the domain-batching work from #5368 (which prevents silent domain loss with 60+ domains) — both features are now combined.

Upgrade migration

discoverExistingKeys() now discovers and removes legacy-format keys (without interface scope) on startup, so stale scutil entries from older versions are cleaned up automatically rather than left orphaned.

Changes

  • host_darwin.go: Updated key format constants to include interface name; added interfaceName field to systemConfigurator; updated all key generation in addBatchedDomains, addLocalDNS, discoverExistingKeys; fixed primaryServiceStateKeyFormat bug; updated getKeyWithInput signature
  • server_darwin.go: Pass s.wgInterface.Name() to newHostManager()
  • unclean_shutdown_darwin.go: Added InterfaceName field to ShutdownState for proper cleanup scoping after unclean shutdown
  • host_darwin_test.go: Updated all existing tests for new key format; added TestGetKeyWithInput, TestNewHostManagerWithInterfaceName, TestMultipleInterfacesGenerateDifferentKeys, TestShutdownStateIncludesInterfaceName, TestPrimaryServiceKeyFormatNotAffected, TestMultipleInstancesBatchedIsolation

Testing

Unit tests (no scutil required, any platform):

go test -short ./client/internal/dns/ -run "TestGetKeyWithInput|TestNewHostManager|TestSplitDomainsIntoBatches|TestMultipleInterfaces|TestShutdownState|TestPrimaryService"

Integration tests (macOS with scutil, requires admin):

go test ./client/internal/dns/ -run "TestDarwinDNS|TestMatchDomainBatching|TestOriginalNameservers|TestMultipleInstancesBatchedIsolation" -v -timeout 120s

This supersedes PR #4633 (which was opened from main and became unmergeable after #5368 landed).

cc @mlsmaycon and @lixmal

Summary by CodeRabbit

  • Bug Fixes

    • DNS state is now scoped per network interface; shutdown/restoration use the interface context and preserve legacy keys for seamless upgrades.
    • Shutdown state now records interface info to prevent cross-interface cleanup and improve reliability.
  • Tests

    • Expanded test coverage for per-interface keying, batched domain isolation, legacy migration, and multi-interface isolation.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change

Internal implementation detail: scopes macOS scutil DNS state keys by WireGuard interface name to prevent multi-instance conflicts. No user-facing API, CLI, or configuration changes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Scopes macOS DNS state to a specific network interface by adding interfaceName to the host manager and ShutdownState, changing key formats and key-generation to include interface scoping, and adding legacy (non‑scoped) key discovery and JSON unmarshalling for upgrade compatibility.

Changes

Cohort / File(s) Summary
Interface-scoped DNS implementation
client/internal/dns/host_darwin.go
Adds interfaceName to systemConfigurator; changes newHostManager(interfaceName string); updates getKeyWithInput(format, iface, key) signature and all key construction to include interface scope; preserves legacy key discovery paths.
Unclean shutdown state & migration
client/internal/dns/unclean_shutdown_darwin.go
ShutdownState gains InterfaceName (json:"interface_name,omitempty") and CreatedKeys tag (json:"created_keys,omitempty"); adds UnmarshalJSON to handle legacy formats; Cleanup() can create a host manager using InterfaceName or fall back to legacy discovery mode.
Server init change
client/internal/dns/server_darwin.go
initialize now calls newHostManager(s.wgInterface.Name()), passing the interface name into host manager creation.
Tests updated for per-interface behavior
client/internal/dns/host_darwin_test.go
Adds testInterfaceName and wires interfaceName into tests; updates tests to validate interface-scoped key generation, cross-interface isolation, shutdown-state serialization, legacy handling, and that primary service key format remains unchanged.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • pascal-fischer
  • pappz
  • mlsmaycon

Poem

🐇 I hop through keys both new and old,

I tuck each tunnel its name to hold.
One interface per burrow, neat and spry,
Legacy crumbs I keep nearby.
Hooray — each DNS nap is snug and dry.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: scoping macOS DNS state keys by WireGuard interface name to prevent multi-instance conflicts.
Description check ✅ Passed The PR description provides comprehensive information including problem statement, solution, upgrade migration details, and organized changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@siriobalmelli siriobalmelli force-pushed the fix/dns-iface-scoping-v2 branch from a664a05 to 4b97324 Compare March 4, 2026 22:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
client/internal/dns/host_darwin_test.go (1)

620-654: Strengthen multi-instance isolation assertions.

This test currently validates key separation and domain counts, but not exact per-instance payload integrity. Using distinct domain sets and asserting full read-back equality would better catch cross-instance leakage.

♻️ Suggested test hardening
@@
-	domains1 := generateShortDomains(60) // forces 2 batches
-	domains2 := generateShortDomains(60)
+	domains1 := generateShortDomains(60) // forces 2 batches
+	domains2 := generateLongDomains(60)  // distinct payload for stronger isolation checks
@@
-	var got1, got2 []string
-	for key := range cfg1.createdKeys {
+	var got1, got2 []string
+	keys1 := make([]string, 0, len(cfg1.createdKeys))
+	for key := range cfg1.createdKeys {
+		keys1 = append(keys1, key)
+	}
+	sort.Strings(keys1)
+	for _, key := range keys1 {
 		got1 = append(got1, readDomainsFromKey(t, key)...)
 	}
-	for key := range cfg2.createdKeys {
+	keys2 := make([]string, 0, len(cfg2.createdKeys))
+	for key := range cfg2.createdKeys {
+		keys2 = append(keys2, key)
+	}
+	sort.Strings(keys2)
+	for _, key := range keys2 {
 		got2 = append(got2, readDomainsFromKey(t, key)...)
 	}
-	assert.Equal(t, len(domains1), len(got1), "all domains from instance 1 should be readable")
-	assert.Equal(t, len(domains2), len(got2), "all domains from instance 2 should be readable")
+	assert.Equal(t, domains1, got1, "instance 1 should read back only its own domains")
+	assert.Equal(t, domains2, got2, "instance 2 should read back only its own domains")
@@
 import (
@@
+	"sort"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin_test.go` around lines 620 - 654, The test
should not only check counts and key separation but verify payload integrity per
instance: for each instance use cfg1.createdKeys and cfg2.createdKeys with
readDomainsFromKey to collect all domains, normalize into sets, and assert that
the set read from cfg1 equals the original domains1 set and the set read from
cfg2 equals domains2; additionally assert that the intersection of those two
read sets is empty to catch any cross-instance leakage (use addBatchedDomains,
createdKeys, readDomainsFromKey, and the domains1/domains2 variables to locate
the logic).
client/internal/dns/host_darwin.go (1)

199-207: Indexed key discovery assumes contiguous indices.

The for i := 0; ; i++ { ... break on first miss } pattern can miss higher stale keys when there is an index gap. Consider scanning all returned DNS keys for matching indexed prefixes instead of stopping at the first missing slot.

Also applies to: 218-227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin.go` around lines 199 - 207, The current loop
that uses for i := 0; ; i++ with key :=
fmt.Sprintf(netbirdDNSStateKeyIndexedFormat, s.interfaceName, suffix, i) stops
at the first missing index and can miss higher-numbered stale keys; change the
discovery to scan the full dnsKeys string/list and collect keys whose names
match the indexed prefix for each suffix instead of breaking on the first miss.
Concretely, for each suffix (searchSuffix and matchSuffix) build the static
prefix using s.interfaceName and suffix (based on
netbirdDNSStateKeyIndexedFormat) and iterate over dnsKeys, appending any entry
that has that prefix and the expected index pattern (or use strings.HasPrefix +
index parsing/regex) to keys; apply the same replacement to the similar block
around lines 218-227.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/host_darwin_test.go`:
- Around line 620-654: The test should not only check counts and key separation
but verify payload integrity per instance: for each instance use
cfg1.createdKeys and cfg2.createdKeys with readDomainsFromKey to collect all
domains, normalize into sets, and assert that the set read from cfg1 equals the
original domains1 set and the set read from cfg2 equals domains2; additionally
assert that the intersection of those two read sets is empty to catch any
cross-instance leakage (use addBatchedDomains, createdKeys, readDomainsFromKey,
and the domains1/domains2 variables to locate the logic).

In `@client/internal/dns/host_darwin.go`:
- Around line 199-207: The current loop that uses for i := 0; ; i++ with key :=
fmt.Sprintf(netbirdDNSStateKeyIndexedFormat, s.interfaceName, suffix, i) stops
at the first missing index and can miss higher-numbered stale keys; change the
discovery to scan the full dnsKeys string/list and collect keys whose names
match the indexed prefix for each suffix instead of breaking on the first miss.
Concretely, for each suffix (searchSuffix and matchSuffix) build the static
prefix using s.interfaceName and suffix (based on
netbirdDNSStateKeyIndexedFormat) and iterate over dnsKeys, appending any entry
that has that prefix and the expected index pattern (or use strings.HasPrefix +
index parsing/regex) to keys; apply the same replacement to the similar block
around lines 218-227.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 761f0703-bb96-4cee-9b42-018f6a237f23

📥 Commits

Reviewing files that changed from the base of the PR and between 8e7b016 and a664a05.

📒 Files selected for processing (4)
  • client/internal/dns/host_darwin.go
  • client/internal/dns/host_darwin_test.go
  • client/internal/dns/server_darwin.go
  • client/internal/dns/unclean_shutdown_darwin.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
client/internal/dns/host_darwin_test.go (1)

548-553: Consider adding a negative test case for empty interface name.

The test validates the happy path, but newHostManager also returns an error when interfaceName is empty. Consider adding a test case to verify the error handling.

📝 Optional test addition
func TestNewHostManagerWithEmptyInterfaceName(t *testing.T) {
	manager, err := newHostManager("")
	require.Error(t, err)
	assert.Nil(t, manager)
	assert.Contains(t, err.Error(), "interfaceName must not be empty")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin_test.go` around lines 548 - 553, Add a
negative test in host_darwin_test.go that calls newHostManager with an empty
string and asserts it returns an error and a nil manager; specifically, create a
TestNewHostManagerWithEmptyInterfaceName that calls newHostManager("") and uses
require.Error to check an error was returned, assert.Nil on the manager, and
assert.Contains on err.Error() to check the message mentions the empty
interfaceName (e.g., "interfaceName must not be empty"), so the error-handling
branch of newHostManager is covered.
client/internal/dns/unclean_shutdown_darwin.go (1)

20-26: Clarify comment about legacy key cleanup path.

When InterfaceName is empty, returning nil causes the state manager to delete this state entry (per cleanupSingleState behavior in statemanager/manager.go). The comment says "legacy keys will be cleaned up on next successful start," which is accurate since discoverExistingKeys() independently discovers and removes legacy-format keys.

Consider making the comment more explicit about the cleanup mechanism:

📝 Suggested comment clarification
 func (s *ShutdownState) Cleanup() error {
 	if s.InterfaceName == "" {
-		// State file from an older version without interface name — skip cleanup
-		// rather than fail; legacy keys will be cleaned up on next successful start.
+		// State file from an older version without interface name — skip cleanup.
+		// Return nil to allow state deletion; legacy keys will be discovered and
+		// removed by discoverExistingKeys() on next successful start.
 		log.Warn("dns shutdown state has no interface name, skipping cleanup of legacy scutil keys")
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/unclean_shutdown_darwin.go` around lines 20 - 26, The
comment in ShutdownState.Cleanup is ambiguous about how legacy keys are removed;
update it to explicitly state that returning nil causes the state manager
(cleanupSingleState in statemanager/manager.go) to delete this state entry and
that legacy-format keys are handled separately by discoverExistingKeys() on next
successful start, so this code path intentionally skips cleanup when
InterfaceName == "" to rely on that discovery-based removal. Reference
ShutdownState.Cleanup and the discoverExistingKeys() behavior in your comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/host_darwin_test.go`:
- Around line 548-553: Add a negative test in host_darwin_test.go that calls
newHostManager with an empty string and asserts it returns an error and a nil
manager; specifically, create a TestNewHostManagerWithEmptyInterfaceName that
calls newHostManager("") and uses require.Error to check an error was returned,
assert.Nil on the manager, and assert.Contains on err.Error() to check the
message mentions the empty interfaceName (e.g., "interfaceName must not be
empty"), so the error-handling branch of newHostManager is covered.

In `@client/internal/dns/unclean_shutdown_darwin.go`:
- Around line 20-26: The comment in ShutdownState.Cleanup is ambiguous about how
legacy keys are removed; update it to explicitly state that returning nil causes
the state manager (cleanupSingleState in statemanager/manager.go) to delete this
state entry and that legacy-format keys are handled separately by
discoverExistingKeys() on next successful start, so this code path intentionally
skips cleanup when InterfaceName == "" to rely on that discovery-based removal.
Reference ShutdownState.Cleanup and the discoverExistingKeys() behavior in your
comment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9b0ba2e-92b2-4b96-adf9-868791f42318

📥 Commits

Reviewing files that changed from the base of the PR and between a664a05 and 4b97324.

📒 Files selected for processing (4)
  • client/internal/dns/host_darwin.go
  • client/internal/dns/host_darwin_test.go
  • client/internal/dns/server_darwin.go
  • client/internal/dns/unclean_shutdown_darwin.go

@siriobalmelli
Copy link
Copy Markdown
Author

Thanks for the review @coderabbitai!

Addressed nitpicks #3 (negative test for empty interface name) and #4 (clarified comment) in the latest commit.

On the two remaining nitpicks:

Nitpick #1 — Strengthen multi-instance isolation assertions

Skipping intentionally. TestMultipleInstancesBatchedIsolation is an integration test that requires admin/sudo to write scutil State keys (scutil exits 0 but silently drops writes without privileges). The domain read-back assertions via readDomainsFromKey are therefore unreliable in most CI environments. The key-separation assertions — verifying that each instance's keys contain only its own interface name and that the two key sets don't overlap — are the meaningful part of the test and work without elevated privileges. Strengthening the read-back assertions would add complexity to a path that can't execute in standard CI.

Nitpick #2 — Indexed key discovery assumes contiguous indices

Skipping intentionally. The contiguous-index assumption is correct by construction: addBatchedDomains always writes indices 0, 1, 2, … in order, and removeKeysContaining removes them all atomically. A gap can only arise from manual scutil intervention or a partial write failure (which would be a much larger problem). This is also the same pattern used by the upstream batching code in #5368 — changing it here would diverge from the established convention without a concrete failure case to motivate it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@siriobalmelli — thanks for the thorough explanations!

Nitpicks #3 & #4 — confirmed, acknowledged.

Nitpick #1 — the reasoning makes sense. The key-separation assertions (no key overlap between instances, each key set scoped to its own interface name) are the meaningful invariant that can be verified without elevated privileges. Requiring sudo-gated read-back in standard CI would make the test fragile without adding real confidence. Happy to accept this tradeoff.

Nitpick #2 — agreed. The contiguous-index invariant is enforced by construction in addBatchedDomains and is consistent with the pattern established in #5368. Diverging from that convention without a concrete failure case would be premature. The reasoning holds.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/dns/host_darwin_test.go (1)

627-660: Strengthen isolation validation with distinct domain sets and content checks.

This case currently uses the same domain list for both instances and only checks counts, so domain-content cross-over could still pass. Use different domain datasets per interface and assert each instance reads back its own expected domains.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin_test.go` around lines 627 - 660, The test
currently uses identical domain lists and only checks counts; change it to use
distinct domain sets (domains1 vs domains2 already present) and replace the
length assertions with content assertions: after building got1 and got2 via
readDomainsFromKey, assert that got1 contains exactly the elements of domains1
and got2 contains exactly the elements of domains2 (use equality or set
comparison), and additionally assert that none of domains1 appear in got2 and
none of domains2 appear in got1 to ensure no cross-over; keep references to
cfg1.addBatchedDomains, cfg2.addBatchedDomains, cfg1.createdKeys,
cfg2.createdKeys and readDomainsFromKey when locating the code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/host_darwin_test.go`:
- Around line 627-660: The test currently uses identical domain lists and only
checks counts; change it to use distinct domain sets (domains1 vs domains2
already present) and replace the length assertions with content assertions:
after building got1 and got2 via readDomainsFromKey, assert that got1 contains
exactly the elements of domains1 and got2 contains exactly the elements of
domains2 (use equality or set comparison), and additionally assert that none of
domains1 appear in got2 and none of domains2 appear in got1 to ensure no
cross-over; keep references to cfg1.addBatchedDomains, cfg2.addBatchedDomains,
cfg1.createdKeys, cfg2.createdKeys and readDomainsFromKey when locating the code
to modify.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3c65aa9-4802-4d59-9a44-f4e4100742df

📥 Commits

Reviewing files that changed from the base of the PR and between 4b97324 and b23e076.

📒 Files selected for processing (2)
  • client/internal/dns/host_darwin_test.go
  • client/internal/dns/unclean_shutdown_darwin.go

@siriobalmelli siriobalmelli force-pushed the fix/dns-iface-scoping-v2 branch from b23e076 to fc4097a Compare March 12, 2026 08:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
client/internal/dns/host_darwin_test.go (1)

24-102: Add a legacy-state cleanup regression here.

TestDarwinDNSUncleanShutdownCleanup now exercises only the new state shape. Please add a case that loads a pre-change dns_state payload without InterfaceName (and ideally with the old CreatedKeys field name), since that upgrade path is the critical compatibility promise in this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin_test.go` around lines 24 - 102, Add a
legacy-state regression case to TestDarwinDNSUncleanShutdownCleanup that writes
a pre-change dns_state JSON payload (omitting InterfaceName and using the old
CreatedKeys field name) into the same stateFile before calling
statemanager.New(...) / sm2.LoadState(...); then retrieve the loaded
ShutdownState and call shutdownState.Cleanup() and assert created keys are
removed using checkDNSKeyExists and cleanup with removeTestDNSKey. Ensure the
test constructs the legacy JSON with the same key names the old version used
(e.g., "CreatedKeys" array and no "InterfaceName"), writes it to stateFile, and
then reuses the existing assertions around ShutdownState.Cleanup to verify
backward compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/host_darwin_test.go`:
- Around line 24-102: Add a legacy-state regression case to
TestDarwinDNSUncleanShutdownCleanup that writes a pre-change dns_state JSON
payload (omitting InterfaceName and using the old CreatedKeys field name) into
the same stateFile before calling statemanager.New(...) / sm2.LoadState(...);
then retrieve the loaded ShutdownState and call shutdownState.Cleanup() and
assert created keys are removed using checkDNSKeyExists and cleanup with
removeTestDNSKey. Ensure the test constructs the legacy JSON with the same key
names the old version used (e.g., "CreatedKeys" array and no "InterfaceName"),
writes it to stateFile, and then reuses the existing assertions around
ShutdownState.Cleanup to verify backward compatibility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3dd0d1c9-cc89-4967-a75e-d4c2e2a74f0e

📥 Commits

Reviewing files that changed from the base of the PR and between b23e076 and fc4097a.

📒 Files selected for processing (4)
  • client/internal/dns/host_darwin.go
  • client/internal/dns/host_darwin_test.go
  • client/internal/dns/server_darwin.go
  • client/internal/dns/unclean_shutdown_darwin.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server_darwin.go

Comment thread client/internal/dns/unclean_shutdown_darwin.go
@siriobalmelli siriobalmelli force-pushed the fix/dns-iface-scoping-v2 branch from fc4097a to d07f09b Compare March 13, 2026 08:40
@siriobalmelli
Copy link
Copy Markdown
Author

Addressed the nitpick from the review as well: added TestShutdownStateUnmarshalLegacyJSON (6 table-driven cases covering legacy PascalCase, new snake_case, empty, and mixed-field payloads) and TestShutdownStateLegacyCleanupWithKeys — an end-to-end integration test that writes a legacy scutil key, deserializes legacy JSON through UnmarshalJSON, calls Cleanup(), and asserts the key is removed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/host_darwin.go`:
- Around line 75-79: The call to
stateManager.UpdateState(&ShutdownState{InterfaceName: s.interfaceName}) must
not be ignored on failure; if UpdateState returns an error you must stop further
execution so the InterfaceName is persisted before creating scoped keys. Modify
the surrounding function (the one invoking UpdateState in host_darwin.go) to
propagate or handle the error immediately (e.g., return the error or abort
initialization) instead of logging and continuing; reference
stateManager.UpdateState, ShutdownState, s.interfaceName and ensure Cleanup will
see a persisted InterfaceName on restart.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d61f5b7d-cc79-4928-b4d2-eeb09e82358a

📥 Commits

Reviewing files that changed from the base of the PR and between fc4097a and d07f09b.

📒 Files selected for processing (4)
  • client/internal/dns/host_darwin.go
  • client/internal/dns/host_darwin_test.go
  • client/internal/dns/server_darwin.go
  • client/internal/dns/unclean_shutdown_darwin.go

Comment on lines +75 to +79
if err := stateManager.UpdateState(&ShutdownState{
InterfaceName: s.interfaceName,
}); err != nil {
log.Errorf("failed to update shutdown state: %s", err)
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't continue past a failed interface checkpoint.

This is the only state write that happens before we create any scoped NetBird-<iface>-* keys. If it fails and we keep going, an unclean restart can end up with no persisted InterfaceName, and Cleanup() then falls back to legacy non-scoped discovery only, leaving the new scoped keys behind.

🩹 Proposed fix
 	if err := stateManager.UpdateState(&ShutdownState{
 		InterfaceName: s.interfaceName,
 	}); err != nil {
-		log.Errorf("failed to update shutdown state: %s", err)
+		return fmt.Errorf("update dns shutdown state: %w", err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := stateManager.UpdateState(&ShutdownState{
InterfaceName: s.interfaceName,
}); err != nil {
log.Errorf("failed to update shutdown state: %s", err)
}
if err := stateManager.UpdateState(&ShutdownState{
InterfaceName: s.interfaceName,
}); err != nil {
return fmt.Errorf("update dns shutdown state: %w", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/host_darwin.go` around lines 75 - 79, The call to
stateManager.UpdateState(&ShutdownState{InterfaceName: s.interfaceName}) must
not be ignored on failure; if UpdateState returns an error you must stop further
execution so the InterfaceName is persisted before creating scoped keys. Modify
the surrounding function (the one invoking UpdateState in host_darwin.go) to
propagate or handle the error immediately (e.g., return the error or abort
initialization) instead of logging and continuing; reference
stateManager.UpdateState, ShutdownState, s.interfaceName and ensure Cleanup will
see a persisted InterfaceName on restart.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false positive — no code change needed. Here's why:

  1. UpdateState can only fail if the state was never registered — the sole error path in setState() is errStateNotRegistered, which is a programming bug, not a runtime condition. If it fails here, every subsequent UpdateState call in this function (lines 95, 117, 127 via s.updateState(stateManager)) will also fail, so returning early from this one call doesn't improve the situation.

  2. Codebase convention is log-and-continue — 9 of 10 UpdateState call sites across the codebase (routemanager, engine_ssh, dns/systemd_linux, dns/network_manager_unix, dns/resolvconf_unix, dns/host_windows, updatemanager, etc.) all use log.Errorf/log.Warnf and continue. Changing just this one call to return an error would be inconsistent.

  3. The updateState() helper at line 136 also logs-and-continues — even if we returned early here, the same non-fatal pattern is used for all subsequent state updates in this function. Making only the first call fatal while the rest are non-fatal is incoherent.

  4. Returning an error here would break DNS entirelyapplyDNSConfig is the function that configures the client's DNS. Returning early because state persistence failed means DNS won't be configured at all, which is strictly worse than the theoretical risk of orphaned keys on an unclean shutdown.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siriobalmelli, you're right — I accept all four points. The errStateNotRegistered path is a programming invariant (not a runtime condition), and blocking DNS configuration because of a state-persistence failure would be strictly worse than the theoretical risk of orphaned keys. The log-and-continue pattern is consistent with the rest of the codebase. I'll retract this comment.


✏️ Learnings added
Learnt from: siriobalmelli
Repo: netbirdio/netbird PR: 5504
File: client/internal/dns/host_darwin.go:75-79
Timestamp: 2026-03-13T09:37:32.529Z
Learning: In netbirdio/netbird, `stateManager.UpdateState()` failures are handled with log.Errorf/log.Warnf and execution continues (never return an error). The sole error path is `errStateNotRegistered`, which is a programming bug not a runtime condition. All call sites across routemanager, engine_ssh, dns/systemd_linux, dns/network_manager_unix, dns/resolvconf_unix, dns/host_windows, and updatemanager follow this log-and-continue convention. Do not flag log-and-continue on UpdateState as a bug in this codebase.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5202
File: client/internal/portforward/manager.go:219-222
Timestamp: 2026-01-30T14:43:13.106Z
Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.

Learnt from: pappz
Repo: netbirdio/netbird PR: 5448
File: client/server/server.go:142-146
Timestamp: 2026-02-25T17:44:13.439Z
Learning: In the netbirdio/netbird client/server/server.go file, the update manager is intentionally designed to operate only with GUI flows, not in headless/daemon mode. When the update manager is initialized in the Start() method, only CheckUpdateSuccess is called; the manager's Start() method is deliberately not invoked in the daemon lifecycle and is instead called only for GUI flows via startUpdateManagerForGUI().
</learning]

Learnt from: pappz
Repo: netbirdio/netbird PR: 5512
File: client/internal/metrics/push.go:167-181
Timestamp: 2026-03-13T09:34:29.533Z
Learning: In netbirdio/netbird PR `#5512`, the `NB_METRICS_INTERVAL` env var (`p.envInterval`) in `client/internal/metrics/push.go` does NOT bypass remote config version gating (`isVersionInRange`) or the `RefreshIfNeeded` nil check. The earlier documentation describing this override behavior was deprecated. Do not flag the `resolve()` function for failing to skip version/config checks when `envInterval > 0` — the current behavior is intentional.

Learnt from: bcmmbaga
Repo: netbirdio/netbird PR: 5441
File: management/server/user.go:745-749
Timestamp: 2026-02-24T19:32:13.189Z
Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/internal/udp/relay.go:276-278
Timestamp: 2026-03-07T13:59:16.506Z
Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Multiple netbird instances on the same host (each with its own WireGuard
interface and state directory) clobber each other's scutil DNS entries
because the key format 'State:/Network/Service/NetBird-%s/DNS' has no
per-instance disambiguation.

Add the WireGuard interface name as a component in all NetBird scutil
key formats:
  - named keys:   NetBird-<iface>-<suffix>/DNS
  - batched keys: NetBird-<iface>-<suffix>-<index>/DNS

This preserves the domain-batching logic from netbirdio#5368 (which prevents
silent domain loss with 60+ domains) while adding the interface scoping
needed for multi-instance correctness.

Also fix a latent bug reintroduced by netbirdio#5368: primaryServiceStateKeyFormat
has only one %s placeholder (for the system service UUID), so it must be
formatted with fmt.Sprintf directly rather than getKeyWithInput.

On upgrade, discoverExistingKeys() also finds and removes legacy-format
keys (without interface scope) so stale entries are cleaned up rather
than left orphaned in scutil.

Fixes: netbirdio#446

Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
…sions

When upgrading from an older NetBird version after an unclean shutdown,
legacy scutil DNS keys (e.g. NetBird-Match/DNS) were permanently orphaned
because Cleanup() returned nil early when InterfaceName was empty, causing
the state manager to delete the state entry before any cleanup ran.

Two compounding issues:

1. The early return prevented discoverExistingKeys() from ever running.
   discoverExistingKeys() is only called when createdKeys is empty, which
   only happens during teardown — but by then the new instance has its own
   keys in createdKeys, so legacy keys were never discovered.

2. The json:"created_keys" tag introduced in the previous commit does not
   match the old PascalCase "CreatedKeys" field name (underscore vs
   camelCase are not equivalent under Go's case-insensitive matching rules),
   so CreatedKeys always deserialized as nil from old state files.

Fix both issues:
- Replace the early return with a bare systemConfigurator (empty
  interfaceName, empty createdKeys) that falls through to the normal
  cleanup path. discoverExistingKeys() will be called and its hardcoded
  legacy key probes will find and remove old non-scoped keys.
- Add UnmarshalJSON to handle both old "CreatedKeys" and new
  "created_keys" JSON field names, with new-format taking precedence
  when both are present.

Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
@siriobalmelli siriobalmelli force-pushed the fix/dns-iface-scoping-v2 branch from d07f09b to 1c8d5ce Compare March 26, 2026 08:18
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant