Skip to content

[management] Refactor network map controller#4789

Merged
pascal-fischer merged 54 commits intomainfrom
refactor/network-map-controller
Dec 2, 2025
Merged

[management] Refactor network map controller#4789
pascal-fischer merged 54 commits intomainfrom
refactor/network-map-controller

Conversation

@pascal-fischer
Copy link
Copy Markdown
Collaborator

@pascal-fischer pascal-fischer commented Nov 14, 2025

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • Centralized ephemeral peer handling with 10‑minute lifetime and batched peer add/update/delete/connect/disconnect hooks.
    • Stream counting/observability for active peer streams.
    • Secrets manager exposes WireGuard key accessor with proper initialization error handling.
  • Chores

    • Controller and peer management rewired to use new peers/ephemeral flows.
    • Cache store now accepts connection‑capacity tuning.
    • Management integrations dependency bumped.
  • Tests

    • Tests updated to reflect new batched flows and ephemeral manager wiring.

✏️ Tip: You can customize this high-level summary in your review settings.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
client/internal/engine_test.go (2)

1631-1631: Consider formatting for readability.

The controller construction is correct and properly uses the new EphemeralManager parameter. However, with 11 parameters, consider formatting across multiple lines to improve readability:

-	networkMapController := controller.NewController(context.Background(), store, metrics, updateManager, requestBuffer, server.MockIntegratedValidator{}, settingsMockManager, "netbird.selfhosted", port_forwarding.NewControllerMock(), manager.NewEphemeralManager(store, peersManager), config)
+	networkMapController := controller.NewController(
+		context.Background(),
+		store,
+		metrics,
+		updateManager,
+		requestBuffer,
+		server.MockIntegratedValidator{},
+		settingsMockManager,
+		"netbird.selfhosted",
+		port_forwarding.NewControllerMock(),
+		manager.NewEphemeralManager(store, peersManager),
+		config,
+	)

1641-1641: Server construction correctly updated.

The NewServer call properly reflects the refactored signature. The use of nil for appMetrics and authManager is appropriate for test setup.

Optionally, consider multi-line formatting for the 8 parameters to improve readability:

-	mgmtServer, err := nbgrpc.NewServer(config, accountManager, settingsMockManager, secretsManager, nil, nil, &server.MockIntegratedValidator{}, networkMapController)
+	mgmtServer, err := nbgrpc.NewServer(
+		config,
+		accountManager,
+		settingsMockManager,
+		secretsManager,
+		nil, // appMetrics
+		nil, // authManager
+		&server.MockIntegratedValidator{},
+		networkMapController,
+	)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbf92bb and e6c11d1.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • client/internal/engine_test.go (2 hunks)
  • go.mod (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod
🧰 Additional context used
🧬 Code graph analysis (1)
client/internal/engine_test.go (2)
management/internals/modules/peers/ephemeral/manager/ephemeral.go (1)
  • NewEphemeralManager (53-61)
management/internals/shared/grpc/server.go (1)
  • NewServer (77-127)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Build Cache
  • GitHub Check: Client / Unit
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
  • GitHub Check: Windows
  • GitHub Check: Darwin
  • GitHub Check: Check External GPL/AGPL Licenses
  • GitHub Check: Linux
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: Client / Unit
🔇 Additional comments (2)
client/internal/engine_test.go (2)

33-34: LGTM! Imports align with the refactored API.

The new imports for peers and ephemeral/manager are necessary to support the updated network map controller construction that now requires an EphemeralManager parameter.


1637-1640: LGTM! Proper error handling added.

The error handling for NewTimeBasedAuthSecretsManager correctly follows Go best practices by checking and propagating the error.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6c11d1 and 8afe24a.

📒 Files selected for processing (2)
  • management/internals/controllers/network_map/interface_mock.go (2 hunks)
  • management/internals/shared/grpc/server_test.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
management/internals/controllers/network_map/interface_mock.go (1)
management/internals/controllers/network_map/update_message.go (1)
  • UpdateMessage (7-9)
management/internals/shared/grpc/server_test.go (2)
management/internals/shared/grpc/token_mgr.go (1)
  • TimeBasedAuthSecretsManager (37-49)
encryption/message.go (2)
  • EncryptMessage (10-24)
  • DecryptMessage (27-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: Management / Benchmark (amd64, postgres)
  • GitHub Check: Management / Integration (amd64, postgres)
  • GitHub Check: Management / Benchmark (API) (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Management / Integration (amd64, sqlite)
  • GitHub Check: Management / Benchmark (API) (amd64, sqlite)
  • GitHub Check: Signal / Unit (386)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Benchmark (amd64, sqlite)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Client / Unit
  • GitHub Check: release_ui_darwin
  • GitHub Check: release
  • GitHub Check: release_ui
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Check External GPL/AGPL Licenses
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
🔇 Additional comments (4)
management/internals/shared/grpc/server_test.go (2)

75-80: LGTM - Proper test setup with secrets manager.

Direct struct initialization is appropriate here for test isolation, and since GetWGKey() is the only method called, leaving other fields uninitialized is acceptable.


84-84: LGTM - Encryption/decryption calls correctly updated.

The refactoring properly switches from direct wgKey field access to secretsManager.GetWGKey() while maintaining correct key usage:

  • Line 84: Server's private key used for encrypting the message to the client
  • Line 98: Client's private key used for decrypting the server's response

Also applies to: 98-98

management/internals/controllers/network_map/interface_mock.go (2)

60-72: LGTM! New CountStreams method correctly implemented.

The mock implementation for the new CountStreams() method follows gomock conventions and correctly handles the return type.


160-200: LGTM! Batch peer lifecycle methods correctly implemented.

The new batch-oriented methods (OnPeersAdded, OnPeersDeleted, OnPeersUpdated) are well-structured and follow gomock conventions. The shift from singular to batch operations aligns with the PR's refactoring objectives.

Comment on lines +75 to 84
func (m *MockController) DisconnectPeers(ctx context.Context, accountId string, peerIDs []string) {
m.ctrl.T.Helper()
m.ctrl.Call(m, "DisconnectPeers", ctx, peerIDs)
m.ctrl.Call(m, "DisconnectPeers", ctx, accountId, peerIDs)
}

// DisconnectPeers indicates an expected call of DisconnectPeers.
func (mr *MockControllerMockRecorder) DisconnectPeers(ctx, peerIDs any) *gomock.Call {
func (mr *MockControllerMockRecorder) DisconnectPeers(ctx, accountId, peerIDs any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "DisconnectPeers", reflect.TypeOf((*MockController)(nil).DisconnectPeers), ctx, peerIDs)
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "DisconnectPeers", reflect.TypeOf((*MockController)(nil).DisconnectPeers), ctx, accountId, peerIDs)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Script executed:

#!/bin/bash
# Find and examine the source interface.go file in the same directory
fd "interface.go" management/internals/controllers/network_map/ | head -5

Repository: netbirdio/netbird


<shell_result>
Repository: netbirdio/netbird
Exit code: 0

stdout:

management/internals/controllers/network_map/interface.go

</shell_result>


Script executed:

#!/bin/bash
# Read the interface.go file to check parameter naming
wc -l management/internals/controllers/network_map/interface.go

Repository: netbirdio/netbird


<shell_result>
Repository: netbirdio/netbird
Exit code: 0

stdout:

39 management/internals/controllers/network_map/interface.go

</shell_result>


Script executed:

#!/bin/bash
# Read the full interface.go file
cat -n management/internals/controllers/network_map/interface.go

Repository: netbirdio/netbird


<shell_result>
Repository: netbirdio/netbird
Exit code: 0

stdout:

     1	package network_map
     2	
     3	//go:generate go run go.uber.org/mock/mockgen -package network_map -destination=interface_mock.go -source=./interface.go -build_flags=-mod=mod
     4	
     5	import (
     6		"context"
     7	
     8		nbdns "github.com/netbirdio/netbird/dns"
     9		nbpeer "github.com/netbirdio/netbird/management/server/peer"
    10		"github.com/netbirdio/netbird/management/server/posture"
    11		"github.com/netbirdio/netbird/management/server/types"
    12	)
    13	
    14	const (
    15		EnvNewNetworkMapBuilder  = "NB_EXPERIMENT_NETWORK_MAP"
    16		EnvNewNetworkMapAccounts = "NB_EXPERIMENT_NETWORK_MAP_ACCOUNTS"
    17	
    18		DnsForwarderPort           = nbdns.ForwarderServerPort
    19		OldForwarderPort           = nbdns.ForwarderClientPort
    20		DnsForwarderPortMinVersion = "v0.59.0"
    21	)
    22	
    23	type Controller interface {
    24		UpdateAccountPeers(ctx context.Context, accountID string) error
    25		UpdateAccountPeer(ctx context.Context, accountId string, peerId string) error
    26		BufferUpdateAccountPeers(ctx context.Context, accountID string) error
    27		GetValidatedPeerWithMap(ctx context.Context, isRequiresApproval bool, accountID string, p *nbpeer.Peer) (*nbpeer.Peer, *types.NetworkMap, []*posture.Checks, int64, error)
    28		GetDNSDomain(settings *types.Settings) string
    29		StartWarmup(context.Context)
    30		GetNetworkMap(ctx context.Context, peerID string) (*types.NetworkMap, error)
    31		CountStreams() int
    32	
    33		OnPeersUpdated(ctx context.Context, accountId string, peerIDs []string) error
    34		OnPeersAdded(ctx context.Context, accountID string, peerIDs []string) error
    35		OnPeersDeleted(ctx context.Context, accountID string, peerIDs []string) error
    36		DisconnectPeers(ctx context.Context, accountId string, peerIDs []string)
    37		OnPeerConnected(ctx context.Context, accountID string, peerID string) (chan *UpdateMessage, error)
    38		OnPeerDisconnected(ctx context.Context, accountID string, peerID string)
    39	}

</shell_result>


Parameter naming inconsistency exists in the source interface, which the mock faithfully reproduces.

The inconsistency flagged in the review comment is real. The source interface.go file shows mixed naming conventions across methods:

  • Lines 24, 26, 34, 35, 37, 38: accountID (uppercase 'D')
  • Lines 25, 33, 36: accountId (lowercase 'd')

Since the mock file is auto-generated from this interface (via mockgen on line 3 of interface.go), it correctly reproduces the source inconsistency. To resolve this issue, standardize the parameter name to accountID in the source interface definition at management/internals/controllers/network_map/interface.go (lines 25, 33, 36), which will automatically propagate to the regenerated mock file.

🤖 Prompt for AI Agents
In management/internals/controllers/network_map/interface_mock.go around lines
75 to 84, the mock reproduces a parameter naming inconsistency originating from
the source interface; open
management/internals/controllers/network_map/interface.go and rename the
inconsistent parameters on lines 25, 33, and 36 from accountId to accountID to
match the other methods, then regenerate the mock (run mockgen as used in the
project) so the mock file is updated accordingly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
management/server/user_test.go (1)

1428-1443: Update TestUserAccountPeersUpdate comments to match new expectation

In the “deleting user with no linked peers” subtest, the comment still says “should not update account peers and not send peer update”, but the test body now calls peerShouldReceiveUpdate(t, updMsg), i.e., it expects a network‑map update.

If this behavioural change is intentional, the subtest description should be updated to reflect that deleting a user—even without linked peers—now triggers an update, otherwise the test is misleading.

management/server/account.go (1)

2065-2073: Perfect! I found the critical evidence. The git diff shows a definitive semantic change:

Before (previous code):

am.networkMapController.OnPeerUpdated(peer.AccountID, peer)

After (current code):

err = am.networkMapController.OnPeersUpdated(ctx, peer.AccountID, []string{peerID})
if err != nil {
    return fmt.Errorf("notify network map controller of peer update: %w", err)
}

The method was changed from OnPeerUpdated (which did not return or check errors) to OnPeersUpdated (which returns errors that are now checked and propagated). This means UpdatePeerIP now fails when the network-map notification fails, even though the peer IP is already successfully persisted in the database.


UpdatePeerIP now fails when network-map notification fails, despite IP being already persisted

Changed from calling OnPeerUpdated(peer.AccountID, peer) (fire-and-forget) to OnPeersUpdated(ctx, peer.AccountID, []string{peerID}) with error checking. The peer IP is saved in the transaction and committed before OnPeersUpdated is called—if the notification fails, the caller sees an error even though the update succeeded in the database.

If the intent is to keep notifications best-effort, consider logging the error instead of propagating it:

if err != nil {
    log.WithContext(ctx).Warnf("failed to notify network map of peer IP update: %w", err)
}
🧹 Nitpick comments (5)
management/server/cache/idp.go (1)

17-22: New DefaultIDPCacheOpenConn constant is fine; consider a brief doc comment

The new constant fits alongside the other cache defaults and is used consistently as the maxConn argument. Consider adding a short comment (e.g., “max open Redis connections for IDP cache”) so future readers understand its semantics at a glance.

management/server/account_test.go (2)

24-31: Ephemeral peers wiring in createManager looks correct; possible small simplification

Using ephemeral_manager.NewEphemeralManager(store, peers.NewManager(store, permissionsManager)) when constructing the network map controller matches the new production wiring and keeps tests close to real behaviour. The BuildManager call also correctly injects the constructed controller and the config.

If you ever need the controller and account manager to share a single mutable config object, consider constructing one cfg := &config.Config{} and passing it to both controller.NewController and BuildManager instead of allocating two separate zero-valued configs.

Also applies to: 2964-2966


3376-3392: Reuse DefaultIDPCacheOpenConn instead of hard‑coding 100 in cache tests

Both memory and Redis cache paths now call cache.NewStore(..., 100). To keep the tests aligned with the defined default and avoid magic numbers, consider using cache.DefaultIDPCacheOpenConn here as well.

management/server/user_test.go (2)

10-15: Mixed gomock imports across the repo; verify mocks use the intended package

This file imports go.uber.org/mock/gomock, while other tests (e.g., management/server/account_test.go) still import github.com/golang/mock/gomock. Types from those two packages are distinct, so any generated mocks (like network_map.NewMockController) must be built against the same gomock import path you use here.

Please double‑check that:

  • All newly generated mocks (e.g., under management/internals/controllers/network_map) import go.uber.org/mock/gomock, and
  • You don’t pass a *gomock.Controller from one package to mocks generated against the other.

If there is a mix, it may be worth standardising on the Uber fork everywhere to avoid confusion.


744-756: Verify OnPeersDeleted expectations vs actual DeleteUser behaviour

The DeleteUser tests now inject a networkMapControllerMock and expect OnPeersDeleted to be called. That’s good to assert the network map is notified, but a couple of details are worth checking:

  • In TestUser_DeleteUser_regularUser, you set a single expectation without .AnyTimes(), yet the same am and mock are reused across multiple subtests. If DeleteUser calls OnPeersDeleted more than once across those cases, this will fail. You may want to either create a fresh controller per subtest or add .AnyTimes() / explicit Times(n) to clarify expectations.
  • In TestUser_DeleteUser_RegularUsers, you already use .AnyTimes(), which looks more robust for multi‑user deletion.

Please run the tests and adjust the expectations if necessary.

Also applies to: 860-874

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd0cc43 and 3fc8c8b.

📒 Files selected for processing (10)
  • management/internals/shared/grpc/server.go (13 hunks)
  • management/internals/shared/grpc/server_test.go (2 hunks)
  • management/internals/shared/grpc/token_mgr.go (5 hunks)
  • management/server/account.go (2 hunks)
  • management/server/account_test.go (4 hunks)
  • management/server/cache/idp.go (1 hunks)
  • management/server/cache/idp_test.go (1 hunks)
  • management/server/cache/store.go (2 hunks)
  • management/server/cache/store_test.go (3 hunks)
  • management/server/user_test.go (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • management/internals/shared/grpc/server_test.go
🧰 Additional context used
🧬 Code graph analysis (5)
management/server/account.go (1)
management/server/cache/store.go (1)
  • NewStore (24-31)
management/server/cache/store_test.go (1)
management/server/cache/store.go (1)
  • NewStore (24-31)
management/server/user_test.go (3)
management/server/cache/store.go (1)
  • NewStore (24-31)
management/internals/controllers/network_map/interface_mock.go (1)
  • NewMockController (35-39)
management/server/account.go (1)
  • DefaultAccountManager (66-109)
management/server/cache/idp_test.go (2)
management/server/cache/store.go (1)
  • NewStore (24-31)
management/server/cache/idp.go (3)
  • DefaultIDPCacheExpirationMax (18-18)
  • DefaultIDPCacheCleanupInterval (20-20)
  • DefaultIDPCacheOpenConn (21-21)
management/internals/shared/grpc/server.go (3)
management/internals/shared/grpc/token_mgr.go (1)
  • SecretsManager (28-34)
shared/management/proto/management.pb.go (6)
  • ServerKeyResponse (1197-1208)
  • ServerKeyResponse (1223-1223)
  • ServerKeyResponse (1238-1240)
  • EncryptedMessage (273-284)
  • EncryptedMessage (299-299)
  • EncryptedMessage (314-316)
encryption/message.go (2)
  • EncryptMessage (10-24)
  • DecryptMessage (27-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: Build Cache
  • GitHub Check: release_ui
  • GitHub Check: Android / Build
  • GitHub Check: JS / Lint
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: Windows
  • GitHub Check: iOS / Build
  • GitHub Check: Client / Unit
  • GitHub Check: Check External GPL/AGPL Licenses
🔇 Additional comments (14)
management/internals/shared/grpc/token_mgr.go (1)

53-99: LGTM! Constructor and key accessor properly implemented.

The refactor to generate and store the WireGuard key within the secrets manager is clean. The constructor properly handles key generation errors, and the GetWGKey() accessor provides controlled access to the key.

Minor observation: GetWGKey() always returns nil error since the key is already validated during initialization. While this simplifies the implementation, it means the error return exists mainly for API consistency or future extensibility.

management/internals/shared/grpc/server.go (9)

89-91: LGTM! Metrics integration updated correctly.

The migration from peersUpdateManager to networkMapController.CountStreams() aligns with the broader refactoring of peer lifecycle management.


151-160: LGTM! Key retrieval properly implemented with error handling.

The migration to secretsManager.GetWGKey() is clean, with appropriate error handling and logging. The response correctly uses the public key.


263-269: LGTM! Proper error handling with cleanup.

The error handling for OnPeerConnected correctly performs cleanup via cancelPeerRoutines before returning, ensuring the peer is properly disconnected if the connection process fails.


321-342: LGTM! Key retrieval and encryption follow correct pattern.

Key retrieval includes proper error handling with peer cleanup. The encrypted message correctly populates WgPubKey with the server's public key.


344-356: LGTM! Peer disconnection updated to use networkMapController.

The migration to networkMapController.OnPeerDisconnected is consistent with the broader refactoring of peer lifecycle management.


500-518: LGTM! Request parsing with proper key retrieval.

The key retrieval for message decryption includes appropriate error handling without exposing implementation details to clients.


617-633: LGTM! Login response encryption properly implemented.

Key retrieval and encryption follow the correct pattern, with proper error handling and the response correctly using the server's public key.


723-746: LGTM! Initial sync encryption follows correct pattern.

Key retrieval and message encryption are properly implemented with appropriate error handling and correct public key usage in the response.


828-870: LGTM! PKCE authorization flow correctly implements key retrieval and encryption.

Key operations are properly implemented with correct error handling and the response correctly uses the server's public key.

management/server/account.go (1)

239-245: Wiring DefaultIDPCacheOpenConn into BuildManager cache store is consistent

Passing nbcache.DefaultIDPCacheOpenConn into nbcache.NewStore cleanly propagates the intended Redis connection limit into the IDP cache layer. The change is self‑contained and looks correct.

management/server/cache/store_test.go (1)

17-41: Tests correctly adapted to new NewStore signature

All cache.NewStore invocations were updated to pass the new maxConn argument, and the rest of the test logic remains unchanged. This keeps the coverage for in‑memory and Redis stores intact.

Also applies to: 43-49, 51-69

management/server/cache/idp_test.go (1)

36-40: IDP cache tests now use the shared DefaultIDPCacheOpenConn

Using cache.DefaultIDPCacheOpenConn in the test aligns it with production defaults and avoids magic numbers. This looks good.

management/server/user_test.go (1)

552-556: User‑related tests correctly adapted to nbcache.NewStore’s new signature

Both TestUser_InviteNewUser and TestDefaultAccountManager_ExternalCache now pass nbcache.DefaultIDPCacheOpenConn into nbcache.NewStore, keeping test wiring consistent with production defaults and avoiding magic numbers. No issues here.

Also applies to: 1076-1082

Comment thread management/internals/shared/grpc/server.go
Comment on lines +24 to 31
func NewStore(ctx context.Context, maxTimeout, cleanupInterval time.Duration, maxConn int) (store.StoreInterface, error) {
redisAddr := os.Getenv(RedisStoreEnvVar)
if redisAddr != "" {
return getRedisStore(ctx, redisAddr)
return getRedisStore(ctx, redisAddr, maxConn)
}
goc := gocache.New(maxTimeout, cleanupInterval)
return gocache_store.NewGoCache(goc), nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n management/server/cache/store.go

Repository: netbirdio/netbird

Length of output: 2530


🏁 Script executed:

web_search

Repository: netbirdio/netbird

Length of output: 111


🌐 Web query:

go-redis/v9 redis.Options MaxActiveConns PoolSize connection pool fields

💡 Result:

Short answer: go-redis/v9 uses Options.PoolSize (not a MaxActiveConns field) to control the connection pool size. Key connection-pool / connection-related fields on redis.Options (v9) are:

  • PoolSize — maximum number of socket connections (default: 10 * GOMAXPROCS). [1][2]
  • MinIdleConns — minimum number of idle connections to keep. [1][2]
  • PoolTimeout — how long Get waits for a free connection when pool is exhausted (default: ReadTimeout + 1s). [1][2]
  • IdleTimeout — close connections idle longer than this (default ~5m; -1 disables). [1][2]
  • IdleCheckFrequency — frequency to run idle reaper (default 1m; -1 disables). [1][2]
  • MaxConnAge (ConnMaxLifetime) — retire/close connections older than this. [1][2]
  • DialTimeout / ReadTimeout / WriteTimeout — dial/read/write socket timeouts used by the client. [1][3]
  • OnConnect — callback executed when a new connection is created. [1][2]

Notes:

  • PoolSize applies per node for cluster clients. [1]
  • There is no built-in Options.MaxActiveConns in go-redis/v9; PoolSize is the primary knob. Some other Redis wrappers/libs expose a MaxActiveConns that behaves differently (e.g., limiting excessive connection creation), but for go-redis use PoolSize + MinIdleConns + timeouts to tune behavior and PoolTimeout to control wait behavior. [1][2][4]

Sources: official go-redis repo/docs and package reference. [1] GitHub go-redis repo options (v9). [2] pkg.go.dev / redis v9 Options docs. [3] Redis docs on production usage (timeouts). [4] examples/issues discussing pool behavior.


🌐 Web query:

go-redis/v9 redis.Options MaxIdleConns MinIdleConns valid fields

💡 Result:

Short answer: In go-redis v9 the Options struct uses MinIdleConns and MaxIdleConns as the valid fields for idle-connection control. MinIdleConns (minimum number of idle connections) remains, and MaxIdleConns (maximum number of idle connections) was added in v9 (replacing the old connection reaper behavior). See the Options documentation and v9 changelog. [1][2]

Sources:
[1] go-redis v9 Options docs. [pkg.go.dev/github.com/go-redis/redis/v9].
[2] go-redis v9 changelog (notes about MaxIdleConns and removed reaper).


🏁 Script executed:

# Check if there are any other issues or usages of these fields in the codebase
rg -n "MaxActiveConns" management/server/cache/ -A2 -B2

Repository: netbirdio/netbird

Length of output: 513


Fix redis.Options usage: MaxActiveConns does not exist; use PoolSize and validate maxConn

redis.Options from github.com/redis/go-redis/v9 includes PoolSize, MinIdleConns, MaxIdleConns, ConnMaxIdleTime, ConnMaxLifetime, and PoolTimeout, but it does not define MaxActiveConns. Assigning to options.MaxActiveConns on line 41 will fail at compile time.

If the intent is to cap total concurrent connections:

  • Use options.PoolSize = maxConn to bound the pool size.
  • Keep MinIdleConns and MaxIdleConns as fractions of maxConn, ensuring MaxIdleConns <= PoolSize.

Also consider guarding against non‑positive maxConn (e.g., default to a sane minimum or return an error) so callers can't accidentally configure an unusable pool.

 func getRedisStore(ctx context.Context, redisEnvAddr string, maxConn int) (store.StoreInterface, error) {
  	options, err := redis.ParseURL(redisEnvAddr)
  	if err != nil {
  		return nil, fmt.Errorf("parsing redis cache url: %s", err)
  	}
 
+	if maxConn <= 0 {
+		return nil, fmt.Errorf("maxConn must be > 0, got %d", maxConn)
+	}
+	options.PoolSize = maxConn
  	options.MaxIdleConns = int(math.Ceil(float64(maxConn) * 0.5)) // 50% of max conns
  	options.MinIdleConns = int(math.Ceil(float64(maxConn) * 0.1)) // 10% of max conns
-	options.MaxActiveConns = maxConn
  	options.ConnMaxIdleTime = 30 * time.Minute

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
management/server/cache/store.go around lines 24 to 31: the code is attempting
to set redis.Options.MaxActiveConns (which doesn't exist) and may not validate
maxConn; replace that assignment with options.PoolSize = maxConn, set
options.MinIdleConns and options.MaxIdleConns to sensible fractions (e.g.,
MinIdleConns = max(1, maxConn/4), MaxIdleConns = min(maxConn, maxConn/2)) while
ensuring MaxIdleConns <= PoolSize, and validate maxConn is positive (return an
error or default to a sane minimum like 2) before constructing redis.Options so
the pool is bounded correctly and the compiler error is fixed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
management/internals/shared/grpc/server.go (1)

151-155: Consider extracting a helper for key retrieval.

The pattern of calling s.secretsManager.GetWGKey() with error handling is repeated throughout the file (7+ locations). While the current implementation is correct, extracting a helper function could improve consistency and maintainability:

func (s *Server) getServerKey(ctx context.Context) (wgtypes.Key, error) {
    key, err := s.secretsManager.GetWGKey()
    if err != nil {
        log.WithContext(ctx).Errorf("failed to retrieve WireGuard key: %v", err)
        return wgtypes.Key{}, status.Errorf(codes.Internal, "internal server error")
    }
    return key, nil
}

This would centralize error logging and message formatting.

Also applies to: 321-325, 507-510, 617-621, 723-726, 765-768, 828-831

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fc8c8b and 7b3b588.

📒 Files selected for processing (1)
  • management/internals/shared/grpc/server.go (13 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
management/internals/shared/grpc/server.go (2)
management/internals/shared/grpc/token_mgr.go (1)
  • SecretsManager (28-34)
encryption/message.go (2)
  • EncryptMessage (10-24)
  • DecryptMessage (27-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Management / Integration (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Management / Benchmark (amd64, postgres)
  • GitHub Check: Management / Benchmark (API) (amd64, postgres)
  • GitHub Check: Management / Benchmark (API) (amd64, sqlite)
  • GitHub Check: Management / Benchmark (amd64, sqlite)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Client / Unit
  • GitHub Check: Check External GPL/AGPL Licenses
  • GitHub Check: Darwin
  • GitHub Check: Windows
  • GitHub Check: release_ui_darwin
  • GitHub Check: Linux
  • GitHub Check: release
  • GitHub Check: Client / Unit
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
  • GitHub Check: iOS / Build
🔇 Additional comments (10)
management/internals/shared/grpc/server.go (10)

58-74: LGTM! Clean refactoring to centralize dependencies.

The removal of direct wgKey storage and the transition to networkMapController aligns well with the refactoring objectives. The updated metrics registration correctly uses networkMapController.CountStreams().

Also applies to: 77-127


151-160: LGTM! Secure key handling.

The endpoint now correctly retrieves the WireGuard key via secretsManager.GetWGKey() and returns only the public key. Error handling appropriately returns a generic message to avoid leaking internal details.


321-341: LGTM! Proper key retrieval and secure response construction.

The function correctly:

  • Retrieves the WireGuard key for encryption
  • Returns only the public key in the response (line 333)
  • Cleans up peer routines on any failure

352-352: LGTM! Consistent controller usage.

The update to use networkMapController.OnPeerDisconnected aligns with the refactoring to centralize peer lifecycle management.


507-517: LGTM! Secure request parsing.

The function correctly retrieves the WireGuard private key for message decryption and handles errors appropriately.


617-632: LGTM! Security fix confirmed - only public key returned.

The Login response correctly uses key.PublicKey().String() (line 630), ensuring the private WireGuard key is never exposed to clients. This addresses the critical security vulnerability mentioned in previous reviews.


723-745: LGTM! Consistent and secure key handling.

The initial sync correctly retrieves the key for encryption and returns only the public key in the response (line 735).


765-808: LGTM! Critical security vulnerability fixed.

Line 806 now correctly returns key.PublicKey().String() instead of exposing the private key. This resolves the critical security issue flagged in the previous review.

The endpoint properly:

  • Retrieves the key via secretsManager.GetWGKey()
  • Uses the private key for encryption/decryption
  • Returns only the public key to clients

828-869: LGTM! Secure key handling throughout.

The PKCE authorization flow correctly retrieves keys via secretsManager and returns only the public key in the response (line 867).


263-269: <!-- [scratchpad_start]
[task verification_1/1 - FINAL CONTEXT CHECK]

[observations from full context]
Line 249: SyncAndMarkPeer returns peer, netMap, postureChecks, dnsFwdPort, err

  • This returns the peer object that is used later
  • This is where the peer is marked as connected in accountManager
  • If this succeeds, peer exists and is marked connected

Line 256-260: sendInitialSync - can fail, but if it does, it's not followed by OnPeerConnected

  • On failure returns early

Line 263: OnPeerConnected uses peer.ID from the peer returned at line 249

  • This peer definitely exists (SyncAndMarkPeer succeeded)
  • The only way OnPeerConnected fails is if controller.GetPeerByID fails
  • But peer was just synced and marked at line 249, so it should exist

Line 267: cancelPeerRoutines is called on OnPeerConnected failure

  • This calls accountManager.OnPeerDisconnected with peer.Key (NOT peer.ID)
  • Then networkMapController.OnPeerDisconnected with peer.ID

[KEY INSIGHT]
Line 348 uses peer.Key (public key string)
Line 352 uses peer.ID (UUID)
Line 264 OnPeerConnected was called with peer.ID

So in cancelPeerRoutines:

  • accountManager.OnPeerDisconnected is called with peer.Key (mark as disconnected)
  • networkMapController.OnPeerDisconnected is called with peer.ID (cleanup manager state)

The peer definitely exists in the database because SyncAndMarkPeer succeeded.
So GetPeerByID in OnPeerDisconnected should succeed in the vast majority of cases.

[FINAL ASSESSMENT]
Review comment is CORRECT in flagging this as worth verifying:

  1. OnPeerConnected can fail (due to GetPeerByID)
  2. When it fails, OnPeerDisconnected is called for cleanup
  3. All operations are idempotent
  4. The concern about "calling disconnect for a peer that was never fully connected" is valid but NOT A BUG
  5. The code handles this correctly with idempotent operations

However, the review comment's concern could be more specific:

  • The actual failure point is GetPeerByID in OnPeerConnected
  • If that fails, no state is changed
  • OnPeerDisconnected is safe to call (idempotent)
  • No bug exists, but the pattern could be clearer

[CONCLUSION]
The review comment correctly identifies a potential concern about cleanup semantics, but upon verification:

  • All operations are idempotent
  • No actual bug or data corruption would occur
  • The code is safe, though the failure handling could be more explicit/clearer
    </scratchpad_start] -->

Based on my verification of the code, I can now provide the rewritten review comment:

Cleanup sequence when OnPeerConnected fails is safe due to idempotent operations, but warrants clarity.

When OnPeerConnected fails (line 263), cancelPeerRoutines is called which invokes OnPeerDisconnected. The failure can only occur at GetPeerByID within OnPeerConnected (line 114 of controller.go), which happens before any state changes.

Verified findings:

  1. closeChannel is idempotent (line 92-102 of updatechannel.go): handles missing channels gracefully, logging and returning safely if the channel was never created
  2. EphemeralPeersManager.OnPeerDisconnected is idempotent (line 111-126 of ephemeral.go): checks if peer is already on the disconnect list and returns early if so
  3. Failure at GetPeerByID: If OnPeerConnected fails here, no state is modified, and OnPeerDisconnected operations are safe no-ops

The cleanup sequence is functionally correct, though it could be more explicit about the expected failure mode and why it's safe to call disconnect cleanup when connection setup failed.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 1, 2025

@pascal-fischer pascal-fischer merged commit 7193bd2 into main Dec 2, 2025
42 of 43 checks passed
@pascal-fischer pascal-fischer deleted the refactor/network-map-controller branch December 2, 2025 11:34
@coderabbitai coderabbitai Bot mentioned this pull request Mar 25, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants