[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) #6082
[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4)
#6082MichaelUray wants to merge 63 commits intonetbirdio:mainfrom
Conversation
…nfig Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers (11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are unchanged so old clients ignore the additions and old servers send UNSPECIFIED, which the new client maps back via the legacy boolean. Note: the regenerated pb.go files now report protoc v5.29.3 in their header (this branch was generated with locally-installed protoc 29.3 instead of upstream's v7.34.1). Functionally identical; header diff is the only delta beyond the actual schema additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the client-only sentinels Unspecified and FollowServer), ParseString for CLI/env input, ToProto/FromProto for wire translation, and the two backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled that bridge the old Settings.LazyConnectionEnabled boolean. Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched in this commit; the engine/conn_mgr migration follows in subsequent commits in the same PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY, NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set together, NB_FORCE_RELAY wins (most-restrictive, mirrors the group-conflict rule from issue netbirdio#5990). Each legacy var emits a one-shot deprecation warning when it actually contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD becomes an alias for the future relay_timeout setting and warns once. IsForceRelayed() is kept for callers that have not yet been migrated (conn.go, statusrecorder); they will be updated in the engine/conn refactor commits later in this PR. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing: - --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server> - --relay-timeout <seconds> - --p2p-timeout <seconds> Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput, LoginRequest), persisted in profilemanager.Config, and added as new fields on the daemon.proto IPC messages. Empty / not-changed flags fall back to the server-pushed value (which itself falls back to the legacy lazy_connection_enabled boolean for old servers). Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds. ConnMgr now stores the resolved Mode plus the raw inputs (env, config) so it can re-resolve when the server pushes a new PeerConfig. UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and takes the full PeerConfig pointer; a thin shim with the old name delegates to it for callers that haven't been updated yet. connect.go copies the three new fields from profilemanager.Config into the EngineConfig builder, with a tolerant parser that logs and falls through to Unspecified on invalid input. Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now checks Mode == ModeRelayForced instead of calling the global env-reader IsForceRelayed(). The local 'forceRelay' variable name is renamed to 'skipICE' to make the new branching intent explicit. The PeerStateUpdate block at the end of Open() also reads from conn.config.Mode now, so the StatusRecorder sees the per-peer mode rather than the global env var. A single remaining caller of IsForceRelayed() (srWatcher.Start in engine.go) is left for a follow-up; that path uses a process-wide flag not per-peer state, so it can be migrated in Phase 2 once srWatcher itself learns about ConnectionMode. Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into peer.ConnConfig, so the per-peer Open() loop in conn.go can take the ModeRelayForced skip-ICE branch without reading the global env var. This is the last wiring commit for the client side of Phase 1; the server-side mgmt changes (Settings + OpenAPI + handler + audit + NetworkMap-build) follow in Section D. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default' (NULL) from explicit values (incl. 0 = never tear down). Copy() now deep-clones the new pointer fields via two small helpers. GORM AutoMigrate creates the new columns at first start; existing accounts have NULL in all three columns and resolve via the legacy LazyConnectionEnabled boolean. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings Three new optional, nullable fields with descriptions of the NULL = built-in-default semantics and the Phase-1-vs-Phase-2 status of p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen tooling. The generated AccountSettingsConnectionMode enum has the canonical values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid() helper for handler-side validation. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PUT /api/accounts/{id} now accepts connection_mode (validated against
the four-value enum via the generated AccountSettingsConnectionMode.
Valid()), p2p_timeout_seconds and relay_timeout_seconds. NULL in the
JSON body keeps the existing value untouched (= "no client-side
override on this round-trip"); explicit NULL-clear via API uses a
distinct PATCH-style call which is out-of-scope for Phase 1.
Response payload mirrors the input fields back as nullable so the
dashboard can distinguish "use default" from "explicit value".
Phase 1 of issue netbirdio#5989. Audit-event emission follows in D5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccountConnectionModeChanged (121), AccountRelayTimeoutChanged (122), AccountP2pTimeoutChanged (123) -- emitted from account.go when settings change. Per-peer / per-group event codes are reserved for Phase 3 (issue netbirdio#5990). Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleConnectionModeSettings is invoked from the same diff-detection block as handleLazyConnectionSettings; emits one StoreEvent per changed field (ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds) with old/new values in the meta payload. Four small ptr-equality / deref helpers are added for nullable string and uint32 fields. They are package-private and named after the existing convention used elsewhere in the package. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elds Two changes in one commit because they're inseparable: 1. Move client/internal/peer/connectionmode/ to shared/connectionmode/. The package now needs to be importable from BOTH client/ and management/ (which is impossible while it lives under client/internal/ per Go's internal-package rule). All imports updated; tests pass on both sides. 2. Extend management/internals/shared/grpc/conversion.go::toPeerConfig to populate the three new PeerConfig fields (ConnectionMode, P2PTimeoutSeconds, RelayTimeoutSeconds) using the connectionmode helpers. The legacy LazyConnectionEnabled boolean is now derived from the resolved Mode via ToLazyConnectionEnabled() rather than copied verbatim from Settings -- this is the central backwards-compat contract: old clients see only the boolean, new clients prefer the explicit enum and ignore the bool. Resolution rules (Phase 1, account-wide only): - Settings.ConnectionMode != nil and parses -> wins - Otherwise -> ResolveLegacyLazyBool(LazyConnectionEnabled) - timeouts: Settings.RelayTimeoutSeconds / P2pTimeoutSeconds when non-NULL, else 0 (= server has no preference; client uses built-in default) Per-peer / per-group resolution comes in Phase 3 (netbirdio#5990). Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine sub-cases cover the Phase-1 resolution matrix from spec section 3: - no settings -> default (P2P + lazy=false) - legacy bool only -> mapped via ResolveLegacyLazyBool - explicit ConnectionMode -> wins over the legacy bool - timeouts propagate - garbage ConnectionMode value -> tolerant fallback to legacy bool Particular attention to the structural compat gap: relay-forced cannot be expressed via the legacy boolean, so the wire field for old clients is sent as false. Documented in the spec, asserted here. Existing TestAccount_GetPeerNetworkMap remains green: existing test peers have ConnectionMode=NULL in Settings, falls through to the legacy ResolveLegacyLazyBool(false) -> ModeP2P -> wire bool false. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs to dynamically detach the ICE listener at runtime (when p2p-dynamic mode hits its ICE-inactivity threshold). Adds the RemoveICEListener method and extends the Add/RemoveICEListener pair to hold h.mu; Listen() now reads h.iceListener through readICEListener under the same mutex, fixing a latent race that didn't matter while AddICEListener was the only writer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs the daemon to fire ICE-teardown and relay- teardown at different idle thresholds. Adds NewManagerWithTwoTimers plus iceInactiveChan / relayInactiveChan; the periodic check loop fires each channel when its threshold elapses. Threshold == 0 disables that channel (peer.lazy: iceTimeout=0; peer.dynamic: both>0; eager modes don't register peers at all). NewManager (Phase-1 entry point) becomes a thin wrapper that delegates to newManager with iceTimeout=0, preserving p2p-lazy semantics exactly. The Phase-1 InactivePeersChan now aliases the new RelayInactiveChan so existing callers (engine.go p2p-lazy path) continue to work unchanged. Five sub-tests cover the timeout matrix: - TwoTimers_OnlyICEFires (peer idle between thresholds) - TwoTimers_BothFire (peer idle past both) - TwoTimers_ICEDisabled (iceTimeout=0) - TwoTimers_RelayDisabled (relayTimeout=0) - TwoTimers_BothDisabled (both 0 = manager inert) Plus Phase1_LazyEquivalence proves NewManager-call-site behaviour is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier timeouts for Phase-2 p2p-dynamic. The deprecated InactivityThreshold pointer field becomes a backwards-compat alias that maps onto RelayInactivityThreshold when set, so Phase-1 callers that still pass it (engine.go, p2p-lazy tests) keep working without edits. resolvedTimeouts() helper centralises the alias logic. NewManager picks the appropriate inactivity-manager constructor based on the resolved timeouts; legacy callers continue through the Phase-1 single-timer code path. InactivityManager() getter exposes the inner manager so the conn_mgr can subscribe to ICEInactiveChan / RelayInactiveChan in the p2p-dynamic dispatch loop. Phase 2 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AttachICE registers the ICE-offer listener on the handshaker after the activity-detector observes traffic; emits a fresh offer so the remote side learns we are now ICE-capable. DetachICE removes the listener and calls workerICE.Close() so the ICE worker tears down without affecting the relay tunnel. Both methods are idempotent and use Handshaker.readICEListener() under its mutex, avoiding the race with Handshaker.Listen() that the prior direct field read had. Used in Phase 2 by the inactivity manager to flip ICE on/off independently of the relay tunnel. Refs netbirdio#5989 (Phase 2 / connection-mode-phase2).
Phase 2 of netbirdio#5989. ModeP2PDynamic now takes a third branch in Open(): workerICE is constructed eagerly (so the activity-trigger does not have to wait for ICE setup) but the AddICEListener call is skipped. The handshaker still has workerICE wired in via NewHandshaker, so buildOfferAnswer can include local ICE credentials when the listener is later attached by Conn.AttachICE() on activity-trigger. Behavior matrix in Open(): - ModeRelayForced: no workerICE, no ICE listener. - ModeP2P / ModeP2PLazy: workerICE created, listener registered eagerly (Phase-1 lazy-tunnel defer remains at conn_mgr level). - ModeP2PDynamic: workerICE created, listener deferred to AttachICE. Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Fixes the lazy/eager mismatch where a remote peer's GO_IDLE was silently dropped whenever the local manager was not in lazy mode, leaving the eager local end to immediately reconnect and defeating the remote's lazy/dynamic intent. DeactivatePeer now dispatches by locally-resolved connection mode: p2p-lazy -> existing lazy-manager teardown (whole tunnel) p2p-dynamic-> DetachICEForPeer (ICE only, relay tunnel stays up) p2p / relay-forced / unspecified -> silent no-op (eager modes deliberately ignore GO_IDLE because they are always-on) DetachICEForPeer is also the entry point used by the inactivity manager when its iceTimeout fires (engine wiring follows in E1). Lookup miss is treated as a no-op: a peer can be removed concurrently with a GO_IDLE or inactivity-timer event. Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. ConnMgr.Start now also brings up the lazy/dynamic manager when mode==ModeP2PDynamic. A new runDynamicInactivityLoop goroutine reads from inactivity.Manager.ICEInactiveChan() and RelayInactiveChan() and dispatches DetachICEForPeer or full Conn.Close per peer. initLazyManager populates Config.RelayInactivityThreshold from e.relayTimeoutSecs and Config.ICEInactivityThreshold from e.p2pTimeoutSecs (only in dynamic mode; in lazy mode iceTimeout stays 0 so the ICE channel never fires, preserving Phase-1 semantics exactly). UpdatedRemotePeerConfig generalized to handle the dynamic mode the same way it handled lazy: enable the manager on entry, tear down on exit, restart on lazy<->dynamic transitions so the new timeouts take effect. Refs netbirdio#5989 (Phase 2).
Phase-2 follow-up. The dynamic-mode iceTimeout was always 0 in the field because resolveConnectionMode only resolved relay-timeout, not p2p-timeout, and the manager-init only consulted e.p2pTimeoutSecs which never got updated from server pushes. Resolution chain for P2pTimeoutSeconds is now symmetric to relay: cfgP2pTimeout (client config) > serverPC.GetP2PTimeoutSeconds() (no env var; reserved for Phase 3). UpdatedRemotePeerConfig now also stores the resolved p2p-timeout and includes it in the early-return short-circuit so changes to the server-pushed value trigger a re-init. Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Closes the activity-attach loop: ActivatePeer is called by the engine when a remote peer signals it is active. In p2p-lazy this opens the relay tunnel; in p2p-dynamic the relay was already eager-opened, so we additionally call Conn.AttachICE() to register the deferred ICE listener and emit a fresh offer. Without this, p2p-dynamic peers would never upgrade from relay to ICE. Refs netbirdio#5989 (Phase 2).
…5989) Phase 3 introduces per-peer ICE-failure backoff. Server-pushed cap on the truncated-exponential schedule lives in this new field 14. Wire-format sentinels: 0 -> "not set", daemon falls back to built-in 15 min default max -> user explicitly disabled backoff (Phase-2 behavior) N -> N seconds as the MaxInterval of cenkalti/backoff Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Account-wide cap on the ICE-failure backoff schedule. Nullable: NULL = use daemon default (15 min), 0 = treated as "user-explicit disable" by the conversion layer (signaled via uint32-max sentinel on the wire). Plus deep-copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New field documented in the public API. Defaults to 900 seconds (15 min). 0 disables backoff. Regenerated types.gen.go via oapi-codegen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. The PUT handler now persists the new account-wide ICE-backoff cap, and the GET response surfaces it back. NULL stays NULL (= "use daemon default"). Mirrors the Phase-1+2 plumbing of relay_timeout_seconds and p2p_timeout_seconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New activity-code 124 (AccountP2pRetryMaxChanged) plus the StoreEvent call inside handleConnectionModeSettings, mirroring the Phase-1 Relay-timeout and Phase-2 P2P-timeout patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apping Phase 3 of netbirdio#5989. The conversion layer translates between DB-NULL (= "use daemon default") and Settings-zero (= "user-explicit disable") using a uint32-max sentinel on the wire. Test matrix covers all three states: NULL, 0, and a normal positive value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Per-peer ICE-failure backoff using cenkalti/backoff v4 (already imported by Guard). InitialInterval 1m, Multiplier 2.0, RandomizationFactor 0.1 -- produces ~1m, ~2m, ~4m, ~8m, ~15m capped sequence with default cap. Public API: newIceBackoff, markFailure (returns delay for logging), markSuccess, Reset (alias), IsSuspended, Snapshot (read-only view for status output), SetMaxBackoff (live update). maxBackoff=0 disables the backoff entirely (Phase-2 behavior). Test matrix covers initial state, exponential growth, cap override, disabled mode, success-reset, hard-reset, suspended-expiry, and live SetMaxBackoff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/peer/guard/guard.go (1)
138-147:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSynchronous callback may stall the reconnect loop
onNetworkChange()is called directly inside theselectbody. If the callback takes any non-trivial time — or tries to acquire a lock that the surroundingConncode also holds during network-change processing —ctx.Done(),relayedConnDisconnected, andiCEConnDisconnectedevents are all frozen until it returns.Per the PR description the callback is expected to "reset iceBackoff + recreate workerICE", which touches
Conn's internal mutex and may spawn goroutines. Consider dispatching asynchronously to keep the loop responsive:💡 Suggested fix
- case <-srReconnectedChan: - g.log.Debugf("has network changes, reset reconnection ticker") - ticker.Stop() - ticker = g.newReconnectTicker(ctx) - tickerChannel = ticker.C - iceState.reset() - // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE - if g.onNetworkChange != nil { - g.onNetworkChange() - } + case <-srReconnectedChan: + g.log.Debugf("has network changes, reset reconnection ticker") + ticker.Stop() + ticker = g.newReconnectTicker(ctx) + tickerChannel = ticker.C + iceState.reset() + // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE. + // Fire in a goroutine so the reconnect loop stays responsive to ctx.Done() + // and other disconnect events while the callback runs. + if g.onNetworkChange != nil { + go g.onNetworkChange() + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/guard/guard.go` around lines 138 - 147, The select case calling g.onNetworkChange() synchronously can block the reconnect loop; change it to invoke the callback asynchronously (e.g., spawn a goroutine) so the select can continue handling ctx.Done(), relayedConnDisconnected and iCEConnDisconnected immediately. Locate the case handling srReconnectedChan in guard.go (references: srReconnectedChan, ticker, g.newReconnectTicker, iceState.reset, g.onNetworkChange) and replace the direct call to g.onNetworkChange() with a non-blocking dispatch (start a goroutine, optionally recover/panic-handle and/or swap to a bounded worker or channel if you must limit concurrency) to avoid holding up the loop.
🧹 Nitpick comments (5)
management/server/types/settings_test.go (1)
5-20: ⚡ Quick winConsider adding a
ConnectionModedeep-copy test to covercloneStringPtrThe test validates
cloneUint32PtrviaP2pRetryMaxSeconds, butcloneStringPtr(used byConnectionMode) is a separate code path with no test coverage. A mirror test would be low-effort:🧪 Suggested addition
+func TestSettings_Copy_ConnectionMode(t *testing.T) { + mode := "p2p-dynamic" + src := &Settings{ConnectionMode: &mode} + dst := src.Copy() + if dst.ConnectionMode == nil { + t.Fatal("Copy lost ConnectionMode pointer") + } + if *dst.ConnectionMode != "p2p-dynamic" { + t.Fatalf("expected p2p-dynamic, got %s", *dst.ConnectionMode) + } + // Verify deep copy (different pointers). + *dst.ConnectionMode = "relay-forced" + if *src.ConnectionMode != "p2p-dynamic" { + t.Fatal("Copy did not deep-clone ConnectionMode") + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/types/settings_test.go` around lines 5 - 20, Add a new unit test mirroring TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy behavior for Settings.ConnectionMode: create a source Settings with ConnectionMode set to a string pointer, call Settings.Copy(), assert dst.ConnectionMode is non-nil and equals the original value, then mutate *dst.ConnectionMode and assert *src.ConnectionMode remains unchanged to verify cloneStringPtr produced a deep copy; reference the Settings struct, its Copy method, and cloneStringPtr in the test.shared/management/http/api/openapi.yml (1)
357-369: ⚡ Quick winMark legacy
lazy_connection_enabledas deprecated and state precedence.With both fields present, migration intent is unclear unless deprecation and precedence are explicit in the schema.
🧭 Suggested compatibility clarification
lazy_connection_enabled: x-experimental: true + deprecated: true description: Enables or disables experimental lazy connection type: boolean example: true connection_mode: x-experimental: true type: string enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] nullable: true description: | Account-wide default peer-connection mode. NULL means "fall back to lazy_connection_enabled" for backwards compatibility. + When non-NULL, `connection_mode` takes precedence.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 357 - 369, Mark the legacy boolean field lazy_connection_enabled as deprecated and make precedence explicit: add deprecated: true to the lazy_connection_enabled schema and update its description to indicate it is legacy and overridden by connection_mode when connection_mode is non-null; update the connection_mode description to explicitly state that a non-null connection_mode takes precedence over lazy_connection_enabled and that NULL means "fall back to lazy_connection_enabled" for backwards compatibility (use the existing keys lazy_connection_enabled and connection_mode to locate the schema).shared/connectionmode/mode_test.go (1)
77-94: 💤 Low value
ModeFollowServermissing fromTestToLazyConnectionEnabled.The table omits
ModeFollowServer, so itsToLazyConnectionEnabled()return value is never asserted. If its behaviour changes it won't be caught. Consider adding a case (expected value is almost certainlyfalse, matchingModeUnspecified).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/connectionmode/mode_test.go` around lines 77 - 94, The test TestToLazyConnectionEnabled omits ModeFollowServer so its ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry in the cases table for ModeFollowServer with the expected boolean (likely false, matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for ModeFollowServer is validated like the other entries.client/internal/peer/status.go (1)
367-382: 💤 Low value
UpdatePeerIceBackoffsilently ignores unknown peer keys — inconsistent with other update methods.Every other
UpdatePeer*method (e.g.,UpdatePeerState,UpdateWireGuardPeerState,UpdatePeerFQDN) returns an error when the peer is not found.UpdatePeerIceBackoffreturns nothing and silently no-ops. While a silent no-op is safe here (the peer may have been removed between the ICE failure and this call), the divergence from the established pattern makes it harder to detect misconfigured call sites during development. Consider returning an error for consistency, or add a brief comment explaining why the silent no-op is intentional.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 367 - 382, UpdatePeerIceBackoff currently silently no-ops when the peer key is missing, diverging from other update methods like UpdatePeerState / UpdateWireGuardPeerState / UpdatePeerFQDN which return an error; change UpdatePeerIceBackoff(pubKey string, snap BackoffSnapshot) to return an error, lock/unlock as before, and when peer not found return a descriptive error (e.g., fmt.Errorf("peer %s not found", pubKey)); update callers to handle the returned error, or if you intentionally want the silent no-op instead, add a short comment above UpdatePeerIceBackoff explaining why ignoring missing peers is acceptable to avoid surprise.client/internal/peer/conn_test.go (1)
374-384: 💤 Low valueTest assertion is tied to an internal error string.
Line 381 checks
strings.Contains(errMsg, "workerICE")— if the error message inAttachICEis ever rewarded, this assertion passes vacuously. Consider asserting on a sentinel error value (errors.Is) instead of a substring, or at minimum add a comment acknowledging the coupling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_test.go` around lines 374 - 384, The test currently asserts on the error string from c.AttachICE() by checking for "workerICE", which is brittle; change the implementation of AttachICE to return or wrap a sentinel error (e.g., ErrNilWorkerICE) and update the test to use errors.Is(err, ErrNilWorkerICE) instead of substring matching; if adding a sentinel isn't possible right now, at minimum replace the substring check with errors.Is against an existing exported sentinel or add a clear comment documenting the coupling. Ensure AttachICE (and any callers) wrap the sentinel with fmt.Errorf("%w: ...", ErrNilWorkerICE) so errors.Is will work reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/android/preferences.go`:
- Around line 347-352: Validate negative timeout inputs before casting to
uint32: in SetRelayTimeoutSeconds (and the other two timeout setters around the
same area) check if the incoming secs < 0 and, if so, clear the override (set
the corresponding p.configInput.*TimeoutSeconds = nil) instead of casting;
otherwise cast to uint32 and assign as before. Ensure you update all three
setters (the SetRelayTimeoutSeconds function and the two other timeout setter
functions referenced) with the same guard.
In `@client/cmd/up.go`:
- Around line 442-453: The SetConfig handler in server.go is ignoring the
incoming optional fields (req.ConnectionMode, req.RelayTimeoutSeconds,
req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so daemon-mode client overrides
are dropped; add four nil-checks in server.go's SetConfig to apply each field
when non-nil (e.g., if req.ConnectionMode != nil { serverConfig.ConnectionMode =
*req.ConnectionMode }) and persist/update the server configuration/store
accordingly; do the same for Login's handler to read and persist the same four
fields from the incoming request (setupSetConfigReq / setupLoginRequest populate
them) so the client flags are honored.
In `@client/internal/conn_mgr.go`:
- Around line 277-320: The code updates ConnMgr.relayTimeoutSecs, p2pTimeoutSecs
and p2pRetryMaxSecs before deciding whether to restart the existing lazy
manager, so changes to timeouts inside the same managed mode are ignored; fix by
capturing whether the timeout values actually changed (e.g. timeoutsChanged :=
newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs || newP2pRetry !=
e.p2pRetryMaxSecs) before overwriting the fields, then after setting e.mode and
the timeout fields, if isManaged && e.lazyConnMgr != nil && timeoutsChanged call
e.closeManager(ctx), e.statusRecorder.UpdateLazyConnection(false), and then
re-init via e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.
- Around line 176-188: When Rosenpass is enabled we must not leave e.mode set to
a lazy/dynamic variant while lazyConnMgr is not started; change ConnMgr.Start so
that if e.rosenpassEnabled is true you replace/downgrade e.mode to a non-lazy
equivalent (e.g., convert p2p-lazy/p2p-dynamic -> their non-lazy counterpart)
before returning, and also update UpdatedRemotePeerConfig to detect when newMode
is a lazy/dynamic variant while e.rosenpassEnabled is true and store the
downgraded non-lazy mode instead of newMode; reference ConnMgr.Start,
e.rosenpassEnabled, e.mode, lazyConnMgr, UpdatedRemotePeerConfig, newMode and
ActivatePeer when making this change.
In `@client/internal/lazyconn/inactivity/manager.go`:
- Around line 233-239: The scan can enqueue the same peer into both iceIdle and
relayIdle, causing races in the consumer; modify the manager scan so that when a
peer meets relayTimeout it is added to relayIdle and then skipped for ICE
notification (i.e., check relayIdle before adding to iceIdle), or alternatively
collapse to a single ordered notification list processed in priority order;
specifically update the block that uses m.relayTimeout and m.iceTimeout
(variables shown as relayIdle, iceIdle, checkTime, since) so relayIdle is
evaluated/assigned first and iceIdle insertion is gated on relayIdle not
containing peerID (or replace both maps with a single ordered notification for
the consumer).
- Around line 76-87: NewManagerWithTwoTimers currently accepts negative or
sub-minute durations; validate and clamp iceTimeout and relayTimeout the same as
the legacy constructor by enforcing MinimumInactivityThreshold (and preventing
negatives) before calling newManager. In NewManagerWithTwoTimers, if iceTimeout
or relayTimeout is less than MinimumInactivityThreshold (or negative), set them
to MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.
In `@client/internal/lazyconn/manager/manager.go`:
- Around line 103-109: The code treats iceTO==0 && relayTO==0 as a legacy "not
migrated" case; change resolvedTimeouts() to also return a migration flag (e.g.,
resolvedTimeouts() -> (iceTO, relayTO, migrated bool)) and then update the
branch in manager.go to: if !migrated { keep legacy single-timer behavior via
inactivity.NewManager(wgIface, config.InactivityThreshold) } else if iceTO == 0
&& relayTO == 0 { explicitly disable inactivity (e.g., set m.inactivityManager
to a disabled/no-op manager or nil) } else { use
inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) } so intentional 0/0
does not fall back to the legacy manager; refer to resolvedTimeouts, iceTO,
relayTO, m.inactivityManager, NewManager and NewManagerWithTwoTimers when making
the change.
In `@client/internal/peer/conn.go`:
- Around line 202-212: The code currently conflates three states (disabled
sentinel, explicit 0 for "use default", and a literal non-zero seconds) so
Open() loses a prior "disabled backoff" sentinel; update the Open()/backoff init
logic and SetIceBackoffMax handling to preserve the explicit disable sentinel:
when reading conn.config.P2pRetryMaxSeconds in the backoff initialization
(backoffCap calculation and calls to newIceBackoff/SetMaxBackoff) first test for
the special disable sentinel value (e.g. ^uint32(0) or whatever sentinel you
use) and propagate that as a disabled state to conn.iceBackoff (e.g. leave
conn.iceBackoff as disabled or call a Disable/SetMaxBackoffDisabled path)
instead of treating it as a huge duration; likewise, make SetIceBackoffMax
distinguish between setting the disable sentinel, explicit 0 (meaning use
default cap), and non-zero seconds so it does not rewrite a disable sentinel
into 0 and thereby lose the disabled state across Close/Open cycles.
In `@client/internal/peer/env_test.go`:
- Line 25: Update the test case name string literal "connection_mode unparseable
falls through to legacy" to use the preferred spelling "unparsable" (i.e. change
"unparseable" → "unparsable") in the table entry currently containing
{"connection_mode unparseable falls through to legacy", "garbage", "true", "",
"", connectionmode.ModeRelayForced, 0} so that the test identifier matches
codespell expectations.
In `@client/internal/peer/status.go`:
- Around line 1372-1374: The IceBackoffNextRetry field is being set with
timestamppb.New even when peerState.IceBackoffNextRetry is the zero time, which
produces an invalid past proto timestamp; change the assignment that sets
IceBackoffNextRetry (the struct population using peerState) to guard with
peerState.IceBackoffNextRetry.IsZero() and only call timestamppb.New(...) when
non-zero, otherwise set the proto field to nil so consumers see an unset
timestamp; locate the code that sets IceBackoffNextRetry alongside
IceBackoffFailures/IceBackoffSuspended in status.go and apply this conditional.
In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: The read/write of lastKnownState races because
IsConnected() reads it under muxAgent while onConnectionStateChange writes it
without locking; change lastKnownState to an atomic.Int32 (or int32 accessed via
sync/atomic) and replace writes in onConnectionStateChange with
atomic.StoreInt32(&w.lastKnownState, int32(ice.ConnectionStateX)) and the read
in IsConnected() with atomic.LoadInt32(&w.lastKnownState) (while still using
muxAgent to check w.agent != nil), or alternatively guard the writes in
onConnectionStateChange with muxAgent; update all places that set/read
lastKnownState to use the chosen atomic or lock-based approach to eliminate the
data race.
In `@client/server/server.go`:
- Around line 1518-1538: The comment claiming "all zero/empty when the engine
has not received PeerConfig" is incorrect because
ConnMgr.ServerPushedP2pRetryMaxSecs() substitutes a built-in default, so
spP2pRetMax will be non‑zero; fix this by either updating the comment to reflect
the actual semantics or (preferred) changing the getter behavior: modify
ConnMgr.ServerPushedP2pRetryMaxSecs() (in conn_mgr.go) to return 0 when the
stored peer-config value is unset instead of substituting the default, and keep
default application logic elsewhere (UI or a helper), so the code in server.go
that reads spP2pRetMax from cm.ServerPushedP2pRetryMaxSecs() can reliably use 0
as the "not yet received" sentinel.
In `@client/ui/client_ui.go`:
- Around line 656-666: parseUint32Field currently coerces empty, non-numeric,
and overflowed input to 0 which lets buildSetConfigRequest silently clear
overrides; change parseUint32Field to return (uint32, error) (or (uint32, bool)
if you prefer) and treat empty input as a distinct "no value" case rather than
returning 0, return a parse/overflow error for invalid inputs, and then update
buildSetConfigRequest (and the other caller around lines 741-750) to
propagate/handle the error: reject the request or show validation feedback
instead of sending 0 to the daemon. Ensure you validate numeric range (32-bit)
and non-digit characters in parseUint32Field and adjust all call sites to handle
the new return signature.
In `@management/internals/shared/grpc/conversion.go`:
- Around line 115-119: When parsing Settings.ConnectionMode inside the if
settings.ConnectionMode != nil block use
connectionmode.ParseString(*settings.ConnectionMode) as before but if it returns
an error emit a one-line warning indicating the bad value so admin typos are
visible; update the code around connectionmode.ParseString to log a warn (e.g.,
logger.Warnf or the package logger) with the invalid string and the parse error
when err != nil, otherwise keep the existing behavior of assigning resolvedMode
when m != connectionmode.ModeUnspecified.
- Around line 26-29: The initial peer connection path doesn't translate the
wire-format sentinel (p2pRetryMaxDisabledSentinel) to zero, so Conn.Open()
computes backoffCap from conn.config.P2pRetryMaxSeconds and treats the sentinel
as a huge duration; update Conn.Open() to detect if
conn.config.P2pRetryMaxSeconds equals p2pRetryMaxDisabledSentinel and treat it
as 0 (or alternatively perform the same sentinel→0 conversion earlier in
resolveConnectionMode() / engine.createPeerConnConfig() /
ConnMgr.P2pRetryMax()), then compute backoffCap :=
time.Duration(uint32ValueConvertedToZeroIfSentinel) * time.Second so sentinel
yields duration 0.
In `@management/server/account.go`:
- Around line 372-375: The change handlers currently only audit connection-mode
related updates but do not push new runtime config to live peers; update the
caller to trigger am.updateAccountPeers when any of ConnectionMode,
RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change.
Concretely: modify handleConnectionModeSettings (or have the handlers return a
bool) so the outer code that calls am.handleConnectionModeSettings(...) can
detect a change and then call am.updateAccountPeers(ctx, userID, accountID) (or
the existing updateAccountPeers signature) after the settings are saved/audited;
ensure the check compares oldSettings.ConnectionMode,
oldSettings.RelayTimeoutSeconds, oldSettings.P2pTimeoutSeconds, and
oldSettings.P2pRetryMaxSeconds against the newSettings versions and only calls
updateAccountPeers when any differ.
In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 229-240: The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.
In `@shared/management/http/api/openapi.yml`:
- Around line 367-372: The description for the "connection_mode" enum is
outdated and still states that "p2p-dynamic" is reserved/passes through; update
the OpenAPI description for the connection_mode field so it reflects the current
phase behavior (remove the claim that p2p-dynamic is reserved/pass-through) and
clearly document which modes are currently functional (relay-forced, p2p,
p2p-lazy) and how NULL interacts with lazy_connection_enabled for backwards
compatibility; edit the description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.
---
Outside diff comments:
In `@client/internal/peer/guard/guard.go`:
- Around line 138-147: The select case calling g.onNetworkChange() synchronously
can block the reconnect loop; change it to invoke the callback asynchronously
(e.g., spawn a goroutine) so the select can continue handling ctx.Done(),
relayedConnDisconnected and iCEConnDisconnected immediately. Locate the case
handling srReconnectedChan in guard.go (references: srReconnectedChan, ticker,
g.newReconnectTicker, iceState.reset, g.onNetworkChange) and replace the direct
call to g.onNetworkChange() with a non-blocking dispatch (start a goroutine,
optionally recover/panic-handle and/or swap to a bounded worker or channel if
you must limit concurrency) to avoid holding up the loop.
---
Nitpick comments:
In `@client/internal/peer/conn_test.go`:
- Around line 374-384: The test currently asserts on the error string from
c.AttachICE() by checking for "workerICE", which is brittle; change the
implementation of AttachICE to return or wrap a sentinel error (e.g.,
ErrNilWorkerICE) and update the test to use errors.Is(err, ErrNilWorkerICE)
instead of substring matching; if adding a sentinel isn't possible right now, at
minimum replace the substring check with errors.Is against an existing exported
sentinel or add a clear comment documenting the coupling. Ensure AttachICE (and
any callers) wrap the sentinel with fmt.Errorf("%w: ...", ErrNilWorkerICE) so
errors.Is will work reliably.
In `@client/internal/peer/status.go`:
- Around line 367-382: UpdatePeerIceBackoff currently silently no-ops when the
peer key is missing, diverging from other update methods like UpdatePeerState /
UpdateWireGuardPeerState / UpdatePeerFQDN which return an error; change
UpdatePeerIceBackoff(pubKey string, snap BackoffSnapshot) to return an error,
lock/unlock as before, and when peer not found return a descriptive error (e.g.,
fmt.Errorf("peer %s not found", pubKey)); update callers to handle the returned
error, or if you intentionally want the silent no-op instead, add a short
comment above UpdatePeerIceBackoff explaining why ignoring missing peers is
acceptable to avoid surprise.
In `@management/server/types/settings_test.go`:
- Around line 5-20: Add a new unit test mirroring
TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy behavior for
Settings.ConnectionMode: create a source Settings with ConnectionMode set to a
string pointer, call Settings.Copy(), assert dst.ConnectionMode is non-nil and
equals the original value, then mutate *dst.ConnectionMode and assert
*src.ConnectionMode remains unchanged to verify cloneStringPtr produced a deep
copy; reference the Settings struct, its Copy method, and cloneStringPtr in the
test.
In `@shared/connectionmode/mode_test.go`:
- Around line 77-94: The test TestToLazyConnectionEnabled omits ModeFollowServer
so its ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry
in the cases table for ModeFollowServer with the expected boolean (likely false,
matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for
ModeFollowServer is validated like the other entries.
In `@shared/management/http/api/openapi.yml`:
- Around line 357-369: Mark the legacy boolean field lazy_connection_enabled as
deprecated and make precedence explicit: add deprecated: true to the
lazy_connection_enabled schema and update its description to indicate it is
legacy and overridden by connection_mode when connection_mode is non-null;
update the connection_mode description to explicitly state that a non-null
connection_mode takes precedence over lazy_connection_enabled and that NULL
means "fall back to lazy_connection_enabled" for backwards compatibility (use
the existing keys lazy_connection_enabled and connection_mode to locate the
schema).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f6253fba-a00e-49ab-b17b-28539d6a1fa2
⛔ Files ignored due to path filters (4)
client/proto/daemon.pb.gois excluded by!**/*.pb.goclient/proto/daemon_grpc.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.goshared/management/proto/proxy_service.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (46)
client/android/client.goclient/android/preferences.goclient/cmd/root.goclient/cmd/service.goclient/cmd/service_installer.goclient/cmd/up.goclient/internal/conn_mgr.goclient/internal/conn_mgr_test.goclient/internal/connect.goclient/internal/engine.goclient/internal/lazyconn/env.goclient/internal/lazyconn/inactivity/manager.goclient/internal/lazyconn/inactivity/manager_test.goclient/internal/lazyconn/manager/manager.goclient/internal/peer/conn.goclient/internal/peer/conn_test.goclient/internal/peer/env.goclient/internal/peer/env_test.goclient/internal/peer/guard/guard.goclient/internal/peer/handshaker.goclient/internal/peer/handshaker_test.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/status.goclient/internal/peer/worker_ice.goclient/internal/profilemanager/config.goclient/proto/daemon.protoclient/server/server.goclient/status/status.goclient/status/status_test.goclient/ui/client_ui.goclient/ui/const.goclient/ui/event_handler.gomanagement/internals/shared/grpc/conversion.gomanagement/internals/shared/grpc/conversion_test.gomanagement/server/account.gomanagement/server/activity/codes.gomanagement/server/http/handlers/accounts/accounts_handler.gomanagement/server/http/handlers/accounts/accounts_handler_test.gomanagement/server/types/settings.gomanagement/server/types/settings_test.goshared/connectionmode/mode.goshared/connectionmode/mode_test.goshared/management/http/api/openapi.ymlshared/management/http/api/types.gen.goshared/management/proto/management.proto
💤 Files with no reviewable changes (2)
- client/ui/event_handler.go
- client/ui/const.go
| // SetRelayTimeoutSeconds stores a local override for the relay timeout. | ||
| // Pass 0 to clear the override. | ||
| func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { | ||
| v := uint32(secs) | ||
| p.configInput.RelayTimeoutSeconds = &v | ||
| } |
There was a problem hiding this comment.
Guard negative timeout inputs before casting to uint32.
These setters currently turn negative Android values into huge positive timeouts (-1 becomes 4294967295), which silently corrupts the persisted override instead of clearing/rejecting it. Clamp or reject < 0 before the cast for all three setters.
🛠️ Suggested fix
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+ if secs < 0 {
+ secs = 0
+ }
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
@@
func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+ if secs < 0 {
+ secs = 0
+ }
v := uint32(secs)
p.configInput.P2pTimeoutSeconds = &v
}
@@
func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+ if secs < 0 {
+ secs = 0
+ }
v := uint32(secs)
p.configInput.P2pRetryMaxSeconds = &v
}Also applies to: 368-373, 388-393
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/android/preferences.go` around lines 347 - 352, Validate negative
timeout inputs before casting to uint32: in SetRelayTimeoutSeconds (and the
other two timeout setters around the same area) check if the incoming secs < 0
and, if so, clear the override (set the corresponding
p.configInput.*TimeoutSeconds = nil) instead of casting; otherwise cast to
uint32 and assign as before. Ensure you update all three setters (the
SetRelayTimeoutSeconds function and the two other timeout setter functions
referenced) with the same guard.
| if cmd.Flag(connectionModeFlag).Changed { | ||
| req.ConnectionMode = &connectionMode | ||
| } | ||
| if cmd.Flag(relayTimeoutFlag).Changed { | ||
| req.RelayTimeoutSeconds = &relayTimeoutSecs | ||
| } | ||
| if cmd.Flag(p2pTimeoutFlag).Changed { | ||
| req.P2PTimeoutSeconds = &p2pTimeoutSecs | ||
| } | ||
| if cmd.Flag(p2pRetryMaxFlag).Changed { | ||
| req.P2PRetryMaxSeconds = &p2pRetryMaxSecs | ||
| } |
There was a problem hiding this comment.
New connection-mode flags are silently dropped in daemon mode.
setupSetConfigReq populates req.ConnectionMode, req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, and req.P2PRetryMaxSeconds, but server.go's SetConfig handler never reads or persists these four fields from the incoming *proto.SetConfigRequest. In daemon mode the client.SetConfig call at runInDaemonMode succeeds with no error, but the values are silently discarded — the user's --connection-mode override never takes effect.
The same applies to setupLoginRequest (lines 699-710): the populated fields are carried over the wire but server.go's Login handler does not persist them either.
The fix is to extract and apply these fields inside server.go's SetConfig:
🐛 Proposed fix in server.go `SetConfig`
if msg.Mtu != nil {
mtu := uint16(*msg.Mtu)
config.MTU = &mtu
}
+
+ if msg.ConnectionMode != nil {
+ config.ConnectionMode = *msg.ConnectionMode
+ }
+ if msg.RelayTimeoutSeconds != nil {
+ config.RelayTimeoutSeconds = *msg.RelayTimeoutSeconds
+ }
+ if msg.P2PTimeoutSeconds != nil {
+ config.P2pTimeoutSeconds = *msg.P2PTimeoutSeconds
+ }
+ if msg.P2PRetryMaxSeconds != nil {
+ config.P2pRetryMaxSeconds = *msg.P2PRetryMaxSeconds
+ }
if _, err := profilemanager.UpdateConfig(config); err != nil {Wait - this is actually not high effort. The fix is 4 simple nil-checks in one location.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/cmd/up.go` around lines 442 - 453, The SetConfig handler in server.go
is ignoring the incoming optional fields (req.ConnectionMode,
req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so
daemon-mode client overrides are dropped; add four nil-checks in server.go's
SetConfig to apply each field when non-nil (e.g., if req.ConnectionMode != nil {
serverConfig.ConnectionMode = *req.ConnectionMode }) and persist/update the
server configuration/store accordingly; do the same for Login's handler to read
and persist the same four fields from the incoming request (setupSetConfigReq /
setupLoginRequest populate them) so the client flags are honored.
| newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode( | ||
| e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout, | ||
| e.cfgP2pTimeout, e.cfgP2pRetryMax, pc, | ||
| ) | ||
|
|
||
| if newMode == e.mode && newRelay == e.relayTimeoutSecs && | ||
| newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs { | ||
| return nil | ||
| } | ||
| prev := e.mode | ||
| e.mode = newMode | ||
| e.relayTimeoutSecs = newRelay | ||
| e.p2pTimeoutSecs = newP2P | ||
| e.p2pRetryMaxSecs = newP2pRetry | ||
| e.propagateP2pRetryMaxToConns() | ||
|
|
||
| wasManaged := modeUsesLazyMgr(prev) | ||
| isManaged := modeUsesLazyMgr(newMode) | ||
| modeChanged := prev != newMode | ||
|
|
||
| if modeChanged && wasManaged && !isManaged { | ||
| log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode) | ||
| e.closeManager(ctx) | ||
| e.statusRecorder.UpdateLazyConnection(false) | ||
| return nil | ||
| } | ||
|
|
||
| if modeChanged && wasManaged && isManaged { | ||
| // Switching between lazy and dynamic at runtime: tear down the | ||
| // existing manager so initLazyManager picks up the new timeouts. | ||
| log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode) | ||
| e.closeManager(ctx) | ||
| e.statusRecorder.UpdateLazyConnection(false) | ||
| } | ||
|
|
||
| if isManaged && e.lazyConnMgr == nil { | ||
| if e.rosenpassEnabled { | ||
| log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push") | ||
| return nil | ||
| } | ||
| log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode) | ||
| e.initLazyManager(ctx) | ||
| e.startModeSideEffects() | ||
| return e.addPeersToLazyConnManager() |
There was a problem hiding this comment.
Restart the lazy manager when timeouts change inside the same managed mode.
relayTimeoutSecs and p2pTimeoutSecs are only consumed in initLazyManager(). Right now a server push from, say, p2p-dynamic(60/30) to p2p-dynamic(10/5) updates the fields on ConnMgr but keeps the existing manager instance alive, so the inactivity thresholds never actually change until a full restart or a mode flip.
Suggested fix
- if modeChanged && wasManaged && isManaged {
+ timeoutsChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
+ if wasManaged && isManaged && (modeChanged || timeoutsChanged) {
// Switching between lazy and dynamic at runtime: tear down the
// existing manager so initLazyManager picks up the new timeouts.
- log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
+ log.Infof("restarting lazy/dynamic manager to apply updated mode/timeouts (%s -> %s)", prev, newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
}Capture timeoutsChanged before overwriting the old timeout fields so the comparison is against the previous manager config.
Also applies to: 553-563
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_mgr.go` around lines 277 - 320, The code updates
ConnMgr.relayTimeoutSecs, p2pTimeoutSecs and p2pRetryMaxSecs before deciding
whether to restart the existing lazy manager, so changes to timeouts inside the
same managed mode are ignored; fix by capturing whether the timeout values
actually changed (e.g. timeoutsChanged := newRelay != e.relayTimeoutSecs ||
newP2P != e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) before
overwriting the fields, then after setting e.mode and the timeout fields, if
isManaged && e.lazyConnMgr != nil && timeoutsChanged call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), and then re-init via
e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.
| // NewManagerWithTwoTimers is the Phase-2 constructor. Pass 0 for either | ||
| // timeout to disable that teardown path. Both 0 leaves the manager | ||
| // running but inert (no channel ever fires) -- used by p2p / relay-forced | ||
| // modes that don't tear down workers. | ||
| func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager { | ||
| if iceTimeout > 0 { | ||
| log.Infof("ICE inactivity timeout: %v", iceTimeout) | ||
| } | ||
| if relayTimeout > 0 { | ||
| log.Infof("relay inactivity timeout: %v", relayTimeout) | ||
| } | ||
| return newManager(iface, iceTimeout, relayTimeout) |
There was a problem hiding this comment.
Validate two-timer thresholds the same way the legacy constructor does.
NewManagerWithTwoTimers bypasses MinimumInactivityThreshold, so negative or sub-minute values are accepted on the new path. With a 1-minute polling loop, that makes small values misleading and negative values effectively mark every peer idle on the first tick.
🛠️ Suggested fix
func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
+ validateOptional := func(name string, v time.Duration) time.Duration {
+ if v == 0 {
+ return 0
+ }
+ if v < MinimumInactivityThreshold {
+ log.Warnf("%s inactivity timeout %v is too low, using minimum %v", name, v, MinimumInactivityThreshold)
+ return MinimumInactivityThreshold
+ }
+ return v
+ }
+
+ iceTimeout = validateOptional("ICE", iceTimeout)
+ relayTimeout = validateOptional("relay", relayTimeout)
+
if iceTimeout > 0 {
log.Infof("ICE inactivity timeout: %v", iceTimeout)
}
if relayTimeout > 0 {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/lazyconn/inactivity/manager.go` around lines 76 - 87,
NewManagerWithTwoTimers currently accepts negative or sub-minute durations;
validate and clamp iceTimeout and relayTimeout the same as the legacy
constructor by enforcing MinimumInactivityThreshold (and preventing negatives)
before calling newManager. In NewManagerWithTwoTimers, if iceTimeout or
relayTimeout is less than MinimumInactivityThreshold (or negative), set them to
MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.
| // p2pRetryMaxDisabledSentinel is the wire-format value that signals | ||
| // "user-explicit disable backoff" (uint32-max). The 0 wire-value is | ||
| // reserved for "not set, use daemon default". Phase 3 of #5989. | ||
| const p2pRetryMaxDisabledSentinel = ^uint32(0) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate where the wire P2PRetryMaxSeconds is consumed on the daemon side and
# check that both 0 and ^uint32(0) are handled.
rg -nP --type=go -C5 'P2PRetryMaxSeconds|P2pRetryMaxSeconds'
echo "---"
# Look for the sentinel constant or any uint32-max comparison in client code.
rg -nP --type=go -C3 '\^uint32\(0\)|math\.MaxUint32|0xffffffff'Repository: netbirdio/netbird
Length of output: 50373
🏁 Script executed:
# Search for code that interprets the sentinel or handles uint32-max specially
rg -nP --type=go 'SetMaxBackoff|SetIceBackoffMax' -A 5 -B 2
echo "---"
# Search in engine or relevant client files for sentinel detection
rg -nP --type=go -C 3 'uint32.*max|math\.MaxUint32|\^uint32' client/internal/
echo "---"
# Look for where P2pRetryMaxSeconds is converted from wire format to time.Duration
rg -nP --type=go 'time\.Duration.*P2pRetryMax|P2pRetryMax.*time\.Duration' -A 2 -B 2Repository: netbirdio/netbird
Length of output: 7903
🏁 Script executed:
sed -n '577,594p' client/internal/conn_mgr.go
echo "---"
# Also check engine.go for sentinel handling
rg -nP --type=go 'sentinelDisabled|uint32.*0\)|disable.*backoff' client/internal/engine.go -A 3 -B 3Repository: netbirdio/netbird
Length of output: 543
🏁 Script executed:
# Find where P2pRetryMaxSeconds is initially set in engine config
rg -nP --type=go 'P2pRetryMaxSeconds:' client/internal/engine.go -B 5 -A 2
echo "---"
# Check resolveConnectionMode for sentinel handling
sed -n '109,170p' client/internal/conn_mgr.go
echo "---"
# Look for any conversion before the value reaches Conn.config
rg -nP --type=go 'connConfig.*P2p|PeerConnConfig.*P2p' client/internal/engine.go -B 2 -A 2Repository: netbirdio/netbird
Length of output: 2418
Verify where the wire-format sentinel (^uint32(0)) is converted to disable-backoff (duration 0) in the initial peer-connection path.
The server-side conversion (nil→0, &0→sentinel, &N→N) is correct, and propagateP2pRetryMaxToConns() properly handles sentinel→0 for subsequent server updates. However, the initial Conn.Open() path at peer/conn.go:204 lacks sentinel detection: backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second will interpret sentinel as ~136 years instead of zero. The comment at peer/conn.go:98–99 claims "the resolver translates to time.Duration(0) at engine.go" but no such conversion occurs in resolveConnectionMode() or engine.createPeerConnConfig().
Either add sentinel-to-zero conversion before backoffCap is computed in Conn.Open(), or ensure the sentinel is translated in the initial config-resolution chain (e.g., in resolveConnectionMode() or ConnMgr.P2pRetryMax()).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/internals/shared/grpc/conversion.go` around lines 26 - 29, The
initial peer connection path doesn't translate the wire-format sentinel
(p2pRetryMaxDisabledSentinel) to zero, so Conn.Open() computes backoffCap from
conn.config.P2pRetryMaxSeconds and treats the sentinel as a huge duration;
update Conn.Open() to detect if conn.config.P2pRetryMaxSeconds equals
p2pRetryMaxDisabledSentinel and treat it as 0 (or alternatively perform the same
sentinel→0 conversion earlier in resolveConnectionMode() /
engine.createPeerConnConfig() / ConnMgr.P2pRetryMax()), then compute backoffCap
:= time.Duration(uint32ValueConvertedToZeroIfSentinel) * time.Second so sentinel
yields duration 0.
| if settings.ConnectionMode != nil { | ||
| if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified { | ||
| resolvedMode = m | ||
| } | ||
| } |
There was a problem hiding this comment.
Log when Settings.ConnectionMode fails to parse so admin typos are surfaced.
Right now an invalid value (say "p2p-typoed") is silently dropped — the function falls back to the legacy bool with no signal at all. A misconfigured admin will see "old" behavior with no clue why. A one-line warn log on the parse error is cheap and matches what the env-resolver in client/internal/peer/env.go already does for EnvKeyNBConnectionMode.
📝 Suggested change
resolvedMode := connectionmode.ResolveLegacyLazyBool(settings.LazyConnectionEnabled)
if settings.ConnectionMode != nil {
- if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
+ m, err := connectionmode.ParseString(*settings.ConnectionMode)
+ if err != nil {
+ log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err)
+ } else if m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if settings.ConnectionMode != nil { | |
| if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified { | |
| resolvedMode = m | |
| } | |
| } | |
| if settings.ConnectionMode != nil { | |
| m, err := connectionmode.ParseString(*settings.ConnectionMode) | |
| if err != nil { | |
| log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err) | |
| } else if m != connectionmode.ModeUnspecified { | |
| resolvedMode = m | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/internals/shared/grpc/conversion.go` around lines 115 - 119, When
parsing Settings.ConnectionMode inside the if settings.ConnectionMode != nil
block use connectionmode.ParseString(*settings.ConnectionMode) as before but if
it returns an error emit a one-line warning indicating the bad value so admin
typos are visible; update the code around connectionmode.ParseString to log a
warn (e.g., logger.Warnf or the package logger) with the invalid string and the
parse error when err != nil, otherwise keep the existing behavior of assigning
resolvedMode when m != connectionmode.ModeUnspecified.
| am.handleRoutingPeerDNSResolutionSettings(ctx, oldSettings, newSettings, userID, accountID) | ||
| am.handleLazyConnectionSettings(ctx, oldSettings, newSettings, userID, accountID) | ||
| am.handleConnectionModeSettings(ctx, oldSettings, newSettings, userID, accountID) | ||
| am.handlePeerLoginExpirationSettings(ctx, oldSettings, newSettings, userID, accountID) |
There was a problem hiding this comment.
Propagate connection-mode setting changes to peers, not just audit them.
This new audit hook lands, but updateAccountPeers is never flipped when ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change. As written, the account setting is saved and audited, but already-connected clients won't be pushed the new runtime config until some unrelated peer update happens.
🛠️ Suggested fix
if oldSettings.RoutingPeerDNSResolutionEnabled != newSettings.RoutingPeerDNSResolutionEnabled ||
oldSettings.LazyConnectionEnabled != newSettings.LazyConnectionEnabled ||
+ !equalStringPtr(oldSettings.ConnectionMode, newSettings.ConnectionMode) ||
+ !equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) ||
+ !equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) ||
+ !equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) ||
oldSettings.DNSDomain != newSettings.DNSDomain ||
oldSettings.AutoUpdateVersion != newSettings.AutoUpdateVersion ||
oldSettings.AutoUpdateAlways != newSettings.AutoUpdateAlways {
updateAccountPeers = true
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/account.go` around lines 372 - 375, The change handlers
currently only audit connection-mode related updates but do not push new runtime
config to live peers; update the caller to trigger am.updateAccountPeers when
any of ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or
P2pRetryMaxSeconds change. Concretely: modify handleConnectionModeSettings (or
have the handlers return a bool) so the outer code that calls
am.handleConnectionModeSettings(...) can detect a change and then call
am.updateAccountPeers(ctx, userID, accountID) (or the existing
updateAccountPeers signature) after the settings are saved/audited; ensure the
check compares oldSettings.ConnectionMode, oldSettings.RelayTimeoutSeconds,
oldSettings.P2pTimeoutSeconds, and oldSettings.P2pRetryMaxSeconds against the
newSettings versions and only calls updateAccountPeers when any differ.
| if req.Settings.P2pTimeoutSeconds != nil { | ||
| v := uint32(*req.Settings.P2pTimeoutSeconds) | ||
| returnSettings.P2pTimeoutSeconds = &v | ||
| } | ||
| if req.Settings.P2pRetryMaxSeconds != nil { | ||
| v := uint32(*req.Settings.P2pRetryMaxSeconds) | ||
| returnSettings.P2pRetryMaxSeconds = &v | ||
| } | ||
| if req.Settings.RelayTimeoutSeconds != nil { | ||
| v := uint32(*req.Settings.RelayTimeoutSeconds) | ||
| returnSettings.RelayTimeoutSeconds = &v | ||
| } |
There was a problem hiding this comment.
Reject negative timeout values before converting them to uint32.
These casts accept negative JSON values and wrap them into huge unsigned numbers, so a request like -1 becomes 4294967295 seconds instead of a validation error. Please validate all three timeout fields as >= 0 before the cast.
🛠️ Suggested fix
if req.Settings.P2pTimeoutSeconds != nil {
+ if *req.Settings.P2pTimeoutSeconds < 0 {
+ return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0")
+ }
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
+ if *req.Settings.P2pRetryMaxSeconds < 0 {
+ return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0")
+ }
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
+ if *req.Settings.RelayTimeoutSeconds < 0 {
+ return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0")
+ }
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if req.Settings.P2pTimeoutSeconds != nil { | |
| v := uint32(*req.Settings.P2pTimeoutSeconds) | |
| returnSettings.P2pTimeoutSeconds = &v | |
| } | |
| if req.Settings.P2pRetryMaxSeconds != nil { | |
| v := uint32(*req.Settings.P2pRetryMaxSeconds) | |
| returnSettings.P2pRetryMaxSeconds = &v | |
| } | |
| if req.Settings.RelayTimeoutSeconds != nil { | |
| v := uint32(*req.Settings.RelayTimeoutSeconds) | |
| returnSettings.RelayTimeoutSeconds = &v | |
| } | |
| if req.Settings.P2pTimeoutSeconds != nil { | |
| if *req.Settings.P2pTimeoutSeconds < 0 { | |
| return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0") | |
| } | |
| v := uint32(*req.Settings.P2pTimeoutSeconds) | |
| returnSettings.P2pTimeoutSeconds = &v | |
| } | |
| if req.Settings.P2pRetryMaxSeconds != nil { | |
| if *req.Settings.P2pRetryMaxSeconds < 0 { | |
| return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0") | |
| } | |
| v := uint32(*req.Settings.P2pRetryMaxSeconds) | |
| returnSettings.P2pRetryMaxSeconds = &v | |
| } | |
| if req.Settings.RelayTimeoutSeconds != nil { | |
| if *req.Settings.RelayTimeoutSeconds < 0 { | |
| return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0") | |
| } | |
| v := uint32(*req.Settings.RelayTimeoutSeconds) | |
| returnSettings.RelayTimeoutSeconds = &v | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
229 - 240, The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.
| description: | | ||
| Account-wide default peer-connection mode. NULL means | ||
| "fall back to lazy_connection_enabled" for backwards compatibility. | ||
| Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are | ||
| functional. p2p-dynamic is reserved (passes through as p2p in | ||
| Phase 1; will become functional in Phase 2). |
There was a problem hiding this comment.
Update stale connection_mode phase semantics in the description.
This text says p2p-dynamic is reserved/pass-through, which conflicts with the current phase scope and can mislead API consumers.
✏️ Suggested doc update
connection_mode:
x-experimental: true
type: string
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
nullable: true
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
- Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
- functional. p2p-dynamic is reserved (passes through as p2p in
- Phase 1; will become functional in Phase 2).
+ relay-forced, p2p, p2p-lazy, and p2p-dynamic are supported.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/management/http/api/openapi.yml` around lines 367 - 372, The
description for the "connection_mode" enum is outdated and still states that
"p2p-dynamic" is reserved/passes through; update the OpenAPI description for the
connection_mode field so it reflects the current phase behavior (remove the
claim that p2p-dynamic is reserved/pass-through) and clearly document which
modes are currently functional (relay-forced, p2p, p2p-lazy) and how NULL
interacts with lazy_connection_enabled for backwards compatibility; edit the
description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.
…uest test setconfig_test.go reflection-tests (TestSetConfig_AllFieldsSaved + TestCLIFlags_MappedToSetConfig) tolerate the four new fields: ConnectionMode, P2PTimeoutSeconds, RelayTimeoutSeconds, P2PRetryMaxSeconds. They are in the proto so daemons can advertise them via GetConfig, but SetConfig RPC does not apply them at runtime today — the CLI sets them via 'netbird service install/reconfigure' writing the active profile file directly. Listed in the test maps so the structural drift-detector passes; wiring through SetConfig is a follow-up.
The pb.go regeneration on dev machines uses local protoc (v5.29.3 here) which causes the Proto Version Check workflow to flag the diff as a toolchain mismatch. Restore the upstream-main version strings (v6.33.1 for client/proto, v7.34.1 for shared/management/proto) — generated content matches; only the comment-line differs.
2b6f78b to
6bf6777
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
client/internal/peer/worker_ice.go (1)
218-231:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
lastKnownStateis still accessed without consistent synchronization — data race.
IsConnected()readslastKnownStatewhile holdingmuxAgent, but the writes insideonConnectionStateChangeat lines 550 and 562 happen on the pion callback goroutine without acquiringmuxAgent. This is a Go data race, and the newConn.onNetworkChangepath (inconn.goat line 1244) now reads throughIsConnected()to decide whether to close the agent — a torn read can either skipClose()on a freshly-disconnected agent or close a still-Connected one, exactly the scenario the comment inonNetworkChangesays to avoid.Switch
lastKnownStatetoatomic.Int32(or similar) so reads and writes are consistent regardless ofmuxAgent:- // we record the last known state of the ICE agent to avoid duplicate on disconnected events - lastKnownState ice.ConnectionState + // Accessed from pion's state-change callback and from IsConnected() + // concurrently — keep atomic so reads on the network-change path are + // not torn against unlocked writes in onConnectionStateChange. + lastKnownState atomic.Int32 // ice.ConnectionState- lastKnownState: ice.ConnectionStateDisconnected, + } + w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected))func (w *WorkerICE) IsConnected() bool { w.muxAgent.Lock() defer w.muxAgent.Unlock() - return w.agent != nil && w.lastKnownState == ice.ConnectionStateConnected + return w.agent != nil && ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected }case ice.ConnectionStateConnected: - w.lastKnownState = ice.ConnectionStateConnected + w.lastKnownState.Store(int32(ice.ConnectionStateConnected)) w.logSuccessfulPaths(agent) @@ - if w.lastKnownState == ice.ConnectionStateConnected { - w.lastKnownState = ice.ConnectionStateDisconnected + if ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected { + w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected)) w.conn.onICEStateDisconnected(sessionChanged) }Also applies to: 549-563
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 218 - 231, IsConnected reads lastKnownState under muxAgent but onConnectionStateChange writes it without locking, causing a data race; change the WorkerICE.lastKnownState from a plain int to an atomic.Int32 (or similar), update IsConnected() to use lastKnownState.Load() (while keeping muxAgent for agent nil-check), and update onConnectionStateChange to use lastKnownState.Store(newState) instead of assigning directly; ensure all other accesses (e.g., in the code paths around lines referenced) use atomic Load/Store and remove any unsynchronized direct reads/writes to lastKnownState while leaving agent access protected by muxAgent as before.client/internal/peer/conn.go (1)
202-212:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDisable-backoff state is still not preserved across Open/Close cycles.
The ConnConfig comment at lines 98-100 says
^uint32(0)is the disable sentinel and the resolver translates it to0before reaching here, while the Open() block treats0as "use built-in default" (DefaultP2PRetryMax). These two semantics conflict: a peer that the resolver wanted disabled ends up with the default 15 m backoff. Likewise,SetIceBackoffMax(0)writes0intoconfig.P2pRetryMaxSeconds, and a subsequent Close()/Open() cycle silently re-promotes the peer from "disabled" to the default schedule.Use a distinct in-process representation for "disabled" so the three states (
use default,disabled,explicit value) can each round-trip. For example, treat the wire^uint32(0)as the in-process "disabled" sentinel rather than collapsing it to 0 at the resolver, so Open() can branch on it directly.+const p2pRetryMaxDisabled = ^uint32(0) + - backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second - if backoffCap == 0 { - backoffCap = DefaultP2PRetryMax - } + var backoffCap time.Duration + switch conn.config.P2pRetryMaxSeconds { + case 0: + backoffCap = DefaultP2PRetryMax + case p2pRetryMaxDisabled: + backoffCap = 0 + default: + backoffCap = time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second + } @@ func (conn *Conn) SetIceBackoffMax(d time.Duration) { conn.mu.Lock() defer conn.mu.Unlock() - conn.config.P2pRetryMaxSeconds = uint32(d / time.Second) + if d == 0 { + conn.config.P2pRetryMaxSeconds = p2pRetryMaxDisabled + } else { + conn.config.P2pRetryMaxSeconds = uint32(d / time.Second) + } if conn.iceBackoff != nil { conn.iceBackoff.SetMaxBackoff(d) } }The companion change in the resolver/engine.go must also stop translating
^uint32(0)→0— pass the sentinel through verbatim.Also applies to: 1152-1163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 202 - 212, The code collapses the wire sentinel for "disabled" into 0 which conflicts with Open() treating 0 as "use default", causing disabled peers to be re-enabled; fix by introducing a distinct in-process sentinel for "disabled" (e.g. a named constant like DisabledP2pRetrySentinel) and preserve the resolver wire sentinel instead of translating it to 0 in resolver/engine.go, then update Open()'s logic around conn.config.P2pRetryMaxSeconds to: if it equals the disabled sentinel, configure conn.iceBackoff as disabled (or leave nil) so backoff is turned off; else if it equals 0 treat it as "use default" and set backoffCap = DefaultP2PRetryMax, otherwise use the explicit value to newIceBackoff / SetMaxBackoff; also ensure SetIceBackoffMax writes the same in-process sentinel when disabling so Close()/Open() round-trips.client/android/preferences.go (1)
349-352:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNegative
secsis silently aliased to a hugeuint32; forSetP2pRetryMaxSecondsit collides with the disable sentinel.A
SetRelayTimeoutSeconds(-1)becomes4294967295seconds (~136 years), and the same bad cast onSetP2pRetryMaxSeconds(-1)yields exactly^uint32(0), which is the documented wire-format sentinel for "user-explicit disable backoff" (seeclient/internal/peer/conn.golines 98-100). Negative input therefore silently flips backoff into the disabled state without any validation. Clamp or reject< 0in all three setters.func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { + if secs < 0 { + secs = 0 + } v := uint32(secs) p.configInput.RelayTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pTimeoutSeconds(secs int64) { + if secs < 0 { + secs = 0 + } v := uint32(secs) p.configInput.P2pTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) { + if secs < 0 { + secs = 0 + } v := uint32(secs) p.configInput.P2pRetryMaxSeconds = &v }Also applies to: 370-373, 390-393
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/android/preferences.go` around lines 349 - 352, SetRelayTimeoutSeconds (and the two other setters that cast int64 to uint32, e.g., SetP2pRetryMaxSeconds and the comparable setter) currently cast negative secs into a huge uint32 (which can collide with the disable sentinel); update each setter to validate secs before casting: if secs < 0 then clamp to 0 (or return an error if you prefer API change) and only then assign via uint32(secs) into the pointer field so negative inputs cannot become ^uint32(0) or other unintended values.
🧹 Nitpick comments (6)
client/internal/peer/guard/guard.go (1)
57-62: ⚡ Quick win
SetOnNetworkChangeafterStart()is an undetected data raceThe comment says "Set once before
Start()is called; no lock needed." This is safe only if the goroutine spawningStart()happens strictly afterSetOnNetworkChangereturns, establishing a happens-before edge. However, nothing in the API enforces this — a caller that invokesSetOnNetworkChangeafter spawninggo g.Start(...)introduces a data race ong.onNetworkChangewith no diagnostic. Consider usingsync/atomicto store the function pointer, or document this precondition with a//nolint:gocritic+go vetnote so it surfaces in race-detector runs.As a lightweight alternative, a
sync/atomic.Pointer[func()]store/load pair makes the intent explicit and race-detector-clean with minimal overhead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/guard/guard.go` around lines 57 - 62, SetOnNetworkChange currently writes g.onNetworkChange without synchronization which can race with concurrent Start() readers; change the Guard implementation to store the callback via a sync/atomic pointer type (e.g. sync/atomic.Pointer[func()]) instead of the plain field and update all reads (e.g. in Start and any other goroutines that invoke the callback) to load via atomic.Load/Pointer (or the generic Atomic.Pointer API) before invoking; ensure SetOnNetworkChange uses atomic.Store/Pointer to set the callback and nil-checks use the atomic load so access is race-detector-clean and retains the same semantics.client/internal/peer/ice_backoff.go (2)
82-92: 💤 Low valueOptional: clear
suspendedflag on lazy expiry to keep state self-consistent.
IsSuspended()returnsfalseoncenextRetryis in the past, but it leavess.suspended = true. Any future caller that inspects the raw flag (or that adds non-time-based logic) will see a stale "suspended" state. The condition is duplicated inSnapshot()(line 131) and the two helpers can drift. Consider clearing the flag once it has lapsed — pure read API can become a write (still safe unders.mu), or simply expose only the time-based predicate.func (s *iceBackoffState) IsSuspended() bool { s.mu.Lock() defer s.mu.Unlock() if !s.suspended { return false } - if time.Now().After(s.nextRetry) { - return false - } - return true + if time.Now().After(s.nextRetry) { + s.suspended = false + return false + } + return true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/ice_backoff.go` around lines 82 - 92, IsSuspended() currently returns false when nextRetry has passed but leaves s.suspended true, causing inconsistent state with Snapshot(); update IsSuspended() (in iceBackoffState) to, under s.mu lock, check if s.suspended && time.Now().After(s.nextRetry) and if so clear s.suspended (and optionally reset nextRetry) before returning false so the in-memory flags remain self-consistent; ensure you modify only inside the locked section to avoid races and keep Snapshot() behavior aligned.
161-169: 💤 Low valueConfirm:
SetMaxBackoffresets the in-flight exponential schedule.The doc comment states the failure counter is preserved, which it is, but
buildBackoffcallsbo.Reset()so the schedule restarts atInitialInterval. A peer that has climbed to e.g. ~8 m suspends will, after a server-pushedSetMaxBackoffto a different cap, drop back to ~1 m on the next failure. This is most likely intentional (new policy from server ⇒ fresh schedule), but the comment "preserves the failure counter" can be misread as "preserves the schedule"; consider clarifying.-// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig -// when the server pushes a new value. Rebuilds the internal backoff with -// the new schedule but preserves the failure counter. +// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig +// when the server pushes a new value. Rebuilds the internal exponential +// schedule from scratch (so the next markFailure starts at InitialInterval) +// but preserves the cumulative failure counter for status reporting.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/ice_backoff.go` around lines 161 - 169, Update the comment for SetMaxBackoff on iceBackoffState to clarify that although the failure counter is preserved, the exponential backoff schedule is reset because SetMaxBackoff calls buildBackoff(d) which invokes bo.Reset(); explicitly state that the schedule (current interval) restarts at InitialInterval while the failure count remains unchanged so callers won't be misled into thinking the in-flight schedule is preserved.client/internal/peer/conn.go (2)
1030-1056: 💤 Low valueMinor: redundant
IsSuspended()check.
AttachICEchecksiceBackoff.IsSuspended()(line 1034) and then callsattachICEListenerLocked, which performs the same check (line 1070). The first check returns early without an error; the second returnsfalse(no-op). The behaviour is identical, so the first block is just an extra log path that fires before the listener attach. If the dual logging is intentional (different message), keep it; otherwise drop the outer check and rely onattachICEListenerLockedreturningfalseto skip the SendOffer.func (conn *Conn) AttachICE() error { conn.mu.Lock() defer conn.mu.Unlock() - if conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() { - snap := conn.iceBackoff.Snapshot() - conn.Log.Debugf("ICE backoff active (failure #%d, retry at %s), staying on relay", - snap.Failures, - snap.NextRetry.Format("15:04:05")) - return nil - } if conn.handshaker == nil { return fmt.Errorf("AttachICE: handshaker not initialized (Open not called)") } if conn.workerICE == nil { return fmt.Errorf("AttachICE: workerICE is nil (relay-forced mode)") } if !conn.attachICEListenerLocked() { return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 1030 - 1056, Remove the redundant iceBackoff.IsSuspended() check and its logging from Conn.AttachICE and rely on attachICEListenerLocked to handle suspended backoff (so AttachICE no longer early-returns for suspension); specifically, delete the block that checks conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() and its conn.Log.Debugf/return, leaving the rest of AttachICE (including the handshaker nil checks and the call to attachICEListenerLocked and SendOffer) intact so behavior is unchanged except for avoiding duplicate suspension logging.
19-20: 💤 Low valueImport group ordering.
shared/connectionmodeis wedged betweenclient/internal/metricsandclient/internal/peer/conntype, breaking theclient/...group's alphabetical ordering.goimportswould putshared/...after theclient/...group. Minor, but easy to fix."github.com/netbirdio/netbird/client/internal/metrics" - "github.com/netbirdio/netbird/shared/connectionmode" "github.com/netbirdio/netbird/client/internal/peer/conntype" "github.com/netbirdio/netbird/client/internal/peer/dispatcher" "github.com/netbirdio/netbird/client/internal/peer/guard" icemaker "github.com/netbirdio/netbird/client/internal/peer/ice" "github.com/netbirdio/netbird/client/internal/peer/id" "github.com/netbirdio/netbird/client/internal/peer/worker" "github.com/netbirdio/netbird/client/internal/portforward" "github.com/netbirdio/netbird/client/internal/stdnet" "github.com/netbirdio/netbird/route" relayClient "github.com/netbirdio/netbird/shared/relay/client" + "github.com/netbirdio/netbird/shared/connectionmode"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 19 - 20, The import block in client/internal/peer/conn.go has shared/connectionmode placed between client/internal packages, breaking the client/... group's alphabetical grouping; reorder the imports so all client/internal/* imports (e.g., client/internal/metrics and client/internal/peer/conntype) are contiguous and alphabetically ordered, and move github.com/netbirdio/netbird/shared/connectionmode after the client/... group (matching goimports grouping).client/internal/peer/env_test.go (1)
33-34: 💤 Low valueUse constants instead of hardcoded environment variable names.
Lines 31–32 already reference
EnvKeyNBConnectionModeandEnvKeyNBForceRelay. Lines 33–34 hardcode"NB_ENABLE_EXPERIMENTAL_LAZY_CONN"and"NB_LAZY_CONN_INACTIVITY_THRESHOLD"as string literals.The
peerpackage defines these as private constantsenvEnableLazyConnandenvInactivityThreshold(accessible within the same package), or importEnvEnableLazyConnandEnvInactivityThresholdfromclient/internal/lazyconn. Using constants ensures any rename is caught at compile time.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/env_test.go` around lines 33 - 34, Replace the hardcoded environment variable names in the test t.Setenv calls with the package constants: use envEnableLazyConn instead of "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of "NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these constants are defined in the peer package); update the two t.Setenv calls in env_test.go to reference those constants so renames are caught at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/android/preferences.go`:
- Around line 325-352: The setters store pointers to zero values instead of
clearing overrides, so change SetConnectionMode, SetRelayTimeoutSeconds (and
likewise SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds) to set the
corresponding pointer to nil when the sentinel is passed (empty string for
SetConnectionMode, 0 for the timeout/setter methods); otherwise allocate and
assign a pointer to the non-zero value. Update p.configInput.ConnectionMode and
p.configInput.RelayTimeoutSeconds (and the P2P fields) accordingly so apply()'s
nil check correctly reverts to server-pushed values.
In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The select loop currently calls g.onNetworkChange()
synchronously which blocks responsiveness to ctx.Done() and other channels;
change it to dispatch the callback asynchronously: when g.onNetworkChange != nil
spawn a background goroutine to call g.onNetworkChange() from the guard's select
branch (e.g., in the guard loop where srReconnectedChan is handled) so the loop
can continue processing ctx.Done and other events. To avoid unbounded concurrent
invocations, add a simple coalescing mechanism (a boolean flag or non-blocking
send to a buffered channel) adjacent to the async dispatch to ensure rapid
reconnect events don’t spawn redundant goroutines unless the previous handler
has completed; keep the flag/chan state local to the guard instance that owns
g.onNetworkChange.
In `@client/proto/daemon.proto`:
- Around line 208-218: The daemon handlers in client/server/server.go currently
ignore the new LoginRequest.connection_mode, p2p_timeout_seconds,
relay_timeout_seconds, and p2p_retry_max_seconds (and the same fields on
SetConfigRequest), so RPC callers can send them but the daemon neither applies
nor persists them; update the request handling code paths that process
LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).
---
Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: SetRelayTimeoutSeconds (and the two other setters that
cast int64 to uint32, e.g., SetP2pRetryMaxSeconds and the comparable setter)
currently cast negative secs into a huge uint32 (which can collide with the
disable sentinel); update each setter to validate secs before casting: if secs <
0 then clamp to 0 (or return an error if you prefer API change) and only then
assign via uint32(secs) into the pointer field so negative inputs cannot become
^uint32(0) or other unintended values.
In `@client/internal/peer/conn.go`:
- Around line 202-212: The code collapses the wire sentinel for "disabled" into
0 which conflicts with Open() treating 0 as "use default", causing disabled
peers to be re-enabled; fix by introducing a distinct in-process sentinel for
"disabled" (e.g. a named constant like DisabledP2pRetrySentinel) and preserve
the resolver wire sentinel instead of translating it to 0 in resolver/engine.go,
then update Open()'s logic around conn.config.P2pRetryMaxSeconds to: if it
equals the disabled sentinel, configure conn.iceBackoff as disabled (or leave
nil) so backoff is turned off; else if it equals 0 treat it as "use default" and
set backoffCap = DefaultP2PRetryMax, otherwise use the explicit value to
newIceBackoff / SetMaxBackoff; also ensure SetIceBackoffMax writes the same
in-process sentinel when disabling so Close()/Open() round-trips.
In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: IsConnected reads lastKnownState under muxAgent but
onConnectionStateChange writes it without locking, causing a data race; change
the WorkerICE.lastKnownState from a plain int to an atomic.Int32 (or similar),
update IsConnected() to use lastKnownState.Load() (while keeping muxAgent for
agent nil-check), and update onConnectionStateChange to use
lastKnownState.Store(newState) instead of assigning directly; ensure all other
accesses (e.g., in the code paths around lines referenced) use atomic Load/Store
and remove any unsynchronized direct reads/writes to lastKnownState while
leaving agent access protected by muxAgent as before.
---
Nitpick comments:
In `@client/internal/peer/conn.go`:
- Around line 1030-1056: Remove the redundant iceBackoff.IsSuspended() check and
its logging from Conn.AttachICE and rely on attachICEListenerLocked to handle
suspended backoff (so AttachICE no longer early-returns for suspension);
specifically, delete the block that checks conn.iceBackoff != nil &&
conn.iceBackoff.IsSuspended() and its conn.Log.Debugf/return, leaving the rest
of AttachICE (including the handshaker nil checks and the call to
attachICEListenerLocked and SendOffer) intact so behavior is unchanged except
for avoiding duplicate suspension logging.
- Around line 19-20: The import block in client/internal/peer/conn.go has
shared/connectionmode placed between client/internal packages, breaking the
client/... group's alphabetical grouping; reorder the imports so all
client/internal/* imports (e.g., client/internal/metrics and
client/internal/peer/conntype) are contiguous and alphabetically ordered, and
move github.com/netbirdio/netbird/shared/connectionmode after the client/...
group (matching goimports grouping).
In `@client/internal/peer/env_test.go`:
- Around line 33-34: Replace the hardcoded environment variable names in the
test t.Setenv calls with the package constants: use envEnableLazyConn instead of
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of
"NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these constants are defined in the peer
package); update the two t.Setenv calls in env_test.go to reference those
constants so renames are caught at compile time.
In `@client/internal/peer/guard/guard.go`:
- Around line 57-62: SetOnNetworkChange currently writes g.onNetworkChange
without synchronization which can race with concurrent Start() readers; change
the Guard implementation to store the callback via a sync/atomic pointer type
(e.g. sync/atomic.Pointer[func()]) instead of the plain field and update all
reads (e.g. in Start and any other goroutines that invoke the callback) to load
via atomic.Load/Pointer (or the generic Atomic.Pointer API) before invoking;
ensure SetOnNetworkChange uses atomic.Store/Pointer to set the callback and
nil-checks use the atomic load so access is race-detector-clean and retains the
same semantics.
In `@client/internal/peer/ice_backoff.go`:
- Around line 82-92: IsSuspended() currently returns false when nextRetry has
passed but leaves s.suspended true, causing inconsistent state with Snapshot();
update IsSuspended() (in iceBackoffState) to, under s.mu lock, check if
s.suspended && time.Now().After(s.nextRetry) and if so clear s.suspended (and
optionally reset nextRetry) before returning false so the in-memory flags remain
self-consistent; ensure you modify only inside the locked section to avoid races
and keep Snapshot() behavior aligned.
- Around line 161-169: Update the comment for SetMaxBackoff on iceBackoffState
to clarify that although the failure counter is preserved, the exponential
backoff schedule is reset because SetMaxBackoff calls buildBackoff(d) which
invokes bo.Reset(); explicitly state that the schedule (current interval)
restarts at InitialInterval while the failure count remains unchanged so callers
won't be misled into thinking the in-flight schedule is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d4622b57-deca-4419-b794-100392cfb3c4
⛔ Files ignored due to path filters (2)
client/proto/daemon.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (20)
client/android/client.goclient/android/preferences.goclient/cmd/service.goclient/cmd/service_installer.goclient/internal/conn_mgr.goclient/internal/conn_mgr_test.goclient/internal/engine.goclient/internal/peer/conn.goclient/internal/peer/env_test.goclient/internal/peer/guard/guard.goclient/internal/peer/handshaker.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/worker_ice.goclient/proto/daemon.protoclient/server/server.goclient/server/setconfig_test.goclient/ui/client_ui.goclient/ui/const.goclient/ui/event_handler.go
💤 Files with no reviewable changes (2)
- client/ui/event_handler.go
- client/ui/const.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/conn_mgr.go
| // SetConnectionMode stores a local override for the connection mode. | ||
| // Pass an empty string to clear the override (revert to following the | ||
| // server-pushed value). | ||
| func (p *Preferences) SetConnectionMode(mode string) { | ||
| m := mode | ||
| p.configInput.ConnectionMode = &m | ||
| } | ||
|
|
||
| // GetRelayTimeoutSeconds returns the locally configured relay-worker | ||
| // inactivity timeout in seconds, or 0 if no override is set (follow | ||
| // server-pushed value, or built-in default if the server has none). | ||
| func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) { | ||
| if p.configInput.RelayTimeoutSeconds != nil { | ||
| return int64(*p.configInput.RelayTimeoutSeconds), nil | ||
| } | ||
| cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath) | ||
| if err != nil { | ||
| return 0, err | ||
| } | ||
| return int64(cfg.RelayTimeoutSeconds), nil | ||
| } | ||
|
|
||
| // SetRelayTimeoutSeconds stores a local override for the relay timeout. | ||
| // Pass 0 to clear the override. | ||
| func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { | ||
| v := uint32(secs) | ||
| p.configInput.RelayTimeoutSeconds = &v | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm there's no other "clear override" path that nils the pointer in apply() or elsewhere.
rg -nP --type=go -C2 'ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds' -g '!**/*_test.go' -g 'client/internal/profilemanager/**'Repository: netbirdio/netbird
Length of output: 3375
🏁 Script executed:
#!/bin/bash
# Check the actual SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds implementations
cat -n client/android/preferences.go | head -400 | tail -100Repository: netbirdio/netbird
Length of output: 4108
🏁 Script executed:
#!/bin/bash
# Search for all occurrences of these four setter functions to ensure we catch all of them
rg -n 'func.*Set(ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds)' --type=go -g '!**/*_test.go'Repository: netbirdio/netbird
Length of output: 426
🏁 Script executed:
#!/bin/bash
# Check if there are any other code paths that might clear these overrides (e.g., by setting to nil)
rg -n 'ConnectionMode.*=.*nil|RelayTimeoutSeconds.*=.*nil|P2pTimeoutSeconds.*=.*nil|P2pRetryMaxSeconds.*=.*nil' --type=go -g '!**/*_test.go'Repository: netbirdio/netbird
Length of output: 3062
Fix the "clear override" logic to set pointer to nil, not to zero value.
The documented behaviour for SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, and SetP2pRetryMaxSeconds promises that passing 0 (or empty string) clears the override and reverts to the server-pushed value. However, each setter currently stores a pointer to the zero value, and the apply() function in client/internal/profilemanager/config.go (lines 608–626) checks only != nil before assigning. This overwrites the persisted server-pushed value with 0 or "" instead of clearing the override.
To honour the documented semantics, set the pointer to nil when the sentinel value is passed:
Suggested fix pattern
func (p *Preferences) SetConnectionMode(mode string) {
+ if mode == "" {
+ p.configInput.ConnectionMode = nil
+ return
+ }
m := mode
p.configInput.ConnectionMode = &m
}
@@
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+ if secs <= 0 {
+ p.configInput.RelayTimeoutSeconds = nil
+ return
+ }
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}Apply the same pattern to SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // SetConnectionMode stores a local override for the connection mode. | |
| // Pass an empty string to clear the override (revert to following the | |
| // server-pushed value). | |
| func (p *Preferences) SetConnectionMode(mode string) { | |
| m := mode | |
| p.configInput.ConnectionMode = &m | |
| } | |
| // GetRelayTimeoutSeconds returns the locally configured relay-worker | |
| // inactivity timeout in seconds, or 0 if no override is set (follow | |
| // server-pushed value, or built-in default if the server has none). | |
| func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) { | |
| if p.configInput.RelayTimeoutSeconds != nil { | |
| return int64(*p.configInput.RelayTimeoutSeconds), nil | |
| } | |
| cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath) | |
| if err != nil { | |
| return 0, err | |
| } | |
| return int64(cfg.RelayTimeoutSeconds), nil | |
| } | |
| // SetRelayTimeoutSeconds stores a local override for the relay timeout. | |
| // Pass 0 to clear the override. | |
| func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { | |
| v := uint32(secs) | |
| p.configInput.RelayTimeoutSeconds = &v | |
| } | |
| // SetConnectionMode stores a local override for the connection mode. | |
| // Pass an empty string to clear the override (revert to following the | |
| // server-pushed value). | |
| func (p *Preferences) SetConnectionMode(mode string) { | |
| if mode == "" { | |
| p.configInput.ConnectionMode = nil | |
| return | |
| } | |
| m := mode | |
| p.configInput.ConnectionMode = &m | |
| } | |
| // GetRelayTimeoutSeconds returns the locally configured relay-worker | |
| // inactivity timeout in seconds, or 0 if no override is set (follow | |
| // server-pushed value, or built-in default if the server has none). | |
| func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) { | |
| if p.configInput.RelayTimeoutSeconds != nil { | |
| return int64(*p.configInput.RelayTimeoutSeconds), nil | |
| } | |
| cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath) | |
| if err != nil { | |
| return 0, err | |
| } | |
| return int64(cfg.RelayTimeoutSeconds), nil | |
| } | |
| // SetRelayTimeoutSeconds stores a local override for the relay timeout. | |
| // Pass 0 to clear the override. | |
| func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { | |
| if secs <= 0 { | |
| p.configInput.RelayTimeoutSeconds = nil | |
| return | |
| } | |
| v := uint32(secs) | |
| p.configInput.RelayTimeoutSeconds = &v | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/android/preferences.go` around lines 325 - 352, The setters store
pointers to zero values instead of clearing overrides, so change
SetConnectionMode, SetRelayTimeoutSeconds (and likewise SetP2pTimeoutSeconds and
SetP2pRetryMaxSeconds) to set the corresponding pointer to nil when the sentinel
is passed (empty string for SetConnectionMode, 0 for the timeout/setter
methods); otherwise allocate and assign a pointer to the non-zero value. Update
p.configInput.ConnectionMode and p.configInput.RelayTimeoutSeconds (and the P2P
fields) accordingly so apply()'s nil check correctly reverts to server-pushed
values.
| // Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE | ||
| if g.onNetworkChange != nil { | ||
| g.onNetworkChange() | ||
| } |
There was a problem hiding this comment.
Synchronous onNetworkChange() call blocks ctx.Done() responsiveness
The callback is invoked on the guard's select goroutine. Per the PR description, it recreates the ICE worker — potentially involving non-trivial setup. Until the callback returns, the guard loop cannot react to ctx.Done() or any other channel (including another srReconnectedChan event). Consider dispatching it in a background goroutine:
⚡ Proposed fix
- // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
- if g.onNetworkChange != nil {
- g.onNetworkChange()
- }
+ // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+ // Dispatched asynchronously so the guard loop remains responsive to ctx.Done().
+ if g.onNetworkChange != nil {
+ go g.onNetworkChange()
+ }If the callback itself serializes on an internal mutex (as ICE worker recreation likely does), concurrent calls from rapid reconnect events are already serialized. If not, consider adding a flag or draining mechanism to coalesce multiple rapid network-change signals.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE | |
| if g.onNetworkChange != nil { | |
| g.onNetworkChange() | |
| } | |
| // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE. | |
| // Dispatched asynchronously so the guard loop remains responsive to ctx.Done(). | |
| if g.onNetworkChange != nil { | |
| go g.onNetworkChange() | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The select loop
currently calls g.onNetworkChange() synchronously which blocks responsiveness to
ctx.Done() and other channels; change it to dispatch the callback
asynchronously: when g.onNetworkChange != nil spawn a background goroutine to
call g.onNetworkChange() from the guard's select branch (e.g., in the guard loop
where srReconnectedChan is handled) so the loop can continue processing ctx.Done
and other events. To avoid unbounded concurrent invocations, add a simple
coalescing mechanism (a boolean flag or non-blocking send to a buffered channel)
adjacent to the async dispatch to ensure rapid reconnect events don’t spawn
redundant goroutines unless the previous handler has completed; keep the
flag/chan state local to the guard instance that owns g.onNetworkChange.
| // Phase 1 (#5989): peer-connection mode and idle timeouts. | ||
| // connection_mode is a string (relay-forced, p2p, p2p-lazy, p2p-dynamic, | ||
| // follow-server, or empty); parsed via connectionmode.ParseString at the | ||
| // daemon side. Empty means "no client-side override, use server value". | ||
| optional string connection_mode = 40; | ||
| optional uint32 p2p_timeout_seconds = 41; | ||
| optional uint32 relay_timeout_seconds = 42; | ||
|
|
||
| // Phase 3 (#5989): cap on the ICE-failure backoff schedule. 0 = use | ||
| // server-pushed value (or built-in default 15 min). | ||
| optional uint32 p2p_retry_max_seconds = 43; |
There was a problem hiding this comment.
Wire these new request fields through the daemon handlers.
client/server/server.go still ignores LoginRequest.connection_mode/... and SetConfigRequest.connection_mode/..., so callers can send these overrides and get a success response even though the daemon neither applies nor persists them. That leaves the new RPC surface out of sync with the implementation, especially for the GUI save flow.
Also applies to: 714-724
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/proto/daemon.proto` around lines 208 - 218, The daemon handlers in
client/server/server.go currently ignore the new LoginRequest.connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, and p2p_retry_max_seconds (and the
same fields on SetConfigRequest), so RPC callers can send them but the daemon
neither applies nor persists them; update the request handling code paths that
process LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).
…rsion pin - conn_mgr.go: tagged switch on uint32 sentinel (QF1002 staticcheck) - engine.go: //nolint:staticcheck on the deprecated peer.IsForceRelayed call site - the function is intentionally retained for Phase-1 backwards compat with daemons that haven't migrated to ResolveModeFromEnv yet - lazyconn/manager.go: //nolint:staticcheck on the deprecated inactivity.NewManager fallback - kept on purpose for callers without resolved two-timer config (iceTimeout=0 && relayTimeout=0) - proxy_service.pb.go: pin protoc version header to upstream-main v7.34.1 (Proto Version Check workflow flagged it; only the comment-line differs)
Phase 3.5 hotfix for netbirdio#5989. When Guard detects that the signal/relay layer has reconnected (typically after a network/interface change like LTE-modem replug or WiFi-roaming), the per-Conn iceBackoff is now reset to a fresh state AND the workerICE is recreated. Without this, after a long Phase-2-style suspension the daemon would remain stuck on Relay even when the new network conditions allow P2P: the AttachICE-gate is open (suspend window expired) but the previously DetachICE-closed workerICE never gets a fresh pion-agent, producing "ICE Agent is not initialized yet" warnings instead of new ICE attempts. Verified on badmitterndorf-r1 (LTE-replug -> public IP): without this fix all peers stayed Relayed; after daemon-restart 3/5 peers came up as P2P. With this fix the same transition is automatic. Spec section 5.1 listed Interface-Change as a Reset-Trigger but the implementation was missed in the original Phase-3 plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow-up) The Phase-3.5 hotfix recreated workerICE on srReconnect but only swapped the handshaker listener; no offer was sent. Result: pion's agent stays nil until a peer-initiated offer arrives. HW-verified on badmitterndorf via 144x "ICE Agent is not initialized yet" warnings post-bounce with no peer transitioning back to P2P. Now we explicitly call handshaker.SendOffer() after the listener swap so the remote responds, the local OnNewOffer fires, and reCreateAgent populates the new agent. Refs netbirdio#5989 (Phase 3.5).
…gent Phase 3.6 hotfix for netbirdio#5989. OnRemoteCandidate previously dropped any candidate that arrived while w.agent was nil (= during the small race window between OnNewOffer kicking off reCreateAgent and the agent being assigned). After Phase-3.5 introduced in-process workerICE recreate on network change, this race becomes much more frequent -- the remote peer often replies with candidates before our local OnNewOffer handler has finished assigning the new agent. Buffer up to 64 candidates per peer, drained inside OnNewOffer right after w.agent is set. Buffer overflow logs a warning but is bounded to prevent OOM under a misbehaving peer. Verified: pre-fix on badmitterndorf showed ~150 "ICE Agent is not initialized yet" warnings per peer after a network bounce, with no P2P recovery within several minutes. After this fix the candidates survive the race and pion can complete the pair-checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ead of agent" This reverts commit c948885.
…recreate path
Phase 3.5 originally closed workerICE and replaced the wrapper via
NewWorkerICE() + listener-swap. Empirically this caused ICE to fail
with a ~13s pair-check timeout after a network event (LTE-glitch on
badmitterndorf), while a fresh daemon-restart converged in <1s on
the exact same network conditions. Most likely cause: state-leak
between the old and new workerICE wrappers (sockets, stdnet bindings)
that prevented the new pion agent from gathering working candidates.
Refactored to reuse the existing workerICE wrapper. The well-tested
"existing agent + new offer with different sessionID -> tear down +
reCreateAgent" branch already in worker_ice.go (Phase 1 code) is the
blessed path for in-place recreate. We just:
1) Reset iceBackoff (counter -> 0)
2) Close the current pion agent (w.agent = nil)
3) handshaker.SendOffer() so the remote responds with a fresh
sessionID that flows back through the still-attached listener
The buffer/race fix that netbirdio#5805 originally proposed and pappz argued
against is NOT needed here -- the warnings happen in both fresh-start
(P2P succeeds) and recreate (P2P fails), proving the warnings are
benign noise that pion handles via retransmits. The actual bug was
in the wrapper-replacement logic of the original Phase-3.5 commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Phase-3.7 hotfix called handshaker.SendOffer() from onNetworkChange. But Guard already drives sendOffer via its newReconnectTicker (800ms initial, ~4 retries in the first 6s) right after the same srReconnect event that fires our callback. Both paths firing produces ~5 sendOffer calls per peer in ~6s. Each offer routes to the remote's workerICE.OnNewOffer where (if the local sessionID has changed since the last reCreateAgent) it triggers a fresh tear-down + reCreateAgent. With 5 such triggers in rapid succession, the remote ICE-pair-checks never complete -- they get torn down before pion's GatherCandidates / Dial can finish. Empirically observed on badmitterndorf during real LTE-carrier glitches: peers got 5 sending-offer log lines in 6s, then no "set ICE to active" event for several minutes -> stayed Relayed. Fix: do nothing about offers in onNetworkChange. Just close the agent (w.agent = nil). The Guard's natural reconnect-burst drives the next sendOffer, the remote replies, and our existing "agent==nil + new offer -> reCreateAgent" path handles the recreation cleanly. Refs netbirdio#5989 (Phase 3.7 follow-up).
… (Guard-Loop Fix from netbirdio#5805) Phase 3.7c (netbirdio#5989) re-introduces the Guard-Loop Fix from MichaelUray's PR netbirdio#5805 that pappz did not accept (he believed sessionID-skip was sufficient). Empirically on badmitterndorf during LTE-carrier instability we now have hard evidence the bug is real cross-NAT too: uray-mic-dh received 5 different sessionIDs from the remote peer in 2 minutes (06:27-06:29). Each different sessionID triggered the in-place "tear-down + reCreateAgent" branch in OnNewOffer. With both sides' Guards firing fresh offers in lockstep, the in-flight ICE pair-checks (5-10s) never completed -- each new offer tore down the previous attempt before pion could finish. Fix: when agentConnecting==true, ignore any new offer regardless of sessionID. Let the in-flight attempt either succeed (-> agent stays) or fail (-> agentConnecting goes false naturally via closeAgent), THEN allow a fresh sessionID to trigger reCreateAgent. Refs netbirdio#5805 (Guard-Loop Fix), Phase 3.7c of netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In p2p-dynamic mode the runDynamicInactivityLoop calls DetachICEForPeer when the per-peer iceTimeout fires, but does NOT call lazyConnMgr.DeactivatePeer. The lazy state therefore stays at watcherInactivity while the handshaker's iceListener is gone -- a sub-state the lazy manager does not represent. Before this change, ConnMgr.ActivatePeer gated AttachICE behind lazyConnMgr.ActivatePeer's "found" flag, which is a one-shot edge (watcherActivity -> watcherInactivity). After an iceTimeout-detach, subsequent signal messages would call ActivatePeer with found=false, AttachICE would never run, and remote OFFERs would silently drop at handshaker.Listen():143 (iceListener==nil). The peer was stuck on relay forever even while both sides kept signaling normally. Symptom observed on bm router (uray-mic-dh peer): no "set ICE to active" event for 22+ minutes, six different remote session IDs received, zero ICE attempts started. Recovery was only possible via daemon restart. Fix: 1. ConnMgr.ActivatePeer: call AttachICE unconditionally for p2p-dynamic on every signal trigger. AttachICE is idempotent (returns nil if listener already attached, conn.go:1048) and honors iceBackoff.IsSuspended() so the failure-backoff is not bypassed. 2. handshaker.Listen: emit a Debug-level note when an OFFER/ANSWER arrives without an attached ICE listener (typical during ICE backoff suspension or relay-forced mode), so the dispatch state is observable in debug logs without alarming on healthy systems. Verified live on bm router post-deploy: - uray-mic-dh recovered to P2P (srflx/srflx) within seconds of restart, confirming the AttachICE flow runs end-to-end. - ctb50-d (V7iprtVU... peer) cycles correctly: ICE failure -> backoff suspend -> backoff expire -> next signal triggers AttachICE -> ICE retry -> backoff doubles. Exact Phase-3 behavior, now also working post-iceTimeout-detach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7e) Phase 3.7d ensured AttachICE runs on every signal trigger via ConnMgr.ActivatePeer, but that fix only takes effect when the next signal arrives. After an LTE-modem replug the iceListener can end up detached for some peers (paths via onICEFailed -> DetachICE during the bounce window, plus concurrent state-change callbacks while the agent is being torn down). Until the next signal triggers AttachICE, every remote OFFER reaches handshaker.Listen() with iceListener==nil and is silently dropped. In one observed case ygW6ySPb stayed on relay for 5 minutes after the bounce despite uray-mic-dh sending nine OFFERs with fresh sessionIDs; recovery only happened after a manual daemon restart. Phase 3.5's onNetworkChange already runs once per peer right after srReconnect: it resets iceBackoff and closes the workerICE agent. This is the natural place to also force the iceListener back on, so the Guard's reconnect-loop can drive a fresh sendOffer that actually reaches OnNewOffer instead of being dropped. Refactor: - Extract attachICEListenerLocked from AttachICE. The locked helper returns true when a new attachment was made and false on no-op (already attached, ICE backoff suspended, handshaker not initialised, or workerICE not present). - AttachICE checks iceBackoff.IsSuspended first to preserve the existing relay-forced-mode error semantics in TestConn_AttachICE_NoOpWhenSuspended, then delegates to attachICEListenerLocked, then sends an offer if a new attachment was made. - onNetworkChange calls attachICEListenerLocked after closing the workerICE agent. Deliberately does NOT call SendOffer because the Guard reconnect-ticker already issues one right after the same srReconnect event; sending another here would re-introduce the offer-storm Phase 3.7b removed. Verified live on bm router with a 90 s management+signal blackhole: - All 6 peers logged "ICE state reset on network change (agent closed; listener re-armed; Guard will resend offer)" within 1 ms of srReconnect. - Peers whose listener had been detached (X+HhIybX, Rkqv) logged "ICE listener attached (locked path)" from the new helper. - ygW6ySPb (uray-mic-dh), yhM26jA (dk20), dxvaVD2 (w11-test1) all reached "set ICE to active connection" within 75-85 s of the blackhole ending. Previously ygW6ySPb required a daemon restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hange (Phase 3.7f) After Reset() is called for an srReconnect / network-change event, the first ICE pair-check often fails because pion is still working with NAT mappings that were torn down by the network event. Falling back to the normal 1-minute initial exponential interval after that single failure leaves the peer on relay for 60-100 seconds, far longer than the underlying connectivity actually warrants - the second attempt typically succeeds within a few seconds because the new LTE/Wi-Fi mapping is now warm. Behaviour change: - Track lastResetAt in iceBackoffState; Reset() stamps it. - Inside markFailure, while time.Since(lastResetAt) < networkChangeGracePeriod (30 s), use a fixed networkChangeRetryDelay of 5 s and do NOT advance the long-term exponential schedule. - Outside the grace window, behaviour is unchanged: normal exponential backoff capped at maxBackoff. Verified live on bm router with a 60 s management+signal blackhole (equivalent to an LTE replug): Pre-3.7f real LTE bounce (08:03): uray-mic-dh stayed on relay for 103 s before P2P recovered. ICE failure netbirdio#1 -> 60 s+ exponential suspend -> ICE retry -> success. Post-3.7f 60 s blackhole (08:18): srReconnect at 08:18:46.672, ICE failure netbirdio#1 at 08:18:59 -> "suspending for 5s, next retry at 08:19:04", then "ICE success, resetting backoff (was 1 failures)" at 08:19:09. Total relay-only window: 23 s (4.5x faster). All four P2P-capable peers (uray-mic-d4, uray-mic-dh, dk20, w11-test1) reconverged within the same 23 s window. Tests: - TestIceBackoff_GracePeriodAfterReset_ShortDelay: first and second failure within the grace window both return networkChangeRetryDelay and do not advance the underlying ExponentialBackOff. - TestIceBackoff_GraceExpired_NormalExponential: forcing lastResetAt into the past restores the ~1 m initial exponential delay. - TestIceBackoff_NoGraceWithoutReset: a fresh state without an explicit Reset uses the normal exponential schedule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nnected (Phase 3.7g)
Phase 3.5 added workerICE.Close to onNetworkChange so that the ICE
agent gets recreated after an LTE-modem replug or other network event.
That's correct when the underlying peer-to-peer UDP path actually
broke -- pion's ICE state machine has already gone Disconnected/Failed
and w.agent has been cleared by closeAgent in onConnectionStateChange.
But many srReconnect events are NOT real connectivity losses. A brief
signal/relay outage (gRPC keepalive timeout, transient IP-blacklist,
DNS hiccup) fires srReconnect even though peer-to-peer WG keepalives
between clients kept flowing the whole time. Closing the still-working
ICE agent in that case forces:
- workerICE.Close clears w.agent and triggers ICE.Failed callbacks
- onICEFailed marks a backoff failure and calls DetachICE
- WireGuard endpoint is removed (~1 s ping dropout)
- Next signal triggers OnNewOffer -> reCreateAgent (~12 s pair-check)
- Total observable interruption: 15-25 s for an event that should
have been a no-op
Verified live on the second badmitterndorf router (172A2, wired to
home LAN, identical netbird config except runs on stable connectivity):
Pre-3.7g (Phase 3.7f): a 60 s mgmt blackhole caused all four
P2P-capable peers to log "ICE disconnected, do not switch to Relay
... configure WireGuard endpoint to ..." and a full 21 s
ICE-renegotiation cycle even though each peer's WG endpoint ended
up at the SAME address it had before.
Post-3.7g: same 60 s mgmt blackhole, debug log shows
"network change: skipping workerICE.Close (ICE still Connected,
soft-fallback)" for every healthy peer. Zero state changes in
netbird status, zero ICE failures, zero ping dropout.
Implementation:
- WorkerICE.IsConnected returns true when w.agent != nil and
lastKnownState == ice.ConnectionStateConnected. Reads the same
state machine that drives onConnectionStateChange, so it's
authoritative.
- onNetworkChange wraps the workerICE.Close call in
`if !workerICE.IsConnected()`. The LTE-bounce path is unchanged
(pion has already cleared w.agent so IsConnected returns false
and Close runs as before).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7h) Real-world LTE-bounce traces showed Phase 3.7f's 30 s grace window only fit ~2 ICE attempts before the schedule jumped to a 1-minute exponential suspend. Each pair-check is ~12-15 s, the post-Reset delay was 5 s, so by the second failure (T+~34s) we were already outside the grace window. Cold NAT mappings often need 3-4 attempts to prime, so peers behind a single LTE/Wi-Fi NAT routinely waited 2-3 minutes for P2P recovery instead of the ~30-50 s the underlying connectivity actually allowed. Widen the window from 30 s to 60 s and shorten the retry delay from 5 s to 2 s. With these values: T=0 s Reset (srReconnect / network change) T=~12 s Attempt netbirdio#1 fails (pion pair-check timeout) T=~14 s Attempt netbirdio#2 starts T=~26 s Attempt netbirdio#2 fails T=~28 s Attempt netbirdio#3 starts T=~40 s Attempt netbirdio#3 fails T=~42 s Attempt netbirdio#4 starts (still inside 60 s grace) T=~54 s Attempt netbirdio#4 fails / succeeds Roughly twice as many priming attempts before exponential kicks in. Observed bm-LTE peer that previously needed 2 min 8 s for recovery should now converge inside the grace window. Trade-off: chronically broken peers (behind a genuine symmetric NAT on both sides) generate ~2x more signal/STUN traffic in the first 60 s after a network change. After the grace expires the schedule falls back to the same exponential backoff as before, so the long- term cost is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfigure
Headless setups today need two steps to seed the active profile with a
non-default connection mode:
netbird service install ...
netbird up --connection-mode p2p-dynamic --p2p-timeout 60 ...
Add the same four profile-level flags to the install + reconfigure
commands and persist them via profilemanager.UpdateOrCreateConfig so
the daemon picks them up on first start. Reuses the package-level
vars + constants already defined for `up`, so naming and validation
stay consistent.
Flags added (mirrored from upCmd):
--connection-mode relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server
--relay-timeout seconds (0 = use server default)
--p2p-timeout seconds (0 = use server default; only effective
in p2p-dynamic mode)
--p2p-retry-max seconds (0 = use server default; built-in 15 min
fallback when server has not pushed a value)
Only fields whose flag was Changed() are written; unset flags leave
the existing profile untouched. Honors --config when present,
otherwise uses profilemanager.DefaultConfigPath.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Android Advanced Settings UI in the Fyne-based desktop
client (Linux + Windows). The Network tab in NetBird Settings gains:
- Connection Mode dropdown: Follow server, relay-forced, p2p,
p2p-lazy, p2p-dynamic. "Follow server" clears any local override.
- Relay Timeout (s): only meaningful in p2p-lazy / p2p-dynamic,
disabled for other modes. Empty = use server default.
- P2P Timeout (s): only meaningful in p2p-dynamic. Empty = use
server default.
- P2P Retry-Max (s): only meaningful in p2p-dynamic. Empty = use
server default (or built-in 15-min fallback when the management
server has not pushed a value).
The dropdown's onChange handler enables/disables the timeout entries
to match the inactivity-manager's actual scope (no inactivity teardown
runs in relay-forced / p2p, so those modes get all three fields
disabled).
Daemon plumbing:
- GetConfigResponse gains connection_mode, p2p_timeout_seconds,
relay_timeout_seconds, p2p_retry_max_seconds (proto + regenerated
pb.go). server.go fills them from the active profile config so the
GUI hydrates correctly on open.
- SetConfigRequest already had these fields from Phase 1; the GUI's
buildSetConfigRequest now populates them on every save.
State plumbing in client_ui.go:
- 4 new fyne widgets (sConnectionMode + 3 entries) created in
showSettingsUI alongside the existing checks.
- 4 new state fields cached on serviceClient for change detection.
- hasConnectionModeChanges() compares dropdown + entries against
the cached state; saveSettings flow only sends a SetConfigRequest
when there's a real diff.
- parseUint32Field tolerates empty / non-numeric input (treated as
0 = no override) so the user can clear a field without errors.
The lazy-connection checkbox in the systray menu is left in place;
it acts as a quick toggle for the legacy boolean and is kept for
backwards compatibility with profiles that have not yet adopted the
new ConnectionMode field.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. GetConfigResponse gets 4 server_pushed_* fields (connection_mode + 3 timeouts in seconds). ConnMgr captures the raw values from each NetworkMap.PeerConfig and exposes them via ServerPushed*() accessors, all guarded by sync.RWMutex (covered by TestConnMgr_ServerPushedFieldsAreRaceSafe). Engine.ConnMgr() getter lets the daemon-RPC layer reach the values; Server.GetConfig fills them into every response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. The Connection Mode dropdown's "Follow server" entry now suffixes "(currently: <mode>)" when the engine has received a PeerConfig; the timeout entries' placeholders show the actual server default seconds. The redundant "Enable Lazy Connections" tray submenu is removed since Connection Mode covers it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lues
Phase 3.7h finalisation. Adds Get/Set{ConnectionMode, RelayTimeoutSeconds,
P2pTimeoutSeconds, P2pRetryMaxSeconds} on Preferences and the matching
GetServerPushed* accessors on Client (via connMgrSafe helper). Required
by the Android Connection-Mode picker (in netbird-android repo).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6bf6777 to
b2b91ef
Compare
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (7)
client/internal/peer/guard/guard.go (1)
144-147:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSynchronous
onNetworkChange()still blocks the guard select loopThe callback is still invoked synchronously inside the
srReconnectedChanbranch. Until it returns,ctx.Done(),relayedConnDisconnected,iCEConnDisconnected, and any furthersrReconnectedChanevents are all blocked. This was flagged in the previous review and remains unresolved.🛠 Proposed fix (async dispatch with single-flight coalescing)
+ // networkChangePending prevents redundant concurrent recreations. + var networkChangePending atomic.Bool case <-srReconnectedChan: g.log.Debugf("has network changes, reset reconnection ticker") ticker.Stop() ticker = g.newReconnectTicker(ctx) tickerChannel = ticker.C iceState.reset() - // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE - if g.onNetworkChange != nil { - g.onNetworkChange() - } + // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE. + // Dispatched asynchronously so the guard loop stays responsive to ctx.Done(). + if g.onNetworkChange != nil && !networkChangePending.Swap(true) { + go func() { + defer networkChangePending.Store(false) + g.onNetworkChange() + }() + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The srReconnectedChan branch currently calls g.onNetworkChange() synchronously inside the guard select loop, which blocks other cases; change it to dispatch the work asynchronously: when handling srReconnectedChan, if g.onNetworkChange != nil spawn a goroutine to invoke it instead of calling it inline and use a single-flight/coalescing mechanism (e.g., a singleflight.Group or an in-struct boolean guarded by a mutex/atomic) so concurrent srReconnectedChan events only trigger one inflight onNetworkChange invocation; ensure the async worker performs the existing reset iceBackoff and recreate workerICE logic and that you respect ctx cancellation to avoid leaking goroutines.client/internal/peer/worker_ice.go (1)
218-230:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSynchronize
lastKnownStateacross the callback and network-change paths.
IsConnected()at Lines 227-230 now drives the skip-close branch, butonConnectionStateChange()still writeslastKnownStatewithout the same synchronization at Lines 550 and 562. That leaves a data race on the exact close-or-skip decision this PR added.#!/bin/bash rg -n -C2 '\blastKnownState\b' client/internal/peer/worker_ice.goAlso applies to: 545-563
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 218 - 230, IsConnected reads lastKnownState under w.muxAgent while onConnectionStateChange writes lastKnownState without locking, creating a race; fix by protecting all accesses to lastKnownState with the same mutex (w.muxAgent) — update onConnectionStateChange (the callback that sets lastKnownState at/around where it currently assigns at lines ~545-563) to acquire w.muxAgent.Lock() and defer w.muxAgent.Unlock() around writes (and any reads), ensuring symmetry with IsConnected and preventing the close-or-skip race.client/internal/conn_mgr.go (2)
282-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winApply timeout pushes even when the managed mode itself doesn’t change.
Lines 282-320 only restart the manager on
modeChanged, butinitLazyManager()consumesrelayTimeoutSecsandp2pTimeoutSecsonce at Lines 557-562. A push from, for example,p2p-dynamic(60/30)top2p-dynamic(10/5)updates the fields but leaves the live manager running with the old thresholds until a full restart.Also applies to: 553-562
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 282 - 320, The runtime ignores timeout-only pushes if the mode doesn’t change; detect when relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs change while a manager is active and apply them by restarting or updating the manager: in conn_mgr.go where modeChanged/wasManaged/isManaged are computed (and where lazyConnMgr is checked), add a branch that if isManaged && e.lazyConnMgr != nil and any of relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from the manager’s current thresholds, call e.closeManager(ctx) then e.initLazyManager(ctx), e.startModeSideEffects(), and e.addPeersToLazyConnManager() (or alternatively propagate the new values into the running manager); reference modeUsesLazyMgr, lazyConnMgr, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs, propagateP2pRetryMaxToConns, initLazyManager, closeManager, startModeSideEffects, and addPeersToLazyConnManager when making the change.
181-188:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t leave
e.modein lazy/dynamic when Rosenpass prevents the manager from starting.Lines 185-187 and 313-315 return early, but
e.modeis still kept/set top2p-lazyorp2p-dynamic. New peers then inherit that mode,AddPeerConn()falls back toconn.Open()at Lines 391-395 because no manager exists, andModeP2PDynamicstill defers ICE listener registration inclient/internal/peer/conn.goat Lines 239-242. SinceActivatePeer()also returns early with no manager at Lines 449-451, those peers can stay relay-only indefinitely.Also applies to: 312-320
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 181 - 188, When you bail out from starting the lazy/dynamic manager (both the modeUsesLazyMgr() check and the e.rosenpassEnabled check), clear/unset the lazy P2P mode on e.mode so it no longer advertises "p2p-lazy" or "p2p-dynamic" to new peers; specifically, before each early return set e.mode to a non-lazy value (i.e. remove the lazy/dynamic flag or assign the normal/direct P2P mode) so callers like AddPeerConn() and ActivatePeer() won’t treat peers as manager-backed (avoiding the conn.Open() fallback and deferred ICE registration in ModeP2PDynamic).client/internal/peer/conn.go (1)
202-212:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the explicit “disable backoff” state round-trippable.
propagateP2pRetryMaxToConns()converts the wire sentinel totime.Duration(0), but Line 1159 stores that back as plain0. After a later reopen, Lines 204-207 reinterpret0as “use default”, so a user-explicit disable silently turns back into the 15-minute cap.Also applies to: 1156-1162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 202 - 212, The explicit "disable backoff" sentinel is being lost because 0 is being reused for "use default"; introduce and use a distinct sentinel (e.g. const P2PRetryMaxDisabledSeconds = -1) and update both propagateP2pRetryMaxToConns() (the code around 1156-1162) and this initialization block in conn.go to interpret -1 as "disabled" (set backoffCap to 0 or disable the iceBackoff entirely) while treating 0 as "use built-in default" (map to DefaultP2PRetryMax); also ensure newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so the explicit disable round-trips unchanged.client/ui/client_ui.go (1)
656-666:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not silently coerce invalid timeout input to
0.Invalid/overflow timeout text currently collapses to
0, which can unintentionally clear overrides when saving settings. Parse should return an error andbuildSetConfigRequestshould surface it.Proposed fix
-func parseUint32Field(text string) uint32 { +func parseUint32Field(text string) (uint32, error) { t := strings.TrimSpace(text) if t == "" { - return 0 + return 0, nil } v, err := strconv.ParseUint(t, 10, 32) if err != nil { - return 0 + return 0, fmt.Errorf("must be a whole number between 0 and %d", uint64(^uint32(0))) } - return uint32(v) + return uint32(v), nil } @@ - relaySecs := parseUint32Field(s.iRelayTimeout.Text) - p2pSecs := parseUint32Field(s.iP2pTimeout.Text) - retrySecs := parseUint32Field(s.iP2pRetryMax.Text) + relaySecs, err := parseUint32Field(s.iRelayTimeout.Text) + if err != nil { + return nil, fmt.Errorf("invalid relay timeout: %w", err) + } + p2pSecs, err := parseUint32Field(s.iP2pTimeout.Text) + if err != nil { + return nil, fmt.Errorf("invalid P2P timeout: %w", err) + } + retrySecs, err := parseUint32Field(s.iP2pRetryMax.Text) + if err != nil { + return nil, fmt.Errorf("invalid P2P retry-max: %w", err) + } req.RelayTimeoutSeconds = &relaySecs req.P2PTimeoutSeconds = &p2pSecs req.P2PRetryMaxSeconds = &retrySecsAlso applies to: 741-750
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 656 - 666, The parseUint32Field function currently swallows parse/overflow errors and returns 0 which can accidentally clear saved timeouts; change parseUint32Field to return (uint32, error) instead of silently coercing — validate strings.TrimSpace(text) then call strconv.ParseUint(..., 32) and return an explicit error on failure (including overflow), and update all callers (notably buildSetConfigRequest) to propagate or surface that error to the caller (e.g., return/build an error response instead of using 0); apply the same change to the other similar parser at the 741-750 block so all timeout parsing surfaces errors instead of returning 0.client/android/preferences.go (1)
349-352:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard negative timeout inputs before casting to
uint32.Negative values currently wrap into very large
uint32values and can silently corrupt timeout overrides. Treat<= 0as “clear override” (0) before casting.Proposed fix
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { + if secs <= 0 { + v := uint32(0) + p.configInput.RelayTimeoutSeconds = &v + return + } v := uint32(secs) p.configInput.RelayTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pTimeoutSeconds(secs int64) { + if secs <= 0 { + v := uint32(0) + p.configInput.P2pTimeoutSeconds = &v + return + } v := uint32(secs) p.configInput.P2pTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) { + if secs <= 0 { + v := uint32(0) + p.configInput.P2pRetryMaxSeconds = &v + return + } v := uint32(secs) p.configInput.P2pRetryMaxSeconds = &v }Also applies to: 370-373, 390-393
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/android/preferences.go` around lines 349 - 352, Guard against negative timeout inputs in SetRelayTimeoutSeconds: treat any secs <= 0 as a clear override by using 0 instead of allowing a negative int64 to wrap when cast to uint32. Update SetRelayTimeoutSeconds to check secs <= 0 and set the config pointer to a uint32(0) in that case; otherwise cast secs to uint32 and assign as before. Apply the same pattern to the other timeout setter functions in this file that perform the uint32 cast so they also handle non-positive inputs safely.
🧹 Nitpick comments (1)
client/internal/lazyconn/manager/manager.go (1)
68-68: ⚡ Quick winRemove unused
inactivityThresholdfield from Manager struct.The field is assigned in
NewManager(line 93) but never read; all timeout logic flows throughresolvedTimeouts(). It's dead code from the migration and can be safely removed.🧹 Removal diff
type Manager struct { engineCtx context.Context peerStore *peerstore.Store - inactivityThreshold time.Duration managedPeers map[string]*lazyconn.PeerConfigm := &Manager{ engineCtx: engineCtx, peerStore: peerStore, - inactivityThreshold: inactivity.DefaultInactivityThreshold, managedPeers: make(map[string]*lazyconn.PeerConfig),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/manager/manager.go` at line 68, Remove the dead inactivityThreshold field from the Manager struct and eliminate its assignment in NewManager: delete the inactivityThreshold declaration in the Manager type, remove any parameter/argument or line in NewManager that sets m.inactivityThreshold, and ensure no other code reads it (timeout logic should remain using resolvedTimeouts()). Update related constructors/call sites to stop passing that value and run tests to confirm nothing else references inactivityThreshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/cmd/service_installer.go`:
- Around line 135-139: The call to applyConnectionModeFlagsToProfile currently
logs a warning on failure, allowing install/reconfigure to continue with
inconsistent connection-mode state; change this so that any error returned by
applyConnectionModeFlagsToProfile is propagated and causes the command to abort
(return the error) instead of printing a warning. Update both places where this
call appears (the install path around applyConnectionModeFlagsToProfile(cmd) and
the reconfigure path) to return the error up the call stack rather than using
cmd.PrintErrf, ensuring the command stops if persistence fails and no further
install/reconfigure steps run.
---
Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: Guard against negative timeout inputs in
SetRelayTimeoutSeconds: treat any secs <= 0 as a clear override by using 0
instead of allowing a negative int64 to wrap when cast to uint32. Update
SetRelayTimeoutSeconds to check secs <= 0 and set the config pointer to a
uint32(0) in that case; otherwise cast secs to uint32 and assign as before.
Apply the same pattern to the other timeout setter functions in this file that
perform the uint32 cast so they also handle non-positive inputs safely.
In `@client/internal/conn_mgr.go`:
- Around line 282-320: The runtime ignores timeout-only pushes if the mode
doesn’t change; detect when relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs
change while a manager is active and apply them by restarting or updating the
manager: in conn_mgr.go where modeChanged/wasManaged/isManaged are computed (and
where lazyConnMgr is checked), add a branch that if isManaged && e.lazyConnMgr
!= nil and any of relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from
the manager’s current thresholds, call e.closeManager(ctx) then
e.initLazyManager(ctx), e.startModeSideEffects(), and
e.addPeersToLazyConnManager() (or alternatively propagate the new values into
the running manager); reference modeUsesLazyMgr, lazyConnMgr, relayTimeoutSecs,
p2pTimeoutSecs, p2pRetryMaxSecs, propagateP2pRetryMaxToConns, initLazyManager,
closeManager, startModeSideEffects, and addPeersToLazyConnManager when making
the change.
- Around line 181-188: When you bail out from starting the lazy/dynamic manager
(both the modeUsesLazyMgr() check and the e.rosenpassEnabled check), clear/unset
the lazy P2P mode on e.mode so it no longer advertises "p2p-lazy" or
"p2p-dynamic" to new peers; specifically, before each early return set e.mode to
a non-lazy value (i.e. remove the lazy/dynamic flag or assign the normal/direct
P2P mode) so callers like AddPeerConn() and ActivatePeer() won’t treat peers as
manager-backed (avoiding the conn.Open() fallback and deferred ICE registration
in ModeP2PDynamic).
In `@client/internal/peer/conn.go`:
- Around line 202-212: The explicit "disable backoff" sentinel is being lost
because 0 is being reused for "use default"; introduce and use a distinct
sentinel (e.g. const P2PRetryMaxDisabledSeconds = -1) and update both
propagateP2pRetryMaxToConns() (the code around 1156-1162) and this
initialization block in conn.go to interpret -1 as "disabled" (set backoffCap to
0 or disable the iceBackoff entirely) while treating 0 as "use built-in default"
(map to DefaultP2PRetryMax); also ensure
newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so
the explicit disable round-trips unchanged.
In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The srReconnectedChan branch currently calls
g.onNetworkChange() synchronously inside the guard select loop, which blocks
other cases; change it to dispatch the work asynchronously: when handling
srReconnectedChan, if g.onNetworkChange != nil spawn a goroutine to invoke it
instead of calling it inline and use a single-flight/coalescing mechanism (e.g.,
a singleflight.Group or an in-struct boolean guarded by a mutex/atomic) so
concurrent srReconnectedChan events only trigger one inflight onNetworkChange
invocation; ensure the async worker performs the existing reset iceBackoff and
recreate workerICE logic and that you respect ctx cancellation to avoid leaking
goroutines.
In `@client/internal/peer/worker_ice.go`:
- Around line 218-230: IsConnected reads lastKnownState under w.muxAgent while
onConnectionStateChange writes lastKnownState without locking, creating a race;
fix by protecting all accesses to lastKnownState with the same mutex
(w.muxAgent) — update onConnectionStateChange (the callback that sets
lastKnownState at/around where it currently assigns at lines ~545-563) to
acquire w.muxAgent.Lock() and defer w.muxAgent.Unlock() around writes (and any
reads), ensuring symmetry with IsConnected and preventing the close-or-skip
race.
In `@client/ui/client_ui.go`:
- Around line 656-666: The parseUint32Field function currently swallows
parse/overflow errors and returns 0 which can accidentally clear saved timeouts;
change parseUint32Field to return (uint32, error) instead of silently coercing —
validate strings.TrimSpace(text) then call strconv.ParseUint(..., 32) and return
an explicit error on failure (including overflow), and update all callers
(notably buildSetConfigRequest) to propagate or surface that error to the caller
(e.g., return/build an error response instead of using 0); apply the same change
to the other similar parser at the 741-750 block so all timeout parsing surfaces
errors instead of returning 0.
---
Nitpick comments:
In `@client/internal/lazyconn/manager/manager.go`:
- Line 68: Remove the dead inactivityThreshold field from the Manager struct and
eliminate its assignment in NewManager: delete the inactivityThreshold
declaration in the Manager type, remove any parameter/argument or line in
NewManager that sets m.inactivityThreshold, and ensure no other code reads it
(timeout logic should remain using resolvedTimeouts()). Update related
constructors/call sites to stop passing that value and run tests to confirm
nothing else references inactivityThreshold.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a37e050-20cd-4834-a220-5d197ba3378d
⛔ Files ignored due to path filters (1)
client/proto/daemon.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (19)
client/android/client.goclient/android/preferences.goclient/cmd/service.goclient/cmd/service_installer.goclient/internal/conn_mgr.goclient/internal/conn_mgr_test.goclient/internal/engine.goclient/internal/lazyconn/manager/manager.goclient/internal/peer/conn.goclient/internal/peer/guard/guard.goclient/internal/peer/handshaker.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/worker_ice.goclient/proto/daemon.protoclient/server/server.goclient/ui/client_ui.goclient/ui/const.goclient/ui/event_handler.go
💤 Files with no reviewable changes (2)
- client/ui/const.go
- client/ui/event_handler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/peer/handshaker.go
| // Persist any profile-level connection-mode/timeout flags that | ||
| // were explicitly set so the daemon picks them up on first start. | ||
| if err := applyConnectionModeFlagsToProfile(cmd); err != nil { | ||
| cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err) | ||
| } |
There was a problem hiding this comment.
Make connection-mode persistence failures abort the command.
These flags are not passed through buildServiceArguments(), so if applyConnectionModeFlagsToProfile fails here and execution continues, install/reconfigure completes with different connection-mode/timeout behavior than the user explicitly requested. This should return an error instead of downgrading to a warning.
Suggested change
- if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
- cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
- }
+ if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
+ return fmt.Errorf("persist connection-mode flags: %w", err)
+ }Apply the same change in both the install and reconfigure paths.
Based on learnings methods returning errors should be propagated early rather than continuing with partially applied state.
Also applies to: 263-265
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/cmd/service_installer.go` around lines 135 - 139, The call to
applyConnectionModeFlagsToProfile currently logs a warning on failure, allowing
install/reconfigure to continue with inconsistent connection-mode state; change
this so that any error returned by applyConnectionModeFlagsToProfile is
propagated and causes the command to abort (return the error) instead of
printing a warning. Update both places where this call appears (the install path
around applyConnectionModeFlagsToProfile(cmd) and the reconfigure path) to
return the error up the call stack rather than using cmd.PrintErrf, ensuring the
command stops if persistence fails and no further install/reconfigure steps run.




[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab
Branch:
pr/b-phase3.5-network-change→ basepr/a-p2p-dynamic-foundation(stacked PR — depends on PR-A landing first)Compare: https://github.com/netbirdio/netbird/compare/MichaelUray:netbird:pr/a-p2p-dynamic-foundation...MichaelUray:netbird:pr/b-phase3.5-network-change?expand=1
Summary
This PR adds network-change handling, signal-trigger ICE re-attach, a short-grace post-network-change ICE-failure delay (so we don't burn the exponential backoff on the well-known "first ICE pair-check after WiFi/cellular handover fails on stale NAT mappings"), the Guard-Loop fix from the closed PR #5805, and the GUI surface for Connection Mode + timeouts in the client UI.
This is the second of four stacked PRs implementing #5989. It builds on PR-A.
Why
The Phase-1+2+3 foundation (PR-A) handles steady-state operation well, but real devices do not stay in steady state:
After such an event the existing ICE agent is bound to a departing local interface and pion's pair-checks fail. The exponential backoff (PR-A) then kicks in and the peer stays on relay for minutes.
Two related fixes are needed:
Plus: previously closed PR #5805 ("Fix ICE reconnection loop and buffer candidates before agent initialization") had a known reconnect-loop where ICE worker rebuild on every signal-state change racked up offers faster than pion could complete pair-checks. The clean fix is to skip new offers while the ICE agent is still in
Connecting.What's in this PR
Phase 3.5 — Reset on network change
client/peer/conn: reset ICE backoff + recreateworkerICEwhen the signal/relay layer reconnects after a network change.client/peer/conn: refactoronNetworkChangeto use the in-place agent recreate path; dropSendOfferfromonNetworkChangeto fix offer-storm; revert and replace an interimworker_icecandidate-buffering attempt with a clean approach.Phase 3.7d — Signal-trigger re-attach
Phase 3.7e — Listener inside onNetworkChange
onNetworkChangeitself so a network change fires a fresh listener registration without waiting for the next activity event.Phase 3.7f + 3.7h — Grace window
iceBackoff.markFailure: while insidenetworkChangeGracePeriod(60s) of the most recentReset(), cap the suspend delay atnetworkChangeRetryDelay(2s) and do NOT advance the long-term exponential schedule. Outside the window the normal 1-minute initial backoff applies.Phase 3.7g — Skip ICE-Close on NC when still connected
Connectedstate when the network-change fires, don't callworkerICE.Close()— the existing pair is fine, just refresh the listener.#5805 cleanup — Guard-Loop fix
worker_ice: skip new offers while the ICE agent is inConnecting. Drops the candidate-buffer approach the closed PR proposed in favour of a much simpler offer-skip during connecting.GUI surface (Phase 3.7h GUI)
client/cmd/service: persist connection-mode + timeouts on install/reconfigure.client/ui: Connection Mode + timeouts in Network tab.client: surface server-pushed connection-mode/timeouts via daemon-RPC.client/ui: "Follow-Server (currently: …)" display + remove the now-redundant Lazy menu.client/android: gomobile getters for ConnectionMode + ServerPushed values.Tests
go test ./client/internal/peer ./client/internal/peer/guard— pass (full ICE-backoff suite including the new grace-window cases).go build ./client/...— pass on linux/amd64, linux/arm64, windows/amd64.Test plan
Reset()→ grace window engages.Use case
A laptop that switches between docked Ethernet, office WiFi, and tethered LTE during the day. Without this PR every transition burns the exponential backoff and the user sees "stuck on relay" for 5-15 minutes. With this PR the post-handover recovery is bounded at ~50s.
Linked work
Maintainers are welcome to push directly to this branch.
Summary by CodeRabbit
New Features
UI
Tests
Documentation
These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.