Skip to content

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) #6082

Open
MichaelUray wants to merge 63 commits intonetbirdio:mainfrom
MichaelUray:pr/b-phase3.5-network-change
Open

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) #6082
MichaelUray wants to merge 63 commits intonetbirdio:mainfrom
MichaelUray:pr/b-phase3.5-network-change

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented May 5, 2026

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab

Branch: pr/b-phase3.5-network-change → base pr/a-p2p-dynamic-foundation (stacked PR — depends on PR-A landing first)
Compare: https://github.com/netbirdio/netbird/compare/MichaelUray:netbird:pr/a-p2p-dynamic-foundation...MichaelUray:netbird:pr/b-phase3.5-network-change?expand=1


Summary

This PR adds network-change handling, signal-trigger ICE re-attach, a short-grace post-network-change ICE-failure delay (so we don't burn the exponential backoff on the well-known "first ICE pair-check after WiFi/cellular handover fails on stale NAT mappings"), the Guard-Loop fix from the closed PR #5805, and the GUI surface for Connection Mode + timeouts in the client UI.

This is the second of four stacked PRs implementing #5989. It builds on PR-A.

Why

The Phase-1+2+3 foundation (PR-A) handles steady-state operation well, but real devices do not stay in steady state:

  • Laptops roam between WiFi and cellular.
  • Mobile devices wake from Doze/StandBy with stale NAT mappings.
  • LTE modems briefly drop and re-attach.

After such an event the existing ICE agent is bound to a departing local interface and pion's pair-checks fail. The exponential backoff (PR-A) then kicks in and the peer stays on relay for minutes.

Two related fixes are needed:

  1. Re-attach ICE after a network change — recreate the ICE worker on a fresh socket, send a new offer. (Phases 3.5, 3.7d, 3.7e.)
  2. Short retry delay during the post-network-change grace window — the first pair-check often fails on stale mappings even when subsequent ones succeed. Falling back to the normal 1-minute initial backoff after that single failure leaves the peer on relay far longer than the underlying connectivity warrants. A 60-second grace window with 2-second retries lets ~4-5 attempts run while the new mapping primes. (Phases 3.7f, 3.7h.)

Plus: previously closed PR #5805 ("Fix ICE reconnection loop and buffer candidates before agent initialization") had a known reconnect-loop where ICE worker rebuild on every signal-state change racked up offers faster than pion could complete pair-checks. The clean fix is to skip new offers while the ICE agent is still in Connecting.

What's in this PR

Phase 3.5 — Reset on network change

  • client/peer/conn: reset ICE backoff + recreate workerICE when the signal/relay layer reconnects after a network change.
  • client/peer/conn: refactor onNetworkChange to use the in-place agent recreate path; drop SendOffer from onNetworkChange to fix offer-storm; revert and replace an interim worker_ice candidate-buffering attempt with a clean approach.

Phase 3.7d — Signal-trigger re-attach

  • Re-attach ICE on every signal trigger so a peer that lost ICE while the daemon was Disconnected from signaling resumes correctly when signaling is back.

Phase 3.7e — Listener inside onNetworkChange

  • Re-attach the ICE listener inside onNetworkChange itself so a network change fires a fresh listener registration without waiting for the next activity event.

Phase 3.7f + 3.7h — Grace window

  • iceBackoff.markFailure: while inside networkChangeGracePeriod (60s) of the most recent Reset(), cap the suspend delay at networkChangeRetryDelay (2s) and do NOT advance the long-term exponential schedule. Outside the window the normal 1-minute initial backoff applies.
  • Window widened from initial 30s to 60s and delay reduced from 5s to 2s after observing real-world LTE-bounce behaviour: cold NAT mappings often need 3-4 ICE attempts to prime, and the previous 30s window only fit ~2 attempts.

Phase 3.7g — Skip ICE-Close on NC when still connected

  • If ICE is still in Connected state when the network-change fires, don't call workerICE.Close() — the existing pair is fine, just refresh the listener.

#5805 cleanup — Guard-Loop fix

  • worker_ice: skip new offers while the ICE agent is in Connecting. Drops the candidate-buffer approach the closed PR proposed in favour of a much simpler offer-skip during connecting.

GUI surface (Phase 3.7h GUI)

  • client/cmd/service: persist connection-mode + timeouts on install/reconfigure.
  • client/ui: Connection Mode + timeouts in Network tab.
  • client: surface server-pushed connection-mode/timeouts via daemon-RPC.
  • client/ui: "Follow-Server (currently: …)" display + remove the now-redundant Lazy menu.
  • client/android: gomobile getters for ConnectionMode + ServerPushed values.

Tests

  • go test ./client/internal/peer ./client/internal/peer/guard — pass (full ICE-backoff suite including the new grace-window cases).
  • go build ./client/... — pass on linux/amd64, linux/arm64, windows/amd64.
  • Hardware-validated against three OpenWrt routers (LAN host candidate, internet srflx, LTE srflx) plus a Windows 11 daemon. After a forced LTE-modem reset, the LTE peer recovers to P2P srflx within ~50 s instead of the 2-3 minutes the pre-3.7f code took.

Test plan

  • Grace window: failure inside 60s → 2s retry; failure after 60s → exponential restarts at ~1m.
  • No regression for the non-network-change failure path (still uses the exponential schedule).
  • Signal/relay reconnect drives Reset() → grace window engages.
  • Maintainer to verify: GUI changes on macOS / Linux ARM (I tested Windows + Linux/amd64 only).

Use case

A laptop that switches between docked Ethernet, office WiFi, and tethered LTE during the day. Without this PR every transition burns the exponential backoff and the user sees "stuck on relay" for 5-15 minutes. With this PR the post-handover recovery is bounded at ~50s.

Linked work

Maintainers are welcome to push directly to this branch.

Summary by CodeRabbit

  • New Features

    • Configurable connection modes (relay-forced, p2p, p2p-lazy, p2p-dynamic) with per-mode timeouts and local overrides.
    • Profile/install-time and CLI persistence for connection settings; server can push defaults surfaced to clients and UI.
    • ICE failure backoff with visible status (suspended, failures, next-retry) and graceful recovery.
  • UI

    • Network/settings UI updated to manage connection mode, timeout fields, and show server-provided hints.
  • Tests

    • Expanded test coverage across mode resolution, inactivity timers, ICE backoff, and related flows.

Documentation

  • Documentation is not needed

These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.

MichaelUray and others added 30 commits May 5, 2026 22:31
…nfig

Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers
(11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are
unchanged so old clients ignore the additions and old servers send
UNSPECIFIED, which the new client maps back via the legacy boolean.

Note: the regenerated pb.go files now report protoc v5.29.3 in their
header (this branch was generated with locally-installed protoc 29.3
instead of upstream's v7.34.1). Functionally identical; header diff is
the only delta beyond the actual schema additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the
client-only sentinels Unspecified and FollowServer), ParseString for
CLI/env input, ToProto/FromProto for wire translation, and the two
backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled
that bridge the old Settings.LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched
in this commit; the engine/conn_mgr migration follows in subsequent
commits in the same PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns

NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY,
NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set
together, NB_FORCE_RELAY wins (most-restrictive, mirrors the
group-conflict rule from issue netbirdio#5990).

Each legacy var emits a one-shot deprecation warning when it actually
contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD
becomes an alias for the future relay_timeout setting and warns once.

IsForceRelayed() is kept for callers that have not yet been migrated
(conn.go, statusrecorder); they will be updated in the engine/conn
refactor commits later in this PR.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing:
- --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server>
- --relay-timeout <seconds>
- --p2p-timeout <seconds>

Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput,
LoginRequest), persisted in profilemanager.Config, and added as new
fields on the daemon.proto IPC messages. Empty / not-changed flags fall
back to the server-pushed value (which itself falls back to the legacy
lazy_connection_enabled boolean for old servers).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds.
ConnMgr now stores the resolved Mode plus the raw inputs (env, config)
so it can re-resolve when the server pushes a new PeerConfig.

UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and
takes the full PeerConfig pointer; a thin shim with the old name
delegates to it for callers that haven't been updated yet.

connect.go copies the three new fields from profilemanager.Config into
the EngineConfig builder, with a tolerant parser that logs and falls
through to Unspecified on invalid input.

Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now
checks Mode == ModeRelayForced instead of calling the global env-reader
IsForceRelayed(). The local 'forceRelay' variable name is renamed to
'skipICE' to make the new branching intent explicit.

The PeerStateUpdate block at the end of Open() also reads from
conn.config.Mode now, so the StatusRecorder sees the per-peer mode
rather than the global env var.

A single remaining caller of IsForceRelayed() (srWatcher.Start in
engine.go) is left for a follow-up; that path uses a process-wide flag
not per-peer state, so it can be migrated in Phase 2 once srWatcher
itself learns about ConnectionMode.

Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into
peer.ConnConfig, so the per-peer Open() loop in conn.go can take the
ModeRelayForced skip-ICE branch without reading the global env var.

This is the last wiring commit for the client side of Phase 1; the
server-side mgmt changes (Settings + OpenAPI + handler + audit +
NetworkMap-build) follow in Section D.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default'
(NULL) from explicit values (incl. 0 = never tear down). Copy() now
deep-clones the new pointer fields via two small helpers.

GORM AutoMigrate creates the new columns at first start; existing
accounts have NULL in all three columns and resolve via the legacy
LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings

Three new optional, nullable fields with descriptions of the NULL =
built-in-default semantics and the Phase-1-vs-Phase-2 status of
p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen
tooling.

The generated AccountSettingsConnectionMode enum has the canonical
values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid()
helper for handler-side validation.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PUT /api/accounts/{id} now accepts connection_mode (validated against
the four-value enum via the generated AccountSettingsConnectionMode.
Valid()), p2p_timeout_seconds and relay_timeout_seconds. NULL in the
JSON body keeps the existing value untouched (= "no client-side
override on this round-trip"); explicit NULL-clear via API uses a
distinct PATCH-style call which is out-of-scope for Phase 1.

Response payload mirrors the input fields back as nullable so the
dashboard can distinguish "use default" from "explicit value".

Phase 1 of issue netbirdio#5989. Audit-event emission follows in D5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccountConnectionModeChanged (121), AccountRelayTimeoutChanged (122),
AccountP2pTimeoutChanged (123) -- emitted from account.go when settings
change. Per-peer / per-group event codes are reserved for Phase 3
(issue netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleConnectionModeSettings is invoked from the same diff-detection
block as handleLazyConnectionSettings; emits one StoreEvent per
changed field (ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds)
with old/new values in the meta payload.

Four small ptr-equality / deref helpers are added for nullable string
and uint32 fields. They are package-private and named after the
existing convention used elsewhere in the package.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elds

Two changes in one commit because they're inseparable:

1. Move client/internal/peer/connectionmode/ to shared/connectionmode/.
   The package now needs to be importable from BOTH client/ and
   management/ (which is impossible while it lives under client/internal/
   per Go's internal-package rule). All imports updated; tests pass on
   both sides.

2. Extend management/internals/shared/grpc/conversion.go::toPeerConfig
   to populate the three new PeerConfig fields (ConnectionMode,
   P2PTimeoutSeconds, RelayTimeoutSeconds) using the connectionmode
   helpers. The legacy LazyConnectionEnabled boolean is now derived
   from the resolved Mode via ToLazyConnectionEnabled() rather than
   copied verbatim from Settings -- this is the central backwards-compat
   contract: old clients see only the boolean, new clients prefer the
   explicit enum and ignore the bool.

Resolution rules (Phase 1, account-wide only):
- Settings.ConnectionMode != nil and parses -> wins
- Otherwise -> ResolveLegacyLazyBool(LazyConnectionEnabled)
- timeouts: Settings.RelayTimeoutSeconds / P2pTimeoutSeconds when non-NULL,
  else 0 (= server has no preference; client uses built-in default)

Per-peer / per-group resolution comes in Phase 3 (netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine sub-cases cover the Phase-1 resolution matrix from spec section 3:
- no settings -> default (P2P + lazy=false)
- legacy bool only -> mapped via ResolveLegacyLazyBool
- explicit ConnectionMode -> wins over the legacy bool
- timeouts propagate
- garbage ConnectionMode value -> tolerant fallback to legacy bool

Particular attention to the structural compat gap: relay-forced cannot
be expressed via the legacy boolean, so the wire field for old clients
is sent as false. Documented in the spec, asserted here.

Existing TestAccount_GetPeerNetworkMap remains green: existing test
peers have ConnectionMode=NULL in Settings, falls through to the legacy
ResolveLegacyLazyBool(false) -> ModeP2P -> wire bool false.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs to dynamically detach the ICE listener at runtime
(when p2p-dynamic mode hits its ICE-inactivity threshold). Adds the
RemoveICEListener method and extends the Add/RemoveICEListener pair to
hold h.mu; Listen() now reads h.iceListener through readICEListener
under the same mutex, fixing a latent race that didn't matter while
AddICEListener was the only writer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs the daemon to fire ICE-teardown and relay-
teardown at different idle thresholds. Adds NewManagerWithTwoTimers
plus iceInactiveChan / relayInactiveChan; the periodic check loop
fires each channel when its threshold elapses. Threshold == 0 disables
that channel (peer.lazy: iceTimeout=0; peer.dynamic: both>0; eager
modes don't register peers at all).

NewManager (Phase-1 entry point) becomes a thin wrapper that
delegates to newManager with iceTimeout=0, preserving p2p-lazy
semantics exactly. The Phase-1 InactivePeersChan now aliases the
new RelayInactiveChan so existing callers (engine.go p2p-lazy path)
continue to work unchanged.

Five sub-tests cover the timeout matrix:
- TwoTimers_OnlyICEFires (peer idle between thresholds)
- TwoTimers_BothFire (peer idle past both)
- TwoTimers_ICEDisabled (iceTimeout=0)
- TwoTimers_RelayDisabled (relayTimeout=0)
- TwoTimers_BothDisabled (both 0 = manager inert)

Plus Phase1_LazyEquivalence proves NewManager-call-site behaviour is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier timeouts for Phase-2 p2p-dynamic. The deprecated
InactivityThreshold pointer field becomes a backwards-compat alias
that maps onto RelayInactivityThreshold when set, so Phase-1 callers
that still pass it (engine.go, p2p-lazy tests) keep working without
edits.

resolvedTimeouts() helper centralises the alias logic. NewManager
picks the appropriate inactivity-manager constructor based on the
resolved timeouts; legacy callers continue through the Phase-1
single-timer code path.

InactivityManager() getter exposes the inner manager so the
conn_mgr can subscribe to ICEInactiveChan / RelayInactiveChan
in the p2p-dynamic dispatch loop.

Phase 2 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AttachICE registers the ICE-offer listener on the handshaker after the
activity-detector observes traffic; emits a fresh offer so the remote
side learns we are now ICE-capable. DetachICE removes the listener and
calls workerICE.Close() so the ICE worker tears down without affecting
the relay tunnel.

Both methods are idempotent and use Handshaker.readICEListener() under
its mutex, avoiding the race with Handshaker.Listen() that the prior
direct field read had. Used in Phase 2 by the inactivity manager to
flip ICE on/off independently of the relay tunnel.

Refs netbirdio#5989 (Phase 2 / connection-mode-phase2).
Phase 2 of netbirdio#5989. ModeP2PDynamic now takes a third branch in Open():
workerICE is constructed eagerly (so the activity-trigger does not
have to wait for ICE setup) but the AddICEListener call is skipped.
The handshaker still has workerICE wired in via NewHandshaker, so
buildOfferAnswer can include local ICE credentials when the listener
is later attached by Conn.AttachICE() on activity-trigger.

Behavior matrix in Open():
- ModeRelayForced: no workerICE, no ICE listener.
- ModeP2P / ModeP2PLazy: workerICE created, listener registered eagerly
  (Phase-1 lazy-tunnel defer remains at conn_mgr level).
- ModeP2PDynamic: workerICE created, listener deferred to AttachICE.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Fixes the lazy/eager mismatch where a remote peer's
GO_IDLE was silently dropped whenever the local manager was not in
lazy mode, leaving the eager local end to immediately reconnect and
defeating the remote's lazy/dynamic intent.

DeactivatePeer now dispatches by locally-resolved connection mode:
  p2p-lazy   -> existing lazy-manager teardown (whole tunnel)
  p2p-dynamic-> DetachICEForPeer (ICE only, relay tunnel stays up)
  p2p / relay-forced / unspecified -> silent no-op (eager modes
    deliberately ignore GO_IDLE because they are always-on)

DetachICEForPeer is also the entry point used by the inactivity
manager when its iceTimeout fires (engine wiring follows in E1).

Lookup miss is treated as a no-op: a peer can be removed concurrently
with a GO_IDLE or inactivity-timer event.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. ConnMgr.Start now also brings up the lazy/dynamic
manager when mode==ModeP2PDynamic. A new runDynamicInactivityLoop
goroutine reads from inactivity.Manager.ICEInactiveChan() and
RelayInactiveChan() and dispatches DetachICEForPeer or full
Conn.Close per peer.

initLazyManager populates Config.RelayInactivityThreshold from
e.relayTimeoutSecs and Config.ICEInactivityThreshold from
e.p2pTimeoutSecs (only in dynamic mode; in lazy mode iceTimeout
stays 0 so the ICE channel never fires, preserving Phase-1
semantics exactly).

UpdatedRemotePeerConfig generalized to handle the dynamic mode
the same way it handled lazy: enable the manager on entry, tear
down on exit, restart on lazy<->dynamic transitions so the new
timeouts take effect.

Refs netbirdio#5989 (Phase 2).
Phase-2 follow-up. The dynamic-mode iceTimeout was always 0 in the
field because resolveConnectionMode only resolved relay-timeout, not
p2p-timeout, and the manager-init only consulted e.p2pTimeoutSecs
which never got updated from server pushes.

Resolution chain for P2pTimeoutSeconds is now symmetric to relay:
  cfgP2pTimeout (client config) > serverPC.GetP2PTimeoutSeconds()
(no env var; reserved for Phase 3).

UpdatedRemotePeerConfig now also stores the resolved p2p-timeout and
includes it in the early-return short-circuit so changes to the
server-pushed value trigger a re-init.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Closes the activity-attach loop: ActivatePeer is
called by the engine when a remote peer signals it is active. In
p2p-lazy this opens the relay tunnel; in p2p-dynamic the relay was
already eager-opened, so we additionally call Conn.AttachICE() to
register the deferred ICE listener and emit a fresh offer. Without
this, p2p-dynamic peers would never upgrade from relay to ICE.

Refs netbirdio#5989 (Phase 2).
…5989)

Phase 3 introduces per-peer ICE-failure backoff. Server-pushed cap
on the truncated-exponential schedule lives in this new field 14.

Wire-format sentinels:
  0   -> "not set", daemon falls back to built-in 15 min default
  max -> user explicitly disabled backoff (Phase-2 behavior)
  N   -> N seconds as the MaxInterval of cenkalti/backoff

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Account-wide cap on the ICE-failure backoff
schedule. Nullable: NULL = use daemon default (15 min), 0 = treated
as "user-explicit disable" by the conversion layer (signaled via
uint32-max sentinel on the wire). Plus deep-copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New field documented in the public API. Defaults
to 900 seconds (15 min). 0 disables backoff. Regenerated types.gen.go
via oapi-codegen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. The PUT handler now persists the new account-wide
ICE-backoff cap, and the GET response surfaces it back. NULL stays
NULL (= "use daemon default"). Mirrors the Phase-1+2 plumbing of
relay_timeout_seconds and p2p_timeout_seconds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New activity-code 124 (AccountP2pRetryMaxChanged)
plus the StoreEvent call inside handleConnectionModeSettings, mirroring
the Phase-1 Relay-timeout and Phase-2 P2P-timeout patterns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apping

Phase 3 of netbirdio#5989. The conversion layer translates between DB-NULL
(= "use daemon default") and Settings-zero (= "user-explicit disable")
using a uint32-max sentinel on the wire. Test matrix covers all
three states: NULL, 0, and a normal positive value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Per-peer ICE-failure backoff using cenkalti/backoff
v4 (already imported by Guard). InitialInterval 1m, Multiplier 2.0,
RandomizationFactor 0.1 -- produces ~1m, ~2m, ~4m, ~8m, ~15m capped
sequence with default cap.

Public API: newIceBackoff, markFailure (returns delay for logging),
markSuccess, Reset (alias), IsSuspended, Snapshot (read-only view for
status output), SetMaxBackoff (live update).

maxBackoff=0 disables the backoff entirely (Phase-2 behavior).

Test matrix covers initial state, exponential growth, cap override,
disabled mode, success-reset, hard-reset, suspended-expiry, and live
SetMaxBackoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/peer/guard/guard.go (1)

138-147: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous callback may stall the reconnect loop

onNetworkChange() is called directly inside the select body. If the callback takes any non-trivial time — or tries to acquire a lock that the surrounding Conn code also holds during network-change processing — ctx.Done(), relayedConnDisconnected, and iCEConnDisconnected events are all frozen until it returns.

Per the PR description the callback is expected to "reset iceBackoff + recreate workerICE", which touches Conn's internal mutex and may spawn goroutines. Consider dispatching asynchronously to keep the loop responsive:

💡 Suggested fix
-		case <-srReconnectedChan:
-			g.log.Debugf("has network changes, reset reconnection ticker")
-			ticker.Stop()
-			ticker = g.newReconnectTicker(ctx)
-			tickerChannel = ticker.C
-			iceState.reset()
-			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-			if g.onNetworkChange != nil {
-				g.onNetworkChange()
-			}
+		case <-srReconnectedChan:
+			g.log.Debugf("has network changes, reset reconnection ticker")
+			ticker.Stop()
+			ticker = g.newReconnectTicker(ctx)
+			tickerChannel = ticker.C
+			iceState.reset()
+			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+			// Fire in a goroutine so the reconnect loop stays responsive to ctx.Done()
+			// and other disconnect events while the callback runs.
+			if g.onNetworkChange != nil {
+				go g.onNetworkChange()
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 138 - 147, The select case
calling g.onNetworkChange() synchronously can block the reconnect loop; change
it to invoke the callback asynchronously (e.g., spawn a goroutine) so the select
can continue handling ctx.Done(), relayedConnDisconnected and
iCEConnDisconnected immediately. Locate the case handling srReconnectedChan in
guard.go (references: srReconnectedChan, ticker, g.newReconnectTicker,
iceState.reset, g.onNetworkChange) and replace the direct call to
g.onNetworkChange() with a non-blocking dispatch (start a goroutine, optionally
recover/panic-handle and/or swap to a bounded worker or channel if you must
limit concurrency) to avoid holding up the loop.
🧹 Nitpick comments (5)
management/server/types/settings_test.go (1)

5-20: ⚡ Quick win

Consider adding a ConnectionMode deep-copy test to cover cloneStringPtr

The test validates cloneUint32Ptr via P2pRetryMaxSeconds, but cloneStringPtr (used by ConnectionMode) is a separate code path with no test coverage. A mirror test would be low-effort:

🧪 Suggested addition
+func TestSettings_Copy_ConnectionMode(t *testing.T) {
+	mode := "p2p-dynamic"
+	src := &Settings{ConnectionMode: &mode}
+	dst := src.Copy()
+	if dst.ConnectionMode == nil {
+		t.Fatal("Copy lost ConnectionMode pointer")
+	}
+	if *dst.ConnectionMode != "p2p-dynamic" {
+		t.Fatalf("expected p2p-dynamic, got %s", *dst.ConnectionMode)
+	}
+	// Verify deep copy (different pointers).
+	*dst.ConnectionMode = "relay-forced"
+	if *src.ConnectionMode != "p2p-dynamic" {
+		t.Fatal("Copy did not deep-clone ConnectionMode")
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/types/settings_test.go` around lines 5 - 20, Add a new unit
test mirroring TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy
behavior for Settings.ConnectionMode: create a source Settings with
ConnectionMode set to a string pointer, call Settings.Copy(), assert
dst.ConnectionMode is non-nil and equals the original value, then mutate
*dst.ConnectionMode and assert *src.ConnectionMode remains unchanged to verify
cloneStringPtr produced a deep copy; reference the Settings struct, its Copy
method, and cloneStringPtr in the test.
shared/management/http/api/openapi.yml (1)

357-369: ⚡ Quick win

Mark legacy lazy_connection_enabled as deprecated and state precedence.

With both fields present, migration intent is unclear unless deprecation and precedence are explicit in the schema.

🧭 Suggested compatibility clarification
         lazy_connection_enabled:
           x-experimental: true
+          deprecated: true
           description: Enables or disables experimental lazy connection
           type: boolean
           example: true
         connection_mode:
           x-experimental: true
           type: string
           enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
+            When non-NULL, `connection_mode` takes precedence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 357 - 369, Mark the
legacy boolean field lazy_connection_enabled as deprecated and make precedence
explicit: add deprecated: true to the lazy_connection_enabled schema and update
its description to indicate it is legacy and overridden by connection_mode when
connection_mode is non-null; update the connection_mode description to
explicitly state that a non-null connection_mode takes precedence over
lazy_connection_enabled and that NULL means "fall back to
lazy_connection_enabled" for backwards compatibility (use the existing keys
lazy_connection_enabled and connection_mode to locate the schema).
shared/connectionmode/mode_test.go (1)

77-94: 💤 Low value

ModeFollowServer missing from TestToLazyConnectionEnabled.

The table omits ModeFollowServer, so its ToLazyConnectionEnabled() return value is never asserted. If its behaviour changes it won't be caught. Consider adding a case (expected value is almost certainly false, matching ModeUnspecified).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/connectionmode/mode_test.go` around lines 77 - 94, The test
TestToLazyConnectionEnabled omits ModeFollowServer so its
ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry in the
cases table for ModeFollowServer with the expected boolean (likely false,
matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for
ModeFollowServer is validated like the other entries.
client/internal/peer/status.go (1)

367-382: 💤 Low value

UpdatePeerIceBackoff silently ignores unknown peer keys — inconsistent with other update methods.

Every other UpdatePeer* method (e.g., UpdatePeerState, UpdateWireGuardPeerState, UpdatePeerFQDN) returns an error when the peer is not found. UpdatePeerIceBackoff returns nothing and silently no-ops. While a silent no-op is safe here (the peer may have been removed between the ICE failure and this call), the divergence from the established pattern makes it harder to detect misconfigured call sites during development. Consider returning an error for consistency, or add a brief comment explaining why the silent no-op is intentional.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 367 - 382, UpdatePeerIceBackoff
currently silently no-ops when the peer key is missing, diverging from other
update methods like UpdatePeerState / UpdateWireGuardPeerState / UpdatePeerFQDN
which return an error; change UpdatePeerIceBackoff(pubKey string, snap
BackoffSnapshot) to return an error, lock/unlock as before, and when peer not
found return a descriptive error (e.g., fmt.Errorf("peer %s not found",
pubKey)); update callers to handle the returned error, or if you intentionally
want the silent no-op instead, add a short comment above UpdatePeerIceBackoff
explaining why ignoring missing peers is acceptable to avoid surprise.
client/internal/peer/conn_test.go (1)

374-384: 💤 Low value

Test assertion is tied to an internal error string.

Line 381 checks strings.Contains(errMsg, "workerICE") — if the error message in AttachICE is ever rewarded, this assertion passes vacuously. Consider asserting on a sentinel error value (errors.Is) instead of a substring, or at minimum add a comment acknowledging the coupling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_test.go` around lines 374 - 384, The test currently
asserts on the error string from c.AttachICE() by checking for "workerICE",
which is brittle; change the implementation of AttachICE to return or wrap a
sentinel error (e.g., ErrNilWorkerICE) and update the test to use errors.Is(err,
ErrNilWorkerICE) instead of substring matching; if adding a sentinel isn't
possible right now, at minimum replace the substring check with errors.Is
against an existing exported sentinel or add a clear comment documenting the
coupling. Ensure AttachICE (and any callers) wrap the sentinel with
fmt.Errorf("%w: ...", ErrNilWorkerICE) so errors.Is will work reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/android/preferences.go`:
- Around line 347-352: Validate negative timeout inputs before casting to
uint32: in SetRelayTimeoutSeconds (and the other two timeout setters around the
same area) check if the incoming secs < 0 and, if so, clear the override (set
the corresponding p.configInput.*TimeoutSeconds = nil) instead of casting;
otherwise cast to uint32 and assign as before. Ensure you update all three
setters (the SetRelayTimeoutSeconds function and the two other timeout setter
functions referenced) with the same guard.

In `@client/cmd/up.go`:
- Around line 442-453: The SetConfig handler in server.go is ignoring the
incoming optional fields (req.ConnectionMode, req.RelayTimeoutSeconds,
req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so daemon-mode client overrides
are dropped; add four nil-checks in server.go's SetConfig to apply each field
when non-nil (e.g., if req.ConnectionMode != nil { serverConfig.ConnectionMode =
*req.ConnectionMode }) and persist/update the server configuration/store
accordingly; do the same for Login's handler to read and persist the same four
fields from the incoming request (setupSetConfigReq / setupLoginRequest populate
them) so the client flags are honored.

In `@client/internal/conn_mgr.go`:
- Around line 277-320: The code updates ConnMgr.relayTimeoutSecs, p2pTimeoutSecs
and p2pRetryMaxSecs before deciding whether to restart the existing lazy
manager, so changes to timeouts inside the same managed mode are ignored; fix by
capturing whether the timeout values actually changed (e.g. timeoutsChanged :=
newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs || newP2pRetry !=
e.p2pRetryMaxSecs) before overwriting the fields, then after setting e.mode and
the timeout fields, if isManaged && e.lazyConnMgr != nil && timeoutsChanged call
e.closeManager(ctx), e.statusRecorder.UpdateLazyConnection(false), and then
re-init via e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.
- Around line 176-188: When Rosenpass is enabled we must not leave e.mode set to
a lazy/dynamic variant while lazyConnMgr is not started; change ConnMgr.Start so
that if e.rosenpassEnabled is true you replace/downgrade e.mode to a non-lazy
equivalent (e.g., convert p2p-lazy/p2p-dynamic -> their non-lazy counterpart)
before returning, and also update UpdatedRemotePeerConfig to detect when newMode
is a lazy/dynamic variant while e.rosenpassEnabled is true and store the
downgraded non-lazy mode instead of newMode; reference ConnMgr.Start,
e.rosenpassEnabled, e.mode, lazyConnMgr, UpdatedRemotePeerConfig, newMode and
ActivatePeer when making this change.

In `@client/internal/lazyconn/inactivity/manager.go`:
- Around line 233-239: The scan can enqueue the same peer into both iceIdle and
relayIdle, causing races in the consumer; modify the manager scan so that when a
peer meets relayTimeout it is added to relayIdle and then skipped for ICE
notification (i.e., check relayIdle before adding to iceIdle), or alternatively
collapse to a single ordered notification list processed in priority order;
specifically update the block that uses m.relayTimeout and m.iceTimeout
(variables shown as relayIdle, iceIdle, checkTime, since) so relayIdle is
evaluated/assigned first and iceIdle insertion is gated on relayIdle not
containing peerID (or replace both maps with a single ordered notification for
the consumer).
- Around line 76-87: NewManagerWithTwoTimers currently accepts negative or
sub-minute durations; validate and clamp iceTimeout and relayTimeout the same as
the legacy constructor by enforcing MinimumInactivityThreshold (and preventing
negatives) before calling newManager. In NewManagerWithTwoTimers, if iceTimeout
or relayTimeout is less than MinimumInactivityThreshold (or negative), set them
to MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 103-109: The code treats iceTO==0 && relayTO==0 as a legacy "not
migrated" case; change resolvedTimeouts() to also return a migration flag (e.g.,
resolvedTimeouts() -> (iceTO, relayTO, migrated bool)) and then update the
branch in manager.go to: if !migrated { keep legacy single-timer behavior via
inactivity.NewManager(wgIface, config.InactivityThreshold) } else if iceTO == 0
&& relayTO == 0 { explicitly disable inactivity (e.g., set m.inactivityManager
to a disabled/no-op manager or nil) } else { use
inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) } so intentional 0/0
does not fall back to the legacy manager; refer to resolvedTimeouts, iceTO,
relayTO, m.inactivityManager, NewManager and NewManagerWithTwoTimers when making
the change.

In `@client/internal/peer/conn.go`:
- Around line 202-212: The code currently conflates three states (disabled
sentinel, explicit 0 for "use default", and a literal non-zero seconds) so
Open() loses a prior "disabled backoff" sentinel; update the Open()/backoff init
logic and SetIceBackoffMax handling to preserve the explicit disable sentinel:
when reading conn.config.P2pRetryMaxSeconds in the backoff initialization
(backoffCap calculation and calls to newIceBackoff/SetMaxBackoff) first test for
the special disable sentinel value (e.g. ^uint32(0) or whatever sentinel you
use) and propagate that as a disabled state to conn.iceBackoff (e.g. leave
conn.iceBackoff as disabled or call a Disable/SetMaxBackoffDisabled path)
instead of treating it as a huge duration; likewise, make SetIceBackoffMax
distinguish between setting the disable sentinel, explicit 0 (meaning use
default cap), and non-zero seconds so it does not rewrite a disable sentinel
into 0 and thereby lose the disabled state across Close/Open cycles.

In `@client/internal/peer/env_test.go`:
- Line 25: Update the test case name string literal "connection_mode unparseable
falls through to legacy" to use the preferred spelling "unparsable" (i.e. change
"unparseable" → "unparsable") in the table entry currently containing
{"connection_mode unparseable falls through to legacy", "garbage", "true", "",
"", connectionmode.ModeRelayForced, 0} so that the test identifier matches
codespell expectations.

In `@client/internal/peer/status.go`:
- Around line 1372-1374: The IceBackoffNextRetry field is being set with
timestamppb.New even when peerState.IceBackoffNextRetry is the zero time, which
produces an invalid past proto timestamp; change the assignment that sets
IceBackoffNextRetry (the struct population using peerState) to guard with
peerState.IceBackoffNextRetry.IsZero() and only call timestamppb.New(...) when
non-zero, otherwise set the proto field to nil so consumers see an unset
timestamp; locate the code that sets IceBackoffNextRetry alongside
IceBackoffFailures/IceBackoffSuspended in status.go and apply this conditional.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: The read/write of lastKnownState races because
IsConnected() reads it under muxAgent while onConnectionStateChange writes it
without locking; change lastKnownState to an atomic.Int32 (or int32 accessed via
sync/atomic) and replace writes in onConnectionStateChange with
atomic.StoreInt32(&w.lastKnownState, int32(ice.ConnectionStateX)) and the read
in IsConnected() with atomic.LoadInt32(&w.lastKnownState) (while still using
muxAgent to check w.agent != nil), or alternatively guard the writes in
onConnectionStateChange with muxAgent; update all places that set/read
lastKnownState to use the chosen atomic or lock-based approach to eliminate the
data race.

In `@client/server/server.go`:
- Around line 1518-1538: The comment claiming "all zero/empty when the engine
has not received PeerConfig" is incorrect because
ConnMgr.ServerPushedP2pRetryMaxSecs() substitutes a built-in default, so
spP2pRetMax will be non‑zero; fix this by either updating the comment to reflect
the actual semantics or (preferred) changing the getter behavior: modify
ConnMgr.ServerPushedP2pRetryMaxSecs() (in conn_mgr.go) to return 0 when the
stored peer-config value is unset instead of substituting the default, and keep
default application logic elsewhere (UI or a helper), so the code in server.go
that reads spP2pRetMax from cm.ServerPushedP2pRetryMaxSecs() can reliably use 0
as the "not yet received" sentinel.

In `@client/ui/client_ui.go`:
- Around line 656-666: parseUint32Field currently coerces empty, non-numeric,
and overflowed input to 0 which lets buildSetConfigRequest silently clear
overrides; change parseUint32Field to return (uint32, error) (or (uint32, bool)
if you prefer) and treat empty input as a distinct "no value" case rather than
returning 0, return a parse/overflow error for invalid inputs, and then update
buildSetConfigRequest (and the other caller around lines 741-750) to
propagate/handle the error: reject the request or show validation feedback
instead of sending 0 to the daemon. Ensure you validate numeric range (32-bit)
and non-digit characters in parseUint32Field and adjust all call sites to handle
the new return signature.

In `@management/internals/shared/grpc/conversion.go`:
- Around line 115-119: When parsing Settings.ConnectionMode inside the if
settings.ConnectionMode != nil block use
connectionmode.ParseString(*settings.ConnectionMode) as before but if it returns
an error emit a one-line warning indicating the bad value so admin typos are
visible; update the code around connectionmode.ParseString to log a warn (e.g.,
logger.Warnf or the package logger) with the invalid string and the parse error
when err != nil, otherwise keep the existing behavior of assigning resolvedMode
when m != connectionmode.ModeUnspecified.
- Around line 26-29: The initial peer connection path doesn't translate the
wire-format sentinel (p2pRetryMaxDisabledSentinel) to zero, so Conn.Open()
computes backoffCap from conn.config.P2pRetryMaxSeconds and treats the sentinel
as a huge duration; update Conn.Open() to detect if
conn.config.P2pRetryMaxSeconds equals p2pRetryMaxDisabledSentinel and treat it
as 0 (or alternatively perform the same sentinel→0 conversion earlier in
resolveConnectionMode() / engine.createPeerConnConfig() /
ConnMgr.P2pRetryMax()), then compute backoffCap :=
time.Duration(uint32ValueConvertedToZeroIfSentinel) * time.Second so sentinel
yields duration 0.

In `@management/server/account.go`:
- Around line 372-375: The change handlers currently only audit connection-mode
related updates but do not push new runtime config to live peers; update the
caller to trigger am.updateAccountPeers when any of ConnectionMode,
RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change.
Concretely: modify handleConnectionModeSettings (or have the handlers return a
bool) so the outer code that calls am.handleConnectionModeSettings(...) can
detect a change and then call am.updateAccountPeers(ctx, userID, accountID) (or
the existing updateAccountPeers signature) after the settings are saved/audited;
ensure the check compares oldSettings.ConnectionMode,
oldSettings.RelayTimeoutSeconds, oldSettings.P2pTimeoutSeconds, and
oldSettings.P2pRetryMaxSeconds against the newSettings versions and only calls
updateAccountPeers when any differ.

In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 229-240: The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.

In `@shared/management/http/api/openapi.yml`:
- Around line 367-372: The description for the "connection_mode" enum is
outdated and still states that "p2p-dynamic" is reserved/passes through; update
the OpenAPI description for the connection_mode field so it reflects the current
phase behavior (remove the claim that p2p-dynamic is reserved/pass-through) and
clearly document which modes are currently functional (relay-forced, p2p,
p2p-lazy) and how NULL interacts with lazy_connection_enabled for backwards
compatibility; edit the description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.

---

Outside diff comments:
In `@client/internal/peer/guard/guard.go`:
- Around line 138-147: The select case calling g.onNetworkChange() synchronously
can block the reconnect loop; change it to invoke the callback asynchronously
(e.g., spawn a goroutine) so the select can continue handling ctx.Done(),
relayedConnDisconnected and iCEConnDisconnected immediately. Locate the case
handling srReconnectedChan in guard.go (references: srReconnectedChan, ticker,
g.newReconnectTicker, iceState.reset, g.onNetworkChange) and replace the direct
call to g.onNetworkChange() with a non-blocking dispatch (start a goroutine,
optionally recover/panic-handle and/or swap to a bounded worker or channel if
you must limit concurrency) to avoid holding up the loop.

---

Nitpick comments:
In `@client/internal/peer/conn_test.go`:
- Around line 374-384: The test currently asserts on the error string from
c.AttachICE() by checking for "workerICE", which is brittle; change the
implementation of AttachICE to return or wrap a sentinel error (e.g.,
ErrNilWorkerICE) and update the test to use errors.Is(err, ErrNilWorkerICE)
instead of substring matching; if adding a sentinel isn't possible right now, at
minimum replace the substring check with errors.Is against an existing exported
sentinel or add a clear comment documenting the coupling. Ensure AttachICE (and
any callers) wrap the sentinel with fmt.Errorf("%w: ...", ErrNilWorkerICE) so
errors.Is will work reliably.

In `@client/internal/peer/status.go`:
- Around line 367-382: UpdatePeerIceBackoff currently silently no-ops when the
peer key is missing, diverging from other update methods like UpdatePeerState /
UpdateWireGuardPeerState / UpdatePeerFQDN which return an error; change
UpdatePeerIceBackoff(pubKey string, snap BackoffSnapshot) to return an error,
lock/unlock as before, and when peer not found return a descriptive error (e.g.,
fmt.Errorf("peer %s not found", pubKey)); update callers to handle the returned
error, or if you intentionally want the silent no-op instead, add a short
comment above UpdatePeerIceBackoff explaining why ignoring missing peers is
acceptable to avoid surprise.

In `@management/server/types/settings_test.go`:
- Around line 5-20: Add a new unit test mirroring
TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy behavior for
Settings.ConnectionMode: create a source Settings with ConnectionMode set to a
string pointer, call Settings.Copy(), assert dst.ConnectionMode is non-nil and
equals the original value, then mutate *dst.ConnectionMode and assert
*src.ConnectionMode remains unchanged to verify cloneStringPtr produced a deep
copy; reference the Settings struct, its Copy method, and cloneStringPtr in the
test.

In `@shared/connectionmode/mode_test.go`:
- Around line 77-94: The test TestToLazyConnectionEnabled omits ModeFollowServer
so its ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry
in the cases table for ModeFollowServer with the expected boolean (likely false,
matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for
ModeFollowServer is validated like the other entries.

In `@shared/management/http/api/openapi.yml`:
- Around line 357-369: Mark the legacy boolean field lazy_connection_enabled as
deprecated and make precedence explicit: add deprecated: true to the
lazy_connection_enabled schema and update its description to indicate it is
legacy and overridden by connection_mode when connection_mode is non-null;
update the connection_mode description to explicitly state that a non-null
connection_mode takes precedence over lazy_connection_enabled and that NULL
means "fall back to lazy_connection_enabled" for backwards compatibility (use
the existing keys lazy_connection_enabled and connection_mode to locate the
schema).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f6253fba-a00e-49ab-b17b-28539d6a1fa2

📥 Commits

Reviewing files that changed from the base of the PR and between b19b746 and 2b6f78b.

⛔ Files ignored due to path filters (4)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • client/proto/daemon_grpc.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/proxy_service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (46)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/root.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/cmd/up.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/connect.go
  • client/internal/engine.go
  • client/internal/lazyconn/env.go
  • client/internal/lazyconn/inactivity/manager.go
  • client/internal/lazyconn/inactivity/manager_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_test.go
  • client/internal/peer/env.go
  • client/internal/peer/env_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/handshaker_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/worker_ice.go
  • client/internal/profilemanager/config.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/status/status.go
  • client/status/status_test.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/types/settings.go
  • management/server/types/settings_test.go
  • shared/connectionmode/mode.go
  • shared/connectionmode/mode_test.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
💤 Files with no reviewable changes (2)
  • client/ui/event_handler.go
  • client/ui/const.go

Comment on lines +347 to +352
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative timeout inputs before casting to uint32.

These setters currently turn negative Android values into huge positive timeouts (-1 becomes 4294967295), which silently corrupts the persisted override instead of clearing/rejecting it. Clamp or reject < 0 before the cast for all three setters.

🛠️ Suggested fix
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 368-373, 388-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 347 - 352, Validate negative
timeout inputs before casting to uint32: in SetRelayTimeoutSeconds (and the
other two timeout setters around the same area) check if the incoming secs < 0
and, if so, clear the override (set the corresponding
p.configInput.*TimeoutSeconds = nil) instead of casting; otherwise cast to
uint32 and assign as before. Ensure you update all three setters (the
SetRelayTimeoutSeconds function and the two other timeout setter functions
referenced) with the same guard.

Comment thread client/cmd/up.go
Comment on lines +442 to +453
if cmd.Flag(connectionModeFlag).Changed {
req.ConnectionMode = &connectionMode
}
if cmd.Flag(relayTimeoutFlag).Changed {
req.RelayTimeoutSeconds = &relayTimeoutSecs
}
if cmd.Flag(p2pTimeoutFlag).Changed {
req.P2PTimeoutSeconds = &p2pTimeoutSecs
}
if cmd.Flag(p2pRetryMaxFlag).Changed {
req.P2PRetryMaxSeconds = &p2pRetryMaxSecs
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

New connection-mode flags are silently dropped in daemon mode.

setupSetConfigReq populates req.ConnectionMode, req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, and req.P2PRetryMaxSeconds, but server.go's SetConfig handler never reads or persists these four fields from the incoming *proto.SetConfigRequest. In daemon mode the client.SetConfig call at runInDaemonMode succeeds with no error, but the values are silently discarded — the user's --connection-mode override never takes effect.

The same applies to setupLoginRequest (lines 699-710): the populated fields are carried over the wire but server.go's Login handler does not persist them either.

The fix is to extract and apply these fields inside server.go's SetConfig:

🐛 Proposed fix in server.go `SetConfig`
 	if msg.Mtu != nil {
 		mtu := uint16(*msg.Mtu)
 		config.MTU = &mtu
 	}
+
+	if msg.ConnectionMode != nil {
+		config.ConnectionMode = *msg.ConnectionMode
+	}
+	if msg.RelayTimeoutSeconds != nil {
+		config.RelayTimeoutSeconds = *msg.RelayTimeoutSeconds
+	}
+	if msg.P2PTimeoutSeconds != nil {
+		config.P2pTimeoutSeconds = *msg.P2PTimeoutSeconds
+	}
+	if msg.P2PRetryMaxSeconds != nil {
+		config.P2pRetryMaxSeconds = *msg.P2PRetryMaxSeconds
+	}

 	if _, err := profilemanager.UpdateConfig(config); err != nil {

Wait - this is actually not high effort. The fix is 4 simple nil-checks in one location.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/up.go` around lines 442 - 453, The SetConfig handler in server.go
is ignoring the incoming optional fields (req.ConnectionMode,
req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so
daemon-mode client overrides are dropped; add four nil-checks in server.go's
SetConfig to apply each field when non-nil (e.g., if req.ConnectionMode != nil {
serverConfig.ConnectionMode = *req.ConnectionMode }) and persist/update the
server configuration/store accordingly; do the same for Login's handler to read
and persist the same four fields from the incoming request (setupSetConfigReq /
setupLoginRequest populate them) so the client flags are honored.

Comment thread client/internal/conn_mgr.go
Comment on lines +277 to +320
newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode(
e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout,
e.cfgP2pTimeout, e.cfgP2pRetryMax, pc,
)

if newMode == e.mode && newRelay == e.relayTimeoutSecs &&
newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs {
return nil
}
prev := e.mode
e.mode = newMode
e.relayTimeoutSecs = newRelay
e.p2pTimeoutSecs = newP2P
e.p2pRetryMaxSecs = newP2pRetry
e.propagateP2pRetryMaxToConns()

wasManaged := modeUsesLazyMgr(prev)
isManaged := modeUsesLazyMgr(newMode)
modeChanged := prev != newMode

if modeChanged && wasManaged && !isManaged {
log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
return nil
}

if modeChanged && wasManaged && isManaged {
// Switching between lazy and dynamic at runtime: tear down the
// existing manager so initLazyManager picks up the new timeouts.
log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
}

if isManaged && e.lazyConnMgr == nil {
if e.rosenpassEnabled {
log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push")
return nil
}
log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode)
e.initLazyManager(ctx)
e.startModeSideEffects()
return e.addPeersToLazyConnManager()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the lazy manager when timeouts change inside the same managed mode.

relayTimeoutSecs and p2pTimeoutSecs are only consumed in initLazyManager(). Right now a server push from, say, p2p-dynamic(60/30) to p2p-dynamic(10/5) updates the fields on ConnMgr but keeps the existing manager instance alive, so the inactivity thresholds never actually change until a full restart or a mode flip.

Suggested fix
-	if modeChanged && wasManaged && isManaged {
+	timeoutsChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
+	if wasManaged && isManaged && (modeChanged || timeoutsChanged) {
 		// Switching between lazy and dynamic at runtime: tear down the
 		// existing manager so initLazyManager picks up the new timeouts.
-		log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
+		log.Infof("restarting lazy/dynamic manager to apply updated mode/timeouts (%s -> %s)", prev, newMode)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 	}

Capture timeoutsChanged before overwriting the old timeout fields so the comparison is against the previous manager config.

Also applies to: 553-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 277 - 320, The code updates
ConnMgr.relayTimeoutSecs, p2pTimeoutSecs and p2pRetryMaxSecs before deciding
whether to restart the existing lazy manager, so changes to timeouts inside the
same managed mode are ignored; fix by capturing whether the timeout values
actually changed (e.g. timeoutsChanged := newRelay != e.relayTimeoutSecs ||
newP2P != e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) before
overwriting the fields, then after setting e.mode and the timeout fields, if
isManaged && e.lazyConnMgr != nil && timeoutsChanged call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), and then re-init via
e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.

Comment on lines +76 to +87
// NewManagerWithTwoTimers is the Phase-2 constructor. Pass 0 for either
// timeout to disable that teardown path. Both 0 leaves the manager
// running but inert (no channel ever fires) -- used by p2p / relay-forced
// modes that don't tear down workers.
func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
if iceTimeout > 0 {
log.Infof("ICE inactivity timeout: %v", iceTimeout)
}
if relayTimeout > 0 {
log.Infof("relay inactivity timeout: %v", relayTimeout)
}
return newManager(iface, iceTimeout, relayTimeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate two-timer thresholds the same way the legacy constructor does.

NewManagerWithTwoTimers bypasses MinimumInactivityThreshold, so negative or sub-minute values are accepted on the new path. With a 1-minute polling loop, that makes small values misleading and negative values effectively mark every peer idle on the first tick.

🛠️ Suggested fix
 func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
+	validateOptional := func(name string, v time.Duration) time.Duration {
+		if v == 0 {
+			return 0
+		}
+		if v < MinimumInactivityThreshold {
+			log.Warnf("%s inactivity timeout %v is too low, using minimum %v", name, v, MinimumInactivityThreshold)
+			return MinimumInactivityThreshold
+		}
+		return v
+	}
+
+	iceTimeout = validateOptional("ICE", iceTimeout)
+	relayTimeout = validateOptional("relay", relayTimeout)
+
 	if iceTimeout > 0 {
 		log.Infof("ICE inactivity timeout: %v", iceTimeout)
 	}
 	if relayTimeout > 0 {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/inactivity/manager.go` around lines 76 - 87,
NewManagerWithTwoTimers currently accepts negative or sub-minute durations;
validate and clamp iceTimeout and relayTimeout the same as the legacy
constructor by enforcing MinimumInactivityThreshold (and preventing negatives)
before calling newManager. In NewManagerWithTwoTimers, if iceTimeout or
relayTimeout is less than MinimumInactivityThreshold (or negative), set them to
MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.

Comment on lines +26 to +29
// p2pRetryMaxDisabledSentinel is the wire-format value that signals
// "user-explicit disable backoff" (uint32-max). The 0 wire-value is
// reserved for "not set, use daemon default". Phase 3 of #5989.
const p2pRetryMaxDisabledSentinel = ^uint32(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate where the wire P2PRetryMaxSeconds is consumed on the daemon side and
# check that both 0 and ^uint32(0) are handled.
rg -nP --type=go -C5 'P2PRetryMaxSeconds|P2pRetryMaxSeconds'
echo "---"
# Look for the sentinel constant or any uint32-max comparison in client code.
rg -nP --type=go -C3 '\^uint32\(0\)|math\.MaxUint32|0xffffffff'

Repository: netbirdio/netbird

Length of output: 50373


🏁 Script executed:

# Search for code that interprets the sentinel or handles uint32-max specially
rg -nP --type=go 'SetMaxBackoff|SetIceBackoffMax' -A 5 -B 2
echo "---"
# Search in engine or relevant client files for sentinel detection
rg -nP --type=go -C 3 'uint32.*max|math\.MaxUint32|\^uint32' client/internal/
echo "---"
# Look for where P2pRetryMaxSeconds is converted from wire format to time.Duration
rg -nP --type=go 'time\.Duration.*P2pRetryMax|P2pRetryMax.*time\.Duration' -A 2 -B 2

Repository: netbirdio/netbird

Length of output: 7903


🏁 Script executed:

sed -n '577,594p' client/internal/conn_mgr.go
echo "---"
# Also check engine.go for sentinel handling
rg -nP --type=go 'sentinelDisabled|uint32.*0\)|disable.*backoff' client/internal/engine.go -A 3 -B 3

Repository: netbirdio/netbird

Length of output: 543


🏁 Script executed:

# Find where P2pRetryMaxSeconds is initially set in engine config
rg -nP --type=go 'P2pRetryMaxSeconds:' client/internal/engine.go -B 5 -A 2
echo "---"
# Check resolveConnectionMode for sentinel handling  
sed -n '109,170p' client/internal/conn_mgr.go
echo "---"
# Look for any conversion before the value reaches Conn.config
rg -nP --type=go 'connConfig.*P2p|PeerConnConfig.*P2p' client/internal/engine.go -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 2418


Verify where the wire-format sentinel (^uint32(0)) is converted to disable-backoff (duration 0) in the initial peer-connection path.

The server-side conversion (nil→0, &0→sentinel, &N→N) is correct, and propagateP2pRetryMaxToConns() properly handles sentinel→0 for subsequent server updates. However, the initial Conn.Open() path at peer/conn.go:204 lacks sentinel detection: backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second will interpret sentinel as ~136 years instead of zero. The comment at peer/conn.go:98–99 claims "the resolver translates to time.Duration(0) at engine.go" but no such conversion occurs in resolveConnectionMode() or engine.createPeerConnConfig().

Either add sentinel-to-zero conversion before backoffCap is computed in Conn.Open(), or ensure the sentinel is translated in the initial config-resolution chain (e.g., in resolveConnectionMode() or ConnMgr.P2pRetryMax()).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 26 - 29, The
initial peer connection path doesn't translate the wire-format sentinel
(p2pRetryMaxDisabledSentinel) to zero, so Conn.Open() computes backoffCap from
conn.config.P2pRetryMaxSeconds and treats the sentinel as a huge duration;
update Conn.Open() to detect if conn.config.P2pRetryMaxSeconds equals
p2pRetryMaxDisabledSentinel and treat it as 0 (or alternatively perform the same
sentinel→0 conversion earlier in resolveConnectionMode() /
engine.createPeerConnConfig() / ConnMgr.P2pRetryMax()), then compute backoffCap
:= time.Duration(uint32ValueConvertedToZeroIfSentinel) * time.Second so sentinel
yields duration 0.

Comment on lines +115 to +119
if settings.ConnectionMode != nil {
if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Log when Settings.ConnectionMode fails to parse so admin typos are surfaced.

Right now an invalid value (say "p2p-typoed") is silently dropped — the function falls back to the legacy bool with no signal at all. A misconfigured admin will see "old" behavior with no clue why. A one-line warn log on the parse error is cheap and matches what the env-resolver in client/internal/peer/env.go already does for EnvKeyNBConnectionMode.

📝 Suggested change
 	resolvedMode := connectionmode.ResolveLegacyLazyBool(settings.LazyConnectionEnabled)
 	if settings.ConnectionMode != nil {
-		if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
+		m, err := connectionmode.ParseString(*settings.ConnectionMode)
+		if err != nil {
+			log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err)
+		} else if m != connectionmode.ModeUnspecified {
 			resolvedMode = m
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if settings.ConnectionMode != nil {
if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
if settings.ConnectionMode != nil {
m, err := connectionmode.ParseString(*settings.ConnectionMode)
if err != nil {
log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err)
} else if m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 115 - 119, When
parsing Settings.ConnectionMode inside the if settings.ConnectionMode != nil
block use connectionmode.ParseString(*settings.ConnectionMode) as before but if
it returns an error emit a one-line warning indicating the bad value so admin
typos are visible; update the code around connectionmode.ParseString to log a
warn (e.g., logger.Warnf or the package logger) with the invalid string and the
parse error when err != nil, otherwise keep the existing behavior of assigning
resolvedMode when m != connectionmode.ModeUnspecified.

Comment on lines 372 to 375
am.handleRoutingPeerDNSResolutionSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handleLazyConnectionSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handleConnectionModeSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handlePeerLoginExpirationSettings(ctx, oldSettings, newSettings, userID, accountID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate connection-mode setting changes to peers, not just audit them.

This new audit hook lands, but updateAccountPeers is never flipped when ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change. As written, the account setting is saved and audited, but already-connected clients won't be pushed the new runtime config until some unrelated peer update happens.

🛠️ Suggested fix
 		if oldSettings.RoutingPeerDNSResolutionEnabled != newSettings.RoutingPeerDNSResolutionEnabled ||
 			oldSettings.LazyConnectionEnabled != newSettings.LazyConnectionEnabled ||
+			!equalStringPtr(oldSettings.ConnectionMode, newSettings.ConnectionMode) ||
+			!equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) ||
 			oldSettings.DNSDomain != newSettings.DNSDomain ||
 			oldSettings.AutoUpdateVersion != newSettings.AutoUpdateVersion ||
 			oldSettings.AutoUpdateAlways != newSettings.AutoUpdateAlways {
 			updateAccountPeers = true
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/account.go` around lines 372 - 375, The change handlers
currently only audit connection-mode related updates but do not push new runtime
config to live peers; update the caller to trigger am.updateAccountPeers when
any of ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or
P2pRetryMaxSeconds change. Concretely: modify handleConnectionModeSettings (or
have the handlers return a bool) so the outer code that calls
am.handleConnectionModeSettings(...) can detect a change and then call
am.updateAccountPeers(ctx, userID, accountID) (or the existing
updateAccountPeers signature) after the settings are saved/audited; ensure the
check compares oldSettings.ConnectionMode, oldSettings.RelayTimeoutSeconds,
oldSettings.P2pTimeoutSeconds, and oldSettings.P2pRetryMaxSeconds against the
newSettings versions and only calls updateAccountPeers when any differ.

Comment on lines +229 to +240
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject negative timeout values before converting them to uint32.

These casts accept negative JSON values and wrap them into huge unsigned numbers, so a request like -1 becomes 4294967295 seconds instead of a validation error. Please validate all three timeout fields as >= 0 before the cast.

🛠️ Suggested fix
 	if req.Settings.P2pTimeoutSeconds != nil {
+		if *req.Settings.P2pTimeoutSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.P2pTimeoutSeconds)
 		returnSettings.P2pTimeoutSeconds = &v
 	}
 	if req.Settings.P2pRetryMaxSeconds != nil {
+		if *req.Settings.P2pRetryMaxSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.P2pRetryMaxSeconds)
 		returnSettings.P2pRetryMaxSeconds = &v
 	}
 	if req.Settings.RelayTimeoutSeconds != nil {
+		if *req.Settings.RelayTimeoutSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.RelayTimeoutSeconds)
 		returnSettings.RelayTimeoutSeconds = &v
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
if req.Settings.P2pTimeoutSeconds != nil {
if *req.Settings.P2pTimeoutSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0")
}
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
if *req.Settings.P2pRetryMaxSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0")
}
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
if *req.Settings.RelayTimeoutSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0")
}
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
229 - 240, The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.

Comment on lines +367 to +372
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale connection_mode phase semantics in the description.

This text says p2p-dynamic is reserved/pass-through, which conflicts with the current phase scope and can mislead API consumers.

✏️ Suggested doc update
         connection_mode:
           x-experimental: true
           type: string
           enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            relay-forced, p2p, p2p-lazy, and p2p-dynamic are supported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 367 - 372, The
description for the "connection_mode" enum is outdated and still states that
"p2p-dynamic" is reserved/passes through; update the OpenAPI description for the
connection_mode field so it reflects the current phase behavior (remove the
claim that p2p-dynamic is reserved/pass-through) and clearly document which
modes are currently functional (relay-forced, p2p, p2p-lazy) and how NULL
interacts with lazy_connection_enabled for backwards compatibility; edit the
description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.

@MichaelUray MichaelUray changed the title [2/4 stacked] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab [client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) May 6, 2026
…uest test

setconfig_test.go reflection-tests (TestSetConfig_AllFieldsSaved +
TestCLIFlags_MappedToSetConfig) tolerate the four new fields:
ConnectionMode, P2PTimeoutSeconds, RelayTimeoutSeconds,
P2PRetryMaxSeconds. They are in the proto so daemons can advertise
them via GetConfig, but SetConfig RPC does not apply them at runtime
today — the CLI sets them via 'netbird service install/reconfigure'
writing the active profile file directly. Listed in the test maps so
the structural drift-detector passes; wiring through SetConfig is a
follow-up.
The pb.go regeneration on dev machines uses local protoc (v5.29.3 here)
which causes the Proto Version Check workflow to flag the diff as a
toolchain mismatch. Restore the upstream-main version strings (v6.33.1
for client/proto, v7.34.1 for shared/management/proto) — generated
content matches; only the comment-line differs.
@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from 2b6f78b to 6bf6777 Compare May 6, 2026 06:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
client/internal/peer/worker_ice.go (1)

218-231: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

lastKnownState is still accessed without consistent synchronization — data race.

IsConnected() reads lastKnownState while holding muxAgent, but the writes inside onConnectionStateChange at lines 550 and 562 happen on the pion callback goroutine without acquiring muxAgent. This is a Go data race, and the new Conn.onNetworkChange path (in conn.go at line 1244) now reads through IsConnected() to decide whether to close the agent — a torn read can either skip Close() on a freshly-disconnected agent or close a still-Connected one, exactly the scenario the comment in onNetworkChange says to avoid.

Switch lastKnownState to atomic.Int32 (or similar) so reads and writes are consistent regardless of muxAgent:

-	// we record the last known state of the ICE agent to avoid duplicate on disconnected events
-	lastKnownState ice.ConnectionState
+	// Accessed from pion's state-change callback and from IsConnected()
+	// concurrently — keep atomic so reads on the network-change path are
+	// not torn against unlocked writes in onConnectionStateChange.
+	lastKnownState atomic.Int32 // ice.ConnectionState
-		lastKnownState:    ice.ConnectionStateDisconnected,
+	}
+	w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected))
 func (w *WorkerICE) IsConnected() bool {
 	w.muxAgent.Lock()
 	defer w.muxAgent.Unlock()
-	return w.agent != nil && w.lastKnownState == ice.ConnectionStateConnected
+	return w.agent != nil && ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected
 }
 		case ice.ConnectionStateConnected:
-			w.lastKnownState = ice.ConnectionStateConnected
+			w.lastKnownState.Store(int32(ice.ConnectionStateConnected))
 			w.logSuccessfulPaths(agent)
@@
-			if w.lastKnownState == ice.ConnectionStateConnected {
-				w.lastKnownState = ice.ConnectionStateDisconnected
+			if ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected {
+				w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected))
 				w.conn.onICEStateDisconnected(sessionChanged)
 			}

Also applies to: 549-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 231, IsConnected reads
lastKnownState under muxAgent but onConnectionStateChange writes it without
locking, causing a data race; change the WorkerICE.lastKnownState from a plain
int to an atomic.Int32 (or similar), update IsConnected() to use
lastKnownState.Load() (while keeping muxAgent for agent nil-check), and update
onConnectionStateChange to use lastKnownState.Store(newState) instead of
assigning directly; ensure all other accesses (e.g., in the code paths around
lines referenced) use atomic Load/Store and remove any unsynchronized direct
reads/writes to lastKnownState while leaving agent access protected by muxAgent
as before.
client/internal/peer/conn.go (1)

202-212: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Disable-backoff state is still not preserved across Open/Close cycles.

The ConnConfig comment at lines 98-100 says ^uint32(0) is the disable sentinel and the resolver translates it to 0 before reaching here, while the Open() block treats 0 as "use built-in default" (DefaultP2PRetryMax). These two semantics conflict: a peer that the resolver wanted disabled ends up with the default 15 m backoff. Likewise, SetIceBackoffMax(0) writes 0 into config.P2pRetryMaxSeconds, and a subsequent Close()/Open() cycle silently re-promotes the peer from "disabled" to the default schedule.

Use a distinct in-process representation for "disabled" so the three states (use default, disabled, explicit value) can each round-trip. For example, treat the wire ^uint32(0) as the in-process "disabled" sentinel rather than collapsing it to 0 at the resolver, so Open() can branch on it directly.

+const p2pRetryMaxDisabled = ^uint32(0)
+
-	backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
-	if backoffCap == 0 {
-		backoffCap = DefaultP2PRetryMax
-	}
+	var backoffCap time.Duration
+	switch conn.config.P2pRetryMaxSeconds {
+	case 0:
+		backoffCap = DefaultP2PRetryMax
+	case p2pRetryMaxDisabled:
+		backoffCap = 0
+	default:
+		backoffCap = time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
+	}
@@
 func (conn *Conn) SetIceBackoffMax(d time.Duration) {
 	conn.mu.Lock()
 	defer conn.mu.Unlock()
-	conn.config.P2pRetryMaxSeconds = uint32(d / time.Second)
+	if d == 0 {
+		conn.config.P2pRetryMaxSeconds = p2pRetryMaxDisabled
+	} else {
+		conn.config.P2pRetryMaxSeconds = uint32(d / time.Second)
+	}
 	if conn.iceBackoff != nil {
 		conn.iceBackoff.SetMaxBackoff(d)
 	}
 }

The companion change in the resolver/engine.go must also stop translating ^uint32(0)0 — pass the sentinel through verbatim.

Also applies to: 1152-1163

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 202 - 212, The code collapses the
wire sentinel for "disabled" into 0 which conflicts with Open() treating 0 as
"use default", causing disabled peers to be re-enabled; fix by introducing a
distinct in-process sentinel for "disabled" (e.g. a named constant like
DisabledP2pRetrySentinel) and preserve the resolver wire sentinel instead of
translating it to 0 in resolver/engine.go, then update Open()'s logic around
conn.config.P2pRetryMaxSeconds to: if it equals the disabled sentinel, configure
conn.iceBackoff as disabled (or leave nil) so backoff is turned off; else if it
equals 0 treat it as "use default" and set backoffCap = DefaultP2PRetryMax,
otherwise use the explicit value to newIceBackoff / SetMaxBackoff; also ensure
SetIceBackoffMax writes the same in-process sentinel when disabling so
Close()/Open() round-trips.
client/android/preferences.go (1)

349-352: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Negative secs is silently aliased to a huge uint32; for SetP2pRetryMaxSeconds it collides with the disable sentinel.

A SetRelayTimeoutSeconds(-1) becomes 4294967295 seconds (~136 years), and the same bad cast on SetP2pRetryMaxSeconds(-1) yields exactly ^uint32(0), which is the documented wire-format sentinel for "user-explicit disable backoff" (see client/internal/peer/conn.go lines 98-100). Negative input therefore silently flips backoff into the disabled state without any validation. Clamp or reject < 0 in all three setters.

 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 370-373, 390-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 349 - 352, SetRelayTimeoutSeconds
(and the two other setters that cast int64 to uint32, e.g.,
SetP2pRetryMaxSeconds and the comparable setter) currently cast negative secs
into a huge uint32 (which can collide with the disable sentinel); update each
setter to validate secs before casting: if secs < 0 then clamp to 0 (or return
an error if you prefer API change) and only then assign via uint32(secs) into
the pointer field so negative inputs cannot become ^uint32(0) or other
unintended values.
🧹 Nitpick comments (6)
client/internal/peer/guard/guard.go (1)

57-62: ⚡ Quick win

SetOnNetworkChange after Start() is an undetected data race

The comment says "Set once before Start() is called; no lock needed." This is safe only if the goroutine spawning Start() happens strictly after SetOnNetworkChange returns, establishing a happens-before edge. However, nothing in the API enforces this — a caller that invokes SetOnNetworkChange after spawning go g.Start(...) introduces a data race on g.onNetworkChange with no diagnostic. Consider using sync/atomic to store the function pointer, or document this precondition with a //nolint:gocritic + go vet note so it surfaces in race-detector runs.

As a lightweight alternative, a sync/atomic.Pointer[func()] store/load pair makes the intent explicit and race-detector-clean with minimal overhead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 57 - 62, SetOnNetworkChange
currently writes g.onNetworkChange without synchronization which can race with
concurrent Start() readers; change the Guard implementation to store the
callback via a sync/atomic pointer type (e.g. sync/atomic.Pointer[func()])
instead of the plain field and update all reads (e.g. in Start and any other
goroutines that invoke the callback) to load via atomic.Load/Pointer (or the
generic Atomic.Pointer API) before invoking; ensure SetOnNetworkChange uses
atomic.Store/Pointer to set the callback and nil-checks use the atomic load so
access is race-detector-clean and retains the same semantics.
client/internal/peer/ice_backoff.go (2)

82-92: 💤 Low value

Optional: clear suspended flag on lazy expiry to keep state self-consistent.

IsSuspended() returns false once nextRetry is in the past, but it leaves s.suspended = true. Any future caller that inspects the raw flag (or that adds non-time-based logic) will see a stale "suspended" state. The condition is duplicated in Snapshot() (line 131) and the two helpers can drift. Consider clearing the flag once it has lapsed — pure read API can become a write (still safe under s.mu), or simply expose only the time-based predicate.

 func (s *iceBackoffState) IsSuspended() bool {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	if !s.suspended {
 		return false
 	}
-	if time.Now().After(s.nextRetry) {
-		return false
-	}
-	return true
+	if time.Now().After(s.nextRetry) {
+		s.suspended = false
+		return false
+	}
+	return true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 82 - 92, IsSuspended()
currently returns false when nextRetry has passed but leaves s.suspended true,
causing inconsistent state with Snapshot(); update IsSuspended() (in
iceBackoffState) to, under s.mu lock, check if s.suspended &&
time.Now().After(s.nextRetry) and if so clear s.suspended (and optionally reset
nextRetry) before returning false so the in-memory flags remain self-consistent;
ensure you modify only inside the locked section to avoid races and keep
Snapshot() behavior aligned.

161-169: 💤 Low value

Confirm: SetMaxBackoff resets the in-flight exponential schedule.

The doc comment states the failure counter is preserved, which it is, but buildBackoff calls bo.Reset() so the schedule restarts at InitialInterval. A peer that has climbed to e.g. ~8 m suspends will, after a server-pushed SetMaxBackoff to a different cap, drop back to ~1 m on the next failure. This is most likely intentional (new policy from server ⇒ fresh schedule), but the comment "preserves the failure counter" can be misread as "preserves the schedule"; consider clarifying.

-// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig
-// when the server pushes a new value. Rebuilds the internal backoff with
-// the new schedule but preserves the failure counter.
+// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig
+// when the server pushes a new value. Rebuilds the internal exponential
+// schedule from scratch (so the next markFailure starts at InitialInterval)
+// but preserves the cumulative failure counter for status reporting.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 161 - 169, Update the
comment for SetMaxBackoff on iceBackoffState to clarify that although the
failure counter is preserved, the exponential backoff schedule is reset because
SetMaxBackoff calls buildBackoff(d) which invokes bo.Reset(); explicitly state
that the schedule (current interval) restarts at InitialInterval while the
failure count remains unchanged so callers won't be misled into thinking the
in-flight schedule is preserved.
client/internal/peer/conn.go (2)

1030-1056: 💤 Low value

Minor: redundant IsSuspended() check.

AttachICE checks iceBackoff.IsSuspended() (line 1034) and then calls attachICEListenerLocked, which performs the same check (line 1070). The first check returns early without an error; the second returns false (no-op). The behaviour is identical, so the first block is just an extra log path that fires before the listener attach. If the dual logging is intentional (different message), keep it; otherwise drop the outer check and rely on attachICEListenerLocked returning false to skip the SendOffer.

 func (conn *Conn) AttachICE() error {
 	conn.mu.Lock()
 	defer conn.mu.Unlock()
 
-	if conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() {
-		snap := conn.iceBackoff.Snapshot()
-		conn.Log.Debugf("ICE backoff active (failure #%d, retry at %s), staying on relay",
-			snap.Failures,
-			snap.NextRetry.Format("15:04:05"))
-		return nil
-	}
 	if conn.handshaker == nil {
 		return fmt.Errorf("AttachICE: handshaker not initialized (Open not called)")
 	}
 	if conn.workerICE == nil {
 		return fmt.Errorf("AttachICE: workerICE is nil (relay-forced mode)")
 	}
 
 	if !conn.attachICEListenerLocked() {
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1030 - 1056, Remove the redundant
iceBackoff.IsSuspended() check and its logging from Conn.AttachICE and rely on
attachICEListenerLocked to handle suspended backoff (so AttachICE no longer
early-returns for suspension); specifically, delete the block that checks
conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() and its
conn.Log.Debugf/return, leaving the rest of AttachICE (including the handshaker
nil checks and the call to attachICEListenerLocked and SendOffer) intact so
behavior is unchanged except for avoiding duplicate suspension logging.

19-20: 💤 Low value

Import group ordering.

shared/connectionmode is wedged between client/internal/metrics and client/internal/peer/conntype, breaking the client/... group's alphabetical ordering. goimports would put shared/... after the client/... group. Minor, but easy to fix.

 	"github.com/netbirdio/netbird/client/internal/metrics"
-	"github.com/netbirdio/netbird/shared/connectionmode"
 	"github.com/netbirdio/netbird/client/internal/peer/conntype"
 	"github.com/netbirdio/netbird/client/internal/peer/dispatcher"
 	"github.com/netbirdio/netbird/client/internal/peer/guard"
 	icemaker "github.com/netbirdio/netbird/client/internal/peer/ice"
 	"github.com/netbirdio/netbird/client/internal/peer/id"
 	"github.com/netbirdio/netbird/client/internal/peer/worker"
 	"github.com/netbirdio/netbird/client/internal/portforward"
 	"github.com/netbirdio/netbird/client/internal/stdnet"
 	"github.com/netbirdio/netbird/route"
 	relayClient "github.com/netbirdio/netbird/shared/relay/client"
+	"github.com/netbirdio/netbird/shared/connectionmode"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 19 - 20, The import block in
client/internal/peer/conn.go has shared/connectionmode placed between
client/internal packages, breaking the client/... group's alphabetical grouping;
reorder the imports so all client/internal/* imports (e.g.,
client/internal/metrics and client/internal/peer/conntype) are contiguous and
alphabetically ordered, and move
github.com/netbirdio/netbird/shared/connectionmode after the client/... group
(matching goimports grouping).
client/internal/peer/env_test.go (1)

33-34: 💤 Low value

Use constants instead of hardcoded environment variable names.

Lines 31–32 already reference EnvKeyNBConnectionMode and EnvKeyNBForceRelay. Lines 33–34 hardcode "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and "NB_LAZY_CONN_INACTIVITY_THRESHOLD" as string literals.

The peer package defines these as private constants envEnableLazyConn and envInactivityThreshold (accessible within the same package), or import EnvEnableLazyConn and EnvInactivityThreshold from client/internal/lazyconn. Using constants ensures any rename is caught at compile time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/env_test.go` around lines 33 - 34, Replace the hardcoded
environment variable names in the test t.Setenv calls with the package
constants: use envEnableLazyConn instead of "NB_ENABLE_EXPERIMENTAL_LAZY_CONN"
and envInactivityThreshold instead of "NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these
constants are defined in the peer package); update the two t.Setenv calls in
env_test.go to reference those constants so renames are caught at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/android/preferences.go`:
- Around line 325-352: The setters store pointers to zero values instead of
clearing overrides, so change SetConnectionMode, SetRelayTimeoutSeconds (and
likewise SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds) to set the
corresponding pointer to nil when the sentinel is passed (empty string for
SetConnectionMode, 0 for the timeout/setter methods); otherwise allocate and
assign a pointer to the non-zero value. Update p.configInput.ConnectionMode and
p.configInput.RelayTimeoutSeconds (and the P2P fields) accordingly so apply()'s
nil check correctly reverts to server-pushed values.

In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The select loop currently calls g.onNetworkChange()
synchronously which blocks responsiveness to ctx.Done() and other channels;
change it to dispatch the callback asynchronously: when g.onNetworkChange != nil
spawn a background goroutine to call g.onNetworkChange() from the guard's select
branch (e.g., in the guard loop where srReconnectedChan is handled) so the loop
can continue processing ctx.Done and other events. To avoid unbounded concurrent
invocations, add a simple coalescing mechanism (a boolean flag or non-blocking
send to a buffered channel) adjacent to the async dispatch to ensure rapid
reconnect events don’t spawn redundant goroutines unless the previous handler
has completed; keep the flag/chan state local to the guard instance that owns
g.onNetworkChange.

In `@client/proto/daemon.proto`:
- Around line 208-218: The daemon handlers in client/server/server.go currently
ignore the new LoginRequest.connection_mode, p2p_timeout_seconds,
relay_timeout_seconds, and p2p_retry_max_seconds (and the same fields on
SetConfigRequest), so RPC callers can send them but the daemon neither applies
nor persists them; update the request handling code paths that process
LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: SetRelayTimeoutSeconds (and the two other setters that
cast int64 to uint32, e.g., SetP2pRetryMaxSeconds and the comparable setter)
currently cast negative secs into a huge uint32 (which can collide with the
disable sentinel); update each setter to validate secs before casting: if secs <
0 then clamp to 0 (or return an error if you prefer API change) and only then
assign via uint32(secs) into the pointer field so negative inputs cannot become
^uint32(0) or other unintended values.

In `@client/internal/peer/conn.go`:
- Around line 202-212: The code collapses the wire sentinel for "disabled" into
0 which conflicts with Open() treating 0 as "use default", causing disabled
peers to be re-enabled; fix by introducing a distinct in-process sentinel for
"disabled" (e.g. a named constant like DisabledP2pRetrySentinel) and preserve
the resolver wire sentinel instead of translating it to 0 in resolver/engine.go,
then update Open()'s logic around conn.config.P2pRetryMaxSeconds to: if it
equals the disabled sentinel, configure conn.iceBackoff as disabled (or leave
nil) so backoff is turned off; else if it equals 0 treat it as "use default" and
set backoffCap = DefaultP2PRetryMax, otherwise use the explicit value to
newIceBackoff / SetMaxBackoff; also ensure SetIceBackoffMax writes the same
in-process sentinel when disabling so Close()/Open() round-trips.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: IsConnected reads lastKnownState under muxAgent but
onConnectionStateChange writes it without locking, causing a data race; change
the WorkerICE.lastKnownState from a plain int to an atomic.Int32 (or similar),
update IsConnected() to use lastKnownState.Load() (while keeping muxAgent for
agent nil-check), and update onConnectionStateChange to use
lastKnownState.Store(newState) instead of assigning directly; ensure all other
accesses (e.g., in the code paths around lines referenced) use atomic Load/Store
and remove any unsynchronized direct reads/writes to lastKnownState while
leaving agent access protected by muxAgent as before.

---

Nitpick comments:
In `@client/internal/peer/conn.go`:
- Around line 1030-1056: Remove the redundant iceBackoff.IsSuspended() check and
its logging from Conn.AttachICE and rely on attachICEListenerLocked to handle
suspended backoff (so AttachICE no longer early-returns for suspension);
specifically, delete the block that checks conn.iceBackoff != nil &&
conn.iceBackoff.IsSuspended() and its conn.Log.Debugf/return, leaving the rest
of AttachICE (including the handshaker nil checks and the call to
attachICEListenerLocked and SendOffer) intact so behavior is unchanged except
for avoiding duplicate suspension logging.
- Around line 19-20: The import block in client/internal/peer/conn.go has
shared/connectionmode placed between client/internal packages, breaking the
client/... group's alphabetical grouping; reorder the imports so all
client/internal/* imports (e.g., client/internal/metrics and
client/internal/peer/conntype) are contiguous and alphabetically ordered, and
move github.com/netbirdio/netbird/shared/connectionmode after the client/...
group (matching goimports grouping).

In `@client/internal/peer/env_test.go`:
- Around line 33-34: Replace the hardcoded environment variable names in the
test t.Setenv calls with the package constants: use envEnableLazyConn instead of
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of
"NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these constants are defined in the peer
package); update the two t.Setenv calls in env_test.go to reference those
constants so renames are caught at compile time.

In `@client/internal/peer/guard/guard.go`:
- Around line 57-62: SetOnNetworkChange currently writes g.onNetworkChange
without synchronization which can race with concurrent Start() readers; change
the Guard implementation to store the callback via a sync/atomic pointer type
(e.g. sync/atomic.Pointer[func()]) instead of the plain field and update all
reads (e.g. in Start and any other goroutines that invoke the callback) to load
via atomic.Load/Pointer (or the generic Atomic.Pointer API) before invoking;
ensure SetOnNetworkChange uses atomic.Store/Pointer to set the callback and
nil-checks use the atomic load so access is race-detector-clean and retains the
same semantics.

In `@client/internal/peer/ice_backoff.go`:
- Around line 82-92: IsSuspended() currently returns false when nextRetry has
passed but leaves s.suspended true, causing inconsistent state with Snapshot();
update IsSuspended() (in iceBackoffState) to, under s.mu lock, check if
s.suspended && time.Now().After(s.nextRetry) and if so clear s.suspended (and
optionally reset nextRetry) before returning false so the in-memory flags remain
self-consistent; ensure you modify only inside the locked section to avoid races
and keep Snapshot() behavior aligned.
- Around line 161-169: Update the comment for SetMaxBackoff on iceBackoffState
to clarify that although the failure counter is preserved, the exponential
backoff schedule is reset because SetMaxBackoff calls buildBackoff(d) which
invokes bo.Reset(); explicitly state that the schedule (current interval)
restarts at InitialInterval while the failure count remains unchanged so callers
won't be misled into thinking the in-flight schedule is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4622b57-deca-4419-b794-100392cfb3c4

📥 Commits

Reviewing files that changed from the base of the PR and between 2b6f78b and 6bf6777.

⛔ Files ignored due to path filters (2)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (20)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/engine.go
  • client/internal/peer/conn.go
  • client/internal/peer/env_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/worker_ice.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/server/setconfig_test.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
💤 Files with no reviewable changes (2)
  • client/ui/event_handler.go
  • client/ui/const.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/conn_mgr.go

Comment on lines +325 to +352
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
m := mode
p.configInput.ConnectionMode = &m
}

// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}

// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm there's no other "clear override" path that nils the pointer in apply() or elsewhere.
rg -nP --type=go -C2 'ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds' -g '!**/*_test.go' -g 'client/internal/profilemanager/**'

Repository: netbirdio/netbird

Length of output: 3375


🏁 Script executed:

#!/bin/bash
# Check the actual SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds implementations
cat -n client/android/preferences.go | head -400 | tail -100

Repository: netbirdio/netbird

Length of output: 4108


🏁 Script executed:

#!/bin/bash
# Search for all occurrences of these four setter functions to ensure we catch all of them
rg -n 'func.*Set(ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds)' --type=go -g '!**/*_test.go'

Repository: netbirdio/netbird

Length of output: 426


🏁 Script executed:

#!/bin/bash
# Check if there are any other code paths that might clear these overrides (e.g., by setting to nil)
rg -n 'ConnectionMode.*=.*nil|RelayTimeoutSeconds.*=.*nil|P2pTimeoutSeconds.*=.*nil|P2pRetryMaxSeconds.*=.*nil' --type=go -g '!**/*_test.go'

Repository: netbirdio/netbird

Length of output: 3062


Fix the "clear override" logic to set pointer to nil, not to zero value.

The documented behaviour for SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, and SetP2pRetryMaxSeconds promises that passing 0 (or empty string) clears the override and reverts to the server-pushed value. However, each setter currently stores a pointer to the zero value, and the apply() function in client/internal/profilemanager/config.go (lines 608–626) checks only != nil before assigning. This overwrites the persisted server-pushed value with 0 or "" instead of clearing the override.

To honour the documented semantics, set the pointer to nil when the sentinel value is passed:

Suggested fix pattern
 func (p *Preferences) SetConnectionMode(mode string) {
+	if mode == "" {
+		p.configInput.ConnectionMode = nil
+		return
+	}
 	m := mode
 	p.configInput.ConnectionMode = &m
 }
@@
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		p.configInput.RelayTimeoutSeconds = nil
+		return
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }

Apply the same pattern to SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
m := mode
p.configInput.ConnectionMode = &m
}
// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
if mode == "" {
p.configInput.ConnectionMode = nil
return
}
m := mode
p.configInput.ConnectionMode = &m
}
// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
if secs <= 0 {
p.configInput.RelayTimeoutSeconds = nil
return
}
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 325 - 352, The setters store
pointers to zero values instead of clearing overrides, so change
SetConnectionMode, SetRelayTimeoutSeconds (and likewise SetP2pTimeoutSeconds and
SetP2pRetryMaxSeconds) to set the corresponding pointer to nil when the sentinel
is passed (empty string for SetConnectionMode, 0 for the timeout/setter
methods); otherwise allocate and assign a pointer to the non-zero value. Update
p.configInput.ConnectionMode and p.configInput.RelayTimeoutSeconds (and the P2P
fields) accordingly so apply()'s nil check correctly reverts to server-pushed
values.

Comment on lines +144 to +147
// Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE
if g.onNetworkChange != nil {
g.onNetworkChange()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous onNetworkChange() call blocks ctx.Done() responsiveness

The callback is invoked on the guard's select goroutine. Per the PR description, it recreates the ICE worker — potentially involving non-trivial setup. Until the callback returns, the guard loop cannot react to ctx.Done() or any other channel (including another srReconnectedChan event). Consider dispatching it in a background goroutine:

⚡ Proposed fix
-            // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-            if g.onNetworkChange != nil {
-                g.onNetworkChange()
-            }
+            // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+            // Dispatched asynchronously so the guard loop remains responsive to ctx.Done().
+            if g.onNetworkChange != nil {
+                go g.onNetworkChange()
+            }

If the callback itself serializes on an internal mutex (as ICE worker recreation likely does), concurrent calls from rapid reconnect events are already serialized. If not, consider adding a flag or draining mechanism to coalesce multiple rapid network-change signals.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE
if g.onNetworkChange != nil {
g.onNetworkChange()
}
// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
// Dispatched asynchronously so the guard loop remains responsive to ctx.Done().
if g.onNetworkChange != nil {
go g.onNetworkChange()
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The select loop
currently calls g.onNetworkChange() synchronously which blocks responsiveness to
ctx.Done() and other channels; change it to dispatch the callback
asynchronously: when g.onNetworkChange != nil spawn a background goroutine to
call g.onNetworkChange() from the guard's select branch (e.g., in the guard loop
where srReconnectedChan is handled) so the loop can continue processing ctx.Done
and other events. To avoid unbounded concurrent invocations, add a simple
coalescing mechanism (a boolean flag or non-blocking send to a buffered channel)
adjacent to the async dispatch to ensure rapid reconnect events don’t spawn
redundant goroutines unless the previous handler has completed; keep the
flag/chan state local to the guard instance that owns g.onNetworkChange.

Comment thread client/proto/daemon.proto
Comment on lines +208 to +218
// Phase 1 (#5989): peer-connection mode and idle timeouts.
// connection_mode is a string (relay-forced, p2p, p2p-lazy, p2p-dynamic,
// follow-server, or empty); parsed via connectionmode.ParseString at the
// daemon side. Empty means "no client-side override, use server value".
optional string connection_mode = 40;
optional uint32 p2p_timeout_seconds = 41;
optional uint32 relay_timeout_seconds = 42;

// Phase 3 (#5989): cap on the ICE-failure backoff schedule. 0 = use
// server-pushed value (or built-in default 15 min).
optional uint32 p2p_retry_max_seconds = 43;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire these new request fields through the daemon handlers.

client/server/server.go still ignores LoginRequest.connection_mode/... and SetConfigRequest.connection_mode/..., so callers can send these overrides and get a success response even though the daemon neither applies nor persists them. That leaves the new RPC surface out of sync with the implementation, especially for the GUI save flow.

Also applies to: 714-724

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/proto/daemon.proto` around lines 208 - 218, The daemon handlers in
client/server/server.go currently ignore the new LoginRequest.connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, and p2p_retry_max_seconds (and the
same fields on SetConfigRequest), so RPC callers can send them but the daemon
neither applies nor persists them; update the request handling code paths that
process LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).

MichaelUray and others added 20 commits May 6, 2026 06:53
…rsion pin

- conn_mgr.go: tagged switch on uint32 sentinel (QF1002 staticcheck)
- engine.go: //nolint:staticcheck on the deprecated peer.IsForceRelayed
  call site - the function is intentionally retained for Phase-1
  backwards compat with daemons that haven't migrated to ResolveModeFromEnv yet
- lazyconn/manager.go: //nolint:staticcheck on the deprecated
  inactivity.NewManager fallback - kept on purpose for callers without
  resolved two-timer config (iceTimeout=0 && relayTimeout=0)
- proxy_service.pb.go: pin protoc version header to upstream-main v7.34.1
  (Proto Version Check workflow flagged it; only the comment-line differs)
Phase 3.5 hotfix for netbirdio#5989. When Guard detects that the signal/relay
layer has reconnected (typically after a network/interface change
like LTE-modem replug or WiFi-roaming), the per-Conn iceBackoff is
now reset to a fresh state AND the workerICE is recreated.

Without this, after a long Phase-2-style suspension the daemon would
remain stuck on Relay even when the new network conditions allow P2P:
the AttachICE-gate is open (suspend window expired) but the previously
DetachICE-closed workerICE never gets a fresh pion-agent, producing
"ICE Agent is not initialized yet" warnings instead of new ICE
attempts.

Verified on badmitterndorf-r1 (LTE-replug -> public IP): without
this fix all peers stayed Relayed; after daemon-restart 3/5 peers
came up as P2P. With this fix the same transition is automatic.

Spec section 5.1 listed Interface-Change as a Reset-Trigger but the
implementation was missed in the original Phase-3 plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow-up)

The Phase-3.5 hotfix recreated workerICE on srReconnect but only
swapped the handshaker listener; no offer was sent. Result: pion's
agent stays nil until a peer-initiated offer arrives. HW-verified
on badmitterndorf via 144x "ICE Agent is not initialized yet"
warnings post-bounce with no peer transitioning back to P2P.

Now we explicitly call handshaker.SendOffer() after the listener
swap so the remote responds, the local OnNewOffer fires, and
reCreateAgent populates the new agent.

Refs netbirdio#5989 (Phase 3.5).
…gent

Phase 3.6 hotfix for netbirdio#5989. OnRemoteCandidate previously dropped any
candidate that arrived while w.agent was nil (= during the small race
window between OnNewOffer kicking off reCreateAgent and the agent
being assigned). After Phase-3.5 introduced in-process workerICE
recreate on network change, this race becomes much more frequent --
the remote peer often replies with candidates before our local
OnNewOffer handler has finished assigning the new agent.

Buffer up to 64 candidates per peer, drained inside OnNewOffer right
after w.agent is set. Buffer overflow logs a warning but is bounded
to prevent OOM under a misbehaving peer.

Verified: pre-fix on badmitterndorf showed ~150 "ICE Agent is not
initialized yet" warnings per peer after a network bounce, with no
P2P recovery within several minutes. After this fix the candidates
survive the race and pion can complete the pair-checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…recreate path

Phase 3.5 originally closed workerICE and replaced the wrapper via
NewWorkerICE() + listener-swap. Empirically this caused ICE to fail
with a ~13s pair-check timeout after a network event (LTE-glitch on
badmitterndorf), while a fresh daemon-restart converged in <1s on
the exact same network conditions. Most likely cause: state-leak
between the old and new workerICE wrappers (sockets, stdnet bindings)
that prevented the new pion agent from gathering working candidates.

Refactored to reuse the existing workerICE wrapper. The well-tested
"existing agent + new offer with different sessionID -> tear down +
reCreateAgent" branch already in worker_ice.go (Phase 1 code) is the
blessed path for in-place recreate. We just:
  1) Reset iceBackoff (counter -> 0)
  2) Close the current pion agent (w.agent = nil)
  3) handshaker.SendOffer() so the remote responds with a fresh
     sessionID that flows back through the still-attached listener

The buffer/race fix that netbirdio#5805 originally proposed and pappz argued
against is NOT needed here -- the warnings happen in both fresh-start
(P2P succeeds) and recreate (P2P fails), proving the warnings are
benign noise that pion handles via retransmits. The actual bug was
in the wrapper-replacement logic of the original Phase-3.5 commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Phase-3.7 hotfix called handshaker.SendOffer() from
onNetworkChange. But Guard already drives sendOffer via its
newReconnectTicker (800ms initial, ~4 retries in the first 6s)
right after the same srReconnect event that fires our callback.

Both paths firing produces ~5 sendOffer calls per peer in ~6s.
Each offer routes to the remote's workerICE.OnNewOffer where (if
the local sessionID has changed since the last reCreateAgent) it
triggers a fresh tear-down + reCreateAgent. With 5 such triggers
in rapid succession, the remote ICE-pair-checks never complete --
they get torn down before pion's GatherCandidates / Dial can finish.

Empirically observed on badmitterndorf during real LTE-carrier
glitches: peers got 5 sending-offer log lines in 6s, then no
"set ICE to active" event for several minutes -> stayed Relayed.

Fix: do nothing about offers in onNetworkChange. Just close the
agent (w.agent = nil). The Guard's natural reconnect-burst drives
the next sendOffer, the remote replies, and our existing
"agent==nil + new offer -> reCreateAgent" path handles the
recreation cleanly.

Refs netbirdio#5989 (Phase 3.7 follow-up).
… (Guard-Loop Fix from netbirdio#5805)

Phase 3.7c (netbirdio#5989) re-introduces the Guard-Loop Fix from MichaelUray's
PR netbirdio#5805 that pappz did not accept (he believed sessionID-skip was
sufficient). Empirically on badmitterndorf during LTE-carrier
instability we now have hard evidence the bug is real cross-NAT too:

uray-mic-dh received 5 different sessionIDs from the remote peer in
2 minutes (06:27-06:29). Each different sessionID triggered the
in-place "tear-down + reCreateAgent" branch in OnNewOffer. With
both sides' Guards firing fresh offers in lockstep, the in-flight
ICE pair-checks (5-10s) never completed -- each new offer tore
down the previous attempt before pion could finish.

Fix: when agentConnecting==true, ignore any new offer regardless of
sessionID. Let the in-flight attempt either succeed (-> agent stays)
or fail (-> agentConnecting goes false naturally via closeAgent),
THEN allow a fresh sessionID to trigger reCreateAgent.

Refs netbirdio#5805 (Guard-Loop Fix), Phase 3.7c of netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In p2p-dynamic mode the runDynamicInactivityLoop calls DetachICEForPeer
when the per-peer iceTimeout fires, but does NOT call
lazyConnMgr.DeactivatePeer. The lazy state therefore stays at
watcherInactivity while the handshaker's iceListener is gone -- a
sub-state the lazy manager does not represent.

Before this change, ConnMgr.ActivatePeer gated AttachICE behind
lazyConnMgr.ActivatePeer's "found" flag, which is a one-shot edge
(watcherActivity -> watcherInactivity). After an iceTimeout-detach,
subsequent signal messages would call ActivatePeer with found=false,
AttachICE would never run, and remote OFFERs would silently drop at
handshaker.Listen():143 (iceListener==nil). The peer was stuck on
relay forever even while both sides kept signaling normally.

Symptom observed on bm router (uray-mic-dh peer): no "set ICE to
active" event for 22+ minutes, six different remote session IDs
received, zero ICE attempts started. Recovery was only possible via
daemon restart.

Fix:
1. ConnMgr.ActivatePeer: call AttachICE unconditionally for
   p2p-dynamic on every signal trigger. AttachICE is idempotent
   (returns nil if listener already attached, conn.go:1048) and
   honors iceBackoff.IsSuspended() so the failure-backoff is not
   bypassed.

2. handshaker.Listen: emit a Debug-level note when an OFFER/ANSWER
   arrives without an attached ICE listener (typical during ICE
   backoff suspension or relay-forced mode), so the dispatch state
   is observable in debug logs without alarming on healthy systems.

Verified live on bm router post-deploy:
- uray-mic-dh recovered to P2P (srflx/srflx) within seconds of
  restart, confirming the AttachICE flow runs end-to-end.
- ctb50-d (V7iprtVU... peer) cycles correctly: ICE failure ->
  backoff suspend -> backoff expire -> next signal triggers
  AttachICE -> ICE retry -> backoff doubles. Exact Phase-3
  behavior, now also working post-iceTimeout-detach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7e)

Phase 3.7d ensured AttachICE runs on every signal trigger via
ConnMgr.ActivatePeer, but that fix only takes effect when the next
signal arrives. After an LTE-modem replug the iceListener can end up
detached for some peers (paths via onICEFailed -> DetachICE during the
bounce window, plus concurrent state-change callbacks while the agent
is being torn down). Until the next signal triggers AttachICE, every
remote OFFER reaches handshaker.Listen() with iceListener==nil and is
silently dropped. In one observed case ygW6ySPb stayed on relay for
5 minutes after the bounce despite uray-mic-dh sending nine OFFERs
with fresh sessionIDs; recovery only happened after a manual daemon
restart.

Phase 3.5's onNetworkChange already runs once per peer right after
srReconnect: it resets iceBackoff and closes the workerICE agent. This
is the natural place to also force the iceListener back on, so the
Guard's reconnect-loop can drive a fresh sendOffer that actually
reaches OnNewOffer instead of being dropped.

Refactor:
- Extract attachICEListenerLocked from AttachICE. The locked helper
  returns true when a new attachment was made and false on no-op
  (already attached, ICE backoff suspended, handshaker not initialised,
  or workerICE not present).
- AttachICE checks iceBackoff.IsSuspended first to preserve the
  existing relay-forced-mode error semantics in
  TestConn_AttachICE_NoOpWhenSuspended, then delegates to
  attachICEListenerLocked, then sends an offer if a new attachment was
  made.
- onNetworkChange calls attachICEListenerLocked after closing the
  workerICE agent. Deliberately does NOT call SendOffer because the
  Guard reconnect-ticker already issues one right after the same
  srReconnect event; sending another here would re-introduce the
  offer-storm Phase 3.7b removed.

Verified live on bm router with a 90 s management+signal blackhole:
- All 6 peers logged "ICE state reset on network change (agent closed;
  listener re-armed; Guard will resend offer)" within 1 ms of
  srReconnect.
- Peers whose listener had been detached (X+HhIybX, Rkqv) logged
  "ICE listener attached (locked path)" from the new helper.
- ygW6ySPb (uray-mic-dh), yhM26jA (dk20), dxvaVD2 (w11-test1) all
  reached "set ICE to active connection" within 75-85 s of the
  blackhole ending. Previously ygW6ySPb required a daemon restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hange (Phase 3.7f)

After Reset() is called for an srReconnect / network-change event,
the first ICE pair-check often fails because pion is still working
with NAT mappings that were torn down by the network event. Falling
back to the normal 1-minute initial exponential interval after that
single failure leaves the peer on relay for 60-100 seconds, far
longer than the underlying connectivity actually warrants - the
second attempt typically succeeds within a few seconds because the
new LTE/Wi-Fi mapping is now warm.

Behaviour change:
- Track lastResetAt in iceBackoffState; Reset() stamps it.
- Inside markFailure, while time.Since(lastResetAt) <
  networkChangeGracePeriod (30 s), use a fixed networkChangeRetryDelay
  of 5 s and do NOT advance the long-term exponential schedule.
- Outside the grace window, behaviour is unchanged: normal exponential
  backoff capped at maxBackoff.

Verified live on bm router with a 60 s management+signal blackhole
(equivalent to an LTE replug):

  Pre-3.7f real LTE bounce (08:03): uray-mic-dh stayed on relay for
  103 s before P2P recovered. ICE failure netbirdio#1 -> 60 s+ exponential
  suspend -> ICE retry -> success.

  Post-3.7f 60 s blackhole (08:18): srReconnect at 08:18:46.672,
  ICE failure netbirdio#1 at 08:18:59 -> "suspending for 5s, next retry at
  08:19:04", then "ICE success, resetting backoff (was 1 failures)"
  at 08:19:09. Total relay-only window: 23 s (4.5x faster).

  All four P2P-capable peers (uray-mic-d4, uray-mic-dh, dk20,
  w11-test1) reconverged within the same 23 s window.

Tests:
- TestIceBackoff_GracePeriodAfterReset_ShortDelay: first and second
  failure within the grace window both return networkChangeRetryDelay
  and do not advance the underlying ExponentialBackOff.
- TestIceBackoff_GraceExpired_NormalExponential: forcing lastResetAt
  into the past restores the ~1 m initial exponential delay.
- TestIceBackoff_NoGraceWithoutReset: a fresh state without an
  explicit Reset uses the normal exponential schedule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nnected (Phase 3.7g)

Phase 3.5 added workerICE.Close to onNetworkChange so that the ICE
agent gets recreated after an LTE-modem replug or other network event.
That's correct when the underlying peer-to-peer UDP path actually
broke -- pion's ICE state machine has already gone Disconnected/Failed
and w.agent has been cleared by closeAgent in onConnectionStateChange.

But many srReconnect events are NOT real connectivity losses. A brief
signal/relay outage (gRPC keepalive timeout, transient IP-blacklist,
DNS hiccup) fires srReconnect even though peer-to-peer WG keepalives
between clients kept flowing the whole time. Closing the still-working
ICE agent in that case forces:

  - workerICE.Close clears w.agent and triggers ICE.Failed callbacks
  - onICEFailed marks a backoff failure and calls DetachICE
  - WireGuard endpoint is removed (~1 s ping dropout)
  - Next signal triggers OnNewOffer -> reCreateAgent (~12 s pair-check)
  - Total observable interruption: 15-25 s for an event that should
    have been a no-op

Verified live on the second badmitterndorf router (172A2, wired to
home LAN, identical netbird config except runs on stable connectivity):

  Pre-3.7g (Phase 3.7f): a 60 s mgmt blackhole caused all four
  P2P-capable peers to log "ICE disconnected, do not switch to Relay
  ... configure WireGuard endpoint to ..." and a full 21 s
  ICE-renegotiation cycle even though each peer's WG endpoint ended
  up at the SAME address it had before.

  Post-3.7g: same 60 s mgmt blackhole, debug log shows
  "network change: skipping workerICE.Close (ICE still Connected,
  soft-fallback)" for every healthy peer. Zero state changes in
  netbird status, zero ICE failures, zero ping dropout.

Implementation:
- WorkerICE.IsConnected returns true when w.agent != nil and
  lastKnownState == ice.ConnectionStateConnected. Reads the same
  state machine that drives onConnectionStateChange, so it's
  authoritative.
- onNetworkChange wraps the workerICE.Close call in
  `if !workerICE.IsConnected()`. The LTE-bounce path is unchanged
  (pion has already cleared w.agent so IsConnected returns false
  and Close runs as before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7h)

Real-world LTE-bounce traces showed Phase 3.7f's 30 s grace window
only fit ~2 ICE attempts before the schedule jumped to a 1-minute
exponential suspend. Each pair-check is ~12-15 s, the post-Reset
delay was 5 s, so by the second failure (T+~34s) we were already
outside the grace window. Cold NAT mappings often need 3-4 attempts
to prime, so peers behind a single LTE/Wi-Fi NAT routinely waited
2-3 minutes for P2P recovery instead of the ~30-50 s the underlying
connectivity actually allowed.

Widen the window from 30 s to 60 s and shorten the retry delay from
5 s to 2 s. With these values:

  T=0 s    Reset (srReconnect / network change)
  T=~12 s  Attempt netbirdio#1 fails (pion pair-check timeout)
  T=~14 s  Attempt netbirdio#2 starts
  T=~26 s  Attempt netbirdio#2 fails
  T=~28 s  Attempt netbirdio#3 starts
  T=~40 s  Attempt netbirdio#3 fails
  T=~42 s  Attempt netbirdio#4 starts (still inside 60 s grace)
  T=~54 s  Attempt netbirdio#4 fails / succeeds

Roughly twice as many priming attempts before exponential kicks in.
Observed bm-LTE peer that previously needed 2 min 8 s for recovery
should now converge inside the grace window.

Trade-off: chronically broken peers (behind a genuine symmetric NAT
on both sides) generate ~2x more signal/STUN traffic in the first
60 s after a network change. After the grace expires the schedule
falls back to the same exponential backoff as before, so the long-
term cost is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfigure

Headless setups today need two steps to seed the active profile with a
non-default connection mode:

  netbird service install ...
  netbird up --connection-mode p2p-dynamic --p2p-timeout 60 ...

Add the same four profile-level flags to the install + reconfigure
commands and persist them via profilemanager.UpdateOrCreateConfig so
the daemon picks them up on first start. Reuses the package-level
vars + constants already defined for `up`, so naming and validation
stay consistent.

Flags added (mirrored from upCmd):
  --connection-mode  relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server
  --relay-timeout    seconds (0 = use server default)
  --p2p-timeout      seconds (0 = use server default; only effective
                     in p2p-dynamic mode)
  --p2p-retry-max    seconds (0 = use server default; built-in 15 min
                     fallback when server has not pushed a value)

Only fields whose flag was Changed() are written; unset flags leave
the existing profile untouched. Honors --config when present,
otherwise uses profilemanager.DefaultConfigPath.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Android Advanced Settings UI in the Fyne-based desktop
client (Linux + Windows). The Network tab in NetBird Settings gains:

  - Connection Mode dropdown: Follow server, relay-forced, p2p,
    p2p-lazy, p2p-dynamic. "Follow server" clears any local override.
  - Relay Timeout (s): only meaningful in p2p-lazy / p2p-dynamic,
    disabled for other modes. Empty = use server default.
  - P2P Timeout (s): only meaningful in p2p-dynamic. Empty = use
    server default.
  - P2P Retry-Max (s): only meaningful in p2p-dynamic. Empty = use
    server default (or built-in 15-min fallback when the management
    server has not pushed a value).

The dropdown's onChange handler enables/disables the timeout entries
to match the inactivity-manager's actual scope (no inactivity teardown
runs in relay-forced / p2p, so those modes get all three fields
disabled).

Daemon plumbing:
  - GetConfigResponse gains connection_mode, p2p_timeout_seconds,
    relay_timeout_seconds, p2p_retry_max_seconds (proto + regenerated
    pb.go). server.go fills them from the active profile config so the
    GUI hydrates correctly on open.
  - SetConfigRequest already had these fields from Phase 1; the GUI's
    buildSetConfigRequest now populates them on every save.

State plumbing in client_ui.go:
  - 4 new fyne widgets (sConnectionMode + 3 entries) created in
    showSettingsUI alongside the existing checks.
  - 4 new state fields cached on serviceClient for change detection.
  - hasConnectionModeChanges() compares dropdown + entries against
    the cached state; saveSettings flow only sends a SetConfigRequest
    when there's a real diff.
  - parseUint32Field tolerates empty / non-numeric input (treated as
    0 = no override) so the user can clear a field without errors.

The lazy-connection checkbox in the systray menu is left in place;
it acts as a quick toggle for the legacy boolean and is kept for
backwards compatibility with profiles that have not yet adopted the
new ConnectionMode field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. GetConfigResponse gets 4 server_pushed_* fields
(connection_mode + 3 timeouts in seconds). ConnMgr captures the raw
values from each NetworkMap.PeerConfig and exposes them via
ServerPushed*() accessors, all guarded by sync.RWMutex (covered by
TestConnMgr_ServerPushedFieldsAreRaceSafe). Engine.ConnMgr() getter
lets the daemon-RPC layer reach the values; Server.GetConfig fills
them into every response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. The Connection Mode dropdown's "Follow server"
entry now suffixes "(currently: <mode>)" when the engine has received a
PeerConfig; the timeout entries' placeholders show the actual server
default seconds. The redundant "Enable Lazy Connections" tray submenu
is removed since Connection Mode covers it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lues

Phase 3.7h finalisation. Adds Get/Set{ConnectionMode, RelayTimeoutSeconds,
P2pTimeoutSeconds, P2pRetryMaxSeconds} on Preferences and the matching
GetServerPushed* accessors on Client (via connMgrSafe helper). Required
by the Android Connection-Mode picker (in netbird-android repo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from 6bf6777 to b2b91ef Compare May 6, 2026 06:53
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
7 New issues
7 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
client/internal/peer/guard/guard.go (1)

144-147: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous onNetworkChange() still blocks the guard select loop

The callback is still invoked synchronously inside the srReconnectedChan branch. Until it returns, ctx.Done(), relayedConnDisconnected, iCEConnDisconnected, and any further srReconnectedChan events are all blocked. This was flagged in the previous review and remains unresolved.

🛠 Proposed fix (async dispatch with single-flight coalescing)
+		// networkChangePending prevents redundant concurrent recreations.
+		var networkChangePending atomic.Bool

 		case <-srReconnectedChan:
 			g.log.Debugf("has network changes, reset reconnection ticker")
 			ticker.Stop()
 			ticker = g.newReconnectTicker(ctx)
 			tickerChannel = ticker.C
 			iceState.reset()
-			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-			if g.onNetworkChange != nil {
-				g.onNetworkChange()
-			}
+			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+			// Dispatched asynchronously so the guard loop stays responsive to ctx.Done().
+			if g.onNetworkChange != nil && !networkChangePending.Swap(true) {
+				go func() {
+					defer networkChangePending.Store(false)
+					g.onNetworkChange()
+				}()
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The
srReconnectedChan branch currently calls g.onNetworkChange() synchronously
inside the guard select loop, which blocks other cases; change it to dispatch
the work asynchronously: when handling srReconnectedChan, if g.onNetworkChange
!= nil spawn a goroutine to invoke it instead of calling it inline and use a
single-flight/coalescing mechanism (e.g., a singleflight.Group or an in-struct
boolean guarded by a mutex/atomic) so concurrent srReconnectedChan events only
trigger one inflight onNetworkChange invocation; ensure the async worker
performs the existing reset iceBackoff and recreate workerICE logic and that you
respect ctx cancellation to avoid leaking goroutines.
client/internal/peer/worker_ice.go (1)

218-230: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize lastKnownState across the callback and network-change paths.

IsConnected() at Lines 227-230 now drives the skip-close branch, but onConnectionStateChange() still writes lastKnownState without the same synchronization at Lines 550 and 562. That leaves a data race on the exact close-or-skip decision this PR added.

#!/bin/bash
rg -n -C2 '\blastKnownState\b' client/internal/peer/worker_ice.go

Also applies to: 545-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 230, IsConnected reads
lastKnownState under w.muxAgent while onConnectionStateChange writes
lastKnownState without locking, creating a race; fix by protecting all accesses
to lastKnownState with the same mutex (w.muxAgent) — update
onConnectionStateChange (the callback that sets lastKnownState at/around where
it currently assigns at lines ~545-563) to acquire w.muxAgent.Lock() and defer
w.muxAgent.Unlock() around writes (and any reads), ensuring symmetry with
IsConnected and preventing the close-or-skip race.
client/internal/conn_mgr.go (2)

282-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply timeout pushes even when the managed mode itself doesn’t change.

Lines 282-320 only restart the manager on modeChanged, but initLazyManager() consumes relayTimeoutSecs and p2pTimeoutSecs once at Lines 557-562. A push from, for example, p2p-dynamic(60/30) to p2p-dynamic(10/5) updates the fields but leaves the live manager running with the old thresholds until a full restart.

Also applies to: 553-562

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 282 - 320, The runtime ignores
timeout-only pushes if the mode doesn’t change; detect when
relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs change while a manager is active
and apply them by restarting or updating the manager: in conn_mgr.go where
modeChanged/wasManaged/isManaged are computed (and where lazyConnMgr is
checked), add a branch that if isManaged && e.lazyConnMgr != nil and any of
relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from the manager’s
current thresholds, call e.closeManager(ctx) then e.initLazyManager(ctx),
e.startModeSideEffects(), and e.addPeersToLazyConnManager() (or alternatively
propagate the new values into the running manager); reference modeUsesLazyMgr,
lazyConnMgr, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs,
propagateP2pRetryMaxToConns, initLazyManager, closeManager,
startModeSideEffects, and addPeersToLazyConnManager when making the change.

181-188: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t leave e.mode in lazy/dynamic when Rosenpass prevents the manager from starting.

Lines 185-187 and 313-315 return early, but e.mode is still kept/set to p2p-lazy or p2p-dynamic. New peers then inherit that mode, AddPeerConn() falls back to conn.Open() at Lines 391-395 because no manager exists, and ModeP2PDynamic still defers ICE listener registration in client/internal/peer/conn.go at Lines 239-242. Since ActivatePeer() also returns early with no manager at Lines 449-451, those peers can stay relay-only indefinitely.

Also applies to: 312-320

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 181 - 188, When you bail out from
starting the lazy/dynamic manager (both the modeUsesLazyMgr() check and the
e.rosenpassEnabled check), clear/unset the lazy P2P mode on e.mode so it no
longer advertises "p2p-lazy" or "p2p-dynamic" to new peers; specifically, before
each early return set e.mode to a non-lazy value (i.e. remove the lazy/dynamic
flag or assign the normal/direct P2P mode) so callers like AddPeerConn() and
ActivatePeer() won’t treat peers as manager-backed (avoiding the conn.Open()
fallback and deferred ICE registration in ModeP2PDynamic).
client/internal/peer/conn.go (1)

202-212: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the explicit “disable backoff” state round-trippable.

propagateP2pRetryMaxToConns() converts the wire sentinel to time.Duration(0), but Line 1159 stores that back as plain 0. After a later reopen, Lines 204-207 reinterpret 0 as “use default”, so a user-explicit disable silently turns back into the 15-minute cap.

Also applies to: 1156-1162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 202 - 212, The explicit "disable
backoff" sentinel is being lost because 0 is being reused for "use default";
introduce and use a distinct sentinel (e.g. const P2PRetryMaxDisabledSeconds =
-1) and update both propagateP2pRetryMaxToConns() (the code around 1156-1162)
and this initialization block in conn.go to interpret -1 as "disabled" (set
backoffCap to 0 or disable the iceBackoff entirely) while treating 0 as "use
built-in default" (map to DefaultP2PRetryMax); also ensure
newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so
the explicit disable round-trips unchanged.
client/ui/client_ui.go (1)

656-666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently coerce invalid timeout input to 0.

Invalid/overflow timeout text currently collapses to 0, which can unintentionally clear overrides when saving settings. Parse should return an error and buildSetConfigRequest should surface it.

Proposed fix
-func parseUint32Field(text string) uint32 {
+func parseUint32Field(text string) (uint32, error) {
 	t := strings.TrimSpace(text)
 	if t == "" {
-		return 0
+		return 0, nil
 	}
 	v, err := strconv.ParseUint(t, 10, 32)
 	if err != nil {
-		return 0
+		return 0, fmt.Errorf("must be a whole number between 0 and %d", uint64(^uint32(0)))
 	}
-	return uint32(v)
+	return uint32(v), nil
 }
@@
-	relaySecs := parseUint32Field(s.iRelayTimeout.Text)
-	p2pSecs := parseUint32Field(s.iP2pTimeout.Text)
-	retrySecs := parseUint32Field(s.iP2pRetryMax.Text)
+	relaySecs, err := parseUint32Field(s.iRelayTimeout.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid relay timeout: %w", err)
+	}
+	p2pSecs, err := parseUint32Field(s.iP2pTimeout.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid P2P timeout: %w", err)
+	}
+	retrySecs, err := parseUint32Field(s.iP2pRetryMax.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid P2P retry-max: %w", err)
+	}
 	req.RelayTimeoutSeconds = &relaySecs
 	req.P2PTimeoutSeconds = &p2pSecs
 	req.P2PRetryMaxSeconds = &retrySecs

Also applies to: 741-750

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 656 - 666, The parseUint32Field function
currently swallows parse/overflow errors and returns 0 which can accidentally
clear saved timeouts; change parseUint32Field to return (uint32, error) instead
of silently coercing — validate strings.TrimSpace(text) then call
strconv.ParseUint(..., 32) and return an explicit error on failure (including
overflow), and update all callers (notably buildSetConfigRequest) to propagate
or surface that error to the caller (e.g., return/build an error response
instead of using 0); apply the same change to the other similar parser at the
741-750 block so all timeout parsing surfaces errors instead of returning 0.
client/android/preferences.go (1)

349-352: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative timeout inputs before casting to uint32.

Negative values currently wrap into very large uint32 values and can silently corrupt timeout overrides. Treat <= 0 as “clear override” (0) before casting.

Proposed fix
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.RelayTimeoutSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.P2pTimeoutSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.P2pRetryMaxSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 370-373, 390-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 349 - 352, Guard against negative
timeout inputs in SetRelayTimeoutSeconds: treat any secs <= 0 as a clear
override by using 0 instead of allowing a negative int64 to wrap when cast to
uint32. Update SetRelayTimeoutSeconds to check secs <= 0 and set the config
pointer to a uint32(0) in that case; otherwise cast secs to uint32 and assign as
before. Apply the same pattern to the other timeout setter functions in this
file that perform the uint32 cast so they also handle non-positive inputs
safely.
🧹 Nitpick comments (1)
client/internal/lazyconn/manager/manager.go (1)

68-68: ⚡ Quick win

Remove unused inactivityThreshold field from Manager struct.

The field is assigned in NewManager (line 93) but never read; all timeout logic flows through resolvedTimeouts(). It's dead code from the migration and can be safely removed.

🧹 Removal diff
 type Manager struct {
 	engineCtx           context.Context
 	peerStore           *peerstore.Store
-	inactivityThreshold time.Duration
 
 	managedPeers         map[string]*lazyconn.PeerConfig
 	m := &Manager{
 		engineCtx:            engineCtx,
 		peerStore:            peerStore,
-		inactivityThreshold:  inactivity.DefaultInactivityThreshold,
 		managedPeers:         make(map[string]*lazyconn.PeerConfig),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` at line 68, Remove the dead
inactivityThreshold field from the Manager struct and eliminate its assignment
in NewManager: delete the inactivityThreshold declaration in the Manager type,
remove any parameter/argument or line in NewManager that sets
m.inactivityThreshold, and ensure no other code reads it (timeout logic should
remain using resolvedTimeouts()). Update related constructors/call sites to stop
passing that value and run tests to confirm nothing else references
inactivityThreshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/cmd/service_installer.go`:
- Around line 135-139: The call to applyConnectionModeFlagsToProfile currently
logs a warning on failure, allowing install/reconfigure to continue with
inconsistent connection-mode state; change this so that any error returned by
applyConnectionModeFlagsToProfile is propagated and causes the command to abort
(return the error) instead of printing a warning. Update both places where this
call appears (the install path around applyConnectionModeFlagsToProfile(cmd) and
the reconfigure path) to return the error up the call stack rather than using
cmd.PrintErrf, ensuring the command stops if persistence fails and no further
install/reconfigure steps run.

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: Guard against negative timeout inputs in
SetRelayTimeoutSeconds: treat any secs <= 0 as a clear override by using 0
instead of allowing a negative int64 to wrap when cast to uint32. Update
SetRelayTimeoutSeconds to check secs <= 0 and set the config pointer to a
uint32(0) in that case; otherwise cast secs to uint32 and assign as before.
Apply the same pattern to the other timeout setter functions in this file that
perform the uint32 cast so they also handle non-positive inputs safely.

In `@client/internal/conn_mgr.go`:
- Around line 282-320: The runtime ignores timeout-only pushes if the mode
doesn’t change; detect when relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs
change while a manager is active and apply them by restarting or updating the
manager: in conn_mgr.go where modeChanged/wasManaged/isManaged are computed (and
where lazyConnMgr is checked), add a branch that if isManaged && e.lazyConnMgr
!= nil and any of relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from
the manager’s current thresholds, call e.closeManager(ctx) then
e.initLazyManager(ctx), e.startModeSideEffects(), and
e.addPeersToLazyConnManager() (or alternatively propagate the new values into
the running manager); reference modeUsesLazyMgr, lazyConnMgr, relayTimeoutSecs,
p2pTimeoutSecs, p2pRetryMaxSecs, propagateP2pRetryMaxToConns, initLazyManager,
closeManager, startModeSideEffects, and addPeersToLazyConnManager when making
the change.
- Around line 181-188: When you bail out from starting the lazy/dynamic manager
(both the modeUsesLazyMgr() check and the e.rosenpassEnabled check), clear/unset
the lazy P2P mode on e.mode so it no longer advertises "p2p-lazy" or
"p2p-dynamic" to new peers; specifically, before each early return set e.mode to
a non-lazy value (i.e. remove the lazy/dynamic flag or assign the normal/direct
P2P mode) so callers like AddPeerConn() and ActivatePeer() won’t treat peers as
manager-backed (avoiding the conn.Open() fallback and deferred ICE registration
in ModeP2PDynamic).

In `@client/internal/peer/conn.go`:
- Around line 202-212: The explicit "disable backoff" sentinel is being lost
because 0 is being reused for "use default"; introduce and use a distinct
sentinel (e.g. const P2PRetryMaxDisabledSeconds = -1) and update both
propagateP2pRetryMaxToConns() (the code around 1156-1162) and this
initialization block in conn.go to interpret -1 as "disabled" (set backoffCap to
0 or disable the iceBackoff entirely) while treating 0 as "use built-in default"
(map to DefaultP2PRetryMax); also ensure
newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so
the explicit disable round-trips unchanged.

In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The srReconnectedChan branch currently calls
g.onNetworkChange() synchronously inside the guard select loop, which blocks
other cases; change it to dispatch the work asynchronously: when handling
srReconnectedChan, if g.onNetworkChange != nil spawn a goroutine to invoke it
instead of calling it inline and use a single-flight/coalescing mechanism (e.g.,
a singleflight.Group or an in-struct boolean guarded by a mutex/atomic) so
concurrent srReconnectedChan events only trigger one inflight onNetworkChange
invocation; ensure the async worker performs the existing reset iceBackoff and
recreate workerICE logic and that you respect ctx cancellation to avoid leaking
goroutines.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-230: IsConnected reads lastKnownState under w.muxAgent while
onConnectionStateChange writes lastKnownState without locking, creating a race;
fix by protecting all accesses to lastKnownState with the same mutex
(w.muxAgent) — update onConnectionStateChange (the callback that sets
lastKnownState at/around where it currently assigns at lines ~545-563) to
acquire w.muxAgent.Lock() and defer w.muxAgent.Unlock() around writes (and any
reads), ensuring symmetry with IsConnected and preventing the close-or-skip
race.

In `@client/ui/client_ui.go`:
- Around line 656-666: The parseUint32Field function currently swallows
parse/overflow errors and returns 0 which can accidentally clear saved timeouts;
change parseUint32Field to return (uint32, error) instead of silently coercing —
validate strings.TrimSpace(text) then call strconv.ParseUint(..., 32) and return
an explicit error on failure (including overflow), and update all callers
(notably buildSetConfigRequest) to propagate or surface that error to the caller
(e.g., return/build an error response instead of using 0); apply the same change
to the other similar parser at the 741-750 block so all timeout parsing surfaces
errors instead of returning 0.

---

Nitpick comments:
In `@client/internal/lazyconn/manager/manager.go`:
- Line 68: Remove the dead inactivityThreshold field from the Manager struct and
eliminate its assignment in NewManager: delete the inactivityThreshold
declaration in the Manager type, remove any parameter/argument or line in
NewManager that sets m.inactivityThreshold, and ensure no other code reads it
(timeout logic should remain using resolvedTimeouts()). Update related
constructors/call sites to stop passing that value and run tests to confirm
nothing else references inactivityThreshold.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a37e050-20cd-4834-a220-5d197ba3378d

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf6777 and b2b91ef.

⛔ Files ignored due to path filters (1)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (19)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/engine.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/worker_ice.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
💤 Files with no reviewable changes (2)
  • client/ui/const.go
  • client/ui/event_handler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/peer/handshaker.go

Comment on lines +135 to +139
// Persist any profile-level connection-mode/timeout flags that
// were explicitly set so the daemon picks them up on first start.
if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make connection-mode persistence failures abort the command.

These flags are not passed through buildServiceArguments(), so if applyConnectionModeFlagsToProfile fails here and execution continues, install/reconfigure completes with different connection-mode/timeout behavior than the user explicitly requested. This should return an error instead of downgrading to a warning.

Suggested change
-		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
-			cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
-		}
+		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
+			return fmt.Errorf("persist connection-mode flags: %w", err)
+		}

Apply the same change in both the install and reconfigure paths.

Based on learnings methods returning errors should be propagated early rather than continuing with partially applied state.

Also applies to: 263-265

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/service_installer.go` around lines 135 - 139, The call to
applyConnectionModeFlagsToProfile currently logs a warning on failure, allowing
install/reconfigure to continue with inconsistent connection-mode state; change
this so that any error returned by applyConnectionModeFlagsToProfile is
propagated and causes the command to abort (return the error) instead of
printing a warning. Update both places where this call appears (the install path
around applyConnectionModeFlagsToProfile(cmd) and the reconfigure path) to
return the error up the call stack rather than using cmd.PrintErrf, ensuring the
command stops if persistence fails and no further install/reconfigure steps run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant