Skip to content

[client, management] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff (stack 1/4) #6081

Open
MichaelUray wants to merge 44 commits intonetbirdio:mainfrom
MichaelUray:pr/a-p2p-dynamic-foundation
Open

[client, management] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff (stack 1/4) #6081
MichaelUray wants to merge 44 commits intonetbirdio:mainfrom
MichaelUray:pr/a-p2p-dynamic-foundation

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented May 5, 2026

[client+management] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff

Branch: pr/a-p2p-dynamic-foundation → base main
Compare: https://github.com/netbirdio/netbird/compare/main...MichaelUray:netbird:pr/a-p2p-dynamic-foundation?expand=1


Summary

This PR is the foundation for the work proposed in issue #5989 ("Consolidate connection-mode flags; add p2p-dynamic and p2p-dynamic-lazy modes"). It introduces a single first-class ConnectionMode enum on the wire and across Settings, replaces the asymmetric LazyConnectionEnabled / IsForceRelayed precedence with that enum, adds the per-peer two-timer lifecycle (P2pTimeoutSeconds for ICE-inactivity detach, RelayTimeoutSeconds for full close), and lands the per-peer iceBackoffState so failed pair-checks don't loop forever (P2pRetryMaxSeconds controls the cap).

It supersedes the older PR #6047 (which only landed Phase 1 and is no longer mergeable).

Why

NetBird currently has two flags fighting for the same job:

  • LazyConnectionEnabled (account-wide bool) — eager vs. lazy initial connect.
  • IsForceRelayed (per-peer hint) — skip ICE entirely.

A user who wants "all my peers should start lazy, but on demand try P2P, and if P2P never comes up, fall back to relay" can't express that with two booleans. The proposed ConnectionMode is one of p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy, relay-forced. Each mode unambiguously selects one combination of (eager-vs-lazy, skip-ICE-yes-no, ICE-fallback-to-relay-yes-no).

The p2p-dynamic variants additionally need two timers (one for ICE-only inactivity, one for full conn close) and an exponential ICE-failure backoff so a chronically-broken NAT path can't pin a peer to relay-with-stalled-retries forever.

What's in this PR

Phase 1 — Wire + Settings + CLI plumbing

  • New proto field PeerConfig.ConnectionMode (enum) + P2pTimeoutSeconds, RelayTimeoutSeconds.
  • New OpenAPI AccountSettings.connection_mode + the two timeouts; mgmt PUT handler accepts and audit-logs them.
  • Client: --connection-mode, --p2p-timeout, --relay-timeout CLI flags + env-var bridge with deprecation warnings for the old flags.
  • connectionmode package with Mode type, proto bridge, ResolveModeFromEnv.

Phase 2 — Lifecycle mechanism

  • Conn.AttachICE() / DetachICE() — register/remove the ICE listener on the handshaker on demand instead of at Open().
  • client/peer/conn.Open() defers ICE-listener registration in p2p-dynamic; the ICE worker is created but the dispatch is held back until the activity listener fires.
  • Two-timer client/lazyconn/inactivity (per-peer ICE-inactivity vs. relay-timeout, separate channels).
  • conn_mgr.ActivatePeer/DeactivatePeer per-mode plumbing for ICE attach/detach.

Phase 3 — Backoff

  • client/peer.iceBackoffState truncated exponential schedule; markFailure, markSuccess, Reset API.
  • Conn.AttachICE no-ops while suspended; pion ICE state changes drive markFailure/markSuccess.
  • conn_mgr resolves P2pRetryMaxSeconds from server-pushed PeerConfig and propagates updates to active conns.
  • netbird status -d shows the current backoff state (Failures, NextRetry, Suspended) per peer; line is suppressed when nextRetry has passed.

Tests

  • go test ./client/internal/peer ./client/internal/peer/guard ./client/internal/lazyconn/... — pass.
  • go build ./client/... ./management/... — pass on linux/amd64, linux/arm64, windows/amd64.
  • Hardware-validated end-to-end on a 4-peer testbed (Windows 11 daemon + 3 OpenWrt routers) running the full p2p-dynamic lifecycle (cold-start → P2P → idle-detach to Relay → idle-close to "Idle" → wake → P2P recovery). The follow-up PRs (B, C, D) build on this foundation and the full lifecycle test goes through them.

Test plan

  • Unit tests for iceBackoffState (initial, exponential, reset, suspended-expired, max-cap, disabled-when-zero, grace-after-reset, no-grace-without-reset).
  • mgmt/grpc tests cover toPeerConfig mode resolution paths.
  • All proto regenerated cleanly (bash client/proto/generate.sh, bash shared/management/proto/generate.sh).
  • Maintainer review of the proto schema for backwards compatibility on legacy clients (no behavioural change for unset / unknown enum values; existing LazyConnectionEnabled still honoured by older daemons).

Use case

Real-world deployment: a fleet of OpenWrt routers + a small number of always-on Linux/Windows boxes. For roaming/laptop peers we want strict relay (relay-forced) for predictable latency; for the routers we want p2p-dynamic so the daemon attempts P2P on demand but doesn't hold an ICE pair when the link is quiet. With the current two-flag approach this configuration is impossible from the dashboard.

Linked work

What's not here (follow-up PRs)

  • PR-B: Phase 3.5 + 3.7d-h network-change handling, signal-trigger re-attach, ICE-grace-window for stale NAT mappings, GUI mode tab.
  • PR-C: Phase 3.7i peer-status visibility (per-peer remote meta, conn-state pusher with adaptive heartbeat, Peers tab in client UI).
  • PR-D: Activity-trigger fast-path (relay-state activity → re-attach ICE), Guard activity-driven retry-budget reset, hardening (legacy-fallback, Codex review fixes), Windows ICE filter case-insensitive (subset, branched separately as PR-Q2), keepWgPeer routed-subnet fix.

The PRs are stacked: B's base = A's branch; C's base = B; D's base = C. Reviewing A first lets the foundation land standalone.

Maintainers are welcome to push directly to this branch.

Summary by CodeRabbit

  • New Features
    • Configurable peer connection modes (relay, p2p, p2p-lazy, p2p-dynamic), account/client timeout and retry settings, new CLI flags, per-peer ICE attach/detach and exponential backoff; status now reflects ICE backoff state.
  • Documentation
    • OpenAPI and HTTP API responses updated to expose new experimental connection-mode and timeout fields.
  • Tests
    • Extensive unit tests for mode resolution, inactivity managers, ICE lifecycle/backoff, and related flows.
  • Chores
    • Audit events added for connection-mode and timeout setting changes.

Documentation

  • Documentation is not needed

These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.

MichaelUray and others added 30 commits May 5, 2026 22:31
…nfig

Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers
(11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are
unchanged so old clients ignore the additions and old servers send
UNSPECIFIED, which the new client maps back via the legacy boolean.

Note: the regenerated pb.go files now report protoc v5.29.3 in their
header (this branch was generated with locally-installed protoc 29.3
instead of upstream's v7.34.1). Functionally identical; header diff is
the only delta beyond the actual schema additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the
client-only sentinels Unspecified and FollowServer), ParseString for
CLI/env input, ToProto/FromProto for wire translation, and the two
backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled
that bridge the old Settings.LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched
in this commit; the engine/conn_mgr migration follows in subsequent
commits in the same PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns

NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY,
NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set
together, NB_FORCE_RELAY wins (most-restrictive, mirrors the
group-conflict rule from issue netbirdio#5990).

Each legacy var emits a one-shot deprecation warning when it actually
contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD
becomes an alias for the future relay_timeout setting and warns once.

IsForceRelayed() is kept for callers that have not yet been migrated
(conn.go, statusrecorder); they will be updated in the engine/conn
refactor commits later in this PR.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing:
- --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server>
- --relay-timeout <seconds>
- --p2p-timeout <seconds>

Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput,
LoginRequest), persisted in profilemanager.Config, and added as new
fields on the daemon.proto IPC messages. Empty / not-changed flags fall
back to the server-pushed value (which itself falls back to the legacy
lazy_connection_enabled boolean for old servers).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds.
ConnMgr now stores the resolved Mode plus the raw inputs (env, config)
so it can re-resolve when the server pushes a new PeerConfig.

UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and
takes the full PeerConfig pointer; a thin shim with the old name
delegates to it for callers that haven't been updated yet.

connect.go copies the three new fields from profilemanager.Config into
the EngineConfig builder, with a tolerant parser that logs and falls
through to Unspecified on invalid input.

Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now
checks Mode == ModeRelayForced instead of calling the global env-reader
IsForceRelayed(). The local 'forceRelay' variable name is renamed to
'skipICE' to make the new branching intent explicit.

The PeerStateUpdate block at the end of Open() also reads from
conn.config.Mode now, so the StatusRecorder sees the per-peer mode
rather than the global env var.

A single remaining caller of IsForceRelayed() (srWatcher.Start in
engine.go) is left for a follow-up; that path uses a process-wide flag
not per-peer state, so it can be migrated in Phase 2 once srWatcher
itself learns about ConnectionMode.

Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into
peer.ConnConfig, so the per-peer Open() loop in conn.go can take the
ModeRelayForced skip-ICE branch without reading the global env var.

This is the last wiring commit for the client side of Phase 1; the
server-side mgmt changes (Settings + OpenAPI + handler + audit +
NetworkMap-build) follow in Section D.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default'
(NULL) from explicit values (incl. 0 = never tear down). Copy() now
deep-clones the new pointer fields via two small helpers.

GORM AutoMigrate creates the new columns at first start; existing
accounts have NULL in all three columns and resolve via the legacy
LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings

Three new optional, nullable fields with descriptions of the NULL =
built-in-default semantics and the Phase-1-vs-Phase-2 status of
p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen
tooling.

The generated AccountSettingsConnectionMode enum has the canonical
values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid()
helper for handler-side validation.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PUT /api/accounts/{id} now accepts connection_mode (validated against
the four-value enum via the generated AccountSettingsConnectionMode.
Valid()), p2p_timeout_seconds and relay_timeout_seconds. NULL in the
JSON body keeps the existing value untouched (= "no client-side
override on this round-trip"); explicit NULL-clear via API uses a
distinct PATCH-style call which is out-of-scope for Phase 1.

Response payload mirrors the input fields back as nullable so the
dashboard can distinguish "use default" from "explicit value".

Phase 1 of issue netbirdio#5989. Audit-event emission follows in D5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccountConnectionModeChanged (121), AccountRelayTimeoutChanged (122),
AccountP2pTimeoutChanged (123) -- emitted from account.go when settings
change. Per-peer / per-group event codes are reserved for Phase 3
(issue netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleConnectionModeSettings is invoked from the same diff-detection
block as handleLazyConnectionSettings; emits one StoreEvent per
changed field (ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds)
with old/new values in the meta payload.

Four small ptr-equality / deref helpers are added for nullable string
and uint32 fields. They are package-private and named after the
existing convention used elsewhere in the package.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elds

Two changes in one commit because they're inseparable:

1. Move client/internal/peer/connectionmode/ to shared/connectionmode/.
   The package now needs to be importable from BOTH client/ and
   management/ (which is impossible while it lives under client/internal/
   per Go's internal-package rule). All imports updated; tests pass on
   both sides.

2. Extend management/internals/shared/grpc/conversion.go::toPeerConfig
   to populate the three new PeerConfig fields (ConnectionMode,
   P2PTimeoutSeconds, RelayTimeoutSeconds) using the connectionmode
   helpers. The legacy LazyConnectionEnabled boolean is now derived
   from the resolved Mode via ToLazyConnectionEnabled() rather than
   copied verbatim from Settings -- this is the central backwards-compat
   contract: old clients see only the boolean, new clients prefer the
   explicit enum and ignore the bool.

Resolution rules (Phase 1, account-wide only):
- Settings.ConnectionMode != nil and parses -> wins
- Otherwise -> ResolveLegacyLazyBool(LazyConnectionEnabled)
- timeouts: Settings.RelayTimeoutSeconds / P2pTimeoutSeconds when non-NULL,
  else 0 (= server has no preference; client uses built-in default)

Per-peer / per-group resolution comes in Phase 3 (netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine sub-cases cover the Phase-1 resolution matrix from spec section 3:
- no settings -> default (P2P + lazy=false)
- legacy bool only -> mapped via ResolveLegacyLazyBool
- explicit ConnectionMode -> wins over the legacy bool
- timeouts propagate
- garbage ConnectionMode value -> tolerant fallback to legacy bool

Particular attention to the structural compat gap: relay-forced cannot
be expressed via the legacy boolean, so the wire field for old clients
is sent as false. Documented in the spec, asserted here.

Existing TestAccount_GetPeerNetworkMap remains green: existing test
peers have ConnectionMode=NULL in Settings, falls through to the legacy
ResolveLegacyLazyBool(false) -> ModeP2P -> wire bool false.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs to dynamically detach the ICE listener at runtime
(when p2p-dynamic mode hits its ICE-inactivity threshold). Adds the
RemoveICEListener method and extends the Add/RemoveICEListener pair to
hold h.mu; Listen() now reads h.iceListener through readICEListener
under the same mutex, fixing a latent race that didn't matter while
AddICEListener was the only writer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs the daemon to fire ICE-teardown and relay-
teardown at different idle thresholds. Adds NewManagerWithTwoTimers
plus iceInactiveChan / relayInactiveChan; the periodic check loop
fires each channel when its threshold elapses. Threshold == 0 disables
that channel (peer.lazy: iceTimeout=0; peer.dynamic: both>0; eager
modes don't register peers at all).

NewManager (Phase-1 entry point) becomes a thin wrapper that
delegates to newManager with iceTimeout=0, preserving p2p-lazy
semantics exactly. The Phase-1 InactivePeersChan now aliases the
new RelayInactiveChan so existing callers (engine.go p2p-lazy path)
continue to work unchanged.

Five sub-tests cover the timeout matrix:
- TwoTimers_OnlyICEFires (peer idle between thresholds)
- TwoTimers_BothFire (peer idle past both)
- TwoTimers_ICEDisabled (iceTimeout=0)
- TwoTimers_RelayDisabled (relayTimeout=0)
- TwoTimers_BothDisabled (both 0 = manager inert)

Plus Phase1_LazyEquivalence proves NewManager-call-site behaviour is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier timeouts for Phase-2 p2p-dynamic. The deprecated
InactivityThreshold pointer field becomes a backwards-compat alias
that maps onto RelayInactivityThreshold when set, so Phase-1 callers
that still pass it (engine.go, p2p-lazy tests) keep working without
edits.

resolvedTimeouts() helper centralises the alias logic. NewManager
picks the appropriate inactivity-manager constructor based on the
resolved timeouts; legacy callers continue through the Phase-1
single-timer code path.

InactivityManager() getter exposes the inner manager so the
conn_mgr can subscribe to ICEInactiveChan / RelayInactiveChan
in the p2p-dynamic dispatch loop.

Phase 2 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AttachICE registers the ICE-offer listener on the handshaker after the
activity-detector observes traffic; emits a fresh offer so the remote
side learns we are now ICE-capable. DetachICE removes the listener and
calls workerICE.Close() so the ICE worker tears down without affecting
the relay tunnel.

Both methods are idempotent and use Handshaker.readICEListener() under
its mutex, avoiding the race with Handshaker.Listen() that the prior
direct field read had. Used in Phase 2 by the inactivity manager to
flip ICE on/off independently of the relay tunnel.

Refs netbirdio#5989 (Phase 2 / connection-mode-phase2).
Phase 2 of netbirdio#5989. ModeP2PDynamic now takes a third branch in Open():
workerICE is constructed eagerly (so the activity-trigger does not
have to wait for ICE setup) but the AddICEListener call is skipped.
The handshaker still has workerICE wired in via NewHandshaker, so
buildOfferAnswer can include local ICE credentials when the listener
is later attached by Conn.AttachICE() on activity-trigger.

Behavior matrix in Open():
- ModeRelayForced: no workerICE, no ICE listener.
- ModeP2P / ModeP2PLazy: workerICE created, listener registered eagerly
  (Phase-1 lazy-tunnel defer remains at conn_mgr level).
- ModeP2PDynamic: workerICE created, listener deferred to AttachICE.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Fixes the lazy/eager mismatch where a remote peer's
GO_IDLE was silently dropped whenever the local manager was not in
lazy mode, leaving the eager local end to immediately reconnect and
defeating the remote's lazy/dynamic intent.

DeactivatePeer now dispatches by locally-resolved connection mode:
  p2p-lazy   -> existing lazy-manager teardown (whole tunnel)
  p2p-dynamic-> DetachICEForPeer (ICE only, relay tunnel stays up)
  p2p / relay-forced / unspecified -> silent no-op (eager modes
    deliberately ignore GO_IDLE because they are always-on)

DetachICEForPeer is also the entry point used by the inactivity
manager when its iceTimeout fires (engine wiring follows in E1).

Lookup miss is treated as a no-op: a peer can be removed concurrently
with a GO_IDLE or inactivity-timer event.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. ConnMgr.Start now also brings up the lazy/dynamic
manager when mode==ModeP2PDynamic. A new runDynamicInactivityLoop
goroutine reads from inactivity.Manager.ICEInactiveChan() and
RelayInactiveChan() and dispatches DetachICEForPeer or full
Conn.Close per peer.

initLazyManager populates Config.RelayInactivityThreshold from
e.relayTimeoutSecs and Config.ICEInactivityThreshold from
e.p2pTimeoutSecs (only in dynamic mode; in lazy mode iceTimeout
stays 0 so the ICE channel never fires, preserving Phase-1
semantics exactly).

UpdatedRemotePeerConfig generalized to handle the dynamic mode
the same way it handled lazy: enable the manager on entry, tear
down on exit, restart on lazy<->dynamic transitions so the new
timeouts take effect.

Refs netbirdio#5989 (Phase 2).
Phase-2 follow-up. The dynamic-mode iceTimeout was always 0 in the
field because resolveConnectionMode only resolved relay-timeout, not
p2p-timeout, and the manager-init only consulted e.p2pTimeoutSecs
which never got updated from server pushes.

Resolution chain for P2pTimeoutSeconds is now symmetric to relay:
  cfgP2pTimeout (client config) > serverPC.GetP2PTimeoutSeconds()
(no env var; reserved for Phase 3).

UpdatedRemotePeerConfig now also stores the resolved p2p-timeout and
includes it in the early-return short-circuit so changes to the
server-pushed value trigger a re-init.

Refs netbirdio#5989 (Phase 2).
Phase 2 of netbirdio#5989. Closes the activity-attach loop: ActivatePeer is
called by the engine when a remote peer signals it is active. In
p2p-lazy this opens the relay tunnel; in p2p-dynamic the relay was
already eager-opened, so we additionally call Conn.AttachICE() to
register the deferred ICE listener and emit a fresh offer. Without
this, p2p-dynamic peers would never upgrade from relay to ICE.

Refs netbirdio#5989 (Phase 2).
…5989)

Phase 3 introduces per-peer ICE-failure backoff. Server-pushed cap
on the truncated-exponential schedule lives in this new field 14.

Wire-format sentinels:
  0   -> "not set", daemon falls back to built-in 15 min default
  max -> user explicitly disabled backoff (Phase-2 behavior)
  N   -> N seconds as the MaxInterval of cenkalti/backoff

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Account-wide cap on the ICE-failure backoff
schedule. Nullable: NULL = use daemon default (15 min), 0 = treated
as "user-explicit disable" by the conversion layer (signaled via
uint32-max sentinel on the wire). Plus deep-copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New field documented in the public API. Defaults
to 900 seconds (15 min). 0 disables backoff. Regenerated types.gen.go
via oapi-codegen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. The PUT handler now persists the new account-wide
ICE-backoff cap, and the GET response surfaces it back. NULL stays
NULL (= "use daemon default"). Mirrors the Phase-1+2 plumbing of
relay_timeout_seconds and p2p_timeout_seconds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New activity-code 124 (AccountP2pRetryMaxChanged)
plus the StoreEvent call inside handleConnectionModeSettings, mirroring
the Phase-1 Relay-timeout and Phase-2 P2P-timeout patterns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apping

Phase 3 of netbirdio#5989. The conversion layer translates between DB-NULL
(= "use daemon default") and Settings-zero (= "user-explicit disable")
using a uint32-max sentinel on the wire. Test matrix covers all
three states: NULL, 0, and a normal positive value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Per-peer ICE-failure backoff using cenkalti/backoff
v4 (already imported by Guard). InitialInterval 1m, Multiplier 2.0,
RandomizationFactor 0.1 -- produces ~1m, ~2m, ~4m, ~8m, ~15m capped
sequence with default cap.

Public API: newIceBackoff, markFailure (returns delay for logging),
markSuccess, Reset (alias), IsSuspended, Snapshot (read-only view for
status output), SetMaxBackoff (live update).

maxBackoff=0 disables the backoff entirely (Phase-2 behavior).

Test matrix covers initial state, exponential growth, cap override,
disabled mode, success-reset, hard-reset, suspended-expiry, and live
SetMaxBackoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MichaelUray and others added 10 commits May 5, 2026 22:31
Phase 3 of netbirdio#5989. New ConnConfig.P2pRetryMaxSeconds field plus an
iceBackoff field on Conn. Open() initializes the backoff state with
the resolved cap; if cap is 0, uses DefaultP2PRetryMax (15 min).

The wire-format sentinel uint32-max for "user-explicit disable" is
translated to cap=0 by the resolver in engine.go (next commit), so
this code only sees positive caps OR 0-meaning-default.

Refs netbirdio#5989 (Phase 3).
Phase 3 of netbirdio#5989. The Phase-2 AttachICE primitive now consults the
iceBackoffState before doing anything. When suspended (= recent ICE
failure within the backoff window), AttachICE silently returns nil
and stays on relay. When the backoff expires, AttachICE proceeds
with the normal listener-attach + offer-send.

This is the actual bug-fix for the reset-loop where each iCEConn-
Disconnected event re-enabled the 3-quick-retries budget. The
backoff is consulted INSIDE AttachICE, so any path that triggers
AttachICE (activity, signal-reconnect, candidate-change) goes
through the gate.

Refs netbirdio#5989 (Phase 3).
Phase 3 of netbirdio#5989. ConnectionStateFailed -> onICEFailed (markFailure +
DetachICE), ConnectionStateConnected/Completed -> onICEConnected
(markSuccess). Closes the loop: actual ICE outcomes now drive the
backoff state machine.

This is what makes the AttachICE-gate from the previous commit
actually engage. Without these hooks, the backoff state would never
transition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Extends resolveConnectionMode to a 4-tuple return
(mode, relay-timeout, p2p-timeout, p2p-retry-max). Resolution chain
analogous to relay/p2p timeouts: client-config wins over server-push.

ConnMgr struct stores both the resolved value (p2pRetryMaxSecs) and
the raw client-config input (cfgP2pRetryMax) so subsequent server
pushes can be re-resolved correctly.

UpdatedRemotePeerConfig early-return shortcut now also checks
p2p-retry-max for changes.

Refs netbirdio#5989 (Phase 3).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 3 of netbirdio#5989. New propagateP2pRetryMaxToConns() iterates all
active peers and calls Conn.SetIceBackoffMax with the freshly resolved
value. Translates the uint32-max sentinel ("user-explicit disable")
to time.Duration(0) for the daemon-side semantics. NULL on server
(wire 0) maps to peer.DefaultP2PRetryMax (15 min built-in default).

Conn.SetIceBackoffMax updates the iceBackoffState live or, if the
Conn is not yet opened, stashes the value in config for Open() to
pick up.

IceBackoffSnapshot exposes a read-only view for the status output
(implemented in Section E).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Local override hierarchy mirrors p2p-timeout:
config.json profile field > CLI flag > server-pushed. New
entry points for the account-wide ICE-failure backoff cap:
- CLI: netbird up --p2p-retry-max=600
- Profile (config.json): "p2p_retry_max_seconds": 600

EngineConfig.P2pRetryMaxSeconds wired through connect.go and
profilemanager (ConfigInput + Config + updateConfig handler).
daemon.proto gains p2p_retry_max_seconds = 43 on LoginRequest
and SetConfigRequest; daemon.pb.go regenerated.

Removes the D1 placeholder in conn_mgr.go: cfgP2pRetryMax
now reads from EngineConfig.P2pRetryMaxSeconds.

Env var NB_P2P_RETRY_MAX_SECONDS not added (matches Phase-2
pattern: p2p-timeout has no env-var path either).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989. Each new Conn now receives the resolved cap via
ConnConfig. Conn.Open() then initializes its iceBackoffState. The
sentinel-translation (uint32-max -> 0) happens inside Conn so the
engine-level code stays simple.

Refs netbirdio#5989 (Phase 3).
Phase 3 (E1) of netbirdio#5989. Full pipeline implemented:
- daemon.proto: add iceBackoffFailures/NextRetry/Suspended fields to PeerState (field IDs 20-22)
- client/internal/peer/status.go: add IceBackoff* fields to State struct + UpdatePeerIceBackoff() method
- client/internal/peer/conn.go: push snapshot to statusRecorder after onICEFailed / onICEConnected
- client/status/status.go: wire fields through ToProtoFullStatus(), PeerStateDetailOutput, parsePeers()
- parsePeers() appends "ICE backoff: suspended for ..." line only when suspended
- client/status/status_test.go: update JSON/YAML expectations for new fields

When not suspended, the line is omitted (no noise on healthy peers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989 follow-up. The PeerState.IceBackoffSuspended flag is
only refreshed on ICE state-change events (markFailure / markSuccess),
so it stays true even after the suspension window has elapsed. The
status detail-output now also wall-clock-checks IceBackoffNextRetry
before printing the line, avoiding lines like "suspended for -1m45s"
between the expiry and the next event-driven snapshot push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RetryMaxSeconds proto changes

Picked-up commits in this branch added new fields to daemon.proto and
management.proto (ConnectionMode enum, P2pTimeoutSeconds,
RelayTimeoutSeconds, P2pRetryMaxSeconds, ICE-backoff fields). The
generated .pb.go files in the original Phase-3.7i branch were left
out-of-date during the cherry-pick; regenerated here so the package
builds.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

Adds a connection-mode system with env/CLI/server precedence, new proto/API fields, dual inactivity timers and per-peer ICE backoff, mode-driven conn lifecycle (Attach/Detach ICE), client/server wiring, and extensive tests across components.

Changes

Connection Mode, Timeouts, and ICE Backoff

Layer / File(s) Summary
Enum & Parsing
shared/connectionmode/mode.go, shared/connectionmode/mode_test.go
New Mode type, parsing, proto mappings, legacy converters, and unit tests.
Proto / API Schema
shared/management/proto/management.proto, client/proto/daemon.proto, shared/management/http/api/openapi.yml, shared/management/http/api/types.gen.go
Added connection_mode, p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds to PeerConfig/LoginRequest/SetConfigRequest and OpenAPI/types.
Server Settings & Persistence
management/server/types/settings.go, management/server/types/settings_test.go
Settings gains nullable pointer fields for ConnectionMode and timeouts; deep-copy helpers and tests added.
Server Audit & Activity Codes
management/server/account.go, management/server/activity/codes.go
Account manager emits audit events for changes to connection-mode and timeout/retry settings; new activity codes and map entries added.
gRPC Conversion & Server Tests
management/internals/shared/grpc/conversion.go, management/internals/shared/grpc/conversion_test.go
toPeerConfig resolves ConnectionMode (settings override legacy lazy flag), computes timeouts and P2P retry sentinel, maps into wire PeerConfig; tests added.
HTTP Handlers & Tests
management/server/http/handlers/accounts/accounts_handler.go, management/server/http/handlers/accounts/accounts_handler_test.go
API input validation for ConnectionMode; response includes new fields; PUT settings test for p2p_retry_max_seconds added.
CLI Flags & Request Wiring
client/cmd/root.go, client/cmd/up.go
New CLI flags: --connection-mode, --relay-timeout, --p2p-timeout, --p2p-retry-max; flags populate LoginRequest, SetConfigRequest, and local ConfigInput when set.
Profile & Engine Wiring
client/internal/profilemanager/config.go, client/internal/connect.go, client/internal/engine.go
Profile Config/ConfigInput accept new fields; EngineConfig parses connection mode and carries relay/p2p/retry values; engine populates peer.ConnConfig.Mode and P2pRetryMaxSeconds.
ConnMgr Resolution & Orchestration
client/internal/conn_mgr.go, client/internal/conn_mgr_test.go
ConnMgr stores env/cfg/server inputs, resolves mode/timeouts (env > client > server), exposes Mode/RelayTimeout/P2pRetryMax accessors, starts/restarts lazy/dynamic inactivity manager on changes, propagates P2P retry max to peers; tests validate precedence and actions.
LazyConn Inactivity Manager
client/internal/lazyconn/inactivity/manager.go, client/internal/lazyconn/inactivity/manager_test.go, client/internal/lazyconn/manager/manager.go, client/internal/lazyconn/env.go
Introduces two-timer inactivity manager (ICE vs Relay) with separate channels and Phase-1 compatibility alias; Manager selects single- or two-timer mode; tests for combinations added; deprecation docs for legacy env.
Peer ICE Backoff & Lifecycle
client/internal/peer/ice_backoff.go, client/internal/peer/ice_backoff_test.go, client/internal/peer/conn.go, client/internal/peer/conn_test.go, client/internal/peer/worker_ice.go
Per-peer exponential ICE backoff state (markFailure/markSuccess/SetMaxBackoff/Snapshot), ConnConfig adds Mode and P2pRetryMaxSeconds, Open initializes/updates backoff, mode-driven skip/defer ICE logic, AttachICE/DetachICE added, worker calls on success/failure to update backoff; tests added.
Handshaker Thread-Safety
client/internal/peer/handshaker.go, client/internal/peer/handshaker_test.go
Guard ICE listener with mutex, add RemoveICEListener, readICEListener; Listen uses safe reads; tests added.
Status & CLI Output
client/internal/peer/status.go, client/status/status.go, client/status/status_test.go, client/server/setconfig_test.go
Peer state includes IceBackoffFailures/NextRetry/Suspended; Status.UpdatePeerIceBackoff updates state; ToProto and human-readable output include new fields; tests and SetConfig expectations updated.
Misc Wiring & Tests
client/internal/connect.go, client/internal/conn_mgr_test.go, client/internal/lazyconn/*, client/internal/peer/env.go, client/internal/peer/env_test.go
Env parsing for NB_CONNECTION_MODE and NB_LAZY_CONNK timeout deprecation warnings; many tests validate env resolution and manager behaviors.

Sequence Diagram

sequenceDiagram
    participant Admin as Admin
    participant API as HTTP API
    participant Server as Management Server
    participant GRPC as gRPC Conversion
    participant Client as NetBird Client
    participant ConnMgr as ConnMgr
    participant Peer as Peer

    Admin->>API: PUT /accounts/{id} with connection_mode/timeouts
    API->>Server: validate & persist settings
    Server->>Server: emit audit events
    Server->>GRPC: toPeerConfig maps settings -> PeerConfig
    GRPC->>Client: push PeerConfig (mode, timeouts, retry_max)

    Client->>Client: ResolveModeFromEnv() (NB_CONNECTION_MODE, legacy flags)
    Client->>ConnMgr: init with env + CLI values
    ConnMgr->>ConnMgr: resolveConnectionMode(env > client > server)
    ConnMgr->>ConnMgr: start/restart lazy manager if mode/timeouts changed
    ConnMgr->>Peer: propagateP2pRetryMaxToConns -> update peer backoff max

    ConnMgr->>Peer: ActivatePeer (P2P dynamic) -> Peer.AttachICE()
    Peer->>Peer: ICE events -> onICEConnected()/onICEFailed() update backoff
    Peer->>ConnMgr: status -> Status.UpdatePeerIceBackoff()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • pappz
  • lixmal

🐰 "I hopped through code where modes decide,
relay or P2P, with timers side by side.
Backoff counts and channels sing,
peers attach, detach — a lively spring.
A jittered retry, gentle and bright,
the network hums through day and night."

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@MichaelUray MichaelUray changed the title Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff [1/4 stacked] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff May 5, 2026
@MichaelUray
Copy link
Copy Markdown
Contributor Author

Stack overview (PRs of issue #5989):

Maintainer note: each PR's branch is rebased on the previous PR's branch in the fork. Once this lands, the next PR can be rebased on main (mergebase will collapse to just its unique commits).

Plus: separate standalone — #6080 Windows ICE filter case-insensitive (no dependencies, applicable to current main directly).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
management/server/types/settings.go (1)

99-126: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clone JWTAllowGroups in Copy().

Copy() still shares JWTAllowGroups with the source instance, so mutating the returned settings can mutate the original. This PR already tightens copy semantics for the new nullable fields, so leaving this slice aliased will keep producing surprising state bleed.

Suggested fix
 	settings := &Settings{
 		PeerLoginExpirationEnabled: s.PeerLoginExpirationEnabled,
 		PeerLoginExpiration:        s.PeerLoginExpiration,
 		JWTGroupsEnabled:           s.JWTGroupsEnabled,
 		JWTGroupsClaimName:         s.JWTGroupsClaimName,
 		GroupsPropagationEnabled:   s.GroupsPropagationEnabled,
-		JWTAllowGroups:             s.JWTAllowGroups,
+		JWTAllowGroups:             slices.Clone(s.JWTAllowGroups),
 		RegularUsersViewBlocked:    s.RegularUsersViewBlocked,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/types/settings.go` around lines 99 - 126, Settings.Copy
currently returns a Settings that shares the JWTAllowGroups slice with the
source, causing mutations to bleed; update the Copy method (function
Settings.Copy) to clone JWTAllowGroups (e.g., use slices.Clone or an explicit
copy) when constructing the new Settings so the returned Settings has its own
independent slice just like PeerExposeGroups is cloned.
management/server/account.go (1)

332-338: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Push the new connection settings to live peers.

These fields are now saved and audited, but they never set updateAccountPeers. A change to ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds therefore skips IncrementNetworkSerial and the later UpdateAccountPeers() call, so connected peers keep stale connection behavior until some unrelated update or reconnect.

Suggested fix
 		if oldSettings.RoutingPeerDNSResolutionEnabled != newSettings.RoutingPeerDNSResolutionEnabled ||
 			oldSettings.LazyConnectionEnabled != newSettings.LazyConnectionEnabled ||
+			!equalStringPtr(oldSettings.ConnectionMode, newSettings.ConnectionMode) ||
+			!equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) ||
 			oldSettings.DNSDomain != newSettings.DNSDomain ||
 			oldSettings.AutoUpdateVersion != newSettings.AutoUpdateVersion ||
 			oldSettings.AutoUpdateAlways != newSettings.AutoUpdateAlways {
 			updateAccountPeers = true
 		}

Also applies to: 372-375

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/account.go` around lines 332 - 338, The code only sets
updateAccountPeers when a subset of settings change; include comparisons for
ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, and P2pRetryMaxSeconds
(the newly saved/audited connection settings) in the same conditional that sets
updateAccountPeers so that IncrementNetworkSerial() and UpdateAccountPeers() run
when any of those values change; make the same addition in the other identical
comparison block (the one around where other settings are checked) so both code
paths trigger updateAccountPeers when any connection-related field differs
between oldSettings and newSettings.
client/cmd/up.go (1)

438-453: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject contradictory connection-mode overrides before building these payloads.

These builders can currently emit both LazyConnectionEnabled and ConnectionMode, so a command like --enable-lazy-connection=true --connection-mode=follow-server sends mutually inconsistent intent. Also, connectionMode is forwarded as an unchecked string here, so typos can be persisted and later fall through as “unspecified” instead of failing fast.

Please validate connectionMode before serialization and make the legacy lazy flag mutually exclusive with --connection-mode across all three payloads.

Also applies to: 572-583, 695-710

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/up.go` around lines 438 - 453, The code currently allows setting
both LazyConnectionEnabled and ConnectionMode together and forwards
connectionMode as an unchecked string; update the builders that set
req.LazyConnectionEnabled and req.ConnectionMode (the branches using
cmd.Flag(enableLazyConnectionFlag).Changed and
cmd.Flag(connectionModeFlag).Changed and the related payload constructors used
in the other two locations) to first validate connectionMode against the allowed
enum/values and return an error if invalid, and also reject the request early
when both flags are present (if cmd.Flag(enableLazyConnectionFlag).Changed &&
cmd.Flag(connectionModeFlag).Changed) with a clear error message; apply the same
mutually-exclusive check and validation to all three payload-building sites so
the legacy lazy flag cannot be used alongside --connection-mode and typos in
connectionMode fail fast.
🧹 Nitpick comments (7)
client/internal/peer/ice_backoff_test.go (1)

113-122: ⚡ Quick win

Add a transition test for disabling backoff while already suspended.

Current zero-cap coverage starts from a fresh state. Please add a case that does markFailure()SetMaxBackoff(0) and asserts immediate unsuspension plus cleared retry timestamp.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff_test.go` around lines 113 - 122, Add a test
that verifies disabling backoff while already suspended immediately clears
suspension: create s := newIceBackoff(somePositiveMax) then call s.markFailure()
and assert s.IsSuspended() is true, then call s.SetMaxBackoff(0) and assert
s.IsSuspended() is false and the internal retry timestamp (e.g., s.retryAt or
equivalent time field) is cleared/zero (use its IsZero check). Name the test
something like TestIceBackoff_DisableWhileSuspended and use newIceBackoff,
markFailure, SetMaxBackoff, IsSuspended and the retry timestamp field to locate
the logic.
client/internal/lazyconn/inactivity/manager_test.go (1)

141-144: ⚡ Quick win

Restore newTicker after each test override.

These tests mutate a package-global and never put it back. That makes the suite order-dependent and brittle if later tests in this package rely on the real ticker or start running in parallel.

Suggested fix
 	fakeTick := make(chan time.Time, 1)
+	oldTicker := newTicker
 	newTicker = func(d time.Duration) Ticker {
 		return &fakeTickerMock{CChan: fakeTick}
 	}
+	t.Cleanup(func() {
+		newTicker = oldTicker
+	})

Also applies to: 181-184, 224-227, 264-267, 303-306, 341-344

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/inactivity/manager_test.go` around lines 141 - 144,
Tests in manager_test.go overwrite the package-global newTicker (e.g., setting
newTicker = func(...) Ticker { return &fakeTickerMock{CChan: fakeTick} }) but
never restore it, making the suite order-dependent; fix by capturing the
original function before each override (old := newTicker) and restoring it after
the test (use defer newTicker = old or t.Cleanup(func(){ newTicker = old }))
wherever newTicker is reassigned (instances around the fakeTickerMock overrides)
so the global is returned to its original value after each test.
management/internals/shared/grpc/conversion_test.go (1)

160-169: 💤 Low value

Reference the named sentinel constant in the assertion.

Line 166 asserts against the literal ^uint32(0); this test is in the same grpc package and can read the unexported p2pRetryMaxDisabledSentinel directly, which keeps the test honest if the sentinel value ever changes (e.g. moves to a different magic number) and makes the intent slightly clearer.

♻️ Proposed fix
-	if pc.P2PRetryMaxSeconds != ^uint32(0) {
-		t.Errorf("explicit 0 should map to uint32-max sentinel on the wire, got %d", pc.P2PRetryMaxSeconds)
+	if pc.P2PRetryMaxSeconds != p2pRetryMaxDisabledSentinel {
+		t.Errorf("explicit 0 should map to disabled-sentinel on the wire, got %d", pc.P2PRetryMaxSeconds)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion_test.go` around lines 160 - 169,
Update TestToPeerConfig_P2pRetryMax_ExplicitDisable to assert against the
package's sentinel constant instead of the literal magic number: replace the
literal ^uint32(0) in the assertion with p2pRetryMaxDisabledSentinel so the test
uses the named sentinel from the grpc package (keep the rest of the test and the
call to toPeerConfigForTest unchanged).
client/internal/peer/env_test.go (1)

33-34: 💤 Low value

Use the package constants instead of literal env var names.

Lines 33–34 hardcode "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and "NB_LAZY_CONN_INACTIVITY_THRESHOLD" while the test lives in the same peer package and could reference envEnableLazyConn / envInactivityThreshold directly (you already do that for the exported EnvKeyNBConnectionMode / EnvKeyNBForceRelay two lines above). Keeping a single source of truth avoids silent test drift if a constant is ever renamed.

♻️ Proposed fix
-			t.Setenv("NB_ENABLE_EXPERIMENTAL_LAZY_CONN", c.envEnableLazy)
-			t.Setenv("NB_LAZY_CONN_INACTIVITY_THRESHOLD", c.envInactivity)
+			t.Setenv(envEnableLazyConn, c.envEnableLazy)
+			t.Setenv(envInactivityThreshold, c.envInactivity)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/env_test.go` around lines 33 - 34, The test currently
hardcodes the environment variable names "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and
"NB_LAZY_CONN_INACTIVITY_THRESHOLD"; change t.Setenv calls to use the package
constants envEnableLazyConn and envInactivityThreshold (the same constants used
elsewhere in this package) so the test references a single source of truth and
won't drift if names change.
client/internal/peer/conn.go (1)

1025-1054: 💤 Low value

AttachICE swallows SendOffer failures silently.

When SendOffer returns an error at line 1050, AttachICE only logs a warning and returns nil. The local listener is now attached but the remote peer has not been notified that we're ICE-capable, so until something else triggers a fresh offer (e.g. the guard goroutine), the dynamic-mode upgrade window is wasted. Worth at least propagating the error so callers in conn_mgr can decide whether to retry.

♻️ Proposed fix
 	if err := conn.handshaker.SendOffer(); err != nil {
-		conn.Log.Warnf("AttachICE: SendOffer failed: %v", err)
+		conn.Log.Warnf("AttachICE: SendOffer failed: %v", err)
+		return fmt.Errorf("AttachICE: send offer: %w", err)
 	}
 	return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1025 - 1054, AttachICE currently
adds the ICE listener (Conn.handshaker.AddICEListener with
Conn.workerICE.OnNewOffer) and then only logs SendOffer errors, leaving the
local listener attached while remote peer may never be notified; change
AttachICE so that if Conn.handshaker.SendOffer() returns an error it returns
that error to the caller instead of swallowing it, and ensure you undo the
listener attachment on failure (remove the listener or use a defer that removes
it when SendOffer fails) so state stays consistent; update callers in conn_mgr
to handle the returned error (retry/backoff) as needed.
client/internal/peer/conn_test.go (2)

362-384: 💤 Low value

Consider exposing a test helper instead of reaching into iceBackoff internals.

Lines 370-372 lock and directly mutate iceBackoff.mu and iceBackoff.nextRetry from the test. While same-package access is allowed in Go, this tightly couples the test to the unexported layout of iceBackoffState — any rename or restructuring of those fields silently breaks the test.

A small helper such as (b *iceBackoffState) forceExpiredForTest() or an exported Reset(nextRetry time.Time) test seam would decouple the test from internal state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_test.go` around lines 362 - 384, The test mutates
iceBackoff internals (locking iceBackoff.mu and setting iceBackoff.nextRetry)
which couples it to iceBackoffState layout; add a small test seam on
iceBackoffState (e.g., func (b *iceBackoffState) forceExpiredForTest() or func
(b *iceBackoffState) ResetNextRetry(t time.Time)) that performs the mutexed
update of nextRetry, then change TestConn_AttachICE_AfterBackoffExpiry to call
that helper after c.iceBackoff.markFailure() instead of touching mu/nextRetry
directly; this keeps Conn.AttachICE and newIceBackoff/markFailure unchanged but
removes direct field access from the test.

344-360: 💤 Low value

TestConn_AttachICE_NoOpWhenSuspended is implicitly order-dependent.

The test asserts err == nil with a nil workerICE only because AttachICE currently checks the backoff gate before the workerICE nil check. If that ordering ever changes, the test would either start returning a "nil workerICE" error (masking the suspend path) or panic, with no obvious test failure message pointing to the ordering assumption.

Consider adding an explicit comment documenting this dependency, or restructuring so the suspended path is exercised via a dedicated stub workerICE that records whether it was reached.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_test.go` around lines 344 - 360, The test
TestConn_AttachICE_NoOpWhenSuspended relies on AttachICE checking the backoff
gate before validating a nil workerICE — make this explicit and robust by
injecting a stub/fake workerICE into the Conn that will fail the test if its
attach path is invoked; set c.workerICE to that stub before calling
c.AttachICE(), keep c.iceBackoff.markFailure(), assert err == nil, and verify
the stub was not called and handshaker.readICEListener() remains nil. This
replaces the implicit nil-workerICE dependency with an explicit assertion that
the suspended path short-circuits AttachICE (referencing
TestConn_AttachICE_NoOpWhenSuspended, Conn.AttachICE, c.workerICE, and
handshaker.readICEListener()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/internal/conn_mgr.go`:
- Around line 245-291: In UpdatedRemotePeerConfig: when the connection mode
remains managed (isManaged true) but any of the timeout/retry values change
(compare newRelay/newP2P/newP2pRetry against the previously-stored
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs) you must restart the
running lazy manager so it picks up the new timers; call e.closeManager(ctx),
then re-init with e.initLazyManager(ctx), call e.startModeSideEffects(), and
re-add peers via e.addPeersToLazyConnManager() (also ensure
e.statusRecorder.UpdateLazyConnection(true) is set appropriately). Use the
existing symbols (UpdatedRemotePeerConfig, isManaged, e.lazyConnMgr,
closeManager, initLazyManager, startModeSideEffects, addPeersToLazyConnManager,
propagateP2pRetryMaxToConns) to locate and implement this branch.

In `@client/internal/engine.go`:
- Around line 1585-1586: The Mode and P2pRetryMaxSeconds values are only
snapshotted into peer.Conn at creation, so runtime changes via
ConnMgr.UpdatedRemotePeerConfig() won't affect existing conns; modify the code
so connections are reconfigured when the ConnMgr changes by adding an update
path: introduce or use methods on peer.Conn like SetMode(mode string) and
SetP2pRetryMax(seconds int) (or a single UpdateConfig) and, in the
ConnMgr.UpdatedRemotePeerConfig() handler inside engine (the code that currently
constructs conns with Mode: e.connMgr.Mode(), P2pRetryMaxSeconds:
e.connMgr.P2pRetryMax()), iterate existing e.peers and call those methods to
apply the new mode and retry value (and trigger any Open/AttachICE/DetachICE
transitions as needed) so live switches between p2p-lazy and p2p-dynamic take
effect without recreating connections.

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 102-112: The kernel-bind path currently leaves m.inactivityManager
nil which causes panics when methods like Start, activateSinglePeer,
addActivePeer, DeactivatePeer and removePeer unconditionally dereference it; fix
this by ensuring m.inactivityManager is always a valid implementation: add a
lightweight no-op inactivity manager (e.g., inactivity.NewNoopManager returning
an object that implements the same interface as
inactivity.NewManager/NewManagerWithTwoTimers) and instantiate it in the else
branch where kernel mode is detected (replace the current log-only path), so all
subsequent calls to m.inactivityManager are safe without changing callers;
alternatively, if you prefer refusal, return an error from the constructor
instead of creating the manager (but pick one approach and apply it
consistently).

In `@client/internal/peer/conn.go`:
- Around line 96-101: The uint32 sentinel ^uint32(0) for "user-explicit disable"
is being lost; fix by preserving that sentinel in the config and handling it
explicitly: modify propagateP2pRetryMaxToConns() to pass the sentinel value (not
time.Duration(0)) into SetIceBackoffMax when the wire-value is the sentinel,
change SetIceBackoffMax(conn.config.P2pRetryMaxSeconds) so it stores the
sentinel (^uint32(0)) into conn.config.P2pRetryMaxSeconds instead of 0 when
disable was requested, and update conn.Open() to treat
conn.config.P2pRetryMaxSeconds == ^uint32(0) as "disabled (zero cap)" while
treating the genuine 0 (or other unset marker if you prefer) as "not set → apply
DefaultP2PRetryMax"; ensure all logic paths that previously wrote 0 for disable
now preserve/write the sentinel so the distinction remains.

In `@client/internal/peer/ice_backoff.go`:
- Around line 97-103: When resetting or disabling backoff (e.g., in markSuccess)
you must also clear the suspended retry timestamp so IsSuspended() and snapshots
do not retain a stale NextRetry; update markSuccess (and the corresponding
reset/disable handler in the 115-123 block) to set the internal nextRetry field
to the zero time (reset the suspension window) whenever you set s.suspended =
false and bo.Reset(), ensuring snapshots no longer report a stale NextRetry
after SetMaxBackoff(0) or a success.

In `@management/server/account.go`:
- Around line 459-489: The audit currently uses derefUint32Ptr (and
derefStringPtr for timeouts) which collapses nil and 0 to 0; update
handleConnectionModeSettings so timeout fields emit the raw pointer values
instead of dereferenced zeroed values (e.g., use oldSettings.RelayTimeoutSeconds
and newSettings.RelayTimeoutSeconds,
oldSettings.P2pTimeoutSeconds/newSettings.P2pTimeoutSeconds,
oldSettings.P2pRetryMaxSeconds/newSettings.P2pRetryMaxSeconds) in the StoreEvent
payload so nil stays nil and 0 stays 0; keep the equality checks
(equalUint32Ptr) but change the map entries for the three timeout events to pass
the pointer (or a small helper that returns either nil or uint32) instead of
derefUint32Ptr.

In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 223-240: The handler currently treats JSON null as "absent" so
ConnectionMode, P2pTimeoutSeconds, P2pRetryMaxSeconds and RelayTimeoutSeconds
can never be cleared once set; update the request DTO used by the accounts
update path (the req.Settings type referenced in accounts_handler.go) to
implement tri-state semantics (absent vs null vs value) — e.g. replace plain *T
fields with a custom Nullable[T] struct (or use *json.RawMessage / presence
flag) that records whether the field was provided and whether it was explicitly
null, change the handler logic in the update function to inspect that presence
flag and set
returnSettings.<ConnectionMode|P2pTimeoutSeconds|P2pRetryMaxSeconds|RelayTimeoutSeconds>
to nil when the client sent explicit null, set to the concrete value when
provided, and leave unchanged when absent, and update JSON unmarshalling/tests
accordingly.
- Around line 218-221: The handler currently returns a generic fmt.Errorf for an
invalid req.Settings.ConnectionMode which yields a 500; change this to return a
gRPC InvalidArgument error consistent with other validation failures in this
handler (e.g. use status.Errorf(codes.InvalidArgument, "invalid connection_mode
%q", modeStr) or the project’s util helper for invalid arguments) when
req.Settings.ConnectionMode.Valid() is false, referencing
req.Settings.ConnectionMode and modeStr so the error message includes the bad
value.
- Around line 229-240: The code casts int64 timeout fields from req.Settings
(P2pTimeoutSeconds, P2pRetryMaxSeconds, RelayTimeoutSeconds) directly to uint32
which can wrap on negative or too-large values; validate each non-nil
*req.Settings.* value is within 0..math.MaxUint32 before casting, and if out of
range return a validation error (HTTP 400) instead of assigning to
returnSettings.*; after the check perform v := uint32(*req.Settings.<Field>) and
set returnSettings.<Field> = &v.

In `@shared/connectionmode/mode.go`:
- Around line 19-127: Add a new Mode constant for "p2p-dynamic-lazy" and wire it
through the conversions: define ModeP2PDynamicLazy alongside other Mode
constants, return "p2p-dynamic-lazy" from Mode.String(), accept
"p2p-dynamic-lazy" in ParseString(), map
mgmProto.ConnectionMode_CONNECTION_MODE_P2P_DYNAMIC_LAZY <-> ModeP2PDynamicLazy
in FromProto() and ToProto(), and make ModeP2PDynamicLazy return true from
ToLazyConnectionEnabled() (and ensure ResolveLegacyLazyBool behavior remains
correct). Update only the symbols ModeP2PDynamicLazy, String(), ParseString(),
FromProto(), ToProto(), ToLazyConnectionEnabled() to include this new mode and
keep ModeFollowServer behavior unchanged.

In `@shared/management/http/api/openapi.yml`:
- Line 365: The OpenAPI enum for connection_mode is missing the new value;
update the enum array under the connection_mode schema to include
"p2p-dynamic-lazy" alongside the existing values (relay-forced, p2p, p2p-lazy,
p2p-dynamic) so REST clients can set that mode; ensure you add the exact string
"p2p-dynamic-lazy" to the enum in the openapi YAML and run any spec
validation/generation steps you normally use.
- Around line 370-373: Update the stale documentation text in the OpenAPI
description that mentions "Phase 1 of issue `#5989`" and that "p2p-dynamic is
reserved" so it no longer misleads consumers; locate the block around the
p2p_timeout_seconds schema entry in openapi.yml (the paragraph starting with
"Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are functional...") and
replace or remove the phase-gating sentence to reflect the current status
(either remove the "p2p-dynamic is reserved" clause or update it to state that
p2p-dynamic is now functional), keeping the rest of the p2p_timeout_seconds
description consistent.

In `@shared/management/http/api/types.gen.go`:
- Around line 41-63: The generated enum AccountSettingsConnectionMode is missing
the new value "p2p-dynamic-lazy": add a new constant named
AccountSettingsConnectionModeP2pDynamicLazy with value "p2p-dynamic-lazy" to the
AccountSettingsConnectionMode block and include it in the
AccountSettingsConnectionMode.Valid() switch so the method returns true for that
value; update the OpenAPI source to include p2p-dynamic-lazy and regenerate
shared/management/http/api/types.gen.go so the generated code and Valid() check
are in sync.

---

Outside diff comments:
In `@client/cmd/up.go`:
- Around line 438-453: The code currently allows setting both
LazyConnectionEnabled and ConnectionMode together and forwards connectionMode as
an unchecked string; update the builders that set req.LazyConnectionEnabled and
req.ConnectionMode (the branches using
cmd.Flag(enableLazyConnectionFlag).Changed and
cmd.Flag(connectionModeFlag).Changed and the related payload constructors used
in the other two locations) to first validate connectionMode against the allowed
enum/values and return an error if invalid, and also reject the request early
when both flags are present (if cmd.Flag(enableLazyConnectionFlag).Changed &&
cmd.Flag(connectionModeFlag).Changed) with a clear error message; apply the same
mutually-exclusive check and validation to all three payload-building sites so
the legacy lazy flag cannot be used alongside --connection-mode and typos in
connectionMode fail fast.

In `@management/server/account.go`:
- Around line 332-338: The code only sets updateAccountPeers when a subset of
settings change; include comparisons for ConnectionMode, RelayTimeoutSeconds,
P2pTimeoutSeconds, and P2pRetryMaxSeconds (the newly saved/audited connection
settings) in the same conditional that sets updateAccountPeers so that
IncrementNetworkSerial() and UpdateAccountPeers() run when any of those values
change; make the same addition in the other identical comparison block (the one
around where other settings are checked) so both code paths trigger
updateAccountPeers when any connection-related field differs between oldSettings
and newSettings.

In `@management/server/types/settings.go`:
- Around line 99-126: Settings.Copy currently returns a Settings that shares the
JWTAllowGroups slice with the source, causing mutations to bleed; update the
Copy method (function Settings.Copy) to clone JWTAllowGroups (e.g., use
slices.Clone or an explicit copy) when constructing the new Settings so the
returned Settings has its own independent slice just like PeerExposeGroups is
cloned.

---

Nitpick comments:
In `@client/internal/lazyconn/inactivity/manager_test.go`:
- Around line 141-144: Tests in manager_test.go overwrite the package-global
newTicker (e.g., setting newTicker = func(...) Ticker { return
&fakeTickerMock{CChan: fakeTick} }) but never restore it, making the suite
order-dependent; fix by capturing the original function before each override
(old := newTicker) and restoring it after the test (use defer newTicker = old or
t.Cleanup(func(){ newTicker = old })) wherever newTicker is reassigned
(instances around the fakeTickerMock overrides) so the global is returned to its
original value after each test.

In `@client/internal/peer/conn_test.go`:
- Around line 362-384: The test mutates iceBackoff internals (locking
iceBackoff.mu and setting iceBackoff.nextRetry) which couples it to
iceBackoffState layout; add a small test seam on iceBackoffState (e.g., func (b
*iceBackoffState) forceExpiredForTest() or func (b *iceBackoffState)
ResetNextRetry(t time.Time)) that performs the mutexed update of nextRetry, then
change TestConn_AttachICE_AfterBackoffExpiry to call that helper after
c.iceBackoff.markFailure() instead of touching mu/nextRetry directly; this keeps
Conn.AttachICE and newIceBackoff/markFailure unchanged but removes direct field
access from the test.
- Around line 344-360: The test TestConn_AttachICE_NoOpWhenSuspended relies on
AttachICE checking the backoff gate before validating a nil workerICE — make
this explicit and robust by injecting a stub/fake workerICE into the Conn that
will fail the test if its attach path is invoked; set c.workerICE to that stub
before calling c.AttachICE(), keep c.iceBackoff.markFailure(), assert err ==
nil, and verify the stub was not called and handshaker.readICEListener() remains
nil. This replaces the implicit nil-workerICE dependency with an explicit
assertion that the suspended path short-circuits AttachICE (referencing
TestConn_AttachICE_NoOpWhenSuspended, Conn.AttachICE, c.workerICE, and
handshaker.readICEListener()).

In `@client/internal/peer/conn.go`:
- Around line 1025-1054: AttachICE currently adds the ICE listener
(Conn.handshaker.AddICEListener with Conn.workerICE.OnNewOffer) and then only
logs SendOffer errors, leaving the local listener attached while remote peer may
never be notified; change AttachICE so that if Conn.handshaker.SendOffer()
returns an error it returns that error to the caller instead of swallowing it,
and ensure you undo the listener attachment on failure (remove the listener or
use a defer that removes it when SendOffer fails) so state stays consistent;
update callers in conn_mgr to handle the returned error (retry/backoff) as
needed.

In `@client/internal/peer/env_test.go`:
- Around line 33-34: The test currently hardcodes the environment variable names
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and "NB_LAZY_CONN_INACTIVITY_THRESHOLD";
change t.Setenv calls to use the package constants envEnableLazyConn and
envInactivityThreshold (the same constants used elsewhere in this package) so
the test references a single source of truth and won't drift if names change.

In `@client/internal/peer/ice_backoff_test.go`:
- Around line 113-122: Add a test that verifies disabling backoff while already
suspended immediately clears suspension: create s :=
newIceBackoff(somePositiveMax) then call s.markFailure() and assert
s.IsSuspended() is true, then call s.SetMaxBackoff(0) and assert s.IsSuspended()
is false and the internal retry timestamp (e.g., s.retryAt or equivalent time
field) is cleared/zero (use its IsZero check). Name the test something like
TestIceBackoff_DisableWhileSuspended and use newIceBackoff, markFailure,
SetMaxBackoff, IsSuspended and the retry timestamp field to locate the logic.

In `@management/internals/shared/grpc/conversion_test.go`:
- Around line 160-169: Update TestToPeerConfig_P2pRetryMax_ExplicitDisable to
assert against the package's sentinel constant instead of the literal magic
number: replace the literal ^uint32(0) in the assertion with
p2pRetryMaxDisabledSentinel so the test uses the named sentinel from the grpc
package (keep the rest of the test and the call to toPeerConfigForTest
unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34dd9c60-fb5b-443e-ae99-ca94bcb9fc13

📥 Commits

Reviewing files that changed from the base of the PR and between b19b746 and 75cbbe6.

⛔ Files ignored due to path filters (4)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • client/proto/daemon_grpc.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/proxy_service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (37)
  • client/cmd/root.go
  • client/cmd/up.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/connect.go
  • client/internal/engine.go
  • client/internal/lazyconn/env.go
  • client/internal/lazyconn/inactivity/manager.go
  • client/internal/lazyconn/inactivity/manager_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_test.go
  • client/internal/peer/env.go
  • client/internal/peer/env_test.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/handshaker_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/worker_ice.go
  • client/internal/profilemanager/config.go
  • client/proto/daemon.proto
  • client/status/status.go
  • client/status/status_test.go
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/types/settings.go
  • management/server/types/settings_test.go
  • shared/connectionmode/mode.go
  • shared/connectionmode/mode_test.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto

Comment on lines +245 to +291
func (e *ConnMgr) UpdatedRemotePeerConfig(ctx context.Context, pc *mgmProto.PeerConfig) error {
newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode(
e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout,
e.cfgP2pTimeout, e.cfgP2pRetryMax, pc,
)

if newMode == e.mode && newRelay == e.relayTimeoutSecs &&
newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs {
return nil
}
prev := e.mode
e.mode = newMode
e.relayTimeoutSecs = newRelay
e.p2pTimeoutSecs = newP2P
e.p2pRetryMaxSecs = newP2pRetry
e.propagateP2pRetryMaxToConns()

wasManaged := modeUsesLazyMgr(prev)
isManaged := modeUsesLazyMgr(newMode)
modeChanged := prev != newMode

if modeChanged && wasManaged && !isManaged {
log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
return nil
}

if modeChanged && wasManaged && isManaged {
// Switching between lazy and dynamic at runtime: tear down the
// existing manager so initLazyManager picks up the new timeouts.
log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
}

if isManaged && e.lazyConnMgr == nil {
if e.rosenpassEnabled {
log.Infof("rosenpass connection manager is enabled, lazy connection manager will not be started")
log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push")
return nil
}

log.Warnf("lazy connection manager is enabled by management feature flag")
log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode)
e.initLazyManager(ctx)
e.statusRecorder.UpdateLazyConnection(true)
e.startModeSideEffects()
return e.addPeersToLazyConnManager()
} else {
if e.lazyConnMgr == nil {
return nil
}
log.Infof("lazy connection manager is disabled by management feature flag")
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
return nil
}
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the lazy/dynamic manager when only the timeouts change.

newRelay/newP2P update the stored fields here, but the live inactivity manager keeps the old timer values because it is only built in initLazyManager(). Right now a management push that changes relay_timeout_seconds or p2p_timeout_seconds inside the same mode updates bookkeeping only; the active lazy/dynamic lifecycle keeps running with the previous thresholds until a reconnect.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 245 - 291, In
UpdatedRemotePeerConfig: when the connection mode remains managed (isManaged
true) but any of the timeout/retry values change (compare
newRelay/newP2P/newP2pRetry against the previously-stored
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs) you must restart the
running lazy manager so it picks up the new timers; call e.closeManager(ctx),
then re-init with e.initLazyManager(ctx), call e.startModeSideEffects(), and
re-add peers via e.addPeersToLazyConnManager() (also ensure
e.statusRecorder.UpdateLazyConnection(true) is set appropriately). Use the
existing symbols (UpdatedRemotePeerConfig, isManaged, e.lazyConnMgr,
closeManager, initLazyManager, startModeSideEffects, addPeersToLazyConnManager,
propagateP2pRetryMaxToConns) to locate and implement this branch.

Comment thread client/internal/engine.go
Comment on lines +1585 to +1586
Mode: e.connMgr.Mode(),
P2pRetryMaxSeconds: e.connMgr.P2pRetryMax(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

These mode-dependent conn settings are only applied at construction time.

Mode and P2pRetryMaxSeconds are snapshotted when the peer.Conn is created, but ConnMgr.UpdatedRemotePeerConfig() can change the resolved mode later without recreating existing conns. That means a live switch between p2p-lazy and p2p-dynamic can leave already-open peers running with the old Open/AttachICE/DetachICE behavior until each conn is rebuilt or explicitly reconfigured.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 1585 - 1586, The Mode and
P2pRetryMaxSeconds values are only snapshotted into peer.Conn at creation, so
runtime changes via ConnMgr.UpdatedRemotePeerConfig() won't affect existing
conns; modify the code so connections are reconfigured when the ConnMgr changes
by adding an update path: introduce or use methods on peer.Conn like
SetMode(mode string) and SetP2pRetryMax(seconds int) (or a single UpdateConfig)
and, in the ConnMgr.UpdatedRemotePeerConfig() handler inside engine (the code
that currently constructs conns with Mode: e.connMgr.Mode(), P2pRetryMaxSeconds:
e.connMgr.P2pRetryMax()), iterate existing e.peers and call those methods to
apply the new mode and retry value (and trigger any Open/AttachICE/DetachICE
transitions as needed) so live switches between p2p-lazy and p2p-dynamic take
effect without recreating connections.

Comment on lines 102 to 112
if wgIface.IsUserspaceBind() {
m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold)
iceTO, relayTO := config.resolvedTimeouts()
if iceTO == 0 && relayTO == 0 {
// Phase 1 / single-timer fallback when caller hasn't migrated.
m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold)
} else {
m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO)
}
} else {
log.Warnf("inactivity manager not supported for kernel mode, wait for remote peer to close the connection")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Kernel-bind fallback is unsafe here.

This leaves m.inactivityManager nil, but the rest of the manager still dereferences it unconditionally (Start, activateSinglePeer, addActivePeer, DeactivatePeer, removePeer). In lazy/dynamic mode on a kernel bind, that turns the “not supported” path into a runtime panic instead of a graceful no-op/fallback.

Please either refuse to construct/start this manager for kernel mode or make every inactivity-manager call path nil-safe.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 102 - 112, The
kernel-bind path currently leaves m.inactivityManager nil which causes panics
when methods like Start, activateSinglePeer, addActivePeer, DeactivatePeer and
removePeer unconditionally dereference it; fix this by ensuring
m.inactivityManager is always a valid implementation: add a lightweight no-op
inactivity manager (e.g., inactivity.NewNoopManager returning an object that
implements the same interface as inactivity.NewManager/NewManagerWithTwoTimers)
and instantiate it in the else branch where kernel mode is detected (replace the
current log-only path), so all subsequent calls to m.inactivityManager are safe
without changing callers; alternatively, if you prefer refusal, return an error
from the constructor instead of creating the manager (but pick one approach and
apply it consistently).

Comment on lines +96 to +101
// P2pRetryMaxSeconds is the cap for the ICE-failure backoff schedule
// in p2p-dynamic mode. 0 = use built-in default (DefaultP2PRetryMax).
// Wire-format sentinel uint32-max (= ^uint32(0)) means "user-explicit
// disable", which the resolver translates to time.Duration(0) at
// engine.go before passing it here. Phase 3 of #5989.
P2pRetryMaxSeconds uint32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how engine.go translates the wire sentinel for P2pRetryMaxSeconds, and
# what DefaultP2PRetryMax actually is, to determine whether this is a doc bug
# or a logic bug.

echo "--- engine.go: how is P2PRetryMaxSeconds translated into ConnConfig? ---"
rg -nP --type=go -C5 'P2PRetryMaxSeconds|P2pRetryMaxSeconds' -g 'client/internal/engine*.go'

echo
echo "--- DefaultP2PRetryMax definition ---"
rg -nP --type=go -C2 '\bDefaultP2PRetryMax\b'

echo
echo "--- newIceBackoff: how is a 0 cap handled? ---"
rg -nP --type=go -C5 '\bnewIceBackoff\b'

echo
echo "--- iceBackoffState.SetMaxBackoff: how is a 0 cap handled? ---"
ast-grep --pattern $'func ($_ *iceBackoffState) SetMaxBackoff($_) {
  $$$
}'

echo
echo "--- conn_mgr.go: who calls SetIceBackoffMax and with what input? ---"
rg -nP --type=go -C5 '\bSetIceBackoffMax\b'

Repository: netbirdio/netbird

Length of output: 17107


🏁 Script executed:

rg -nP --type=go -A10 'func \(.*ConnMgr\) P2pRetryMax' client/internal/conn_mgr.go

Repository: netbirdio/netbird

Length of output: 362


🏁 Script executed:

rg -nP --type=go -A15 'func \(conn \*Conn\) SetIceBackoffMax' client/internal/peer/conn.go

Repository: netbirdio/netbird

Length of output: 657


🏁 Script executed:

sed -n '540,560p' client/internal/conn_mgr.go

Repository: netbirdio/netbird

Length of output: 640


🏁 Script executed:

rg -nP --type=go -B20 'user-explicit disable' client/internal/conn_mgr.go | head -40

Repository: netbirdio/netbird

Length of output: 777


Fix sentinel preservation through the backoff cap translation; user-explicit disable is currently reversed to default.

The documented path from wire sentinel to disable is broken. When a user explicitly disables via the management sentinel ^uint32(0), the flow is:

  1. propagateP2pRetryMaxToConns() correctly translates sentinel to time.Duration(0) and calls SetIceBackoffMax(0)
  2. SetIceBackoffMax() stores uint32(0) back into conn.config.P2pRetryMaxSeconds
  3. conn.Open() reads uint32(0) and treats it as "not set", applying DefaultP2PRetryMax instead

The field comment at lines 98–100 is misleading: it states the resolver translates the sentinel to time.Duration(0) but omits that Open() immediately overwrites 0 with the default. Since 0 cannot distinguish "user wants disable" from "not set, use default", the user's explicit disable intent is silently reversed to 15-minute backoff.

The same issue affects SetIceBackoffMax() when called before Open(): the sentinel-decoded 0 gets stored as uint32(0) in the config, and the next Open() call overwrites it with the default.

Solution: Preserve the sentinel value ^uint32(0) in the uint32 field so Open() can explicitly handle it—map it to disable (zero cap) separately from the "not set, use default" path (which remains 0 today).

Suggested fix
 // P2pRetryMaxSeconds is the cap for the ICE-failure backoff schedule
 // in p2p-dynamic mode. 0 = use built-in default (DefaultP2PRetryMax).
-// Wire-format sentinel uint32-max (= ^uint32(0)) means "user-explicit
-// disable", which the resolver translates to time.Duration(0) at
-// engine.go before passing it here. Phase 3 of `#5989`.
+// Wire-format sentinel uint32-max (= ^uint32(0)) means "user-explicit
+// disable" and is preserved as ^uint32(0) here so Open() can
+// distinguish it from the "not set" zero value.
 P2pRetryMaxSeconds uint32
-backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
-if backoffCap == 0 {
+var backoffCap time.Duration
+switch conn.config.P2pRetryMaxSeconds {
+case 0:
     backoffCap = DefaultP2PRetryMax
+case ^uint32(0):
+    backoffCap = 0 // user-explicit disable; iceBackoff treats 0 as off
+default:
+    backoffCap = time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
 }

And update propagateP2pRetryMaxToConns() in conn_mgr.go to pass the sentinel directly (currently it converts to 0).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// P2pRetryMaxSeconds is the cap for the ICE-failure backoff schedule
// in p2p-dynamic mode. 0 = use built-in default (DefaultP2PRetryMax).
// Wire-format sentinel uint32-max (= ^uint32(0)) means "user-explicit
// disable", which the resolver translates to time.Duration(0) at
// engine.go before passing it here. Phase 3 of #5989.
P2pRetryMaxSeconds uint32
// P2pRetryMaxSeconds is the cap for the ICE-failure backoff schedule
// in p2p-dynamic mode. 0 = use built-in default (DefaultP2PRetryMax).
// Wire-format sentinel uint32-max (= ^uint32(0)) means "user-explicit
// disable" and is preserved as ^uint32(0) here so Open() can
// distinguish it from the "not set" zero value.
P2pRetryMaxSeconds uint32
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 96 - 101, The uint32 sentinel
^uint32(0) for "user-explicit disable" is being lost; fix by preserving that
sentinel in the config and handling it explicitly: modify
propagateP2pRetryMaxToConns() to pass the sentinel value (not time.Duration(0))
into SetIceBackoffMax when the wire-value is the sentinel, change
SetIceBackoffMax(conn.config.P2pRetryMaxSeconds) so it stores the sentinel
(^uint32(0)) into conn.config.P2pRetryMaxSeconds instead of 0 when disable was
requested, and update conn.Open() to treat conn.config.P2pRetryMaxSeconds ==
^uint32(0) as "disabled (zero cap)" while treating the genuine 0 (or other unset
marker if you prefer) as "not set → apply DefaultP2PRetryMax"; ensure all logic
paths that previously wrote 0 for disable now preserve/write the sentinel so the
distinction remains.

Comment on lines +97 to +103
func (s *iceBackoffState) markSuccess() {
s.mu.Lock()
defer s.mu.Unlock()
s.failures = 0
s.suspended = false
s.bo.Reset()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear suspension window when backoff is reset/disabled.

Disabling backoff via SetMaxBackoff(0) can still leave IsSuspended() true until the old retry timestamp elapses, and markSuccess() keeps a stale NextRetry in snapshots. This can incorrectly gate ICE attach and misreport status.

💡 Proposed fix
 func (s *iceBackoffState) markSuccess() {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	s.failures = 0
 	s.suspended = false
+	s.nextRetry = time.Time{}
 	s.bo.Reset()
 }

 func (s *iceBackoffState) SetMaxBackoff(d time.Duration) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	if d == s.maxBackoff {
 		return
 	}
 	s.maxBackoff = d
 	s.bo = buildBackoff(d)
+	if d == 0 {
+		s.suspended = false
+		s.nextRetry = time.Time{}
+	}
 }

Also applies to: 115-123

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 97 - 103, When resetting or
disabling backoff (e.g., in markSuccess) you must also clear the suspended retry
timestamp so IsSuspended() and snapshots do not retain a stale NextRetry; update
markSuccess (and the corresponding reset/disable handler in the 115-123 block)
to set the internal nextRetry field to the zero time (reset the suspension
window) whenever you set s.suspended = false and bo.Reset(), ensuring snapshots
no longer report a stale NextRetry after SetMaxBackoff(0) or a success.

Comment on lines +229 to +240
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate timeout bounds before casting to uint32.

These values arrive as int64 and are cast directly to uint32. Negative inputs or anything larger than uint32 wrap silently, so malformed requests can persist nonsensical timeouts (-1 becomes 4294967295, 4294967296 becomes 0). Reject out-of-range values before the cast.

Suggested fix
+	const maxUint32 = int64(^uint32(0))
+
 	if req.Settings.P2pTimeoutSeconds != nil {
+		if *req.Settings.P2pTimeoutSeconds < 0 || *req.Settings.P2pTimeoutSeconds > maxUint32 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be between 0 and %d", maxUint32)
+		}
 		v := uint32(*req.Settings.P2pTimeoutSeconds)
 		returnSettings.P2pTimeoutSeconds = &v
 	}
 	if req.Settings.P2pRetryMaxSeconds != nil {
+		if *req.Settings.P2pRetryMaxSeconds < 0 || *req.Settings.P2pRetryMaxSeconds > maxUint32 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be between 0 and %d", maxUint32)
+		}
 		v := uint32(*req.Settings.P2pRetryMaxSeconds)
 		returnSettings.P2pRetryMaxSeconds = &v
 	}
 	if req.Settings.RelayTimeoutSeconds != nil {
+		if *req.Settings.RelayTimeoutSeconds < 0 || *req.Settings.RelayTimeoutSeconds > maxUint32 {
+			return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be between 0 and %d", maxUint32)
+		}
 		v := uint32(*req.Settings.RelayTimeoutSeconds)
 		returnSettings.RelayTimeoutSeconds = &v
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
const maxUint32 = int64(^uint32(0))
if req.Settings.P2pTimeoutSeconds != nil {
if *req.Settings.P2pTimeoutSeconds < 0 || *req.Settings.P2pTimeoutSeconds > maxUint32 {
return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be between 0 and %d", maxUint32)
}
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
if *req.Settings.P2pRetryMaxSeconds < 0 || *req.Settings.P2pRetryMaxSeconds > maxUint32 {
return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be between 0 and %d", maxUint32)
}
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
if *req.Settings.RelayTimeoutSeconds < 0 || *req.Settings.RelayTimeoutSeconds > maxUint32 {
return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be between 0 and %d", maxUint32)
}
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
229 - 240, The code casts int64 timeout fields from req.Settings
(P2pTimeoutSeconds, P2pRetryMaxSeconds, RelayTimeoutSeconds) directly to uint32
which can wrap on negative or too-large values; validate each non-nil
*req.Settings.* value is within 0..math.MaxUint32 before casting, and if out of
range return a validation error (HTTP 400) instead of assigning to
returnSettings.*; after the check perform v := uint32(*req.Settings.<Field>) and
set returnSettings.<Field> = &v.

Comment on lines +19 to +127
const (
ModeUnspecified Mode = iota
ModeRelayForced
ModeP2P
ModeP2PLazy
ModeP2PDynamic
// ModeFollowServer is a client-side sentinel: setting this in the
// client config explicitly clears any local override so the
// server-pushed value (or its legacy fallback) is used. It MUST NOT
// be sent on the wire -- ToProto returns UNSPECIFIED for it.
ModeFollowServer
)

// String returns the canonical lower-kebab-case name of the mode.
func (m Mode) String() string {
switch m {
case ModeRelayForced:
return "relay-forced"
case ModeP2P:
return "p2p"
case ModeP2PLazy:
return "p2p-lazy"
case ModeP2PDynamic:
return "p2p-dynamic"
case ModeFollowServer:
return "follow-server"
default:
return ""
}
}

// ParseString accepts the canonical name (case-insensitive, surrounding
// whitespace tolerated) and returns the corresponding Mode. Empty input
// returns ModeUnspecified with no error. Unknown input returns
// ModeUnspecified with an error.
func ParseString(s string) (Mode, error) {
switch strings.ToLower(strings.TrimSpace(s)) {
case "":
return ModeUnspecified, nil
case "relay-forced":
return ModeRelayForced, nil
case "p2p":
return ModeP2P, nil
case "p2p-lazy":
return ModeP2PLazy, nil
case "p2p-dynamic":
return ModeP2PDynamic, nil
case "follow-server":
return ModeFollowServer, nil
default:
return ModeUnspecified, fmt.Errorf("unknown connection mode %q", s)
}
}

// FromProto translates a proto enum value to the internal Mode.
func FromProto(m mgmProto.ConnectionMode) Mode {
switch m {
case mgmProto.ConnectionMode_CONNECTION_MODE_RELAY_FORCED:
return ModeRelayForced
case mgmProto.ConnectionMode_CONNECTION_MODE_P2P:
return ModeP2P
case mgmProto.ConnectionMode_CONNECTION_MODE_P2P_LAZY:
return ModeP2PLazy
case mgmProto.ConnectionMode_CONNECTION_MODE_P2P_DYNAMIC:
return ModeP2PDynamic
default:
return ModeUnspecified
}
}

// ToProto translates the internal Mode to a proto enum value.
// ModeFollowServer is a client-side concept and intentionally maps to
// UNSPECIFIED so it never appears on the wire.
func (m Mode) ToProto() mgmProto.ConnectionMode {
switch m {
case ModeRelayForced:
return mgmProto.ConnectionMode_CONNECTION_MODE_RELAY_FORCED
case ModeP2P:
return mgmProto.ConnectionMode_CONNECTION_MODE_P2P
case ModeP2PLazy:
return mgmProto.ConnectionMode_CONNECTION_MODE_P2P_LAZY
case ModeP2PDynamic:
return mgmProto.ConnectionMode_CONNECTION_MODE_P2P_DYNAMIC
default:
return mgmProto.ConnectionMode_CONNECTION_MODE_UNSPECIFIED
}
}

// ResolveLegacyLazyBool maps the historical Settings.LazyConnectionEnabled
// boolean to the new Mode. Used when a new client receives an old server's
// PeerConfig (ConnectionMode = UNSPECIFIED) or when the management server
// has no explicit Settings.ConnectionMode set yet.
func ResolveLegacyLazyBool(lazy bool) Mode {
if lazy {
return ModeP2PLazy
}
return ModeP2P
}

// ToLazyConnectionEnabled is the inverse mapping for backwards-compat.
// Used by toPeerConfig() so old clients (which only know the boolean)
// still get a sensible behaviour.
//
// Note: ModeRelayForced cannot be expressed via the legacy boolean and
// falls back to false. This is a structural compat gap documented in the
// release notes; admins must set NB_FORCE_RELAY=true on old clients
// or upgrade them.
func (m Mode) ToLazyConnectionEnabled() bool {
return m == ModeP2PLazy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

p2p-dynamic-lazy is missing from the canonical mode table.

The supported mode list for #5989 includes p2p-dynamic-lazy, but this package can't represent it anywhere: String(), ParseString(), FromProto(), ToProto(), and the legacy lazy fallback all omit it. Any config carrying that mode will currently fail parsing or get downgraded during conversion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/connectionmode/mode.go` around lines 19 - 127, Add a new Mode constant
for "p2p-dynamic-lazy" and wire it through the conversions: define
ModeP2PDynamicLazy alongside other Mode constants, return "p2p-dynamic-lazy"
from Mode.String(), accept "p2p-dynamic-lazy" in ParseString(), map
mgmProto.ConnectionMode_CONNECTION_MODE_P2P_DYNAMIC_LAZY <-> ModeP2PDynamicLazy
in FromProto() and ToProto(), and make ModeP2PDynamicLazy return true from
ToLazyConnectionEnabled() (and ensure ResolveLegacyLazyBool behavior remains
correct). Update only the symbols ModeP2PDynamicLazy, String(), ParseString(),
FromProto(), ToProto(), ToLazyConnectionEnabled() to include this new mode and
keep ModeFollowServer behavior unchanged.

connection_mode:
x-experimental: true
type: string
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

connection_mode enum is missing p2p-dynamic-lazy.

PR scope introduces p2p-dynamic-lazy, but the OpenAPI enum does not expose it, so REST clients cannot configure that mode through the account settings API.

Suggested fix
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` at line 365, The OpenAPI enum for
connection_mode is missing the new value; update the enum array under the
connection_mode schema to include "p2p-dynamic-lazy" alongside the existing
values (relay-forced, p2p, p2p-lazy, p2p-dynamic) so REST clients can set that
mode; ensure you add the exact string "p2p-dynamic-lazy" to the enum in the
openapi YAML and run any spec validation/generation steps you normally use.

Comment on lines +370 to +373
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
p2p_timeout_seconds:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale phase-gating text in public API description.

This description says p2p-dynamic is reserved and not functional yet, which is outdated for this PR and can mislead API consumers.

Suggested fix
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Supported values are relay-forced, p2p, p2p-lazy, p2p-dynamic,
+            and p2p-dynamic-lazy.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
p2p_timeout_seconds:
Supported values are relay-forced, p2p, p2p-lazy, p2p-dynamic,
and p2p-dynamic-lazy.
p2p_timeout_seconds:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 370 - 373, Update the
stale documentation text in the OpenAPI description that mentions "Phase 1 of
issue `#5989`" and that "p2p-dynamic is reserved" so it no longer misleads
consumers; locate the block around the p2p_timeout_seconds schema entry in
openapi.yml (the paragraph starting with "Phase 1 of issue `#5989`: relay-forced,
p2p, and p2p-lazy are functional...") and replace or remove the phase-gating
sentence to reflect the current status (either remove the "p2p-dynamic is
reserved" clause or update it to state that p2p-dynamic is now functional),
keeping the rest of the p2p_timeout_seconds description consistent.

Comment on lines +41 to +63
// Defines values for AccountSettingsConnectionMode.
const (
AccountSettingsConnectionModeP2p AccountSettingsConnectionMode = "p2p"
AccountSettingsConnectionModeP2pDynamic AccountSettingsConnectionMode = "p2p-dynamic"
AccountSettingsConnectionModeP2pLazy AccountSettingsConnectionMode = "p2p-lazy"
AccountSettingsConnectionModeRelayForced AccountSettingsConnectionMode = "relay-forced"
)

// Valid indicates whether the value is a known member of the AccountSettingsConnectionMode enum.
func (e AccountSettingsConnectionMode) Valid() bool {
switch e {
case AccountSettingsConnectionModeP2p:
return true
case AccountSettingsConnectionModeP2pDynamic:
return true
case AccountSettingsConnectionModeP2pLazy:
return true
case AccountSettingsConnectionModeRelayForced:
return true
default:
return false
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add p2p-dynamic-lazy to the generated AccountSettingsConnectionMode enum.

The PR introduces p2p-dynamic-lazy, but this public enum and Valid() omit it. That means OpenAPI-generated clients cannot round-trip the full connection-mode set, and the handler-side req.Settings.ConnectionMode.Valid() check will reject that mode even if the backend supports it. Please add it in the OpenAPI source and regenerate this file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
enum AccountSettingsConnectionMode is missing the new value "p2p-dynamic-lazy":
add a new constant named AccountSettingsConnectionModeP2pDynamicLazy with value
"p2p-dynamic-lazy" to the AccountSettingsConnectionMode block and include it in
the AccountSettingsConnectionMode.Valid() switch so the method returns true for
that value; update the OpenAPI source to include p2p-dynamic-lazy and regenerate
shared/management/http/api/types.gen.go so the generated code and Valid() check
are in sync.

@MichaelUray MichaelUray changed the title [1/4 stacked] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff [client, management] Phase 1+2+3 of #5989: ConnectionMode foundation, two-timer lifecycle, iceBackoff (stack 1/4) May 6, 2026
…uest test

setconfig_test.go reflection-tests (TestSetConfig_AllFieldsSaved +
TestCLIFlags_MappedToSetConfig) tolerate the four new fields:
ConnectionMode, P2PTimeoutSeconds, RelayTimeoutSeconds,
P2PRetryMaxSeconds. They are in the proto so daemons can advertise
them via GetConfig, but SetConfig RPC does not apply them at runtime
today — the CLI sets them via 'netbird service install/reconfigure'
writing the active profile file directly. Listed in the test maps so
the structural drift-detector passes; wiring through SetConfig is a
follow-up.
@MichaelUray MichaelUray force-pushed the pr/a-p2p-dynamic-foundation branch from bad32d4 to 9ea349f Compare May 6, 2026 06:18
The pb.go regeneration on dev machines uses local protoc (v5.29.3 here)
which causes the Proto Version Check workflow to flag the diff as a
toolchain mismatch. Restore the upstream-main version strings (v6.33.1
for client/proto, v7.34.1 for shared/management/proto) — generated
content matches; only the comment-line differs.
…rsion pin

- conn_mgr.go: tagged switch on uint32 sentinel (QF1002 staticcheck)
- engine.go: //nolint:staticcheck on the deprecated peer.IsForceRelayed
  call site - the function is intentionally retained for Phase-1
  backwards compat with daemons that haven't migrated to ResolveModeFromEnv yet
- lazyconn/manager.go: //nolint:staticcheck on the deprecated
  inactivity.NewManager fallback - kept on purpose for callers without
  resolved two-timer config (iceTimeout=0 && relayTimeout=0)
- proxy_service.pb.go: pin protoc version header to upstream-main v7.34.1
  (Proto Version Check workflow flagged it; only the comment-line differs)
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
5 New issues
5 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
client/internal/conn_mgr.go (1)

251-289: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the managed lifecycle when only the inactivity timeouts change.

relayTimeoutSecs/p2pTimeoutSecs are applied only in initLazyManager(), but this branch only tears the manager down when the mode changes. A management push that keeps p2p-lazy/p2p-dynamic the same and only changes relay_timeout_seconds or p2p_timeout_seconds updates the stored fields without rebuilding the live manager, so it keeps running with the previous thresholds until reconnect.

♻️ Minimal direction
+	timeoutsChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
-	if modeChanged && wasManaged && isManaged {
+	if wasManaged && isManaged && (modeChanged || timeoutsChanged) {
 		// Switching between lazy and dynamic at runtime: tear down the
 		// existing manager so initLazyManager picks up the new timeouts.
 		log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 251 - 289, The code updates
relayTimeoutSecs and p2pTimeoutSecs but only restarts the lazy/dynamic manager
when the mode changes, so an active manager (lazyConnMgr != nil) will keep old
thresholds; modify the branch handling unchanged mode to detect when timeouts
(relayTimeoutSecs or p2pTimeoutSecs or p2pRetryMaxSecs) changed while
modeUsesLazyMgr(newMode) is true and lazyConnMgr != nil, and when they did: call
e.closeManager(ctx), e.statusRecorder.UpdateLazyConnection(false), then re-init
by calling e.initLazyManager(ctx), e.startModeSideEffects(), and return
e.addPeersToLazyConnManager() so the running manager is recreated with the new
timeouts (use existing helpers closeManager, initLazyManager,
startModeSideEffects, addPeersToLazyConnManager and keep
propagateP2pRetryMaxToConns for immediate propagation).
client/internal/engine.go (1)

1585-1586: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reconfigure live peers when the resolved mode changes.

Mode is still only copied into peer.ConnConfig at construction time. UpdatedRemotePeerConfig() can change ConnMgr.mode later, but already-created peer.Conns will keep their old Open/AttachICE/DetachICE behavior until each peer is rebuilt, so live switches like p2p-lazy -> p2p-dynamic or p2p -> relay-forced won't fully take effect.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 1585 - 1586, The code only sets Mode
into peer.ConnConfig at construction (via e.connMgr.Mode()) so when
UpdatedRemotePeerConfig() changes ConnMgr.mode live peers (peer.Conn) keep old
Open/AttachICE/DetachICE behavior; update UpdatedRemotePeerConfig() to detect a
mode change and reconfigure existing peers by iterating active peer.Conn
instances and updating or rebuilding their ConnConfig (e.g., set the new Mode
and P2pRetryMaxSeconds) or invoking a peer-level method to apply the new
connection policy (e.g., peer.UpdateConnConfig / peer.RebuildConnections /
peer.SetMode) so live switches like p2p-lazy ↔ p2p-dynamic or p2p ↔ relay-forced
take effect immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@client/internal/conn_mgr.go`:
- Around line 251-289: The code updates relayTimeoutSecs and p2pTimeoutSecs but
only restarts the lazy/dynamic manager when the mode changes, so an active
manager (lazyConnMgr != nil) will keep old thresholds; modify the branch
handling unchanged mode to detect when timeouts (relayTimeoutSecs or
p2pTimeoutSecs or p2pRetryMaxSecs) changed while modeUsesLazyMgr(newMode) is
true and lazyConnMgr != nil, and when they did: call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), then re-init by calling
e.initLazyManager(ctx), e.startModeSideEffects(), and return
e.addPeersToLazyConnManager() so the running manager is recreated with the new
timeouts (use existing helpers closeManager, initLazyManager,
startModeSideEffects, addPeersToLazyConnManager and keep
propagateP2pRetryMaxToConns for immediate propagation).

In `@client/internal/engine.go`:
- Around line 1585-1586: The code only sets Mode into peer.ConnConfig at
construction (via e.connMgr.Mode()) so when UpdatedRemotePeerConfig() changes
ConnMgr.mode live peers (peer.Conn) keep old Open/AttachICE/DetachICE behavior;
update UpdatedRemotePeerConfig() to detect a mode change and reconfigure
existing peers by iterating active peer.Conn instances and updating or
rebuilding their ConnConfig (e.g., set the new Mode and P2pRetryMaxSeconds) or
invoking a peer-level method to apply the new connection policy (e.g.,
peer.UpdateConnConfig / peer.RebuildConnections / peer.SetMode) so live switches
like p2p-lazy ↔ p2p-dynamic or p2p ↔ relay-forced take effect immediately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb05b04e-fb0e-4e46-938d-a6555e15d67d

📥 Commits

Reviewing files that changed from the base of the PR and between 745ef8f and 2ef31c0.

📒 Files selected for processing (3)
  • client/internal/conn_mgr.go
  • client/internal/engine.go
  • client/internal/lazyconn/manager/manager.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/lazyconn/manager/manager.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant