[client, management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes — closes #5989 (stack 4/4) #6084
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a connection-mode model with client/server timeouts and server-pushed overrides; implements per-peer ICE backoff, two-timer inactivity (ICE vs relay), a conn-state pusher with management snapshot plumbing, peer-connections store/router and HTTP endpoints, UI/Android exposure, many tests, and small infra utilities. ChangesConnection Mode, Timeouts, and Conn-State Push (single DAG)
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant UI as Desktop UI
participant Client as Client Daemon
participant ConnMgr as ConnMgr
participant Engine as Engine
participant Mgmt as Management Server
User->>UI: Select connection mode & timeouts
UI->>Client: Build SetConfigRequest (mode, timeouts)
Client->>ConnMgr: persist configured overrides
ConnMgr->>ConnMgr: resolve effective mode/timeouts (env↦cfg↦server)
ConnMgr->>Engine: notify mode/timeout change
Engine->>Engine: resetPeersToLazyIdle / reconfigure peers
Engine->>Mgmt: SetEffectiveConnConfig (debounced)
Engine->>Mgmt: SyncPeerConnections (delta/full snapshot)
Mgmt->>Mgmt: store snapshot (MemoryStore) / route snapshot (SnapshotRouter)
User->>UI: Open Peers tab / request FullStatus
UI->>Client: GetFullStatus
Client->>Engine: request status
Engine->>Client: FullStatus (includes ICE backoff, groups, counters)
Client->>UI: Render peers tab (connection_type_extended, backoff, lastSeen, bytes)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 3
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
client/internal/lazyconn/inactivity/manager_test.go (1)
35-37:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestore
newTickerafter each test.Every test overwrites the package-global ticker factory and leaves the fake in place. That makes later tests order-dependent and can hang/flap if they inherit the previous fake channel.
Suggested fix
fakeTick := make(chan time.Time, 1) + prevTicker := newTicker newTicker = func(d time.Duration) Ticker { return &fakeTickerMock{CChan: fakeTick} } + t.Cleanup(func() { + newTicker = prevTicker + })Also applies to: 76-78, 144-146, 184-186, 227-229, 267-269, 306-308, 347-349
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/inactivity/manager_test.go` around lines 35 - 37, The tests overwrite the package-global ticker factory newTicker and never restore it, causing test order dependency and flakes; for each test that sets newTicker (e.g., where fakeTickerMock and fakeTick are used) capture the original factory into a local variable before replacing it and defer restoring it (e.g., originalNewTicker := newTicker; defer func(){ newTicker = originalNewTicker }()) so the global is reset after the test; apply this to every test that assigns newTicker (lines around the other occurrences you noted) to ensure isolation.docs/bugs/2026-05-04-user-peer-visibility-regression.md (1)
1-99:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRedact PII / private deployment details before committing this doc to the public repo.
This file contains identifiable, non-anonymized information that should not land in a public OSS repository:
- Reporter's real name: "Michael Uray" (line 3) and a verbatim user quote referencing "Georg".
- Real end-user identifier:
georg.stoisser-gigacher(line 33).- Real device hostname:
ctb50-d(line 33).- Real customer-location group names:
Lunz.am.See.FWR-access,Lunz.am.See.FWR.Rx(lines 36–37) — these doxx a specific customer site.- Real private deployment URL with non-default port:
https://netbird.uplink.plant-control.net:44106(line 69).None of these identifiers are required for the technical content (regression cause, restored helpers, fix plan, performance trade-offs). Replace each with neutral placeholders so the doc still conveys the engineering context without leaking customer data. Also note that once committed, prior commits will retain this content in git history — if this branch hasn't been merged yet, please amend/force-push the redaction so the data never reaches
main.🛡️ Suggested redaction pattern
-**Reported:** 2026-05-04 by Michael Uray ("Georg sees only his own peer in the dashboard, not the Gegenstellen — that's not what we want, before each user saw their own peers PLUS their counterparts"). +**Reported:** 2026-05-04 by an internal operator: a `user`-role end user reported seeing only their own peer in the dashboard, no longer the routing peers their access policies allowed them to reach. @@ -`georg.stoisser-gigacher` has 1 own peer (`ctb50-d`) and 17 auto_groups, -including 16 `*-access` groups that source policies into `*-Rx` / -`*-NW` destination groups (e.g. `Lunz.am.See.FWR-access` -> -`Lunz.am.See.FWR.Rx`). With the regressed code, the dashboard -shows only `ctb50-d`. Operationally useless -- the user wants to -see the routing peers their device can reach. +A `user`-role account had 1 own peer and ~17 auto-groups, most of +which were `<site>-access` groups that source ACL policies into +`<site>-Rx` / `<site>-NW` destination groups. With the regressed +code, the dashboard showed only the user's own peer, hiding every +counterpart their policies allowed them to reach. @@ - curl -sk -H "Authorization: Token <Georg's token>" \ - https://netbird.uplink.plant-control.net:44106/api/peers \ + curl -sk -H "Authorization: Token <user-token>" \ + https://<management-host>/api/peers \ | jq '. | length'As per coding guidelines, compliance/privacy risks such as PII retention and exposure of user identifiers are treated as major issues.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/bugs/2026-05-04-user-peer-visibility-regression.md` around lines 1 - 99, This doc contains PII and customer-specific deployment details (reporter name "Michael Uray", user "georg.stoisser-gigacher", device "ctb50-d", group names like "Lunz.am.See.FWR-access"/"Lunz.am.See.FWR.Rx", and the URL https://netbird.uplink.plant-control.net:44106); redact each by replacing real names/hosts/groups/URLs with neutral placeholders (e.g., <REPORTER>, <USER>, <DEVICE>, <GROUP_X>, <PRIVATE_DEPLOYMENT_URL>) in docs/bugs/2026-05-04-user-peer-visibility-regression.md, update any inline quotes to anonymized text, then amend the local commit (git commit --amend) and force-push the branch (git push --force) so the branch tip no longer contains the sensitive strings; if the branch was already merged, open a follow-up to purge sensitive data from history and coordinate with security/Git admins.management/server/peer.go (1)
1297-1355:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
GetPeeraligned with the list-endpoint visibility rules.This path never applies
RegularUsersViewBlocked, and the fallback returnsstatus.Internalon an access miss. That lets a restrictable user who is blocked fromGetPeersstill fetch a known peer ID here, and unauthorized misses can bubble out as 500s instead of a deny/not-found.Suggested direction
func (am *DefaultAccountManager) GetPeer(ctx context.Context, accountID, peerID, userID string) (*nbpeer.Peer, error) { peer, err := am.Store.GetPeerByID(ctx, store.LockingStrengthNone, accountID, peerID) if err != nil { return nil, err } @@ user, err := am.Store.GetUserByUserID(ctx, store.LockingStrengthNone, userID) if err != nil { return nil, err } + + settings, err := am.Store.GetAccountSettings(ctx, store.LockingStrengthNone, accountID) + if err != nil { + return nil, fmt.Errorf("failed to get account settings: %w", err) + } + if user.IsRestrictable() && settings.RegularUsersViewBlocked { + return nil, status.NewPermissionDeniedError() + } @@ - return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID) + return nil, status.NewPermissionDeniedError() }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1297 - 1355, GetPeer currently skips the list-endpoint visibility rule (RegularUsersViewBlocked) and returns status.Internal on access misses; update GetPeer (and/or checkIfUserOwnsPeer) to honor the RegularUsersViewBlocked rule from permissionsManager (use the same permissions check flow you use for list visibility) before falling back to ownership checks, and if the user is blocked by RegularUsersViewBlocked return a NotFound/deny style error instead of status.Internal (replace the final status.Internal error in checkIfUserOwnsPeer with a NotFound/permission-denied response consistent with list behavior). Ensure you reference permissionsManager.ValidateUserPermissions / RegularUsersViewBlocked and adjust the error returned from checkIfUserOwnsPeer accordingly.client/internal/peer/status.go (1)
682-697:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftCritical:
notifyPeerStateChangeListenersandnotifyConnStateChangeare invoked afterd.mux.Unlock(), violating their documented lock contract.Both helpers explicitly require the caller to hold
d.mux:
notifyPeerStateChangeListeners(Line 328) callssnapshotRouterPeersLocked, which doesfor pid := range d.changeNotifyand readsd.peers. IfSubscribeToPeerStateChanges(writesd.changeNotifyat Line 1006‑1008) orAddPeer/RemovePeerruns concurrently, Go will panic withfatal error: concurrent map iteration and map write.notifyConnStateChange(Line 300) readsd.connStateListener, whichSetConnStateListenerwrites under the lock — a data race per the Go memory model.The same pattern is repeated in every new
update*Lockedhelper:
call site helper invoked after unlock Line 469 ( updatePeerStateLocked)notifyConnStateChangeLine 691, 695 ( updatePeerICEStateLocked)notifyPeerStateChangeListeners,notifyConnStateChangeLine 759, 763 ( updatePeerRelayedStateLocked)notifyPeerStateChangeListeners,notifyConnStateChangeLine 811 ( updatePeerRelayedStateToDisconnectedLocked)notifyConnStateChangeLine 862 ( updatePeerICEStateToDisconnectedLocked)notifyConnStateChangeTwo viable fixes:
- Capture under the lock, deliver after — build the closure (and the router-state snapshot) before
d.mux.Unlock(), then invoke it later. This matches the design intent of the returnednotifyFn.- Make the helpers self-locking — drop the "caller must hold d.mux" precondition and have them acquire
d.mux(or useatomic.PointerforconnStateListener).Sketch of fix
#1forupdatePeerICEStateLocked:🔒 Proposed fix: build notifications under the lock
notifyList := hasConnStatusChanged(oldStatus, receivedState.ConnStatus) notifyRouter := hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) routerSnapshot := d.snapshotRouterPeersLocked(receivedState.PubKey, notifyRouter) numPeers := d.numOfPeers() materialICE := hasMaterialICEChange(oldSnapshot, peerState) + var stateChangeSnapshot map[string]RouterState + if materialICE { + stateChangeSnapshot = d.snapshotRouterPeersLocked(receivedState.PubKey, true) + } + connStateNotify := func() {} + if notifyRouter { + connStateNotify = d.notifyConnStateChange(receivedState.PubKey, peerState) + } d.mux.Unlock() if notifyList { d.notifier.peerListChanged(numPeers) } if notifyRouter { d.dispatchRouterPeers(receivedState.PubKey, routerSnapshot) } - if materialICE { - d.notifyPeerStateChangeListeners(receivedState.PubKey) - } - - if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) { - return d.notifyConnStateChange(receivedState.PubKey, peerState), nil - } - return func() {}, nil + if stateChangeSnapshot != nil { + go d.dispatchRouterPeers(receivedState.PubKey, stateChangeSnapshot) + } + return connStateNotify, nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 682 - 697, The problem is that notifyPeerStateChangeListeners and notifyConnStateChange are called after d.mux.Unlock(), violating their "caller must hold d.mux" contract and risking concurrent map iteration/write/data races; fix by capturing everything needed for notification while holding d.mux and returning a closure to invoke after unlocking. Concretely, in updatePeerStateLocked, updatePeerICEStateLocked, updatePeerRelayedStateLocked, updatePeerRelayedStateToDisconnectedLocked, and updatePeerICEStateToDisconnectedLocked: before calling d.mux.Unlock(), build the notify closure(s) and any router snapshot or listener references (e.g., result of snapshotRouterPeersLocked and the connStateListener) into local variables/closures, then unlock and invoke those closures (or return them via the existing notifyFn pattern); alternatively, if you prefer, make notifyPeerStateChangeListeners and notifyConnStateChange acquire d.mux internally, but do not call them without holding the lock. Ensure no accesses to d.changeNotify, d.peers, or d.connStateListener happen after unlocking.
🟠 Major comments (23)
client/internal/stdnet/filter_test.go-51-66 (1)
51-66:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAsymmetric assertion silently skips most of the table — tighten the checks.
The loop only fails in two situations:
!c.want && got— a "should be filtered" case that wasn't filtered.c.want && !got && GOOS == "windows" && name == "vEthernet (LAN)"— one very specific Windows case.Every other
c.want == truerow ("Ethernet USB","OpenVPN 1","WiFi","vEthernet (External)", and"vEthernet (LAN)"on non-Windows runners) silently passes even ifInterfaceFilterincorrectly drops them. That's most of the table.Compounding this, the comment at lines 57–59 is now stale: since
filter.golowercases both sides, on Linux"vEthernet (LAN)"→"vethernet (lan)"does match the"veth"prefix and is filtered. The test only "passes" on Linux CI because the assertion is gated on Windows. So this test as written cannot regress on the very behavior its name promises to lock down.Two options:
- Make
c.want == truestrict on the platform where it's meaningful (Windows for the vEthernet cases; all platforms for"Ethernet USB","OpenVPN 1","WiFi").- Or split into per-GOOS subtests with explicit
t.Skipso each case has clear, symmetricassertEqualsemantics.🛠️ Suggested tightening
for _, c := range cases { - // The wgctrl branch can override on hosts where NetBird is - // running; tests run on a host where these names are not - // real interfaces, so the final return faithfully reflects - // the disallow-list logic. - got := allow(c.name) - // "veth*" prefix only filters on non-Windows; on Linux test - // runners "vEthernet (LAN)" still passes because of mixed - // case + the !Windows branch keeping the prefix match. - if !c.want && got { - t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name) - } - if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" { - t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name) - } + got := allow(c.name) + + // vEthernet (LAN) / vEthernet (External) only stay allowed on + // Windows; on non-Windows the case-insensitive "veth" prefix + // in the disallow list legitimately filters them. + isWindowsOnlyAllow := strings.HasPrefix(strings.ToLower(c.name), "vethernet") + if c.want && isWindowsOnlyAllow && runtime.GOOS != "windows" { + continue + } + + if got != c.want { + t.Errorf("InterfaceFilter(%q) = %v, want %v (GOOS=%s)", c.name, got, c.want, runtime.GOOS) + } }(Add
"strings"to the imports.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter_test.go` around lines 51 - 66, The table-driven test's loop around allow(...) is asymmetric and lets many "want==true" rows silently pass; change the loop in Test (the for _, c := range cases block that calls allow) to assert equality for every case (if got != c.want { t.Fatalf/ t.Errorf(... ) }) and handle platform-specific expectations explicitly: either split into per-GOOS subtests (use t.Run with runtime.GOOS checks and t.Skip where not applicable) or keep a single loop but special-case only the Windows-specific vEthernet assertion (use runtime.GOOS == "windows" to adjust expected value for "vEthernet (LAN)"). Ensure you reference the allow(...) call/InterfaceFilter semantics when updating the messages so failures clearly show c.name, c.want and got.client/internal/peer/worker_ice.go-218-230 (1)
218-230:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSynchronize
lastKnownStatewithIsConnected.
IsConnected()readslastKnownStateundermuxAgent, butonConnectionStateChangemutates the same field without that synchronization. That is a real race, and it can make the new network-change fast path preserve or tear down ICE based on stale state. Guard both reads and writes withmuxAgentor switch the field to an atomic.Also applies to: 545-570
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 218 - 230, The IsConnected race: guard reads and writes of lastKnownState consistently—either acquire muxAgent around the mutation in onConnectionStateChange (and any other places noted around the 545-570 range) or convert lastKnownState to an atomic value and update/read it atomically; specifically, update the onConnectionStateChange method and any other writers to use muxAgent.Lock()/Unlock() (or atomic store) so IsConnected()'s read under muxAgent is synchronized with those writers.client/internal/conn_state_pusher.go-293-298 (1)
293-298:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake the push RPCs cancellable.
Both push paths use
context.Background(). If the management RPC stalls, this goroutine blocks indefinitely andStop()never returns. Please use a pusher-owned context and/or a bounded timeout so shutdown can interrupt an in-flight push.Also applies to: 363-369
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 293 - 298, Replace the use of context.Background() when calling p.sink.Push so the RPCs are cancellable: derive a context from a pusher-owned context (e.g., p.ctx or p.stopCtx) and/or wrap with a timeout via context.WithTimeout, pass that derived context to p.sink.Push (both in the PeerConnectionMap push and the other push path using p.sessionID), and ensure Stop() cancels the pusher-owned context (or calls the cancel func) so in-flight pushes are interrupted and Stop() can return.client/internal/conn_state_pusher.go-382-396 (1)
382-396:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPeer removals never get propagated.
computeDeltaFromSource()only emits peers that still exist in the current snapshot. When a peer disappears, its old entry remains inlastPushedand there is no tombstone or full-snapshot-on-shrink path, so management will keep stale connection state until an explicit full snapshot is requested. This needs delete events or a forced full snapshot whenever the key set shrinks.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 382 - 396, computeDeltaFromSource currently only emits peers present in the snapshot and never reports removals; update it to detect when keys in p.lastPushed are absent from p.source.SnapshotAllRemotePeers() and emit corresponding "delete/tombstone" PeerStateChangeEvent entries (or a PeerStateChangeEvent that marks the peer as removed using whatever removed/disconnected field/constant exists) for each missing Pubkey, then remove those keys from p.lastPushed (or trigger the existing full-snapshot-on-shrink path if you prefer that alternative); keep references to p.lastPushed, computeDeltaFromSource, SnapshotAllRemotePeers, PeerStateChangeEvent and isMaterialChange when implementing the change.shared/connectionmode/mode.go-19-29 (1)
19-29:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
p2p-dynamic-lazystill has no enum slot.
#5989's objective includesp2p-dynamic-lazy, but thisModetype stops atModeP2PDynamic. Without a distinct value here, the parser and proto/config conversions cannot preserve that mode, so it will collapse into some other behavior during resolution or sync.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/connectionmode/mode.go` around lines 19 - 29, The Mode enum lacks a value for "p2p-dynamic-lazy" so Mode values like ModeP2PDynamic currently cannot represent it; add a new constant (e.g., ModeP2PDynamicLazy) into the const block alongside ModeP2PDynamic and ModeFollowServer, then update all conversion/serialization points that reference Mode (for example any ToProto/FromProto, parser, config marshaling/unmarshaling functions) to handle the new ModeP2PDynamicLazy case so the mode is preserved across parsing, proto conversion and sync.client/internal/debouncer/debouncer.go-28-53 (1)
28-53:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStale timers can still run the newest callback.
Every
AfterFuncclosure reads shared state fromd.fn. If an older timer has already fired whenTriggerorStopcallstimer.Stop(), that stale closure can still run and invoke the latest callback, which breaks debounce semantics and letsStop()leak work after shutdown.Suggested fix
type Debouncer struct { delay time.Duration mu sync.Mutex timer *time.Timer fn func() + gen uint64 } @@ func (d *Debouncer) Trigger(fn func()) { d.mu.Lock() defer d.mu.Unlock() + d.gen++ + gen := d.gen d.fn = fn if d.timer != nil { d.timer.Stop() } d.timer = time.AfterFunc(d.delay, func() { d.mu.Lock() + if gen != d.gen { + d.mu.Unlock() + return + } f := d.fn + d.fn = nil + d.timer = nil d.mu.Unlock() if f != nil { f() } }) } @@ func (d *Debouncer) Stop() { d.mu.Lock() defer d.mu.Unlock() + d.gen++ if d.timer != nil { d.timer.Stop() d.timer = nil } + d.fn = nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/debouncer/debouncer.go` around lines 28 - 53, The Debouncer allows stale timers' closures to run and call the newest callback because AfterFunc closures read shared d.fn; fix by versioning/invalidation: add a generation counter (e.g. d.generation) and increment it whenever you replace/clear the timer (in Trigger and Stop); when creating the new timer in Debouncer.Trigger capture the current generation value and the callback into locals and have the closure check that the captured generation matches d.generation (under d.mu) before invoking the captured callback; also ensure Stop increments the generation and clears d.fn so any already-firing closures see a mismatched generation and return without calling the latest fn.client/internal/profilemanager/config.go-178-203 (1)
178-203:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAll four timeout/connection-mode fields need explicit snake_case JSON tags per proto/API contracts.
The protobuf definitions and OpenAPI spec define these fields with snake_case JSON names:
connection_mode,relay_timeout_seconds, andp2p_retry_max_seconds. Currently,ConnectionMode,RelayTimeoutSeconds, andP2pTimeoutSecondslack explicit JSON tags and will serialize as PascalCase ("ConnectionMode","RelayTimeoutSeconds","P2pTimeoutSeconds"), while onlyP2pRetryMaxSecondsis correctly tagged"p2p_retry_max_seconds". This breaks serialization/deserialization consistency. Add JSON tags to all three:ConnectionMode→json:"connection_mode,omitempty",RelayTimeoutSeconds→json:"relay_timeout_seconds,omitempty",P2pTimeoutSeconds→json:"p2p_timeout_seconds,omitempty".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/profilemanager/config.go` around lines 178 - 203, The struct fields ConnectionMode, RelayTimeoutSeconds and P2pTimeoutSeconds are missing explicit snake_case JSON tags and will serialize as PascalCase; update their struct tags to match the proto/OpenAPI contract by adding json:"connection_mode,omitempty" to ConnectionMode, json:"relay_timeout_seconds,omitempty" to RelayTimeoutSeconds, and json:"p2p_timeout_seconds,omitempty" to P2pTimeoutSeconds (leave P2pRetryMaxSeconds as-is with json:"p2p_retry_max_seconds,omitempty"). Ensure tags are placed on the same field declarations as shown (ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds) so JSON marshal/unmarshal uses the correct snake_case names.management/server/activity/codes.go-357-362 (1)
357-362:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the public event-code enum in sync with these new activities.
These codes are now emitted on the server side, but
shared/management/http/api/types.gen.gostill doesn't define matchingEventActivityCodevalues. That leaves/api/eventsand notification-type consumers out of sync as soon as one of these events is returned.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/activity/codes.go` around lines 357 - 362, The public EventActivityCode enum (EventActivityCode) must be updated to include the new activity codes added in the server (AccountConnectionModeChanged, AccountRelayTimeoutChanged, AccountP2pTimeoutChanged, AccountP2pRetryMaxChanged, AccountLegacyLazyFallbackEnabledChanged, AccountLegacyLazyFallbackTimeoutChanged); open the generated API types file that defines EventActivityCode (types.gen.go) and add matching enum entries or re-run the generator that emits EventActivityCode so the shared API and server activity codes stay in sync, then rebuild to ensure /api/events consumers see the new values.shared/management/http/api/types.gen.go-41-63 (1)
41-63:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRegenerate this schema with the final connection-mode set.
The issue/PR objective includes
p2p-dynamic-lazy, but this generated enum still exposes only four modes and the doc blocks still describe the old Phase-1 behavior. Any generated client or dashboard code built from this schema won't be able to accept or round-trip the final mode.Also applies to: 1482-1487, 1576-1581
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated AccountSettingsConnectionMode enum and its Valid() method are missing the final "p2p-dynamic-lazy" mode and still reflect Phase-1 docs; regenerate or update the schema so the constants include AccountSettingsConnectionModeP2pDynamicLazy (value "p2p-dynamic-lazy"), add that constant to the switch in AccountSettingsConnectionMode.Valid(), and update the related doc comments (and the same changes for the other generated blocks referenced) so generated clients/dashboards can accept and round-trip the new mode.client/android/preferences.go-325-330 (1)
325-330:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMap
follow-serverto clearing the override.The getter docs advertise
follow-serverto gomobile callers, but this setter persists it as a literal mode string. That stores an invalid override instead of reverting to the server-pushed mode.Suggested fix
func (p *Preferences) SetConnectionMode(mode string) { + if mode == "" || mode == "follow-server" { + p.configInput.ConnectionMode = nil + return + } m := mode p.configInput.ConnectionMode = &m }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/android/preferences.go` around lines 325 - 330, The setter Preferences.SetConnectionMode currently persists the literal "follow-server" instead of clearing the override; update SetConnectionMode so that when mode is "" or "follow-server" it clears the override (set p.configInput.ConnectionMode = nil) to revert to server-pushed behavior, otherwise set the pointer to the provided mode string as before; locate this change in the Preferences.SetConnectionMode method to apply the mapping.client/ui/peers_tab.go-319-320 (1)
319-320:⚠️ Potential issue | 🟠 Major | 💤 Low valueProto fields
bytesRxandbytesTxareint64, notuint64.The concern is valid. The proto definition (client/proto/daemon.proto:363-364) declares these as
int64, and the generated getters returnint64. Casting a negativeint64touint64would wrap around and render as an absurdly large transfer count. Either the proto fields should beuint64, or the code should validate the values are non-negative before casting.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/peers_tab.go` around lines 319 - 320, The proto getters p.GetBytesRx() and p.GetBytesTx() return int64, so avoid directly casting them to uint64; update the formatting in peers_tab.go where humanBytes(uint64(p.GetBytesRx())) / humanBytes(uint64(p.GetBytesTx())) is used — check each value from p.GetBytesRx() and p.GetBytesTx(), clamp negatives to 0 (or render a safe placeholder) and only convert to uint64 after ensuring non-negative, then pass the uint64 to humanBytes; alternatively, if you prefer schema change mention, change the proto fields to uint64 and regenerate, but the quick fix is to validate/clamp the int64 return values before conversion.client/server/server.go-1540-1549 (1)
1540-1549:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
cfg.ServerSSHAllowedbefore dereferencing it.This response path still treats
ServerSSHAllowedas a required pointer, but the rest of the file handles it as optional. If an older or partially migrated profile leaves that field nil,GetConfigwill panic here.Proposed fix
+ serverSSHAllowed := true + if cfg.ServerSSHAllowed != nil { + serverSSHAllowed = *cfg.ServerSSHAllowed + } + return &proto.GetConfigResponse{ ManagementUrl: managementURL.String(), PreSharedKey: preSharedKey, AdminURL: adminURL.String(), InterfaceName: cfg.WgIface, WireguardPort: int64(cfg.WgPort), Mtu: int64(cfg.MTU), DisableAutoConnect: cfg.DisableAutoConnect, - ServerSSHAllowed: *cfg.ServerSSHAllowed, + ServerSSHAllowed: serverSSHAllowed,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/server/server.go` around lines 1540 - 1549, The GetConfigResponse construction dereferences cfg.ServerSSHAllowed unguarded which can panic when it's nil; update the code building the proto.GetConfigResponse to check cfg.ServerSSHAllowed for nil and only set ServerSSHAllowed to *cfg.ServerSSHAllowed when non-nil (otherwise set a safe default or omit/set a nil/false-equivalent per proto semantics), i.e., guard the dereference of ServerSSHAllowed in the function that builds the response so the GetConfig path won’t panic.client/ui/network.go-379-384 (1)
379-384:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRefresh the Peers tab when it is the active view.
Line 382 still routes the timer through
getGridAndFilterFromTab(...), which treats the Peers tab as the default networks tab. In practice that means the visible Peers view never auto-refreshes, while the hidden all-networks grid keeps doing background refresh work. Branch onpeersTexthere and callpeersBundle.Refresh()instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/network.go` around lines 379 - 384, The timer refresh currently always calls getGridAndFilterFromTab(tabs, allGrid, overlappingGrid, exitNodesGrid) which treats the Peers tab as networks and prevents the visible Peers view from auto-refreshing; update the branch in the timer handler to check if the active tab equals peersText and, if so, call peersBundle.Refresh() (and avoid calling getGridAndFilterFromTab or s.wNetworks.Content().Refresh() in that path), otherwise continue to call getGridAndFilterFromTab(...) and s.updateNetworksSilent(grid, f) so hidden networks still refresh as before.client/ui/client_ui.go-656-666 (1)
656-666:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject invalid timeout text instead of silently converting it to
0.
parseUint32Fieldtreats parse failures and overflow the same as an empty field, so values likeabcor999999999999get saved as “use server default”. That makes config edits lossy and very hard to diagnose.Suggested direction
-func parseUint32Field(text string) uint32 { +func parseUint32Field(text string) (uint32, error) { t := strings.TrimSpace(text) if t == "" { - return 0 + return 0, nil } v, err := strconv.ParseUint(t, 10, 32) if err != nil { - return 0 + return 0, fmt.Errorf("invalid timeout value %q", t) } - return uint32(v) + return uint32(v), nil }Then validate/build with the returned error instead of silently sending zeroes.
Also applies to: 741-750
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 656 - 666, parseUint32Field currently treats empty, invalid, and overflow inputs the same by returning 0, which silently converts bad user edits into "use server default"; change parseUint32Field to return (uint32, error) (or an explicit ok bool plus value) and return an error on invalid parse/overflow instead of 0, update callers (including the other similar helper at the 741-750 region) to check and propagate that error to the config validation/build path so invalid timeout text is rejected with a clear error rather than saved as zero.management/server/http/handlers/peer_connections/handler.go-161-188 (1)
161-188:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftFilter snapshot entries by the caller’s peer visibility before returning them.
GetPeeronly authorizes the source peer.buildResponsethen emits every remote pubkey/endpoint in that peer’s connection map, so a regular user who can see a routing peer via policy reachability can learn unrelated peers and their traffic metadata. Please drop or redact entries the caller cannot access under the same visibility rules used byGetPeer/GetPeers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/peer_connections/handler.go` around lines 161 - 188, buildResponse currently returns every entry in the connection map regardless of the caller’s visibility; instead, apply the same visibility/filtering used by GetPeer/GetPeers before emitting entries. For each entry in buildResponse (refer to apiEntry assembly and the call to h.account.GetPeerByPubKey), call the same visibility-check helper used by GetPeer/GetPeers (or extract that logic into a shared helper like authorizePeerVisibility) and if the caller cannot see the remote peer then drop the entry or redact sensitive fields (remote pubkey, endpoint, RxBytes/TxBytes, latency) before appending to resp.Entries so only peers the caller is authorized to see are returned.client/ui/client_ui.go-501-510 (1)
501-510:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExpose
p2p-dynamic-lazyeverywhere the selector reads or writes modes.The new mode is missing from the dropdown, the refresh options, the enable/disable switch, and the load-time
switch. Opening settings on a profile already set top2p-dynamic-lazywill fall back to “Follow server”, and saving from there clears the override.Suggested direction
s.sConnectionMode = widget.NewSelect( - []string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic"}, + []string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy"}, func(string) { s.updateTimeoutEntriesEnabled() }, ) @@ s.sConnectionMode.Options = []string{ s.followServerLabel(), "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", + "p2p-dynamic-lazy", } @@ - case "p2p-dynamic": + case "p2p-dynamic", "p2p-dynamic-lazy": s.iRelayTimeout.Enable() s.iP2pTimeout.Enable() s.iP2pRetryMax.Enable() @@ - case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic": + case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy": s.sConnectionMode.SetSelected(cfg.ConnectionMode)Also applies to: 855-907, 1573-1598
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 501 - 510, The dropdown and related mode-handling code omit the new "p2p-dynamic-lazy" mode; add "p2p-dynamic-lazy" to the NewSelect options list (where s.sConnectionMode is created) and update every place that enumerates or branches on connection modes — e.g., the refresh/options arrays, the enable/disable switch logic, the load-time mapping that calls s.sConnectionMode.SetSelected or reads s.sConnectionMode.Selected, and the updateTimeoutEntriesEnabled function — so that checks that currently look for "p2p-dynamic" also accept "p2p-dynamic-lazy" (or include it as a new case) and saving/loading code preserves this exact string instead of falling back to "Follow server".management/server/http/handlers/accounts/accounts_handler.go-231-283 (1)
231-283:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn
InvalidArgumentfor these validation failures.These branches build plain
fmt.Errorfs, butupdateAccount()forwards them toutil.WriteError(). That makes malformedconnection_mode/ timeout input look like a server failure instead of a 4xx validation error. Please returnstatus.Errorf(status.InvalidArgument, ...)here, including fromvalidateUint32Timeout().Suggested fix
if req.Settings.ConnectionMode != nil { modeStr := string(*req.Settings.ConnectionMode) if !req.Settings.ConnectionMode.Valid() { - return nil, fmt.Errorf("invalid connection_mode %q", modeStr) + return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr) } @@ if v < 60 || v > 86400 { - return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) + return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) }func validateUint32Timeout(name string, v int64) (uint32, error) { if v < 0 { - return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v) } if v > int64(math.MaxUint32) { - return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, math.MaxUint32) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, math.MaxUint32) } return uint32(v), nil }Also applies to: 505-519
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` around lines 231 - 283, The validation branches in accounts_handler.go (e.g., the connection_mode check, the legacy_lazy_fallback_timeout_seconds check, and calls to validateUint32Timeout for p2p_timeout_seconds, p2p_retry_max_seconds, relay_timeout_seconds) currently return plain fmt.Errorf which causes util.WriteError to treat them as server errors; change these to return gRPC status errors using status.Errorf(codes.InvalidArgument, ...) with the same message text, and update validateUint32Timeout to return a status.Error (or status.Errorf) on invalid input so its callers receive an InvalidArgument status; references: req.Settings.ConnectionMode, validateUint32Timeout, returnSettings.LegacyLazyFallbackTimeoutSeconds, and the timeout fields P2pTimeoutSeconds/P2pRetryMaxSeconds/RelayTimeoutSeconds.client/internal/peer/conn.go-97-102 (1)
97-102:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
P2pRetryMaxSecondscurrently can't represent “disabled”.The new flow collapses two different states into
0: “useDefaultP2PRetryMax” and “explicitly disable backoff”.Open()interprets0as the default cap, andSetIceBackoffMax(0)stores the same value back into config, so an explicit disable is lost on first open/reopen. This needs a real tri-state throughConnConfigand the open/setter paths.Also applies to: 222-227, 1428-1434
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 97 - 102, P2pRetryMaxSeconds currently conflates three states into 0; make it a tri-state by changing ConnConfig.P2pRetryMaxSeconds from uint32 to *uint32 (nil = use DefaultP2PRetryMax, pointer to 0 = explicit disable, pointer to >0 = explicit cap), update Open() to interpret nil/0/>0 accordingly (use DefaultP2PRetryMax for nil, treat 0 as disabled), and change SetIceBackoffMax to accept/store a *uint32 (or adapt its signature to set nil/0/value) so explicit disable isn't lost on reopen; apply the same pattern wherever P2pRetryMaxSeconds is consumed (e.g., the other referenced sites: lines ~222-227 and ~1428-1434) and adjust any code that previously relied on the sentinel ^uint32(0) to use the new pointer-tri-state semantics.client/internal/conn_mgr.go-297-301 (1)
297-301:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPropagate live mode/timeout changes into existing
peer.Conns.This updates the resolved values on
ConnMgr, but the only thing pushed into already-created peers isp2pRetryMax. The samepeer.Connobjects are then reused bycloseManager()/resetPeersToLazyIdle(), so their mode-sensitive logic keeps running with staleConnConfig.Modeand timeout-derived behavior after a server push. A runtime switch torelay-forcedorp2p-dynamicwon't fully take effect until the peers are rebuilt.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 297 - 301, When updating e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs, also push the new configuration into already-created peer.Conn objects so their ConnConfig.Mode and timeout-derived behavior are updated immediately; extend propagateP2pRetryMaxToConns() (or add a new helper called from this same place) to iterate existing peers and set each peer.Conn.ConnConfig.Mode, .RelayTimeout (or equivalent), .P2PTimeout and .P2PRetryMax fields from the ConnMgr's resolved values so closeManager() and resetPeersToLazyIdle() do not keep operating on stale settings. Ensure the update is safe for concurrent access to the peers collection.management/internals/shared/grpc/conversion.go-282-289 (1)
282-289:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSentinel encoding inconsistency between
configured_p2p_retry_max_secsandPeerConfig.P2pRetryMaxSecondswill cause false UI divergence warnings.In
toPeerConfig(lines 132-139),*settings.P2pRetryMaxSeconds == 0is encoded on the wire asp2pRetryMaxDisabledSentinel(uint32 max) to mean "user explicitly disabled backoff", while wire-value0means "not set / use daemon default" — matching the wire-format semantics documented onPeerConfig.P2pRetryMaxSecondsinmanagement.proto.Here,
cfgP2pRetryMax = derefUint32OrZero(c.Cfg.P2pRetryMaxSeconds)skips that transform: when an admin has explicitly disabled backoff (settings value 0),RemotePeerConfig.configured_p2p_retry_max_secsis sent as0whilePeerConfig.P2pRetryMaxSecondsand the peer's reportedEffectiveP2PRetryMaxSecsare sent as the sentinel max. UIs that compare "effective vs configured to spot local overrides (≠ → ⚠)" — as the proto comment on these fields states — will flag every legitimately backoff-disabled peer as diverging.🛠️ Proposed fix: extract a small helper and apply the sentinel transform in both call sites
var cfgConnMode string var cfgRelayTO, cfgP2pTO, cfgP2pRetryMax uint32 if c.Cfg != nil { cfgConnMode = derefStringOrEmpty(c.Cfg.ConnectionMode) cfgRelayTO = derefUint32OrZero(c.Cfg.RelayTimeoutSeconds) cfgP2pTO = derefUint32OrZero(c.Cfg.P2pTimeoutSeconds) - cfgP2pRetryMax = derefUint32OrZero(c.Cfg.P2pRetryMaxSeconds) + cfgP2pRetryMax = encodeP2pRetryMaxWire(c.Cfg.P2pRetryMaxSeconds) }// encodeP2pRetryMaxWire applies the wire-format sentinel for // PeerConfig.P2pRetryMaxSeconds / configured_p2p_retry_max_secs: // // nil -> 0 (not set, use daemon default) // *v==0 -> max (user explicitly disabled backoff) // *v>0 -> *v func encodeP2pRetryMaxWire(v *uint32) uint32 { if v == nil { return 0 } if *v == 0 { return p2pRetryMaxDisabledSentinel } return *v }Then update
toPeerConfig(lines 132-139) to use the same helper for symmetry.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 282 - 289, The configured_p2p_retry_max_secs wire-value is using derefUint32OrZero and therefore fails to apply the same sentinel mapping used by PeerConfig.P2pRetryMaxSeconds, causing false divergence; add a helper like encodeP2pRetryMaxWire(v *uint32) uint32 that returns 0 for nil, p2pRetryMaxDisabledSentinel for *v==0, and *v otherwise, then replace uses of derefUint32OrZero for the P2P retry field (e.g., the cfgP2pRetryMax assignment in this diff and the toPeerConfig mapping of settings.P2pRetryMaxSeconds) to call encodeP2pRetryMaxWire so both sides use the same wire encoding.management/internals/shared/grpc/server.go-435-437 (1)
435-437:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd nil-guard on
s.snapshotRouterto match defensiveSyncPeerConnectionspattern.The unconditional dereference of
s.snapshotRouter.Register()on line 435 is inconsistent with theif s.peerConnections != nilcheck inSyncPeerConnections(line 1182).NewServeraccepts both parameters with no documented non-nil contract, and multiple test files passnilforpeerConnRouter. Since those tests do invokeSync()which callshandleUpdates(), the nil panic is a genuine risk.Apply Option A (defensive guard):
- // Phase 3.7i (`#5989`): register for SnapshotRequest dispatch. - snapshotCh := s.snapshotRouter.Register(peerKey.String()) - defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh) + // Phase 3.7i (`#5989`): register for SnapshotRequest dispatch. + var snapshotCh <-chan uint64 + if s.snapshotRouter != nil { + snapshotCh = s.snapshotRouter.Register(peerKey.String()) + defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh) + }The existing
case nonce, ok := <-snapshotCh:will safely block forever on a nil channel, degrading gracefully.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/server.go` around lines 435 - 437, The code unconditionally calls s.snapshotRouter.Register(...) and Unregister(...), causing a nil-deref when s.snapshotRouter is nil; add a defensive nil-guard around registration/unregistration similar to the SyncPeerConnections pattern: only call s.snapshotRouter.Register(peerKey.String()) and defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh) when s.snapshotRouter != nil, and handle snapshotCh being nil (the existing select/case will block on a nil channel as intended); update the block using the symbols snapshotCh, s.snapshotRouter.Register, s.snapshotRouter.Unregister and keep behavior when NewServer may receive a nil peerConnRouter.shared/management/http/api/openapi.yml-379-405 (1)
379-405:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTimeout defaults conflict with
#5989behavior goalsLine 382 documents ICE idle default as 180 minutes, while Line 405 documents relay idle default as 5 minutes.
#5989objectives describe the opposite shape (ICE around 5 minutes, relay around 1 hour). Please align these descriptions (and values, if needed) with actual server behavior so clients don’t apply wrong defaults.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 379 - 405, The YAML descriptions and implied defaults are flipped: update the ICE idle timeout field (the description block that starts "Default ICE-worker idle timeout in seconds. 0 = never tear down.") to state the built-in default is ~5 minutes (NULL means use built-in default (5 minutes)) and change any example/value references accordingly, and update relay_timeout_seconds to state the built-in default is ~60 minutes (NULL means use built-in default (60 minutes)) so the documented defaults match the actual server behavior; adjust any example values and text to reflect these corrected defaults and keep p2p_retry_max_seconds unchanged.shared/management/http/api/openapi.yml-362-372 (1)
362-372:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
connection_modeis missingp2p-dynamic-lazyand still documents stale phase behaviorLine 365 omits
p2p-dynamic-lazy, even though#5989defines it as a first-class mode. This can block generated clients from sending/accepting valid values and causes an API contract mismatch.Suggested OpenAPI fix
connection_mode: x-experimental: true type: string - enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] + enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy] nullable: true description: | Account-wide default peer-connection mode. NULL means "fall back to lazy_connection_enabled" for backwards compatibility. - Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are - functional. p2p-dynamic is reserved (passes through as p2p in - Phase 1; will become functional in Phase 2). + Supported modes: relay-forced, p2p, p2p-lazy, p2p-dynamic, + and p2p-dynamic-lazy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 362 - 372, The OpenAPI schema for property connection_mode currently omits the valid enum value "p2p-dynamic-lazy" and contains stale phase-specific documentation; update the enum for connection_mode to include "p2p-dynamic-lazy" alongside relay-forced, p2p, p2p-lazy, and p2p-dynamic, keep nullable: true and x-experimental: true, and revise the description to remove the outdated "Phase 1/Phase 2" wording and instead document the current semantics and fallback behavior (NULL meaning fall back to lazy_connection_enabled) so generated clients accept and document the new mode correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19da9f16-b34b-4316-be52-6651b3b8f39b
⛔ Files ignored due to path filters (7)
client/proto/daemon.pb.gois excluded by!**/*.pb.goclient/proto/daemon_grpc.pb.gois excluded by!**/*.pb.goflow/proto/flow.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.goshared/management/proto/management_grpc.pb.gois excluded by!**/*.pb.goshared/management/proto/proxy_service.pb.gois excluded by!**/*.pb.goshared/signal/proto/signalexchange.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (108)
.gitignoreclient/android/client.goclient/android/peer_notifier.goclient/android/preferences.goclient/android/preferences_clamp_test.goclient/cmd/root.goclient/cmd/service.goclient/cmd/service_installer.goclient/cmd/testutil_test.goclient/cmd/up.goclient/iface/bind/activity.goclient/iface/device/endpoint_manager.goclient/internal/conn_mgr.goclient/internal/conn_mgr_test.goclient/internal/conn_state_pusher.goclient/internal/conn_state_pusher_material_test.goclient/internal/conn_state_pusher_test.goclient/internal/conn_state_pusher_testhelper_test.goclient/internal/connect.goclient/internal/debouncer/debouncer.goclient/internal/engine.goclient/internal/engine_offline_debounce_test.goclient/internal/engine_pusher_adapters.goclient/internal/engine_test.goclient/internal/lazyconn/activity/listener_bind_test.goclient/internal/lazyconn/env.goclient/internal/lazyconn/inactivity/manager.goclient/internal/lazyconn/inactivity/manager_test.goclient/internal/lazyconn/manager/manager.goclient/internal/lazyconn/support.goclient/internal/peer/conn.goclient/internal/peer/conn_handover_order_test.goclient/internal/peer/conn_lazy_keepwgpeer_test.goclient/internal/peer/conn_test.goclient/internal/peer/env.goclient/internal/peer/env_test.goclient/internal/peer/guard/guard.goclient/internal/peer/guard/guard_test.goclient/internal/peer/guard/ice_retry_state_test.goclient/internal/peer/handshaker.goclient/internal/peer/handshaker_test.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/status.goclient/internal/peer/status_debounce_test.goclient/internal/peer/status_remote_meta_notify_test.goclient/internal/peer/status_test.goclient/internal/peer/worker_ice.goclient/internal/peerstore/store.goclient/internal/profilemanager/config.goclient/internal/stdnet/filter.goclient/internal/stdnet/filter_test.goclient/proto/daemon.protoclient/server/server.goclient/server/server_test.goclient/server/setconfig_test.goclient/status/status.goclient/status/status_test.goclient/system/features.goclient/system/features_test.goclient/ui/client_ui.goclient/ui/const.goclient/ui/event_handler.goclient/ui/network.goclient/ui/peers_tab.godocs/bugs/2026-05-04-user-peer-visibility-regression.mddocs/superpowers/plans/2026-04-06-kernel-ice-separate-port.mdmanagement/internals/controllers/network_map/controller/controller.gomanagement/internals/server/boot.gomanagement/internals/shared/grpc/conversion.gomanagement/internals/shared/grpc/conversion_test.gomanagement/internals/shared/grpc/server.gomanagement/server/account.gomanagement/server/account/manager.gomanagement/server/account/manager_mock.gomanagement/server/activity/codes.gomanagement/server/http/handler.gomanagement/server/http/handlers/accounts/accounts_handler.gomanagement/server/http/handlers/accounts/accounts_handler_test.gomanagement/server/http/handlers/accounts/validate_uint32_timeout_test.gomanagement/server/http/handlers/peer_connections/handler.gomanagement/server/http/handlers/peer_connections/handler_test.gomanagement/server/http/testing/testing_tools/channel/channel.gomanagement/server/management_proto_test.gomanagement/server/management_test.gomanagement/server/mock_server/account_mock.gomanagement/server/peer.gomanagement/server/peer/peer.gomanagement/server/peer/peer_test.gomanagement/server/peer_connections/snapshot_router.gomanagement/server/peer_connections/snapshot_router_test.gomanagement/server/peer_connections/store.gomanagement/server/peer_connections/store_test.gomanagement/server/peer_test.gomanagement/server/store/file_store.gomanagement/server/store/sql_store.gomanagement/server/types/settings.gomanagement/server/types/settings_ptr_equal_test.gomanagement/server/types/settings_test.goshared/connectionmode/mode.goshared/connectionmode/mode_test.goshared/management/client/client.goshared/management/client/client_test.goshared/management/client/grpc.goshared/management/client/mock.goshared/management/http/api/openapi.ymlshared/management/http/api/types.gen.goshared/management/proto/management.proto
💤 Files with no reviewable changes (2)
- client/ui/const.go
- client/ui/event_handler.go
| // SetRelayTimeoutSeconds stores a local override for the relay timeout. | ||
| // Pass 0 to clear the override. Negative values are clamped to 0; | ||
| // values larger than MaxUint32 are clamped to MaxUint32. The Android | ||
| // AdvancedFragment UI already clamps negatives but a Java caller using | ||
| // the bare gomobile API directly would otherwise wrap silently. | ||
| func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { | ||
| v := clampUint32Seconds(secs) | ||
| p.configInput.RelayTimeoutSeconds = &v | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cfg_file="$(fd -p 'config.go' client/internal/profilemanager | head -n1)"
if [[ -z "${cfg_file}" ]]; then
echo "Could not find client/internal/profilemanager/config.go" >&2
exit 1
fi
echo "=== ConfigInput / Config timeout fields ==="
rg -n -C3 'type ConfigInput struct|type Config struct|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds|ConnectionMode' "$cfg_file"
echo
echo "=== Read / write paths ==="
rg -n -C5 'func ReadConfig|func UpdateOrCreateConfig' "$cfg_file"
echo
echo "=== Any nil-vs-zero normalization before persist ==="
rg -n -C3 'RelayTimeoutSeconds.*0|P2pTimeoutSeconds.*0|P2pRetryMaxSeconds.*0|ConnectionMode.*follow-server|nil' client/internal/profilemanager client/androidRepository: netbirdio/netbird
Length of output: 50374
🏁 Script executed:
cat -n client/android/preferences.go | sed -n '340,400p'Repository: netbirdio/netbird
Length of output: 2798
🏁 Script executed:
cat -n client/android/preferences.go | sed -n '398,420p'Repository: netbirdio/netbird
Length of output: 713
Fix setters to write nil when 0 is passed, not &0.
The documented behavior—"Pass 0 to clear the override"—is broken. When you call SetRelayTimeoutSeconds(0), the setter always creates a non-nil pointer to 0 and stores it in configInput.RelayTimeoutSeconds. When UpdateOrCreateConfig is called, it dereferences this pointer and writes literal 0 to the config file.
Per the Config struct documentation, local 0 means "follow server," so this persists an override (follow-server behavior) instead of clearing the local override. The same issue affects SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds.
Each setter should check if the clamped value is 0 and write nil instead of &0:
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := clampUint32Seconds(secs)
if v == 0 {
p.configInput.RelayTimeoutSeconds = nil
} else {
p.configInput.RelayTimeoutSeconds = &v
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/android/preferences.go` around lines 347 - 355, The setters
(Preferences.SetRelayTimeoutSeconds, and likewise SetP2pTimeoutSeconds and
SetP2pRetryMaxSeconds) currently always take the clamped uint32 value v and
assign its address to p.configInput.* which creates a non-nil pointer for 0;
instead detect when clampUint32Seconds(secs) returns 0 and set the corresponding
p.configInput.RelayTimeoutSeconds / P2pTimeoutSeconds / P2pRetryMaxSeconds to
nil, otherwise set it to a pointer to v so passing 0 clears the override rather
than persisting a literal 0.
| case nonce, ok := <-snapshotCh: | ||
| if !ok { | ||
| continue | ||
| } | ||
| snapMsg := &network_map.UpdateMessage{ | ||
| Update: &proto.SyncResponse{ | ||
| SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce}, | ||
| }, | ||
| } | ||
| if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil { | ||
| log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect SnapshotRouter implementation to determine when it closes channels.
fd -t f 'snapshot_router.go' | xargs -I{} cat {}
ast-grep --pattern $'func ($_ *SnapshotRouter) Unregister($$$) {
$$$
}'
ast-grep --pattern $'func ($_ *SnapshotRouter) $_($$$) {
$$$
close($_)
$$$
}'Repository: netbirdio/netbird
Length of output: 4855
Closed-channel busy-loop on fast-reconnect scenarios.
When a peer reconnects quickly, SnapshotRouter.Register() closes the previous stream's channel (documented behavior for fast reconnect handling). However, the old handleUpdates goroutine remains in the select loop, reading from the now-closed channel. Repeatedly, <-snapshotCh returns (0, false), the if !ok { continue } re-enters the select, and the closed case fires again immediately—a tight spin until either ctx.Done() fires or the deferred Unregister() completes after handleUpdates returns.
Set snapshotCh = nil after closing so subsequent select evaluations skip this case (nil channels block forever in select):
Proposed fix
case nonce, ok := <-snapshotCh:
if !ok {
- continue
+ snapshotCh = nil
+ continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case nonce, ok := <-snapshotCh: | |
| if !ok { | |
| continue | |
| } | |
| snapMsg := &network_map.UpdateMessage{ | |
| Update: &proto.SyncResponse{ | |
| SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce}, | |
| }, | |
| } | |
| if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil { | |
| log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err) | |
| } | |
| case nonce, ok := <-snapshotCh: | |
| if !ok { | |
| snapshotCh = nil | |
| continue | |
| } | |
| snapMsg := &network_map.UpdateMessage{ | |
| Update: &proto.SyncResponse{ | |
| SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce}, | |
| }, | |
| } | |
| if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil { | |
| log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/internals/shared/grpc/server.go` around lines 487 - 498, The
select case reading from snapshotCh can busy-loop when the channel is closed; in
the case branch where you detect the closed channel (if !ok), set snapshotCh =
nil so the select will skip that case (nil channels block) and then continue;
update the select-handling logic in the function/method containing snapshotCh
(the handleUpdates/select loop) to assign snapshotCh = nil immediately after
detecting closure before continuing to avoid the tight spin.
There was a problem hiding this comment.
Actionable comments posted: 6
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/server/peer/peer.go (1)
207-225:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
isEmptynot updated to consider the new Phase 3.7i fields.The library context for this change states that
PeerSystemMeta.isEmptywas updated to factor in the new fields, but the implementation still only inspects the legacy fields. A meta payload that only carried the newEffective*/SupportedFeaturesdata (e.g., a future minimal-meta path) would be silently treated as empty byUpdateMetaIfNewand skipped. In current flows real peers always populateHostname/OS/etc., so this isn't actively triggered, but it's worth aligning to keepisEmptyandisEqualin sync.🔧 Suggested alignment
func (p PeerSystemMeta) isEmpty() bool { return p.Hostname == "" && p.GoOS == "" && p.Kernel == "" && p.Core == "" && p.Platform == "" && p.OS == "" && p.OSVersion == "" && p.WtVersion == "" && p.UIVersion == "" && p.KernelVersion == "" && len(p.NetworkAddresses) == 0 && p.SystemSerialNumber == "" && p.SystemProductName == "" && p.SystemManufacturer == "" && p.Environment.Cloud == "" && p.Environment.Platform == "" && - len(p.Files) == 0 + len(p.Files) == 0 && + p.EffectiveConnectionMode == "" && + p.EffectiveRelayTimeoutSecs == 0 && + p.EffectiveP2PTimeoutSecs == 0 && + p.EffectiveP2PRetryMaxSecs == 0 && + len(p.SupportedFeatures) == 0 }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 207 - 225, PeerSystemMeta.isEmpty currently checks only legacy fields and doesn't consider the new Phase 3.7i fields (the Effective* fields and SupportedFeatures), so payloads containing only those new fields are treated as empty; update PeerSystemMeta.isEmpty to mirror the expanded checks in PeerSystemMeta.isEqual/UpdateMetaIfNew by including the Effective* fields (e.g., EffectiveOS, EffectivePlatform, EffectiveKernel, etc.) and SupportedFeatures (and any other new Phase 3.7i fields added to the struct), and ensure slices/maps are checked for length/non-nil the same way as in isEqual so UpdateMetaIfNew won't skip valid new-meta updates.
🟠 Major comments (21)
docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-44-44 (1)
44-44:⚠️ Potential issue | 🟠 MajorUse dual-stack UDP socket instead of IPv4-only.
The proposed code restricts to
"udp4", but the currentsharedsockimplementation explicitly creates both IPv4 and IPv6 raw sockets. The rest of the codebase—includingICEBind,DualStackPacketConn, andudpmux—expects dual-stack support. For consistency with existing infrastructure and to avoid breaking IPv6 connectivity, use"udp"(which handles both IPv4 and IPv6) or create separate IPv4 and IPv6 listeners:udpConn, err := net.ListenUDP("udp", &net.UDPAddr{Port: 0})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 44, The code currently calls net.ListenUDP with "udp4" which forces an IPv4-only socket; update the socket creation used by udpConn (the net.ListenUDP invocation) to use a dual-stack socket by switching the network string to "udp" or by creating separate IPv4 and IPv6 listeners so it matches the sharedsock/ICEBind/DualStackPacketConn/udpmux expectations and preserves IPv6 support; ensure the created socket(s) integrate with the existing ICEBind and DualStackPacketConn usage paths.client/internal/stdnet/filter.go-51-54 (1)
51-54:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope the
loprefix guard away from Windows.Line 52 now filters any Windows interface whose friendly name starts with
Lo, not just the loopback pseudo-interface. Names likeLocal Area Connectionwould be dropped before the Windows-specific exceptions run, which can hide a real default-route NIC and recreate the relay-only failure you’re trying to fix here.Suggested fix
- // Linux/macOS loopback prefix ("lo", "lo0"). - if strings.HasPrefix(lowerIFace, "lo") { + // Unix loopback prefixes ("lo", "lo0"). + if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") { return false }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter.go` around lines 51 - 54, The current prefix guard uses strings.HasPrefix(lowerIFace, "lo") and wrongly drops Windows interfaces whose friendly name starts with "Lo"; change the condition to only apply on non-Windows platforms by checking runtime.GOOS (e.g. if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") { return false }) and add the runtime import if missing so Windows-specific exceptions still run for interfaces like "Local Area Connection".docs/bugs/2026-05-04-user-peer-visibility-regression.md-33-70 (1)
33-70:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRedact user-identifiable and internal endpoint details in bug docs.
Line 33 and Lines 68-70 include a real-looking user identifier and internal host/path details. Please replace these with sanitized placeholders before merge to avoid persisting potentially sensitive identifiers/infrastructure metadata in git history.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/bugs/2026-05-04-user-peer-visibility-regression.md` around lines 33 - 70, Replace all user-identifiable and internal endpoint strings in this document: redact the user id `georg.stoisser-gigacher` and device id `ctb50-d` (and any other real names) with placeholders such as <USER_ID> and <PEER_ID>, and sanitize the curl example by replacing the real token and host/path `https://netbird.uplink.plant-control.net:44106/api/peers` with placeholders like <AUTH_TOKEN> and <INTERNAL_API_HOST>/api/peers (or remove the exact host entirely), ensuring no real identifiers remain in the text or examples (also scan for similar strings elsewhere in the file and replace them).client/internal/debouncer/debouncer.go-31-42 (1)
31-42:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStale timer callback can execute the newest function (duplicate/early fire).
Because the callback reads shared
d.fn(Line 37), an oldertime.AfterFuncthat races with a newerTriggercan run the latest function before the new delay expires, and then the new timer runs it again. This violates last-write-wins debounce behavior.Suggested fix (generation guard)
type Debouncer struct { delay time.Duration mu sync.Mutex timer *time.Timer fn func() + gen uint64 } func (d *Debouncer) Trigger(fn func()) { d.mu.Lock() defer d.mu.Unlock() + d.gen++ + currGen := d.gen d.fn = fn if d.timer != nil { d.timer.Stop() } d.timer = time.AfterFunc(d.delay, func() { d.mu.Lock() - f := d.fn + if currGen != d.gen { + d.mu.Unlock() + return + } + f := d.fn + d.fn = nil + d.timer = nil d.mu.Unlock() if f != nil { f() } }) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/debouncer/debouncer.go` around lines 31 - 42, The callback races by reading shared d.fn and can invoke a newer function; fix by introducing a generation guard on the debouncer: add a generation counter (e.g., d.generation), increment it inside the Trigger path when setting d.fn and creating d.timer, capture the current generation into a local variable in the time.AfterFunc closure, and inside the closure acquire d.mu then compare the captured generation to d.generation — only read/execute d.fn if the generations match so stale timers cannot run the latest function.client/internal/peer/ice_backoff.go-178-197 (1)
178-197:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConsume the activity override atomically.
AllowActivityOverride()only checks the window; it does not mark the override as spent. Two activity callbacks racing here can both returntruebefore either caller runsReset(), which breaks the “one override per 5 minutes per peer” guarantee. It also ignoresnextRetry, so an already-expired suspension still looks overrideable.Suggested fix
func (s *iceBackoffState) AllowActivityOverride() bool { s.mu.Lock() defer s.mu.Unlock() - if !s.suspended { + now := time.Now() + if !s.suspended || now.After(s.nextRetry) { return false // not in backoff, nothing to override } - if time.Since(s.lastResetAt) < activityOverrideMinInterval { + if now.Sub(s.lastResetAt) < activityOverrideMinInterval { return false // too soon since last reset, respect rate limit } + s.lastResetAt = now // consume the override budget before unlocking return true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/ice_backoff.go` around lines 178 - 197, AllowActivityOverride currently only inspects state and doesn't consume the override, allowing races; change it to atomically check and mark the override while holding s.mu: ensure s.suspended is true, ensure time.Since(s.lastResetAt) >= activityOverrideMinInterval, and ensure the suspension hasn’t already expired by checking s.nextRetry <= time.Now() (or nextRetry.IsZero logic as appropriate); if all checks pass update s.lastResetAt = time.Now() (or otherwise record the override as spent) before returning true so the override cannot be reused by a concurrent caller; keep Reset() semantics intact.management/server/account.go-477-506 (1)
477-506:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
nilvs0in the timeout audit payloads.The change detection correctly treats
niland explicit0as different, butderefUint32Ptr()flattens both to0before storing the event. That makes anil → 0or0 → niltransition audit asold=0, new=0, even though those states mean different things for these settings.Suggested fix
if !equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) { am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountRelayTimeoutChanged, map[string]any{ - "old": derefUint32Ptr(oldSettings.RelayTimeoutSeconds), - "new": derefUint32Ptr(newSettings.RelayTimeoutSeconds), + "old": oldSettings.RelayTimeoutSeconds, + "new": newSettings.RelayTimeoutSeconds, }) } if !equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) { am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountP2pTimeoutChanged, map[string]any{ - "old": derefUint32Ptr(oldSettings.P2pTimeoutSeconds), - "new": derefUint32Ptr(newSettings.P2pTimeoutSeconds), + "old": oldSettings.P2pTimeoutSeconds, + "new": newSettings.P2pTimeoutSeconds, }) } if !equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) { am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountP2pRetryMaxChanged, map[string]any{ - "old": derefUint32Ptr(oldSettings.P2pRetryMaxSeconds), - "new": derefUint32Ptr(newSettings.P2pRetryMaxSeconds), + "old": oldSettings.P2pRetryMaxSeconds, + "new": newSettings.P2pRetryMaxSeconds, }) }Also applies to: 537-541
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/account.go` around lines 477 - 506, The audit payload currently uses derefUint32Ptr(...) which turns nil and 0 into the same value; instead preserve nil vs non-nil by passing the pointer values through to the event payload. Replace derefUint32Ptr(oldSettings.X) / derefUint32Ptr(newSettings.X) with the actual pointer fields (oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds, oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds, oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds, and similarly for LegacyLazyFallbackTimeoutSeconds) when calling am.StoreEvent so the map values reflect nil vs 0 while keeping the existing equalUint32Ptr checks and am.StoreEvent calls.management/server/peer_connections/store.go-148-153 (1)
148-153:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExpire stale entries before applying the nonce gate.
Line 148 can return
falsebefore Line 151 gets a chance to evict an expired entry, so a caller that keeps polling with a highersincevalue can leave dead peer maps in memory indefinitely. Check TTL first, then apply the nonce filter.Suggested fix
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { s.mu.Lock() defer s.mu.Unlock() e, ok := s.maps[peerPubKey] if !ok { return nil, false } - if since > 0 && e.m.GetInResponseToNonce() < since { - return nil, false - } if s.clock.Now().Sub(e.updatedAt) > s.ttl { delete(s.maps, peerPubKey) return nil, false } + if since > 0 && e.m.GetInResponseToNonce() < since { + return nil, false + } return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer_connections/store.go` around lines 148 - 153, The nonce check in the method returns before expired entries are evicted; change the order so you first check TTL and evict expired entries (use s.clock.Now().Sub(e.updatedAt) > s.ttl to delete s.maps[peerPubKey] and return nil,false) and only after that apply the nonce gate (if since > 0 && e.m.GetInResponseToNonce() < since return nil,false); also ensure you handle a nil entry (e) defensively before accessing fields like e.updatedAt or e.m.GetInResponseToNonce().client/internal/peer/conn.go-1428-1434 (1)
1428-1434:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the live backoff cap aligned with the documented zero-as-default contract.
Open()normalizesP2pRetryMaxSeconds == 0toDefaultP2PRetryMax, butSetIceBackoffMax(0)pushes a literal zero intoiceBackoff.SetMaxBackoff. After a runtime config update back to “use default”, the live backoff cap diverges from the config semantics.Suggested fix
func (conn *Conn) SetIceBackoffMax(d time.Duration) { conn.mu.Lock() defer conn.mu.Unlock() conn.config.P2pRetryMaxSeconds = uint32(d / time.Second) + liveMax := d + if liveMax == 0 { + liveMax = DefaultP2PRetryMax + } if conn.iceBackoff != nil { - conn.iceBackoff.SetMaxBackoff(d) + conn.iceBackoff.SetMaxBackoff(liveMax) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 1428 - 1434, SetIceBackoffMax currently passes a literal zero into iceBackoff.SetMaxBackoff which breaks the “zero means use default” contract that Open() enforces; change SetIceBackoffMax (Conn.SetIceBackoffMax) to preserve the stored config value but when invoking conn.iceBackoff.SetMaxBackoff interpret d==0 as the effective default by substituting time.Duration(DefaultP2PRetryMax)*time.Second (referencing DefaultP2PRetryMax and Open() behavior). Keep conn.config.P2pRetryMaxSeconds set from the raw d (so config reflects the updated value), but call iceBackoff.SetMaxBackoff with the normalized cap when iceBackoff != nil.client/internal/peer/worker_ice.go-103-117 (1)
103-117:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear
agentConnectingon explicit shutdown, or this branch can wedge ICE forever.With this new early-return, any
WorkerICE.Close()during an in-flight dial can leaveagentConnecting == true:Close()nilsw.agent, andcloseAgent()only clears the flag whenw.agent == agent. After that, every future offer hits this branch and gets dropped permanently. The newConn.DetachICE()/Conn.onNetworkChange()paths make that reachable.Suggested fix
func (w *WorkerICE) Close() { w.muxAgent.Lock() defer w.muxAgent.Unlock() if w.agent == nil { return } w.agentDialerCancel() if err := w.agent.Close(); err != nil { w.log.Warnf("failed to close ICE agent: %s", err) } w.agent = nil + w.agentConnecting = false + w.remoteSessionChanged = false + w.remoteSessionID = "" + w.lastKnownState = ice.ConnectionStateDisconnected }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 103 - 117, The early-return when w.agentConnecting is true can permanently drop future offers after WorkerICE.Close() clears w.agent but leaves w.agentConnecting set; update the shutdown flow to clear the flag on explicit teardown: in WorkerICE.Close() (or in closeAgent() when invoked for an explicit shutdown or when agent == nil) set w.agentConnecting = false so the guard loop won't forever ignore incoming offers; touch the Close(), closeAgent(), and any DetachICE/onNetworkChange paths that nil w.agent to ensure they also clear agentConnecting to avoid wedging ICE.shared/management/http/api/openapi.yml-379-406 (1)
379-406:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTimeout default semantics look inconsistent with the phase objectives
The documented defaults for
p2p_timeout_seconds(180m) andrelay_timeout_seconds(5m) appear opposite of the intended split-threshold behavior for this rollout. Please align these descriptions with the actual server defaults to avoid misconfiguration by API consumers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 379 - 406, The OpenAPI descriptions for timeout defaults are inconsistent with server behavior; update the description text for p2p_timeout_seconds and relay_timeout_seconds so they match the actual server defaults (swap or correct the stated built-in defaults as implemented on the server) and clarify which modes each applies to; locate the YAML properties p2p_timeout_seconds and relay_timeout_seconds and change the quoted built-in default values and any explanatory text so p2p_timeout_seconds documents the server's actual default and relay_timeout_seconds documents its actual default (also ensure nullable/null semantics remain correct and any references to minutes vs seconds are accurate).shared/management/http/api/openapi.yml-365-372 (1)
365-372:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
connection_modeschema is missingp2p-dynamic-lazyand still documents obsolete Phase-1 behaviorThis enum/docs block still describes
p2p-dynamicas non-functional pass-through and does not includep2p-dynamic-lazy. That can cause client-side validation failures and incorrect API contract generation.Suggested OpenAPI fix
connection_mode: x-experimental: true type: string - enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] + enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy] nullable: true description: | Account-wide default peer-connection mode. NULL means "fall back to lazy_connection_enabled" for backwards compatibility. - Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are - functional. p2p-dynamic is reserved (passes through as p2p in - Phase 1; will become functional in Phase 2). + Supports relay-forced, p2p, p2p-lazy, p2p-dynamic, and + p2p-dynamic-lazy as defined by issue `#5989` rollout.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 365 - 372, The openapi enum for the connection_mode schema is out of date and missing the p2p-dynamic-lazy value and still documents obsolete "Phase 1" behavior; update the enum in the connection_mode schema to include "p2p-dynamic-lazy" (so enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]) and revise the description text to remove the Phase 1/pass-through wording and instead accurately describe the current semantics of p2p-dynamic and p2p-dynamic-lazy (and how NULL falls back to lazy_connection_enabled) so client validation and generated contracts match the implementation.client/internal/peer/status.go-300-306 (1)
300-306:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCapture the connection-state listener before unlocking.
All of the new
updatePeer*Lockedpaths call this helper afterd.mux.Unlock(), butSetConnStateListenermutatesd.connStateListenerunder the same mutex. Reading it here without synchronization introduces a real race on listener register/unregister.Suggested pattern
- d.mux.Unlock() - - if hasConnStatusChanged(oldState, receivedState.ConnStatus) { - return d.notifyConnStateChange(receivedState.PubKey, peerState), nil - } - return func() {}, nil + notifyConn := func() {} + if hasConnStatusChanged(oldState, receivedState.ConnStatus) { + notifyConn = d.notifyConnStateChange(receivedState.PubKey, peerState) + } + d.mux.Unlock() + return notifyConn, nilApply the same capture-before-unlock pattern to the other
updatePeer*Lockedvariants.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 300 - 306, The notifyConnStateChange function currently reads d.connStateListener without synchronization, creating a race with SetConnStateListener; modify the updatePeer*Locked callers to capture the listener while the mutex is held (e.g., listener := d.connStateListener) and pass that captured listener into notifyConnStateChange (or change notifyConnStateChange signature to accept a listener parameter) so that the listener is read under the lock and the returned callback can be safely invoked after d.mux.Unlock(); update all updatePeer*Locked variants to follow this capture-before-unlock pattern.client/internal/peer/status_debounce_test.go-17-149 (1)
17-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix the errcheck lint failures (CI lint job is failing).
UpdatePeerICEState,UpdatePeerRemoteMeta, andUpdatePeerStateall return errors that the tests drop on Lines 27, 37, 53, 80, 88, 129, 135. The lint pipeline is currently failing on every one of them. Best to surface real errors withrequire.NoError/assert.NoErrorso a regression in the recorder fails the test rather than silently passes.♻️ Proposed fix (apply to all 7 sites)
- rec.UpdatePeerICEState(State{ + require.NoError(t, rec.UpdatePeerICEState(State{ PubKey: key, ConnStatus: StatusConnected, // ... - }) + }))And add
"github.com/stretchr/testify/require"to imports. Same treatment for the remainingUpdatePeerRemoteMeta/UpdatePeerStatecall sites at Lines 37, 53, 80, 88, 129, 135.As per static analysis / pipeline failures: golangci-lint
errcheckis failing on Lines 27, 37, 53, 80, 88, 129, 135.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status_debounce_test.go` around lines 17 - 149, The tests drop errors from recorder methods causing errcheck failures: replace the bare calls to UpdatePeerICEState, UpdatePeerRemoteMeta, and UpdatePeerState in TestStatus_DuringOfflineDebounce_LocalConnStateUnchanged, TestStatus_AfterDebouncedClose_StatusReflectsLocalIdle, TestStatus_DeriveExtended_DuringLivenessFlap (where used), and TestStatus_GetFullStatus_PreservesEffectiveAndBackoffFields with assertions that surface errors (e.g., require.NoError or assert.NoError) and add the "github.com/stretchr/testify/require" import; specifically wrap each call to UpdatePeerICEState, UpdatePeerRemoteMeta, and UpdatePeerState with require.NoError(t, err) (or assert.NoError) so test failures propagate and errcheck is satisfied.client/ui/client_ui.go-501-510 (1)
501-510:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
p2p-dynamic-lazyeverywhere this UI enumerates connection modes.The selector, refresh logic, enablement switch, and load path only know about four modes. If a profile already has
p2p-dynamic-lazy, the settings window will showFollow server, and saving any unrelated change will clear that override becausebuildSetConfigRequest()always sends the selected mode back.♻️ Suggested update
- []string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic"}, + []string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy"},s.sConnectionMode.Options = []string{ s.followServerLabel(), "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", + "p2p-dynamic-lazy", }- case "p2p-dynamic": + case "p2p-dynamic", "p2p-dynamic-lazy": s.iRelayTimeout.Enable() s.iP2pTimeout.Enable() s.iP2pRetryMax.Enable()- case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic": + case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy": s.sConnectionMode.SetSelected(cfg.ConnectionMode)Also applies to: 741-745, 855-866, 889-902, 1573-1598
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 501 - 510, The connection-mode selector and related logic miss the "p2p-dynamic-lazy" option so profiles with that mode get mapped to "Follow server" and overwritten; update the options slice passed to s.sConnectionMode = widget.NewSelect (and any other NewSelect usages for connection modes) to include "p2p-dynamic-lazy", adjust any switch/if logic in updateTimeoutEntriesEnabled and the refresh/load code that checks modes to handle "p2p-dynamic-lazy" the same as "p2p-dynamic" (or the intended behavior), and ensure buildSetConfigRequest (and the save/load paths referenced around the other occurrences) preserves and emits "p2p-dynamic-lazy" when selected so the profile value isn't lost.client/internal/conn_state_pusher.go-317-323 (1)
317-323:⚠️ Potential issue | 🟠 Major | ⚡ Quick winA successful full snapshot should replace
lastPushed, not merge into it.
flushFull()currently callsmarkPushed(), which only overwrites keys present in the snapshot. After a full snapshot that omits peerB, management will correctly dropB, butlastPushedwill still retain it locally. IfBlater reappears with the same state,computeDeltaFromSource()suppresses the delta because it compares against that stale cached entry, so management never getsBback until another full snapshot.♻️ Suggested update
func (p *connStatePusher) markPushed(events []PeerStateChangeEvent) { p.mu.Lock() for _, ev := range events { p.lastPushed[ev.Pubkey] = ev } p.mu.Unlock() } + +func (p *connStatePusher) replacePushed(events []PeerStateChangeEvent) { + next := make(map[string]PeerStateChangeEvent, len(events)) + for _, ev := range events { + next[ev.Pubkey] = ev + } + p.mu.Lock() + p.lastPushed = next + p.mu.Unlock() +}- p.markPushed(events) + p.replacePushed(events)Also applies to: 346-379, 382-396
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 317 - 323, The bug: markPushed currently merges events into p.lastPushed which leaves stale entries after a full snapshot; when flushFull() calls markPushed it should replace the entire cache so omitted peers are removed. Fix by changing the behavior used by full snapshots: add or modify a function (e.g., markPushedFull or extend markPushed with a flag) so that when called from flushFull() it acquires p.mu, replaces p.lastPushed with a freshly allocated map populated only with the provided events (instead of writing/merging into the existing map), and then unlocks; keep the existing merge behavior for incremental updates used elsewhere (so computeDeltaFromSource compares against the correct state).client/ui/client_ui.go-638-640 (1)
638-640:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't silently turn invalid timeout text into
0.
parseUint32Field()collapses every parse error to0, andbuildSetConfigRequest()uses that value directly. A typo like30sorabcwill clear the override instead of surfacing a validation error, andhasConnectionModeChanges()can even treat the bad input as “no change” when the stored value is already zero.🛡️ Suggested direction
-func parseUint32Field(text string) uint32 { +func parseUint32Field(text string) (uint32, error) { t := strings.TrimSpace(text) if t == "" { - return 0 + return 0, nil } v, err := strconv.ParseUint(t, 10, 32) if err != nil { - return 0 + return 0, fmt.Errorf("must be an unsigned integer") } - return uint32(v) + return uint32(v), nil }Then validate the three timeout entries during save and only build the request when all non-empty values parse successfully.
Also applies to: 656-666, 745-750
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 638 - 640, parseUint32Field currently converts any parse error to 0 which lets typos like "30s" silently clear overrides and makes hasConnectionModeChanges/buildSetConfigRequest behave incorrectly; update the save/validation flow to explicitly parse iRelayTimeout.Text, iP2pTimeout.Text, and iP2pRetryMax.Text using a parsing function that returns (value, error), validate each non-empty field before building the SetConfigRequest, surface validation errors to the user instead of substituting 0, and only call buildSetConfigRequest (and compare against relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs) when all parses succeeded so invalid input does not get treated as a valid zero override.client/internal/conn_state_pusher.go-153-164 (1)
153-164:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the newest refresh nonce when the queue is full.
The comment says snapshot requests are coalesced to “the latest nonce”, but this implementation drops the newest nonce once the 4-slot buffer fills. That means
POST /connections/refreshcan hand back a token that will never be echoed by the next full snapshot, so the caller waits against a nonce the pusher already discarded.♻️ Suggested update
func (p *connStatePusher) OnSnapshotRequest(nonce uint64) { if p == nil { return } select { case p.snapshotReq <- nonce: default: + select { + case <-p.snapshotReq: + default: + } + select { + case p.snapshotReq <- nonce: + default: + } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 153 - 164, OnSnapshotRequest currently drops the newest nonce when p.snapshotReq is full; change it to retain the newest nonce by removing one oldest entry and then enqueueing the incoming nonce. Locate connStatePusher.OnSnapshotRequest and update the default branch to non-blockingly discard a single value from p.snapshotReq (e.g., a receive with select) and then attempt to send the new nonce into p.snapshotReq (retrying non-blockingly if needed) so the channel always ends up containing the most recent nonce rather than silently dropping it.client/internal/engine.go-2759-2803 (1)
2759-2803:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSerialize the debounced close with
syncMsgMux.This timer callback mutates peer lifecycle state (
statusRecorder,peerStore,connMgr,conn.Close) from its own goroutine, while the rest of connection management is serialized undere.syncMsgMux. That opens a race withhandleSync, signal processing, andremovePeer, so the debounce can close a conn mid-reconfigure or after concurrent teardown started.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine.go` around lines 2759 - 2803, The timer callback in Engine.scheduleRemoteOfflineClose mutates peer lifecycle state from its own goroutine and must be serialized with the existing e.syncMsgMux; modify the anonymous func passed to time.AfterFunc so it acquires e.syncMsgMux.Lock() at the start (and defers Unlock()) before performing the re-validation and any calls into e.statusRecorder, e.peerStore, e.connMgr or conn.Close; this ensures the debounce handler runs under the same synchronization used by handleSync/removePeer and prevents races while keeping the existing early-return checks intact.management/internals/shared/grpc/conversion.go-315-326 (1)
315-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly mark liveness as authoritative when a status record exists.
Lines 315-321 correctly nil-check
rPeer.Status, but Line 326 still setsServerLivenessKnown = trueeven when no status data was available. In that case clients will treatlive_online=falseas a real offline signal instead of “unknown”.Suggested fix
if rPeer.Status != nil { if !rPeer.Status.LastSeen.IsZero() { cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen) } cfg.LiveOnline = rPeer.Status.Connected + cfg.ServerLivenessKnown = true } - // New servers always know per-peer liveness; signal that to new - // clients so they can trust LiveOnline directly instead of - // guessing from the LastSeenAtServer-zero heuristic. Old servers - // leave this field at default (false) and clients fall back. - cfg.ServerLivenessKnown = true dst = append(dst, cfg)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The code wrongly sets cfg.ServerLivenessKnown = true unconditionally; change it so ServerLivenessKnown is only true when a status record exists by guarding the assignment with the same nil-check used above (i.e., set cfg.ServerLivenessKnown = true inside the if rPeer.Status != nil block or assign cfg.ServerLivenessKnown = (rPeer.Status != nil)), keeping existing handling of LastSeen and LiveOnline in the rPeer.Status branch.management/internals/shared/grpc/server.go-487-490 (1)
487-490:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisable the snapshot case after channel closure.
On Line 489,
continueleaves a closedsnapshotChpermanently selectable, so this loop can spin hot and starve normal updates once the router closes that registration. Nil the channel or return afterok == false.Suggested fix
case nonce, ok := <-snapshotCh: if !ok { - continue + snapshotCh = nil + continue } snapMsg := &network_map.UpdateMessage{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/server.go` around lines 487 - 490, The snapshot case leaves a closed snapshotCh selectable and can cause a hot spin; when the receive yields ok == false, disable that case by setting snapshotCh = nil (or return from the surrounding goroutine) instead of continuing. Locate the receive case handling "nonce, ok := <-snapshotCh" and replace the "if !ok { continue }" with "snapshotCh = nil" (or an early return) so the select no longer considers the closed channel.client/internal/conn_mgr.go-292-341 (1)
292-341:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestart or reconfigure the lazy manager when only timeouts change.
This branch updates
relayTimeoutSecs/p2pTimeoutSecs, but if the mode staysp2p-lazyorp2p-dynamic, the already-runninglazyConnMgrkeeps its old inactivity thresholds. A server-pushed timeout change therefore has no effect until a later mode flip or daemon restart.Suggested direction
- if modeChanged && wasManaged && isManaged { + timeoutChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs + + if (modeChanged || timeoutChanged) && wasManaged && isManaged { // Switching between lazy and dynamic at runtime: tear down the - // existing manager so initLazyManager picks up the new timeouts. - log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode) + // existing manager so initLazyManager picks up the new timeouts. + log.Infof("lazy/dynamic settings changed, restarting manager (mode=%s->%s)", prev, newMode) e.closeManager(ctx) e.statusRecorder.UpdateLazyConnection(false) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 292 - 341, The code updates relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs but doesn't restart or reconfigure an already-running lazyConnMgr when mode stays managed, so timeout changes don't take effect; modify the branch after you've set e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs to detect when isManaged && e.lazyConnMgr != nil && mode didn't change but any of those timeout fields changed, then call e.closeManager(ctx) followed by e.initLazyManager(ctx) and e.startModeSideEffects() (and optionally e.resetPeersToLazyIdle(ctx) if you want immediate idle semantics), ensuring the manager picks up the new thresholds; use the existing functions closeManager, initLazyManager, startModeSideEffects and reference the fields relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs and lazyConnMgr.
🟡 Minor comments (8)
docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-44-60 (1)
44-60:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winResource leak on error path.
If any operation fails after
net.ListenUDPsucceeds (line 44), theudpConnsocket will leak because it's not closed before returning the error. Add cleanup for partial failures.🔧 Proposed fix with cleanup
udpConn, err := net.ListenUDP("udp4", &net.UDPAddr{Port: 0}) if err != nil { return nil, fmt.Errorf("listen udp for ICE: %w", err) } - log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpConn.LocalAddr().(*net.UDPAddr).Port, t.wgPort) + udpAddr, ok := udpConn.LocalAddr().(*net.UDPAddr) + if !ok { + udpConn.Close() + return nil, fmt.Errorf("unexpected address type: %T", udpConn.LocalAddr()) + } + log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpAddr.Port, t.wgPort) bindParams := udpmux.UniversalUDPMuxParams{ UDPConn: nbnet.WrapPacketConn(udpConn),Note: Also verify that there's a cleanup path (likely in a
Close()orDown()method) that closest.udpMuxConnwhen the device is torn down.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines 44 - 60, The code opens udpConn via net.ListenUDP and assigns it to t.udpMuxConn only after creating the udpmux and starting mux.ReadFromConn, which leaks udpConn if any subsequent operation fails; update the function to defer a cleanup that closes udpConn on error (e.g., capture success with a boolean or set t.udpMuxConn only after all steps succeed and call udpConn.Close() on early returns), ensure you stop the started goroutine or only start mux.ReadFromConn after assigning t.udpMux and t.udpMuxConn successfully, and verify the device Close()/Down() path closes t.udpMuxConn and stops the mux to avoid goroutine/socket leaks (referencing udpConn, bindParams, udpmux.NewUniversalUDPMuxDefault, mux.ReadFromConn, t.udpMuxConn, t.udpMux).docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-48-48 (1)
48-48:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winType assertion could panic.
The type assertion
udpConn.LocalAddr().(*net.UDPAddr).Portwill panic ifLocalAddr()returns a different address type. While unlikely for a UDP connection, defensive code should check the type assertion result.🛡️ Proposed fix with safe type assertion
- log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpConn.LocalAddr().(*net.UDPAddr).Port, t.wgPort) + udpAddr, ok := udpConn.LocalAddr().(*net.UDPAddr) + if !ok { + udpConn.Close() + return nil, fmt.Errorf("unexpected address type: %T", udpConn.LocalAddr()) + } + log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpAddr.Port, t.wgPort)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 48, The log line uses a direct type assertion udpConn.LocalAddr().(*net.UDPAddr).Port which can panic; update the code around the log.Infof call to perform a safe type assertion (addr, ok := udpConn.LocalAddr().(*net.UDPAddr)) and use addr.Port when ok, otherwise fall back to printing the address via udpConn.LocalAddr().String() or a placeholder; ensure you still include t.wgPort in the message and preserve the original log.Infof call semantics (use the safe value or fallback in the formatted string).client/internal/stdnet/filter_test.go-51-65 (1)
51-65:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThis loop does not assert most of the
want == truecases.Right now the test only checks negative cases, plus one Windows-only special case for
vEthernet (LAN). IfEthernet USB,OpenVPN 1,WiFi, orvEthernet (External)start returningfalse, this test still passes. Since this is meant to pin Windows behavior, I’d either skip it off Windows or assertgot == c.wantfor every case.Suggested fix
func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) { + if runtime.GOOS != "windows" { + t.Skip("Windows-specific regression test") + } + disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"} allow := InterfaceFilter(disallow) @@ for _, c := range cases { // The wgctrl branch can override on hosts where NetBird is // running; tests run on a host where these names are not // real interfaces, so the final return faithfully reflects // the disallow-list logic. got := allow(c.name) - // "veth*" prefix only filters on non-Windows; on Linux test - // runners "vEthernet (LAN)" still passes because of mixed - // case + the !Windows branch keeping the prefix match. - if !c.want && got { - t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name) - } - if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" { - t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name) + if got != c.want { + t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test loop in filter_test.go only asserts negative cases and one Windows special-case, so it can miss regressions where entries that should be allowed (c.want == true) start returning false; update the loop that iterates over cases and the call to allow(c.name) to assert equality for every case (i.e., if got != c.want { t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) }) instead of the two separate conditionals, keeping the call to allow and the cases slice as-is (or alternatively skip the entire test when runtime.GOOS != "windows" if the intention is to only pin Windows behavior).management/server/types/settings_ptr_equal_test.go-71-72 (1)
71-72:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRename
newto avoid the predeclared-identifier lint failure.Line 71 shadows Go's predeclared
newfunction. Thepredeclaredlinter is enabled in.golangci.yamland will flag this violation. Rename the variable toupdatedornextto clear the check without changing test logic.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/types/settings_ptr_equal_test.go` around lines 71 - 72, The test creates a variable named new which shadows Go's predeclared new; rename that variable (e.g., to updated or next) wherever it is assigned from old.Copy() and used to set ConnectionMode = &dynamic so the logic (old.Copy(), updated.ConnectionMode = &dynamic, and any subsequent assertions referencing the variable) remains unchanged and the predeclared-identifier lint error is resolved.client/internal/peer/conn_handover_order_test.go-100-115 (1)
100-115:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTest will false-fail if
onWGTimeoutRecoveris referenced beforeCloseConn()(e.g., a nil-guard).
strings.Indexreturns the first occurrence. A common pattern isif c.onWGTimeoutRecover != nil { ... }placed before the close switch with the actual invocation after. In that caseidxCb < idxCloseand the test fails despite the invariant being satisfied. Match the call site (e.g.,onWGTimeoutRecover() and/or usestrings.LastIndexto locate the actual invocation.🛠️ Proposed fix
- const cbField = "onWGTimeoutRecover" - if !strings.Contains(body, cbField) { - t.Fatalf("onWGDisconnected missing reference to %q — WG-timeout recovery is broken", cbField) + const cbField = "onWGTimeoutRecover" + const cbCall = "onWGTimeoutRecover(" + if !strings.Contains(body, cbField) { + t.Fatalf("onWGDisconnected missing reference to %q — WG-timeout recovery is broken", cbField) } // The callback must be invoked AFTER the conn close switch (otherwise // lazy mgr would be re-armed before the active workers are torn down). idxClose := strings.Index(body, "workerRelay.CloseConn()") - idxCb := strings.Index(body, cbField) + idxCb := strings.LastIndex(body, cbCall) if idxClose < 0 { t.Fatalf("workerRelay.CloseConn() landmark missing") } + if idxCb < 0 { + t.Fatalf("onWGDisconnected missing call site %q", cbCall) + } if idxCb < idxClose { t.Errorf("recover callback (idx %d) must come AFTER worker close (idx %d)", idxCb, idxClose) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_handover_order_test.go` around lines 100 - 115, The test currently uses strings.Index to find the first occurrence of "onWGTimeoutRecover" which can pick up nil-guards before the actual invocation and cause false failures; update the check in the onWGDisconnected test to locate the actual call site (e.g., search for the substring "onWGTimeoutRecover(") or use strings.LastIndex to find the final occurrence, then compare that index (idxCb) against the "workerRelay.CloseConn()" index (idxClose) to ensure the invocation occurs after the close; keep the existing error messages and landmarks (onWGDisconnected, onWGTimeoutRecover, workerRelay.CloseConn) but change the search logic so the test matches the real call site rather than the first reference.management/server/http/handlers/peer_connections/handler_test.go-40-46 (1)
40-46:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winEnforce account scoping in fake
GetPeerByPubKey.This fake ignores
accountID, so tests can pass even if cross-account key lookups are accidentally allowed in handler logic.Suggested hardening
-func (a *fakeAM) GetPeerByPubKey(_ context.Context, _, pubKey string) (*nbpeer.Peer, error) { +func (a *fakeAM) GetPeerByPubKey(_ context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { p, ok := a.peersByKey[pubKey] if !ok { return nil, errors.New("not found") } + if a.allowedAcc != "" && accountID != a.allowedAcc { + return nil, errors.New("not found") + } + if p.AccountID != "" && p.AccountID != accountID { + return nil, errors.New("not found") + } return p, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/peer_connections/handler_test.go` around lines 40 - 46, The fake GetPeerByPubKey implementation ignores the accountID parameter and looks up peers solely by pubKey (function: GetPeerByPubKey in fakeAM, map: peersByKey), allowing cross-account lookups in tests; update the fake to respect account scoping by using the accountID parameter (do not use the blank identifier), and change the lookup to be account-scoped — e.g., maintain peersByAccount map[string]map[string]*nbpeer.Peer or use a composite key including accountID when accessing peersByKey — and return "not found" when the pubKey is not present for that account.client/cmd/service_installer.go-173-211 (1)
173-211:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInconsistent nil-guarding on
cmd.Flag(...).Changed.Lines 175-180 carefully nil-check each flag (
f != nil && f.Changed), but Lines 194-205 dereferencecmd.Flag(name).Changedwithout that guard. If any of the four flags ever isn't registered oninstallCmd/reconfigureCmd(e.g., refactor, build-tag-gated registration), the second block panics. Memoize the result of the first scan and reuse it:♻️ Proposed consolidation
func applyConnectionModeFlagsToProfile(cmd *cobra.Command) error { - anyChanged := false - for _, name := range []string{connectionModeFlag, relayTimeoutFlag, p2pTimeoutFlag, p2pRetryMaxFlag} { - if f := cmd.Flag(name); f != nil && f.Changed { - anyChanged = true - break - } - } - if !anyChanged { + changed := func(name string) bool { + f := cmd.Flag(name) + return f != nil && f.Changed + } + if !changed(connectionModeFlag) && !changed(relayTimeoutFlag) && + !changed(p2pTimeoutFlag) && !changed(p2pRetryMaxFlag) { return nil } cfgPath := profilemanager.DefaultConfigPath if configPath != "" { cfgPath = configPath } if cfgPath == "" { return fmt.Errorf("default config path is not set on this platform; pass --config") } ic := profilemanager.ConfigInput{ConfigPath: cfgPath} - if cmd.Flag(connectionModeFlag).Changed { + if changed(connectionModeFlag) { ic.ConnectionMode = &connectionMode } - if cmd.Flag(relayTimeoutFlag).Changed { + if changed(relayTimeoutFlag) { ic.RelayTimeoutSeconds = &relayTimeoutSecs } - if cmd.Flag(p2pTimeoutFlag).Changed { + if changed(p2pTimeoutFlag) { ic.P2pTimeoutSeconds = &p2pTimeoutSecs } - if cmd.Flag(p2pRetryMaxFlag).Changed { + if changed(p2pRetryMaxFlag) { ic.P2pRetryMaxSeconds = &p2pRetryMaxSecs }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/cmd/service_installer.go` around lines 173 - 211, In applyConnectionModeFlagsToProfile, avoid dereferencing cmd.Flag(...).Changed twice without nil-guards: during the initial scan memoize each flag pointer (e.g., map[string]*pflag.Flag or local variables for connectionModeFlag, relayTimeoutFlag, p2pTimeoutFlag, p2pRetryMaxFlag) and its .Changed state, then reuse those memoized pointers when building the profilemanager.ConfigInput (set ic.ConnectionMode, ic.RelayTimeoutSeconds, ic.P2pTimeoutSeconds, ic.P2pRetryMaxSeconds only if the corresponding memoized flag pointer is non-nil and .Changed). This preserves the original anyChanged behavior, prevents nil panics if a flag is not registered, and keeps references to the exact flags used when writing the config.management/server/http/handlers/accounts/accounts_handler_test.go-408-426 (1)
408-426:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIndentation in
expectedSettingsis over-tabbed and will failgofmt.Lines 418-419 carry the indentation level of a deeply nested struct literal but appear inside a top-level
expectedSettingsdeclaration, so they're indented several tabs further than the surrounding fields. Rungofmt -won the file:♻️ Proposed fix
expectedSettings := api.AccountSettings{ PeerLoginExpiration: 3600, // ... LazyConnectionEnabled: br(false), - LegacyLazyFallbackEnabled: br(true), - LegacyLazyFallbackTimeoutSeconds: ir(3600), + LegacyLazyFallbackEnabled: br(true), + LegacyLazyFallbackTimeoutSeconds: ir(3600), DnsDomain: sr(""),While at it, the
LegacyLazyFallbackEnabled: br(true),lines added throughout the table at Lines 124-125, 152-153, 180-181, 208-209, 236-237, and 264-265 also have one fewer alignment space than the surrounding fields;gofmtwill normalize them but it's worth a one-shot pass to keep the diff tidy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler_test.go` around lines 408 - 426, The struct literal assigned to expectedSettings has incorrect/over-tabbed indentation (notably the LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds entries) which will fail gofmt; open the accounts_handler_test.go, fix the indentation of the expectedSettings block so all fields align consistently with the surrounding fields (including the other occurrences of LegacyLazyFallbackEnabled in the file), then run gofmt -w to normalize spacing across the file.
| // IsConnected returns true when pion's ICE agent reports Connected and | ||
| // has not yet transitioned to Disconnected/Failed/Closed. Used by | ||
| // Conn.onNetworkChange (Phase 3.7g of #5989) to skip a needless | ||
| // workerICE.Close when an srReconnect/network-change event arrives but | ||
| // the existing P2P session is still alive end-to-end (typical for a | ||
| // brief signal-server outage while peer-to-peer UDP keeps flowing). | ||
| // Closing the agent in that case forces a 15-25 s renegotiation cycle | ||
| // and a Relay→ICE handover gap that the user would observe as a ping | ||
| // dropout, even though no real peer-to-peer connectivity loss occurred. | ||
| func (w *WorkerICE) IsConnected() bool { | ||
| w.muxAgent.Lock() | ||
| defer w.muxAgent.Unlock() | ||
| return w.agent != nil && w.lastKnownState == ice.ConnectionStateConnected |
There was a problem hiding this comment.
Synchronize lastKnownState consistently before using it as a liveness signal.
IsConnected() reads lastKnownState under muxAgent, but onConnectionStateChange() writes that field without the same lock. Now that Conn.onNetworkChange() uses IsConnected() to decide whether to tear down the agent, this is a real cross-goroutine data race and can produce stale keep/close decisions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/worker_ice.go` around lines 218 - 230, The race occurs
because IsConnected() reads lastKnownState under w.muxAgent but
onConnectionStateChange() updates lastKnownState without that lock; fix by
synchronizing updates to lastKnownState with the same mutex (w.muxAgent) used in
IsConnected(), i.e., acquire w.muxAgent.Lock()/Unlock() (or use RLock/RUnlock
appropriately) in onConnectionStateChange() when writing lastKnownState and when
touching w.agent, so both WorkerICE.onConnectionStateChange and
WorkerICE.IsConnected use the same protection for lastKnownState and agent.
| func (r *SnapshotRouter) Register(peerPubKey string) <-chan uint64 { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() |
There was a problem hiding this comment.
Guard the public methods against a nil router.
The pipeline is already hitting this: Register panics when SnapshotRouter was not injected on a Sync path yet. Unregister and Request have the same receiver dereference and will fail the same way.
Suggested fix
func (r *SnapshotRouter) Register(peerPubKey string) <-chan uint64 {
+ if r == nil {
+ return nil
+ }
r.mu.Lock()
defer r.mu.Unlock()
ch := make(chan uint64, 4)
@@
func (r *SnapshotRouter) Unregister(peerPubKey string, token <-chan uint64) {
+ if r == nil {
+ return
+ }
r.mu.Lock()
defer r.mu.Unlock()
@@
func (r *SnapshotRouter) Request(peerPubKey string, nonce uint64) bool {
+ if r == nil {
+ return false
+ }
r.mu.Lock()
defer r.mu.Unlock()Also applies to: 41-44, 60-63
🧰 Tools
🪛 GitHub Actions: Darwin
[error] 23-23: panic: runtime error: invalid memory address or nil pointer dereference in SnapshotRouter.Register during Sync handling.
🪛 GitHub Actions: FreeBSD
[error] 23-23: Runtime panic: nil pointer dereference in SnapshotRouter.Register (receiver is nil)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer_connections/snapshot_router.go` around lines 22 - 24,
Guard Register, Unregister and Request on a nil SnapshotRouter by checking "if r
== nil" at the top of each method before touching r.mu or other fields; for
Register return a closed channel (so callers won't block) when r is nil, and for
Unregister and Request make them no-ops (just return) when r is nil. Update the
methods named Register, Unregister and Request on type SnapshotRouter to perform
this nil-receiver check before any dereference to avoid panics.
86fab00 to
82758b2
Compare
There was a problem hiding this comment.
Actionable comments posted: 19
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
management/server/peer.go (1)
1297-1324:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
RegularUsersViewBlockedon the detail endpoint too.
GetPeersnow hides everything for restrictable users when that setting is enabled, butGetPeerskips that gate and still returns the peer to the owner or any policy-reachable user who knows the ID. That reopens the visibility restriction through the detail route. Mirror the same settings check here before the direct/policy branches.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1297 - 1324, GetPeer currently skips the RegularUsersViewBlocked global setting and can leak peers via the detail route; update GetPeer to mirror GetPeers by checking the RegularUsersViewBlocked flag after resolving the user (or immediately after permissions validation) and, if the user is a restrictable/regular user and RegularUsersViewBlocked is enabled, return a permission-denied error instead of falling through to the admin/owner or policy reachability branches. Locate GetPeer and add the setting check (using the same predicate used by GetPeers), consult user via Store.GetUserByUserID and user.IsRestrictable()/IsRegular() as appropriate, and return a permission-denied status (use the project’s existing status error helper) when the check blocks access before calling peer-owner or checkIfUserOwnsPeer.client/ui/network.go (1)
139-206:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWrap Fyne widget updates from background goroutines in fyne.Do or fyne.DoAndWait.
In
startAutoRefresh, the ticker fires on a background goroutine and callsupdateNetworksSilent→updateNetworksWithMode, which performs multiple widget mutations (grid.Objects = nil,grid.Add,Refresh) without dispatching to the UI goroutine. Fyne v2.7.0 requires all widget and container mutations from background goroutines to be wrapped infyne.Do(fire-and-forget) orfyne.DoAndWait(block until applied) to avoid race conditions.Affected locations:
updateNetworksWithMode(lines 139–206) and its call site withinstartAutoRefresh(lines 375–385).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/network.go` around lines 139 - 206, The updateNetworksWithMode function mutates Fyne widgets from a background goroutine (grid.Objects = nil, grid.Add, setting widget properties, Resize/Refresh) which must be run on the UI thread; wrap all UI/container mutations inside fyne.Do or fyne.DoAndWait to dispatch to the UI goroutine and avoid races. Specifically, update updateNetworksWithMode so that the body which clears grid, creates widgets (checkboxes, labels, Selects), sets their properties, calls grid.Add and Refresh are executed inside a single fyne.Do (or fyne.DoAndWait if the caller expects synchronous completion), and adjust the call site in startAutoRefresh (where updateNetworksSilent/updateNetworksWithMode is invoked from the ticker) to use fyne.Do or call the updated synchronous variant accordingly. Ensure you only move UI mutations inside the Do/DoAndWait block and keep non-UI work (filtering, sorting, building data) outside to avoid blocking the UI.
♻️ Duplicate comments (4)
client/internal/peer/worker_ice.go (1)
227-230:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winProtect
lastKnownStatewithmuxAgenton writes too.
IsConnected()now readslastKnownStateundermuxAgent, butonConnectionStateChange()still writes that field without the same lock. WithConn.onNetworkChange()usingIsConnected()to decide whether to tear down the agent, this becomes a real cross-goroutine data race and can flip the keep/close decision.Suggested fix
func (w *WorkerICE) onConnectionStateChange(agent *icemaker.ThreadSafeAgent, dialerCancel context.CancelFunc) func(ice.ConnectionState) { return func(state ice.ConnectionState) { w.log.Debugf("ICE ConnectionState has changed to %s", state.String()) switch state { case ice.ConnectionStateConnected: - w.lastKnownState = ice.ConnectionStateConnected + w.muxAgent.Lock() + w.lastKnownState = ice.ConnectionStateConnected + w.muxAgent.Unlock() w.logSuccessfulPaths(agent) // Phase 3 of `#5989`: reset backoff on ICE success. w.conn.onICEConnected() return case ice.ConnectionStateFailed, ice.ConnectionStateDisconnected, ice.ConnectionStateClosed: @@ sessionChanged := w.closeAgent(agent, dialerCancel) - if w.lastKnownState == ice.ConnectionStateConnected { - w.lastKnownState = ice.ConnectionStateDisconnected + w.muxAgent.Lock() + wasConnected := w.lastKnownState == ice.ConnectionStateConnected + w.lastKnownState = ice.ConnectionStateDisconnected + w.muxAgent.Unlock() + if wasConnected { w.conn.onICEStateDisconnected(sessionChanged) }Also applies to: 545-571
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 227 - 230, IsConnected reads lastKnownState under muxAgent but onConnectionStateChange (and other writers around the 545-571 range) update lastKnownState without that lock, causing a race; modify onConnectionStateChange and any other locations that assign to WorkerICE.lastKnownState to acquire w.muxAgent.Lock()/Unlock() (or use RLock/RUnlock where appropriate) around writes so all reads/writes of lastKnownState are protected by muxAgent, keeping the lock usage consistent with WorkerICE.IsConnected and Conn.onNetworkChange checks.client/android/preferences.go (1)
325-330:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear overrides by writing
nil, not empty/zero pointers.
SetConnectionMode("")and theSet*TimeoutSeconds(0)setters still leave the correspondingConfigInputfields non-nil. That meansCommit()persists an explicit override instead of actually removing it, so callers cannot revert to “follow server” through this API.Suggested fix
func (p *Preferences) SetConnectionMode(mode string) { + if mode == "" { + p.configInput.ConnectionMode = nil + return + } m := mode p.configInput.ConnectionMode = &m } @@ func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { v := clampUint32Seconds(secs) + if v == 0 { + p.configInput.RelayTimeoutSeconds = nil + return + } p.configInput.RelayTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pTimeoutSeconds(secs int64) { v := clampUint32Seconds(secs) + if v == 0 { + p.configInput.P2pTimeoutSeconds = nil + return + } p.configInput.P2pTimeoutSeconds = &v } @@ func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) { v := clampUint32Seconds(secs) + if v == 0 { + p.configInput.P2pRetryMaxSeconds = nil + return + } p.configInput.P2pRetryMaxSeconds = &v }Also applies to: 347-355, 371-376, 391-396
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/android/preferences.go` around lines 325 - 330, SetConnectionMode and the Set*TimeoutSeconds setters currently store empty/zero values as non-nil pointers, causing Commit() to persist explicit overrides instead of removing them; change each setter (e.g., SetConnectionMode and the timeout setters) to write nil to the corresponding p.configInput field when the input is the "clear" value (mode == "" or seconds == 0) and otherwise allocate and assign a pointer to the provided value (keep using p.configInput.ConnectionMode and the timeout fields to locate the spots to change).client/internal/peer/status.go (1)
682-692:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMove
notifyPeerStateChangeListenersback under the mutex.These paths still invoke
notifyPeerStateChangeListenersafterd.mux.Unlock(), but that helper snapshotsd.peers/d.changeNotifyand is documented as lock-only. This can race with concurrent writers and panic on map access.Also applies to: 758-760
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 682 - 692, The call to notifyPeerStateChangeListeners is executed after releasing the mutex (after d.mux.Unlock()), but that function snapshots d.peers/d.changeNotify and must be called while holding the lock; move the notifyPeerStateChangeListeners(receivedState.PubKey) call so it runs before d.mux.Unlock() in this block (and likewise for the other occurrence at the 758-760 region), keeping d.notifier.peerListChanged(numPeers) and d.dispatchRouterPeers(...) semantics unchanged and only reordering to hold d.mux during notifyPeerStateChangeListeners to avoid concurrent map access.management/server/http/handlers/accounts/accounts_handler.go (1)
511-517:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
math.MaxUint32still overflows the format arg on 32-bit builds.This is the same build break that was already flagged earlier: the comparison uses
int64(math.MaxUint32), but the baremath.MaxUint32passed tofmt.Errorfstill defaults tointand overflows on 386/arm32.💡 Minimal fix
if v > int64(math.MaxUint32) { - return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, math.MaxUint32) + return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, int64(math.MaxUint32)) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` around lines 511 - 517, The error message in validateUint32Timeout still passes math.MaxUint32 as a plain value which can overflow on 32-bit builds; update the second fmt.Errorf call in validateUint32Timeout to cast the constant to a 64-bit integer (e.g. int64(math.MaxUint32)) so the formatted argument matches the %d verb and avoids overflow, ensuring the comparison int64(math.MaxUint32) already used and the error message use the same casted value.
🟡 Minor comments (6)
client/internal/peer/env_test.go-33-34 (1)
33-34:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse the existing constants from
env.goinstead of hardcoded strings.Lines 33–34 use bare string literals
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN"and"NB_LAZY_CONN_INACTIVITY_THRESHOLD". The peer package already defines these as unexported constants (envEnableLazyConnandenvInactivityThresholdat lines 21–22 ofenv.go). Using the constants ensures that if the production definition is renamed, the test will also be updated, preventing silent divergence.♻️ Proposed fix
- t.Setenv("NB_ENABLE_EXPERIMENTAL_LAZY_CONN", c.envEnableLazy) - t.Setenv("NB_LAZY_CONN_INACTIVITY_THRESHOLD", c.envInactivity) + t.Setenv(envEnableLazyConn, c.envEnableLazy) + t.Setenv(envInactivityThreshold, c.envInactivity)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/env_test.go` around lines 33 - 34, Replace the hardcoded environment variable strings in the test with the package constants: use envEnableLazyConn instead of "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of "NB_LAZY_CONN_INACTIVITY_THRESHOLD" in the t.Setenv calls in client/internal/peer/env_test.go so the test references the same unexported constants defined in env.go (envEnableLazyConn, envInactivityThreshold).client/internal/stdnet/filter.go-51-54 (1)
51-54:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUnconditional
loprefix can over-filter on Windows; gate it on GOOS.The comment on line 51 says "Linux/macOS loopback prefix", but the check itself runs on every platform. With the new lowercase normalization on line 49, this also fires for any Windows interface name starting with
Lo*(e.g. legacy "Local Area Connection"), causing false-positive filtering. It also makes theloopback pseudo-interfaceentry inwindowsKnownBadSubstringsredundant.Suggest gating it to non-Windows so Windows loopback is exclusively handled by
windowsKnownBadSubstrings(which is the documented strategy on lines 11–22):🛡️ Proposed fix
- // Linux/macOS loopback prefix ("lo", "lo0"). - if strings.HasPrefix(lowerIFace, "lo") { - return false - } + // Linux/macOS loopback prefix ("lo", "lo0"). Windows loopback is + // handled by windowsKnownBadSubstrings below to avoid false + // positives for interfaces like "Local Area Connection". + if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") { + return false + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter.go` around lines 51 - 54, The unconditional check using strings.HasPrefix(lowerIFace, "lo") is running on all OSes and mistakenly filters Windows interfaces; change it so the "lo" prefix check only runs on non-Windows (e.g., wrap it behind a runtime.GOOS != "windows" guard or equivalent) and remove the redundant Windows loopback entry reliance only on windowsKnownBadSubstrings; update the conditional around the lowerIFace "lo" prefix check (the code using lowerIFace and strings.HasPrefix) so Windows names are not filtered here and remain handled by windowsKnownBadSubstrings.client/iface/device/endpoint_manager.go-15-21 (1)
15-21:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win"Always non-nil on userspace binds" overpromises for an interface method.
The comment implies that callers holding a valid
EndpointManagercan skip nil-checkingActivityRecorder()'s return value, yet:
engine.go:651defensively nil-checks the return (if rec := bind.ActivityRecorder(); rec != nil).- The test mock returns
nilfrom this method.- Any future or custom
EndpointManagerimplementation can legally returnnilwithout violating the interface contract.The documentation should say the return value may be nil, aligning with how every existing caller already handles it.
📝 Suggested comment correction
- // ActivityRecorder exposes the per-bind ActivityRecorder so the - // engine can wire its OnActivity callback (Codex review 2026-05-05, - // fast-path Relay -> P2P upgrade trigger). Always non-nil on - // userspace binds. Kernel-mode WG returns nil from GetICEBind so - // callers MUST nil-check the EndpointManager itself before - // dereferencing. + // ActivityRecorder exposes the per-bind ActivityRecorder so the + // engine can wire its OnActivity callback (Codex review 2026-05-05, + // fast-path Relay -> P2P upgrade trigger). May return nil (e.g., + // kernel-mode WG where GetICEBind returns nil, or test stubs). + // Callers MUST nil-check both the EndpointManager and the returned + // *ActivityRecorder before use. ActivityRecorder() *bind.ActivityRecorder🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/iface/device/endpoint_manager.go` around lines 15 - 21, Update the doc comment for EndpointManager.ActivityRecorder to stop promising a non-nil result; change the wording to explicitly state the returned *bind.ActivityRecorder may be nil and callers must nil-check it before use (e.g., engine's current pattern if rec := bind.ActivityRecorder(); rec != nil). Reference the ActivityRecorder() method on the EndpointManager interface and mention that mocks and other implementations are allowed to return nil so callers should not assume a non-nil value.client/internal/debouncer/debouncer.go-45-53 (1)
45-53:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
Stopdoes not fully cancel a pending fn — cleard.fnto honor the docstring.If
time.AfterFunchas already fired and the callback is blocked acquiringd.muwhenStopis called,d.timer.Stop()returns false (timer already triggered),Stopreturns, and then the AfterFunc callback proceeds to readd.fnand invoke it. The docstring states "Stop cancels any pending fn." but this race lets a stalefnrun afterStopreturned. Clearingd.fnunder the lock closes most of the window.🛡️ Proposed fix
func (d *Debouncer) Stop() { d.mu.Lock() defer d.mu.Unlock() if d.timer != nil { d.timer.Stop() d.timer = nil } + d.fn = nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/debouncer/debouncer.go` around lines 45 - 53, The Stop method on Debouncer does not fully cancel a pending fn because if the timer callback is already fired and waiting on d.mu, Stop only stops the timer and doesn't clear d.fn; update Stop (method Debouncer.Stop) to acquire d.mu, set d.timer to nil and also set d.fn = nil while holding the lock to prevent a later callback from running the stale function, and ensure the timer's AfterFunc callback (the anonymous function that grabs d.mu and reads d.fn) checks d.fn under the same lock and returns without invoking if it's nil.client/ui/peers_tab.go-61-71 (1)
61-71:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClear stale peer data when refresh fails.
Both error paths only update
summaryand return. If the daemon becomes unavailable after a successful refresh, the tab keeps showing the previousbreakdownand peer rows under an error banner, which makes stale state look current.🧹 Suggested fix
conn, err := s.getSrvClient(failFastTimeout) if err != nil { - fyne.Do(func() { summary.SetText("Error: " + err.Error()) }) + fyne.Do(func() { + summary.SetText("Error: " + err.Error()) + breakdown.SetText("") + listVBox.Objects = nil + listVBox.Refresh() + }) return } callCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() st, err := conn.Status(callCtx, &proto.StatusRequest{GetFullPeerStatus: true}) if err != nil { - fyne.Do(func() { summary.SetText("Error: " + err.Error()) }) + fyne.Do(func() { + summary.SetText("Error: " + err.Error()) + breakdown.SetText("") + listVBox.Objects = nil + listVBox.Refresh() + }) return }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/peers_tab.go` around lines 61 - 71, When either getSrvClient(...) fails or conn.Status(...) returns an error, you must clear stale peer UI state instead of only updating summary: in both error paths wrap UI updates in fyne.Do to set summary to the error text, clear breakdown (e.g. breakdown.SetText("")) and remove/clear the peers rows/container (the peers table or peersContainer holding rows) so previous peer entries aren't shown under the error banner; apply this to the error branch after getSrvClient and the one after conn.Status to ensure stale data is always wiped when refresh fails.client/server/server.go-1523-1572 (1)
1523-1572:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDon't mix requested-profile config with active-profile pushed state.
GetConfig()loadscfgfromreq.ProfileName/req.Username, but thesesp*values always come from the currently runningconnectClient. When the caller asks for a non-active profile, the response combines one profile's persisted config with another profile's server-pushed hints. Gate these fields on the requested profile being the active/running one; otherwise return zero-values.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/server/server.go` around lines 1523 - 1572, GetConfig currently mixes the requested profile's persisted cfg with server-pushed runtime fields (spMode, spRelayTOSecs, spP2pTOSecs, spP2pRetMax) unconditionally; change this so those sp* fields are only populated from s.connectClient (Engine()->ConnMgr()) when the requested profile equals the active/running profile (compare req.ProfileName/req.Username against the active session identifiers used by the running connectClient), otherwise leave sp* as zero-values. Update the logic that sets spMode/spRelayTOSecs/spP2pTOSecs/spP2pRetMax (the block that queries s.connectClient.Engine().ConnMgr()) to be guarded by an isActiveProfile check and ensure the returned proto.GetConfigResponse uses those guarded values.
🧹 Nitpick comments (9)
management/server/peer/peer.go (1)
199-203: ⚡ Quick win
SupportedFeaturesequality is order-sensitive — consider treating it as a set.
slices.Equal(p.SupportedFeatures, other.SupportedFeatures)returnsfalsewhenever the client emits the same features in a different order. Ifclient/system/features.goever changes how it builds the slice (e.g. iterates a map, adds a feature in the middle, or callers append in a different sequence),isEqualwill start reporting "changed" on every sync, generating unnecessary writes/events. Either sort both sides before comparing or compare as a set.♻️ Proposed fix
- slices.Equal(p.SupportedFeatures, other.SupportedFeatures) && + equalFeatureSets(p.SupportedFeatures, other.SupportedFeatures) &&func equalFeatureSets(a, b []string) bool { if len(a) != len(b) { return false } aa := slices.Clone(a) bb := slices.Clone(b) sort.Strings(aa) sort.Strings(bb) return slices.Equal(aa, bb) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 199 - 203, The current comparison in isEqual uses slices.Equal(p.SupportedFeatures, other.SupportedFeatures), which is order-sensitive; change it to compare feature sets instead by implementing a helper (e.g., equalFeatureSets) that returns false if lengths differ, clones both slices, sorts them (or builds maps/sets) and then compares, and replace the slices.Equal call with equalFeatureSets(p.SupportedFeatures, other.SupportedFeatures) so identical features in different orders are treated as equal.shared/management/http/api/openapi.yml (1)
383-431: ⚡ Quick winDocumented defaults should be encoded as OpenAPI
defaultvalues.Lines 391-392, 416-417, and 428-430 describe defaults in text, but the schema omits
default(e.g.,p2p_retry_max_seconds=900,legacy_lazy_fallback_enabled=true,legacy_lazy_fallback_timeout_seconds=3600). Add explicit defaults to keep generated clients/docs consistent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 383 - 431, The OpenAPI schema omits explicit default values for documented defaults; add default: 900 (integer) to p2p_retry_max_seconds, default: true (boolean) to legacy_lazy_fallback_enabled, and default: 3600 (integer) to legacy_lazy_fallback_timeout_seconds in the YAML so generated clients/docs reflect the described defaults (keep the keys as p2p_retry_max_seconds, legacy_lazy_fallback_enabled, legacy_lazy_fallback_timeout_seconds and ensure types/formats remain unchanged).client/internal/peer/guard/guard_test.go (1)
65-96: 💤 Low valueTest name overstates what is exercised.
TestGuard_PeerActivityResetsHourlyModedoes not actually runreconnectLoopWithRetry; it manually reads fromg.peerActivityand callsiceState.reset()itself, then asserts oniceState. The comment is honest about this, but the test would happily pass even if a future refactor accidentally dropped thecase <-g.peerActivity:arm inreconnectLoopWithRetry(since the test simulates the case body itself). Consider runningreconnectLoopWithRetryin a goroutine with a context cancel and observingiceStatethrough the real loop — would turn this into a true regression pin.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/guard/guard_test.go` around lines 65 - 96, The test TestGuard_PeerActivityResetsHourlyMode currently simulates the reconnectLoopWithRetry behaviour by reading from g.peerActivity and calling iceState.reset() itself; instead, start the actual reconnectLoopWithRetry (or the goroutine that runs it) with a cancellable context, call g.NotifyPeerActivity(), and then wait for the real loop to consume from g.peerActivity and perform the reset so you assert the real effects on the iceRetryState (check iceState.hourly and iceState.retries) — i.e., replace the manual select+iceState.reset() with running reconnectLoopWithRetry in a goroutine, use context cancellation to stop the loop, and assert that the loop cleared hourly mode and zeroed retries after NotifyPeerActivity().client/internal/peer/conn_handover_order_test.go (2)
138-175: ⚖️ Poor tradeoff
extractFunctionBodybrace counting is string/comment-blind.Brace counting walks every
{/}character even inside string literals and comments, so adding a Go format directive containing{or}to one of the tested functions inconn.gowould silently break extraction (mismatched depth, wrong slice, or fatal "unbalanced braces"). Crude is fine for now, but if these tests become flaky after aconn.golog-string edit, replacing the helper withgo/parser+ast.Inspectto grab the function's exact source range would be more robust. Optional since the comment already calls out the limitation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_handover_order_test.go` around lines 138 - 175, The current extractFunctionBody helper blindly counts '{' and '}' characters (in function extractFunctionBody) which miscounts braces inside string literals or comments; replace this brittle scanner with a proper Go AST-based extraction: parse the source with go/parser, use ast.Inspect to find the FuncDecl or MethodDecl node matching the target name (handle both receiver method form and plain func), then use the node.Pos()/node.End() token positions with the original src to slice out the exact function text; update tests to call this new extractor instead of extractFunctionBody.
25-53: 💤 Low valueTest name vs target function disagree.
The test is named
TestConn_HandoverOrder_OnICEConnectedand the comment block referencesonICEConnected, butextractFunctionBodyactually inspectsonICEConnectionIsReady. Aligning the name to the inspected function would make CI failures easier to triage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_handover_order_test.go` around lines 25 - 53, The test name and its comment refer to onICEConnected but the call to extractFunctionBody(t, string(src), "onICEConnectionIsReady") targets a different function; update either the test name/comment to match onICEConnectionIsReady (e.g. rename TestConn_HandoverOrder_OnICEConnected to TestConn_HandoverOrder_OnICEConnectionIsReady and update the comment) or change the extractFunctionBody call to "onICEConnected" so the inspected function and test identity match; adjust any related references in this test (landmarks or error messages) to use the chosen function name.client/internal/peer/status.go (1)
1617-1644: ⚡ Quick winAvoid holding
d.muxacrosswgIface.FullStats().
FullStats()is an external/kernel-facing call. Keeping the status mutex locked for its full duration stalls unrelated peer and UI state updates behind it. Grabd.wgIfaceunder the lock, callFullStats()outside, then re-lock only to apply the snapshot.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 1617 - 1644, RefreshWireGuardStats currently holds d.mux while calling d.wgIface.FullStats(), which can block other operations; fix it by acquiring the lock only to read the wgIface pointer (and return nil if nil), then release the lock, call wgIface.FullStats() outside the lock, and finally re-acquire d.mux to apply the returned stats snapshot into d.peers (updating LastWireguardHandshake, BytesRx, BytesTx for each peer). Ensure you preserve the existing error wrapping from FullStats() and the behavior of skipping unknown public keys when updating the map.management/server/peer_test.go (1)
588-589: ⚡ Quick winMake the denial check specific.
Line 589 only asserts that some error happened after policies are removed. That lets unrelated failures in
GetPeersatisfy the test and weakens the visibility regression coverage. Assert the expected status/type here so the test only passes when access is denied for the right reason.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer_test.go` around lines 588 - 589, Replace the generic assert.Error with a specific assertion that the failure is an authorization/denial error from GetPeer: call manager.GetPeer(context.Background(), accountID, peer2.ID, someUser) and then assert that the returned error matches the expected permission-denied sentinel or status (e.g. errors.Is(err, ErrPermissionDenied) or check a typed error's StatusCode == 403 or assert.EqualError(t, err, ErrPermissionDenied.Error())). Use the actual project symbol for the sentinel (e.g., ErrPermissionDenied, ErrUnauthorized, or the API error type) instead of a generic error check so the test only passes when access is denied for the right reason.client/ui/peers_tab.go (1)
280-355: ⚡ Quick winSplit
buildPeerDetailTextbefore it grows further.This formatter is already doing multiple jobs: base fields, endpoint selection, full-details rendering, and ICE backoff formatting. Extracting those sections into small helpers should clear the current complexity warning and make future field additions less error-prone.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/peers_tab.go` around lines 280 - 355, The buildPeerDetailText function is doing too many jobs; split it into small helpers: create renderBasePeerFields(sb *strings.Builder, p *proto.PeerState) to write IP/FQDN/connection type/effective/configured modes/latency/last-seen/groups, renderConnectionEndpoints(sb *strings.Builder, p *proto.PeerState) to handle relayed vs local/remote endpoint printing and Relay server/local/remote endpoint logic, and renderFullPeerDetails(sb *strings.Builder, p *proto.PeerState) which calls a helper renderIceBackoff(sb *strings.Builder, p *proto.PeerState) to encapsulate the ICE backoff nextRetry/suspended/time-until logic; then have buildPeerDetailText simply create the strings.Builder, call renderBasePeerFields, renderConnectionEndpoints, and if full call renderFullPeerDetails, and return sb.String().management/internals/server/boot.go (1)
111-125: Plan for multi-replica management before relying on this state path.
PeerConnStoreandSnapshotRouterare process-local. In an HA deployment,SyncPeerConnectionstraffic landing on one replica and dashboard/API reads landing on another will see different peer-connection state. Please make sure rollout includes sticky routing or a shared backing/pubsub layer, otherwise visibility will flap across nodes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/server/boot.go` around lines 111 - 125, PeerConnStore and PeerConnRouter currently create process-local state (via Create with peer_connections.NewMemoryStore and peer_connections.NewSnapshotRouter), which will cause inconsistent visibility across replicas; update the initialization to use a shared backing or pub/sub layer (or make the Create wrapper accept an alternate implementation) so PeerConnStore() and PeerConnRouter() return a cluster-scoped store/router instead of an in-memory local one; specifically adjust the code paths that call PeerConnStore, PeerConnRouter, and the SyncPeerConnections flow to use the shared implementation (or guard rollout with sticky routing) so reads and SyncPeerConnections traffic observe the same state across replicas.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/iface/bind/activity.go`:
- Around line 114-121: The callback passed to record() (which triggers
AttachICEOnRelayActivity()) can acquire conn.mu and perform non-trivial checks
(including conn.handshaker.readICEListener()), violating the "MUST be cheap"
contract in activity.go:115; change the invocation to avoid holding the WG I/O
goroutine: instead of calling cb(...) synchronously, dispatch the work
off-thread (e.g., spawn a dedicated goroutine) or perform a non-blocking handoff
(send on a buffered channel with select { case ch<-payload: default: }) so that
record() returns immediately; ensure the dispatched worker invokes
AttachICEOnRelayActivity() and retains existing rate-gating semantics.
In `@client/internal/conn_mgr.go`:
- Around line 322-326: The code updates e.mode and timeout fields then returns
early when e.rosenpassEnabled blocks lazy/dynamic mode, leaving the instance in
an unsupported "lazy" state; change the flow so you either validate
rosenpassEnabled before mutating live state or, if already mutated, explicitly
set e.mode to a supported eager mode (e.g. "p2p-eager") and restore/clear any
lazy-specific timeout fields before returning; update the branch that checks
e.rosenpassEnabled (the spot referencing e.lazyConnMgr and e.rosenpassEnabled)
to perform this reset-or-reject behavior so the in-memory state never reflects a
disabled lazy/dynamic mode.
- Around line 292-329: The code only restarts the lazy manager when modeChanged,
so updates to relayTimeoutSecs or p2pTimeoutSecs on ConnMgr never get propagated
into the running manager.Config; modify the update path in the function that
sets newMode/newRelay/newP2P/newP2pRetry (the block around
propagateP2pRetryMaxToConns, modeUsesLazyMgr, closeManager, initLazyManager,
startModeSideEffects and lazyConnMgr) to detect when the inactivity timeouts
(relayTimeoutSecs or p2pTimeoutSecs) change while isManaged is true and
lazyConnMgr != nil and then restart the manager (call closeManager(ctx),
UpdateLazyConnection(false), then initLazyManager(ctx) and
startModeSideEffects()) so the new timeout values are snapshotted into the
manager.Config even when mode did not change.
In `@client/internal/conn_state_pusher.go`:
- Around line 153-164: The OnSnapshotRequest method currently enqueues a nonce
but can leave an old nonce in the channel causing the pusher to process a stale
value; change connStatePusher.OnSnapshotRequest to coalesce by draining
snapshotReq before sending: perform a non-blocking loop that reads and discards
enqueued nonces keeping only the most recent value, then non-blocking-send that
newest nonce into p.snapshotReq (so callers' latest InResponseToNonce is
preserved). Apply the same drain-to-newest fix to the other similar handler
using the same pattern (the peer/state-change request channel referenced around
lines 245-248).
- Around line 115-122: newSessionID currently ignores errors from rand.Read and
can loop forever; change newSessionID to return (uint64, error), check the error
from rand.Read and return it instead of retrying, and keep the non-zero check
for the generated id; update any callers that call newSessionID (e.g., the
pusher/constructor code in conn_state_pusher) to handle the error by propagating
it or providing a safe fallback (e.g., abort startup with a clear error) so
startup cannot hang indefinitely.
In `@client/internal/lazyconn/manager/manager.go`:
- Around line 102-109: The code currently treats iceTO==0 && relayTO==0 as a
signal to use the legacy inactivity.NewManager which restores a 24h relay
timeout; instead, only use the legacy constructor when the deprecated
InactivityThreshold field was actually used — i.e., check
config.InactivityThreshold (the old field) to decide to call
inactivity.NewManager(wgIface, config.InactivityThreshold), otherwise always
call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) so explicit
zero timeouts from config.resolvedTimeouts() correctly disable teardown as
documented and set m.inactivityManager accordingly.
In `@client/internal/peer/status.go`:
- Around line 468-471: The closure returned after detecting a connection-state
change currently calls notifyConnStateChange which reads d.connStateListener
without holding d.mux, causing a race with SetConnStateListener; fix by
capturing the listener reference under the lock (e.g., read d.connStateListener
into a local variable while holding d.mux) and build the closure to call
notifyConnStateChange (or call the listener directly) using that captured
reference; apply the same pattern for the other similar sites (around the logic
referenced at lines ~694-697, ~762-765, ~810-813, ~861-864) so the listener is
read under d.mux before unlocking.
- Around line 519-544: The current branch in the peer-update code holds d.mux
while calling
d.notifyPeerListChanged()/d.notifyPeerStateChangeListeners(pubKey), which can
deadlock if listeners re-enter Status; change it to mirror other update paths
by: under the lock update st and d.peers[pubKey] and compute a local notify flag
(already done), then release the lock (remove defer or unlock before
notifications) and call d.notifyPeerListChanged() and
d.notifyPeerStateChangeListeners(pubKey) only after the mutex is unlocked;
reference: d.mux, d.peers, pubKey, st, notifyPeerListChanged,
notifyPeerStateChangeListeners.
In `@client/proto/daemon.proto`:
- Around line 744-754: SetConfig currently only copies ProfileName and Username
into ActiveProfileState, so the new connection override fields (connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds) are dropped
and not persisted; update the SetConfig handler (function SetConfig in
client/server/server.go) to copy these fields from the incoming SetConfig
request into ActiveProfileState (add corresponding fields/properties on
ActiveProfileState if missing and map from the proto names connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds), ensure any
serialization/storage code that persists ActiveProfileState is updated to
include them, and update tests to assert that saving a config with those fields
actually persists and is returned on subsequent reads.
- Around line 327-347: GetConfig currently returns the new proto fields
(connection_mode, p2p_timeout_seconds, relay_timeout_seconds,
p2p_retry_max_seconds, server_pushed_connection_mode,
server_pushed_relay_timeout_seconds, server_pushed_p2p_timeout_seconds,
server_pushed_p2p_retry_max_seconds) as empty/0 because the implementation in
GetConfig (server.go) never assigns them; fix by reading the local profile's
connection override values and the management server's last-received PeerConfig
and explicitly set the corresponding fields on the outgoing proto message (use
the same names in the daemon proto struct returned by GetConfig), leaving fields
empty/0 when no local override or PeerConfig exists so callers observe real
values rather than placeholders.
- Around line 208-218: Login currently only copies ProfileName and Username into
ActiveProfileState and ignores the new proto fields (connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds), so
connection overrides sent at login are dropped; update the Login handler to read
those four fields from the incoming request and populate the corresponding
fields on ActiveProfileState (or the existing connection-overrides
sub-structure) when creating or updating the profile on first-run, and ensure
the code path that persists/forwards ActiveProfileState includes these fields
(refer to the Login function, ActiveProfileState, ProfileName, and Username
symbols to locate where to add the assignments).
In `@client/ui/client_ui.go`:
- Around line 893-906: The gating currently switches on
s.sConnectionMode.Selected which treats the "Follow server" case as default and
disables all timeout inputs even when the effective mode is a server-pushed one;
change the logic to compute the effective mode (call selectedConnectionMode() or
use serverPushedMode when selectedConnectionMode() returns empty) and switch on
that value instead, then enable/disable s.iRelayTimeout, s.iP2pTimeout, and
s.iP2pRetryMax based on the effective mode ("p2p-lazy", "p2p-dynamic", default)
so following the server still allows the correct timeout fields to be editable
when the server-pushed mode requires it.
- Around line 656-665: parseUint32Field currently swallows parse errors and
returns 0 which silently treats invalid input as “no override”; change
parseUint32Field to return (uint32, error) (or add a companion
validateUint32Field) so callers can detect bad input, surface a validation error
to the UI, and avoid treating typos as the sentinel 0; update all callers (e.g.,
any code invoking parseUint32Field and logic in hasConnectionModeChanges()) to
handle the error path, show a validation message rather than writing 0, and only
use uint32 value when parsing succeeds.
In `@management/internals/shared/grpc/conversion.go`:
- Around line 315-327: The code sets cfg.ServerLivenessKnown = true
unconditionally even when rPeer.Status is nil, which incorrectly advertises
authoritative liveness; change the logic so cfg.ServerLivenessKnown is only set
to true inside the rPeer.Status != nil branch (where LiveOnline and
LastSeenAtServer are populated) and leave it false when rPeer.Status is absent
so clients fall back to the LastSeenAtServer heuristic; update the block around
rPeer.Status, cfg.LiveOnline, cfg.LastSeenAtServer and cfg.ServerLivenessKnown
accordingly.
In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 231-280: Replace plain fmt.Errorf errors with gRPC InvalidArgument
status errors so bad input yields a 4xx validation response: change the
connection_mode error in the accounts handler to return
status.Errorf(codes.InvalidArgument, "invalid connection_mode %q", modeStr) and
change the legacy_lazy_fallback_timeout_seconds branch to return
status.Errorf(codes.InvalidArgument, "invalid
legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v).
Also ensure validateUint32Timeout returns or surfaces InvalidArgument errors
(update validateUint32Timeout to return status.Errorf(codes.InvalidArgument,
...) for invalid ranges) and add imports for "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status". This targets the connection_mode branch, the
legacy_lazy_fallback_timeout_seconds block, and the validateUint32Timeout
function so updateAccount -> util.WriteError receives typed InvalidArgument
errors.
In `@management/server/peer_connections/store.go`:
- Around line 141-155: GetWithNonceCheck on MemoryStore checks the nonce gate
before TTL, allowing expired entries to be returned as nonce-old and never
deleted; change GetWithNonceCheck to first evaluate TTL (compare
s.clock.Now().Sub(e.updatedAt) > s.ttl and delete from s.maps if expired) before
performing the since / InResponseToNonce check so stale entries are evicted even
when the nonce condition would cause an early return; adjust references inside
the GetWithNonceCheck method that touch e, e.m, e.updatedAt, s.maps, s.ttl and
s.clock accordingly.
In `@management/server/peer.go`:
- Line 1355: Replace the internal-server error return with an authorization
error: change the status code in the return that currently uses status.Internal
to an auth-level code (e.g., status.PermissionDenied) so callers get a
permission response instead of a 500; keep the existing message formatting that
includes userID, peer.ID and accountID in the return statement.
In `@shared/management/client/grpc.go`:
- Around line 483-492: The snapshot callback is invoked before applying the
SyncResponse which can cause snapshots to be built from pre-update state; move
the c.onSnapshotRequest invocation (the cb from c.onSnapshotRequest and
decryptedResp.GetSnapshotRequest()) to after the call to
msgHandler(decryptedResp) so the SyncResponse is applied first, i.e., call
msgHandler(decryptedResp) and only if it returns nil then acquire c.snapMu, read
c.onSnapshotRequest into cb, release the lock and invoke cb(req.GetNonce()) when
decryptedResp.GetSnapshotRequest() is non-nil.
In `@shared/management/http/api/types.gen.go`:
- Around line 41-63: The generated enum AccountSettingsConnectionMode and its
Valid() method are missing the p2p-dynamic-lazy value; update the OpenAPI spec
so the connection-mode enum includes "p2p-dynamic-lazy", regenerate the code,
and ensure the generated constants (e.g.,
AccountSettingsConnectionModeP2pDynamicLazy) and Valid() cover that value (so
Valid() returns true for AccountSettingsConnectionModeP2pDynamicLazy) instead of
manually editing types.gen.go.
---
Outside diff comments:
In `@client/ui/network.go`:
- Around line 139-206: The updateNetworksWithMode function mutates Fyne widgets
from a background goroutine (grid.Objects = nil, grid.Add, setting widget
properties, Resize/Refresh) which must be run on the UI thread; wrap all
UI/container mutations inside fyne.Do or fyne.DoAndWait to dispatch to the UI
goroutine and avoid races. Specifically, update updateNetworksWithMode so that
the body which clears grid, creates widgets (checkboxes, labels, Selects), sets
their properties, calls grid.Add and Refresh are executed inside a single
fyne.Do (or fyne.DoAndWait if the caller expects synchronous completion), and
adjust the call site in startAutoRefresh (where
updateNetworksSilent/updateNetworksWithMode is invoked from the ticker) to use
fyne.Do or call the updated synchronous variant accordingly. Ensure you only
move UI mutations inside the Do/DoAndWait block and keep non-UI work (filtering,
sorting, building data) outside to avoid blocking the UI.
In `@management/server/peer.go`:
- Around line 1297-1324: GetPeer currently skips the RegularUsersViewBlocked
global setting and can leak peers via the detail route; update GetPeer to mirror
GetPeers by checking the RegularUsersViewBlocked flag after resolving the user
(or immediately after permissions validation) and, if the user is a
restrictable/regular user and RegularUsersViewBlocked is enabled, return a
permission-denied error instead of falling through to the admin/owner or policy
reachability branches. Locate GetPeer and add the setting check (using the same
predicate used by GetPeers), consult user via Store.GetUserByUserID and
user.IsRestrictable()/IsRegular() as appropriate, and return a permission-denied
status (use the project’s existing status error helper) when the check blocks
access before calling peer-owner or checkIfUserOwnsPeer.
---
Minor comments:
In `@client/iface/device/endpoint_manager.go`:
- Around line 15-21: Update the doc comment for EndpointManager.ActivityRecorder
to stop promising a non-nil result; change the wording to explicitly state the
returned *bind.ActivityRecorder may be nil and callers must nil-check it before
use (e.g., engine's current pattern if rec := bind.ActivityRecorder(); rec !=
nil). Reference the ActivityRecorder() method on the EndpointManager interface
and mention that mocks and other implementations are allowed to return nil so
callers should not assume a non-nil value.
In `@client/internal/debouncer/debouncer.go`:
- Around line 45-53: The Stop method on Debouncer does not fully cancel a
pending fn because if the timer callback is already fired and waiting on d.mu,
Stop only stops the timer and doesn't clear d.fn; update Stop (method
Debouncer.Stop) to acquire d.mu, set d.timer to nil and also set d.fn = nil
while holding the lock to prevent a later callback from running the stale
function, and ensure the timer's AfterFunc callback (the anonymous function that
grabs d.mu and reads d.fn) checks d.fn under the same lock and returns without
invoking if it's nil.
In `@client/internal/peer/env_test.go`:
- Around line 33-34: Replace the hardcoded environment variable strings in the
test with the package constants: use envEnableLazyConn instead of
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of
"NB_LAZY_CONN_INACTIVITY_THRESHOLD" in the t.Setenv calls in
client/internal/peer/env_test.go so the test references the same unexported
constants defined in env.go (envEnableLazyConn, envInactivityThreshold).
In `@client/internal/stdnet/filter.go`:
- Around line 51-54: The unconditional check using strings.HasPrefix(lowerIFace,
"lo") is running on all OSes and mistakenly filters Windows interfaces; change
it so the "lo" prefix check only runs on non-Windows (e.g., wrap it behind a
runtime.GOOS != "windows" guard or equivalent) and remove the redundant Windows
loopback entry reliance only on windowsKnownBadSubstrings; update the
conditional around the lowerIFace "lo" prefix check (the code using lowerIFace
and strings.HasPrefix) so Windows names are not filtered here and remain handled
by windowsKnownBadSubstrings.
In `@client/server/server.go`:
- Around line 1523-1572: GetConfig currently mixes the requested profile's
persisted cfg with server-pushed runtime fields (spMode, spRelayTOSecs,
spP2pTOSecs, spP2pRetMax) unconditionally; change this so those sp* fields are
only populated from s.connectClient (Engine()->ConnMgr()) when the requested
profile equals the active/running profile (compare req.ProfileName/req.Username
against the active session identifiers used by the running connectClient),
otherwise leave sp* as zero-values. Update the logic that sets
spMode/spRelayTOSecs/spP2pTOSecs/spP2pRetMax (the block that queries
s.connectClient.Engine().ConnMgr()) to be guarded by an isActiveProfile check
and ensure the returned proto.GetConfigResponse uses those guarded values.
In `@client/ui/peers_tab.go`:
- Around line 61-71: When either getSrvClient(...) fails or conn.Status(...)
returns an error, you must clear stale peer UI state instead of only updating
summary: in both error paths wrap UI updates in fyne.Do to set summary to the
error text, clear breakdown (e.g. breakdown.SetText("")) and remove/clear the
peers rows/container (the peers table or peersContainer holding rows) so
previous peer entries aren't shown under the error banner; apply this to the
error branch after getSrvClient and the one after conn.Status to ensure stale
data is always wiped when refresh fails.
---
Duplicate comments:
In `@client/android/preferences.go`:
- Around line 325-330: SetConnectionMode and the Set*TimeoutSeconds setters
currently store empty/zero values as non-nil pointers, causing Commit() to
persist explicit overrides instead of removing them; change each setter (e.g.,
SetConnectionMode and the timeout setters) to write nil to the corresponding
p.configInput field when the input is the "clear" value (mode == "" or seconds
== 0) and otherwise allocate and assign a pointer to the provided value (keep
using p.configInput.ConnectionMode and the timeout fields to locate the spots to
change).
In `@client/internal/peer/status.go`:
- Around line 682-692: The call to notifyPeerStateChangeListeners is executed
after releasing the mutex (after d.mux.Unlock()), but that function snapshots
d.peers/d.changeNotify and must be called while holding the lock; move the
notifyPeerStateChangeListeners(receivedState.PubKey) call so it runs before
d.mux.Unlock() in this block (and likewise for the other occurrence at the
758-760 region), keeping d.notifier.peerListChanged(numPeers) and
d.dispatchRouterPeers(...) semantics unchanged and only reordering to hold d.mux
during notifyPeerStateChangeListeners to avoid concurrent map access.
In `@client/internal/peer/worker_ice.go`:
- Around line 227-230: IsConnected reads lastKnownState under muxAgent but
onConnectionStateChange (and other writers around the 545-571 range) update
lastKnownState without that lock, causing a race; modify onConnectionStateChange
and any other locations that assign to WorkerICE.lastKnownState to acquire
w.muxAgent.Lock()/Unlock() (or use RLock/RUnlock where appropriate) around
writes so all reads/writes of lastKnownState are protected by muxAgent, keeping
the lock usage consistent with WorkerICE.IsConnected and Conn.onNetworkChange
checks.
In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 511-517: The error message in validateUint32Timeout still passes
math.MaxUint32 as a plain value which can overflow on 32-bit builds; update the
second fmt.Errorf call in validateUint32Timeout to cast the constant to a 64-bit
integer (e.g. int64(math.MaxUint32)) so the formatted argument matches the %d
verb and avoids overflow, ensuring the comparison int64(math.MaxUint32) already
used and the error message use the same casted value.
---
Nitpick comments:
In `@client/internal/peer/conn_handover_order_test.go`:
- Around line 138-175: The current extractFunctionBody helper blindly counts '{'
and '}' characters (in function extractFunctionBody) which miscounts braces
inside string literals or comments; replace this brittle scanner with a proper
Go AST-based extraction: parse the source with go/parser, use ast.Inspect to
find the FuncDecl or MethodDecl node matching the target name (handle both
receiver method form and plain func), then use the node.Pos()/node.End() token
positions with the original src to slice out the exact function text; update
tests to call this new extractor instead of extractFunctionBody.
- Around line 25-53: The test name and its comment refer to onICEConnected but
the call to extractFunctionBody(t, string(src), "onICEConnectionIsReady")
targets a different function; update either the test name/comment to match
onICEConnectionIsReady (e.g. rename TestConn_HandoverOrder_OnICEConnected to
TestConn_HandoverOrder_OnICEConnectionIsReady and update the comment) or change
the extractFunctionBody call to "onICEConnected" so the inspected function and
test identity match; adjust any related references in this test (landmarks or
error messages) to use the chosen function name.
In `@client/internal/peer/guard/guard_test.go`:
- Around line 65-96: The test TestGuard_PeerActivityResetsHourlyMode currently
simulates the reconnectLoopWithRetry behaviour by reading from g.peerActivity
and calling iceState.reset() itself; instead, start the actual
reconnectLoopWithRetry (or the goroutine that runs it) with a cancellable
context, call g.NotifyPeerActivity(), and then wait for the real loop to consume
from g.peerActivity and perform the reset so you assert the real effects on the
iceRetryState (check iceState.hourly and iceState.retries) — i.e., replace the
manual select+iceState.reset() with running reconnectLoopWithRetry in a
goroutine, use context cancellation to stop the loop, and assert that the loop
cleared hourly mode and zeroed retries after NotifyPeerActivity().
In `@client/internal/peer/status.go`:
- Around line 1617-1644: RefreshWireGuardStats currently holds d.mux while
calling d.wgIface.FullStats(), which can block other operations; fix it by
acquiring the lock only to read the wgIface pointer (and return nil if nil),
then release the lock, call wgIface.FullStats() outside the lock, and finally
re-acquire d.mux to apply the returned stats snapshot into d.peers (updating
LastWireguardHandshake, BytesRx, BytesTx for each peer). Ensure you preserve the
existing error wrapping from FullStats() and the behavior of skipping unknown
public keys when updating the map.
In `@client/ui/peers_tab.go`:
- Around line 280-355: The buildPeerDetailText function is doing too many jobs;
split it into small helpers: create renderBasePeerFields(sb *strings.Builder, p
*proto.PeerState) to write IP/FQDN/connection type/effective/configured
modes/latency/last-seen/groups, renderConnectionEndpoints(sb *strings.Builder, p
*proto.PeerState) to handle relayed vs local/remote endpoint printing and Relay
server/local/remote endpoint logic, and renderFullPeerDetails(sb
*strings.Builder, p *proto.PeerState) which calls a helper renderIceBackoff(sb
*strings.Builder, p *proto.PeerState) to encapsulate the ICE backoff
nextRetry/suspended/time-until logic; then have buildPeerDetailText simply
create the strings.Builder, call renderBasePeerFields,
renderConnectionEndpoints, and if full call renderFullPeerDetails, and return
sb.String().
In `@management/internals/server/boot.go`:
- Around line 111-125: PeerConnStore and PeerConnRouter currently create
process-local state (via Create with peer_connections.NewMemoryStore and
peer_connections.NewSnapshotRouter), which will cause inconsistent visibility
across replicas; update the initialization to use a shared backing or pub/sub
layer (or make the Create wrapper accept an alternate implementation) so
PeerConnStore() and PeerConnRouter() return a cluster-scoped store/router
instead of an in-memory local one; specifically adjust the code paths that call
PeerConnStore, PeerConnRouter, and the SyncPeerConnections flow to use the
shared implementation (or guard rollout with sticky routing) so reads and
SyncPeerConnections traffic observe the same state across replicas.
In `@management/server/peer_test.go`:
- Around line 588-589: Replace the generic assert.Error with a specific
assertion that the failure is an authorization/denial error from GetPeer: call
manager.GetPeer(context.Background(), accountID, peer2.ID, someUser) and then
assert that the returned error matches the expected permission-denied sentinel
or status (e.g. errors.Is(err, ErrPermissionDenied) or check a typed error's
StatusCode == 403 or assert.EqualError(t, err, ErrPermissionDenied.Error())).
Use the actual project symbol for the sentinel (e.g., ErrPermissionDenied,
ErrUnauthorized, or the API error type) instead of a generic error check so the
test only passes when access is denied for the right reason.
In `@management/server/peer/peer.go`:
- Around line 199-203: The current comparison in isEqual uses
slices.Equal(p.SupportedFeatures, other.SupportedFeatures), which is
order-sensitive; change it to compare feature sets instead by implementing a
helper (e.g., equalFeatureSets) that returns false if lengths differ, clones
both slices, sorts them (or builds maps/sets) and then compares, and replace the
slices.Equal call with equalFeatureSets(p.SupportedFeatures,
other.SupportedFeatures) so identical features in different orders are treated
as equal.
In `@shared/management/http/api/openapi.yml`:
- Around line 383-431: The OpenAPI schema omits explicit default values for
documented defaults; add default: 900 (integer) to p2p_retry_max_seconds,
default: true (boolean) to legacy_lazy_fallback_enabled, and default: 3600
(integer) to legacy_lazy_fallback_timeout_seconds in the YAML so generated
clients/docs reflect the described defaults (keep the keys as
p2p_retry_max_seconds, legacy_lazy_fallback_enabled,
legacy_lazy_fallback_timeout_seconds and ensure types/formats remain unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| if newMode == e.mode && newRelay == e.relayTimeoutSecs && | ||
| newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs { | ||
| return nil | ||
| } | ||
| prev := e.mode | ||
| e.mode = newMode | ||
| e.relayTimeoutSecs = newRelay | ||
| e.p2pTimeoutSecs = newP2P | ||
| e.p2pRetryMaxSecs = newP2pRetry | ||
| e.propagateP2pRetryMaxToConns() | ||
|
|
||
| wasManaged := modeUsesLazyMgr(prev) | ||
| isManaged := modeUsesLazyMgr(newMode) | ||
| modeChanged := prev != newMode | ||
|
|
||
| if modeChanged && wasManaged && !isManaged { | ||
| log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode) | ||
| e.closeManager(ctx) | ||
| e.statusRecorder.UpdateLazyConnection(false) | ||
| return nil | ||
| } | ||
|
|
||
| if modeChanged && wasManaged && isManaged { | ||
| // Switching between lazy and dynamic at runtime: tear down the | ||
| // existing manager so initLazyManager picks up the new timeouts. | ||
| log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode) | ||
| e.closeManager(ctx) | ||
| e.statusRecorder.UpdateLazyConnection(false) | ||
| } | ||
|
|
||
| if isManaged && e.lazyConnMgr == nil { | ||
| if e.rosenpassEnabled { | ||
| log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push") | ||
| return nil | ||
| } | ||
| log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode) | ||
| e.initLazyManager(ctx) | ||
| e.startModeSideEffects() |
There was a problem hiding this comment.
Restart or live-reconfigure the manager when only inactivity timeouts change.
This path only restarts the lazy manager on modeChanged, but initLazyManager() snapshots relayTimeoutSecs and p2pTimeoutSecs into manager.Config once. If management changes just RelayTimeoutSeconds or P2PTimeoutSeconds, the new values are stored on ConnMgr but never reach the running inactivity manager until a full restart or a later mode flip.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_mgr.go` around lines 292 - 329, The code only restarts
the lazy manager when modeChanged, so updates to relayTimeoutSecs or
p2pTimeoutSecs on ConnMgr never get propagated into the running manager.Config;
modify the update path in the function that sets
newMode/newRelay/newP2P/newP2pRetry (the block around
propagateP2pRetryMaxToConns, modeUsesLazyMgr, closeManager, initLazyManager,
startModeSideEffects and lazyConnMgr) to detect when the inactivity timeouts
(relayTimeoutSecs or p2pTimeoutSecs) change while isManaged is true and
lazyConnMgr != nil and then restart the manager (call closeManager(ctx),
UpdateLazyConnection(false), then initLazyManager(ctx) and
startModeSideEffects()) so the new timeout values are snapshotted into the
manager.Config even when mode did not change.
| func newSessionID() uint64 { | ||
| var b [8]byte | ||
| for { | ||
| _, _ = rand.Read(b[:]) | ||
| if id := binary.BigEndian.Uint64(b[:]); id != 0 { | ||
| return id | ||
| } | ||
| } |
There was a problem hiding this comment.
Handle rand.Read failures instead of retrying forever.
If rand.Read errors here, b remains zeroed and this loop spins forever during pusher construction. A small fallback or propagated error avoids hanging startup.
Suggested fix
func newSessionID() uint64 {
var b [8]byte
for {
- _, _ = rand.Read(b[:])
+ if _, err := rand.Read(b[:]); err != nil {
+ if id := uint64(time.Now().UnixNano()); id != 0 {
+ return id
+ }
+ continue
+ }
if id := binary.BigEndian.Uint64(b[:]); id != 0 {
return id
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func newSessionID() uint64 { | |
| var b [8]byte | |
| for { | |
| _, _ = rand.Read(b[:]) | |
| if id := binary.BigEndian.Uint64(b[:]); id != 0 { | |
| return id | |
| } | |
| } | |
| func newSessionID() uint64 { | |
| var b [8]byte | |
| for { | |
| if _, err := rand.Read(b[:]); err != nil { | |
| if id := uint64(time.Now().UnixNano()); id != 0 { | |
| return id | |
| } | |
| continue | |
| } | |
| if id := binary.BigEndian.Uint64(b[:]); id != 0 { | |
| return id | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_state_pusher.go` around lines 115 - 122, newSessionID
currently ignores errors from rand.Read and can loop forever; change
newSessionID to return (uint64, error), check the error from rand.Read and
return it instead of retrying, and keep the non-zero check for the generated id;
update any callers that call newSessionID (e.g., the pusher/constructor code in
conn_state_pusher) to handle the error by propagating it or providing a safe
fallback (e.g., abort startup with a clear error) so startup cannot hang
indefinitely.
| // OnSnapshotRequest enqueues a snapshot-request nonce. Non-blocking, | ||
| // coalescing — multiple requests in flight result in a single full | ||
| // snapshot with the latest nonce echoed. Same nil-guard rationale as | ||
| // OnPeerStateChange. | ||
| func (p *connStatePusher) OnSnapshotRequest(nonce uint64) { | ||
| if p == nil { | ||
| return | ||
| } | ||
| select { | ||
| case p.snapshotReq <- nonce: | ||
| default: | ||
| } |
There was a problem hiding this comment.
Drain snapshotReq to the newest nonce before sending the full snapshot.
The comment promises latest-nonce coalescing, but the loop flushes the first queued nonce it reads. Under back-to-back refreshes, the caller can miss its own InResponseToNonce.
Suggested fix
case nonce := <-p.snapshotReq:
+ for {
+ select {
+ case newer := <-p.snapshotReq:
+ nonce = newer
+ default:
+ goto flushSnapshot
+ }
+ }
+ flushSnapshot:
if p.source != nil {
p.flushFull(p.source.SnapshotAllRemotePeers(), nonce)
}
interval = p.tuning.baseIntervalAlso applies to: 245-248
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_state_pusher.go` around lines 153 - 164, The
OnSnapshotRequest method currently enqueues a nonce but can leave an old nonce
in the channel causing the pusher to process a stale value; change
connStatePusher.OnSnapshotRequest to coalesce by draining snapshotReq before
sending: perform a non-blocking loop that reads and discards enqueued nonces
keeping only the most recent value, then non-blocking-send that newest nonce
into p.snapshotReq (so callers' latest InResponseToNonce is preserved). Apply
the same drain-to-newest fix to the other similar handler using the same pattern
(the peer/state-change request channel referenced around lines 245-248).
| if req.Settings.ConnectionMode != nil { | ||
| modeStr := string(*req.Settings.ConnectionMode) | ||
| if !req.Settings.ConnectionMode.Valid() { | ||
| return nil, fmt.Errorf("invalid connection_mode %q", modeStr) | ||
| } | ||
| // Persist as the canonical string. Important: returnSettings | ||
| // is a fresh struct built from scratch by this handler -- if | ||
| // the request body omits connection_mode (or sets JSON null, | ||
| // which deserializes to a nil pointer), this whole block is | ||
| // skipped AND returnSettings.ConnectionMode stays nil, which | ||
| // the storage layer interprets as "clear the override". To | ||
| // preserve the existing value the caller must include the | ||
| // current value explicitly in the PUT body. This is also true | ||
| // for the four timeout fields below. | ||
| s := modeStr | ||
| returnSettings.ConnectionMode = &s | ||
| } | ||
| if req.Settings.P2pTimeoutSeconds != nil { | ||
| v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| returnSettings.P2pTimeoutSeconds = &v | ||
| } | ||
| if req.Settings.P2pRetryMaxSeconds != nil { | ||
| v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| returnSettings.P2pRetryMaxSeconds = &v | ||
| } | ||
| if req.Settings.RelayTimeoutSeconds != nil { | ||
| v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| returnSettings.RelayTimeoutSeconds = &v | ||
| } | ||
| if req.Settings.LegacyLazyFallbackEnabled != nil { | ||
| returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled | ||
| } | ||
| if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil { | ||
| // Phase 3.7i (#5989): legacy fallback timeout. Range chosen to | ||
| // match the range an admin would plausibly set on a metered LTE | ||
| // fleet: 60s lower bound (anything shorter just hammers | ||
| // signaling), 86400s upper bound (24h - longer than that and the | ||
| // fallback is effectively "never tear down"). | ||
| v := *req.Settings.LegacyLazyFallbackTimeoutSeconds | ||
| if v < 60 || v > 86400 { | ||
| return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) |
There was a problem hiding this comment.
Return InvalidArgument for these new validation failures.
These branches return plain fmt.Errorf, so updateAccount() hands util.WriteError a generic error instead of a typed management/status error. Bad input for connection_mode and the timeout fields will therefore go down the 5xx path instead of returning a 4xx validation response.
💡 Minimal fix
if req.Settings.ConnectionMode != nil {
modeStr := string(*req.Settings.ConnectionMode)
if !req.Settings.ConnectionMode.Valid() {
- return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+ return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
}
@@
if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil {
@@
v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
if v < 60 || v > 86400 {
- return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+ return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
}
returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if req.Settings.ConnectionMode != nil { | |
| modeStr := string(*req.Settings.ConnectionMode) | |
| if !req.Settings.ConnectionMode.Valid() { | |
| return nil, fmt.Errorf("invalid connection_mode %q", modeStr) | |
| } | |
| // Persist as the canonical string. Important: returnSettings | |
| // is a fresh struct built from scratch by this handler -- if | |
| // the request body omits connection_mode (or sets JSON null, | |
| // which deserializes to a nil pointer), this whole block is | |
| // skipped AND returnSettings.ConnectionMode stays nil, which | |
| // the storage layer interprets as "clear the override". To | |
| // preserve the existing value the caller must include the | |
| // current value explicitly in the PUT body. This is also true | |
| // for the four timeout fields below. | |
| s := modeStr | |
| returnSettings.ConnectionMode = &s | |
| } | |
| if req.Settings.P2pTimeoutSeconds != nil { | |
| v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.P2pTimeoutSeconds = &v | |
| } | |
| if req.Settings.P2pRetryMaxSeconds != nil { | |
| v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.P2pRetryMaxSeconds = &v | |
| } | |
| if req.Settings.RelayTimeoutSeconds != nil { | |
| v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.RelayTimeoutSeconds = &v | |
| } | |
| if req.Settings.LegacyLazyFallbackEnabled != nil { | |
| returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled | |
| } | |
| if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil { | |
| // Phase 3.7i (#5989): legacy fallback timeout. Range chosen to | |
| // match the range an admin would plausibly set on a metered LTE | |
| // fleet: 60s lower bound (anything shorter just hammers | |
| // signaling), 86400s upper bound (24h - longer than that and the | |
| // fallback is effectively "never tear down"). | |
| v := *req.Settings.LegacyLazyFallbackTimeoutSeconds | |
| if v < 60 || v > 86400 { | |
| return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) | |
| if req.Settings.ConnectionMode != nil { | |
| modeStr := string(*req.Settings.ConnectionMode) | |
| if !req.Settings.ConnectionMode.Valid() { | |
| return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr) | |
| } | |
| // Persist as the canonical string. Important: returnSettings | |
| // is a fresh struct built from scratch by this handler -- if | |
| // the request body omits connection_mode (or sets JSON null, | |
| // which deserializes to a nil pointer), this whole block is | |
| // skipped AND returnSettings.ConnectionMode stays nil, which | |
| // the storage layer interprets as "clear the override". To | |
| // preserve the existing value the caller must include the | |
| // current value explicitly in the PUT body. This is also true | |
| // for the four timeout fields below. | |
| s := modeStr | |
| returnSettings.ConnectionMode = &s | |
| } | |
| if req.Settings.P2pTimeoutSeconds != nil { | |
| v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.P2pTimeoutSeconds = &v | |
| } | |
| if req.Settings.P2pRetryMaxSeconds != nil { | |
| v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.P2pRetryMaxSeconds = &v | |
| } | |
| if req.Settings.RelayTimeoutSeconds != nil { | |
| v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds) | |
| if err != nil { | |
| return nil, err | |
| } | |
| returnSettings.RelayTimeoutSeconds = &v | |
| } | |
| if req.Settings.LegacyLazyFallbackEnabled != nil { | |
| returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled | |
| } | |
| if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil { | |
| // Phase 3.7i (`#5989`): legacy fallback timeout. Range chosen to | |
| // match the range an admin would plausibly set on a metered LTE | |
| // fleet: 60s lower bound (anything shorter just hammers | |
| // signaling), 86400s upper bound (24h - longer than that and the | |
| // fallback is effectively "never tear down"). | |
| v := *req.Settings.LegacyLazyFallbackTimeoutSeconds | |
| if v < 60 || v > 86400 { | |
| return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) | |
| } | |
| returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
231 - 280, Replace plain fmt.Errorf errors with gRPC InvalidArgument status
errors so bad input yields a 4xx validation response: change the connection_mode
error in the accounts handler to return status.Errorf(codes.InvalidArgument,
"invalid connection_mode %q", modeStr) and change the
legacy_lazy_fallback_timeout_seconds branch to return
status.Errorf(codes.InvalidArgument, "invalid
legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v).
Also ensure validateUint32Timeout returns or surfaces InvalidArgument errors
(update validateUint32Timeout to return status.Errorf(codes.InvalidArgument,
...) for invalid ranges) and add imports for "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status". This targets the connection_mode branch, the
legacy_lazy_fallback_timeout_seconds block, and the validateUint32Timeout
function so updateAccount -> util.WriteError receives typed InvalidArgument
errors.
| func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { | ||
| s.mu.Lock() | ||
| defer s.mu.Unlock() | ||
| e, ok := s.maps[peerPubKey] | ||
| if !ok { | ||
| return nil, false | ||
| } | ||
| if since > 0 && e.m.GetInResponseToNonce() < since { | ||
| return nil, false | ||
| } | ||
| if s.clock.Now().Sub(e.updatedAt) > s.ttl { | ||
| delete(s.maps, peerPubKey) | ||
| return nil, false | ||
| } | ||
| return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true |
There was a problem hiding this comment.
Expire stale entries before the nonce gate.
GetWithNonceCheck returns on InResponseToNonce < since before it evaluates TTL, so entries queried only through the refresh path never get evicted once they expire. That breaks the store's advertised TTL cleanup and can leave dead per-peer snapshots resident indefinitely.
🐛 Suggested fix
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
s.mu.Lock()
defer s.mu.Unlock()
e, ok := s.maps[peerPubKey]
if !ok {
return nil, false
}
- if since > 0 && e.m.GetInResponseToNonce() < since {
- return nil, false
- }
if s.clock.Now().Sub(e.updatedAt) > s.ttl {
delete(s.maps, peerPubKey)
return nil, false
}
+ if since > 0 && e.m.GetInResponseToNonce() < since {
+ return nil, false
+ }
return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { | |
| s.mu.Lock() | |
| defer s.mu.Unlock() | |
| e, ok := s.maps[peerPubKey] | |
| if !ok { | |
| return nil, false | |
| } | |
| if since > 0 && e.m.GetInResponseToNonce() < since { | |
| return nil, false | |
| } | |
| if s.clock.Now().Sub(e.updatedAt) > s.ttl { | |
| delete(s.maps, peerPubKey) | |
| return nil, false | |
| } | |
| return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true | |
| func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { | |
| s.mu.Lock() | |
| defer s.mu.Unlock() | |
| e, ok := s.maps[peerPubKey] | |
| if !ok { | |
| return nil, false | |
| } | |
| if s.clock.Now().Sub(e.updatedAt) > s.ttl { | |
| delete(s.maps, peerPubKey) | |
| return nil, false | |
| } | |
| if since > 0 && e.m.GetInResponseToNonce() < since { | |
| return nil, false | |
| } | |
| return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer_connections/store.go` around lines 141 - 155,
GetWithNonceCheck on MemoryStore checks the nonce gate before TTL, allowing
expired entries to be returned as nonce-old and never deleted; change
GetWithNonceCheck to first evaluate TTL (compare s.clock.Now().Sub(e.updatedAt)
> s.ttl and delete from s.maps if expired) before performing the since /
InResponseToNonce check so stale entries are evicted even when the nonce
condition would cause an early return; adjust references inside the
GetWithNonceCheck method that touch e, e.m, e.updatedAt, s.maps, s.ttl and
s.clock accordingly.
| } | ||
| } | ||
|
|
||
| return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID) |
There was a problem hiding this comment.
Return an auth error here instead of status.Internal.
“No access to this peer” is an expected authorization outcome on this path. Returning status.Internal turns it into a 500-class failure for callers instead of a permission/not-found response.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer.go` at line 1355, Replace the internal-server error
return with an authorization error: change the status code in the return that
currently uses status.Internal to an auth-level code (e.g.,
status.PermissionDenied) so callers get a permission response instead of a 500;
keep the existing message formatting that includes userID, peer.ID and accountID
in the return statement.
| if req := decryptedResp.GetSnapshotRequest(); req != nil { | ||
| c.snapMu.Lock() | ||
| cb := c.onSnapshotRequest | ||
| c.snapMu.Unlock() | ||
| if cb != nil { | ||
| cb(req.GetNonce()) | ||
| } | ||
| } | ||
|
|
||
| if err := msgHandler(decryptedResp); err != nil { |
There was a problem hiding this comment.
Run the snapshot callback after applying the same SyncResponse.
SyncResponse can carry snapshot_request alongside ordinary sync data. Invoking cb() first lets the client publish a snapshot built from pre-update peer state when both travel in the same frame.
Suggested ordering change
- if req := decryptedResp.GetSnapshotRequest(); req != nil {
- c.snapMu.Lock()
- cb := c.onSnapshotRequest
- c.snapMu.Unlock()
- if cb != nil {
- cb(req.GetNonce())
- }
- }
-
if err := msgHandler(decryptedResp); err != nil {
log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
}
+
+ if req := decryptedResp.GetSnapshotRequest(); req != nil {
+ c.snapMu.Lock()
+ cb := c.onSnapshotRequest
+ c.snapMu.Unlock()
+ if cb != nil {
+ cb(req.GetNonce())
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/management/client/grpc.go` around lines 483 - 492, The snapshot
callback is invoked before applying the SyncResponse which can cause snapshots
to be built from pre-update state; move the c.onSnapshotRequest invocation (the
cb from c.onSnapshotRequest and decryptedResp.GetSnapshotRequest()) to after the
call to msgHandler(decryptedResp) so the SyncResponse is applied first, i.e.,
call msgHandler(decryptedResp) and only if it returns nil then acquire c.snapMu,
read c.onSnapshotRequest into cb, release the lock and invoke cb(req.GetNonce())
when decryptedResp.GetSnapshotRequest() is non-nil.
| // Defines values for AccountSettingsConnectionMode. | ||
| const ( | ||
| AccountSettingsConnectionModeP2p AccountSettingsConnectionMode = "p2p" | ||
| AccountSettingsConnectionModeP2pDynamic AccountSettingsConnectionMode = "p2p-dynamic" | ||
| AccountSettingsConnectionModeP2pLazy AccountSettingsConnectionMode = "p2p-lazy" | ||
| AccountSettingsConnectionModeRelayForced AccountSettingsConnectionMode = "relay-forced" | ||
| ) | ||
|
|
||
| // Valid indicates whether the value is a known member of the AccountSettingsConnectionMode enum. | ||
| func (e AccountSettingsConnectionMode) Valid() bool { | ||
| switch e { | ||
| case AccountSettingsConnectionModeP2p: | ||
| return true | ||
| case AccountSettingsConnectionModeP2pDynamic: | ||
| return true | ||
| case AccountSettingsConnectionModeP2pLazy: | ||
| return true | ||
| case AccountSettingsConnectionModeRelayForced: | ||
| return true | ||
| default: | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
Expose p2p-dynamic-lazy in the generated connection-mode enum.
The #5989 contract in this PR still calls out p2p-dynamic-lazy, but AccountSettingsConnectionMode and Valid() only permit four values here. That leaves the HTTP API surface unable to represent the full mode set and generated clients will reject the missing mode if the backend starts using it. Please fix the OpenAPI source and regenerate this file rather than patching the generated output directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
enum AccountSettingsConnectionMode and its Valid() method are missing the
p2p-dynamic-lazy value; update the OpenAPI spec so the connection-mode enum
includes "p2p-dynamic-lazy", regenerate the code, and ensure the generated
constants (e.g., AccountSettingsConnectionModeP2pDynamicLazy) and Valid() cover
that value (so Valid() returns true for
AccountSettingsConnectionModeP2pDynamicLazy) instead of manually editing
types.gen.go.
b07184d to
7f51b84
Compare
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
management/server/peer/peer.go (1)
207-225:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
isEmptyomits newly added 3.7i metadata fields, which can suppress valid updates.If a payload only carries the new effective mode/timeout/features fields, this function still returns empty and
UpdateMetaIfNewexits early.💡 Suggested fix
func (p PeerSystemMeta) isEmpty() bool { return p.Hostname == "" && p.GoOS == "" && p.Kernel == "" && p.Core == "" && p.Platform == "" && p.OS == "" && p.OSVersion == "" && p.WtVersion == "" && p.UIVersion == "" && p.KernelVersion == "" && len(p.NetworkAddresses) == 0 && p.SystemSerialNumber == "" && p.SystemProductName == "" && p.SystemManufacturer == "" && p.Environment.Cloud == "" && p.Environment.Platform == "" && - len(p.Files) == 0 + len(p.Files) == 0 && + p.EffectiveConnectionMode == "" && + p.EffectiveRelayTimeoutSecs == 0 && + p.EffectiveP2PTimeoutSecs == 0 && + p.EffectiveP2PRetryMaxSecs == 0 && + len(p.SupportedFeatures) == 0 }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 207 - 225, The isEmpty method on PeerSystemMeta currently ignores the new 3.7i metadata fields so UpdateMetaIfNew can mistakenly treat payloads with only those fields as empty; update PeerSystemMeta.isEmpty to include the newly added fields (the effective mode, effective timeout, and features fields added for 3.7i) in the emptiness checks so that if any of those fields are set the method returns false and UpdateMetaIfNew will process the update.management/server/account.go (1)
416-442:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
ConnectionModebefore persisting it.
validateSettingsUpdatenever rejects unknown mode strings, buttoPeerConfig()later ignores parse failures and falls back toLazyConnectionEnabled. That lets an invalid admin setting save successfully while clients keep using the old mode. Reject unknown values here so misconfiguration fails fast.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/account.go` around lines 416 - 442, validateSettingsUpdate currently allows invalid ConnectionMode strings to be saved while toPeerConfig silently falls back to LazyConnectionEnabled; update validateSettingsUpdate to parse/validate newSettings.ConnectionMode (using the same parser used by toPeerConfig) and return a status.InvalidArgument error when the parse fails (include the invalid value in the message) so unknown modes are rejected before persisting.management/server/peer.go (1)
1311-1323:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReapply
RegularUsersViewBlockedinGetPeer.
GetPeersreturns nothing for restrictable users whensettings.RegularUsersViewBlockedis enabled, butGetPeernow skips that gate and still allows owner/policy-reachable access if the caller knows a peer ID. That reopens a path the list endpoint explicitly blocks.Suggested fix
user, err := am.Store.GetUserByUserID(ctx, store.LockingStrengthNone, userID) if err != nil { return nil, err } + + settings, err := am.Store.GetAccountSettings(ctx, store.LockingStrengthNone, accountID) + if err != nil { + return nil, fmt.Errorf("failed to get account settings: %w", err) + } + if user.IsRestrictable() && settings.RegularUsersViewBlocked { + return nil, status.NewPermissionDeniedError() + } // admin/service-user, or the peer owner -- direct access. if user.IsAdminOrServiceUser() || peer.UserID == userID { return peer, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1311 - 1323, GetPeer currently bypasses the RegularUsersViewBlocked guard that GetPeers enforces, allowing restrictable users to access a peer by ID; update GetPeer to reapply the same check used in GetPeers: if settings.RegularUsersViewBlocked is enabled and the caller user is a restrictable/regular user, deny access (return an appropriate error) before performing the owner/admin or checkIfUserOwnsPeer logic. Locate the gate near the user lookup (am.Store.GetUserByUserID / GetPeer) and perform the setting + user-type check there so only admin/service users or callers allowed by the existing Restrictable policy proceed to the owner/policy reachability branch.
♻️ Duplicate comments (10)
client/internal/peer/worker_ice.go (1)
218-231:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winData race on
lastKnownState.
IsConnected()readsw.lastKnownStateundermuxAgent, butonConnectionStateChange()writes that same field at lines 550 and 562 without holdingmuxAgent. WithConn.onNetworkChangenow usingIsConnected()to decide whether to tear down the agent, this is a real cross-goroutine race that can produce stale keep/close decisions. Acquirew.muxAgentin the state-change callback when touchinglastKnownState(andagentwhile you're at it).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/worker_ice.go` around lines 218 - 231, IsConnected reads WorkerICE.lastKnownState under w.muxAgent but onConnectionStateChange writes that field without locking, causing a data race; update the onConnectionStateChange callback to acquire w.muxAgent before touching lastKnownState (and agent) so both reads/writes are synchronized with IsConnected; specifically, wrap assignments at the locations that set lastKnownState (and any writes to agent) in a w.muxAgent.Lock()/Unlock() pair inside the onConnectionStateChange handler to eliminate the race.client/android/preferences.go (1)
328-331:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSetters still write
&0/&""instead ofnilwhen "clearing".The previous review flagged that
SetRelayTimeoutSeconds(0)etc. always store a non-nil pointer to 0 inconfigInput, defeating the "Pass 0 to clear the override" contract. That bug persists in this revision and now also applies toSetConnectionMode(""):
SetConnectionMode("")at line 328-331 →&""(docstring says empty-string clears the override)SetRelayTimeoutSeconds(0)at line 352-355 →&0SetP2pTimeoutSeconds(0)at line 373-376 →&0SetP2pRetryMaxSeconds(0)at line 393-396 →&0Per
Config's docstring (config.go lines 180-199), local0means "follow server" — so persisting0is a different state than "no local override". ForConnectionMode, persisting""similarly differs from "unset" inapply()'s comparison logic (config.go line 627).🛡️ Proposed fix
func (p *Preferences) SetConnectionMode(mode string) { - m := mode - p.configInput.ConnectionMode = &m + if mode == "" { + p.configInput.ConnectionMode = nil + return + } + m := mode + p.configInput.ConnectionMode = &m } func (p *Preferences) SetRelayTimeoutSeconds(secs int64) { v := clampUint32Seconds(secs) - p.configInput.RelayTimeoutSeconds = &v + if v == 0 { + p.configInput.RelayTimeoutSeconds = nil + return + } + p.configInput.RelayTimeoutSeconds = &v } func (p *Preferences) SetP2pTimeoutSeconds(secs int64) { v := clampUint32Seconds(secs) - p.configInput.P2pTimeoutSeconds = &v + if v == 0 { + p.configInput.P2pTimeoutSeconds = nil + return + } + p.configInput.P2pTimeoutSeconds = &v } func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) { v := clampUint32Seconds(secs) - p.configInput.P2pRetryMaxSeconds = &v + if v == 0 { + p.configInput.P2pRetryMaxSeconds = nil + return + } + p.configInput.P2pRetryMaxSeconds = &v }Also applies to: 352-355, 373-376, 393-396
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/android/preferences.go` around lines 328 - 331, The setters are currently storing pointers to zero-values (e.g., &"" or &0) which prevents representing "unset" per Config's contract; update each setter (SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds) to set the corresponding p.configInput field to nil when the passed value is the clearing sentinel (empty string for ConnectionMode, 0 for the timeout/retry setters), otherwise allocate and store a pointer to the provided value so callers can distinguish "unset" (nil) from an explicit zero value.client/internal/lazyconn/manager/manager.go (1)
102-109:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't treat explicit
0/0timeouts as the legacy fallback (still applies).This branch maps both new fields being zero to
inactivity.NewManager(wgIface, config.InactivityThreshold). When the caller has migrated to the new API (InactivityThreshold == nil,ICEInactivityThreshold == 0,RelayInactivityThreshold == 0) — which is the legitimate "always-on, no inactivity teardown" configuration — this dereferences a nil pointer or restores the legacy 24h relay default depending onNewManager's signature. Either outcome silently violates the documented "0 disables that teardown path" contract.Gate the legacy constructor on
config.InactivityThreshold != nilinstead:Suggested fix
if wgIface.IsUserspaceBind() { iceTO, relayTO := config.resolvedTimeouts() - if iceTO == 0 && relayTO == 0 { + if config.InactivityThreshold != nil && + config.ICEInactivityThreshold == 0 && + config.RelayInactivityThreshold == 0 { // Phase 1 / single-timer fallback when caller hasn't been migrated. m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback } else { m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) } } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/manager/manager.go` around lines 102 - 109, The current branch treats iceTO==0 && relayTO==0 as the legacy single-timer fallback and calls inactivity.NewManager(wgIface, config.InactivityThreshold), which incorrectly runs when the caller has migrated and intentionally set ICEInactivityThreshold==0 and RelayInactivityThreshold==0 with InactivityThreshold==nil (meaning “disable those teardowns”); update the logic in the manager initialization to only call inactivity.NewManager(...) when config.InactivityThreshold != nil (preserving the legacy behavior), otherwise call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) (so explicit 0/0 from resolvedTimeouts() is honored as “disable”), and keep the existing use of wgIface, config.resolvedTimeouts(), inactivity.NewManager and inactivity.NewManagerWithTwoTimers identifiers to locate the change.management/internals/shared/grpc/conversion.go (1)
315-327:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly mark liveness as authoritative when
rPeer.Statusexists.
LiveOnlineis populated only inside the nil guard, butServerLivenessKnownis forced totruefor every peer. When status is absent, new clients will interpret that as “server knows this peer is offline” instead of “liveness unknown”.Suggested fix
if rPeer.Status != nil { if !rPeer.Status.LastSeen.IsZero() { cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen) } cfg.LiveOnline = rPeer.Status.Connected + cfg.ServerLivenessKnown = true } - cfg.ServerLivenessKnown = true dst = append(dst, cfg)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 315 - 327, The code sets cfg.ServerLivenessKnown = true unconditionally even when rPeer.Status is nil; change this so ServerLivenessKnown is set to true only when rPeer.Status != nil (i.e. inside the existing nil-guard where LiveOnline is populated) or explicitly set to false when rPeer.Status is nil, ensuring the conversion logic (the block handling rPeer, cfg and rPeer.Status) only marks server liveness authoritative when status exists.client/internal/conn_mgr.go (2)
287-341:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestart the running lazy manager when only relay/ICE timeouts change.
initLazyManager()snapshotsrelayTimeoutSecsandp2pTimeoutSecsinto the manager config once. In this update path a pure timeout change reachese.relayTimeoutSecs/e.p2pTimeoutSecs, but becausemodeChanged == falsethe running manager is left on the old thresholds until some unrelated restart.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 287 - 341, When timeouts change but mode stays the same you must restart an already-running lazy manager so it picks up the new relay/p2p/retry thresholds: after computing newMode/newRelay/newP2P/newP2pRetry and before the branch that only handles e.lazyConnMgr == nil, add a check for isManaged && e.lazyConnMgr != nil && (newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) and in that case log the restart, call e.closeManager(ctx) (and e.statusRecorder.UpdateLazyConnection(false) if you need parity with other restarts), then call e.initLazyManager(ctx) and e.startModeSideEffects() so the running manager picks up the new timeouts; keep the existing assignments to e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs and the earlier e.propagateP2pRetryMaxToConns() call.
296-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't keep a lazy/dynamic resolved mode when Rosenpass blocks it.
This branch returns after
e.modeand the timeout fields have already been updated. The daemon can then reportp2p-lazy/p2p-dynamiceven though no lazy manager is running and peers were left on the eager path. Normalize or reject the pushed mode before mutating live state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 296 - 326, The code currently updates e.mode and timeout fields before checking e.rosenpassEnabled, so a pushed lazy/dynamic mode can be recorded even when Rosenpass prevents running a lazy manager; move the Rosenpass check ahead of mutating live state: first evaluate modeUsesLazyMgr(newMode) and if it returns true and e.rosenpassEnabled is true, log the warning and return (or reject the push) without changing e.mode or any timeout fields; only after this check mutate e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs, e.p2pRetryMaxSecs and call e.propagateP2pRetryMaxToConns(), then proceed with the existing manager start/stop logic (references: e.mode, modeUsesLazyMgr, e.rosenpassEnabled, e.relayTimeoutSecs, e.p2pTimeoutSecs, e.p2pRetryMaxSecs, e.propagateP2pRetryMaxToConns, e.closeManager, e.lazyConnMgr, e.statusRecorder.UpdateLazyConnection).management/internals/shared/grpc/server.go (1)
487-489:⚠️ Potential issue | 🔴 CriticalClosed
snapshotChstill busy-loops on fast reconnect.When
SnapshotRouter.Register()closes the old channel, this branch keepssnapshotChlive in theselect, so the loop can spin immediately on the closed case until the stream exits.Suggested fix
case nonce, ok := <-snapshotCh: if !ok { + snapshotCh = nil continue }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/server.go` around lines 487 - 489, The select loop spins on a closed snapshotCh because the receive case stays selectable; in the case handling "case nonce, ok := <-snapshotCh:" when ok is false set snapshotCh to nil (or otherwise remove it from the select) so the closed channel is no longer selected, then continue; update the select loop in server.go to assign snapshotCh = nil on close (referencing snapshotCh and the register flow from SnapshotRouter.Register()) so fast reconnects don't busy-loop.management/server/peer.go (1)
1355-1355:⚠️ Potential issue | 🟠 MajorReturn an auth/not-found status here, not
status.Internal.This still turns a normal access miss into a 500-class error for callers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` at line 1355, The current return uses status.Internal which incorrectly produces a 500 for an access miss; change the status to an auth/not-found code (e.g. status.NotFound or the equivalent not-found constant used in your status package) so callers receive a 404-style response instead of Internal. Update the return that mentions userID, peer.ID and accountID to use the not-found status (or PermissionDenied if you prefer explicit auth failure) while keeping the same error message.client/ui/client_ui.go (2)
889-907:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTimeout-field gating still reads the raw dropdown label, not the effective mode.
When the dropdown sits on
Follow server(orFollow server (currently: <mode>)), the switch falls intodefaultand disables all three timeout entries even whenserverPushedModeisp2p-lazy/p2p-dynamic. That blocks legitimate client-side overrides while following the server. Switch on the effective mode (useselectedConnectionMode(), falling back toserverPushedModewhen it returns"") instead ofs.sConnectionMode.Selected.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 889 - 907, The timeout gating currently switches on the raw dropdown label s.sConnectionMode.Selected; change updateTimeoutEntriesEnabled to determine the effective mode first by calling s.selectedConnectionMode() and, if that returns an empty string, falling back to s.serverPushedMode, then switch on that effective mode (instead of s.sConnectionMode.Selected) to enable/disable s.iRelayTimeout, s.iP2pTimeout and s.iP2pRetryMax so following-the-server modes like p2p-lazy/p2p-dynamic are honored.
656-666:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInvalid timeout input still silently collapses to 0 (no override).
parseUint32Fieldcontinues to swallow parse errors and returns0, so a typo like"5m"or"abc"is persisted as “no override” andhasConnectionModeChangesthen sees no change when the stored value is also 0. The user gets no validation feedback. Please surface a parse error to the caller and reject the save instead of mapping bad input to the sentinel.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/client_ui.go` around lines 656 - 666, The parseUint32Field function currently swallows parse errors and returns 0; change its signature to return (uint32, error) (e.g., parseUint32Field(text string) (uint32, error)), trim and if empty keep returning (0, nil) but on strconv.ParseUint failure return (0, fmt.Errorf("invalid uint32: %w", err)); update callers (such as hasConnectionModeChanges and the save/validate path) to handle the error, reject the save and show validation feedback to the user instead of silently treating bad input as the sentinel 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client/internal/conn_state_pusher.go`:
- Around line 293-298: The Push calls in flushDelta and flushFull use
context.Background() which can block the loop and prevent Stop() from returning;
change those Push invocations (p.sink.Push(...)) to use a cancelable/timeout
context derived from the pusher's lifecycle context (e.g., p.ctx or a context
created in NewPusher) or wrap each call with context.WithTimeout so that Stop()
can cancel the parent context and pending Push calls unblock; ensure Stop()
cancels that context (or respects the timeout) before waiting on wg so wg.Wait()
can return.
In `@client/internal/engine_offline_debounce_test.go`:
- Around line 77-97: The tests read e.peerOfflineDebounce without holding the
engine's debounce mutex, causing races; update the assertions in
TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and
TestCancelAllRemoteOfflineCloses_ClearsEverything to acquire the engine's mutex
that guards peerOfflineDebounce (use the same lock used by
scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses, e.g.
e.mu.Lock()/Unlock() or e.mu.RLock()/RUnlock()) around each
len(e.peerOfflineDebounce) check and around the setup verification so the map
reads are synchronized with
scheduleRemoteOfflineClose/cancelRemoteOfflineClose/cancelAllRemoteOfflineCloses.
In `@client/internal/lazyconn/inactivity/manager.go`:
- Around line 83-95: NewManagerWithTwoTimers currently accepts iceTimeout and
relayTimeout without enforcing MinimumInactivityThreshold, so short durations
(e.g., 30s) can be silently violated; update NewManagerWithTwoTimers to validate
and enforce the 1-minute floor by either clamping inputs to
MinimumInactivityThreshold or returning an error/zero for invalid values before
calling newManager: call validateInactivityThreshold (or inline the same check)
for both iceTimeout and relayTimeout, log or return when values are below
MinimumInactivityThreshold, and ensure newManager always receives durations >=
MinimumInactivityThreshold (or zero if caller intended to disable a path).
Ensure you reference the symbols MinimumInactivityThreshold,
validateInactivityThreshold, NewManagerWithTwoTimers, and newManager when making
the change.
In `@client/internal/peer/conn.go`:
- Around line 1213-1219: AttachICEOnRelayActivity drops conn.mu then calls
AttachICE(), but AttachICE() doesn't re-check conn state (opened, ctx.Err(),
current transport), so a concurrent Close() or transport transition can cause
reattachment on a torn-down connection; update AttachICE to re-acquire the
connection mutex (or otherwise validate under lock) at its start and return
without attaching if conn.opened is false, ctx.Err() != nil, or the
transport/state has changed from the expected relay state, and apply the same
revalidation logic to other callers (e.g., the similar relay handler paths
referenced around the other relay-related callers) to prevent attaching ICE
after teardown or transport transition.
In `@client/internal/peer/status.go`:
- Around line 309-319: The helper notifyPeerListChanged currently claims "Caller
must hold d.mux" which encourages holding the lock across listener re-entry;
change its contract to "may be called with or without d.mux held; prefer after
unlock to avoid listener re-entry deadlock" and update implementations to
snapshot the peer count under the lock and call
d.notifier.peerListChanged(numPeers) after d.mux.Unlock(); specifically, stop
calling notifyPeerListChanged while d.mux is held in UpdatePeerRemoteMeta and
similar sites—replace those calls by capturing num := d.numOfPeers() while
locked and then call d.notifier.peerListChanged(num) after unlocking, or keep
notifyPeerListChanged as a thin wrapper but remove the "must hold d.mux" claim
and document the preferred-after-unlock usage.
In `@client/internal/stdnet/filter_test.go`:
- Around line 51-66: The test currently makes asymmetric assertions over the
table-driven "cases" using allow(c.name) and only fails for one false-positive
and a single hardcoded Windows false-negative; change the loop in the test to
assert symmetrically (if got != c.want { t.Errorf("InterfaceFilter(%q) = %v,
want %v", c.name, got, c.want) }) so every case is verified, and for any rows
intended only for Windows (e.g., the "vEthernet (LAN)"/other Windows-only cases)
wrap or skip those rows with runtime.GOOS == "windows" (or t.Skipf when not
windows) so platform-specific expectations are gated; also update the misleading
comment around the veth/vEthernet behavior to reflect the actual non-Windows
filtering behavior.
In `@client/server/server.go`:
- Around line 1518-1538: The comment claims “All zero/empty when the engine has
not received PeerConfig yet” but ConnMgr.ServerPushedP2pRetryMaxSecs() returns a
non‑zero default, so either document the asymmetry or expose a raw accessor and
use it; implement ConnMgr.ServerPushedP2pRetryMaxSecsRaw() that returns the
internal serverPushedP2pRetryMaxSecs without fallback, then in server.go
populate spP2pRetMax from that raw accessor (or alternatively update the block
comment to explicitly state that spP2pRetMax will default to DefaultP2PRetryMax
when no PeerConfig has been received).
In `@client/ui/network.go`:
- Around line 97-103: In OnUnselected, don't resolve the grid via
tabs.Selected() (which can point at the newly selected tab) — use the provided
item parameter to determine which grid to clear so we clear the tab being left;
update the call site (OnUnselected) to pass the unselected TabItem into
getGridAndFilterFromTab (or add a helper that maps a *container.TabItem to the
corresponding grid) and then clear that returned grid (e.g., grid.Objects = nil)
for the unselected tab; reference symbols: OnUnselected, item
(*container.TabItem), getGridAndFilterFromTab, tabs.Selected(), and grids
allGrid/overlappingGrid/exitNodeGrid to locate and fix the logic.
In `@management/server/account/manager.go`:
- Around line 107-110: GetPeerByPubKey currently lacks the requesting user's
identity, so manager code cannot apply per-user visibility/RBAC when returning
peer metadata; update the GetPeerByPubKey signature (or its calling flow) to
accept and thread the requester identity (e.g., userID or a principal) from the
REST handlers through the manager layer (or extract it reliably from ctx) and
enforce the same visibility checks used by GetPeer before resolving/returning
nbpeer.Peer details to prevent leaking FQDNs/peer names.
In `@management/server/http/handlers/accounts/accounts_handler_test.go`:
- Around line 408-426: Fix the gofmt failure by correcting the indentation of
the struct literal in TestAccountsHandler_PutSettings_P2pRetryMax: align the
fields LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds to match
the surrounding entries (same indentation as LazyConnectionEnabled, DnsDomain,
etc.) so the colon columns line up; after adjusting those two lines, run gofmt
-w to reformat the entire api.AccountSettings literal.
In `@management/server/http/handlers/peer_connections/handler_test.go`:
- Around line 40-45: The fakeAM.GetPeerByPubKey currently ignores the accountID
parameter which can hide cross-account lookup bugs; update GetPeerByPubKey to
validate the provided accountID the same way fakeAM.GetPeer does: look up the
peer by pubKey (from a.peersByKey or similar store) and ensure its AccountID
matches the incoming accountID, returning a not-found (or error) when the peer
is missing or the account IDs differ; reference the GetPeerByPubKey and GetPeer
methods to mirror the account check and error semantics.
In `@management/server/http/handlers/peer_connections/handler.go`:
- Around line 85-88: The call to h.account.GetPeer currently maps every error to
a 404; change the handling in the GetPeer call sites (the handler block around
peer, err := h.account.GetPeer(...) and the similar block at the other
occurrence) to inspect the returned error/type instead of blanketing it: detect
and return http.StatusNotFound only when the error indicates a true "not found"
or an intentionally-hidden permission denial (e.g., errors.Is(err,
store.ErrNotFound) or checking a management error code/Status() on the error
object), otherwise log the error and return http.StatusInternalServerError (or
the appropriate status for permission/validation errors). Ensure you preserve
context in the log and do not leak internal details to clients.
- Line 92: The handler currently ignores parse errors from strconv.ParseUint and
treats malformed ?since=... as 0; change the ParseUint call in the peer
connections handler to check the error returned (the
strconv.ParseUint(r.URL.Query().Get("since"), 10, 64) call) and, if the query
value is non-empty and parsing fails, respond with HTTP 400 (Bad Request)
indicating an invalid refresh token instead of proceeding with since=0; keep
successful parsing behavior unchanged and only reject when the since parameter
is present but malformed.
In `@management/server/peer/peer.go`:
- Around line 199-204: The equality check currently uses slices.Equal on
SupportedFeatures which is order-sensitive; change it to an order-insensitive
set comparison (e.g., convert p.SupportedFeatures and other.SupportedFeatures
into maps/sets or sort them before comparing) so two feature lists with the same
elements in different orders are treated equal; update the equality logic
surrounding EffectiveConnectionMode/.../SupportedFeatures and keep the existing
p.Flags.isEqual call unchanged.
In `@management/server/store/sql_store.go`:
- Around line 1641-1652: The DB int64→uint32 casts (sRelayTimeoutSeconds,
sP2pTimeoutSeconds, sP2pRetryMaxSeconds and the meta/legacy fields
metaEffectiveRelayTimeoutSecs, metaEffectiveP2PTimeoutSecs,
metaEffectiveP2PRetryMaxSecs and sLegacyLazyFallbackTimeoutSecs) must be guarded
against negative values and values > math.MaxUint32 before assigning into
account.Settings (RelayTimeoutSeconds, P2pTimeoutSeconds, P2pRetryMaxSeconds,
etc.); change each block to check that the source.Int64 is >= 0 and <=
math.MaxUint32 and only then create the uint32 v and assign &v to the
corresponding account.Settings field (otherwise leave the pointer nil or handle
as per existing pattern).
In `@shared/management/http/api/openapi.yml`:
- Around line 365-405: The OpenAPI schema's connection_mode enum is missing the
new value "p2p-dynamic-lazy" and several timeout field descriptions still
reference outdated phase-specific behavior; update the enum under
connection_mode to include "p2p-dynamic-lazy" and revise the descriptions of
p2p_timeout_seconds, p2p_retry_max_seconds, and relay_timeout_seconds to reflect
the final mode model (which modes each timeout applies to, default values, and
NULL semantics) so generated clients and validation match the server behavior.
---
Outside diff comments:
In `@management/server/account.go`:
- Around line 416-442: validateSettingsUpdate currently allows invalid
ConnectionMode strings to be saved while toPeerConfig silently falls back to
LazyConnectionEnabled; update validateSettingsUpdate to parse/validate
newSettings.ConnectionMode (using the same parser used by toPeerConfig) and
return a status.InvalidArgument error when the parse fails (include the invalid
value in the message) so unknown modes are rejected before persisting.
In `@management/server/peer.go`:
- Around line 1311-1323: GetPeer currently bypasses the RegularUsersViewBlocked
guard that GetPeers enforces, allowing restrictable users to access a peer by
ID; update GetPeer to reapply the same check used in GetPeers: if
settings.RegularUsersViewBlocked is enabled and the caller user is a
restrictable/regular user, deny access (return an appropriate error) before
performing the owner/admin or checkIfUserOwnsPeer logic. Locate the gate near
the user lookup (am.Store.GetUserByUserID / GetPeer) and perform the setting +
user-type check there so only admin/service users or callers allowed by the
existing Restrictable policy proceed to the owner/policy reachability branch.
In `@management/server/peer/peer.go`:
- Around line 207-225: The isEmpty method on PeerSystemMeta currently ignores
the new 3.7i metadata fields so UpdateMetaIfNew can mistakenly treat payloads
with only those fields as empty; update PeerSystemMeta.isEmpty to include the
newly added fields (the effective mode, effective timeout, and features fields
added for 3.7i) in the emptiness checks so that if any of those fields are set
the method returns false and UpdateMetaIfNew will process the update.
---
Duplicate comments:
In `@client/android/preferences.go`:
- Around line 328-331: The setters are currently storing pointers to zero-values
(e.g., &"" or &0) which prevents representing "unset" per Config's contract;
update each setter (SetConnectionMode, SetRelayTimeoutSeconds,
SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds) to set the corresponding
p.configInput field to nil when the passed value is the clearing sentinel (empty
string for ConnectionMode, 0 for the timeout/retry setters), otherwise allocate
and store a pointer to the provided value so callers can distinguish "unset"
(nil) from an explicit zero value.
In `@client/internal/conn_mgr.go`:
- Around line 287-341: When timeouts change but mode stays the same you must
restart an already-running lazy manager so it picks up the new relay/p2p/retry
thresholds: after computing newMode/newRelay/newP2P/newP2pRetry and before the
branch that only handles e.lazyConnMgr == nil, add a check for isManaged &&
e.lazyConnMgr != nil && (newRelay != e.relayTimeoutSecs || newP2P !=
e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) and in that case log the
restart, call e.closeManager(ctx) (and
e.statusRecorder.UpdateLazyConnection(false) if you need parity with other
restarts), then call e.initLazyManager(ctx) and e.startModeSideEffects() so the
running manager picks up the new timeouts; keep the existing assignments to
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs and the earlier
e.propagateP2pRetryMaxToConns() call.
- Around line 296-326: The code currently updates e.mode and timeout fields
before checking e.rosenpassEnabled, so a pushed lazy/dynamic mode can be
recorded even when Rosenpass prevents running a lazy manager; move the Rosenpass
check ahead of mutating live state: first evaluate modeUsesLazyMgr(newMode) and
if it returns true and e.rosenpassEnabled is true, log the warning and return
(or reject the push) without changing e.mode or any timeout fields; only after
this check mutate e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs,
e.p2pRetryMaxSecs and call e.propagateP2pRetryMaxToConns(), then proceed with
the existing manager start/stop logic (references: e.mode, modeUsesLazyMgr,
e.rosenpassEnabled, e.relayTimeoutSecs, e.p2pTimeoutSecs, e.p2pRetryMaxSecs,
e.propagateP2pRetryMaxToConns, e.closeManager, e.lazyConnMgr,
e.statusRecorder.UpdateLazyConnection).
In `@client/internal/lazyconn/manager/manager.go`:
- Around line 102-109: The current branch treats iceTO==0 && relayTO==0 as the
legacy single-timer fallback and calls inactivity.NewManager(wgIface,
config.InactivityThreshold), which incorrectly runs when the caller has migrated
and intentionally set ICEInactivityThreshold==0 and RelayInactivityThreshold==0
with InactivityThreshold==nil (meaning “disable those teardowns”); update the
logic in the manager initialization to only call inactivity.NewManager(...) when
config.InactivityThreshold != nil (preserving the legacy behavior), otherwise
call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) (so explicit
0/0 from resolvedTimeouts() is honored as “disable”), and keep the existing use
of wgIface, config.resolvedTimeouts(), inactivity.NewManager and
inactivity.NewManagerWithTwoTimers identifiers to locate the change.
In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: IsConnected reads WorkerICE.lastKnownState under
w.muxAgent but onConnectionStateChange writes that field without locking,
causing a data race; update the onConnectionStateChange callback to acquire
w.muxAgent before touching lastKnownState (and agent) so both reads/writes are
synchronized with IsConnected; specifically, wrap assignments at the locations
that set lastKnownState (and any writes to agent) in a
w.muxAgent.Lock()/Unlock() pair inside the onConnectionStateChange handler to
eliminate the race.
In `@client/ui/client_ui.go`:
- Around line 889-907: The timeout gating currently switches on the raw dropdown
label s.sConnectionMode.Selected; change updateTimeoutEntriesEnabled to
determine the effective mode first by calling s.selectedConnectionMode() and, if
that returns an empty string, falling back to s.serverPushedMode, then switch on
that effective mode (instead of s.sConnectionMode.Selected) to enable/disable
s.iRelayTimeout, s.iP2pTimeout and s.iP2pRetryMax so following-the-server modes
like p2p-lazy/p2p-dynamic are honored.
- Around line 656-666: The parseUint32Field function currently swallows parse
errors and returns 0; change its signature to return (uint32, error) (e.g.,
parseUint32Field(text string) (uint32, error)), trim and if empty keep returning
(0, nil) but on strconv.ParseUint failure return (0, fmt.Errorf("invalid uint32:
%w", err)); update callers (such as hasConnectionModeChanges and the
save/validate path) to handle the error, reject the save and show validation
feedback to the user instead of silently treating bad input as the sentinel 0.
In `@management/internals/shared/grpc/conversion.go`:
- Around line 315-327: The code sets cfg.ServerLivenessKnown = true
unconditionally even when rPeer.Status is nil; change this so
ServerLivenessKnown is set to true only when rPeer.Status != nil (i.e. inside
the existing nil-guard where LiveOnline is populated) or explicitly set to false
when rPeer.Status is nil, ensuring the conversion logic (the block handling
rPeer, cfg and rPeer.Status) only marks server liveness authoritative when
status exists.
In `@management/internals/shared/grpc/server.go`:
- Around line 487-489: The select loop spins on a closed snapshotCh because the
receive case stays selectable; in the case handling "case nonce, ok :=
<-snapshotCh:" when ok is false set snapshotCh to nil (or otherwise remove it
from the select) so the closed channel is no longer selected, then continue;
update the select loop in server.go to assign snapshotCh = nil on close
(referencing snapshotCh and the register flow from SnapshotRouter.Register()) so
fast reconnects don't busy-loop.
In `@management/server/peer.go`:
- Line 1355: The current return uses status.Internal which incorrectly produces
a 500 for an access miss; change the status to an auth/not-found code (e.g.
status.NotFound or the equivalent not-found constant used in your status
package) so callers receive a 404-style response instead of Internal. Update the
return that mentions userID, peer.ID and accountID to use the not-found status
(or PermissionDenied if you prefer explicit auth failure) while keeping the same
error message.
---
Nitpick comments:
In `@client/cmd/service_installer.go`:
- Around line 137-139: The current calls to applyConnectionModeFlagsToProfile
swallow persistence failures and only emit warnings; update the callers (notably
the reconfigure flow and the install flow where
applyConnectionModeFlagsToProfile is invoked) to surface failures: for the
reconfigure command return the error from applyConnectionModeFlagsToProfile so
the CLI exits non‑zero and the user is informed, and for the install command
either return the error up the call chain or convert the PrintErrf warning into
a returned error depending on desired UX; locate the calls to
applyConnectionModeFlagsToProfile and change their error handling to return the
error (or propagate it) instead of only logging a warning.
- Around line 194-205: Add defensive nil checks before reading .Changed on
per-flag lookups: for each flag access currently using
cmd.Flag(connectionModeFlag).Changed, cmd.Flag(relayTimeoutFlag).Changed,
cmd.Flag(p2pTimeoutFlag).Changed, and cmd.Flag(p2pRetryMaxFlag).Changed, first
assign the result to a local (e.g., f := cmd.Flag(connectionModeFlag)) and only
set ic.ConnectionMode, ic.RelayTimeoutSeconds, ic.P2pTimeoutSeconds, or
ic.P2pRetryMaxSeconds if f != nil && f.Changed; this mirrors the existing
guarded pattern used earlier and prevents nil-pointer panics when a flag isn't
registered for a command.
In `@client/internal/engine_offline_debounce_test.go`:
- Around line 23-31: The helper engineForDebounceTest should register cleanup
with testing.T to reliably stop any timers instead of relying on manual
test-tail cleanup: change engineForDebounceTest to accept t *testing.T,
construct the Engine with peerOfflineDebounce map[string]*time.Timer, then call
t.Cleanup with a closure that ranges over e.peerOfflineDebounce, stops each
*time.Timer and drains its channel if needed (and clears the map). Update all
test callers to pass t. Use the function name engineForDebounceTest and the
field peerOfflineDebounce to locate the code to modify.
In `@client/internal/engine_pusher_adapters.go`:
- Around line 47-50: The endpoint string logic is asymmetric: it initializes
from st.LocalIceCandidateEndpoint and only sets a combined "local <-> remote"
when both exist, causing remote-only values to be lost; update the logic around
endpoint (the variables st.LocalIceCandidateEndpoint and
st.RemoteIceCandidateEndpoint) to handle all three cases—both non-empty =>
"local <-> remote", only local => local, only remote => remote—so the UI
surfaces remote-only windows during partial state transitions.
In `@client/internal/engine.go`:
- Around line 1376-1402: The repeated system.GetInfoWithChecks(...) →
SetFlags(...) sequence leads to duplication; extract it into a private Engine
helper (e.g. func (e *Engine) systemInfoWithFlags() *system.Info) that performs
GetInfoWithChecks with the same fallback to system.GetInfo, logs on error, calls
info.SetFlags(...) with the exact Engine.config fields, and returns
*system.Info; then replace each inline block (the one inside
e.syncMetaDebouncer.Trigger and the other occurrences) to call
e.systemInfoWithFlags() and use its return value when calling
e.mgmClient.SyncMeta or other callers.
In `@client/internal/lazyconn/manager/manager.go`:
- Around line 610-625: The manager is holding managedPeersMu while calling
external/conn methods (peerStore.PeerConn -> conn.ResetIceBackoff,
conn.AttachICE, conn.NotifyGuardActivity), which is fragile; change the code to
grab the conn (and peerCfg.PublicKey if needed) while holding managedPeersMu via
peerStore.PeerConn, store them in local variables, then release/unlock
managedPeersMu before invoking ResetIceBackoff(), AttachICE(), and
NotifyGuardActivity() so external calls run without the manager lock held; keep
the same warning log path for AttachICE errors and preserve behavior but move
interaction with conn outside the lock.
In `@client/internal/peer/conn_handover_order_test.go`:
- Around line 138-175: The current extractFunctionBody implementation (function
extractFunctionBody) uses a manual brace counter that miscounts braces inside
string/rune literals, raw string/backticks, or comments; replace it by parsing
the source with go/parser and go/ast: parse src into an *ast.File, walk
declarations to find the ast.FuncDecl whose Name (and Receiver if looking for
"func (conn *Conn) Name") matches the requested name, then use the
funcDecl.Pos() and funcDecl.End() token.Pos values (via the token.File set from
the parser) to slice the original src and return the exact function body text;
this removes fragile rune-based counting and correctly handles strings/comments
and receiver vs plain func forms.
In `@client/internal/peer/conn_lazy_keepwgpeer_test.go`:
- Around line 41-75: The test TestConn_Close_KeepWgPeerParameterPresent is
fragile because it asserts the exact Close signature string; replace it with a
behavioral test that constructs a Conn with a fake endpointUpdater (spy) and
calls conn.Close twice: once with keepWgPeer=false and once with
keepWgPeer=true, asserting that endpointUpdater.RemoveWgPeer() is called for the
false case and NOT called for the true case; retain the existing checks that the
guard appears before the call only if you still need a text check, but prefer
the fake endpointUpdater approach to verify the invariant without depending on
source formatting or parameter names.
- Around line 99-132: The current scanner in conn_lazy_keepwgpeer_test.go (the
loop that inspects lines/trim and looks for ".Close(") misses multi-line Close
calls and any receiver names other than "conn"/"peerConn"; replace the fragile
string-based scan with an AST-based check: parse the file with go/parser, walk
*ast.CallExpr nodes, identify selector expressions whose Sel.Name == "Close" and
whose X resolves to a *peer.Conn receiver (type-checking or simple
selector.Type/Import-qualified check), then examine the CallExpr.Args length and
fail the test if the second argument (keepWgPeer) is missing; ensure this covers
multi-line calls and arbitrary receiver identifiers instead of relying on string
ops like strings.Index or strings.Contains.
In `@client/internal/peer/guard/guard_test.go`:
- Around line 18-23: The helper newTestGuard currently returns (*Guard,
*SRWatcher) but callers ignore the SRWatcher; change newTestGuard to return only
*Guard by removing the *SRWatcher from the signature and return value (you can
still construct an SRWatcher inside the function for initialization but do not
return it), update the function body that creates sr := NewSRWatcher(...) and
the NewGuard(...) call accordingly, and update all call sites that do g, _ :=
newTestGuard(...) to use g := newTestGuard(...). Ensure references to SRWatcher
in tests are not relied upon before removing the return.
- Around line 65-96: TestGuard_PeerActivityResetsHourlyMode only exercises a
standalone iceRetryState and manually calls iceState.reset() after receiving on
g.peerActivity, so it doesn't exercise the real reconnectLoopWithRetry logic;
update the test to start the guard's real loop (call g.Start(ctx,
connStatusFunc) in a goroutine), use a connStatusFunc that forces hourly mode
(e.g., returning ConnStatusPartiallyConnected until budget exhausts), call
g.NotifyPeerActivity(), and then assert the loop reacted (hourly ticker cleared
/ short cadence observed) by observing behavior from the actual reconnect loop
instead of directly invoking iceRetryState.reset(); reference
TestGuard_PeerActivityResetsHourlyMode, g.Start, reconnectLoopWithRetry,
NotifyPeerActivity, iceRetryState.enterHourlyMode and reset, and g.peerActivity
to locate changes.
In `@client/internal/peer/guard/guard.go`:
- Around line 67-72: SetOnNetworkChange currently writes g.onNetworkChange
unsafely while reconnectLoopWithRetry reads it concurrently; change the API to
accept the callback at construction to enforce "must set before Start": add a cb
func() parameter to NewGuard and assign g.onNetworkChange inside NewGuard
(remove or deprecate SetOnNetworkChange), update all NewGuard call sites to pass
the callback, and ensure reconnectLoopWithRetry still reads g.onNetworkChange as
before; this makes the callback immutable after start and avoids race/torn-reads
without runtime synchronization.
- Around line 153-183: Extract the repeated ticker reset + iceState.reset() into
a helper method on the guard (e.g., func (g *Guard) resetTicker(ticker
**time.Ticker, tickerChannel *<-chan time.Time, ctx context.Context, iceState
*iceStateType)) and replace the duplicated blocks in the select (the cases for
g.relayedConnDisconnected, g.iCEConnDisconnected, g.peerActivity,
srReconnectedChan) with a single call to g.resetTicker(&ticker, &tickerChannel,
ctx, iceState); ensure the helper stops the old ticker, creates a new one via
g.newReconnectTicker(ctx), updates tickerChannel to the new ticker.C, and calls
iceState.reset(); keep the existing srReconnectedChan behavior of calling
g.onNetworkChange() after invoking g.resetTicker().
In `@client/internal/peer/ice_backoff.go`:
- Around line 178-197: The AllowActivityOverride() method currently returns true
and relies on the caller to call Reset(), which can be misused; change
AllowActivityOverride (or create TryActivityOverride) to perform the rate-limit
check and the reset atomically while holding s.mu: i.e., inside
iceBackoffState.AllowActivityOverride() check s.suspended and
time.Since(s.lastResetAt) < activityOverrideMinInterval, and if allowed
immediately update s.lastResetAt = time.Now() (or call s.Reset() while still
holding s.mu) before returning true so the 5-minute gate cannot be bypassed by a
missed external Reset; keep use of s.mu, s.suspended, s.lastResetAt and
activityOverrideMinInterval consistent and update callers/docs to stop calling
Reset() after AllowActivityOverride().
In `@client/internal/peer/status_remote_meta_notify_test.go`:
- Around line 55-65: The comment above the baseline UpdatePeerRemoteMeta is
contradictory: it says "no notification expected" while the test actually waits
for notifications via waitForCount(t, listener, 2, ...) and then resets
listener.peersChangedCount; update the comment to accurately state that this
initial UpdatePeerRemoteMeta does trigger the first flip/notification (because
default RemoteLiveOnline differs) or otherwise change the assertion to match the
stated intent; locate the block around UpdatePeerRemoteMeta("peerA",
RemoteMeta{LiveOnline: true, ServerLivenessKnown: true}), waitForCount, and
listener.peersChangedCount.Store and make the comment clearly reflect that the
first update causes a notification and we then reset the counter for subsequent
checks.
In `@client/internal/stdnet/filter.go`:
- Around line 47-93: InterfaceFilter currently has high cognitive complexity due
to inline Windows substring matching and the wgctrl probe; extract the
Windows-specific loop into a small helper like isWindowsKnownBad(lowerIFace
string, windowsKnownBadSubstrings []string) bool (called from InterfaceFilter)
and optionally move the WireGuard probe into a helper function (e.g.,
isWireGuardDevice(iFace string) bool) that handles wgctrl.New(), wg.Device and
closing, then call those helpers from InterfaceFilter; keep existing behavior
(skip "veth" special-case on Windows and preserve the iOS skip) and reuse
symbols windowsKnownBadSubstrings, InterfaceFilter, wgctrl.New(), and wg.Device
when relocating logic.
In `@client/ui/peers_tab.go`:
- Around line 280-356: buildPeerDetailText is over the SonarCloud cognitive
complexity limit due to many independent display sections; refactor by
extracting logical blocks into helper functions—e.g., create
appendConnectionType(sb *strings.Builder, p *proto.PeerState),
appendHandshakeAndEndpoints(sb, p), appendGroupsAndSeen(sb, p),
appendFullSection(sb, p) and appendICEBackoff(sb, p) —then have
buildPeerDetailText simply call these helpers in order and return sb.String();
ensure helper names match exactly (appendConnectionType,
appendHandshakeAndEndpoints, appendGroupsAndSeen, appendFullSection,
appendICEBackoff) so tests can target them and keep existing behavior (including
the ICE backoff wall-clock checks and conditional full-block prints).
- Around line 365-375: The function peerLatencyStr currently returns "-" for
both lat == nil and for durations that round down to zero, losing the
distinction between "unknown" and measured-zero/sub-microsecond latencies;
change it so lat == nil still returns "-" but otherwise examine the raw duration
d := lat.AsDuration(): if d == 0 return "0ns" (explicit measured zero), else if
d < time.Microsecond return "<1µs" (measured nonzero but below display
precision), and otherwise return d.Round(time.Microsecond).String(); update the
logic in peerLatencyStr accordingly so the nil check remains tied to unmeasured
and rounding is used only for display.
- Around line 319-320: The code casts proto int64 fields p.GetBytesRx() and
p.GetBytesTx() directly to uint64 when calling humanBytes, which is unsafe if
negatives appear; update the call site (around the fmt.Fprintf using humanBytes)
to normalize values first by clamping negatives to zero (e.g., compute rx :=
p.GetBytesRx(); if rx < 0 { rx = 0 } and same for tx) then cast the non-negative
int64 to uint64 for humanBytes, and/or add a short comment near the fmt.Fprintf
noting that proto fields are int64 but negatives are clamped because WireGuard
counters are expected non-negative.
In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md`:
- Line 13: The "### Task 1: Replace sharedsock with standard UDP socket in
TunKernelDevice.Up()" heading skips from h1 to h3; change it to an h2 to follow
markdownlint MD001 and proper heading increment (i.e., replace the leading "###"
with "##") so the heading level increments from the top-level h1 to h2; ensure
the rest of the document's heading hierarchy remains consistent after modifying
the "Task 1" heading.
- Around line 71-74: Update Step 3 so the description matches the command:
either change the step title/text "Build and verify compilation" to "Build and
verify compilation for arm64" to reflect the GOARCH=arm64 GOOS=linux go build
./client/ command, or replace the command with a local-arch build (go build
./client/) to keep the generic description; edit the Step 3 block that contains
the shell command to ensure consistency with Step 4's explicit cross-compile for
arm64.
- Line 48: The log line uses udpConn.LocalAddr().(*net.UDPAddr).Port which can
panic if LocalAddr() is nil or not a *net.UDPAddr; update the code around that
log (the call to udpConn.LocalAddr(), the log.Infof invocation) to first capture
addr := udpConn.LocalAddr(), check addr != nil, then perform a safe type
assertion udpAddr, ok := addr.(*net.UDPAddr) and only use udpAddr.Port when ok;
otherwise log a safe fallback (e.g., "unknown" or 0) so the code never panics
when LocalAddr() is nil or of an unexpected type.
In `@management/internals/controllers/network_map/controller/controller.go`:
- Around line 328-333: UpdateAccountPeer is calling
grpc.BuildGroupNamesByPeerID(account.Groups) which scans all account groups for
every peer; replace this with a targeted helper that only collects group-name
entries for the current peer (using peer.ID) or builds the needed slice directly
from peerGroups computed by account.GetPeerGroups(peerId). Implement a new
function (e.g., BuildGroupNamesForPeer or BuildGroupNamesFromPeerGroups) that
iterates either account.Groups once but filters by peerId or iterates peerGroups
to assemble the same shape used by grpc.ToSyncResponse, and use that helper in
place of grpc.BuildGroupNamesByPeerID so UpdateAccountPeer work is proportional
to the single peer rather than account size.
In `@management/internals/server/boot.go`:
- Around line 113-117: The hard-coded TTL value 1 * time.Hour in
BaseServer.PeerConnStore should be promoted to a named constant to make the
operational tuning knob discoverable and reusable; create a constant (e.g.,
peer_connections.DefaultStoreTTL) and replace the literal in the call to
peer_connections.NewMemoryStore inside BaseServer.PeerConnStore so the store
uses that constant, and update any other places that might reference the same
TTL to use the new named constant as well.
In `@management/server/http/handlers/accounts/accounts_handler_test.go`:
- Around line 353-429: The test TestAccountsHandler_PutSettings_P2pRetryMax
should capture the argument passed into the mock UpdateAccountSettingsFunc and
assert that the handler translated the API field into the domain
types.Settings.P2pRetryMaxSeconds (not just rely on the returned value). Modify
the mock UpdateAccountSettingsFunc to store its incoming parameter (e.g.,
newSettingsArg := &types.Settings{}) in a closure, return as before, then after
the request assert that newSettingsArg.P2pRetryMaxSeconds is non-nil and equals
600 (matching the request) so the API→types mapping for P2pRetryMaxSeconds is
validated. Ensure assertions reference UpdateAccountSettingsFunc,
TestAccountsHandler_PutSettings_P2pRetryMax, and
types.Settings.P2pRetryMaxSeconds.
In `@management/server/http/handlers/accounts/validate_uint32_timeout_test.go`:
- Around line 79-90: The test TestValidateUint32Timeout_PlainError currently
doesn't detect wrapped errors because it initializes unwrapped from err and then
only asserts unwrapped != nil; update the test for validateUint32Timeout so it
first asserts err is non-nil and then asserts errors.Unwrap(err) == nil
(ensuring the error is not a %w-wrapped error), or alternatively assert the
concrete type is not a gRPC status error (e.g. via status.FromError) if you want
to forbid status.Error wrappers; use the validateUint32Timeout call and
errors.Unwrap/ status.FromError to make the assertion explicit.
In `@management/server/peer_connections/store_test.go`:
- Around line 221-240: Add a companion test that mirrors
TestMemoryStore_MixedSessionAcceptsLegacyDelta but with the cached state having
SessionId==0 and the incoming delta tagged (non-zero) to ensure the seq-only
acceptance is symmetric; using newStoreWithClock create a store, s.Put a cached
map with SessionId 0 (Seq 5, FullSnapshot true, Entry LatencyMs 10), then s.Put
an incoming map with SessionId non-zero (e.g., sessionB, Seq 6, FullSnapshot
false, Entry LatencyMs 22), and assert via s.Get that the entry latency was
updated to 22; name the test clearly (e.g.,
TestMemoryStore_MixedSessionAcceptsTaggedDelta) and reuse the same helpers
(newStoreWithClock, s.Put, s.Get) as in the existing test.
- Around line 38-52: The test TestMemoryStore_DeepCopyOnReturn only mutates
RemotePubkey so a shallow copy that still shares the nested LastHandshake can
pass; update the test to also mutate the nested timestamp on got1 (e.g. change
got1.GetEntries()[0].LastHandshake) after s.Get and then re-fetch got2 and
assert that got2.GetEntries()[0].GetLastHandshake() still equals the original
timestamp from the fixture; reference the TestMemoryStore_DeepCopyOnReturn,
s.Put, s.Get, got1, got2 and the LastHandshake/timestamppb fields when making
this change to ensure the store performs a deep copy of nested protobuf fields.
In `@management/server/peer_test.go`:
- Line 1214: The test currently calls grpc.ToSyncResponse with a nil
groupNamesByPeerID and thus doesn't exercise appendRemotePeerConfig's branch
that populates cfg.Groups from c.GroupNamesByPeerID[rPeer.ID]; update the test
to call grpc.ToSyncResponse twice (or add a second case) passing a non-nil map
for the trailing groupNamesByPeerID argument that maps the test peer's ID to a
slice of group names, then assert that response.RemotePeers[0].Groups equals the
provided slice to verify the conversion consumed the map (refer to
grpc.ToSyncResponse, appendRemotePeerConfig, and
response.RemotePeers[0].Groups).
In `@management/server/store/sql_store.go`:
- Around line 1656-1663: Replace the hardcoded defaults with the shared
constants: set account.Settings.LegacyLazyFallbackEnabled =
types.DefaultLegacyLazyFallbackEnabled instead of true, and set
account.Settings.LegacyLazyFallbackTimeoutSeconds =
types.DefaultLegacyLazyFallbackTimeoutSeconds instead of 3600 when
sLegacyLazyFallbackTimeoutSecs is not valid; ensure the types package is
imported/available in sql_store.go and adjust any conversions to match the
uint32 type of LegacyLazyFallbackTimeoutSeconds.
In `@shared/management/proto/management.proto`:
- Around line 794-807: Update the comment for PeerConnectionMap.seq to
explicitly state its semantics: document that seq is a monotonic counter tied to
the current session_id (i.e., seq increments for each delta within the same
session_id), that a new full_snapshot (full_snapshot=true) or a change in
session_id indicates a reset/new baseline for seq, and describe fallback
semantics for legacy clients (session_id==0) where receivers should treat seq as
the only monotonic ordering key for that process and reset/interpret seq on
full_snapshot accordingly; reference the fields PeerConnectionMap.seq,
PeerConnectionMap.session_id and PeerConnectionMap.full_snapshot in the comment
so future implementers know how to compare and reset sequence numbers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{ | ||
| Seq: seq, | ||
| FullSnapshot: false, | ||
| Entries: entries, | ||
| SessionId: p.sessionID, | ||
| }); err != nil { |
There was a problem hiding this comment.
context.Background() in Push calls can deadlock Stop().
Both flushDelta and flushFull invoke p.sink.Push(context.Background(), ...) from the loop goroutine. If the management gRPC client hangs (slow server, broken connection, head-of-line blocking on retry), the loop will block inside Push, will not service <-p.stop, and Stop() → wg.Wait() will never return — taking daemon shutdown with it.
Derive a context from a per-pusher cancel that Stop() cancels first, or apply a hard context.WithTimeout on each push. Either approach also bounds individual push latency, which is desirable for the heartbeat path.
Suggested shape
type connStatePusher struct {
...
stop chan struct{}
+ ctx context.Context
+ cancel context.CancelFunc
wg sync.WaitGroup
}
func newConnStatePusherForTest(...) *connStatePusher {
p := &connStatePusher{...}
+ p.ctx, p.cancel = context.WithCancel(context.Background())
...
}
func (p *connStatePusher) Stop() {
+ p.cancel()
close(p.stop)
p.wg.Wait()
}
- if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{...}); err != nil {
+ pushCtx, cancel := context.WithTimeout(p.ctx, 30*time.Second)
+ err := p.sink.Push(pushCtx, &mgmProto.PeerConnectionMap{...})
+ cancel()
+ if err != nil {Also applies to: 363-369
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_state_pusher.go` around lines 293 - 298, The Push calls
in flushDelta and flushFull use context.Background() which can block the loop
and prevent Stop() from returning; change those Push invocations
(p.sink.Push(...)) to use a cancelable/timeout context derived from the pusher's
lifecycle context (e.g., p.ctx or a context created in NewPusher) or wrap each
call with context.WithTimeout so that Stop() can cancel the parent context and
pending Push calls unblock; ensure Stop() cancels that context (or respects the
timeout) before waiting on wg so wg.Wait() can return.
| func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) { | ||
| e := engineForDebounceTest() | ||
| // must not panic | ||
| e.cancelRemoteOfflineClose("never-scheduled") | ||
| if len(e.peerOfflineDebounce) != 0 { | ||
| t.Error("map must remain empty") | ||
| } | ||
| } | ||
|
|
||
| func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) { | ||
| e := engineForDebounceTest() | ||
| for _, k := range []string{"a", "b", "c", "d"} { | ||
| e.scheduleRemoteOfflineClose(k) | ||
| } | ||
| if len(e.peerOfflineDebounce) != 4 { | ||
| t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce)) | ||
| } | ||
| e.cancelAllRemoteOfflineCloses() | ||
| if len(e.peerOfflineDebounce) != 0 { | ||
| t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce)) | ||
| } |
There was a problem hiding this comment.
Guard debounce-map length assertions with the mutex
len(e.peerOfflineDebounce) is read without locking on Line 81, Line 91, and Line 95. Since timers can mutate this map asynchronously, these reads can race.
Suggested patch
func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
e := engineForDebounceTest()
// must not panic
e.cancelRemoteOfflineClose("never-scheduled")
- if len(e.peerOfflineDebounce) != 0 {
+ e.peerOfflineDebounceMu.Lock()
+ n := len(e.peerOfflineDebounce)
+ e.peerOfflineDebounceMu.Unlock()
+ if n != 0 {
t.Error("map must remain empty")
}
}
@@
for _, k := range []string{"a", "b", "c", "d"} {
e.scheduleRemoteOfflineClose(k)
}
- if len(e.peerOfflineDebounce) != 4 {
- t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce))
+ e.peerOfflineDebounceMu.Lock()
+ n := len(e.peerOfflineDebounce)
+ e.peerOfflineDebounceMu.Unlock()
+ if n != 4 {
+ t.Fatalf("setup: expected 4 timers, got %d", n)
}
e.cancelAllRemoteOfflineCloses()
- if len(e.peerOfflineDebounce) != 0 {
- t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce))
+ e.peerOfflineDebounceMu.Lock()
+ n = len(e.peerOfflineDebounce)
+ e.peerOfflineDebounceMu.Unlock()
+ if n != 0 {
+ t.Errorf("cancel-all must clear the map, got %d entries", n)
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) { | |
| e := engineForDebounceTest() | |
| // must not panic | |
| e.cancelRemoteOfflineClose("never-scheduled") | |
| if len(e.peerOfflineDebounce) != 0 { | |
| t.Error("map must remain empty") | |
| } | |
| } | |
| func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) { | |
| e := engineForDebounceTest() | |
| for _, k := range []string{"a", "b", "c", "d"} { | |
| e.scheduleRemoteOfflineClose(k) | |
| } | |
| if len(e.peerOfflineDebounce) != 4 { | |
| t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce)) | |
| } | |
| e.cancelAllRemoteOfflineCloses() | |
| if len(e.peerOfflineDebounce) != 0 { | |
| t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce)) | |
| } | |
| func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) { | |
| e := engineForDebounceTest() | |
| // must not panic | |
| e.cancelRemoteOfflineClose("never-scheduled") | |
| e.peerOfflineDebounceMu.Lock() | |
| n := len(e.peerOfflineDebounce) | |
| e.peerOfflineDebounceMu.Unlock() | |
| if n != 0 { | |
| t.Error("map must remain empty") | |
| } | |
| } | |
| func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) { | |
| e := engineForDebounceTest() | |
| for _, k := range []string{"a", "b", "c", "d"} { | |
| e.scheduleRemoteOfflineClose(k) | |
| } | |
| e.peerOfflineDebounceMu.Lock() | |
| n := len(e.peerOfflineDebounce) | |
| e.peerOfflineDebounceMu.Unlock() | |
| if n != 4 { | |
| t.Fatalf("setup: expected 4 timers, got %d", n) | |
| } | |
| e.cancelAllRemoteOfflineCloses() | |
| e.peerOfflineDebounceMu.Lock() | |
| n = len(e.peerOfflineDebounce) | |
| e.peerOfflineDebounceMu.Unlock() | |
| if n != 0 { | |
| t.Errorf("cancel-all must clear the map, got %d entries", n) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The
tests read e.peerOfflineDebounce without holding the engine's debounce mutex,
causing races; update the assertions in
TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and
TestCancelAllRemoteOfflineCloses_ClearsEverything to acquire the engine's mutex
that guards peerOfflineDebounce (use the same lock used by
scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses, e.g.
e.mu.Lock()/Unlock() or e.mu.RLock()/RUnlock()) around each
len(e.peerOfflineDebounce) check and around the setup verification so the map
reads are synchronized with
scheduleRemoteOfflineClose/cancelRemoteOfflineClose/cancelAllRemoteOfflineCloses.
| // NewManagerWithTwoTimers is the Phase-2 constructor. Pass 0 for either | ||
| // timeout to disable that teardown path. Both 0 leaves the manager | ||
| // running but inert (no channel ever fires) -- used by p2p / relay-forced | ||
| // modes that don't tear down workers. | ||
| func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager { | ||
| if iceTimeout > 0 { | ||
| log.Infof("ICE inactivity timeout: %v", iceTimeout) | ||
| } | ||
| if relayTimeout > 0 { | ||
| log.Infof("relay inactivity timeout: %v", relayTimeout) | ||
| } | ||
| return newManager(iface, iceTimeout, relayTimeout) | ||
| } |
There was a problem hiding this comment.
Enforce the 1-minute floor in NewManagerWithTwoTimers.
NewManagerWithTwoTimers bypasses validateInactivityThreshold, so callers can pass values below MinimumInactivityThreshold even though the sweep runs every minute. A 30-second timeout will therefore fire late and silently violate the configured value. Clamp or reject both timers here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/lazyconn/inactivity/manager.go` around lines 83 - 95,
NewManagerWithTwoTimers currently accepts iceTimeout and relayTimeout without
enforcing MinimumInactivityThreshold, so short durations (e.g., 30s) can be
silently violated; update NewManagerWithTwoTimers to validate and enforce the
1-minute floor by either clamping inputs to MinimumInactivityThreshold or
returning an error/zero for invalid values before calling newManager: call
validateInactivityThreshold (or inline the same check) for both iceTimeout and
relayTimeout, log or return when values are below MinimumInactivityThreshold,
and ensure newManager always receives durations >= MinimumInactivityThreshold
(or zero if caller intended to disable a path). Ensure you reference the symbols
MinimumInactivityThreshold, validateInactivityThreshold,
NewManagerWithTwoTimers, and newManager when making the change.
| // notifyPeerListChanged fires a peer-list-changed notification using the | ||
| // current peer count. Phase 3.7i: thin wrapper around the notifier so | ||
| // callers in UpdatePeerRemoteMeta and similar paths don't need to know | ||
| // about d.numOfPeers() and d.notifier internals. | ||
| // | ||
| // Caller must hold d.mux (this method reads d.peers/d.offlinePeers via | ||
| // numOfPeers and assumes consistent state). | ||
| func (d *Status) notifyPeerListChanged() { | ||
| d.notifier.peerListChanged(d.numOfPeers()) | ||
| } | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Misleading "Caller must hold d.mux" docstring on notifyPeerListChanged.
The implementation just calls d.notifier.peerListChanged(d.numOfPeers()). The notifier dispatch itself is the part that re-enters user code, and the established pattern across this file (e.g. lines 461-462, 684-685, 752-753, 854) is to fire it after d.mux.Unlock(). Documenting this helper as "caller must hold d.mux" actively encourages the deadlock pattern that exists at the UpdatePeerRemoteMeta call site (lines 540-541, 564). Either:
- Reword to "may be called with or without
d.muxheld; prefer after unlock to avoid listener re-entry deadlock", and inlined.numOfPeers()accordingly, or - Drop the helper and have callers use
d.notifier.peerListChanged(numPeers)directly withnumPeerssnapshotted under the lock — matching every other path in this file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/status.go` around lines 309 - 319, The helper
notifyPeerListChanged currently claims "Caller must hold d.mux" which encourages
holding the lock across listener re-entry; change its contract to "may be called
with or without d.mux held; prefer after unlock to avoid listener re-entry
deadlock" and update implementations to snapshot the peer count under the lock
and call d.notifier.peerListChanged(numPeers) after d.mux.Unlock();
specifically, stop calling notifyPeerListChanged while d.mux is held in
UpdatePeerRemoteMeta and similar sites—replace those calls by capturing num :=
d.numOfPeers() while locked and then call d.notifier.peerListChanged(num) after
unlocking, or keep notifyPeerListChanged as a thin wrapper but remove the "must
hold d.mux" claim and document the preferred-after-unlock usage.
| peer, err := h.account.GetPeer(r.Context(), userAuth.AccountId, peerID, userAuth.UserId) | ||
| if err != nil { | ||
| http.Error(w, "peer not found", http.StatusNotFound) | ||
| return |
There was a problem hiding this comment.
Don’t turn every GetPeer failure into a 404.
These branches currently collapse permission-validation failures and unexpected store/backend errors into "peer not found". That makes real server faults indistinguishable from auth/not-found outcomes and will mislead clients during outages.
Suggested fix
peer, err := h.account.GetPeer(r.Context(), userAuth.AccountId, peerID, userAuth.UserId)
if err != nil {
- http.Error(w, "peer not found", http.StatusNotFound)
+ // Map expected auth/not-found outcomes explicitly; keep unexpected
+ // failures as 5xx so callers can distinguish them.
+ http.Error(w, "peer not found", http.StatusNotFound)
return
}At minimum, inspect the returned management status/error type here and reserve 404 for not-found / intentionally-hidden permission misses.
Also applies to: 137-140
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/http/handlers/peer_connections/handler.go` around lines 85
- 88, The call to h.account.GetPeer currently maps every error to a 404; change
the handling in the GetPeer call sites (the handler block around peer, err :=
h.account.GetPeer(...) and the similar block at the other occurrence) to inspect
the returned error/type instead of blanketing it: detect and return
http.StatusNotFound only when the error indicates a true "not found" or an
intentionally-hidden permission denial (e.g., errors.Is(err, store.ErrNotFound)
or checking a management error code/Status() on the error object), otherwise log
the error and return http.StatusInternalServerError (or the appropriate status
for permission/validation errors). Ensure you preserve context in the log and do
not leak internal details to clients.
| } | ||
|
|
||
| pubkey := peer.Key | ||
| since, _ := strconv.ParseUint(r.URL.Query().Get("since"), 10, 64) |
There was a problem hiding this comment.
Reject malformed since tokens.
Right now ?since=abc is treated the same as since=0, so the handler can return cached data instead of telling the caller the refresh token is invalid.
Suggested fix
- since, _ := strconv.ParseUint(r.URL.Query().Get("since"), 10, 64)
+ var since uint64
+ if sinceRaw := r.URL.Query().Get("since"); sinceRaw != "" {
+ parsed, err := strconv.ParseUint(sinceRaw, 10, 64)
+ if err != nil {
+ http.Error(w, "invalid since value", http.StatusBadRequest)
+ return
+ }
+ since = parsed
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/http/handlers/peer_connections/handler.go` at line 92, The
handler currently ignores parse errors from strconv.ParseUint and treats
malformed ?since=... as 0; change the ParseUint call in the peer connections
handler to check the error returned (the
strconv.ParseUint(r.URL.Query().Get("since"), 10, 64) call) and, if the query
value is non-empty and parsing fails, respond with HTTP 400 (Bad Request)
indicating an invalid refresh token instead of proceeding with since=0; keep
successful parsing behavior unchanged and only reject when the since parameter
is present but malformed.
| p.EffectiveConnectionMode == other.EffectiveConnectionMode && | ||
| p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs && | ||
| p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs && | ||
| p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs && | ||
| slices.Equal(p.SupportedFeatures, other.SupportedFeatures) && | ||
| p.Flags.isEqual(other.Flags) |
There was a problem hiding this comment.
SupportedFeatures comparison is order-sensitive but capabilities are set-like.
Two equivalent feature sets in different orders will be treated as unequal, causing unnecessary metadata churn.
💡 Suggested fix
func (p PeerSystemMeta) isEqual(other PeerSystemMeta) bool {
+ pFeatures := slices.Clone(p.SupportedFeatures)
+ otherFeatures := slices.Clone(other.SupportedFeatures)
+ sort.Strings(pFeatures)
+ sort.Strings(otherFeatures)
+
return p.Hostname == other.Hostname &&
p.GoOS == other.GoOS &&
p.Kernel == other.Kernel &&
p.KernelVersion == other.KernelVersion &&
p.Core == other.Core &&
p.Platform == other.Platform &&
p.OS == other.OS &&
p.OSVersion == other.OSVersion &&
p.WtVersion == other.WtVersion &&
p.UIVersion == other.UIVersion &&
p.SystemSerialNumber == other.SystemSerialNumber &&
p.SystemProductName == other.SystemProductName &&
p.SystemManufacturer == other.SystemManufacturer &&
p.Environment.Cloud == other.Environment.Cloud &&
p.Environment.Platform == other.Environment.Platform &&
p.EffectiveConnectionMode == other.EffectiveConnectionMode &&
p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs &&
p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs &&
p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs &&
- slices.Equal(p.SupportedFeatures, other.SupportedFeatures) &&
+ slices.Equal(pFeatures, otherFeatures) &&
p.Flags.isEqual(other.Flags)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer/peer.go` around lines 199 - 204, The equality check
currently uses slices.Equal on SupportedFeatures which is order-sensitive;
change it to an order-insensitive set comparison (e.g., convert
p.SupportedFeatures and other.SupportedFeatures into maps/sets or sort them
before comparing) so two feature lists with the same elements in different
orders are treated equal; update the equality logic surrounding
EffectiveConnectionMode/.../SupportedFeatures and keep the existing
p.Flags.isEqual call unchanged.
| if sRelayTimeoutSeconds.Valid { | ||
| v := uint32(sRelayTimeoutSeconds.Int64) | ||
| account.Settings.RelayTimeoutSeconds = &v | ||
| } | ||
| if sP2pTimeoutSeconds.Valid { | ||
| v := uint32(sP2pTimeoutSeconds.Int64) | ||
| account.Settings.P2pTimeoutSeconds = &v | ||
| } | ||
| if sP2pRetryMaxSeconds.Valid { | ||
| v := uint32(sP2pRetryMaxSeconds.Int64) | ||
| account.Settings.P2pRetryMaxSeconds = &v | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Direct int64->uint32 casts in sql_store.go:"
rg -n -C2 'uint32\((sRelayTimeoutSeconds|sP2pTimeoutSeconds|sP2pRetryMaxSeconds|sLegacyLazyFallbackTimeoutSecs|metaEffectiveRelayTimeoutSecs|metaEffectiveP2PTimeoutSecs|metaEffectiveP2PRetryMaxSecs)\.Int64\)' management/server/store/sql_store.go
echo
echo "Range guards around those variables (expected none currently):"
rg -n -C2 '(sRelayTimeoutSeconds|sP2pTimeoutSeconds|sP2pRetryMaxSeconds|sLegacyLazyFallbackTimeoutSecs|metaEffectiveRelayTimeoutSecs|metaEffectiveP2PTimeoutSecs|metaEffectiveP2PRetryMaxSecs).*(<\s*0|>\s*math\.MaxUint32)' management/server/store/sql_store.go || trueRepository: netbirdio/netbird
Length of output: 1323
Guard int64→uint32 conversions before assigning timeouts.
At lines 1642, 1646, 1650, 1660, 1923, 1926, and 1929, direct narrowing casts can silently wrap invalid DB values (negative or > math.MaxUint32) into unrelated timeout values.
💡 Suggested fix pattern
+import "math"
...
if sRelayTimeoutSeconds.Valid {
+ if sRelayTimeoutSeconds.Int64 < 0 || sRelayTimeoutSeconds.Int64 > math.MaxUint32 {
+ return nil, status.NewGetAccountFromStoreError(
+ fmt.Errorf("settings_relay_timeout_seconds out of range: %d", sRelayTimeoutSeconds.Int64),
+ )
+ }
v := uint32(sRelayTimeoutSeconds.Int64)
account.Settings.RelayTimeoutSeconds = &v
}Apply the same bounds check pattern to:
sP2pTimeoutSecondssP2pRetryMaxSecondssLegacyLazyFallbackTimeoutSecsmetaEffectiveRelayTimeoutSecsmetaEffectiveP2PTimeoutSecsmetaEffectiveP2PRetryMaxSecs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/store/sql_store.go` around lines 1641 - 1652, The DB
int64→uint32 casts (sRelayTimeoutSeconds, sP2pTimeoutSeconds,
sP2pRetryMaxSeconds and the meta/legacy fields metaEffectiveRelayTimeoutSecs,
metaEffectiveP2PTimeoutSecs, metaEffectiveP2PRetryMaxSecs and
sLegacyLazyFallbackTimeoutSecs) must be guarded against negative values and
values > math.MaxUint32 before assigning into account.Settings
(RelayTimeoutSeconds, P2pTimeoutSeconds, P2pRetryMaxSeconds, etc.); change each
block to check that the source.Int64 is >= 0 and <= math.MaxUint32 and only then
create the uint32 v and assign &v to the corresponding account.Settings field
(otherwise leave the pointer nil or handle as per existing pattern).
| enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] | ||
| nullable: true | ||
| description: | | ||
| Account-wide default peer-connection mode. NULL means | ||
| "fall back to lazy_connection_enabled" for backwards compatibility. | ||
| Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are | ||
| functional. p2p-dynamic is reserved (passes through as p2p in | ||
| Phase 1; will become functional in Phase 2). | ||
| p2p_timeout_seconds: | ||
| x-experimental: true | ||
| type: integer | ||
| format: int64 | ||
| minimum: 0 | ||
| nullable: true | ||
| description: | | ||
| Default ICE-worker idle timeout in seconds. 0 = never tear down. | ||
| Effective only in p2p-dynamic mode (added in Phase 2). | ||
| NULL means "use built-in default" (180 minutes). | ||
| p2p_retry_max_seconds: | ||
| x-experimental: true | ||
| type: integer | ||
| format: int64 | ||
| minimum: 0 | ||
| nullable: true | ||
| description: | | ||
| Maximum interval between P2P retry attempts after consecutive | ||
| ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to | ||
| disable backoff (always retry immediately, Phase-2 behavior). | ||
| Effective only in p2p-dynamic mode (added in Phase 3). | ||
| example: 900 | ||
| relay_timeout_seconds: | ||
| x-experimental: true | ||
| type: integer | ||
| format: int64 | ||
| minimum: 0 | ||
| nullable: true | ||
| description: | | ||
| Default relay-worker idle timeout in seconds. 0 = never tear | ||
| down. Effective in p2p-lazy and p2p-dynamic modes. Backwards- | ||
| compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the | ||
| client. NULL means "use built-in default" (5 minutes). |
There was a problem hiding this comment.
Add p2p-dynamic-lazy to connection_mode enum and align timeout docs with final mode model
connection_mode is missing p2p-dynamic-lazy (Line 365), which is part of the mode set for this rollout. That creates contract drift: generated clients may reject/omit a valid server value, and schema validation can fail on updates. Also, timeout descriptions (Lines 380-405) still describe earlier phase behavior and can mislead consumers about which modes each timeout applies to.
Suggested OpenAPI fix
connection_mode:
x-experimental: true
type: string
- enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+ enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
nullable: true
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
- Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
- functional. p2p-dynamic is reserved (passes through as p2p in
- Phase 1; will become functional in Phase 2).
+ Includes dynamic and dynamic-lazy rollout modes.
...
p2p_timeout_seconds:
...
description: |
Default ICE-worker idle timeout in seconds. 0 = never tear down.
- Effective only in p2p-dynamic mode (added in Phase 2).
+ Effective in dynamic modes.
...
relay_timeout_seconds:
...
description: |
Default relay-worker idle timeout in seconds. 0 = never tear
- down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
+ down. Effective in lazy/dynamic-lazy behavior. Backwards-
compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
client. NULL means "use built-in default" (5 minutes).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] | |
| nullable: true | |
| description: | | |
| Account-wide default peer-connection mode. NULL means | |
| "fall back to lazy_connection_enabled" for backwards compatibility. | |
| Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are | |
| functional. p2p-dynamic is reserved (passes through as p2p in | |
| Phase 1; will become functional in Phase 2). | |
| p2p_timeout_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Default ICE-worker idle timeout in seconds. 0 = never tear down. | |
| Effective only in p2p-dynamic mode (added in Phase 2). | |
| NULL means "use built-in default" (180 minutes). | |
| p2p_retry_max_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Maximum interval between P2P retry attempts after consecutive | |
| ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to | |
| disable backoff (always retry immediately, Phase-2 behavior). | |
| Effective only in p2p-dynamic mode (added in Phase 3). | |
| example: 900 | |
| relay_timeout_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Default relay-worker idle timeout in seconds. 0 = never tear | |
| down. Effective in p2p-lazy and p2p-dynamic modes. Backwards- | |
| compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the | |
| client. NULL means "use built-in default" (5 minutes). | |
| enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy] | |
| nullable: true | |
| description: | | |
| Account-wide default peer-connection mode. NULL means | |
| "fall back to lazy_connection_enabled" for backwards compatibility. | |
| Includes dynamic and dynamic-lazy rollout modes. | |
| p2p_timeout_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Default ICE-worker idle timeout in seconds. 0 = never tear down. | |
| Effective in dynamic modes. | |
| NULL means "use built-in default" (180 minutes). | |
| p2p_retry_max_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Maximum interval between P2P retry attempts after consecutive | |
| ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to | |
| disable backoff (always retry immediately, Phase-2 behavior). | |
| Effective only in p2p-dynamic mode (added in Phase 3). | |
| example: 900 | |
| relay_timeout_seconds: | |
| x-experimental: true | |
| type: integer | |
| format: int64 | |
| minimum: 0 | |
| nullable: true | |
| description: | | |
| Default relay-worker idle timeout in seconds. 0 = never tear | |
| down. Effective in lazy/dynamic-lazy behavior. Backwards- | |
| compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the | |
| client. NULL means "use built-in default" (5 minutes). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/management/http/api/openapi.yml` around lines 365 - 405, The OpenAPI
schema's connection_mode enum is missing the new value "p2p-dynamic-lazy" and
several timeout field descriptions still reference outdated phase-specific
behavior; update the enum under connection_mode to include "p2p-dynamic-lazy"
and revise the descriptions of p2p_timeout_seconds, p2p_retry_max_seconds, and
relay_timeout_seconds to reflect the final mode model (which modes each timeout
applies to, default values, and NULL semantics) so generated clients and
validation match the server behavior.
0ed16b8 to
b082536
Compare
There was a problem hiding this comment.
Actionable comments posted: 15
♻️ Duplicate comments (17)
client/internal/lazyconn/manager/manager.go (1)
50-55:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGate the Phase-1 inactivity fallback on the deprecated field actually being in use.
Line 52 still applies
InactivityThresholdwhenever relay is0, and Line 104 still treats any effective0/0as “use the legacy single-timer manager.” That means explicitICEInactivityThreshold=0/RelayInactivityThreshold=0can silently restore Phase-1 relay teardown, and mixed old/new config can override an explicit “relay always-on” setting with the deprecated field.Suggested fix
func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) { relay := c.RelayInactivityThreshold - if relay == 0 && c.InactivityThreshold != nil { + if c.InactivityThreshold != nil && + c.ICEInactivityThreshold == 0 && + c.RelayInactivityThreshold == 0 { relay = *c.InactivityThreshold } return c.ICEInactivityThreshold, relay } @@ if wgIface.IsUserspaceBind() { iceTO, relayTO := config.resolvedTimeouts() - if iceTO == 0 && relayTO == 0 { + if config.InactivityThreshold != nil && + config.ICEInactivityThreshold == 0 && + config.RelayInactivityThreshold == 0 { // Phase 1 / single-timer fallback when caller hasn't migrated. m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback } else { m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) }Also applies to: 103-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/manager/manager.go` around lines 50 - 55, The fallback to the deprecated single-timer behavior must only trigger when the deprecated InactivityThreshold was actually provided; update Config.resolvedTimeouts to indicate whether the deprecated field was used (e.g., return a bool or expose a method like Config.deprecatedInUse) and only copy InactivityThreshold into RelayInactivityThreshold when InactivityThreshold != nil; then change the manager selection logic (the code around where resolvedTimeouts is consumed, lines selecting Phase-1 vs new manager) to treat an effective 0/0 as legacy only if the deprecated flag is set, otherwise respect explicit zero values for ICEInactivityThreshold and RelayInactivityThreshold so mixed old/new configs cannot override an explicit "relay always-on" or restore Phase-1 unintentionally.management/server/peer/peer.go (1)
199-203:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCompare
SupportedFeaturesas a set, not a slice.Capabilities are set-like.
slices.Equalmakes metadata equality depend on wire order, so two equivalent feature sets can still churn peer metadata and trigger needless updates.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 199 - 203, The metadata equality currently uses slices.Equal on p.SupportedFeatures and other.SupportedFeatures which makes equality order-sensitive; change the comparison in the equality function (the method that compares peer metadata in peer.go) to treat SupportedFeatures as a set by comparing membership instead of order — e.g., build a temporary map/set of features from one slice and verify every element of the other slice exists and lengths match (or sort both slices deterministically before comparing), replacing the slices.Equal(p.SupportedFeatures, other.SupportedFeatures) check with this set-based comparison so equivalent feature lists in different orders are considered equal.client/internal/stdnet/filter_test.go (1)
51-65:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert the table symmetrically and gate Windows-only rows.
The loop still only fails on false-positive allows and one hardcoded Windows false-negative. Most
want: truecases are never checked, so regressions like filteringvEthernet (External)will still pass on non-Windows runners.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test loop only checks false-positive allows and a single hardcoded Windows case; update the loop around allow(c.name) to assert both directions (if c.want && !got then t.Errorf("... want true"), and if !c.want && got then t.Errorf("... want false")), and gate Windows-only rows so they only assert on Windows runners (either add a boolean field like windowsOnly to the case struct and skip/assert based on runtime.GOOS, or check the existing Windows-specific name(s) such as "vEthernet (LAN)" before asserting on non-windows). Ensure you reference the allow function, c.want, runtime.GOOS, and the test helpers t.Errorf/t.Fatalf appropriately when making these changes.client/internal/peer/status.go (3)
692-693:⚠️ Potential issue | 🔴 CriticalKeep
notifyPeerStateChangeListenersunderd.mux.This helper calls
snapshotRouterPeersLocked, which expectsd.muxto be held while readingd.peersandd.changeNotify. Invoking it after unlock reintroduces the concurrent-map race the earlier review called out. Move thematerialICE/materialRelaynotification back aboved.mux.Unlock().Also applies to: 760-761
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 692 - 693, Move the calls to notifyPeerStateChangeListeners back inside the critical section guarded by d.mux: hold d.mux while checking materialICE/materialRelay and calling notifyPeerStateChangeListeners so snapshotRouterPeersLocked sees d.peers and d.changeNotify under the lock; specifically, ensure the materialICE/materialRelay checks and the notifyPeerStateChangeListeners(receivedState.PubKey) / notifyPeerStateChangeListeners(receivedState.PubKey) calls occur before invoking d.mux.Unlock() (also fix the identical pattern at the other occurrence around lines 760-761).
542-543:⚠️ Potential issue | 🟠 MajorMove
peerListChangedout from underd.mux.
d.notifyPeerListChanged()can synchronously invoke listener code, and this path still does it before the deferred unlock runs. Any listener that re-entersStatuscan deadlock here. SnapshotnumPeersunder the lock, unlock, then fire the peer-list notification.Also applies to: 565-566
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 542 - 543, The peer-list notification is being called while holding d.mux which can deadlock if listeners re-enter Status; modify the code in the Status method around notifyPeerListChanged (and the similar block at the other occurrence) to snapshot numPeers (or any needed state) while holding d.mux, release the lock, then call d.notifyPeerListChanged() outside the mutex; ensure you remove the notify call from under the lock and only reference the pre-captured values inside the notification invocation.
470-471:⚠️ Potential issue | 🟠 MajorCapture the conn-state callback before unlocking.
notifyConnStateChangereadsd.connStateListener, but each of these branches builds the closure afterd.mux.Unlock(). A concurrentSetConnStateListener(nil/...)can race with that read and lose or misroute the event. Snapshot the closure under the lock, then return it after unlock.🧩 Suggested pattern
- d.mux.Unlock() - - if notifyList { - d.notifier.peerListChanged(numPeers) - } - ... - if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) { - return d.notifyConnStateChange(receivedState.PubKey, peerState), nil - } + notifyConn := func() {} + if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) { + notifyConn = d.notifyConnStateChange(receivedState.PubKey, peerState) + } + + d.mux.Unlock() + + if notifyList { + d.notifier.peerListChanged(numPeers) + } + ... + return notifyConn, nilAlso applies to: 696-697, 764-765, 812-813, 863-864
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 470 - 471, The code calls notifyConnStateChange (and similar callbacks) after releasing d.mux which allows a concurrent SetConnStateListener to change d.connStateListener and race the callback; fix by snapshotting the listener under the lock (read d.connStateListener into a local variable while holding d.mux) and return/invoke that captured closure after unlock instead of accessing d.connStateListener later; apply this pattern for the occurrences around hasConnStatusChanged/notifyConnStateChange and the other mentioned branches (around lines handling conn-state at the other call sites).shared/management/http/api/types.gen.go (1)
41-63:⚠️ Potential issue | 🟠 MajorExpose
p2p-dynamic-lazyin the generated connection-mode enum.
AccountSettingsConnectionModeandValid()still only admit four values, but the Phase 3 contract in this PR includesp2p-dynamic-lazy. If the backend starts returning that mode, generated clients will reject it or fail to round-trip it. Please fix the OpenAPI source and regenerate this file instead of patching the generated output directly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated enum AccountSettingsConnectionMode is missing the new value "p2p-dynamic-lazy", causing Valid() and round-trips to reject it; update the OpenAPI specification to include the new enum value for AccountSettingsConnectionMode (so the generator emits a constant like AccountSettingsConnectionModeP2pDynamicLazy with the string "p2p-dynamic-lazy"), then regenerate types.gen.go so the Valid() switch in AccountSettingsConnectionMode also includes the new constant (and any related serialization/deserialization logic) rather than editing the generated file by hand.shared/management/http/api/openapi.yml (1)
365-405:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
AccountSettings.connection_modeand timeout docs are still out of sync with the final mode model.Line 365 is still missing
p2p-dynamic-lazyin the enum, and Lines 380-405 still describe phase-specific applicability that no longer matches the rollout model. This can break generated clients/schema validation and misconfigure consumers relying on these descriptions.Suggested OpenAPI patch
- enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] + enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy] @@ - Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are - functional. p2p-dynamic is reserved (passes through as p2p in - Phase 1; will become functional in Phase 2). + Includes dynamic rollout modes (`p2p-dynamic`, `p2p-dynamic-lazy`). @@ - Effective only in p2p-dynamic mode (added in Phase 2). + Effective in dynamic modes. @@ - disable backoff (always retry immediately, Phase-2 behavior). - Effective only in p2p-dynamic mode (added in Phase 3). + disable backoff (always retry immediately). + Effective in dynamic modes. @@ - down. Effective in p2p-lazy and p2p-dynamic modes. Backwards- + down. Effective in relay-lazy modes (for example `p2p-lazy` + and `p2p-dynamic-lazy`). Backwards-🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 365 - 405, Update the AccountSettings.connection_mode enum to include "p2p-dynamic-lazy" and revise the descriptions for p2p_timeout_seconds, p2p_retry_max_seconds, and relay_timeout_seconds to reflect the final mode model (remove phase-specific rollout language and phase applicability notes); explicitly state which modes each timeout applies to (e.g., p2p_timeout_seconds applies to p2p-dynamic and p2p-dynamic-lazy, p2p_retry_max_seconds applies to p2p-dynamic and p2p-dynamic-lazy backoff behavior, relay_timeout_seconds applies to p2p-lazy and p2p-dynamic-lazy and acts as the client NB_LAZY_CONN_INACTIVITY_THRESHOLD alias), preserve nullable/default semantics and examples (e.g., default 180 minutes for p2p_timeout_seconds, example: 900 for p2p_retry_max_seconds) and remove any "Phase X" wording so generated clients/schema validation match the final model.client/internal/conn_state_pusher.go (3)
245-251:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSnapshot-request handling does not drain to the newest nonce.
The doc on
OnSnapshotRequest(lines 153-156) promises latest-nonce coalescing, but the loop pulls the first queued nonce and runsflushFullwith it. If the management server fired two refreshes back-to-back, the secondInResponseToNoncemay never be echoed (the first one bumpsseq, then the second nonce gets popped and answered with stale state — or the buffer fills and the second nonce is dropped atOnSnapshotRequest's default case). Drain to the newest queued nonce before flushing. This was flagged previously and is still present.Suggested fix
case nonce := <-p.snapshotReq: + // Coalesce: drain any newer queued nonces; we only need to + // answer the latest one with a single full snapshot. + drainSnapshotReq: + for { + select { + case newer := <-p.snapshotReq: + nonce = newer + default: + break drainSnapshotReq + } + } if p.source != nil { p.flushFull(p.source.SnapshotAllRemotePeers(), nonce) } interval = p.tuning.baseInterval emptyTicks = 0 timer.Reset(interval)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 245 - 251, The snapshot-request handling must drain p.snapshotReq to the newest nonce before calling p.flushFull so we coalesce to the latest InResponseToNonce; change the case handling that currently does "nonce := <-p.snapshotReq" to read the first nonce then loop with non-blocking receives (select with default) to consume any additional pending nonces and keep the most recent value, then call p.flushFull(p.source.SnapshotAllRemotePeers(), newestNonce); ensure you reference and update the logic around p.snapshotReq, flushFull, and any variables like interval/emptyTicks/timer so behavior after the flush remains unchanged.
293-298:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
Pushcalls usecontext.Background()and can wedgeStop()indefinitely.Both
flushDelta(line 293) andflushFull(line 363) callp.sink.Push(context.Background(), …)from the loop goroutine. If the management gRPC client hangs (slow server, broken connection, head-of-line blocking on retry), the loop blocks insidePush, never services<-p.stop, andStop()→wg.Wait()never returns — taking daemon shutdown with it. Either derive the context from a per-pusher cancel thatStopcancels first, or wrap each push in acontext.WithTimeout. This was flagged previously and is still present.Suggested shape
type connStatePusher struct { ... + ctx context.Context + cancel context.CancelFunc stop chan struct{} wg sync.WaitGroup } func newConnStatePusherForTest(...) *connStatePusher { - p := &connStatePusher{...} + ctx, cancel := context.WithCancel(context.Background()) + p := &connStatePusher{ctx: ctx, cancel: cancel, ...} ... } func (p *connStatePusher) Stop() { + p.cancel() close(p.stop) p.wg.Wait() } - if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{...}); err != nil { + pushCtx, cancel := context.WithTimeout(p.ctx, 30*time.Second) + err := p.sink.Push(pushCtx, &mgmProto.PeerConnectionMap{...}) + cancel() + if err != nil {Also applies to: 363-369
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 293 - 298, flushDelta and flushFull call p.sink.Push with context.Background(), which can block the loop and prevent Stop/wg.Wait from returning; change both flushDelta and flushFull to use a cancellable/timeout context instead of context.Background() — either derive the context from a per-pusher context stored on the pusher (e.g., p.ctx) that Stop cancels, or wrap each Push call with context.WithTimeout and defer cancel so Push returns on Stop; ensure Stop cancels the pusher context (or relies on the timeout) before waiting on wg so the goroutine unblocks from p.sink.Push.
115-123:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
newSessionIDwill spin-loop forever ifrand.Readever returns an error.
rand.Readerrors are ignored; on a system wheregetrandomis unavailable or the entropy source is wedged,bstays zeroed, the generated id is 0, and the loop never exits —newConnStatePusherForTest(and therefore Engine startup) hangs. Bound the retry, propagate the error, or fall back to a non-zero monotonic seed. This was flagged previously and is still present.Suggested fix
func newSessionID() uint64 { var b [8]byte for { - _, _ = rand.Read(b[:]) + if _, err := rand.Read(b[:]); err != nil { + if id := uint64(time.Now().UnixNano()); id != 0 { + return id + } + continue + } if id := binary.BigEndian.Uint64(b[:]); id != 0 { return id } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 115 - 123, newSessionID currently ignores rand.Read errors and can spin forever returning 0; change newSessionID to return (uint64, error), check the error from rand.Read each iteration, limit retries (e.g. a small constant like 3-5 attempts), and if still failing return a non-zero fallback only if safe or better return a clear error; update callers such as newConnStatePusherForTest and any Engine startup call sites to handle the returned error (propagate or fail startup) rather than assuming a uint64 always succeeds.management/server/http/handlers/accounts/accounts_handler.go (1)
234-234:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPlain
fmt.Errorfstill surfaces as 5xx instead of 4xx for these validation failures.The validation branches for
connection_mode(line 234),legacy_lazy_fallback_timeout_seconds(line 280),AutoUpdateVersion(line 291), andvalidateUint32Timeout(lines 513, 516) still return plain errors.util.WriteErrorwill treat these asInternal/5xx instead ofInvalidArgument/422 for bad input. This was raised previously and isn't marked addressed.💡 Minimal fix
- return nil, fmt.Errorf("invalid connection_mode %q", modeStr) + return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr) @@ - return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) + return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) @@ func validateUint32Timeout(name string, v int64) (uint32, error) { if v < 0 { - return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v) } if v > int64(math.MaxUint32) { - return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) } return uint32(v), nil }Note this will also break
TestValidateUint32Timeout_PlainError— it should be removed or updated since the helper would no longer return a plainfmt.Errorfvalue.Also applies to: 280-280, 291-291, 513-513, 516-516
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` at line 234, Replace plain fmt.Errorf returns for validation failures with the utility error constructor that maps to InvalidArgument (so util.WriteError produces a 4xx), e.g. change the return in the connection_mode branch (modeStr), the legacy_lazy_fallback_timeout_seconds case, the AutoUpdateVersion branch, and inside validateUint32Timeout to return util.NewInvalidArgumentErrorf(...) (or the project’s equivalent constructor) with the same message; also update or remove TestValidateUint32Timeout_PlainError to reflect that validateUint32Timeout no longer returns a plain fmt.Errorf.management/internals/shared/grpc/conversion.go (1)
315-327:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
ServerLivenessKnownis still set unconditionally outside therPeer.Status != nilguard.When
rPeer.Statusis nil,LiveOnlinestaysfalse(default) butServerLivenessKnownis forced totrueon line 326, telling new clients "the server authoritatively knows this peer is offline". That is exactly the regression flagged in the prior review and is the wrong signal — the client should fall back to theLastSeenAtServerheuristic in this case. Move the assignment inside therPeer.Status != nilbranch.🐛 Proposed fix
// nbpeer.Peer.Status is *PeerStatus; nil-guard before accessing. if rPeer.Status != nil { if !rPeer.Status.LastSeen.IsZero() { cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen) } cfg.LiveOnline = rPeer.Status.Connected + // New servers always know per-peer liveness; signal that to + // new clients so they can trust LiveOnline directly instead + // of guessing from the LastSeenAtServer-zero heuristic. Old + // servers leave this field at default (false) and clients + // fall back. Only set when Status is actually present. + cfg.ServerLivenessKnown = true } - // New servers always know per-peer liveness; signal that to new - // clients so they can trust LiveOnline directly instead of - // guessing from the LastSeenAtServer-zero heuristic. Old servers - // leave this field at default (false) and clients fall back. - cfg.ServerLivenessKnown = true dst = append(dst, cfg)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 315 - 327, The ServerLivenessKnown flag is being set unconditionally, which falsely signals authoritative liveness when rPeer.Status is nil; move the assignment of cfg.ServerLivenessKnown = true inside the rPeer.Status != nil branch (next to where cfg.LiveOnline and cfg.LastSeenAtServer are set) so ServerLivenessKnown is only true when rPeer.Status is non-nil, leaving the default false behavior for clients to fall back to the LastSeenAtServer heuristic.management/server/http/handlers/accounts/accounts_handler_test.go (1)
418-419:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
gofmtwill still reject this — fields 418-419 are over-indented (4 tabs vs 2 tabs of the surrounding block).The neighboring fields in
expectedSettings(lines 409-425) use two tabs, butLegacyLazyFallbackEnabledandLegacyLazyFallbackTimeoutSecondsuse four. This breaksgofmt-strict CI gates. The previous review flagged the same hunk; please rungofmt -won the file so the colon column re-aligns across the whole struct literal.Proposed fix
LazyConnectionEnabled: br(false), - LegacyLazyFallbackEnabled: br(true), - LegacyLazyFallbackTimeoutSeconds: ir(3600), + LegacyLazyFallbackEnabled: br(true), + LegacyLazyFallbackTimeoutSeconds: ir(3600), DnsDomain: sr(""),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler_test.go` around lines 418 - 419, The struct literal in expectedSettings has two fields, LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds, over-indented (4 tabs) and misaligning the colon column; run gofmt -w on the file or manually re-indent those two fields to match the surrounding two-tab indentation so the colon alignment is consistent (look for the expectedSettings struct literal and the symbols LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds and correct their indentation).shared/management/client/grpc.go (1)
483-493:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRun the snapshot callback after applying the same
SyncResponse.
SnapshotRequestcan arrive in the same frame as ordinary sync data. Callingcb()first lets the client publish a snapshot from pre-update peer state.🩹 Suggested ordering change
- if req := decryptedResp.GetSnapshotRequest(); req != nil { - c.snapMu.Lock() - cb := c.onSnapshotRequest - c.snapMu.Unlock() - if cb != nil { - cb(req.GetNonce()) - } - } - if err := msgHandler(decryptedResp); err != nil { log.Errorf("failed handling an update message received from Management Service: %v", err.Error()) + continue + } + + if req := decryptedResp.GetSnapshotRequest(); req != nil { + c.snapMu.Lock() + cb := c.onSnapshotRequest + c.snapMu.Unlock() + if cb != nil { + cb(req.GetNonce()) + } } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/client/grpc.go` around lines 483 - 493, The snapshot callback is invoked before applying the SyncResponse, which can cause snapshots to be published from pre-update state; change the ordering so that handling the SyncResponse via msgHandler(decryptedResp) happens first and only after successful application (or at least after msgHandler returns nil) you call the snapshot callback (use c.onSnapshotRequest and req.GetNonce() as before), preserving the c.snapMu lock/unlock around reading c.onSnapshotRequest and keeping the existing nil checks around req and cb.client/internal/engine_offline_debounce_test.go (1)
77-97:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuard debounce-map length checks with
peerOfflineDebounceMu.These assertions read
e.peerOfflineDebouncewithout the mutex while timers can mutate the same map asynchronously. That can trip the race detector and make the tests flaky.🩹 Suggested fix
func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) { e := engineForDebounceTest() // must not panic e.cancelRemoteOfflineClose("never-scheduled") - if len(e.peerOfflineDebounce) != 0 { + e.peerOfflineDebounceMu.Lock() + n := len(e.peerOfflineDebounce) + e.peerOfflineDebounceMu.Unlock() + if n != 0 { t.Error("map must remain empty") } } @@ for _, k := range []string{"a", "b", "c", "d"} { e.scheduleRemoteOfflineClose(k) } - if len(e.peerOfflineDebounce) != 4 { - t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce)) + e.peerOfflineDebounceMu.Lock() + n := len(e.peerOfflineDebounce) + e.peerOfflineDebounceMu.Unlock() + if n != 4 { + t.Fatalf("setup: expected 4 timers, got %d", n) } e.cancelAllRemoteOfflineCloses() - if len(e.peerOfflineDebounce) != 0 { - t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce)) + e.peerOfflineDebounceMu.Lock() + n = len(e.peerOfflineDebounce) + e.peerOfflineDebounceMu.Unlock() + if n != 0 { + t.Errorf("cancel-all must clear the map, got %d entries", n) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The test reads e.peerOfflineDebounce without holding the associated mutex, which can race with concurrent timer callbacks; wrap any access that checks len(e.peerOfflineDebounce) (in TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and TestCancelAllRemoteOfflineCloses_ClearsEverything) with e.peerOfflineDebounceMu RLock/RUnlock (or Lock/Unlock) to safely read the map; locate uses around the assertions after calling engineForDebounceTest(), scheduleRemoteOfflineClose, cancelRemoteOfflineClose, and cancelAllRemoteOfflineCloses and protect them with the mutex to avoid race detector failures.client/internal/conn_mgr.go (1)
287-340:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftTimeout-only changes still don't reach the running lazy/dynamic manager.
When the management push only updates
RelayTimeoutSecondsorP2PTimeoutSeconds(and the mode stays in p2p-lazy/p2p-dynamic),modeChangedis false so neither thecloseManagerbranch (Line 314) nor thee.lazyConnMgr == nilbranch (Line 322) fires. The new timeouts are stored onConnMgr, butmanager.Config.RelayInactivityThreshold/ICEInactivityThresholdwere snapshotted once ininitLazyManager(Lines 605-610) and the running inactivity manager keeps using the stale values until a full mode flip.
propagateP2pRetryMaxToConns()only handles the per-Conn ICE-backoff cap, not the lazy manager's two-timer thresholds. Consider detecting timeout-only changes whenisManaged && lazyConnMgr != niland either restarting the manager or pushing the new thresholds through a live-update path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 287 - 340, Timeout-only changes to relayTimeoutSecs/p2pTimeoutSecs are not applied to a running lazy manager because modeChanged is false; detect when isManaged && e.lazyConnMgr != nil and the newRelay/newP2P differ from the existing e.relayTimeoutSecs/e.p2pTimeoutSecs and handle it by restarting the manager so it snapshots the new thresholds: call e.closeManager(ctx), e.initLazyManager(ctx), e.startModeSideEffects(), and then return e.resetPeersToLazyIdle(ctx) (similar to the existing activation path). Keep the existing propagateP2pRetryMaxToConns() behavior for p2pRetry changes and use the same helper symbols (resolveConnectionMode, modeUsesLazyMgr, initLazyManager, closeManager, startModeSideEffects, resetPeersToLazyIdle, lazyConnMgr) to locate and implement the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db5bc397-20dd-4a93-a989-752a79f7ed0f
⛔ Files ignored due to path filters (2)
client/proto/daemon.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (63)
.gitignoreclient/android/client.goclient/android/peer_notifier.goclient/android/preferences.goclient/android/preferences_clamp_test.goclient/cmd/testutil_test.goclient/iface/bind/activity.goclient/iface/device/endpoint_manager.goclient/internal/conn_mgr.goclient/internal/conn_state_pusher.goclient/internal/conn_state_pusher_material_test.goclient/internal/conn_state_pusher_test.goclient/internal/conn_state_pusher_testhelper_test.goclient/internal/engine.goclient/internal/engine_offline_debounce_test.goclient/internal/lazyconn/activity/listener_bind_test.goclient/internal/lazyconn/manager/manager.goclient/internal/peer/conn.goclient/internal/peer/conn_handover_order_test.goclient/internal/peer/conn_lazy_keepwgpeer_test.goclient/internal/peer/guard/guard.goclient/internal/peer/guard/guard_test.goclient/internal/peer/guard/ice_retry_state_test.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/status.goclient/internal/peer/status_debounce_test.goclient/internal/peer/status_remote_meta_notify_test.goclient/internal/peerstore/store.goclient/internal/profilemanager/config.goclient/internal/stdnet/filter.goclient/internal/stdnet/filter_test.goclient/proto/daemon.protoclient/server/server_test.goclient/system/features.goclient/system/features_test.goclient/ui/network.goclient/ui/peers_tab.godocs/bugs/2026-05-04-user-peer-visibility-regression.mddocs/superpowers/plans/2026-04-06-kernel-ice-separate-port.mdmanagement/internals/server/boot.gomanagement/internals/shared/grpc/conversion.gomanagement/internals/shared/grpc/conversion_test.gomanagement/internals/shared/grpc/server.gomanagement/server/account.gomanagement/server/activity/codes.gomanagement/server/http/handlers/accounts/accounts_handler.gomanagement/server/http/handlers/accounts/accounts_handler_test.gomanagement/server/http/handlers/accounts/validate_uint32_timeout_test.gomanagement/server/peer.gomanagement/server/peer/peer.gomanagement/server/peer_connections/store.gomanagement/server/peer_connections/store_test.gomanagement/server/peer_test.gomanagement/server/store/file_store.gomanagement/server/store/sql_store.gomanagement/server/types/settings.gomanagement/server/types/settings_ptr_equal_test.gomanagement/server/types/settings_test.goshared/management/client/grpc.goshared/management/http/api/openapi.ymlshared/management/http/api/types.gen.goshared/management/proto/management.proto
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (3)
- client/cmd/testutil_test.go
- client/android/peer_notifier.go
- management/server/types/settings_ptr_equal_test.go
|
|
||
| log.Warnf("lazy connection manager is enabled by management feature flag") | ||
| e.initLazyManager(ctx) | ||
| e.statusRecorder.UpdateLazyConnection(true) | ||
| return e.addPeersToLazyConnManager() | ||
| } else { | ||
| if e.lazyConnMgr == nil { | ||
| return nil | ||
| } | ||
| log.Infof("lazy connection manager is disabled by management feature flag") | ||
| newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode( | ||
| e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout, | ||
| e.cfgP2pTimeout, e.cfgP2pRetryMax, pc, | ||
| ) | ||
|
|
||
| if newMode == e.mode && newRelay == e.relayTimeoutSecs && | ||
| newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs { | ||
| return nil | ||
| } | ||
| prev := e.mode | ||
| e.mode = newMode | ||
| e.relayTimeoutSecs = newRelay | ||
| e.p2pTimeoutSecs = newP2P | ||
| e.p2pRetryMaxSecs = newP2pRetry | ||
| e.propagateP2pRetryMaxToConns() |
There was a problem hiding this comment.
Resolved-mode/timeout fields are read concurrently without synchronization.
spMu protects the serverPushed* fields only. The "live" fields mode, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs are written here at Lines 297-300 (under engine syncMsgMux via the NetworkMap path) but read by:
Mode()/RelayTimeout()/P2pTimeout()/P2pRetryMax()callers from arbitrary goroutines (e.g.engine.scheduleRemoteOfflineClosetimer callback atengine.go:2782, plus the Android UI / daemon-RPC paths the doc-comments call out).runDynamicInactivityLoopreadse.p2pTimeoutSecs/e.relayTimeoutSecsfor its startup log line outside any lock.
This is a data race on a non-atomic connectionmode.Mode/uint32 write. Either extend spMu to cover the live fields too, or use atomic loads/stores for them so the accessors are safe to call without the engine mux.
Also applies to: 753-778
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/conn_mgr.go` around lines 286 - 301, The live config fields
(mode, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs) are written under
syncMsgMux but read concurrently (Mode(), RelayTimeout(), P2pTimeout(),
P2pRetryMax(), runDynamicInactivityLoop), causing a data race; fix by making
these fields updated/read with atomic ops instead of relying on spMu: change
their storage to atomic-compatible types (e.g. uint32/uint64 backing for
connectionmode.Mode and timeout seconds), use atomic.Store* in the update path
around resolveConnectionMode/propagateP2pRetryMaxToConns, and use atomic.Load*
in the accessor methods and in runDynamicInactivityLoop/startup logging so all
concurrent readers see safe, race-free values. Ensure all references to those
fields (including propagateP2pRetryMaxToConns and any callers listed) are
switched to the atomic loads/stores.
| t := time.AfterFunc(remoteOfflineGracePeriod, func() { | ||
| e.peerOfflineDebounceMu.Lock() | ||
| delete(e.peerOfflineDebounce, pubKey) | ||
| e.peerOfflineDebounceMu.Unlock() | ||
| // Codex review: re-validate on fire. Several preconditions | ||
| // must still hold: | ||
| // 1. engine context not cancelled (Stop() in flight) | ||
| // 2. connMgr still in p2p-dynamic mode (mode-switch racing) | ||
| // 3. peer still has a peerConn AND status recorder still | ||
| // reports the peer as remote-offline (the live state | ||
| // could have flipped back without us cancelling — e.g. | ||
| // mgmt push for a different peer landed before this fire) | ||
| // Without these checks the debounce fires blindly and can | ||
| // tear down a perfectly good conn in any of those races. | ||
| if e.ctx == nil || e.ctx.Err() != nil { | ||
| return | ||
| } | ||
| if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic { | ||
| return | ||
| } | ||
| if state, err := e.statusRecorder.GetPeer(pubKey); err == nil { | ||
| if !state.RemoteServerLivenessKnown || state.RemoteLiveOnline { | ||
| return | ||
| } | ||
| } else { | ||
| return | ||
| } | ||
| conn, ok := e.peerStore.PeerConn(pubKey) | ||
| if !ok { | ||
| return | ||
| } | ||
| log.Infof("[peer: %s] remote went offline (debounced %s), closing local conn (p2p-dynamic)", pubKey, remoteOfflineGracePeriod) | ||
| // Remote-offline close: keep the WG peer entry so that if the | ||
| // remote comes back online and traffic flows, the route-mgr- | ||
| // applied AllowedIPs are still in place. The lazy-mgr will | ||
| // reactivate the peer through the activity listener. | ||
| conn.Close(false, true) | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all readers of ConnMgr.Mode()/RelayTimeout()/P2pTimeout()/P2pRetryMax()
# to confirm which ones run outside syncMsgMux.
rg -nP --type=go -C2 '\bconnMgr\.(Mode|RelayTimeout|P2pTimeout|P2pRetryMax)\b'Repository: netbirdio/netbird
Length of output: 2018
🏁 Script executed:
#!/bin/bash
# Check function signatures and context for each call site
# Line 1360 context
echo "=== Line 1360 context ==="
sed -n '1350,1375p' client/internal/engine.go
echo -e "\n=== Line 1584 context ==="
sed -n '1570,1590p' client/internal/engine.go
echo -e "\n=== Lines 1821-1822 context ==="
sed -n '1810,1830p' client/internal/engine.go
echo -e "\n=== Line 2765-2804 timer context (full function) ==="
sed -n '2755,2810p' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 5498
🏁 Script executed:
#!/bin/bash
# Find the function containing line 1360
echo "=== Function containing line 1360 ==="
sed -n '1330,1365p' client/internal/engine.go | head -20
# Search for syncMsgMux lock patterns and their scope
echo -e "\n=== syncMsgMux usage patterns ==="
rg -n 'syncMsgMux\.(Lock|Unlock)' client/internal/engine.go
# Find where the lines in question are - identify function boundaries
echo -e "\n=== Looking for function defs near lines 1350-1380 ==="
sed -n '1320,1360p' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 2771
🏁 Script executed:
#!/bin/bash
# Find callers of updateNetworkMap
echo "=== Callers of updateNetworkMap ==="
rg -n 'updateNetworkMap\(' client/internal/engine.go
# Check the context around those call sites
echo -e "\n=== Context of updateNetworkMap calls ==="
sed -n '350,360p' client/internal/engine.go
echo "---"
sed -n '490,500p' client/internal/engine.go
# Find function containing createPeerConfig (line 1821-1822)
echo -e "\n=== Function def for lines 1821-1822 ==="
sed -n '1800,1830p' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 2151
🏁 Script executed:
#!/bin/bash
# Check context around line 1047 where updateNetworkMap is called
echo "=== Context around line 1047 (updateNetworkMap call) ==="
sed -n '1035,1055p' client/internal/engine.go
# Find function definition that contains line 1047
echo -e "\n=== Function containing line 1047 ==="
sed -n '1000,1050p' client/internal/engine.go | head -40
# Look at updateNetworkMap itself to see if it expects syncMsgMux
echo -e "\n=== Full updateNetworkMap function signature and first lines ==="
sed -n '1335,1450p' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 5980
🏁 Script executed:
#!/bin/bash
# Find function name containing line 1047
echo "=== Function name and signature for line 1047 ==="
sed -n '950,1047p' client/internal/engine.go | grep -A 100 "^func"
# Also check for any syncMsgMux locks in message handlers
echo -e "\n=== Check for syncMsgMux in OnSyncResponse or sync handlers ==="
rg -n 'func.*OnSync|func.*handleSync' client/internal/engine.go
# Check line 1584 context - is it in UpdatedRemotePeerConfig handler?
echo -e "\n=== Function containing line 1584 ==="
sed -n '1500,1590p' client/internal/engine.go | head -50Repository: netbirdio/netbird
Length of output: 4180
🏁 Script executed:
#!/bin/bash
# Check where createPeerConfig is called
echo "=== Callers of createPeerConfig ==="
rg -n 'createPeerConfig\(' client/internal/engine.go
# Check context for those call sites
echo -e "\n=== addNewPeers function (likely caller of createPeerConfig) ==="
sed -n '1700,1750p' client/internal/engine.go | head -40
# Also check if line 1584 is indeed in updateNetworkMap
echo -e "\n=== Confirm line 1584 is in updateNetworkMap ==="
awk 'NR>=1335 && NR<=1600 {if (/^func /) print NR": "$0}' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 1684
🏁 Script executed:
#!/bin/bash
# Search for where lines 1821-1822 appear (in what function)
echo "=== Function containing line 1821 ==="
sed -n '1750,1830p' client/internal/engine.go | head -60
# Let's also search for function names around line 1800
echo -e "\n=== Function names near line 1800 ==="
awk 'NR>=1750 && NR<=1850 {if (/^func /) print NR": "$0}' client/internal/engine.go
# Now search for where connMgr reads happen in line 1584 context
echo -e "\n=== Verify line 1584 context (within updateNetworkMap or another function) ==="
awk 'NR>=1500 && NR<=1600 {if (/^func /) print NR": "$0}' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 2383
🏁 Script executed:
#!/bin/bash
# Find all function boundaries to understand which function contains line 1584
echo "=== All function definitions and their line numbers ==="
awk '/^func \(e \*Engine\)/ {print NR": "$0}' client/internal/engine.go | head -50
# Specifically check lines 1500-1700
echo -e "\n=== Function boundaries in range 1500-1700 ==="
awk 'NR>=1500 && NR<=1700 && /^func / {print NR": "$0}' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 3860
🏁 Script executed:
#!/bin/bash
# Check where addNewPeer is called from
echo "=== Callers of addNewPeer ==="
rg -n 'addNewPeer\(' client/internal/engine.go
# Check if addNewPeers is under syncMsgMux (which calls addNewPeer)
echo -e "\n=== Full addNewPeers function ==="
sed -n '1746,1756p' client/internal/engine.go
# Now check where addNewPeers is called
echo -e "\n=== Where addNewPeers is called ==="
rg -n 'addNewPeers\(' client/internal/engine.go -B 5 -A 1Repository: netbirdio/netbird
Length of output: 1019
🏁 Script executed:
#!/bin/bash
# Verify line 1584 is indeed in updateNetworkMap and see the full context
echo "=== Full updateNetworkMap function (lines 1335-1606) ==="
sed -n '1335,1606p' client/internal/engine.go | tail -150
# Confirm the lock boundaries by checking handleSync
echo -e "\n=== handleSync lock scope verification ==="
sed -n '969,985p' client/internal/engine.goRepository: netbirdio/netbird
Length of output: 6880
🏁 Script executed:
#!/bin/bash
# Final verification - check if UpdatedRemotePeerConfig modifies e.connMgr.mode
echo "=== UpdatedRemotePeerConfig writes ==="
rg -n 'UpdatedRemotePeerConfig' client/internal/engine.go -A 20
# Check connmgr.go for mode field access
echo -e "\n=== ConnMgr.Mode() implementation ==="
fd -e go conn_mgr.go | xargs grep -n "func.*Mode\|m\.mode\|\.mode ="Repository: netbirdio/netbird
Length of output: 1726
🏁 Script executed:
#!/bin/bash
# Check the Mode() implementation
echo "=== ConnMgr.Mode() full implementation ==="
sed -n '756,765p' client/internal/engine.go
# Check what field is being accessed
echo -e "\n=== Line 297 context (mode write) ==="
sed -n '290,310p' client/internal/engine.go
# Search for where connMgr.mode is written
echo -e "\n=== All writes to connMgr.mode or e.mode in engine.go ==="
rg -n 'e\.mode\s*=' client/internal/engine.go
# Check UpdatedRemotePeerConfig to see what it modifies
echo -e "\n=== UpdatedRemotePeerConfig in connmgr ==="
fd -e go conn_mgr.go | xargs grep -n 'UpdatedRemotePeerConfig' -A 30 | head -50Repository: netbirdio/netbird
Length of output: 3193
🏁 Script executed:
#!/bin/bash
# Search directly in conn_mgr.go for the Mode() implementation
echo "=== Mode() implementation in conn_mgr.go ==="
fd -e go conn_mgr.go | xargs sed -n '756,765p'
# Also check the complete UpdatedRemotePeerConfig function
echo -e "\n=== UpdatedRemotePeerConfig function ==="
fd -e go conn_mgr.go | xargs grep -n 'func.*UpdatedRemotePeerConfig' -A 40
# Check if mode field is protected by a mutex
echo -e "\n=== ConnMgr struct fields ==="
fd -e go conn_mgr.go | xargs sed -n '40,110p'Repository: netbirdio/netbird
Length of output: 4840
Timer goroutine reads connection-manager state without holding syncMsgMux.
The time.AfterFunc callback at line 2782 reads e.connMgr.Mode() without synchronization. Meanwhile, handleSync calls UpdatedRemotePeerConfig, which writes e.mode under syncMsgMux (line 297). The Mode() getter has no internal synchronization—it returns e.mode directly—so the timer observes unsynchronized reads/writes to shared mutable state during concurrent mode changes or engine shutdown.
Lock protection is inconsistent: spMu guards only the serverPushed* fields, not mode, relayTimeoutSecs, p2pTimeoutSecs, or p2pRetryMaxSecs. Other callers (lines 1360, 1366–1369, 1584, 1821–1822) are protected by syncMsgMux via handleSync, but the timer callback bypasses this.
Either guard the timer's reads with syncMsgMux, or make the ConnMgr accessors internally synchronized.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/engine.go` around lines 2765 - 2802, The timer callback reads
e.connMgr.Mode() without synchronization, causing a data race with
handleSync/UpdatedRemotePeerConfig which writes e.mode under syncMsgMux; fix by
acquiring e.syncMsgMux around the ConnMgr access in the time.AfterFunc callback
(wrap the nil check and Mode() call with e.syncMsgMux.Lock()/Unlock()), or
alternatively make ConnMgr.Mode() itself thread-safe; reference symbols:
time.AfterFunc callback, e.connMgr.Mode(), syncMsgMux, handleSync /
UpdatedRemotePeerConfig, and Mode().
| case strings.Contains(trim, "lazyConnMgr.Close"), | ||
| strings.Contains(trim, "activityManager.Close"), | ||
| strings.Contains(trim, "im.Close"), | ||
| strings.Contains(trim, "peerStore.Close"), | ||
| strings.Contains(trim, "// "), | ||
| !(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")): | ||
| continue | ||
| } |
There was a problem hiding this comment.
Inline-trailing-comments slip past the guard.
strings.Contains(trim, "// ") is intended to skip comment-only lines, but it also matches lines with trailing comments. For example, peerConn.Close(false) // todo add keepWgPeer contains // and is silently excluded — a regressed 1-arg call with any inline comment never gets flagged, defeating the test's purpose.
Detect leading-// instead so trailing comments don't bypass the check.
🛡️ Proposed fix
switch {
case strings.Contains(trim, "lazyConnMgr.Close"),
strings.Contains(trim, "activityManager.Close"),
strings.Contains(trim, "im.Close"),
strings.Contains(trim, "peerStore.Close"),
- strings.Contains(trim, "// "),
+ strings.HasPrefix(trim, "//"),
!(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")):
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case strings.Contains(trim, "lazyConnMgr.Close"), | |
| strings.Contains(trim, "activityManager.Close"), | |
| strings.Contains(trim, "im.Close"), | |
| strings.Contains(trim, "peerStore.Close"), | |
| strings.Contains(trim, "// "), | |
| !(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")): | |
| continue | |
| } | |
| case strings.Contains(trim, "lazyConnMgr.Close"), | |
| strings.Contains(trim, "activityManager.Close"), | |
| strings.Contains(trim, "im.Close"), | |
| strings.Contains(trim, "peerStore.Close"), | |
| strings.HasPrefix(trim, "//"), | |
| !(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")): | |
| continue | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 109 - 116,
The guard in the test currently uses strings.Contains(trim, "// ") which also
matches lines with trailing inline comments and lets regressed single-arg calls
slip through; update the condition in the case block that checks for comments to
detect only leading comment lines (e.g., replace the strings.Contains check with
a leading-check such as strings.HasPrefix(trim, "//")) so only full-line
comments are skipped while lines with trailing comments (like
peerConn.Close(false) // ...) are still evaluated by the
conn.Close/peerConn.Close checks.
| bindParams := udpmux.UniversalUDPMuxParams{ | ||
| UDPConn: nbnet.WrapPacketConn(udpConn), | ||
| Net: t.transportNet, | ||
| FilterFn: t.filterFn, | ||
| WGAddress: t.address, | ||
| MTU: t.mtu, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how filterFn is set in device_kernel_unix.go
rg -n 'filterFn.*=' client/iface/device/device_kernel_unix.go
# Check for any sharedsock.NewIncomingSTUNFilter usage
rg -n 'sharedsock\.NewIncomingSTUNFilter' client/iface/device/device_kernel_unix.goRepository: netbirdio/netbird
Length of output: 152
🏁 Script executed:
# Find the TunKernelDevice struct definition
rg -n 'type TunKernelDevice' client/iface/device/
# Search for filterFn field within the struct or its initialization
rg -n 'filterFn\s*:' client/iface/device/device_kernel_unix.go
# Search for where filterFn might be set as a field
rg -B5 -A5 'FilterFn:' client/iface/device/device_kernel_unix.goRepository: netbirdio/netbird
Length of output: 411
🏁 Script executed:
# Read the device_kernel_unix.go file to see TunKernelDevice struct and filterFn initialization
head -150 client/iface/device/device_kernel_unix.go | cat -n
# Also search more broadly for filterFn assignments
rg -n 'filterFn' client/iface/device/device_kernel_unix.goRepository: netbirdio/netbird
Length of output: 4633
🏁 Script executed:
# Check the full markdown file mentioned in the review to see all proposed changes
head -200 docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md | tail -150Repository: netbirdio/netbird
Length of output: 4569
🏁 Script executed:
# Find udpmux.FilterFn type definition
rg -n 'type FilterFn' client/iface/
# Check if FilterFn can be nil in UniversalUDPMuxParams
rg -B5 -A10 'FilterFn:' client/iface/udpmux/Repository: netbirdio/netbird
Length of output: 164
Clarify the role of FilterFn in the new separate-port approach.
The proposed code references t.filterFn in FilterFn: (line 53), but this field is never initialized and would be nil. This differs from the original code, which passes sharedsock.NewIncomingSTUNFilter() directly to sharedsock.Listen().
Confirm whether:
- FilterFn should remain
nil(no filtering on the dedicated ICE port), OR - FilterFn needs to be initialized with a filter function (and if so, what replaces
sharedsock.NewIncomingSTUNFilter())
Without this clarification, it's unclear if the change is intentionally simplifying the filtering or if it's an incomplete migration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines
50 - 56, The FilterFn field in the udpmux.UniversalUDPMuxParams is currently set
to t.filterFn but that value is never initialized; decide whether the dedicated
ICE port should have no filtering or should use the same STUN filter previously
provided by sharedsock.NewIncomingSTUNFilter(). If you want no filtering leave
FilterFn nil but document this decision and remove any dead t.filterFn state; if
you need the STUN filter, initialize FilterFn with the equivalent function
(replace t.filterFn with sharedsock.NewIncomingSTUNFilter() or wire the filter
factory into the transport so t.filterFn is set during construction) and ensure
the constructor that creates t (or the call site that builds bindParams) sets
that function before creating udpmux.UniversalUDPMuxParams.
| func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { | ||
| p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if p.AccountID != accountID { | ||
| return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID) | ||
| } | ||
| return p, nil | ||
| } |
There was a problem hiding this comment.
Use a typed NotFound error and avoid leaking cross-account peer existence.
GetPeerByPeerPubKey is not scoped by accountID, so the cross-account check at line 1366 is the only barrier. Returning an untyped fmt.Errorf has two issues:
- The error message echoes back the caller-supplied
pubKey, confirming the peer exists in some account — this is an information-disclosure leak across tenants when this method is reached from REST handlers (per the doc comment). - Callers cannot distinguish "peer doesn't exist" from "peer exists in another account"; downstream HTTP handlers will likely surface this as 500 instead of 404.
Both branches should return the same typed not-found status:
🛡️ Proposed fix
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
if err != nil {
return nil, err
}
if p.AccountID != accountID {
- return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
+ return nil, status.Errorf(status.NotFound, "peer not found")
}
return p, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { | |
| p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if p.AccountID != accountID { | |
| return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID) | |
| } | |
| return p, nil | |
| } | |
| func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { | |
| p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if p.AccountID != accountID { | |
| return nil, status.Errorf(status.NotFound, "peer not found") | |
| } | |
| return p, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer.go` around lines 1361 - 1370, GetPeerByPubKey leaks
cross-account existence and returns an untyped fmt.Errorf; change
DefaultAccountManager.GetPeerByPubKey so both failure cases
(Store.GetPeerByPeerPubKey error indicating not found and the accountID
mismatch) return the same typed NotFound error instead of fmt.Errorf and do not
include the caller-supplied pubKey in the message. Use the project's canonical
not-found sentinel/type from the store package (e.g., store.ErrNotFound or
store.NewNotFoundError) when translating/store errors and when returning on
account mismatch so callers can distinguish not-found vs other errors and no
tenant information is exposed.
| // Phase 3.7i (#5989): peer-self-reported runtime mode/timeouts. Stored | ||
| // alongside Hostname/Kernel/etc as meta. Empty when peer pre-dates 3.7i. | ||
| EffectiveConnectionMode string `json:"effective_connection_mode,omitempty"` | ||
| EffectiveRelayTimeoutSecs uint32 `json:"effective_relay_timeout_secs,omitempty"` | ||
| EffectiveP2PTimeoutSecs uint32 `json:"effective_p2p_timeout_secs,omitempty"` | ||
| EffectiveP2PRetryMaxSecs uint32 `json:"effective_p2p_retry_max_secs,omitempty"` | ||
|
|
||
| // Phase 3.7i (#5989): capability keywords this client build advertises. | ||
| // Empty for peers that pre-date the field. Used by mgmt to decide | ||
| // whether to send legacy-compat fallback settings (e.g. downgrade to | ||
| // p2p-lazy when client lacks "p2p_dynamic"). See | ||
| // client/system/features.go for the canonical list. | ||
| SupportedFeatures []string `json:"supported_features,omitempty" gorm:"serializer:json"` |
There was a problem hiding this comment.
Include the new 3.7i fields in isEmpty().
UpdateMetaIfNew() returns early when meta.isEmpty() is true, but isEmpty() below still ignores Effective* and SupportedFeatures. A partial metadata update carrying only these new runtime/capability fields will be dropped and leave stale compatibility/runtime state behind.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/peer/peer.go` around lines 142 - 154, The isEmpty() method
currently ignores the new Phase 3.7i fields so UpdateMetaIfNew() can wrongly
return early; update isEmpty() to consider EffectiveConnectionMode,
EffectiveRelayTimeoutSecs, EffectiveP2PTimeoutSecs, EffectiveP2PRetryMaxSecs and
SupportedFeatures as making the meta non-empty (i.e., return false if any
Effective* field is non-zero/non-empty or SupportedFeatures has elements) so
partial updates that only carry these runtime/capability fields are applied.
b082536 to
e8a298f
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (27)
shared/management/client/grpc.go (2)
483-494:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRun
msgHandlerbefore triggering snapshot callback.On Line 483, invoking
onSnapshotRequestbefore applying the sameSyncResponsecan trigger a snapshot from pre-update state. Move snapshot handling aftermsgHandler(and only when handler succeeds).Suggested ordering fix
- if req := decryptedResp.GetSnapshotRequest(); req != nil { - c.snapMu.Lock() - cb := c.onSnapshotRequest - c.snapMu.Unlock() - if cb != nil { - cb(req.GetNonce()) - } - } - if err := msgHandler(decryptedResp); err != nil { log.Errorf("failed handling an update message received from Management Service: %v", err.Error()) + continue + } + + if req := decryptedResp.GetSnapshotRequest(); req != nil { + c.snapMu.Lock() + cb := c.onSnapshotRequest + c.snapMu.Unlock() + if cb != nil { + cb(req.GetNonce()) + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/client/grpc.go` around lines 483 - 494, The snapshot callback is being invoked before the SyncResponse is applied; change the ordering so you call msgHandler(decryptedResp) first and only if it returns nil then acquire c.snapMu, read c.onSnapshotRequest and invoke the callback with req.GetNonce(); ensure you still check decryptedResp.GetSnapshotRequest() for non-nil but move that block after the msgHandler success path and preserve the use of c.snapMu when accessing c.onSnapshotRequest.
968-970:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not return
nilmetadata wheninfo == nil.Line 969 still drops
SupportedFeaturesand effective connection settings whensystem.Infois unavailable, which makes this client look legacy and can misapply fallback logic. Always return aPeerSystemMetawith capability/effective-config fields populated.Suggested fix
func infoToMetaData(info *system.Info, eff EffectiveConnConfig) *proto.PeerSystemMeta { + meta := &proto.PeerSystemMeta{ + EffectiveConnectionMode: eff.Mode, + EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs, + EffectiveP2PTimeoutSecs: eff.P2PTimeoutSecs, + EffectiveP2PRetryMaxSecs: eff.P2PRetryMaxSecs, + SupportedFeatures: system.SupportedFeatures(), + } if info == nil { - return nil + return meta } @@ - return &proto.PeerSystemMeta{ + meta.Hostname = info.Hostname + meta.GoOS = info.GoOS + meta.OS = info.OS + meta.Core = info.OSVersion + meta.OSVersion = info.OSVersion + meta.Platform = info.Platform + meta.Kernel = info.Kernel + meta.NetbirdVersion = info.NetbirdVersion + meta.UiVersion = info.UIVersion + meta.KernelVersion = info.KernelVersion + meta.NetworkAddresses = addresses + meta.SysSerialNumber = info.SystemSerialNumber + meta.SysManufacturer = info.SystemManufacturer + meta.SysProductName = info.SystemProductName + meta.Environment = &proto.Environment{ + Cloud: info.Environment.Cloud, + Platform: info.Environment.Platform, + } + meta.Files = files + meta.Flags = &proto.Flags{ + RosenpassEnabled: info.RosenpassEnabled, + RosenpassPermissive: info.RosenpassPermissive, + ServerSSHAllowed: info.ServerSSHAllowed, + DisableClientRoutes: info.DisableClientRoutes, + DisableServerRoutes: info.DisableServerRoutes, + DisableDNS: info.DisableDNS, + DisableFirewall: info.DisableFirewall, + BlockLANAccess: info.BlockLANAccess, + BlockInbound: info.BlockInbound, + LazyConnectionEnabled: info.LazyConnectionEnabled, + } - ... - SupportedFeatures: system.SupportedFeatures(), - } + return meta }Also applies to: 1026-1035
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/client/grpc.go` around lines 968 - 970, The function infoToMetaData currently returns nil when info == nil; instead always construct and return a proto.PeerSystemMeta populated with capability and effective-connection fields so the client does not appear legacy. Update infoToMetaData (and the analogous block around the second occurrence) to create a PeerSystemMeta even if info is nil: set SupportedFeatures (or the equivalent capabilities field) from defaults or an empty slice and populate the EffectiveConnConfig fields from the eff parameter so effective connection settings are always conveyed. Ensure the returned value is a non-nil *proto.PeerSystemMeta with SupportedFeatures and EffectiveConnConfig fields filled.client/internal/conn_state_pusher_test.go (1)
199-205:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winWait for disabled latch, not only the first notification
At Line 202,
sink.notifcan fire before the Unimplemented error is processed, so the test may proceed beforep.disabledis latched and become flaky.Suggested fix
- // Wait for the very first push (initial snapshot), which gets the - // Unimplemented error back and latches `disabled`. - select { - case <-sink.notif: - case <-time.After(500 * time.Millisecond): - t.Fatal("timed out waiting for first push attempt") - } + // Wait until the first push attempt is observed and disabled is latched. + deadline := time.After(500 * time.Millisecond) + for !p.disabled.Load() { + select { + case <-sink.notif: + // keep waiting until Unimplemented has been handled + case <-deadline: + t.Fatal("timed out waiting for pusher to latch disabled after first push attempt") + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher_test.go` around lines 199 - 205, The test currently proceeds on receiving sink.notif which may occur before the Unimplemented error handling latches p.disabled; change the wait to ensure the disabled latch is set: after receiving from sink.notif, block until p.disabled is observed (e.g., poll/check p.disabled or wait on a dedicated disabled channel/condition) with the same timeout (500ms) so the test only continues once p.disabled has been latched following the Unimplemented error.client/internal/stdnet/filter_test.go (1)
51-65:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThis regression test still doesn't verify most
want: truecases.Line 60 only catches false-positive allows, and Line 63 special-cases one Windows false-negative. On Linux/macOS,
vEthernet (LAN)andvEthernet (External)are filtered by thevethprefix today, but this test still passes because those rows are never asserted symmetrically. Either skip this test on non-Windows or gate the Windows-only rows, then comparegot != c.wantfor every case.Proposed fix
func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) { + if runtime.GOOS != "windows" { + t.Skip("Windows-specific interface-name filtering") + } + disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"} allow := InterfaceFilter(disallow) @@ for _, c := range cases { - // The wgctrl branch can override on hosts where NetBird is - // running; tests run on a host where these names are not - // real interfaces, so the final return faithfully reflects - // the disallow-list logic. got := allow(c.name) - // "veth*" prefix only filters on non-Windows; on Linux test - // runners "vEthernet (LAN)" still passes because of mixed - // case + the !Windows branch keeping the prefix match. - if !c.want && got { - t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name) - } - if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" { - t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name) + if got != c.want { + t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test only asserts one-sided failures; change the loop to assert symmetry by checking if got != c.want and calling t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) so every case is verified. To preserve the special Windows-only expectations, skip or gate rows like "vEthernet (LAN)" and "vEthernet (External)" when runtime.GOOS != "windows" (e.g., if runtime.GOOS != "windows" && (c.name == "vEthernet (LAN)" || c.name == "vEthernet (External)") { continue }) so the allow(...) check and the unified got!=c.want assertion run correctly across platforms; update the references around the allow(...) call and the existing runtime.GOOS handling accordingly.client/internal/stdnet/filter.go (1)
51-76:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWindows still filters
Local Area Connection-style NICs.At Line 52 and again via the disallow-list loop at Line 76, the case-folded
"lo"prefix rejects any Windows adapter whose name starts with"Lo". That means interfaces likeLocal Area Connectionare still dropped before the Windows-specific exceptions can help, which recreates the false-negative this change was meant to fix.Proposed fix
- // Linux/macOS loopback prefix ("lo", "lo0"). - if strings.HasPrefix(lowerIFace, "lo") { + // Linux/macOS loopback prefix ("lo", "lo0"). + if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") { return false } @@ - if sLower == "veth" && runtime.GOOS == "windows" { + if runtime.GOOS == "windows" && (sLower == "veth" || sLower == "lo") { continue }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter.go` around lines 51 - 76, The current early check uses strings.HasPrefix(lowerIFace, "lo") which incorrectly filters Windows adapters like "Local Area Connection"; update that check to skip the "lo" prefix test on Windows (i.e., only treat "lo" as loopback when runtime.GOOS != "windows"). Also in the disallowList loop (symbols: disallowList, sLower) ensure if sLower == "lo" you similarly continue/skip when runtime.GOOS == "windows" so Windows NICs starting with "Lo" are not dropped; keep the existing special-case for "veth" unchanged.management/server/http/handlers/accounts/accounts_handler_test.go (1)
408-426:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIndentation on lines 418-419 still appears inconsistent with surrounding fields.
Despite the previous comment being marked addressed, the displayed code shows
LegacyLazyFallbackEnabled/LegacyLazyFallbackTimeoutSecondswith extra leading tabs vs. the rest of theapi.AccountSettings{}literal (LazyConnectionEnabled,DnsDomain, …). If this still fails CI gofmt, please rungofmt -won the file. The same pattern exists in the oldertttable at lines 124-125, 152-153, 180-181, 208-209, 236-237, 264-265 — worth verifying those too.#!/bin/bash # Confirm whether gofmt would reformat the file fd -e go accounts_handler_test.go --exec gofmt -d {} \;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler_test.go` around lines 408 - 426, The struct literal for api.AccountSettings contains inconsistent indentation for the fields LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds (they are indented with extra tabs compared to neighboring fields like LazyConnectionEnabled and DnsDomain); run gofmt (or manually align those two lines to match the surrounding field indentation) and also check the older tt table occurrences of LegacyLazyFallbackEnabled/LegacyLazyFallbackTimeoutSeconds mentioned in the comment (lines where tt entries exist) to ensure consistent indentation so the file passes gofmt.management/server/peer.go (2)
1297-1356:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn an authorization error (not
status.Internal) when the user has no policy access to the peer."No access to this peer" is an expected authorization outcome when the user lacks
peers:read, isn't an admin/service user, isn't the owner, and has no policy-reachable own peer to the target. Returningstatus.Internalhere surfaces as a 5xx to API callers instead of a proper 403/404. Usestatus.PermissionDenied(orstatus.NotFoundto also avoid existence disclosure for accounts whose peer ids are guessable).🛡️ Proposed fix
- return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID) + return nil, status.Errorf(status.PermissionDenied, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1297 - 1356, The final error returned from checkIfUserOwnsPeer uses status.Internal but should be an authorization error; in checkIfUserOwnsPeer (called by GetPeer) replace the final return that currently does return nil, status.Errorf(status.Internal, ...) with an authorization-level status (e.g. status.PermissionDenied or status.NotFound) to surface a 403/404 instead of a 5xx; keep the same message or adjust to a generic denial message and ensure the change is applied in the checkIfUserOwnsPeer function so callers of GetPeer receive the correct permission error.
1358-1370:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
GetPeerByPubKeycross-account branch leaks tenant info and returns an untyped error.
Store.GetPeerByPeerPubKeyis not scoped byaccountID, so the cross-account check at line 1366 is the only barrier. Two issues remain:
- The error message echoes the caller-supplied
pubKeyand confirms the peer exists in some account → cross-tenant info disclosure when reached from REST handlers (per the doc comment).fmt.Errorfis untyped, so HTTP handlers cannot distinguish "not found" from "in another account" and will surface this as 5xx rather than 404.🛡️ Proposed fix
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey) if err != nil { return nil, err } if p.AccountID != accountID { - return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID) + return nil, status.Errorf(status.NotFound, "peer not found") } return p, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1358 - 1370, GetPeerByPubKey leaks tenant data and returns an untyped error; change the cross-account branch in DefaultAccountManager.GetPeerByPubKey (which calls Store.GetPeerByPeerPubKey) to not echo pubKey or account details and to return a typed "not found" error that callers can detect (use an existing sentinel like store.ErrNotFound or define ErrPeerNotFound and return fmt.Errorf("%w", store.ErrNotFound) or errors.New/Wrap with that sentinel) so REST handlers get a 404 instead of 5xx and no tenant information is revealed.client/internal/peer/ice_backoff.go (1)
199-210:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
SetMaxBackoff(0)still leaves stale suspension in place.When
d == 0(backoff disabled via management push),s.suspendedands.nextRetryare not cleared, soIsSuspended()will keep blocking retries until the previously computed deadline despite backoff being turned off. Same fix as previously suggested applies.🐛 Proposed fix
func (s *iceBackoffState) SetMaxBackoff(d time.Duration) { s.mu.Lock() defer s.mu.Unlock() if d == s.maxBackoff { return } s.maxBackoff = d + if d == 0 { + s.failures = 0 + s.suspended = false + s.nextRetry = time.Time{} + s.bo = buildBackoff(0) + return + } s.bo = buildBackoff(d) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/ice_backoff.go` around lines 199 - 210, SetMaxBackoff currently updates s.maxBackoff and s.bo but when d == 0 it leaves a previous suspension and deadline in place so IsSuspended() still blocks; while holding the mutex in SetMaxBackoff, if d == 0 clear the suspension state by setting s.suspended = false and s.nextRetry = time.Time{} (zero value) before rebuilding s.bo (call buildBackoff(0) or nil as your backoff factory expects) so the failure counter is preserved but any stale suspension is removed; reference symbols: SetMaxBackoff, s.suspended, s.nextRetry, s.bo, buildBackoff, IsSuspended.management/server/http/handlers/accounts/accounts_handler.go (1)
231-283:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidation errors still need
status.InvalidArgumentwrapping.
fmt.Errorfat lines 234, 280 and insidevalidateUint32Timeout(513, 516) yields untyped errors, soutil.WriteErrorsurfaces them as 5xx. Bad input onconnection_mode,legacy_lazy_fallback_timeout_seconds, and the new uint32 timeout fields will return Internal Server Error instead of 4xx validation responses. Wrap them at one of the two layers (handler-side, or insidevalidateUint32Timeoutitself).💡 Minimal fix at the validator boundary
if req.Settings.ConnectionMode != nil { modeStr := string(*req.Settings.ConnectionMode) if !req.Settings.ConnectionMode.Valid() { - return nil, fmt.Errorf("invalid connection_mode %q", modeStr) + return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr) } @@ v := *req.Settings.LegacyLazyFallbackTimeoutSeconds if v < 60 || v > 86400 { - return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) + return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) } @@ func validateUint32Timeout(name string, v int64) (uint32, error) { if v < 0 { - return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v) } if v > int64(math.MaxUint32) { - return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) } return uint32(v), nil }If you prefer keeping
validateUint32Timeoutas plain-error and wrapping at the handler, also update the corresponding test invalidate_uint32_timeout_test.goaccordingly.Also applies to: 505-519
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` around lines 231 - 283, The handler and validator are returning plain fmt.Errorf which surfaces as 5xx; change these validation errors to gRPC InvalidArgument errors so util.WriteError returns 4xx. Either (A) update validateUint32Timeout to return status.Errorf(codes.InvalidArgument, ...) for its range/parse errors (so callers like the handler get a properly-typed error), or (B) wrap the handler-side errors before returning (wrap the fmt.Errorf at the connection_mode check in the block handling req.Settings.ConnectionMode and the legacy_lazy_fallback_timeout_seconds check around req.Settings.LegacyLazyFallbackTimeoutSeconds with status.Errorf(codes.InvalidArgument, ...)). Reference validateUint32Timeout, req.Settings.ConnectionMode, returnSettings.ConnectionMode, and req.Settings.LegacyLazyFallbackTimeoutSeconds when making the change.client/internal/profilemanager/config.go (1)
200-202: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winJSON tag for
P2pRetryMaxSecondsis still inconsistent with neighboring fields.
ConnectionMode,RelayTimeoutSeconds, andP2pTimeoutSecondsusejson:",omitempty"(PascalCase Go-field-name keys, matching every other field inConfig), butP2pRetryMaxSecondsuniquely usesjson:"p2p_retry_max_seconds,omitempty". This produces a mixed PascalCase + one snake_case key in the on-disk JSON and is a future-maintenance hazard for anyone reading or hand-editing the profile.♻️ Proposed fix
RelayTimeoutSeconds uint32 `json:",omitempty"` P2pTimeoutSeconds uint32 `json:",omitempty"` - P2pRetryMaxSeconds uint32 `json:"p2p_retry_max_seconds,omitempty"` + P2pRetryMaxSeconds uint32 `json:",omitempty"`If snake_case is intentional (e.g., to match a wire format), apply it consistently to all three timeout fields and
ConnectionModeinstead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/profilemanager/config.go` around lines 200 - 202, The JSON tag for the struct field P2pRetryMaxSeconds is inconsistent with its neighbors; update the struct tags so they are consistent—either change P2pRetryMaxSeconds' tag to use the same PascalCase-empty-omitempty style (json:",omitempty") to match RelayTimeoutSeconds, P2pTimeoutSeconds and ConnectionMode, or (if snake_case is intended) change RelayTimeoutSeconds, P2pTimeoutSeconds and ConnectionMode to use the snake_case form so all timeout/connection fields use the same pattern; adjust the tag on the P2pRetryMaxSeconds field (and the other timeout/ConnectionMode tags if choosing snake_case) accordingly.management/server/store/sql_store.go (1)
1641-1663:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate DB timeout/meta values before narrowing them to
uint32.These
int64 -> uint32casts still silently wrap negative or oversized values into unrelated timeouts. A bad row in Postgres would come back as a seemingly validRelayTimeoutSeconds/P2pRetryMaxSeconds/meta timeout instead of failing fast. Please range-check eachNullInt64(>= 0 && <= math.MaxUint32) before assigning, and return an error on invalid data.Also applies to: 1922-1929
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/store/sql_store.go` around lines 1641 - 1663, Validate each sql NullInt64 before casting to uint32: check sRelayTimeoutSeconds, sP2pTimeoutSeconds, sP2pRetryMaxSeconds (and the other block at lines 1922-1929) for value >= 0 and <= math.MaxUint32; if out of range return a descriptive error instead of silently casting, otherwise perform the cast and assign to account.Settings.RelayTimeoutSeconds / P2pTimeoutSeconds / P2pRetryMaxSeconds; likewise validate sLegacyLazyFallbackTimeoutSecs before converting it to uint32 and return an error on invalid DB data.client/internal/conn_mgr.go (3)
297-300:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t persist a lazy/dynamic mode when Rosenpass blocks that mode.
By the time the Rosenpass guard returns,
e.modeand the timeout fields have already been updated to the pushed lazy/dynamic values. That leaves the rest ofConnMgrobserving managed-mode semantics even though the manager was never started. Reject or normalize the resolved mode before mutating the live state soMode()and mode-dependent paths never advertise an unsupported configuration.Also applies to: 322-326
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 297 - 300, Compute and validate/normalize the resolved mode before mutating ConnMgr state: instead of assigning newMode/newRelay/newP2P/newP2pRetry directly to e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs, run the Rosenpass guard (or the existing validation logic) against the candidate newMode and only if allowed assign the lazy/dynamic values; if the guard rejects it, normalize to an allowed fallback (e.g., the supported static/managed mode) and only then mutate e.mode and the timeout fields so Mode() and mode-dependent paths never observe an unsupported configuration. Apply the same change to the other assignment block referenced (the similar statements at lines 322-326).
287-340:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestart the lazy manager when only inactivity thresholds change.
initLazyManager()snapshotsrelayTimeoutSecs/p2pTimeoutSecsintomanager.Config, butUpdatedRemotePeerConfig()only tears the manager down onmodeChanged. If the server keeps the same managed mode and changes just one timeout, the running inactivity manager keeps the old thresholds until a later mode flip or process restart.Also applies to: 601-611
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 287 - 340, resolveConnectionMode can change inactivity timeouts without changing mode, but the code only restarts the lazy manager when modeChanged; this leaves the running manager using stale thresholds. Detect when wasManaged && isManaged and any of newRelay, newP2P or newP2pRetry differ from e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs (i.e. timeoutChanged) and, similar to the existing mode-change handling for switching managed types, call e.closeManager(ctx), e.statusRecorder.UpdateLazyConnection(false), then e.initLazyManager(ctx), e.startModeSideEffects(), and return e.resetPeersToLazyIdle(ctx) so the new manager config and inactivity timers take effect immediately; reference resolveConnectionMode, initLazyManager, closeManager, propagateP2pRetryMaxToConns, startModeSideEffects, and resetPeersToLazyIdle.
35-45:⚠️ Potential issue | 🟠 Major | ⚡ Quick winProtect the resolved live mode/timeout fields from concurrent reads.
spMuonly covers theserverPushed*snapshot. The live fields (mode,relayTimeoutSecs,p2pTimeoutSecs,p2pRetryMaxSecs) are still written inUpdatedRemotePeerConfig()and read concurrently fromrunDynamicInactivityLoop()and the public accessors with no synchronization, so this remains racy. Either guard them with the same lock or switch them to atomic-backed storage.Also applies to: 231-240, 297-300, 756-778
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 35 - 45, The live config fields mode, relayTimeoutSecs, p2pTimeoutSecs and p2pRetryMaxSecs are written in UpdatedRemotePeerConfig() but read concurrently from runDynamicInactivityLoop() and the public accessors, so make them concurrency-safe: either (A) guard all reads and writes with the existing spMu (use spMu in UpdatedRemotePeerConfig(), runDynamicInactivityLoop() and the accessor methods that reference these fields), or (B) convert the fields to atomic-backed storage (e.g., atomic.Value for mode and atomic.Uint32/uint64 for the timeouts) and update UpdatedRemotePeerConfig(), runDynamicInactivityLoop() and the accessors to use atomic loads/stores; apply the same change to the other affected groups mentioned (lines ~231-240, ~297-300, ~756-778) to eliminate the data race.client/internal/peer/conn_lazy_keepwgpeer_test.go (1)
108-116:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInline-trailing-comment lines bypass the arity check — duplicate of past open review.
strings.Contains(trim, "// ")at line 113 also matches lines likepeerConn.Close(false) // keepWgPeer omitted, silently skipping them before the two-argument check fires. The fix from the prior review still applies.🛡️ Proposed fix
- strings.Contains(trim, "// "), + strings.HasPrefix(trim, "//"),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 108 - 116, The test's switch uses strings.Contains(trim, "// ") which wrongly matches lines with code followed by inline comments (e.g., "peerConn.Close(false) // keepWgPeer omitted"), skipping the two-argument check; update the condition in the switch that currently references strings.Contains(trim, "// ") to instead detect comment-only lines (for example use strings.HasPrefix(strings.TrimSpace(trim), "//") or check that strings.Index(trim, "//") == 0) so only lines that are purely comments are skipped; keep the rest of the checks (conn.Close( and peerConn.Close() checks) intact and ensure you modify the case inside the same switch that references the variable trim.client/internal/lazyconn/manager/manager.go (1)
102-109:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLegacy-fallback selector conflates "not migrated" with explicit zero-timeouts — duplicate of past open review.
iceTO == 0 && relayTO == 0is true for two distinct cases:
- A caller that hasn't set any of the new thresholds (intended fallback to Phase-1 single-timer).
- A caller that explicitly sets both to
0to disable all teardowns.Case 2 falls into the Phase-1 single-timer path, which then silently re-enables the default inactivity teardown via
inactivity.NewManager(wgIface, nil). The correct discriminant is the presence of the deprecatedInactivityThresholdfield.🛡️ Proposed fix
- if iceTO == 0 && relayTO == 0 { - // Phase 1 / single-timer fallback when caller hasn't migrated. - m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback - } else { + if config.InactivityThreshold != nil && iceTO == 0 && relayTO == 0 { //nolint:staticcheck + // Phase 1 / single-timer fallback: deprecated field present, new fields absent. + m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) + } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/manager/manager.go` around lines 102 - 109, The current conditional treats iceTO==0 && relayTO==0 as the indicator of the legacy single-timer fallback, which conflates "caller didn't migrate" with "caller explicitly disabled teardowns"; instead, base the legacy fallback on presence of the deprecated InactivityThreshold field. Change the condition around config.resolvedTimeouts()/InactivityThreshold so that if config.InactivityThreshold is non-nil you call inactivity.NewManager(wgIface, config.InactivityThreshold) (the legacy single-timer path), otherwise use inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) — this ensures explicit zero timeouts remain the two-timer (disable) case; reference wgIface, config.resolvedTimeouts(), config.InactivityThreshold, inactivity.NewManager and inactivity.NewManagerWithTwoTimers when making the change.management/server/peer_connections/store.go (1)
141-155:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExpire stale entries before the nonce gate.
Returning on
sincebefore the TTL check means refresh-only callers never evict expired snapshots, so dead entries can live forever in the map.Suggested fix
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { s.mu.Lock() defer s.mu.Unlock() e, ok := s.maps[peerPubKey] if !ok { return nil, false } - if since > 0 && e.m.GetInResponseToNonce() < since { - return nil, false - } if s.clock.Now().Sub(e.updatedAt) > s.ttl { delete(s.maps, peerPubKey) return nil, false } + if since > 0 && e.m.GetInResponseToNonce() < since { + return nil, false + } return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer_connections/store.go` around lines 141 - 155, In MemoryStore.GetWithNonceCheck, expired entries are checked after the nonce gate causing stale entries to persist; change the order so the TTL eviction runs before the nonce comparison: while holding the lock in GetWithNonceCheck, call s.clock.Now().Sub(e.updatedAt) > s.ttl and delete from s.maps when expired (using the existing delete logic) before evaluating the since / e.m.GetInResponseToNonce() check so refresh-only callers will not retain expired snapshots.management/server/peer/peer.go (2)
199-203:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCompare
SupportedFeaturesas a set, not a slice.
slices.Equalmakes two equivalent capability sets compare unequal when protobuf serialization order changes, which will keep churning peer metadata unnecessarily.Suggested fix
func (p PeerSystemMeta) isEqual(other PeerSystemMeta) bool { + pFeatures := slices.Clone(p.SupportedFeatures) + otherFeatures := slices.Clone(other.SupportedFeatures) + sort.Strings(pFeatures) + sort.Strings(otherFeatures) + sort.Slice(p.NetworkAddresses, func(i, j int) bool { return p.NetworkAddresses[i].Mac < p.NetworkAddresses[j].Mac }) @@ p.Environment.Cloud == other.Environment.Cloud && p.Environment.Platform == other.Environment.Platform && p.EffectiveConnectionMode == other.EffectiveConnectionMode && p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs && p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs && p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs && - slices.Equal(p.SupportedFeatures, other.SupportedFeatures) && + slices.Equal(pFeatures, otherFeatures) && p.Flags.isEqual(other.Flags) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 199 - 203, The comparison currently uses slices.Equal on p.SupportedFeatures vs other.SupportedFeatures which fails when protobuf reorders capabilities; change the comparison in the equality logic to treat SupportedFeatures as a set by converting each slice into a map/set (or by making a stable sort before comparing) and then compare those sets for equality; update the equality check that references p.SupportedFeatures and other.SupportedFeatures (the surrounding code using slices.Equal) to use the set-based comparison so logically equivalent feature lists no longer appear different.
142-154:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude the new 3.7i fields in
isEmpty().
UpdateMetaIfNew()still returns early when onlyEffective*orSupportedFeaturesis present, so capability-only/runtime-only updates get dropped and stale peer metadata can persist.Suggested fix
func (p PeerSystemMeta) isEmpty() bool { return p.Hostname == "" && p.GoOS == "" && p.Kernel == "" && p.Core == "" && p.Platform == "" && p.OS == "" && p.OSVersion == "" && p.WtVersion == "" && p.UIVersion == "" && p.KernelVersion == "" && len(p.NetworkAddresses) == 0 && p.SystemSerialNumber == "" && p.SystemProductName == "" && p.SystemManufacturer == "" && p.Environment.Cloud == "" && p.Environment.Platform == "" && - len(p.Files) == 0 + len(p.Files) == 0 && + p.EffectiveConnectionMode == "" && + p.EffectiveRelayTimeoutSecs == 0 && + p.EffectiveP2PTimeoutSecs == 0 && + p.EffectiveP2PRetryMaxSecs == 0 && + len(p.SupportedFeatures) == 0 }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 142 - 154, The isEmpty() predicate currently ignores the new 3.7i fields so UpdateMetaIfNew() can incorrectly return early; update the Peer.isEmpty() implementation to treat any non-empty EffectiveConnectionMode string, any non-zero EffectiveRelayTimeoutSecs/EffectiveP2PTimeoutSecs/EffectiveP2PRetryMaxSecs, or a non-empty SupportedFeatures slice as non-empty (i.e., return false when any of those are present) so capability-only or runtime-only updates are not dropped by UpdateMetaIfNew().client/internal/engine.go (1)
2765-2783:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winGuard the timer's
connMgrreads withsyncMsgMux.This callback still reads
e.connMgr.Mode()outside the lock that protectsUpdatedRemotePeerConfig(), so the remote-offline debounce path races with mode changes and shutdown.Suggested fix
t := time.AfterFunc(remoteOfflineGracePeriod, func() { e.peerOfflineDebounceMu.Lock() delete(e.peerOfflineDebounce, pubKey) e.peerOfflineDebounceMu.Unlock() @@ - if e.ctx == nil || e.ctx.Err() != nil { + e.syncMsgMux.Lock() + ctx := e.ctx + mode := connectionmode.ModeUnspecified + if e.connMgr != nil { + mode = e.connMgr.Mode() + } + e.syncMsgMux.Unlock() + + if ctx == nil || ctx.Err() != nil { return } - if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic { + if mode != connectionmode.ModeP2PDynamic { return }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine.go` around lines 2765 - 2783, The timer callback reads e.connMgr.Mode() (and other engine state) without holding the engine's syncMsgMux, causing races with UpdatedRemotePeerConfig()/mode switches and shutdown; fix by acquiring e.syncMsgMux (or the engine's message sync mutex) before accessing e.connMgr or other mutable engine state in the debounce function and release it after the checks so the Mode() call and any related reads are performed under the same lock that UpdatedRemotePeerConfig() uses (refer to e.connMgr, e.syncMsgMux, e.ctx and the timer callback closing over pubKey/peerOfflineDebounce).client/ui/peers_tab.go (1)
60-68:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
render()off the Fyne event thread.
showFull.OnChangedand the exportedRefreshcallback still invoke a function that doesgetSrvClientand a 5-secondStatusRPC synchronously, so the whole window can hang when the daemon is slow or unreachable.Suggested fix
- showFull.OnChanged = func(_ bool) { render() } + asyncRender := func() { go render() } + showFull.OnChanged = func(_ bool) { asyncRender() } @@ - return peersTabBundle{Content: content, ShowFull: showFull, Refresh: render} + return peersTabBundle{Content: content, ShowFull: showFull, Refresh: asyncRender}In Fyne v2, are widget callback handlers like widget.Check.OnChanged executed on the main UI thread, and should blocking RPC/network work be moved to a background goroutine with UI updates marshaled back via fyne.Do?Also applies to: 98-98, 126-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/peers_tab.go` around lines 60 - 68, render() (and the callbacks wired to showFull.OnChanged and the exported Refresh) perform blocking work (getSrvClient and conn.Status RPC) on the Fyne event thread; move that blocking work into a background goroutine and only marshal UI updates back onto the Fyne main thread using fyne.Do (or RunOnMain). Specifically, wrap the getSrvClient/Status call sequence (the conn := getSrvClient(...) and st, err := conn.Status(...) logic) in a goroutine started from render(), showFull.OnChanged handler, and Refresh, capture results/error, then call fyne.Do to set summary.SetText, update widgets, and any other UI state so the UI never blocks while waiting for the 5s RPC.management/internals/shared/grpc/server.go (1)
497-500:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winNil out
snapshotChwhen the router closes it.A closed channel stays ready forever, so
continuehere turns thehandleUpdatesloop into a tight spin until the stream exits.Suggested fix
case nonce, ok := <-snapshotCh: if !ok { + snapshotCh = nil continue }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/server.go` around lines 497 - 500, The select case receiving from snapshotCh in handleUpdates currently does "if !ok { continue }", which causes a tight spin because a closed channel remains selectable; change the handling so that when the receive returns !ok you set snapshotCh = nil (or otherwise remove it from the select) to stop further selects on the closed channel and avoid busy-looping, keeping the rest of handleUpdates behavior unchanged. Ensure you reference the variable snapshotCh in the select case and update any related logic that assumes snapshotCh may be non-nil.client/internal/peer/status.go (3)
470-471:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCapture the conn-state callback before unlocking.
These sites still call
notifyConnStateChange(...)afterd.mux.Unlock(), so the read ofd.connStateListenerraces withSetConnStateListener. Build the closure while the mutex is still held, then return/invoke it after unlock.Also applies to: 696-697, 764-765, 812-813, 863-864
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 470 - 471, The code currently calls notifyConnStateChange(...) while the mutex is already unlocked, causing a race on d.connStateListener; instead, capture the callback under the mutex (e.g. read d.connStateListener into a local closure/variable inside the critical section where hasConnStatusChanged(...) is checked and before calling d.mux.Unlock()), then release the lock and invoke the captured closure or call notifyConnStateChange using that local reference; apply the same pattern at the other occurrences mentioned (around lines where hasConnStatusChanged/notifyConnStateChange are used and SetConnStateListener may run).
692-693:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
notifyPeerStateChangeListenersis called without the required lock.This helper immediately reads
d.peers/d.changeNotifyviasnapshotRouterPeersLocked, and its own docstring says the caller must holdd.mux. Moving these calls belowUnlock()reintroduces the concurrent-access bug the earlier fix was addressing.Suggested fix
- d.mux.Unlock() - if materialICE { d.notifyPeerStateChangeListeners(receivedState.PubKey) } + d.mux.Unlock()Mirror the same ordering in the relay path.
Also applies to: 760-761
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 692 - 693, The call to notifyPeerStateChangeListeners is being made without holding d.mux even though that helper (and snapshotRouterPeersLocked it calls) requires the caller to hold d.mux; fix by mirroring the relay-path ordering: acquire d.mux, call notifyPeerStateChangeListeners while still holding the lock (before calling Unlock()), then release the lock — ensure the same pattern is applied at both locations mentioned (around notifyPeerStateChangeListeners and snapshotRouterPeersLocked) so accesses to d.peers and d.changeNotify happen while d.mux is held.
542-544:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t fire
peerListChangedwhiled.muxis held.
notifyPeerListChanged()synchronously enters notifier/listener code. With the currentdefer d.mux.Unlock(), any listener that re-entersStatuscan deadlock this path. SnapshotnumPeersunder the lock, release it, then notify.Also applies to: 565-566
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 542 - 544, The code currently calls d.notifyPeerListChanged() and d.notifyPeerStateChangeListeners(pubKey) while holding d.mux (defer d.mux.Unlock()), which can deadlock if listeners re-enter Status; instead, inside the critical section capture the minimal state needed (e.g., snapshot numPeers and any other values used by the listeners, and the pubKey flag) then release d.mux and invoke d.notifyPeerListChanged() and d.notifyPeerStateChangeListeners(pubKey) after unlocking; apply the same change to the other occurrence around lines that call those notify methods (the 565-566 block) so notifications happen outside the lock.management/internals/shared/grpc/conversion.go (1)
315-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly mark liveness authoritative when
rPeer.Statusis present.
cfg.ServerLivenessKnown = trueis still outside therPeer.Status != nilguard. When the status record is missing, clients now treatLiveOnline=falseas authoritative offline instead of using the legacy fallback heuristic.Suggested fix
if rPeer.Status != nil { if !rPeer.Status.LastSeen.IsZero() { cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen) } cfg.LiveOnline = rPeer.Status.Connected + cfg.ServerLivenessKnown = true } - cfg.ServerLivenessKnown = true dst = append(dst, cfg)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The ServerLivenessKnown flag is being set unconditionally which falsely marks liveness authoritative when rPeer.Status is nil; move the assignment of cfg.ServerLivenessKnown = true inside the rPeer.Status != nil guard (the same block that sets cfg.LastSeenAtServer and cfg.LiveOnline) so that ServerLivenessKnown is only true when rPeer.Status is present and clients will fall back to the legacy LastSeenAtServer heuristic when status is missing.
🧹 Nitpick comments (4)
client/iface/bind/activity.go (1)
98-122: 💤 Low valueCallback dispatch path is correct; reaffirm the "MUST be cheap" contract at registration sites.
The CAS-gated single-fire is a nice way to dedupe under packet-burst races: only the goroutine that wins the
CompareAndSwapinvokescb, and the capturedcbsnapshot avoids holding the lock across the user callback.The previously-raised concern about the synchronous invocation on the WG read/write goroutine has been addressed by callers (per the prior thread, commits 2ef31c0–b07184d). Just note that the contract documented here on lines 117–119 is now load-bearing — any future
SetOnActivityregistration that calls into peer-state machinery directly (rather than dispatching to its own goroutine / buffered channel) will reintroduce the original contention risk. Worth a brief comment on the engine-side registration site reaffirming the non-blocking expectation, if not already present.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/iface/bind/activity.go` around lines 98 - 122, The CAS-protected callback path in ActivityRecorder.record correctly ensures only the winning goroutine invokes the callback (cb/onActivity), but you must explicitly reaffirm the "handler MUST be cheap or self-defer" contract at registration sites: update the SetOnActivity (or whichever code assigns ActivityRecorder.onActivity) to include a short comment and/or documentation that any callback registered must not block the WG read/write goroutine and should either spawn its own goroutine or enqueue to a buffered channel; reference ActivityRecorder.record, onActivity, and SetOnActivity in that note so future changes do not reintroduce blocking in the hot path.client/internal/stdnet/filter.go (1)
47-92: ⚡ Quick winSplit the closure into small helpers before merge.
This function now mixes platform-specific rules, disallow-list matching, and
wgctrlprobing in one branchy closure, which is why Sonar is still flagging it for cognitive complexity. Extracting those checks into helpers should clear the gate and make regressions like the Windows"lo"case much easier to spot.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter.go` around lines 47 - 92, The InterfaceFilter closure is doing platform checks, disallow-list prefix logic, and wgctrl probing in one function; split it into small helpers: implement isLoopbackLike(name string) bool (handle "lo"/"lo0" and Windows/ios exceptions), isWindowsKnownBad(name string) bool (use windowsKnownBadSubstrings), matchesDisallowList(name string, disallowList []string) bool (apply the veth/Windows special-case and ios exception), and isWireGuardDevice(name string) (wrap wgctrl.New(), Device lookup and Close); then refactor InterfaceFilter to call these helpers (keeping the same return semantics) so the closure contains only orchestration and cognitive complexity is reduced while preserving behavior of InterfaceFilter, windowsKnownBadSubstrings, and wgctrl probing.client/internal/peer/conn.go (1)
1143-1231: 💤 Low valueMinor doc inaccuracy in
AttachICEOnRelayActivitygate 6 comment.Gate 6 says "we had P2P at least once — avoids pointless retries for peers we never reached P2P with", but
everConnectedis also set inonRelayConnectionIsReady(line 656), so it's true after the first relay connection too, not exclusively P2P. The actual intent (distinguishing "ICE detached by design" from "never connected") is correct; only the comment is misleading.📝 Suggested comment update
-// 6. everConnected must be true (we had P2P at least once -- avoids -// pointless retries for peers we never reached P2P with) +// 6. everConnected must be true (peer has established at least one +// connection — relay or P2P — distinguishing "ICE detached for +// inactivity" from "brand new peer whose ICE listener hasn't been +// registered yet").🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 1143 - 1231, Update the gate-6 comment in AttachICEOnRelayActivity to accurately describe everConnected: note that everConnected is set when any connection (including relay via onRelayConnectionIsReady) was established, not only P2P, and that the check is meant to distinguish peers that previously connected (P2P or relay) from peers that never connected; reference AttachICEOnRelayActivity, everConnected, and onRelayConnectionIsReady so the reader can see where everConnected is set.client/internal/peer/conn_handover_order_test.go (1)
135-176: 💤 Low value
extractFunctionBodybrace counter is fragile but acceptable for these tests.The helper counts
{/}without awareness of string literals or comment-enclosed braces. For the functions currently under test (onICEConnectionIsReady,onGuardEvent,onWGDisconnected,onICEStateDisconnected) this is fine, but a future function containing a string literal likes := "map: {key: val}"would give a wrong body slice. Worth noting for anyone who extends this helper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_handover_order_test.go` around lines 135 - 176, The brace counter in extractFunctionBody is fragile because it counts '{'/'}' even when they appear inside string literals or comments; update extractFunctionBody to ignore braces found inside single-quoted, double-quoted or backtick strings and inside // line-comments and /* block-comments (i.e., implement a small lexer-state machine around the existing loop), keeping the existing depth/openSeen logic and return behavior unchanged so the helper still extracts the function body correctly for onICEConnectionIsReady, onGuardEvent, onWGDisconnected, onICEStateDisconnected and future functions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3c4d3e39-19dd-4193-ae26-ba105f0bdd06
⛔ Files ignored due to path filters (2)
client/proto/daemon.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (60)
.gitignoreclient/android/client.goclient/android/peer_notifier.goclient/android/preferences.goclient/android/preferences_clamp_test.goclient/iface/bind/activity.goclient/iface/device/endpoint_manager.goclient/internal/conn_mgr.goclient/internal/conn_state_pusher.goclient/internal/conn_state_pusher_material_test.goclient/internal/conn_state_pusher_test.goclient/internal/conn_state_pusher_testhelper_test.goclient/internal/engine.goclient/internal/engine_offline_debounce_test.goclient/internal/lazyconn/activity/listener_bind_test.goclient/internal/lazyconn/manager/manager.goclient/internal/peer/conn.goclient/internal/peer/conn_handover_order_test.goclient/internal/peer/conn_lazy_keepwgpeer_test.goclient/internal/peer/guard/guard.goclient/internal/peer/guard/guard_test.goclient/internal/peer/guard/ice_retry_state_test.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/status.goclient/internal/peer/status_debounce_test.goclient/internal/peer/status_remote_meta_notify_test.goclient/internal/peerstore/store.goclient/internal/profilemanager/config.goclient/internal/stdnet/filter.goclient/internal/stdnet/filter_test.goclient/proto/daemon.protoclient/system/features.goclient/system/features_test.goclient/ui/network.goclient/ui/peers_tab.godocs/bugs/2026-05-04-user-peer-visibility-regression.mddocs/superpowers/plans/2026-04-06-kernel-ice-separate-port.mdmanagement/internals/shared/grpc/conversion.gomanagement/internals/shared/grpc/conversion_test.gomanagement/internals/shared/grpc/server.gomanagement/server/account.gomanagement/server/activity/codes.gomanagement/server/http/handlers/accounts/accounts_handler.gomanagement/server/http/handlers/accounts/accounts_handler_test.gomanagement/server/http/handlers/accounts/validate_uint32_timeout_test.gomanagement/server/peer.gomanagement/server/peer/peer.gomanagement/server/peer_connections/store.gomanagement/server/peer_connections/store_test.gomanagement/server/peer_test.gomanagement/server/store/file_store.gomanagement/server/store/sql_store.gomanagement/server/types/settings.gomanagement/server/types/settings_ptr_equal_test.gomanagement/server/types/settings_test.goshared/management/client/grpc.goshared/management/http/api/openapi.ymlshared/management/http/api/types.gen.goshared/management/proto/management.proto
🚧 Files skipped from review as they are similar to previous changes (3)
- management/server/types/settings_ptr_equal_test.go
- .gitignore
- client/internal/conn_state_pusher.go
👮 Files not reviewed due to content moderation or server errors (13)
- shared/management/http/api/types.gen.go
- client/proto/daemon.proto
- shared/management/proto/management.proto
- management/server/store/file_store.go
- management/server/types/settings_test.go
- management/server/peer_connections/store_test.go
- client/internal/engine_offline_debounce_test.go
- client/internal/peer/status_remote_meta_notify_test.go
- management/server/peer_test.go
- docs/bugs/2026-05-04-user-peer-visibility-regression.md
- management/server/activity/codes.go
- management/server/account.go
- client/ui/network.go
| // Phase 3.5 (#5989): reset ICE backoff + recreate workerICE on network change. | ||
| // Set before Start() is called so the goroutine sees it without races. | ||
| if !skipICE { | ||
| conn.guard.SetOnNetworkChange(conn.onNetworkChange) | ||
| } | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that guard.Start calls the network-change callback synchronously
# from its own goroutine (not in a sub-goroutine), which is the prerequisite
# for the deadlock.
rg -n "networkChange\|NetworkChange\|OnNetworkChange\|SetOnNetworkChange" --type=go -C3Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Check if the file exists and examine its actual content around the referenced lines
ls -la client/internal/peer/conn.go 2>/dev/null && echo "File exists" || echo "File not found"
# Check file size to understand scope
wc -l client/internal/peer/conn.go
# Read the actual content around lines 265-270
head -280 client/internal/peer/conn.go | tail -30Repository: netbirdio/netbird
Length of output: 1390
🏁 Script executed:
# Search for any guard-related callback setup in the codebase
find . -name "*.go" -type f | head -20 | xargs grep -l "SetOnNetworkChange\|onNetworkChange" 2>/dev/null || echo "No matches found"
# Also search more broadly for guard-related patterns
rg "guard\." --type=go | head -20Repository: netbirdio/netbird
Length of output: 1821
🏁 Script executed:
# Read guard.go to understand SetOnNetworkChange
cat client/internal/peer/guard/guard.go | head -150Repository: netbirdio/netbird
Length of output: 5339
🏁 Script executed:
# Find the onNetworkChange function in conn.go
rg -n "func.*onNetworkChange" client/internal/peer/conn.go -A 15Repository: netbirdio/netbird
Length of output: 521
🏁 Script executed:
# Find the Close() function and examine defer structure
rg -n "func.*Close\(\)" client/internal/peer/conn.go -A 50 | head -80Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Search for conn.wg.Wait() in conn.go
rg -n "wg\.Wait\(\)" client/internal/peer/conn.go -B 5 -A 5Repository: netbirdio/netbird
Length of output: 461
🏁 Script executed:
# Find where onNetworkChange is actually called in guard.go
rg -n "onNetworkChange\|\.onNetworkChange" client/internal/peer/guard/guard.go -B 2 -A 2Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Get the full Close() function to see defer structure
rg -n "func.*\(conn \*Conn\) Close" client/internal/peer/conn.go -A 70 | head -120Repository: netbirdio/netbird
Length of output: 2571
🏁 Script executed:
# Check if there's an onGuardEvent callback pattern mentioned in the review
rg -n "onGuardEvent\|eventCallback" client/internal/peer/conn.go -B 2 -A 2 | head -40Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Get the full reconnectLoopWithRetry function with complete select statement
sed -n '118,250p' client/internal/peer/guard/guard.goRepository: netbirdio/netbird
Length of output: 3204
🏁 Script executed:
# Double-check by searching for srReconnectedChan case in the select
rg -n "srReconnectedChan\|case.*srWatcher\|case.*Reconnected" client/internal/peer/guard/guard.go -B 1 -A 3Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Verify guard goroutine is added to conn.wg
rg -n "guard.Start\|go.*guard\|conn.guard" client/internal/peer/conn.go -B 2 -A 2 | head -30Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Check if guard is started and in what goroutine
rg -n "func.*Start.*context\|guard.Start" client/internal/peer/conn.go -B 1 -A 3Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Look for where guard is started in relation to wg
sed -n '280,295p' client/internal/peer/conn.goRepository: netbirdio/netbird
Length of output: 452
🏁 Script executed:
# Check what srWatcher does and if it can fire srReconnectedChan
rg -n "NewListener\|Listener\|srWatcher" client/internal/peer/guard/sr_watcher.go | head -20Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Verify conn.onGuardEvent exists to confirm the pattern
rg -n "func.*onGuardEvent\|eventCallback" client/internal/peer/conn.go -B 1 -A 5 | head -30Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Check sr_watcher.go directly
ls -la client/internal/peer/guard/sr_watcher.go
head -100 client/internal/peer/guard/sr_watcher.goRepository: netbirdio/netbird
Length of output: 2515
🏁 Script executed:
# Find onGuardEvent definition/usage
rg -n "onGuardEvent" client/internal/peer/ -B 2 -A 5 | head -40Repository: netbirdio/netbird
Length of output: 3119
🏁 Script executed:
# Verify SRWatcher has actual event sending logic
cat client/internal/peer/guard/sr_watcher.go | tail -100Repository: netbirdio/netbird
Length of output: 2364
🏁 Script executed:
# Find onGuardEvent definition to understand existing callback pattern
rg -n "func.*onGuardEvent" client/internal/peer/conn.go -A 20 | head -40Repository: netbirdio/netbird
Length of output: 1347
🏁 Script executed:
# Check if onICEFailed/onICEConnected exist and access iceBackoff without lock
rg -n "onICEFailed\|onICEConnected" client/internal/peer/conn.go -B 2 -A 5 | head -50Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Verify srReconnectedChan can have pending events when ctx.Done() fires
# Check if notify() is called from onReconnected/onICEChanged
rg -n "notify\|onReconnected\|onICEChanged" client/internal/peer/guard/sr_watcher.goRepository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Check the actual ctx/cancel flow to confirm sequence
rg -n "ctxCancel\|ctx.Done\|context.WithCancel" client/internal/peer/conn.go | head -20Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Final verification: confirm srReconnectedChan can have events queued
# Check the notify function calls notify() sends to buffered channels
grep -n "case listener <-" client/internal/peer/guard/sr_watcher.goRepository: netbirdio/netbird
Length of output: 96
🏁 Script executed:
# Verify the complete defer order in Close to ensure mu.Unlock runs after wg.Wait
sed -n '312,376p' client/internal/peer/conn.go | grep -n "defer\|conn.wg.Wait"Repository: netbirdio/netbird
Length of output: 139
🏁 Script executed:
# Verify iceBackoff access patterns
rg -n "iceBackoff\." client/internal/peer/conn.go | head -30Repository: netbirdio/netbird
Length of output: 1614
🏁 Script executed:
# Check if iceBackoff has internal synchronization (mutex)
fd -t f "backoff" --type go | xargs grep -l "type.*Backoff\|iceBackoff" | head -5Repository: netbirdio/netbird
Length of output: 231
🏁 Script executed:
# Look for iceBackoff type definition
rg -n "type.*iceBackoff\|iceBackoff.*struct" client/internal/peer/ -B 2 -A 5 | head -40Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Find iceBackoff type definition in the codebase
rg -n "type.*[Ii]ceBackoff" --type=go -B 1 -A 5Repository: netbirdio/netbird
Length of output: 514
🏁 Script executed:
# Check where iceBackoff is instantiated to understand its type
rg -n "iceBackoff\s*=" client/internal/peer/conn.go | head -5Repository: netbirdio/netbird
Length of output: 246
🏁 Script executed:
# Look for onICEFailed and onICEConnected callback definitions
rg -n "func.*onICEFailed\|func.*onICEConnected" client/internal/peer/conn.go -A 10 | head -40Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Confirm the iceBackoffState mutex protects all methods
cat client/internal/peer/ice_backoff.go | head -120Repository: netbirdio/netbird
Length of output: 4044
🏁 Script executed:
# Verify markFailure and IsSuspended acquire the mutex
rg -n "func.*markFailure\|func.*IsSuspended\|func.*markSuccess" client/internal/peer/ice_backoff.go -A 3Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Get the rest of ice_backoff.go to see all method definitions
tail -80 client/internal/peer/ice_backoff.goRepository: netbirdio/netbird
Length of output: 3224
🏁 Script executed:
# Search more carefully for all iceBackoffState methods
rg -n "func \(s \*iceBackoffState\)" client/internal/peer/ice_backoff.goRepository: netbirdio/netbird
Length of output: 440
New onNetworkChange callback introduces a deadlock with Close()'s mutex-held wg.Wait().
Close() holds conn.mu for the entire duration of conn.wg.Wait() (line 374) because defer conn.mu.Unlock() (line 315) runs only after the function returns. The guard goroutine lives in conn.wg. After the new SetOnNetworkChange(conn.onNetworkChange) registration (line 268), if the guard's select simultaneously has ctx.Done() and an srReconnectChan event ready (when signal/relay reconnect), Go may choose the reconnect case and call onNetworkChange(), which immediately tries conn.mu.Lock() (line 1461). That goroutine blocks forever, conn.wg.Wait() never returns, and the daemon hangs.
The existing onGuardEvent callback was safe because it never acquires conn.mu. onNetworkChange is the first guard callback to do so, breaking the invariant.
Minimal mitigation—add a non-blocking ctx check before taking the lock in onNetworkChange:
🔒 Proposed fix
func (conn *Conn) onNetworkChange() {
+ // Fast-path: if the connection is already being closed (ctx cancelled),
+ // skip the work to avoid deadlocking with Close(), which holds conn.mu
+ // while waiting for the guard goroutine via wg.Wait().
+ select {
+ case <-conn.ctx.Done():
+ return
+ default:
+ }
conn.mu.Lock()
defer conn.mu.Unlock()
if conn.ctx.Err() != nil {
return
}A structural fix (release conn.mu before conn.wg.Wait()) eliminates the root cause entirely but requires restructuring the Close() defers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/conn.go` around lines 265 - 270, The onNetworkChange
callback can deadlock because it attempts conn.mu.Lock() while Close() holds
conn.mu during conn.wg.Wait(); modify onNetworkChange to first perform a
non-blocking check of the connection's context (or a closed/closing flag) and
return immediately if the context is done or connection is closing, before
attempting to acquire conn.mu, so the guard goroutine will not block trying to
lock; locate the callback registered via
conn.guard.SetOnNetworkChange(conn.onNetworkChange) and add the early ctx.Done()
(or connIsClosing check) guard at the top of onNetworkChange, preserving
existing behavior when not closing.
| IceBackoffFailures: int32(peerState.IceBackoffFailures), | ||
| IceBackoffNextRetry: timestamppb.New(peerState.IceBackoffNextRetry), | ||
| IceBackoffSuspended: peerState.IceBackoffSuspended, |
There was a problem hiding this comment.
Preserve unset semantics for IceBackoffNextRetry.
timestamppb.New(time.Time{}) serializes as 0001-01-01T00:00:00Z, not “missing”. For peers with no scheduled retry, clients will see a non-nil timestamp unless this is guarded the same way as LastSeenAtServer.
Suggested fix
SshHostKey: peerState.SSHHostKey,
IceBackoffFailures: int32(peerState.IceBackoffFailures),
- IceBackoffNextRetry: timestamppb.New(peerState.IceBackoffNextRetry),
IceBackoffSuspended: peerState.IceBackoffSuspended,
// Phase 3.7i (`#5989`): per-peer remote meta fields.
ServerOnline: peerState.ServerOnline,
@@
ConnectionTypeExtended: DeriveConnectionTypeExtended(peerState),
}
+ if !peerState.IceBackoffNextRetry.IsZero() {
+ pbPeerState.IceBackoffNextRetry = timestamppb.New(peerState.IceBackoffNextRetry)
+ }
if !peerState.RemoteLastSeenAtServer.IsZero() {
pbPeerState.LastSeenAtServer = timestamppb.New(peerState.RemoteLastSeenAtServer)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/peer/status.go` around lines 1750 - 1752, The field
IceBackoffNextRetry currently always sets a timestamp via
timestamppb.New(peerState.IceBackoffNextRetry) which yields a non-nil zero time
instead of preserving "unset" semantics; change the assignment for
IceBackoffNextRetry to mirror how LastSeenAtServer is handled by checking if
peerState.IceBackoffNextRetry.IsZero() (or equivalent) and only calling
timestamppb.New(...) when non-zero, otherwise leave the proto field nil so
clients see it as missing.
|
|
||
| --- | ||
|
|
||
| ### Task 1: Replace sharedsock with standard UDP socket in TunKernelDevice.Up() |
There was a problem hiding this comment.
Fix heading-level jump to satisfy markdownlint MD001.
Line 13 jumps from higher-level context to ### directly; use ## here (or add an intermediate heading) to keep heading levels incrementing by one.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 13,
Change the heading "Task 1: Replace sharedsock with standard UDP socket in
TunKernelDevice.Up()" from a third-level heading (###) to a second-level heading
(##) so markdown heading levels increment by one and satisfy MD001; locate the
heading text in the document and update its leading hashes accordingly (ensure
any surrounding headings remain correctly ordered).
| Run: `cd /home/ai-agent/projects/netbird && GOOS=linux GOARCH=arm64 go build ./client/` | ||
| Expected: Successful build, no errors. | ||
|
|
||
| - [ ] **Step 4: Cross-compile for arm64 (OpenWrt router)** | ||
|
|
||
| ```bash | ||
| cd /home/ai-agent/projects/netbird | ||
| GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -o /tmp/netbird-kernel-fix-arm64 ./client/ | ||
| ``` |
There was a problem hiding this comment.
Avoid hard-coded local paths in build instructions.
Lines 73-81 use /home/ai-agent/projects/netbird, which makes the plan less portable for other environments. Prefer repo-relative steps (e.g., “from repo root, run …”).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines
73 - 81, Update Step 4 (“Step 4: Cross-compile for arm64 (OpenWrt router)”) to
remove the hard-coded absolute path `/home/ai-agent/projects/netbird`; instead
instruct users to run the cross-compile from the repository root and use a
repo-relative client path (e.g., reference the ./client directory) while
retaining the environment variables (GOOS, GOARCH, CGO_ENABLED) and output
filename; edit the markdown block that currently contains the `cd
/home/ai-agent/projects/netbird` command to present the repo-relative
instruction and command invocation.
| // ApplyLegacyLazyFallbackDefaults seeds the two LegacyLazyFallback* | ||
| // fields if they are at the Go zero value. Idempotent — calling it on | ||
| // an already-populated Settings is a no-op. The "is at zero value" | ||
| // detection is intentionally simple: there is no semantic difference | ||
| // between "user explicitly turned the toggle off / set timeout to 0" | ||
| // and "field uninitialised", because we forbid 0 timeouts at the API | ||
| // layer (range [60, 86400]) and the false toggle case is preserved | ||
| // only when the field was already true and got copied verbatim. New | ||
| // codepaths that need to remember "user opted out" should use the API | ||
| // handler's path (which only ever sees the wire field). | ||
| func (s *Settings) ApplyLegacyLazyFallbackDefaults() { | ||
| if s == nil { | ||
| return | ||
| } | ||
| // timeout==0 is never valid, so we always rewrite. Toggle: only | ||
| // reset to default true when the timeout was also zero (= field | ||
| // freshly built, never touched), otherwise honour the explicit | ||
| // false the caller put there. | ||
| if s.LegacyLazyFallbackTimeoutSeconds == 0 { | ||
| s.LegacyLazyFallbackEnabled = DefaultLegacyLazyFallbackEnabled | ||
| s.LegacyLazyFallbackTimeoutSeconds = DefaultLegacyLazyFallbackTimeoutSeconds | ||
| } |
There was a problem hiding this comment.
Don’t let timeout defaulting silently re-enable legacy fallback.
ApplyLegacyLazyFallbackDefaults() treats LegacyLazyFallbackTimeoutSeconds == 0 as “freshly built” and then also forces LegacyLazyFallbackEnabled = true. That makes an explicit false impossible to preserve on any caller that constructs a fresh Settings and leaves the timeout unset, even though the field comments say this helper is used from the PUT path. Split the two concerns: seed the timeout here, but only seed the boolean in code paths that know the struct is brand new.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/types/settings.go` around lines 248 - 269,
ApplyLegacyLazyFallbackDefaults currently treats
LegacyLazyFallbackTimeoutSeconds == 0 as “fresh” and forces
LegacyLazyFallbackEnabled = DefaultLegacyLazyFallbackEnabled, which prevents
callers from explicitly preserving false; change ApplyLegacyLazyFallbackDefaults
(in type Settings) to only seed LegacyLazyFallbackTimeoutSeconds when zero (set
to DefaultLegacyLazyFallbackTimeoutSeconds) and do NOT modify
LegacyLazyFallbackEnabled there, and instead seed LegacyLazyFallbackEnabled =
DefaultLegacyLazyFallbackEnabled only from code paths that create brand-new
Settings (e.g. the PUT/new-settings handler) so callers that leave timeout unset
but explicitly set Enabled=false are preserved.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/server/http/handlers/accounts/accounts_handler.go (1)
176-198:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftOmitted mode/timeout fields still wipe existing account settings.
This function rebuilds
types.Settingsfrom scratch and only seeds the legacy-fallback fields. An older Dashboard or API client that updates unrelated settings without sendingconnection_mode,p2p_timeout_seconds,p2p_retry_max_seconds, orrelay_timeout_secondswill clear the stored values on save, which breaks the backward-compatibility goal for this rollout. Seed these fields from the current persisted settings before applying request overrides, or merge against the existing settings instead of starting from zero.Also applies to: 231-268
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` around lines 176 - 198, The handler currently constructs a new types.Settings (returnSettings) from req.Settings which causes omission of older mode/timeout fields (connection_mode, p2p_timeout_seconds, p2p_retry_max_seconds, relay_timeout_seconds) to wipe persisted values; instead, initialize returnSettings by copying the existing persisted settings for those specific fields (or merge the persisted settings into returnSettings) before applying overrides from req.Settings so absent fields in the request preserve their stored values; update the logic around returnSettings, req.Settings and the per-field if-blocks to only overwrite a field when the request explicitly provides it, and ensure the same change is applied to the other block noted (lines 231-268) that rebuilds Settings elsewhere.
♻️ Duplicate comments (30)
client/internal/stdnet/filter_test.go (1)
23-66:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThis Windows regression test is still asymmetric and cross-platform brittle.
On non-Windows runners, the
vEthernet (...)rows are expected to be filtered by the genericvethprefix path, but the loop never asserts mostwant: truecases, so the test still passes. Skip this test outside Windows (or split the Windows-only rows out) and use a symmetricgot != c.wantassertion so allowed interfaces are verified too.Suggested adjustment
func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) { + if runtime.GOOS != "windows" { + t.Skip("Windows-specific interface-name filtering") + } + disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"} allow := InterfaceFilter(disallow) @@ for _, c := range cases { - // The wgctrl branch can override on hosts where NetBird is - // running; tests run on a host where these names are not - // real interfaces, so the final return faithfully reflects - // the disallow-list logic. got := allow(c.name) - // "veth*" prefix only filters on non-Windows; on Linux test - // runners "vEthernet (LAN)" still passes because of mixed - // case + the !Windows branch keeping the prefix match. - if !c.want && got { - t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name) - } - if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" { - t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name) + if got != c.want { + t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter_test.go` around lines 23 - 66, The test TestInterfaceFilter_Windows_TargetedFiltering is brittle cross-platform: update it to skip non-Windows runs (check runtime.GOOS != "windows") or separate Windows-only cases from generic cases, and change the checks to a symmetric assertion (if got != c.want { t.Fatalf/... }) so both allowed and filtered expectations are enforced; operate on the InterfaceFilter result (allow := InterfaceFilter(...); got := allow(c.name)) and use runtime.GOOS to gate Windows-specific rows like "vEthernet (LAN)".client/internal/stdnet/filter.go (1)
49-76:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWindows still false-positives adapters starting with
lo.The earlier
Local Area Connectionregression is still alive here: afterstrings.ToLower, both the dedicatedHasPrefix(..., "lo")branch and the generic disallow-list pass will reject any Windows adapter whose name starts withlo. Guarding only the loopback branch is not enough unless the"lo"token is also skipped in the Windows-specific prefix filtering.Proposed fix
// Linux/macOS loopback prefix ("lo", "lo0"). - if strings.HasPrefix(lowerIFace, "lo") { + if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") { return false } @@ - if sLower == "veth" && runtime.GOOS == "windows" { + if runtime.GOOS == "windows" && (sLower == "veth" || sLower == "lo") { continue }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/stdnet/filter.go` around lines 49 - 76, The code currently rejects interfaces starting with "lo" both in the explicit loopback check (lowerIFace and HasPrefix(..., "lo")) and again when iterating disallowList, causing Windows adapters like "Local Area Connection" to be false-positive filtered; update the loop that iterates disallowList (the block referencing disallowList, sLower and runtime.GOOS) to skip the "lo" token on Windows (similar to the existing veth guard) so that when runtime.GOOS == "windows" and sLower == "lo" you continue without applying the Prefix check, leaving the explicit loopback branch as the sole handler for true loopbacks.client/internal/lazyconn/manager/manager.go (1)
103-107:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly use the legacy inactivity manager when the deprecated field is actually the source.
This still falls back to
inactivity.NewManager(...)whenever both effective timeouts resolve to zero. If the caller explicitly set the new fields to0/0and leftInactivityThresholdunset, that restores phase-1 default timeout behavior instead of honoring the intended always-on two-timer config.Suggested fix
if wgIface.IsUserspaceBind() { iceTO, relayTO := config.resolvedTimeouts() - if iceTO == 0 && relayTO == 0 { + if config.InactivityThreshold != nil && + config.ICEInactivityThreshold == 0 && + config.RelayInactivityThreshold == 0 { // Phase 1 / single-timer fallback when caller hasn't migrated. m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback } else { m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) } } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/lazyconn/manager/manager.go` around lines 103 - 107, The current fallback to inactivity.NewManager when iceTO==0 && relayTO==0 incorrectly triggers even if the caller explicitly set the new timeouts to 0/0; change the logic so the legacy inactivity manager is only used when the deprecated InactivityThreshold is the actual source. Modify config.resolvedTimeouts() (or add a small accessor) to return a third boolean like usedDeprecated (or expose whether InactivityThreshold was the source), then in manager.go call that and only assign m.inactivityManager = inactivity.NewManager(...) when usedDeprecated is true (keeping the current wgIface and config.InactivityThreshold parameters); otherwise preserve the two-timer always-on behavior when new fields were explicitly set to 0/0.management/server/peer/peer.go (1)
207-224:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
isEmpty()still omits the new Phase 3.7i fields.
UpdateMetaIfNewreturns early without applying the update whenmeta.isEmpty()returnstrue. SinceisEmpty()ignoresEffectiveConnectionMode,EffectiveRelayTimeoutSecs,EffectiveP2PTimeoutSecs,EffectiveP2PRetryMaxSecs, andSupportedFeatures, a peer carrying only these new runtime/capability fields would have its update silently dropped.This was raised in a previous review iteration.
🐛 Proposed fix
func (p PeerSystemMeta) isEmpty() bool { return p.Hostname == "" && p.GoOS == "" && p.Kernel == "" && p.Core == "" && p.Platform == "" && p.OS == "" && p.OSVersion == "" && p.WtVersion == "" && p.UIVersion == "" && p.KernelVersion == "" && len(p.NetworkAddresses) == 0 && p.SystemSerialNumber == "" && p.SystemProductName == "" && p.SystemManufacturer == "" && p.Environment.Cloud == "" && p.Environment.Platform == "" && - len(p.Files) == 0 + len(p.Files) == 0 && + p.EffectiveConnectionMode == "" && + p.EffectiveRelayTimeoutSecs == 0 && + p.EffectiveP2PTimeoutSecs == 0 && + p.EffectiveP2PRetryMaxSecs == 0 && + len(p.SupportedFeatures) == 0 }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer/peer.go` around lines 207 - 224, The isEmpty() method on PeerSystemMeta currently omits the Phase 3.7i runtime/capability fields so UpdateMetaIfNew can drop updates that only contain those values; update PeerSystemMeta.isEmpty() to include checks for EffectiveConnectionMode, EffectiveRelayTimeoutSecs, EffectiveP2PTimeoutSecs, EffectiveP2PRetryMaxSecs, and SupportedFeatures (e.g., ensure SupportedFeatures length == 0) alongside the existing field checks so a meta containing only those new fields is not treated as empty.shared/management/http/api/openapi.yml (1)
365-405:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
connection_modecontract is still missingp2p-dynamic-lazy, and timeout docs are staleLine 365 omits a valid mode from the API enum, and Lines 370-405 still document old phase behavior/mode applicability. This can cause client/schema rejection of server-valid values and misconfigure consumers.
Suggested OpenAPI patch
connection_mode: x-experimental: true type: string - enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic] + enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy] nullable: true description: | Account-wide default peer-connection mode. NULL means "fall back to lazy_connection_enabled" for backwards compatibility. - Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are - functional. p2p-dynamic is reserved (passes through as p2p in - Phase 1; will become functional in Phase 2). + Includes relay-forced, p2p, p2p-lazy, p2p-dynamic, and + p2p-dynamic-lazy. ... p2p_timeout_seconds: ... description: | Default ICE-worker idle timeout in seconds. 0 = never tear down. - Effective only in p2p-dynamic mode (added in Phase 2). + Effective in dynamic modes (p2p-dynamic, p2p-dynamic-lazy). NULL means "use built-in default" (180 minutes). ... p2p_retry_max_seconds: ... description: | Maximum interval between P2P retry attempts after consecutive ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to - disable backoff (always retry immediately, Phase-2 behavior). - Effective only in p2p-dynamic mode (added in Phase 3). + disable backoff (always retry immediately). + Effective in dynamic modes (p2p-dynamic, p2p-dynamic-lazy). ... relay_timeout_seconds: ... description: | Default relay-worker idle timeout in seconds. 0 = never tear - down. Effective in p2p-lazy and p2p-dynamic modes. Backwards- + down. Effective in modes that support relay idle teardown + (p2p-lazy and p2p-dynamic-lazy). Backwards- compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the client. NULL means "use built-in default" (5 minutes).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/openapi.yml` around lines 365 - 405, The OpenAPI enum for connection_mode is missing the valid value "p2p-dynamic-lazy" and several descriptions are stale; add "p2p-dynamic-lazy" to the enum for connection_mode and update the descriptions for p2p_timeout_seconds, p2p_retry_max_seconds, and relay_timeout_seconds to remove Phase/X-phase language and explicitly state which concrete modes each field applies to (e.g., p2p_timeout_seconds and p2p_retry_max_seconds apply to p2p-dynamic and p2p-dynamic-lazy; relay_timeout_seconds applies to p2p-lazy and p2p-dynamic-lazy/p2p-dynamic as appropriate), and ensure nullable/default semantics and examples remain accurate for those fields.shared/management/client/grpc.go (2)
968-970:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlways send capability/effective-config metadata, even when
system.Infois unavailable.When
info == nil, this returnsniland dropsSupportedFeaturesplus the effective connection settings. On that path a current client becomes indistinguishable from a legacy one, which can trigger the wrong fallback mode.Suggested fix
func infoToMetaData(info *system.Info, eff EffectiveConnConfig) *proto.PeerSystemMeta { - if info == nil { - return nil - } + meta := &proto.PeerSystemMeta{ + EffectiveConnectionMode: eff.Mode, + EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs, + EffectiveP2PTimeoutSecs: eff.P2PTimeoutSecs, + EffectiveP2PRetryMaxSecs: eff.P2PRetryMaxSecs, + SupportedFeatures: system.SupportedFeatures(), + } + if info == nil { + return meta + } @@ - return &proto.PeerSystemMeta{ + meta.Hostname = info.Hostname + meta.GoOS = info.GoOS + meta.OS = info.OS + meta.Core = info.OSVersion + meta.OSVersion = info.OSVersion + meta.Platform = info.Platform + meta.Kernel = info.Kernel + meta.NetbirdVersion = info.NetbirdVersion + meta.UiVersion = info.UIVersion + meta.KernelVersion = info.KernelVersion + meta.NetworkAddresses = addresses + meta.SysSerialNumber = info.SystemSerialNumber + meta.SysManufacturer = info.SystemManufacturer + meta.SysProductName = info.SystemProductName + meta.Environment = &proto.Environment{ + Cloud: info.Environment.Cloud, + Platform: info.Environment.Platform, + } + meta.Files = files + meta.Flags = &proto.Flags{ + RosenpassEnabled: info.RosenpassEnabled, + RosenpassPermissive: info.RosenpassPermissive, + ServerSSHAllowed: info.ServerSSHAllowed, + DisableClientRoutes: info.DisableClientRoutes, + DisableServerRoutes: info.DisableServerRoutes, + DisableDNS: info.DisableDNS, + DisableFirewall: info.DisableFirewall, + BlockLANAccess: info.BlockLANAccess, + BlockInbound: info.BlockInbound, + LazyConnectionEnabled: info.LazyConnectionEnabled, + } + return meta - Hostname: info.Hostname, - GoOS: info.GoOS, - OS: info.OS, - Core: info.OSVersion, - OSVersion: info.OSVersion, - Platform: info.Platform, - Kernel: info.Kernel, - NetbirdVersion: info.NetbirdVersion, - UiVersion: info.UIVersion, - KernelVersion: info.KernelVersion, - NetworkAddresses: addresses, - SysSerialNumber: info.SystemSerialNumber, - SysManufacturer: info.SystemManufacturer, - SysProductName: info.SystemProductName, - Environment: &proto.Environment{ - Cloud: info.Environment.Cloud, - Platform: info.Environment.Platform, - }, - Files: files, - Flags: &proto.Flags{ - RosenpassEnabled: info.RosenpassEnabled, - RosenpassPermissive: info.RosenpassPermissive, - ServerSSHAllowed: info.ServerSSHAllowed, - DisableClientRoutes: info.DisableClientRoutes, - DisableServerRoutes: info.DisableServerRoutes, - DisableDNS: info.DisableDNS, - DisableFirewall: info.DisableFirewall, - BlockLANAccess: info.BlockLANAccess, - BlockInbound: info.BlockInbound, - LazyConnectionEnabled: info.LazyConnectionEnabled, - }, - EffectiveConnectionMode: eff.Mode, - EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs, - EffectiveP2PTimeoutSecs: eff.P2PTimeoutSecs, - EffectiveP2PRetryMaxSecs: eff.P2PRetryMaxSecs, - SupportedFeatures: system.SupportedFeatures(), - } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/client/grpc.go` around lines 968 - 970, The function infoToMetaData should never return nil so capability and effective-config metadata are always sent; modify infoToMetaData (and its use of proto.PeerSystemMeta and EffectiveConnConfig) to always allocate and return a &proto.PeerSystemMeta populated with SupportedFeatures (use info.SupportedFeatures when info != nil, otherwise an empty/zero-value slice), and fill in the EffectiveConnConfig-derived fields from eff so the client is never mistaken for a legacy client even when info == nil.
483-492:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRun the snapshot callback after the sync frame has been applied.
SyncResponsecan carry ordinary peer-state updates andSnapshotRequestin the same frame. Callingcb()beforemsgHandler(decryptedResp)lets the client publish a snapshot from stale state, and it still fires whenmsgHandlerrejects the frame.Suggested ordering fix
- if req := decryptedResp.GetSnapshotRequest(); req != nil { - c.snapMu.Lock() - cb := c.onSnapshotRequest - c.snapMu.Unlock() - if cb != nil { - cb(req.GetNonce()) - } - } - if err := msgHandler(decryptedResp); err != nil { log.Errorf("failed handling an update message received from Management Service: %v", err.Error()) + continue + } + + if req := decryptedResp.GetSnapshotRequest(); req != nil { + c.snapMu.Lock() + cb := c.onSnapshotRequest + c.snapMu.Unlock() + if cb != nil { + cb(req.GetNonce()) + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/client/grpc.go` around lines 483 - 492, The snapshot callback is invoked before the sync frame is applied; change the order so you first call msgHandler(decryptedResp) and only if it returns nil then check decryptedResp.GetSnapshotRequest(), acquire c.snapMu to read c.onSnapshotRequest into a local cb, release the lock, and if cb != nil call cb(req.GetNonce()); ensure the callback is not invoked when msgHandler rejects the frame and keep using the same symbols: decryptedResp.GetSnapshotRequest(), msgHandler(decryptedResp), c.onSnapshotRequest, and req.GetNonce().management/server/peer.go (2)
1329-1355:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn an auth-level error here instead of
status.Internal.“No access to this peer” is an expected authorization result on this path. Returning
Internalturns a normal denial into a 500-class failure for callers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1329 - 1355, The function DefaultAccountManager.checkIfUserOwnsPeer treats a normal "no access" outcome as an internal error; replace the final return that uses status.Internal with an authorization-level gRPC error such as status.PermissionDenied (or status.Unauthenticated if appropriate) so callers receive a 4xx auth error instead of a 500; update the error message text remain the same but use status.PermissionDenied(fmt.Sprintf(...)) in the return from checkIfUserOwnsPeer.
1361-1368:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t leak cross-account peer existence from
GetPeerByPubKey.This lookup is global by WireGuard pubkey, so the account check is the tenant boundary. Returning a distinct
fmt.Errorfhere both confirms that the peer exists in some other account and makes the failure hard for REST handlers to classify correctly. Both “not found” and “belongs to another account” should collapse to the same typed not-found response.Suggested fix
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) { p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey) if err != nil { return nil, err } if p.AccountID != accountID { - return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID) + return nil, status.Errorf(status.NotFound, "peer not found") } return p, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer.go` around lines 1361 - 1368, GetPeerByPubKey currently returns a distinct fmt.Errorf when the found peer belongs to a different account, leaking existence across tenants; change DefaultAccountManager.GetPeerByPubKey so that if p.AccountID != accountID it returns the same typed "not found" error used by Store.GetPeerByPeerPubKey (e.g., the store's ErrNotFound or the same error value/type the store returns) instead of a new fmt.Errorf, so REST handlers cannot distinguish "exists in another account" from "not found."client/internal/conn_mgr.go (2)
297-300:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThe live mode/timeout getters still race with
UpdatedRemotePeerConfig.These fields are written here without synchronization and then exposed through
Mode(),RelayTimeout(),P2pTimeout(), andP2pRetryMax()to other goroutines. The UI/daemon-RPC paths and timer-driven readers can still observe a real data race.Also applies to: 753-778
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 297 - 300, The assignments to e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs in UpdatedRemotePeerConfig are unsynchronized and race with the readers Mode(), RelayTimeout(), P2pTimeout(), and P2pRetryMax(); fix this by protecting these writes and reads with a shared lock: add (or reuse) a sync.RWMutex on the conn manager struct, wrap the block in UpdatedRemotePeerConfig that sets e.mode/e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs with mu.Lock()/defer mu.Unlock(), and change the getters Mode(), RelayTimeout(), P2pTimeout(), P2pRetryMax() to use mu.RLock()/mu.RUnlock() (or use atomic operations if the fields are simple scalars), ensuring all accesses use the same mutex.
287-340:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestart/live-reconfigure the lazy manager when only the inactivity timeouts change.
initLazyManager()snapshotsrelayTimeoutSecsandp2pTimeoutSecsintomanager.Configonce. This path only tears the manager down onmodeChanged, so a management push that changes just the timeout fields leaves the running manager on stale thresholds until a later mode flip or daemon restart.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_mgr.go` around lines 287 - 340, The lazy manager is not restarted when only inactivity timeout values change, leaving it running with stale relay/p2p thresholds; after computing newRelay/newP2P/newP2pRetry and updating e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs (and calling e.propagateP2pRetryMaxToConns()), detect when modeUsesLazyMgr(newMode) is true and e.lazyConnMgr != nil but timeouts or retry max actually changed (i.e., mode didn't change but newRelay/newP2P/newP2pRetry differ from previous stored values) and then restart the manager by calling e.closeManager(ctx), e.initLazyManager(ctx), e.startModeSideEffects(), and (if needed per current behavior) return e.resetPeersToLazyIdle(ctx) so the running manager picks up the new timeouts immediately.client/ui/peers_tab.go (1)
60-98:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
render()off the Fyne event thread.
showFull.OnChangedand the exportedRefreshcallback still run the blockinggetSrvClient/Statuspath synchronously, so a slow daemon can freeze the whole window for up to 5 seconds. Kick off the fetch in a goroutine and keep only the widget mutations insidefyne.Do(...).Proposed fix
- showFull.OnChanged = func(_ bool) { render() } + asyncRender := func() { go render() } + showFull.OnChanged = func(_ bool) { asyncRender() } @@ - return peersTabBundle{Content: content, ShowFull: showFull, Refresh: render} + return peersTabBundle{Content: content, ShowFull: showFull, Refresh: asyncRender}Also applies to: 126-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/peers_tab.go` around lines 60 - 98, The render function currently performs blocking work (getSrvClient, conn.Status) on the Fyne event thread; move the network/status fetch off the UI thread by invoking render in a new goroutine and ensure only UI mutations are executed inside fyne.Do. Concretely: keep the body of render the same but call it via go render() from showFull.OnChanged and from the exported Refresh callback (and any other places that call render), and ensure all calls to getSrvClient, context.WithTimeout, conn.Status, sorting and building of peer data happen outside fyne.Do while only summary.SetText, breakdown.SetText, listVBox.Add/Refresh and other widget updates occur inside fyne.Do.shared/management/http/api/types.gen.go (1)
41-63:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
p2p-dynamic-lazyis still missing from the public connection-mode enum.The PR contract now includes five modes, but generated clients from this schema can still only represent four. Anything relying on
AccountSettingsConnectionMode.Valid()will rejectp2p-dynamic-lazy, so the HTTP API still cannot round-trip the full mode set. Please update the OpenAPI source and regenerate this file rather than patching the generated output directly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated enum AccountSettingsConnectionMode and its Valid() method are missing the "p2p-dynamic-lazy" constant, so Valid() will reject that mode; update the OpenAPI schema to include the new enum value (so the generator emits a new constant, e.g., AccountSettingsConnectionModeP2pDynamicLazy) and then re-run the codegen to regenerate shared/management/http/api/types.gen.go rather than hand-editing the generated file; ensure the generated Valid() switch in AccountSettingsConnectionMode includes the new constant name so "p2p-dynamic-lazy" is accepted.management/server/http/handlers/accounts/accounts_handler.go (1)
231-283:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn
InvalidArgumentfor these new validation failures.These branches still use
fmt.Errorf, so badconnection_mode/ timeout input reachesutil.WriteErroras a generic error and is likely surfaced as a 5xx instead of a 4xx validation response. Usestatus.Errorf(status.InvalidArgument, ...)here and insidevalidateUint32Timeout.Minimal fix
if req.Settings.ConnectionMode != nil { modeStr := string(*req.Settings.ConnectionMode) if !req.Settings.ConnectionMode.Valid() { - return nil, fmt.Errorf("invalid connection_mode %q", modeStr) + return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr) } @@ v := *req.Settings.LegacyLazyFallbackTimeoutSeconds if v < 60 || v > 86400 { - return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) + return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v) } returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v) } @@ func validateUint32Timeout(name string, v int64) (uint32, error) { if v < 0 { - return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v) } if v > int64(math.MaxUint32) { - return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) + return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32)) } return uint32(v), nil }Also applies to: 511-517
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler.go` around lines 231 - 283, Change validation errors from plain fmt.Errorf to gRPC status.InvalidArgument errors so they are returned as 4xx validation responses: replace fmt.Errorf(...) in the ConnectionMode validation branch (where it currently returns invalid connection_mode) with status.Errorf(status.InvalidArgument, ...), and update validateUint32Timeout to return status.Errorf(status.InvalidArgument, ...) for out-of-range timeout errors; ensure the callers in the branches that handle P2pTimeoutSeconds, P2pRetryMaxSeconds, RelayTimeoutSeconds check and propagate the new error type unchanged. Also convert the range-check error for LegacyLazyFallbackTimeoutSeconds (the fmt.Errorf(...) that checks v < 60 || v > 86400) to status.Errorf(status.InvalidArgument, ...) so that legacy_lazy_fallback_timeout_seconds validation also yields InvalidArgument.client/ui/network.go (1)
97-103:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClear the tab being left, not the newly selected tab.
OnUnselectedresolves the grid throughtabs.Selected(), so a tab switch can wipe the wrong container. LeavingExit-node networksforPeersstill falls through the default branch and clearsallGrid.Suggested fix
tabs.OnUnselected = func(item *container.TabItem) { // Only reset network grids when leaving a network tab; the // peers VBox manages its own state. if item != nil && item.Text != peersText { - grid, _ := getGridAndFilterFromTab(tabs, allGrid, overlappingGrid, exitNodeGrid) - grid.Objects = nil + switch item.Text { + case allNetworksText: + allGrid.Objects = nil + case overlappingNetworksText: + overlappingGrid.Objects = nil + case exitNodeNetworksText: + exitNodeGrid.Objects = nil + } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/ui/network.go` around lines 97 - 103, The code clears the wrong grid because getGridAndFilterFromTab is effectively using tabs.Selected(); change the logic to determine and clear the grid based on the unselected tab parameter (item) instead of the currently selected tab: in the tabs.OnUnselected handler use item (and item.Text) to pick the correct grid (via getGridAndFilterFromTab or a new small helper that accepts the TabItem or tab text) and then clear only that grid (e.g., call getGridAndFilterFromTab(item, allGrid, overlappingGrid, exitNodeGrid) or map item.Text to the matching grid) so leaving "Exit-node networks" won't clear allGrid or other unrelated grids; keep peersText, allGrid, overlappingGrid and exitNodeGrid references intact.management/server/http/handlers/accounts/accounts_handler_test.go (1)
418-419:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix the struct literal indentation in the new test.
Lines 418-419 are still over-indented relative to the surrounding fields, so
gofmtwill rewrite this hunk and can trip formatting gates.Proposed fix
- LegacyLazyFallbackEnabled: br(true), - LegacyLazyFallbackTimeoutSeconds: ir(3600), + LegacyLazyFallbackEnabled: br(true), + LegacyLazyFallbackTimeoutSeconds: ir(3600),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/http/handlers/accounts/accounts_handler_test.go` around lines 418 - 419, The struct literal in accounts_handler_test.go has two over-indented fields (LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds using br(...) and ir(...)) which will be changed by gofmt; fix by aligning these fields' indentation with the other fields in the surrounding composite literal so their leading tabs/spaces match the surrounding entries (move LegacyLazyFallbackEnabled: br(true), and LegacyLazyFallbackTimeoutSeconds: ir(3600), left to the same column as adjacent fields).management/server/peer_connections/store.go (1)
141-153:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExpire the cached entry before the nonce gate.
GetWithNonceCheckreturns on thesincecheck before it evaluates TTL, so refresh-path polling can keep expired maps resident indefinitely instead of evicting them.Proposed fix
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) { s.mu.Lock() defer s.mu.Unlock() e, ok := s.maps[peerPubKey] if !ok { return nil, false } - if since > 0 && e.m.GetInResponseToNonce() < since { - return nil, false - } if s.clock.Now().Sub(e.updatedAt) > s.ttl { delete(s.maps, peerPubKey) return nil, false } + if since > 0 && e.m.GetInResponseToNonce() < since { + return nil, false + } return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/server/peer_connections/store.go` around lines 141 - 153, GetWithNonceCheck currently checks the nonce gate before TTL so expired entries can persist; in MemoryStore.GetWithNonceCheck, move the TTL eviction check (compare s.clock.Now().Sub(e.updatedAt) > s.ttl, delete from s.maps and return false) to occur immediately after retrieving e (before evaluating e.m.GetInResponseToNonce()), so expired entries are removed and not allowed to pass the nonce check.management/internals/shared/grpc/conversion.go (1)
315-326:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly mark liveness as authoritative when status is present.
cfg.ServerLivenessKnown = truecurrently runs even whenrPeer.Statusis nil, which turns “no status data” into “server says offline.” New clients will trustLiveOnline=falseinstead of falling back to the last-seen heuristic.Proposed fix
if rPeer.Status != nil { if !rPeer.Status.LastSeen.IsZero() { cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen) } cfg.LiveOnline = rPeer.Status.Connected + cfg.ServerLivenessKnown = true } - cfg.ServerLivenessKnown = true dst = append(dst, cfg)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The code unconditionally sets cfg.ServerLivenessKnown = true even when rPeer.Status is nil, causing clients to trust LiveOnline when no status exists; change the logic so cfg.ServerLivenessKnown is only set to true inside the rPeer.Status != nil branch (i.e., after you nil-guard and assign cfg.LiveOnline / LastSeenAtServer), leaving it false/default when rPeer.Status is nil so clients fall back to the LastSeen heuristic; adjust the block around rPeer.Status, cfg.LiveOnline and cfg.LastSeenAtServer accordingly.client/internal/engine_offline_debounce_test.go (1)
77-97:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winLock the debounce map around these assertions.
These
len(e.peerOfflineDebounce)reads bypasspeerOfflineDebounceMu, so the tests can still race with timer-map mutation and flap under-race.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The tests read len(e.peerOfflineDebounce) without acquiring the debounce mutex, causing races; wrap those assertions in the engine's peerOfflineDebounceMu lock. Specifically, in TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and TestCancelAllRemoteOfflineCloses_ClearsEverything acquire e.peerOfflineDebounceMu before checking len(e.peerOfflineDebounce) (and release after) so the reads are synchronized with scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses/cancelRemoteOfflineClose which mutate the map.client/internal/peer/conn_lazy_keepwgpeer_test.go (1)
108-115:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winOnly skip full-line comments here.
strings.Contains(trim, "// ")also filters out realconn.Close(...) // ...lines, so a regressed 1-arg call with an inline comment would never be checked.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 108 - 115, The condition that skips comment lines uses strings.Contains(trim, "// "), which also matches inline comments like "conn.Close(...) // ..." and hides real 1-arg calls; change that check to only skip full-line comments by using strings.HasPrefix(trim, "//") (i.e., update the switch condition that references trim alongside lazyConnMgr.Close, activityManager.Close, im.Close, peerStore.Close, conn.Close( and peerConn.Close( to use HasPrefix instead of Contains).client/internal/peer/ice_backoff.go (1)
202-210:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear active suspension when max backoff is disabled.
If a peer is already suspended and management pushes
0, this leavess.suspended/s.nextRetryintact, so retries stay blocked until the old deadline even though backoff is now off.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/ice_backoff.go` around lines 202 - 210, SetMaxBackoff currently updates s.maxBackoff and s.bo but doesn't clear an active suspension, so if management sets max backoff to 0 while a peer is suspended the peer remains blocked by s.suspended/s.nextRetry. In iceBackoffState.SetMaxBackoff, after updating s.maxBackoff and s.bo when d == 0 clear the suspension state by setting s.suspended = false and resetting s.nextRetry to the zero time (time.Time{}); reference the SetMaxBackoff method, iceBackoffState type, and fields s.suspended and s.nextRetry so the change unblocks retries immediately when backoff is disabled.client/internal/conn_state_pusher.go (3)
151-162:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe latest-nonce coalescing contract isn't actually enforced.
Older nonces can stay queued in
snapshotReq, and the loop flushes the first pending value it reads. Under back-to-back requests, the caller can miss its ownInResponseToNonce.Also applies to: 243-246
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 151 - 162, OnSnapshotRequest currently may leave older nonces in p.snapshotReq so the pusher can flush a stale value; change OnSnapshotRequest (and the analogous code at the other occurrence) to coalesce by first draining any pending values from p.snapshotReq in a non-blocking loop and then send the latest nonce non-blocking — i.e., if p is nil return, then repeatedly try a non-blocking receive from p.snapshotReq to discard old nonces, and finally perform a non-blocking send of the new nonce so the channel only ever retains the most recent value (use the existing symbol p.snapshotReq and function connStatePusher.OnSnapshotRequest and replicate the same drain-then-send logic at the referenced second spot).
115-122:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't ignore
rand.Readfailures innewSessionID.On error,
bstays zeroed and this loop can spin forever during pusher construction. Propagate the error or use a bounded fallback instead of retrying blindly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 115 - 122, newSessionID currently ignores errors from rand.Read and can loop forever if Read fails; change newSessionID to return (uint64, error) (or an error alongside the ID) and propagate the rand.Read error instead of retrying blindly, or implement a bounded retry with a deterministic fallback; update callers that use newSessionID (e.g., pusher/session construction) to handle the returned error and fail fast if session ID generation fails.
291-296:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid
context.Background()for push calls.A hung
Pushblocks the loop goroutine, which means<-p.stopis never serviced andStop()can stall forever inwg.Wait(). Use a cancelable lifecycle context or a per-call timeout for both delta and full pushes.Also applies to: 361-367
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/conn_state_pusher.go` around lines 291 - 296, The Push calls currently use context.Background() (see p.sink.Push invocations) which can hang and prevent Stop() from returning; replace those with a cancelable context derived from the p lifecycle context (e.g., p.ctx or p.lifecycleCtx) or wrap each call in context.WithTimeout(p.ctx, pushTimeout) and pass that to p.sink.Push for both delta and full pushes (the call at p.sink.Push near seq and the similar one at the later block), and ensure Stop() cancels the lifecycle context so blocked Pushes are unblocked and wg.Wait() can complete.client/internal/peer/status.go (3)
296-307:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCapture the conn-state listener before unlocking.
notifyConnStateChange()is still written as a lock-held helper, but these update paths call it only afterd.mux.Unlock(). That leaves the read ofd.connStateListenerracing withSetConnStateListener(nil/...), so the new callback plumbing is still not thread-safe.Also applies to: 470-473, 696-699, 764-767, 812-815, 863-866
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 296 - 307, notifyConnStateChange currently reads d.connStateListener while callers may have already released d.mux, creating a race with SetConnStateListener; fix by capturing the listener while holding the lock and passing that captured value into the post-unlock closure. Concretely, either (A) change notifyConnStateChange to accept a listener parameter (e.g., notifyConnStateChange(listener func(string, State), peerPubKey string, peerState State) func()) and have callers read listener := d.connStateListener while d.mux is held and then call the new helper after unlock, or (B) update all call sites to read listener := d.connStateListener while holding d.mux and then call the existing notifyConnStateChange-like closure creator with that captured listener; ensure SetConnStateListener races are eliminated by always reading the listener under d.mux before unlocking.
309-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't fire
peerListChangedwhile holdingd.mux.
UpdatePeerRemoteMeta()still callsd.notifyPeerListChanged()before releasingd.mux. Since the notifier can re-enterStatus, this path can still deadlock on remote-meta-only updates even though the rest of the file generally snapshots under lock and notifies afterwards.Also applies to: 520-567
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 309 - 320, notifyPeerListChanged currently calls d.notifier.peerListChanged while still holding d.mux, which can deadlock because the notifier may re-enter Status; change the pattern to snapshot the required state under the lock (e.g., capture count := d.numOfPeers() or any other needed snapshot of d.peers/d.offlinePeers) and then release d.mux before calling d.notifier.peerListChanged(count). Apply the same fix to the other path referenced (the UpdatePeerRemoteMeta-related block around the second region) so no notifier calls occur while d.mux is held.
322-336:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSnapshot router state while
d.muxis still held.
notifyPeerStateChangeListeners()still assumesd.muxis held because it readsd.peersandd.changeNotifythroughsnapshotRouterPeersLocked(). The ICE and relay paths now call it after unlock, so those reads can still race with writers and hit the same concurrent-map bug that was previously reported.Also applies to: 692-694, 760-762
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/status.go` around lines 322 - 336, notifyPeerStateChangeListeners currently calls snapshotRouterPeersLocked which reads d.peers and d.changeNotify and therefore must be executed while d.mux is held; some ICE/relay code paths call notifyPeerStateChangeListeners after unlocking and can race with writers. Fix by ensuring the snapshot happens under the lock: either (A) call notifyPeerStateChangeListeners while still holding d.mux in the ICE/relay paths (the callers referenced around the other occurrences), or (B) change notifyPeerStateChangeListeners to acquire d.mux internally before calling snapshotRouterPeersLocked and release it immediately after taking the snapshot, then spawn dispatchRouterPeers; apply the same pattern for all call sites (including the ones at the other noted locations) so snapshotRouterPeersLocked is always invoked with d.mux held.management/internals/shared/grpc/server.go (1)
445-447:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winNil out
snapshotChafter closure to avoid a tight spin.When
SnapshotRouter.Register()replaces a stream, the oldsnapshotChcan be closed. This branch still doescontinueon!ok, so the closed receive stays immediately ready and the goroutine busy-loops untilctx.Done().Possible minimal fix
case nonce, ok := <-snapshotCh: if !ok { - continue + snapshotCh = nil + continue }Also applies to: 497-500
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@management/internals/shared/grpc/server.go` around lines 445 - 447, The current receive loop uses snapshotCh from snapshotRouter.Register(peerKey.String()) and on detecting a closed channel (ok == false) it does continue, which causes a closed channel to remain immediately readable and the goroutine to busy-spin; fix this by setting snapshotCh = nil (or otherwise disabling reads) when you detect !ok so the select no longer reads the closed channel, and keep the existing defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh); apply the same niling change to the analogous receive logic referenced around the other occurrence (lines ~497-500) to prevent the same tight loop.client/internal/engine.go (1)
2765-2783:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winGuard the timer's
ConnMgrread with the same lock as the writer.The debounce timer still reads
e.connMgr.Mode()outsidesyncMsgMux, whileupdateNetworkMap()updates the connection-manager config under that mutex. This is the same unsynchronized read/write pair that was flagged earlier, so the callback can still race with mode changes or shutdown.Possible minimal fix
t := time.AfterFunc(remoteOfflineGracePeriod, func() { e.peerOfflineDebounceMu.Lock() delete(e.peerOfflineDebounce, pubKey) e.peerOfflineDebounceMu.Unlock() @@ if e.ctx == nil || e.ctx.Err() != nil { return } - if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic { + e.syncMsgMux.Lock() + connMgr := e.connMgr + mode := connectionmode.ModeUnspecified + if connMgr != nil { + mode = connMgr.Mode() + } + e.syncMsgMux.Unlock() + if connMgr == nil || mode != connectionmode.ModeP2PDynamic { return }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/engine.go` around lines 2765 - 2783, The debounce timer callback is reading e.connMgr.Mode() without holding the same syncMsgMux used by updateNetworkMap(), creating a race; modify the callback to acquire the same syncMsgMux (use the identical lock used in updateNetworkMap(), e.g. e.syncMsgMux.Lock()/Unlock() or RLock()/RUnlock()) around all accesses to e.connMgr and any related peer state checks so the read is synchronized with writers, taking care to acquire locks in the same order as other code paths to avoid deadlocks and release the lock before returning.client/internal/peer/conn.go (1)
1171-1218:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRevalidate conn state inside
AttachICE().
AttachICEOnRelayActivity()still dropsconn.mubefore callingAttachICE(), andAttachICE()still never checksopened/ctx.Err()after reacquiring the lock. A concurrentClose()can therefore finish in between and this path will still reattach ICE plus send a fresh offer on a torn-down conn.Possible minimal fix
func (conn *Conn) AttachICE() error { conn.mu.Lock() defer conn.mu.Unlock() + if !conn.opened || conn.ctx == nil || conn.ctx.Err() != nil { + return fmt.Errorf("AttachICE: connection is closed") + } + if conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() { snap := conn.iceBackoff.Snapshot() conn.Log.Debugf("ICE backoff active (failure #%d, retry at %s), staying on relay", snap.Failures, snap.NextRetry.Format("15:04:05"))Also applies to: 1280-1305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@client/internal/peer/conn.go` around lines 1171 - 1218, AttachICE can be run after conn.mu is dropped in AttachICEOnRelayActivity, so Add state revalidation at the top of AttachICE immediately after reacquiring conn.mu: check that conn.opened is true, ctx.Err() is nil (or context still active), and other preconditions that AttachICEOnRelayActivity enforces (e.g., conn.currentConnPriority == conntype.Relay and handshaker/readICEListener still indicate attachment is valid); if any check fails, return early with a harmless error. Ensure these checks reference the existing symbols AttachICE, conn.mu, conn.opened, conn.ctx (or context), conn.currentConnPriority and handshaker so concurrent Close() or state changes don’t cause ICE to be attached on a torn-down connection.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 557b44c6-66de-483f-9506-4870e6b21175
⛔ Files ignored due to path filters (2)
client/proto/daemon.pb.gois excluded by!**/*.pb.goshared/management/proto/management.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (60)
.gitignoreclient/android/client.goclient/android/peer_notifier.goclient/android/preferences.goclient/android/preferences_clamp_test.goclient/iface/bind/activity.goclient/iface/device/endpoint_manager.goclient/internal/conn_mgr.goclient/internal/conn_state_pusher.goclient/internal/conn_state_pusher_material_test.goclient/internal/conn_state_pusher_test.goclient/internal/conn_state_pusher_testhelper_test.goclient/internal/engine.goclient/internal/engine_offline_debounce_test.goclient/internal/lazyconn/activity/listener_bind_test.goclient/internal/lazyconn/manager/manager.goclient/internal/peer/conn.goclient/internal/peer/conn_handover_order_test.goclient/internal/peer/conn_lazy_keepwgpeer_test.goclient/internal/peer/guard/guard.goclient/internal/peer/guard/guard_test.goclient/internal/peer/guard/ice_retry_state_test.goclient/internal/peer/ice_backoff.goclient/internal/peer/ice_backoff_test.goclient/internal/peer/status.goclient/internal/peer/status_debounce_test.goclient/internal/peer/status_remote_meta_notify_test.goclient/internal/peerstore/store.goclient/internal/profilemanager/config.goclient/internal/stdnet/filter.goclient/internal/stdnet/filter_test.goclient/proto/daemon.protoclient/system/features.goclient/system/features_test.goclient/ui/network.goclient/ui/peers_tab.godocs/bugs/2026-05-04-user-peer-visibility-regression.mddocs/superpowers/plans/2026-04-06-kernel-ice-separate-port.mdmanagement/internals/shared/grpc/conversion.gomanagement/internals/shared/grpc/conversion_test.gomanagement/internals/shared/grpc/server.gomanagement/server/account.gomanagement/server/activity/codes.gomanagement/server/http/handlers/accounts/accounts_handler.gomanagement/server/http/handlers/accounts/accounts_handler_test.gomanagement/server/http/handlers/accounts/validate_uint32_timeout_test.gomanagement/server/peer.gomanagement/server/peer/peer.gomanagement/server/peer_connections/store.gomanagement/server/peer_connections/store_test.gomanagement/server/peer_test.gomanagement/server/store/file_store.gomanagement/server/store/sql_store.gomanagement/server/types/settings.gomanagement/server/types/settings_ptr_equal_test.gomanagement/server/types/settings_test.goshared/management/client/grpc.goshared/management/http/api/openapi.ymlshared/management/http/api/types.gen.goshared/management/proto/management.proto
🚧 Files skipped from review as they are similar to previous changes (4)
- management/server/peer_connections/store_test.go
- management/server/http/handlers/accounts/validate_uint32_timeout_test.go
- management/server/store/sql_store.go
- docs/bugs/2026-05-04-user-peer-visibility-regression.md
| if e.connStatePusher != nil { | ||
| e.connStatePusher.Stop() | ||
| e.connStatePusher = nil | ||
| } |
There was a problem hiding this comment.
Synchronize connStatePusher access across shutdown and callbacks.
The nil-checks prevent a panic, but Stop() still writes e.connStatePusher = nil while the status/snapshot handlers read the same field without any synchronization. That is still a real data race during shutdown, and it can surface as torn reads or calls into a half-stopped pusher under -race.
Use an atomic.Pointer, or unregister/replace the handlers before clearing the field so reads and writes use the same synchronization boundary.
Also applies to: 671-683
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/engine.go` around lines 363 - 366, The code has a data race
on the field connStatePusher: handlers (status/snapshot handlers) read
e.connStatePusher while shutdown calls Stop() and sets e.connStatePusher = nil
without synchronization. Fix by making access atomic or by
unregistering/replacing handlers before clearing the field: either convert
connStatePusher to an atomic.Pointer (use atomic.LoadPointer/StorePointer when
handlers read/write and call Stop via an atomic swap) or ensure shutdown first
unregisters the status/snapshot handlers (so no concurrent readers), then call
connStatePusher.Stop() and set the field to nil; update all reads in the
status/snapshot handler to use the same atomic/mutex-protected access. Ensure
references to connStatePusher and Stop() are the only sites changed.
| func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) { | ||
| relay := c.RelayInactivityThreshold | ||
| if relay == 0 && c.InactivityThreshold != nil { | ||
| relay = *c.InactivityThreshold | ||
| } |
There was a problem hiding this comment.
Don’t let the deprecated timeout bleed into mixed two-timer configs.
Line 52 still copies InactivityThreshold into the relay timeout whenever RelayInactivityThreshold == 0, even if ICEInactivityThreshold is explicitly set. That means a migrated p2p-dynamic config (ICE > 0, relay = 0) can unexpectedly inherit the old relay timeout and tear relay down, which contradicts the comment above this field and changes the new mode’s semantics.
Suggested fix
func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
relay := c.RelayInactivityThreshold
- if relay == 0 && c.InactivityThreshold != nil {
+ if c.ICEInactivityThreshold == 0 && relay == 0 && c.InactivityThreshold != nil {
relay = *c.InactivityThreshold
}
return c.ICEInactivityThreshold, relay
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) { | |
| relay := c.RelayInactivityThreshold | |
| if relay == 0 && c.InactivityThreshold != nil { | |
| relay = *c.InactivityThreshold | |
| } | |
| func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) { | |
| relay := c.RelayInactivityThreshold | |
| if c.ICEInactivityThreshold == 0 && relay == 0 && c.InactivityThreshold != nil { | |
| relay = *c.InactivityThreshold | |
| } | |
| return c.ICEInactivityThreshold, relay | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client/internal/lazyconn/manager/manager.go` around lines 50 - 54, The relay
timeout fallback is incorrectly using the deprecated InactivityThreshold
whenever RelayInactivityThreshold == 0 even if ICEInactivityThreshold is
explicitly set; update Config.resolvedTimeouts so that RelayInactivityThreshold
only falls back to InactivityThreshold when both RelayInactivityThreshold == 0
and ICEInactivityThreshold == 0 (and InactivityThreshold != nil), otherwise
leave relay as zero when ICEInactivityThreshold is non‑zero to preserve the new
mode semantics; modify the logic in the resolvedTimeouts method accordingly,
referencing the RelayInactivityThreshold, ICEInactivityThreshold, and
InactivityThreshold fields.
| } | ||
| mux := udpmux.NewUniversalUDPMuxDefault(bindParams) | ||
| go mux.ReadFromConn(t.ctx) | ||
| t.udpMuxConn = udpConn |
There was a problem hiding this comment.
Specify whether t.udpMuxConn field needs to be added to the struct.
The plan assigns to t.udpMuxConn but doesn't clarify whether this field already exists in TunKernelDevice or needs to be added as part of this change. Include a note specifying the struct field addition if required, or confirm the field already exists.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 59,
The assignment to t.udpMuxConn implies TunKernelDevice needs a field named
udpMuxConn; either add that field to the TunKernelDevice struct with the same
type as the udpConn variable (e.g., net.PacketConn or *net.UDPConn depending on
udpConn's type) and a short comment describing its role, or if the field already
exists, update the plan text to explicitly confirm it exists and show its type;
ensure any initialization/close responsibilities for udpMuxConn are handled
consistently with other connection fields in TunKernelDevice (e.g., in its
constructor/Close method).
| // AccountConnectionModeChanged indicates the account-wide ConnectionMode | ||
| // setting was changed (Phase 1 of issue #5989). | ||
| AccountConnectionModeChanged Activity = 121 |
There was a problem hiding this comment.
Inaccurate phase reference in comment — should be Phase 3.7i, not Phase 1.
Line 236 says "Phase 1 of issue #5989", but this PR is Phase 3.7i. The other new constants on lines 244–245 and 248–252 correctly reference Phase 3 and Phase 3.7i.
✏️ Proposed fix
- // AccountConnectionModeChanged indicates the account-wide ConnectionMode
- // setting was changed (Phase 1 of issue `#5989`).
+ // AccountConnectionModeChanged indicates the account-wide ConnectionMode
+ // setting was changed (Phase 3.7i of issue `#5989`).
AccountConnectionModeChanged Activity = 121📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // AccountConnectionModeChanged indicates the account-wide ConnectionMode | |
| // setting was changed (Phase 1 of issue #5989). | |
| AccountConnectionModeChanged Activity = 121 | |
| // AccountConnectionModeChanged indicates the account-wide ConnectionMode | |
| // setting was changed (Phase 3.7i of issue `#5989`). | |
| AccountConnectionModeChanged Activity = 121 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@management/server/activity/codes.go` around lines 235 - 237, The comment
above the Activity constant AccountConnectionModeChanged incorrectly references
"Phase 1 of issue `#5989`"; update that comment to reference "Phase 3.7i of issue
`#5989`" so it matches the other new constants (e.g., the constants added around
AccountConnectionModeChanged and those at lines after it) and clearly documents
that AccountConnectionModeChanged Activity = 121 is part of Phase 3.7i.
| // ConnectionMode controls how a peer establishes connections to other peers. | ||
| // Added in Phase 1 of the connection-mode consolidation (see issue #5989). | ||
| // CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to | ||
| // the legacy LazyConnectionEnabled boolean field" -- required for backwards | ||
| // compatibility with old management servers that don't set this field. | ||
| enum ConnectionMode { | ||
| CONNECTION_MODE_UNSPECIFIED = 0; | ||
| CONNECTION_MODE_RELAY_FORCED = 1; | ||
| CONNECTION_MODE_P2P = 2; | ||
| CONNECTION_MODE_P2P_LAZY = 3; | ||
| CONNECTION_MODE_P2P_DYNAMIC = 4; | ||
| } |
There was a problem hiding this comment.
Add CONNECTION_MODE_P2P_DYNAMIC_LAZY to the wire enum.
#5989's mode set includes p2p-dynamic-lazy, but this enum stops at CONNECTION_MODE_P2P_DYNAMIC. Without a wire value here, management cannot push or persist that mode through PeerConfig, so clients can never receive the dynamic-lazy behavior over the management protocol.
Proposed fix
enum ConnectionMode {
CONNECTION_MODE_UNSPECIFIED = 0;
CONNECTION_MODE_RELAY_FORCED = 1;
CONNECTION_MODE_P2P = 2;
CONNECTION_MODE_P2P_LAZY = 3;
CONNECTION_MODE_P2P_DYNAMIC = 4;
+ CONNECTION_MODE_P2P_DYNAMIC_LAZY = 5;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // ConnectionMode controls how a peer establishes connections to other peers. | |
| // Added in Phase 1 of the connection-mode consolidation (see issue #5989). | |
| // CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to | |
| // the legacy LazyConnectionEnabled boolean field" -- required for backwards | |
| // compatibility with old management servers that don't set this field. | |
| enum ConnectionMode { | |
| CONNECTION_MODE_UNSPECIFIED = 0; | |
| CONNECTION_MODE_RELAY_FORCED = 1; | |
| CONNECTION_MODE_P2P = 2; | |
| CONNECTION_MODE_P2P_LAZY = 3; | |
| CONNECTION_MODE_P2P_DYNAMIC = 4; | |
| } | |
| // ConnectionMode controls how a peer establishes connections to other peers. | |
| // Added in Phase 1 of the connection-mode consolidation (see issue `#5989`). | |
| // CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to | |
| // the legacy LazyConnectionEnabled boolean field" -- required for backwards | |
| // compatibility with old management servers that don't set this field. | |
| enum ConnectionMode { | |
| CONNECTION_MODE_UNSPECIFIED = 0; | |
| CONNECTION_MODE_RELAY_FORCED = 1; | |
| CONNECTION_MODE_P2P = 2; | |
| CONNECTION_MODE_P2P_LAZY = 3; | |
| CONNECTION_MODE_P2P_DYNAMIC = 4; | |
| CONNECTION_MODE_P2P_DYNAMIC_LAZY = 5; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@shared/management/proto/management.proto` around lines 397 - 408, The
ConnectionMode proto enum is missing the new constant for the p2p-dynamic-lazy
mode, preventing management from encoding/decoding that mode; add a new enum
value named CONNECTION_MODE_P2P_DYNAMIC_LAZY (next numeric value after
CONNECTION_MODE_P2P_DYNAMIC) to the ConnectionMode enum so management can
push/persist the p2p-dynamic-lazy mode via PeerConfig and clients can receive
it.
e8a298f to
587ba07
Compare
…nfig Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers (11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are unchanged so old clients ignore the additions and old servers send UNSPECIFIED, which the new client maps back via the legacy boolean. Note: the regenerated pb.go files now report protoc v5.29.3 in their header (this branch was generated with locally-installed protoc 29.3 instead of upstream's v7.34.1). Functionally identical; header diff is the only delta beyond the actual schema additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the client-only sentinels Unspecified and FollowServer), ParseString for CLI/env input, ToProto/FromProto for wire translation, and the two backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled that bridge the old Settings.LazyConnectionEnabled boolean. Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched in this commit; the engine/conn_mgr migration follows in subsequent commits in the same PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY, NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set together, NB_FORCE_RELAY wins (most-restrictive, mirrors the group-conflict rule from issue netbirdio#5990). Each legacy var emits a one-shot deprecation warning when it actually contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD becomes an alias for the future relay_timeout setting and warns once. IsForceRelayed() is kept for callers that have not yet been migrated (conn.go, statusrecorder); they will be updated in the engine/conn refactor commits later in this PR. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing: - --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server> - --relay-timeout <seconds> - --p2p-timeout <seconds> Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput, LoginRequest), persisted in profilemanager.Config, and added as new fields on the daemon.proto IPC messages. Empty / not-changed flags fall back to the server-pushed value (which itself falls back to the legacy lazy_connection_enabled boolean for old servers). Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds. ConnMgr now stores the resolved Mode plus the raw inputs (env, config) so it can re-resolve when the server pushes a new PeerConfig. UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and takes the full PeerConfig pointer; a thin shim with the old name delegates to it for callers that haven't been updated yet. connect.go copies the three new fields from profilemanager.Config into the EngineConfig builder, with a tolerant parser that logs and falls through to Unspecified on invalid input. Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now checks Mode == ModeRelayForced instead of calling the global env-reader IsForceRelayed(). The local 'forceRelay' variable name is renamed to 'skipICE' to make the new branching intent explicit. The PeerStateUpdate block at the end of Open() also reads from conn.config.Mode now, so the StatusRecorder sees the per-peer mode rather than the global env var. A single remaining caller of IsForceRelayed() (srWatcher.Start in engine.go) is left for a follow-up; that path uses a process-wide flag not per-peer state, so it can be migrated in Phase 2 once srWatcher itself learns about ConnectionMode. Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into peer.ConnConfig, so the per-peer Open() loop in conn.go can take the ModeRelayForced skip-ICE branch without reading the global env var. This is the last wiring commit for the client side of Phase 1; the server-side mgmt changes (Settings + OpenAPI + handler + audit + NetworkMap-build) follow in Section D. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default' (NULL) from explicit values (incl. 0 = never tear down). Copy() now deep-clones the new pointer fields via two small helpers. GORM AutoMigrate creates the new columns at first start; existing accounts have NULL in all three columns and resolve via the legacy LazyConnectionEnabled boolean. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings Three new optional, nullable fields with descriptions of the NULL = built-in-default semantics and the Phase-1-vs-Phase-2 status of p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen tooling. The generated AccountSettingsConnectionMode enum has the canonical values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid() helper for handler-side validation. Phase 1 of issue netbirdio#5989. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…E-backoff display Finding 1 (PR-blocker): client/server compile fix. client/server/server_test.go now passes 13 args (peerConnStore + peerConnRouter as nil) to nbgrpc.NewServer matching the current signature. Also adds the four Phase 3.7i ConnectionMode fields to setconfig_test.go's expectedFields + fieldsWithoutCLIFlags maps — they're in the proto for GetConfig only; SetConfig RPC currently doesn't apply them (CLI sets them via service install/reconfigure writing the profile file directly). Documented as a wiring gap. Finding 2: peer_connections.MemoryStore drops fresh full snapshots after daemon restart. store.go now ALWAYS replaces on full_snapshot=true, regardless of seq, because the pusher resets seq to 1 on every stream restart. Stale in-flight deltas from the closed stream cannot physically arrive after the new full snapshot (transport itself is gone). New regression test TestMemoryStore_FullSnapshotResetsEpoch covers the scenario plus the follow-up delta merge from the new session. TestMemoryStore_OutOfOrderDeltaDropped split out to keep the delta-out-of-order assertion intact. Finding 3: UpdatePeerRemoteMeta now fires OnPeersListChanged when a UI-relevant field flips. status.go detects LiveOnline / ServerLivenessKnown / EffectiveConnectionMode flips and calls notifyPeerListChanged so the Android home/peers fragments — which only refresh on OnPeersListChanged — pick up the change immediately instead of at the next 30 s daemon-RPC poll. New tests: - TestStatus_UpdatePeerRemoteMeta_LiveOnlineFlipNotifies - TestStatus_UpdatePeerRemoteMeta_EffectiveModeChangeNotifies - TestStatus_UpdatePeerRemoteMeta_NonMaterialFieldsDoNotNotify Finding 4: ICE-backoff display now wall-clock aware on Windows. client/ui/peers_tab.go mirrors the CLI's status.go:797 pattern: only show "suspended for Xs (retry at ...)" while nextRetry is still in the future, otherwise show the next-retry timestamp, otherwise hide. The hard "SUSPENDED" line could remain stale for hours after the cool-down expired because the daemon snapshot only refreshes on ICE state-change events. Android Java UI receives the same fix in the netbird-android repo (submodule bump in companion commit).
The previous fix (full_snapshot=true always replaces) covers the
normal daemon-restart case but cannot detect a stale unary RPC from
the previous daemon process arriving AFTER the new process's full
snapshot. Because SyncPeerConnections is a unary RPC (not stream), a
retried in-flight delta can race past the snapshot and merge
old-session data into the fresh map (stale delta seq=51 beats new
session seq=2 under pure seq comparison).
Fix: add a session_id field to PeerConnectionMap, generated once per
daemon process via crypto/rand at conn_state_pusher construction.
Mgmt's MemoryStore.Put now drops any delta whose session_id doesn't
match the cached entry's. session_id=0 means "legacy / unset" and
falls back to the existing seq-only behaviour (so a partial fleet
upgrade doesn't silently drop pushes).
Changes:
- shared/management/proto/management.proto: PeerConnectionMap.session_id
field 5, uint64.
- client/internal/conn_state_pusher.go: connStatePusher.sessionID
set once via crypto/rand; flushDelta + flushFull both stamp it.
- management/server/peer_connections/store.go: drop deltas with
mismatched session_id when both sides advertise one.
Tests:
- TestMemoryStore_StaleDeltaFromOldSessionDropped: full snapshot
from session B arrives at seq=1 against cached session A seq=50,
then a stale delta from session A retries at seq=51 — must not
leak into session B's map.
- TestMemoryStore_LegacyZeroSessionFallsBackToSeqOnly: legacy
clients (session_id=0) keep working under seq-only rules.
- TestMemoryStore_MixedSessionAcceptsLegacyDelta: legacy delta
against a session-tagged cached state must not be dropped (fleet
upgrade hazard).
management.pb.go regenerated with protoc-gen-go v1.36.6 (matches the
existing header) — diff scoped to the new SessionId getter + struct
field + rawDesc entry.
Hardware test on 4998e5a surfaced a regression: my Item 1 fix (suppress reconnect-guard offers under p2p-dynamic when ICE was detached for inactivity) used "handshaker.readICEListener() == nil" as the detach signal. But the listener is ALSO nil for a brand-new peer that has never been connected — the initial setup attaches it. Result: the very first bootstrap offer fired by the lazy mgr's guard activation got suppressed, and dk20 saw all 3 BM routers stuck in "Connecting" forever after the wakeup ping. Fix: track "have we ever been connected" via a new Conn.everConnected atomic.Bool, set on first successful configureConnection (ICE) or relay-connected transition. The skip branch now ALSO requires everConnected==true, so: - brand-new peer -> first offer goes through (bootstrap works) - was-connected -> ICE detach for inactivity skips offers (Item 1 intent preserved) - ICE-failure-backoff -> existing 3-tries-then-hourly retry continues to handle it Regression test: TestConn_OnGuardEvent_SkipOfferGatedOnEverConnected ensures the everConnected.Load() landmark stays above the skip-offer trace landmark in onGuardEvent's source — cheap static-text guard that fails immediately if a future refactor drops the gate.
Hardware test on c9a47ed surfaced a stuck-state bug: after a WireGuard handshake timeout (3 min of no kernel-side handshake), the peer's active worker was closed via onWGDisconnected, BUT the lazy manager kept it in the "active" set with no activity listener attached. Result: - Status frozen at "Connecting" indefinitely - Local outbound traffic (ping etc.) silently dropped — no "activity detected" log, no offer fired - Recovery only happened if the REMOTE peer happened to send an offer (signal-RPC path bypasses the activity listener) - Confirmed on dk20: 0/10 ping responses to 572a2/5731A after WG timeout, no daemon log activity for 10+ minutes Fix: add an onWGTimeoutRecover callback on Conn, wired by ConnMgr to ConnMgr.RecoverPeerToIdle(peerKey) which calls lazyConnMgr.DeactivatePeer(connID). DeactivatePeer already does the right thing (close conn fully, restart activity monitor) — it was just never invoked from the WG-timeout path. The callback is launched in a goroutine to avoid re-entering conn.mu (the recovery path closes the conn, which itself takes conn.mu). Files: - client/internal/peer/conn.go: new onWGTimeoutRecover field + SetOnWGTimeoutRecover setter, invoked at end of onWGDisconnected after the active worker close. - client/internal/conn_mgr.go: new RecoverPeerToIdle method and callback wiring in AddPeerConn. - client/internal/peer/conn_handover_order_test.go: static-text regression test enforces that onWGDisconnected references onWGTimeoutRecover and invokes it AFTER workerRelay.CloseConn().
Phase 3.7i (netbirdio#5989). Adds PeerSystemMeta.supported_features as a forward-compatible capability advertisement mechanism. The client ships a list of feature keywords (currently just "p2p_dynamic") that the management server can branch on to decide whether to send legacy-compat fallback settings to clients that pre-date a feature. Source of truth for the keyword list lives in client/system/features.go with a pin-down test in features_test.go to make adding new capabilities a deliberate two-edit operation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). Two new account-level settings: LegacyLazyFallbackEnabled bool default true LegacyLazyFallbackTimeoutSeconds uint32 default 3600 When the account ConnectionMode is p2p-dynamic, conversion.go (next commit) downgrades clients that don't advertise the "p2p_dynamic" capability to p2p-lazy with the configured timeout. Default is ON so older clients keep behaving sanely after an admin flips the mode. GORM AutoMigrate adds the columns. The pgx fast-path SELECT loads them with NullBool / NullInt64 backed defaults so pre-3.7i rows in the DB stay safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). When the resolved account ConnectionMode is p2p-dynamic and the peer does NOT advertise the "p2p_dynamic" capability via PeerSystemMeta.SupportedFeatures, downgrade its PeerConfig to p2p-lazy with the admin-configured fallback timeout. This addresses Codex's compatibility concern: a default of p2p-dynamic on the account combined with an old client (that just ignores the unknown enum and falls back to LazyConnectionEnabled) would otherwise send LazyConnectionEnabled=false -> peers eagerly hold ICE forever -> battery / metered-link cost. With this override, old clients see LazyConnectionEnabled=true and the configured timeout, so they get the closest behavioural equivalent to p2p-dynamic that their proto vocabulary allows. Override is gated on the LegacyLazyFallbackEnabled toggle; admins who know their entire fleet is on a 3.7i+ build can disable it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989), follow-up to Codex PR review. New clients talking to an old (pre-3.7i) management server would otherwise hit codes.Unimplemented from SyncPeerConnections on every push attempt: once at the initial snapshot, once per peer state change, and once per heartbeat tick (60s). The retries are wasteful (gRPC roundtrip + wakeup on every metered link) and noisy in the daemon log. Detect Unimplemented once via status.FromError, log a single WARN for operator visibility, and latch a sticky `disabled` atomic.Bool that short-circuits all further flushDelta / flushFull paths. The events are still recorded in lastPushed so the dirty-state computation doesn't keep re-flagging them. Disabled state is per-pusher-instance, i.e. resets at the next daemon restart - if mgmt has been upgraded by then, the new pusher detects support naturally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). Adds the two new account settings to the HTTP-API surface and emits audit events when they change: legacy_lazy_fallback_enabled (bool, default true) legacy_lazy_fallback_timeout_seconds (int, default 3600, range 60-86400) PUT /api/accounts/{id} validates the timeout range and rejects out-of-range values with HTTP 400. GET returns both fields unconditionally (defaults if the DB row predates 3.7i). Activity codes 125 + 126 emit on toggle/timeout change with old+new values in the meta payload. The change-detection block in handleAccountSettingsUpdate now treats both new fields as peer-update-triggering so legacy clients re-receive their PeerConfig with the new override timeout when an admin tweaks the setting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989), follow-up to Codex PR review. The GORM `default:true` / `default:3600` tags on Settings.LegacyLazyFallback* only fire on SQL INSERT, leaving every in-memory construction path at the Go zero value (false / 0). The conversion layer reads those zeros and disables the legacy-fallback for the account -- the exact opposite of the intended behaviour. Fixes: - types.DefaultLegacyLazyFallback{Enabled,TimeoutSeconds} constants + ApplyLegacyLazyFallbackDefaults helper (single source of truth) - account.go (both NewAccount paths) seeds the new fields explicitly - store/file_store.go applies defaults on every JSON-account load (existing on-disk JSON predates the fields entirely) - http handler PUT path seeds the rebuilt-from-scratch returnSettings with defaults BEFORE per-field if-blocks; explicit user values (false toggle, custom timeout) still override - http handler GET path returns defaults for accounts whose DB row predates the fields (zero-valued), keeping the API response in sync with what the conversion layer actually applies - accounts_handler_test fixtures updated for the round-trip Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream PR netbirdio#6006 (commit db44848 "Drop netmap calculation on peer read", merged 2026-04-28) simplified GetPeers and GetPeer by removing the post-filter that expanded a regular user's own-peer list with peers reachable via account access policies. Reason given was the expense of GetPeerConnectionResources on large accounts. The visibility behaviour was lost in the process: a "user"-role account member now sees ONLY their directly-owned peers in the dashboard, not the routing peers / counterparts their policies authorise them to communicate with -- making the dashboard close to useless. Reported by Michael Uray for georg.stoisser-gigacher: Georg has 1 own peer (ctb50-d) and 17 auto_groups feeding 10+ access policies, yet sees a single peer in the web UI. Promoting to auditor exposes ALL peers -- too much. Fix: restore the two pre-netbirdio#6006 helpers verbatim (modulo a small optimisation to avoid the redundant GetAccountPeers call): - getUserAccessiblePeers(ctx, accountID, ownPeers) -- merges the user's own peers with the result of GetPeerConnectionResources on each, returning the union. - checkIfUserOwnsPeer(ctx, accountID, userID, peer) -- per-peer membership check used by the GET /api/peers/{id} path. Plus call them from GetPeers / GetPeer respectively. Filter parity with admin path: the SQL store's GetAccountPeers applies LIKE-style nameFilter / ipFilter; for the user-role branch (GetUserPeers + ACL union) we apply the same substring matching in-process via a new filterPeersByNameAndIP helper. This keeps the public GetPeers contract identical for both roles. Unit tests cover the helper directly and the existing GetPeer test now covers both the policy-allows and policy-denies sub-cases. Performance trade-off acknowledged: GetPeerConnectionResources is expensive on large accounts. We accept the cost because (a) only non-admin readers hit this branch, (b) typical own-peer count is small, and (c) a future cache layer can offset it without changing this contract. See docs/bugs/2026-05-04-user-peer-visibility-regression.md (now bundled in this repo) for the full analysis and verification path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug reported by Michael Uray on uray-mic-d4 (2026-05-04): toggling the VPN off while the "Peers and Networks" window is open spawns a modal "get client: failed to list routes: rpc error: code = Unknown desc = not connected" dialog. The dialog re-pops every 10s because the auto-refresh ticker keeps trying to ListNetworks against a daemon that just shut down its IPC pipe. Fix: split getFilteredNetworks / updateNetworks into two-mode helpers (loud + silent). The auto-refresh ticker uses the silent variant (logs only); the manual Refresh button still uses the loud variant because the user pressing Refresh expects feedback if it fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback (Michael Uray, 2026-05-04): the expanded peer-detail text in the "Peers and Networks" window cannot be marked or copied with the mouse, so a user who wants to paste an FQDN or IP into another tool has to retype it. Same for the network-range cells in the All/Overlapping/Exit-node tabs. Fix: set widget.Label.Selectable = true on the affected labels. Fyne 2.6+ supports the flag natively, no custom widget required. The header buttons (peer-row collapse/expand) stay non-selectable because they need to keep their tap-to-toggle behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported by Michael Uray on uray-mic-d4 (2026-05-04, debug bundle):
0/28 peers P2P, all relayed even on a network with valid public srflx
visibility. Pion ICE log showed selected candidate pairs like
[udp4 host 127.0.0.1:51820] <-> [udp4 srflx 81.16.19.15:51820]
[udp4 host 172.26.240.1:51820] <-> [udp4 host 41.66.90.143:40488]
[udp4 host 10.102.0.52:51820] <-> [udp4 srflx 64.141.62.202:60993]
These local addresses are unroutable from the remote peer (loopback,
Hyper-V Default Switch, internal VPN tunnel). Pion ICE never sees a
working pair, falls back to relay every time.
Root cause: client/internal/stdnet/filter.go used strings.HasPrefix
on the raw interface name. The disallow list ("lo", "veth", "docker",
"br-") is lowercase but Windows reports interface names in mixed
case ("Loopback Pseudo-Interface 1", "vEthernet (Default Switch)",
"Docker Desktop"). Case-sensitive HasPrefix matched none of them, so
*every* Windows interface slipped past the filter and Pion ICE
gathered host candidates from all of them.
Fix is two-part to avoid over-filtering (Codex review caught a v1
that filtered ALL vEthernet*, including the user's actual default-
route external switch "vEthernet (LAN)"):
1. Lower-case both sides for the disallow-list prefix match. This
makes "lo" / "wt" / "wg" / "tailscale" / "zerotier" / "docker"
work uniformly on Linux and Windows.
2. On Windows, *additionally* skip a small targeted list of
well-known internal interfaces by case-insensitive substring:
- "loopback pseudo-interface" (127.0.0.1)
- "vethernet (default switch)" (Hyper-V NAT-only)
- "vethernet (wsl" (WSL2)
User-named Hyper-V external switches like "vEthernet (LAN)" are
LEFT ALONE — those are the host's real default route on uray-
mic-d4 and on every multi-NIC Hyper-V host.
3. The "veth" entry in disallowList is intentionally skipped on
Windows (handled by the targeted check above). The Linux veth
pair filtering stays unchanged.
Regression tests in filter_test.go pin both the kill list and the
keep list, with an explicit assertion that "vEthernet (LAN)" stays
allowed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ffective_*
Same trap the in-tree comment 30 lines higher in getAccount() warns
about for settings_connection_mode -- but the analogous Phase-3.7i
columns on the peers table were never added to the pgx fast-path:
- meta_supported_features
- meta_effective_connection_mode
- meta_effective_relay_timeout_secs
- meta_effective_p2_p_timeout_secs
- meta_effective_p2_p_retry_max_secs
Symptom on uray-mic-d4 + ctb50-d (debug bundles 2026-05-04):
22:52:10 mode=p2p-dynamic (Login response, in-memory peer.Meta)
22:52:16 mode=p2p-lazy (~5s later via pgx-loaded NetworkMap push)
The Login-response path uses the freshly-extracted PeerSystemMeta
from the Login request (with SupportedFeatures populated), so the
first PeerConfig correctly resolves to p2p-dynamic. Five seconds
later -- as soon as the next NetworkMap broadcast triggers the pgx
fast-path account loader -- toPeerConfig sees `Meta.SupportedFeatures
== nil` because the SELECT statement omits the column. Its
legacy-fallback check
slices.Contains(peer.Meta.SupportedFeatures, "p2p_dynamic")
returns false for EVERY 3.7i+ client, so resolvedMode silently
downgrades to p2p-lazy and the mode-flip cascades through every
peer's lazy-mgr ("peer reset to idle" x N -> 1610x "ICE Agent is
not initialized yet" -> ICE retries exhausted -> 0 P2P).
Verified by SQL: meta_supported_features for uray-mic-d4, ctb50-d,
dk20 all hold ["p2p_dynamic"] in the DB; the column was being
written correctly, just never read on this hot path.
Fix: SELECT the missing five columns and scan them into
sql.NullInt64 / sql.NullString / []byte holders, then unpack into
nbpeer.PeerSystemMeta exactly as the GORM path does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Ps survive Pre-existing NetBird p2p-lazy bug, made more visible by the p2p-dynamic mode of Phase 3.7i (which adds the WG-handshake-timeout-recovery cycle): when the lazy manager deactivated a peer, peer.Conn.Close() called endpointUpdater.RemoveWgPeer() unconditionally, wiping every AllowedIP including the routed subnets the route-manager had appended in-place. The lazy listener then re-armed the peer with only the basic peer-IP /32. The route-manager's allowedIPsRefCounter went stale -- routed- subnet traffic was silently dropped by WG until either the next mgmt- side reconcile re-attached the prefix or the operator manually pinged the routing peer's NetBird IP to wake it up via the activity listener (which only matches the peer-IP /32). Symptom (Michael Uray, 2026-05-04): from uray-mic-d4 (Windows) ping to 192.168.91.220 (behind r1-pve5) timed out for ~30s, then a manual ping to r1-pve5's NetBird IP woke it, after which 192.168.91.220 was reachable again. Fix: add `keepWgPeer bool` parameter to peer.Conn.Close. Lazy-suspend callers pass true (keep the WG peer entry; the listener will UpdatePeer in place to switch the endpoint to the fake-IP); permanent-removal and fresh-Conn-cleanup callers pass false (current behaviour preserved). Updated callers (intent in parentheses): peerstore.Store.PeerConnIdle -> true (lazy idle) peerstore.Store.PeerConnClose -> true (lazy excluded) ConnMgr loop RelayInactiveChan -> true (lazy suspend) ConnMgr.RemovePeerConn -> false (permanent remove) ConnMgr resetPeers (mode chg) -> false (full reopen) Engine peer-already-exists -> true (live Conn owns WG entry) Engine remote-offline-debounce -> true (allow wake-up on return) Regression tests in conn_lazy_keepwgpeer_test.go cover the Conn.Close signature + the !keepWgPeer guard + every conn_mgr.go caller. Both fail on origin/main; both pass with this fix. Detailed mechanism analysis in docs/bugs/2026-05-04-lazy-wake-on-routed-subnet.md. Refs: Codex review 2026-05-04 (precision on Windows fake-IP at 127.2.x.y vs Linux 127.0.1.x UDP listener; HA-routing pitfall flag). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix)
Reported by Michael Uray on dk20 ↔ ctb50-d ↔ BM-Routers (2026-05-04):
after the lazy/inactivity cycle suspends a peer (Connection type
goes through P2P → Relay → Idle), a subsequent ping to wake it up
opens the connection BUT NEVER ESTABLISHES ICE. The peer stays at
"Connecting" forever, ICMP times out, only a daemon restart on one
side cleans the state.
Live trace from dk20.client.log:
22:31:42 close peer connection (keepWgPeer=true)
22:31:42 created activity listener: 127.0.1.1:46189
... 28 minutes of idle ...
22:59:21 activity detected
22:59:21 detected peer activity
22:59:21 starting guard for reconnection with MaxInterval: 35.9s
...sixty-seven seconds of nothing...
Root cause: signal-trigger and activity-trigger paths were asymmetric.
Signal path (engine.go:1845 → ConnMgr.ActivatePeer):
ActivatePeer does conn.Open() + (for p2p-dynamic) conn.AttachICE().
AttachICE registers the ICE listener on the handshaker AND issues
SendOffer so the remote side learns we're ICE-capable again.
Activity path (lazy-mgr.onPeerActivity → peerStore.PeerConnOpen):
PeerConnOpen only calls conn.Open(). Open() recreates workerICE
but does NOT register the listener on the handshaker -- in
ModeP2PDynamic the registration is deliberately deferred to
AttachICE (deferICEListener=true at conn.go:258). With no
AttachICE call, handshaker.iceListener stays nil.
The guard's onGuardEvent (conn.go:751) then sees:
- everConnected.Load() == true (we ran before the cycle)
- handshaker.readICEListener() == nil (Open didn't attach for
p2p-dynamic)
- IceBackoff not suspended, no failures
and skips every offer with the comment
"guard: skip offer (ICE detached for inactivity, p2p-dynamic;
will re-attach on real traffic)"
But the only re-attach path was ConnMgr.ActivatePeer -- only fired
on signal messages, never on local activity. Result: dead loop, no
offer ever sent, ICE never converges, peer stuck on relay or fully
unreachable until manual restart.
Fix: lazy-mgr.onPeerActivity now also calls conn.AttachICE() after
PeerConnOpen. AttachICE is mode-safe:
- ModeP2P / ModeP2PLazy: listener already registered by Open,
attachICEListenerLocked returns false -> no-op
- ModeRelayForced: workerICE nil -> error returned, swallowed
- ModeP2PDynamic: listener registered + SendOffer issued, ICE
negotiation re-engages cleanly
Honours iceBackoff so the existing 3-tries-then-hourly retry policy
still applies on persistent failure. Symmetric to the signal path
in ConnMgr.ActivatePeer (conn_mgr.go:488).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review point 4 (2026-05-04): "Bei echter Aktivität sollte der Backoff entweder bewusst respektiert oder bewusst soft-reset/kurz überbrückt werden, aber nicht zufällig P2P verhindern." Reproduced on D95820 ↔ w11-test1 (Hetzendorf LAN ↔ Graz LAN, both behind A1 NAT, 2026-05-05): 04:53:35 [w11-test1] ICE failure netbirdio#1 (transient, e.g. concurrent wake-up race or first STUN binding miss) 04:54:53 [w11-test1] ICE failure netbirdio#2 04:55:15 [w11-test1] ICE retries exhausted (3/3), switching to hourly retry After that the connection stays Connection type=Relayed for an HOUR even when the user pings hard from one side. AttachICE early-returns on iceBackoff.IsSuspended() so the previous activity-trigger fix (commit 4faa3f1) couldn't help: the listener gets re-attached only because the backoff blocks both Open() AND the offer. Verified post-fix mechanism: a daemon restart resets the backoff in-process (init path), and ICE then converges to srflx/srflx 80.120.218.226:51821 ↔ 41.66.90.143:51820 in ~50s. So the NAT traversal is physically possible -- it's the in-process state that needs clearing on real user activity. Fix: in Manager.onPeerActivity, call new Conn.ResetIceBackoff() right before AttachICE. The lazy-mgr's local-activity listener fires on real outbound traffic to the peer's NetBird IP -- that's the strongest possible "user wants to talk to this peer" signal, much more reliable than the existing backoff retry policy. Resetting the counter is safe: if ICE keeps failing, the next 3 attempts will re-enter hourly mode the same way as before, and signal-server load stays bounded. Signal-trigger path (engine.go:1845 → ConnMgr.ActivatePeer → conn.AttachICE) is unchanged and still respects the backoff -- those triggers are not user-initiated and shouldn't override the failure policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review 2026-05-05 (refinement of the audit prompted by Michael
Uray): the lazy-mgr activity-listener only fires for peers in fully-
Idle state (post-Close, watcherActivity). For peers in Relay-state
(Open=true, ICE detached after iceTimeout, watcherInactivity), no
local activity ever triggered AttachICE. Result: a peer that fell
back to relay would stay relayed indefinitely even with continuous
user traffic, only escaping if remote sent an offer or the conn
fully cycled to Idle (5-8 min later).
Codex also flagged that the previous "ResetIceBackoff on every
activity" change was too aggressive for the relay path -- every
relay payload packet would have reset the failure backoff,
defeating the 3-tries-then-hourly retry policy that protects
metered links from useless retries when ICE is structurally broken.
Refined fix is two-layer:
1. Fast-path (this commit): ActivityRecorder gains an OnActivity
callback fired at most once per saveFrequency=5s per peer when
transport activity is observed. PeerRecord now carries the pubkey
so the callback knows which peer woke. Engine wires the callback
to Conn.AttachICEOnRelayActivity which gates on:
- mode == p2p-dynamic
- conn.opened
- currentConnPriority == Relay
- handshaker.iceListener == nil (ICE actually detached)
- !iceBackoff.IsSuspended() (respect failure backoff -- no reset)
- everConnected.Load() (had successful P2P at least once)
When all gates pass, AttachICE is called -- it sends a fresh offer,
the remote's OnNewOffer recreates its agent, and ICE re-converges
in seconds. The relay tunnel stays up across the upgrade for
continuity.
2. Activity-recorder note: only the receive path (ice_bind.go:232,
ice_bind.go:333) records into LastActivities today. Outbound-only
traffic is currently invisible to this fast-path; ping-style and
bidirectional traffic does fire correctly because the echo reply
counts as receive activity. A future commit can instrument the
ICEBind.Send path if pure-outbound wake becomes important.
The ResetIceBackoff call stays in lazy-mgr.onPeerActivity because
that fires only after the peer has been fully Closed (via relay-
inactivity). At that point user-driven outbound traffic IS the
authoritative signal that the user wants the peer back, so the
hourly-retry suspension is correctly overridden. The relay-path
trigger added here does NOT reset, by design.
Tests: ActivityRecorder OnActivity callback tested via existing
activity_test.go path; AttachICEOnRelayActivity gating verified by
build + targeted test runs (./client/iface/bind, ./client/internal/peer,
./client/internal/lazyconn/inactivity all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported (Michael Uray, 2026-05-05) and Codex review point 5
follow-up: a peer in Relayed state (ICE detached after iceTimeout)
that hits a transient ICE failure during the next re-attempt enters
"3 retries exhausted -> hourly retry" mode and stays Relayed for a
full HOUR even when the user actively pings. The backoff was
designed to protect against truly broken paths, but the LTE-NAT
recovery window (15-30s) consistently exceeds the Guard's 3-fast-
retry timer (~12s), so legitimate paths get poisoned by transient
network jitter.
Reproduced live on ctb50-d (Lebring) <-> 572A2 (Graz) hop after
~3min idle: ICE detached, then transient re-attach failure, then
hourly retry. User pings every few seconds, sees Relayed at 35ms
instead of P2P at 11ms.
Codex review point 5 explicitly allowed an "optional, very
deliberate user-activity retry override with hard rate-limit". This
commit is that override:
- iceBackoff.AllowActivityOverride() returns true at most ONCE
per activityOverrideMinInterval (5min) per peer
- AttachICEOnRelayActivity calls AllowActivityOverride before
skipping on suspended backoff; on grant it Reset()s the backoff
and proceeds to AttachICE
- 5min lines up with the relayTimeout default -- after one
override window the conn would have cycled to Idle anyway,
freeing the backoff via the C->A wake path which already does
ResetIceBackoff
Signal-server load impact: at most one extra offer/answer pair per
peer per 5min when the path is genuinely broken. For the typical
case (transient drop with quick recovery) we converge to P2P in the
override and stay there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review 2026-05-05 catch: the AllowActivityOverride doc-comment
promised the rate-limit window applies "since the last (success-,
network-change-, or override-driven) reset", but markSuccess() never
updated lastResetAt. Only Reset() did. Concrete bug: a peer that
briefly succeeded an ICE connect (markSuccess fires, failures = 0,
suspended = false, but lastResetAt is still from daemon-start
minutes-or-hours ago) and then failed 3x to enter hourly retry mode
would have AllowActivityOverride return true on the very next user
ping -- defeating the rate-limit's intent of bounding signal-server
storm to one offer per peer per 5min.
Fix is a single line in markSuccess: stamp lastResetAt = time.Now().
A successful ICE connect IS semantically a reset point -- the path
just demonstrably worked, the failure history before it is no longer
predictive. Both consumers of lastResetAt (markFailure's
networkChangeGracePeriod and AllowActivityOverride's rate limit)
benefit from the fresh stamp.
Tests:
- TestIceBackoff_MarkSuccessStampsLastResetAt: direct regression
pin -- markSuccess after a stale lastResetAt MUST update it to
~now.
- TestIceBackoff_AllowActivityOverride: covers the four corners of
the rate-limited override (not-suspended, recently-reset,
>5min-since-reset+suspended, and the post-Reset/markSuccess case).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ant test Codex review follow-up 2026-05-05: three small refinements to make the backoff semantic explicit and self-documenting. 1. AttachICEOnRelayActivity doc-comment was outdated -- it said "iceBackoff must NOT be suspended" but didn't mention the rate- limited override added in commit 322adc9 / 1762601. Updated to reference iceBackoff.AllowActivityOverride explicitly so future readers know the override exists and where its rate limit lives. 2. onICEFailed now classifies the failure into one of three buckets in the log line: - first-attempt : never reached P2P, this is the bootstrap try - post-success-drop: was P2P, ICE dropped (consent-freshness fail / NAT-mapping recovery race) - re-attach : came out of detached state, retry failed Classification is best-effort (pion only tells us "Failed"; we infer from local everConnected + iceListener state). Helps future debugging when the user asks "why did this peer enter hourly retry mode" -- the log now distinguishes "first attempt couldn't pair" from "had P2P, lost it" without requiring full daemon-log correlation. 3. Added doc-comment to onICEFailed explicitly stating that backoff is exclusively triggered by ConnectionStateFailed and NEVER by inactivity-driven detach or relay-timeout close. Codex review point 1: "Backoff sollte nur kaputte oder aktuell nicht stabile P2P-Pfade bremsen". This pins that intent in code, not just docs. 4. New unit test TestIceBackoff_OnlyMarkFailureMutates documents the invariant: read-only calls (IsSuspended, Snapshot, AllowActivityOverride) must never mutate the backoff state. Prevents accidental side-effects in future refactors of those getters. Existing tests still green: ok github.com/netbirdio/netbird/client/internal/peer 0.011s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989) lifecycle test (w11-test1, 2026-05-05) showed that after a full Idle -> Wake (C->A path), peers with non-LAN ICE candidates (srflx, e.g. Internet/LTE) often stay on Relay indefinitely while same-LAN peers (host candidates) recover instantly. Root cause: the guard's per-cycle ICE retry budget (maxICERetries=3, then iceRetryInterval=1h) is consumed in the first ~5 s of pair-checking after wake (cold srflx mappings need several attempts to prime). Subsequent real user activity then finds the guard already in hourly mode and waits up to an hour before retrying. Fix per Codex review 2026-05-05: route real user/transport activity into the guard so it explicitly resets the per-cycle budget and reconnect ticker. The hourly throttle stays in place for genuinely broken NAT paths but no longer blocks the "user keeps pinging" recovery path. Mechanics: - Guard gets a buffered peerActivity channel + NotifyPeerActivity() (mirrors SetICEConnDisconnected). reconnectLoopWithRetry treats it like a connection-disconnected event: ticker.Stop, fresh newReconnectTicker, iceState.reset. - Conn.AttachICEOnRelayActivity (B->A relay-state activity) now calls conn.guard.NotifyPeerActivity after the AttachICE succeeds, complementing the existing iceBackoff override. - Conn.NotifyGuardActivity wraps the call so the lazy-mgr does not have to peek into Conn internals; lazy-mgr.onPeerActivity invokes it after the post-Idle Open + AttachICE. Tests: - TestICERetryState_ResetClearsHourlyAndBudget pins the regression shape: hourly mode + reset must restore the full 3-attempt budget and clear the hourly ticker. - TestGuard_NotifyPeerActivity_{NonBlockingCoalesce, NilSafe} cover the call contract for high-rate ActivityRecorder bursts and partially-initialised conns. - TestGuard_PeerActivityResetsHourlyMode exercises the channel delivery + handler-equivalent reset.
Generated proto code regenerated from .proto sources after upstream/main rebase pulled in commits that altered the proto schema (PacketCapture fields). Hand-merged conflicts on the .proto files were resolved during the rebase; the .pb.go files are now consistent with them. Phase 3.7i UI-push helpers Status.notifyPeerListChanged and Status.notifyPeerStateChangeListeners were lost in mid-rebase reorder; re-add them as thin wrappers around the existing notifier.peerListChanged and snapshotRouterPeersLocked + dispatchRouterPeers paths, so the UpdatePeerRemoteMeta and updatePeer*StateLocked callsites compile. Test mock mockEndpointManager.ActivityRecorder() added: the new EndpointManager interface method (added by the relay-state ICE re-attach fast-path commit) was missing on the test mock, breaking the listener_bind_test build. The mock returns nil — these tests exercise endpoint dispatch only, never the recorder.
Three fixes from Codex review 2026-05-05: 1. client/cmd/testutil_test.go: nbgrpc.NewServer call updated to the current 13-arg signature (peer_connections.Store and *peer_connections.SnapshotRouter were added in Phase 3.7i and the test wasn't updated). The test now passes a fresh in-memory store and snapshot router so the package builds again. 2. .gitignore: management/netbird-mgmt added next to the existing management/management entry. The 51 MB ELF binary was a build artefact accidentally committed in the relay-state ICE re-attach commit; it is now removed from history via interactive rebase of that commit (see previous force-push) and ignored going forward. 3. ResetIceBackoff: stamp the new (cleared) snapshot into the StatusRecorder so `netbird status -d` and the daemon RPC stop advertising the stale "suspended" / "Failures=N" view after a reset. Previously only AttachICEOnRelayActivity's override path pushed the snapshot; the lazy-mgr-driven hard reset on Idle wake silently left the recorder out of date.
… + codespell fix - proto regen across client/, shared/management/, shared/signal/, flow/ - protoc-version comment headers re-pinned to upstream-main values - status_remote_meta_notify_test.go: codespell typo (atLeast -> minCount)
- client/internal/peer/status_debounce_test.go: explicit _ = on rec.UpdatePeer*State/Meta calls so errcheck (golangci-lint) doesn't flag the discarded error returns. The tests deliberately ignore the return because the recorder is pre-populated with the peer. - management/server/http/handlers/accounts/accounts_handler.go: cast math.MaxUint32 to uint64 when interpolating it into the error string. On 32-bit goreleaser builds (linux/386 etc.) the untyped constant 4294967295 cannot be passed where the printf %d formatter expects an int (overflows). uint64 covers all archs cleanly.
- mgmt/internals/server/boot.go: collapse PeerConnRouter() lambda to Create(s, peer_connections.NewSnapshotRouter) (unlambda) - mgmt/server/types/settings_ptr_equal_test.go: rename 'new' to 'updated' to avoid shadowing the predeclared identifier - client/internal/peer/conn_handover_order_test.go: if/else on rune -> tagged switch (QF1003 staticcheck) - mgmt/server/http/handlers/accounts/validate_uint32_timeout_test.go: drop redundant explicit error type from var declaration (ST1023) - client/internal/conn_mgr.go: add //nolint:unused on addPeersToLazyConnManager (deliberately retained for the eventual snapshot-import path; documented inline) - client/internal/peer/status.go: replace 'notify := false; notify = ...' with 'var notify bool' to silence wastedassign
587ba07 to
6aac4ee
Compare
|




[client+management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes (closes #5989)
Branch:
pr/d-hardening-activity-trigger→ basepr/c-phase3.7i-visibility(stacked PR — depends on PR-C landing first)Compare: https://github.com/netbirdio/netbird/compare/MichaelUray:netbird:pr/c-phase3.7i-visibility...MichaelUray:netbird:pr/d-hardening-activity-trigger?expand=1
Summary
This PR completes #5989. It adds the activity-trigger fast-path (relay-state activity from the WG
ActivityRecorderre-attaches ICE so a peer that fell to relay during quiet periods upgrades back to P2P on the next real packet), a rate-limited backoff override (real user activity is the strongest "I want this peer back" signal we have, allowed once per 5 minutes), the Guard activity-driven retry-budget reset that Codex-review identified as the missing piece for srflx-only peers after Idle wake, plus a substantial round of hardening: legacy-fallback (older clients without the newConnectionModefield cleanly map top2p-lazy), peer-visibility for the user RBAC role, the closed-PR pgx fix, the keepWgPeer routed-subnet fix, and several Codex-review-2/3 corrections.This is the fourth and final stacked PR implementing #5989.
Why
After PR-C landed the visibility layer, real-world testing on 4-peer testbed (Windows 11 daemon + 3 OpenWrt routers spread across LAN, Internet, LTE) revealed two specific lifecycle gaps:
Relay-state activity didn't trigger ICE re-attach. A peer that was P2P, fell to relay after the ICE-inactivity timer fired, then received fresh user traffic — stayed on relay. The
ActivityRecordersaw the packet, but no callback re-attached ICE. The fast-path here closes that loop: WG bind callback →Conn.AttachICEOnRelayActivity→ register a fresh ICE listener with the handshaker.After Idle wake the per-peer guard's 3-tries-then-hourly retry budget got consumed in the first ~5 seconds of pair-checking. For peers with non-LAN candidates (srflx Internet, LTE) cold mappings need several attempts to prime; the guard then declared the peer "exhausted" and escalated to hourly retry. Subsequent real user activity found the guard already in hourly mode and waited up to an hour before retrying.
Codex review identified the second one as a real PR-blocker: the iceBackoff state and the guard's retry counter are two parallel throttles, and only iceBackoff was reset on activity. The fix is
Guard.NotifyPeerActivity()— a buffered channel signalled by user/transport activity, the reconnect loop treats it like a connection-disconnected event (ticker stop, fresh ticker,iceState.reset).What's in this PR
Activity-trigger fast-path (the new feature)
client/iface/bind/activity: extendActivityRecorderwith an optionalOnActivitycallback; per-peerPublicKeyfield onPeerRecordso the dispatch knows which peer fired.client/internal/lazyconn:AttachICEon activity wake-up so the lazy-mgr's "fake-IP packet detected" path actually drives a fresh ICE re-attach, not just an Open() call.client/lazyconn+peer:ResetIceBackoffon activity-trigger wake. Updates the StatusRecorder snapshot after reset (Codex-review follow-up).client/iface+peer+engine: relay-state ICE re-attach fast-path — engine wiresActivityRecorder.OnActivity→Conn.AttachICEOnRelayActivity(). Gates: mode =p2p-dynamic, conn open, current priority = Relay, listener detached, never-connected guard, activity-override rate limit.client/peer: rate-limited backoff override on relay-state activity.iceBackoff.AllowActivityOverride()allows ONE override per 5 minutes per peer; this is the rate limit that protects the signal server from amplification while still unsticking peers on a transient pair-check drop.client/peer:markSuccessstampslastResetAt(Codex review caught: previously it didn't, so the rate-limit window never engaged).client/peer: clarify backoff-intent comments + classify failure types (first-attempt,post-success-drop,re-attach) for log diagnostics + invariant test.client/peer/guard:NotifyPeerActivity()+ activity-driven reset of ICE retry budget — the headline Codex-review fix. New buffered channel; the reconnect loop treats it like SetICEConnDisconnected (ticker stop, freshnewReconnectTicker,iceState.reset).Conn.AttachICEOnRelayActivityandlazyconn.manager.onPeerActivityboth call it.Hardening / Codex review rounds 2+3
mgmt/types+store:LegacyLazyFallback{Enabled,TimeoutSeconds}settings.mgmt/conversion: legacy-clientp2p-dynamic→p2p-lazyfallback when remote daemon'sSupportedFeaturesdoesn't include the new flag.proto+client+mgmt:SupportedFeaturescapability advertisement on the wire.client/internal: latchconn_state_pusherdisabled onUnimplemented(older mgmt servers).mgmt/http+activity: exposeLegacyLazyFallbacksettings via API.client/internal: defensive nil-receiver guard on pusher entry points.mgmt: legacy-fallback defaults consistent across all construction paths (test+prod+migration).mgmt/peer: restore policy-aware peer visibility for user role — without this, a non-admin user could not see the new peer-status counters because the visibility check was based on the old, pre-policy-aware code path.client/ui: silent auto-refresh on Networks window when daemon IPC drops.client/ui: peer-detail + network-range text selectable + copyable.Closed-PR-style quick fixes (could have been their own PRs but depend on Phase-3.7i fields)
mgmt/store: pgxgetPeersmust SELECTmeta_supported_features+meta_effective_*— without this the SQL row-marshal silently dropped the new columns andRemotePeerConfigcame across as zero-values.client/stdnet: case-insensitive ICE interface filter (Windows P2P fix). (NOTE: this is also #PR-Q2 — separate standalone PR for early review. If Q2 lands first, this commit can be cherry-pick-dropped here.)fix: keep WG peer entry across lazy-suspend so routed-subnet AllowedIPs survive — without this the route-manager's appended AllowedIPs got discarded on the lazy deactivate path, and the next activity-wake re-opened the conn with the original (subset) AllowedIPs.Round-2 follow-ups
codex-review fixes: settings push, RelayServer materiality, uint32 validation, offline debounce.codex review round 2: debounce safety, pre-init drain, dashboard cache, docs, tests.hardening: explicit cancel hooks + handover-order regression tests.hardening: reconnect-guard inactivity-skip + UI ICE-backoff fields.codex review: 4 findings — server build, store epoch, meta notify, ICE-backoff display.codex follow-up: session_id epoch field for unary-RPC stale-delta safety.fix: gate guard skip-offer oneverConnected(regression from Item 1).fix: WG-handshake-timeout recovery — push peer back to lazy-idle.client/lazyconn:IsSupportedalso accepts0.0.0-dev-…semver-padded form (so dev builds with0.0.0-dev-<sha>versions are recognised as lazy-capable).client/engine: nil-guardconnStatePusherclosures during shutdown.client: reconnect-guard p2p-dynamic-aware + proactive close on remote-offline.client: hybrid "Relayed (negotiating P2P)" UI label during wakeup window.client/ui (Win): colored status swatch on each peer row.client: Android refresh wg-stats on PeersList + bump default relay-timeout to 24h.peer-status:live_onlinefrompeer.Status.Connectedfor accurate counter.Codex post-rebase fixes (the very latest)
client/cmd/testutil_test.go:nbgrpc.NewServercall updated to current 13-arg signature (peer_connections.Store+*peer_connections.SnapshotRouter)..gitignore:management/netbird-mgmtignored. (A 51 MB ELF was accidentally committed in an earlier iteration of the relay-state ICE re-attach commit; removed from this branch via interactive rebase — verified no commit in this branch contains the binary blob.)ResetIceBackoff: stamp the cleared snapshot intoStatusRecordersonetbird status -dand the daemon RPC stop advertising stale "suspended"/"Failures=N" after reset.Tests
go test ./client/internal/peer ./client/internal/peer/guard ./client/internal/lazyconn/...— pass.go build ./client/... ./management/...— pass on linux/amd64, linux/arm64, windows/amd64.go test -c ./client/cmd— compiles. (Full test run blocked by sandbox/etc/systemdpermissions, not a code-level regression.)Known flake (not blocker)
TestICEBind_HandlesConcurrentMixedTrafficinclient/iface/bindis upstream-owned (last touched in #5953) and reproducibly fails under-count=10or-raceon local UDP-loopback when IPv6 delivery dips below the threshold. Verified bygit diff upstream/main -- client/iface/bind/ice_bind_test.goreturning empty: no test code changes here. Not addressed in this PR; would be a separate stabilisation PR for the test owner.Test plan
upstream/main(no untracked binaries; .gitignore corrected).Conn.AttachICEOnRelayActivity(mode + opened + currentConnPriority + listener-nil + iceBackoff override + everConnected — comments document each).SupportedFeaturesset must cleanly degrade top2p-lazy(regression-tested inmgmt/grpc).Use case
The full p2p-dynamic experience: a fleet of intermittently-active peers (mix of LAN, Internet srflx, LTE srflx). They start lazy. On real user traffic they go P2P. On idle they detach ICE then close. On the next packet they re-open and re-establish P2P — always within a couple of seconds, regardless of whether the candidate type is host or srflx-only. ICE-backoff is bounded so a chronically-broken NAT path doesn't spam the signal server, but real user activity always overrides once per 5 minutes.
Linked work
Maintainers are welcome to push directly to this branch.
Summary by CodeRabbit
New Features
Bug Fixes
UI Improvements
Documentation
These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.