Skip to content

[client, management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes — closes #5989 (stack 4/4) #6084

Open
MichaelUray wants to merge 147 commits intonetbirdio:mainfrom
MichaelUray:pr/d-hardening-activity-trigger
Open

[client, management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes — closes #5989 (stack 4/4) #6084
MichaelUray wants to merge 147 commits intonetbirdio:mainfrom
MichaelUray:pr/d-hardening-activity-trigger

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented May 5, 2026

[client+management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes (closes #5989)

Branch: pr/d-hardening-activity-trigger → base pr/c-phase3.7i-visibility (stacked PR — depends on PR-C landing first)
Compare: https://github.com/netbirdio/netbird/compare/MichaelUray:netbird:pr/c-phase3.7i-visibility...MichaelUray:netbird:pr/d-hardening-activity-trigger?expand=1


Summary

This PR completes #5989. It adds the activity-trigger fast-path (relay-state activity from the WG ActivityRecorder re-attaches ICE so a peer that fell to relay during quiet periods upgrades back to P2P on the next real packet), a rate-limited backoff override (real user activity is the strongest "I want this peer back" signal we have, allowed once per 5 minutes), the Guard activity-driven retry-budget reset that Codex-review identified as the missing piece for srflx-only peers after Idle wake, plus a substantial round of hardening: legacy-fallback (older clients without the new ConnectionMode field cleanly map to p2p-lazy), peer-visibility for the user RBAC role, the closed-PR pgx fix, the keepWgPeer routed-subnet fix, and several Codex-review-2/3 corrections.

This is the fourth and final stacked PR implementing #5989.

Why

After PR-C landed the visibility layer, real-world testing on 4-peer testbed (Windows 11 daemon + 3 OpenWrt routers spread across LAN, Internet, LTE) revealed two specific lifecycle gaps:

  1. Relay-state activity didn't trigger ICE re-attach. A peer that was P2P, fell to relay after the ICE-inactivity timer fired, then received fresh user traffic — stayed on relay. The ActivityRecorder saw the packet, but no callback re-attached ICE. The fast-path here closes that loop: WG bind callback → Conn.AttachICEOnRelayActivity → register a fresh ICE listener with the handshaker.

  2. After Idle wake the per-peer guard's 3-tries-then-hourly retry budget got consumed in the first ~5 seconds of pair-checking. For peers with non-LAN candidates (srflx Internet, LTE) cold mappings need several attempts to prime; the guard then declared the peer "exhausted" and escalated to hourly retry. Subsequent real user activity found the guard already in hourly mode and waited up to an hour before retrying.

Codex review identified the second one as a real PR-blocker: the iceBackoff state and the guard's retry counter are two parallel throttles, and only iceBackoff was reset on activity. The fix is Guard.NotifyPeerActivity() — a buffered channel signalled by user/transport activity, the reconnect loop treats it like a connection-disconnected event (ticker stop, fresh ticker, iceState.reset).

What's in this PR

Activity-trigger fast-path (the new feature)

  • client/iface/bind/activity: extend ActivityRecorder with an optional OnActivity callback; per-peer PublicKey field on PeerRecord so the dispatch knows which peer fired.
  • client/internal/lazyconn: AttachICE on activity wake-up so the lazy-mgr's "fake-IP packet detected" path actually drives a fresh ICE re-attach, not just an Open() call.
  • client/lazyconn+peer: ResetIceBackoff on activity-trigger wake. Updates the StatusRecorder snapshot after reset (Codex-review follow-up).
  • client/iface+peer+engine: relay-state ICE re-attach fast-path — engine wires ActivityRecorder.OnActivityConn.AttachICEOnRelayActivity(). Gates: mode = p2p-dynamic, conn open, current priority = Relay, listener detached, never-connected guard, activity-override rate limit.
  • client/peer: rate-limited backoff override on relay-state activity. iceBackoff.AllowActivityOverride() allows ONE override per 5 minutes per peer; this is the rate limit that protects the signal server from amplification while still unsticking peers on a transient pair-check drop.
  • client/peer: markSuccess stamps lastResetAt (Codex review caught: previously it didn't, so the rate-limit window never engaged).
  • client/peer: clarify backoff-intent comments + classify failure types (first-attempt, post-success-drop, re-attach) for log diagnostics + invariant test.
  • client/peer/guard: NotifyPeerActivity() + activity-driven reset of ICE retry budget — the headline Codex-review fix. New buffered channel; the reconnect loop treats it like SetICEConnDisconnected (ticker stop, fresh newReconnectTicker, iceState.reset). Conn.AttachICEOnRelayActivity and lazyconn.manager.onPeerActivity both call it.

Hardening / Codex review rounds 2+3

  • mgmt/types+store: LegacyLazyFallback{Enabled,TimeoutSeconds} settings.
  • mgmt/conversion: legacy-client p2p-dynamicp2p-lazy fallback when remote daemon's SupportedFeatures doesn't include the new flag.
  • proto+client+mgmt: SupportedFeatures capability advertisement on the wire.
  • client/internal: latch conn_state_pusher disabled on Unimplemented (older mgmt servers).
  • mgmt/http+activity: expose LegacyLazyFallback settings via API.
  • client/internal: defensive nil-receiver guard on pusher entry points.
  • mgmt: legacy-fallback defaults consistent across all construction paths (test+prod+migration).
  • mgmt/peer: restore policy-aware peer visibility for user role — without this, a non-admin user could not see the new peer-status counters because the visibility check was based on the old, pre-policy-aware code path.
  • client/ui: silent auto-refresh on Networks window when daemon IPC drops.
  • client/ui: peer-detail + network-range text selectable + copyable.

Closed-PR-style quick fixes (could have been their own PRs but depend on Phase-3.7i fields)

  • mgmt/store: pgx getPeers must SELECT meta_supported_features + meta_effective_* — without this the SQL row-marshal silently dropped the new columns and RemotePeerConfig came across as zero-values.
  • client/stdnet: case-insensitive ICE interface filter (Windows P2P fix). (NOTE: this is also #PR-Q2 — separate standalone PR for early review. If Q2 lands first, this commit can be cherry-pick-dropped here.)
  • fix: keep WG peer entry across lazy-suspend so routed-subnet AllowedIPs survive — without this the route-manager's appended AllowedIPs got discarded on the lazy deactivate path, and the next activity-wake re-opened the conn with the original (subset) AllowedIPs.

Round-2 follow-ups

  • codex-review fixes: settings push, RelayServer materiality, uint32 validation, offline debounce.
  • codex review round 2: debounce safety, pre-init drain, dashboard cache, docs, tests.
  • hardening: explicit cancel hooks + handover-order regression tests.
  • hardening: reconnect-guard inactivity-skip + UI ICE-backoff fields.
  • codex review: 4 findings — server build, store epoch, meta notify, ICE-backoff display.
  • codex follow-up: session_id epoch field for unary-RPC stale-delta safety.
  • fix: gate guard skip-offer on everConnected (regression from Item 1).
  • fix: WG-handshake-timeout recovery — push peer back to lazy-idle.
  • client/lazyconn: IsSupported also accepts 0.0.0-dev-… semver-padded form (so dev builds with 0.0.0-dev-<sha> versions are recognised as lazy-capable).
  • client/engine: nil-guard connStatePusher closures during shutdown.
  • client: reconnect-guard p2p-dynamic-aware + proactive close on remote-offline.
  • client: hybrid "Relayed (negotiating P2P)" UI label during wakeup window.
  • client/ui (Win): colored status swatch on each peer row.
  • client: Android refresh wg-stats on PeersList + bump default relay-timeout to 24h.
  • peer-status: live_online from peer.Status.Connected for accurate counter.

Codex post-rebase fixes (the very latest)

  • client/cmd/testutil_test.go: nbgrpc.NewServer call updated to current 13-arg signature (peer_connections.Store + *peer_connections.SnapshotRouter).
  • .gitignore: management/netbird-mgmt ignored. (A 51 MB ELF was accidentally committed in an earlier iteration of the relay-state ICE re-attach commit; removed from this branch via interactive rebase — verified no commit in this branch contains the binary blob.)
  • ResetIceBackoff: stamp the cleared snapshot into StatusRecorder so netbird status -d and the daemon RPC stop advertising stale "suspended"/"Failures=N" after reset.

Tests

  • go test ./client/internal/peer ./client/internal/peer/guard ./client/internal/lazyconn/... — pass.
  • go build ./client/... ./management/... — pass on linux/amd64, linux/arm64, windows/amd64.
  • go test -c ./client/cmd — compiles. (Full test run blocked by sandbox /etc/systemd permissions, not a code-level regression.)
  • Hardware-validated lifecycle test on 4-peer testbed:
    • All 5 transitions covered: Idle → P2P (cold start, 30 s), P2P → Relayed (4 min idle), Relayed → P2P (B→A fast-path, 5–10 s), Relayed → Idle (9 min idle, full close), Idle → P2P (C→A wake, 3 s for srflx-only peers).
    • Pre-Phase-3.7i activity-trigger fix: srflx-only peers (Internet, LTE) stuck on relay after Idle wake for >30 minutes due to guard hourly-mode.
    • With this PR: all 3 BM peers (LAN host, Internet srflx, LTE srflx) reach P2P binnen 30 s after Idle wake; ICE-backoff state shows correctly cleared after reset.

Known flake (not blocker)

TestICEBind_HandlesConcurrentMixedTraffic in client/iface/bind is upstream-owned (last touched in #5953) and reproducibly fails under -count=10 or -race on local UDP-loopback when IPv6 delivery dips below the threshold. Verified by git diff upstream/main -- client/iface/bind/ice_bind_test.go returning empty: no test code changes here. Not addressed in this PR; would be a separate stabilisation PR for the test owner.

Test plan

  • All Phase-3.7i unit tests (peer, guard, lazyconn/*).
  • iceBackoff full suite incl. activity-override + grace + markSuccess-stamps-lastResetAt + Reset-clears-hourly invariants.
  • Cherry-pick clean against upstream/main (no untracked binaries; .gitignore corrected).
  • Maintainer review of the activity-trigger gate ordering in Conn.AttachICEOnRelayActivity (mode + opened + currentConnPriority + listener-nil + iceBackoff override + everConnected — comments document each).
  • Maintainer review of the legacy-fallback path: a daemon without SupportedFeatures set must cleanly degrade to p2p-lazy (regression-tested in mgmt/grpc).

Use case

The full p2p-dynamic experience: a fleet of intermittently-active peers (mix of LAN, Internet srflx, LTE srflx). They start lazy. On real user traffic they go P2P. On idle they detach ICE then close. On the next packet they re-open and re-establish P2P — always within a couple of seconds, regardless of whether the candidate type is host or srflx-only. ICE-backoff is bounded so a chronically-broken NAT path doesn't spam the signal server, but real user activity always overrides once per 5 minutes.

Linked work

Maintainers are welcome to push directly to this branch.

Summary by CodeRabbit

  • New Features

    • Connection mode chooser (P2P, Relay‑Forced, P2P‑Lazy, P2P‑Dynamic) with editable relay/P2P timeouts and retry limits, plus server‑pushed defaults and profile/CLI overrides.
    • On‑demand per‑peer connection snapshots and live connection‑map pushes.
    • ICE backoff, richer per‑peer telemetry, and new aggregate peer counters surfaced in status.
  • Bug Fixes

    • Restored policy‑based peer visibility for non‑admin users.
  • UI Improvements

    • New Peers tab with detailed per‑peer views and controls to view/set connection mode and timeouts.

Documentation

  • Documentation is not needed

These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a connection-mode model with client/server timeouts and server-pushed overrides; implements per-peer ICE backoff, two-timer inactivity (ICE vs relay), a conn-state pusher with management snapshot plumbing, peer-connections store/router and HTTP endpoints, UI/Android exposure, many tests, and small infra utilities.

Changes

Connection Mode, Timeouts, and Conn-State Push (single DAG)

Layer / File(s) Summary
Data Shape
shared/connectionmode/mode.go, shared/management/proto/management.proto, client/proto/daemon.proto, management/server/types/settings.go, client/internal/profilemanager/config.go, shared/management/http/api/types.gen.go
Introduce Mode enum and many new fields for connection mode, relay/P2P timeouts, and P2P retry max across proto, account Settings, profile ConfigInput, and API schemas.
Config/Parsing/Wiring
client/internal/connect.go, client/cmd/root.go, client/cmd/service.go, client/cmd/service_installer.go, client/cmd/up.go, client/internal/profilemanager/config.go
Add CLI flags, parse connection-mode/timeouts from env/config/flags, persist explicit flags into profile, and propagate into SetConfigRequest/LoginRequest/ConfigInput.
Runtime Resolution & Accessors
client/internal/conn_mgr.go, client/internal/conn_mgr_test.go, client/internal/connect.go
ConnMgr resolves effective mode/timeouts from env/cfg/server, stores server-pushed state, exposes accessors (Mode/RelayTimeout/P2pTimeout/P2pRetryMax and ServerPushed*).
Per-peer Behavior & Backoff
client/internal/peer/ice_backoff.go, client/internal/peer/conn.go, client/internal/peer/worker_ice.go, client/internal/peer/status.go, client/internal/peer/*_test.go
Add per-peer ICE backoff state and logic, extend Conn with Mode/backoff/keepWgPeer/allowed-IP accessor and new callbacks, integrate WorkerICE backoff callbacks, and surface backoff + remote meta into status/FullStatus.
Inactivity / Debounce
client/internal/lazyconn/inactivity/manager.go, client/internal/lazyconn/manager/manager.go, client/internal/engine.go, client/internal/engine_offline_debounce_test.go
Replace single inactivity timer with two independent timers (ICE vs Relay), expose separate channels, adapt constructors and engine integration, and add per-peer offline debounce timers.
Conn-State Pusher & Engine Adapters
client/internal/conn_state_pusher.go, client/internal/engine_pusher_adapters.go, client/internal/conn_state_pusher_*test.go
New connStatePusher producing delta/full PeerConnectionMap pushes with material-change detection, Unimplemented handling, and engine adapters converting engine status into push events.
Management Server Integration
management/internals/shared/grpc/conversion.go, management/internals/shared/grpc/server.go, management/internals/controllers/network_map/controller/controller.go
ToSyncResponse/conversion updated to accept groupNamesByPeerID context; per-peer effective/configured fields added; server accepts SyncPeerConnections RPC and wires SnapshotRouter and peer-connections store into sync/update paths.
Peer Connections Store & Router
management/server/peer_connections/store.go, management/server/peer_connections/snapshot_router.go, management/server/peer_connections/*_test.go
In-memory MemoryStore with TTL, delta/full merge and nonce semantics; SnapshotRouter for per-peer on-demand snapshot routing; comprehensive tests with fake clock.
HTTP Handler & REST
management/server/http/handlers/peer_connections/handler.go, management/server/http/handler.go, management/server/http/handlers/peer_connections/handler_test.go
New HTTP endpoints /peers/{peerId}/connections and /peers/{peerId}/connections/refresh, adapter to account manager, snapshot trigger and enriched JSON responses.
Client & UI Exposure
client/android/*.go, client/server/server.go, client/ui/*.go, client/system/features.go, client/android/preferences.go
Android PeerInfo and getters extended for server-pushed fields and status counters; GetConfig surfaces server-pushed fields; desktop UI adds Connection Mode widgets, per-mode timeout overrides, new Peers tab rendering enriched FullStatus; SupportedFeatures API added.
Infra & Helpers
client/internal/debouncer/debouncer.go, client/iface/bind/activity.go, client/iface/device/endpoint_manager.go, client/internal/peer/handshaker.go, client/internal/peer/env.go
Add Debouncer utility, ActivityRecorder onActivity callback, EndpointManager ActivityRecorder accessor, handshaker ICE listener add/remove/read, ResolveModeFromEnv with deprecation warnings.
Tests, Docs & Misc
many under client/internal/*, management/server/*, shared/connectionmode/*, client/android/*, client/ui/*, docs/*, .gitignore
Extensive new and updated unit/integration tests across engine, conn_mgr, pusher, store, router, peer guard, ice backoff, handover order, HTTP handlers, UI expectations; docs added; .gitignore updated.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant UI as Desktop UI
    participant Client as Client Daemon
    participant ConnMgr as ConnMgr
    participant Engine as Engine
    participant Mgmt as Management Server

    User->>UI: Select connection mode & timeouts
    UI->>Client: Build SetConfigRequest (mode, timeouts)
    Client->>ConnMgr: persist configured overrides
    ConnMgr->>ConnMgr: resolve effective mode/timeouts (env↦cfg↦server)
    ConnMgr->>Engine: notify mode/timeout change
    Engine->>Engine: resetPeersToLazyIdle / reconfigure peers
    Engine->>Mgmt: SetEffectiveConnConfig (debounced)
    Engine->>Mgmt: SyncPeerConnections (delta/full snapshot)
    Mgmt->>Mgmt: store snapshot (MemoryStore) / route snapshot (SnapshotRouter)
    User->>UI: Open Peers tab / request FullStatus
    UI->>Client: GetFullStatus
    Client->>Engine: request status
    Engine->>Client: FullStatus (includes ICE backoff, groups, counters)
    Client->>UI: Render peers tab (connection_type_extended, backoff, lastSeen, bytes)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Possibly related PRs

  • netbirdio/netbird#5530: related — adds peer_connections MemoryStore and SnapshotRouter and updates nbgrpc.NewServer wiring, overlapping server-side snapshot/store plumbing.
  • netbirdio/netbird#5828: related — overlapping ICE/retry and peer connection guard/backoff behavior changes and tests.
  • netbirdio/netbird#4807: related — modifies network_map controller and ToSyncResponse/ToPeerConfig wiring similar to changes here.

Suggested reviewers

  • pascal-fischer
  • pappz

"I hop with joy, a config carrot in paw,
Modes and timeouts stitched into the law,
ICE naps and wakes, backoff counts to keep,
Snapshots hop through fields while servers sleep,
Peers bloom in lists — the rabbit claps—hip hooray! 🐇"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@MichaelUray
Copy link
Copy Markdown
Contributor Author

Stack: #6081#6082#6083#6084 (this PR). Depends on #6083 landing first.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
client/internal/lazyconn/inactivity/manager_test.go (1)

35-37: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore newTicker after each test.

Every test overwrites the package-global ticker factory and leaves the fake in place. That makes later tests order-dependent and can hang/flap if they inherit the previous fake channel.

Suggested fix
 	fakeTick := make(chan time.Time, 1)
+	prevTicker := newTicker
 	newTicker = func(d time.Duration) Ticker {
 		return &fakeTickerMock{CChan: fakeTick}
 	}
+	t.Cleanup(func() {
+		newTicker = prevTicker
+	})

Also applies to: 76-78, 144-146, 184-186, 227-229, 267-269, 306-308, 347-349

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/inactivity/manager_test.go` around lines 35 - 37,
The tests overwrite the package-global ticker factory newTicker and never
restore it, causing test order dependency and flakes; for each test that sets
newTicker (e.g., where fakeTickerMock and fakeTick are used) capture the
original factory into a local variable before replacing it and defer restoring
it (e.g., originalNewTicker := newTicker; defer func(){ newTicker =
originalNewTicker }()) so the global is reset after the test; apply this to
every test that assigns newTicker (lines around the other occurrences you noted)
to ensure isolation.
docs/bugs/2026-05-04-user-peer-visibility-regression.md (1)

1-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redact PII / private deployment details before committing this doc to the public repo.

This file contains identifiable, non-anonymized information that should not land in a public OSS repository:

  • Reporter's real name: "Michael Uray" (line 3) and a verbatim user quote referencing "Georg".
  • Real end-user identifier: georg.stoisser-gigacher (line 33).
  • Real device hostname: ctb50-d (line 33).
  • Real customer-location group names: Lunz.am.See.FWR-access, Lunz.am.See.FWR.Rx (lines 36–37) — these doxx a specific customer site.
  • Real private deployment URL with non-default port: https://netbird.uplink.plant-control.net:44106 (line 69).

None of these identifiers are required for the technical content (regression cause, restored helpers, fix plan, performance trade-offs). Replace each with neutral placeholders so the doc still conveys the engineering context without leaking customer data. Also note that once committed, prior commits will retain this content in git history — if this branch hasn't been merged yet, please amend/force-push the redaction so the data never reaches main.

🛡️ Suggested redaction pattern
-**Reported:** 2026-05-04 by Michael Uray ("Georg sees only his own peer in the dashboard, not the Gegenstellen — that's not what we want, before each user saw their own peers PLUS their counterparts").
+**Reported:** 2026-05-04 by an internal operator: a `user`-role end user reported seeing only their own peer in the dashboard, no longer the routing peers their access policies allowed them to reach.
@@
-`georg.stoisser-gigacher` has 1 own peer (`ctb50-d`) and 17 auto_groups,
-including 16 `*-access` groups that source policies into `*-Rx` /
-`*-NW` destination groups (e.g. `Lunz.am.See.FWR-access` ->
-`Lunz.am.See.FWR.Rx`). With the regressed code, the dashboard
-shows only `ctb50-d`. Operationally useless -- the user wants to
-see the routing peers their device can reach.
+A `user`-role account had 1 own peer and ~17 auto-groups, most of
+which were `<site>-access` groups that source ACL policies into
+`<site>-Rx` / `<site>-NW` destination groups. With the regressed
+code, the dashboard showed only the user's own peer, hiding every
+counterpart their policies allowed them to reach.
@@
-  curl -sk -H "Authorization: Token <Georg's token>" \
-       https://netbird.uplink.plant-control.net:44106/api/peers \
+  curl -sk -H "Authorization: Token <user-token>" \
+       https://<management-host>/api/peers \
        | jq '. | length'

As per coding guidelines, compliance/privacy risks such as PII retention and exposure of user identifiers are treated as major issues.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/bugs/2026-05-04-user-peer-visibility-regression.md` around lines 1 - 99,
This doc contains PII and customer-specific deployment details (reporter name
"Michael Uray", user "georg.stoisser-gigacher", device "ctb50-d", group names
like "Lunz.am.See.FWR-access"/"Lunz.am.See.FWR.Rx", and the URL
https://netbird.uplink.plant-control.net:44106); redact each by replacing real
names/hosts/groups/URLs with neutral placeholders (e.g., <REPORTER>, <USER>,
<DEVICE>, <GROUP_X>, <PRIVATE_DEPLOYMENT_URL>) in
docs/bugs/2026-05-04-user-peer-visibility-regression.md, update any inline
quotes to anonymized text, then amend the local commit (git commit --amend) and
force-push the branch (git push --force) so the branch tip no longer contains
the sensitive strings; if the branch was already merged, open a follow-up to
purge sensitive data from history and coordinate with security/Git admins.
management/server/peer.go (1)

1297-1355: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep GetPeer aligned with the list-endpoint visibility rules.

This path never applies RegularUsersViewBlocked, and the fallback returns status.Internal on an access miss. That lets a restrictable user who is blocked from GetPeers still fetch a known peer ID here, and unauthorized misses can bubble out as 500s instead of a deny/not-found.

Suggested direction
 func (am *DefaultAccountManager) GetPeer(ctx context.Context, accountID, peerID, userID string) (*nbpeer.Peer, error) {
 	peer, err := am.Store.GetPeerByID(ctx, store.LockingStrengthNone, accountID, peerID)
 	if err != nil {
 		return nil, err
 	}
@@
 	user, err := am.Store.GetUserByUserID(ctx, store.LockingStrengthNone, userID)
 	if err != nil {
 		return nil, err
 	}
+
+	settings, err := am.Store.GetAccountSettings(ctx, store.LockingStrengthNone, accountID)
+	if err != nil {
+		return nil, fmt.Errorf("failed to get account settings: %w", err)
+	}
+	if user.IsRestrictable() && settings.RegularUsersViewBlocked {
+		return nil, status.NewPermissionDeniedError()
+	}
@@
-	return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID)
+	return nil, status.NewPermissionDeniedError()
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1297 - 1355, GetPeer currently skips
the list-endpoint visibility rule (RegularUsersViewBlocked) and returns
status.Internal on access misses; update GetPeer (and/or checkIfUserOwnsPeer) to
honor the RegularUsersViewBlocked rule from permissionsManager (use the same
permissions check flow you use for list visibility) before falling back to
ownership checks, and if the user is blocked by RegularUsersViewBlocked return a
NotFound/deny style error instead of status.Internal (replace the final
status.Internal error in checkIfUserOwnsPeer with a NotFound/permission-denied
response consistent with list behavior). Ensure you reference
permissionsManager.ValidateUserPermissions / RegularUsersViewBlocked and adjust
the error returned from checkIfUserOwnsPeer accordingly.
client/internal/peer/status.go (1)

682-697: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Critical: notifyPeerStateChangeListeners and notifyConnStateChange are invoked after d.mux.Unlock(), violating their documented lock contract.

Both helpers explicitly require the caller to hold d.mux:

  • notifyPeerStateChangeListeners (Line 328) calls snapshotRouterPeersLocked, which does for pid := range d.changeNotify and reads d.peers. If SubscribeToPeerStateChanges (writes d.changeNotify at Line 1006‑1008) or AddPeer/RemovePeer runs concurrently, Go will panic with fatal error: concurrent map iteration and map write.
  • notifyConnStateChange (Line 300) reads d.connStateListener, which SetConnStateListener writes under the lock — a data race per the Go memory model.

The same pattern is repeated in every new update*Locked helper:

call site helper invoked after unlock
Line 469 (updatePeerStateLocked) notifyConnStateChange
Line 691, 695 (updatePeerICEStateLocked) notifyPeerStateChangeListeners, notifyConnStateChange
Line 759, 763 (updatePeerRelayedStateLocked) notifyPeerStateChangeListeners, notifyConnStateChange
Line 811 (updatePeerRelayedStateToDisconnectedLocked) notifyConnStateChange
Line 862 (updatePeerICEStateToDisconnectedLocked) notifyConnStateChange

Two viable fixes:

  1. Capture under the lock, deliver after — build the closure (and the router-state snapshot) before d.mux.Unlock(), then invoke it later. This matches the design intent of the returned notifyFn.
  2. Make the helpers self-locking — drop the "caller must hold d.mux" precondition and have them acquire d.mux (or use atomic.Pointer for connStateListener).

Sketch of fix #1 for updatePeerICEStateLocked:

🔒 Proposed fix: build notifications under the lock
 	notifyList := hasConnStatusChanged(oldStatus, receivedState.ConnStatus)
 	notifyRouter := hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed)
 	routerSnapshot := d.snapshotRouterPeersLocked(receivedState.PubKey, notifyRouter)
 	numPeers := d.numOfPeers()
 	materialICE := hasMaterialICEChange(oldSnapshot, peerState)
+	var stateChangeSnapshot map[string]RouterState
+	if materialICE {
+		stateChangeSnapshot = d.snapshotRouterPeersLocked(receivedState.PubKey, true)
+	}
+	connStateNotify := func() {}
+	if notifyRouter {
+		connStateNotify = d.notifyConnStateChange(receivedState.PubKey, peerState)
+	}
 
 	d.mux.Unlock()
 
 	if notifyList {
 		d.notifier.peerListChanged(numPeers)
 	}
 	if notifyRouter {
 		d.dispatchRouterPeers(receivedState.PubKey, routerSnapshot)
 	}
-	if materialICE {
-		d.notifyPeerStateChangeListeners(receivedState.PubKey)
-	}
-
-	if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) {
-		return d.notifyConnStateChange(receivedState.PubKey, peerState), nil
-	}
-	return func() {}, nil
+	if stateChangeSnapshot != nil {
+		go d.dispatchRouterPeers(receivedState.PubKey, stateChangeSnapshot)
+	}
+	return connStateNotify, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 682 - 697, The problem is that
notifyPeerStateChangeListeners and notifyConnStateChange are called after
d.mux.Unlock(), violating their "caller must hold d.mux" contract and risking
concurrent map iteration/write/data races; fix by capturing everything needed
for notification while holding d.mux and returning a closure to invoke after
unlocking. Concretely, in updatePeerStateLocked, updatePeerICEStateLocked,
updatePeerRelayedStateLocked, updatePeerRelayedStateToDisconnectedLocked, and
updatePeerICEStateToDisconnectedLocked: before calling d.mux.Unlock(), build the
notify closure(s) and any router snapshot or listener references (e.g., result
of snapshotRouterPeersLocked and the connStateListener) into local
variables/closures, then unlock and invoke those closures (or return them via
the existing notifyFn pattern); alternatively, if you prefer, make
notifyPeerStateChangeListeners and notifyConnStateChange acquire d.mux
internally, but do not call them without holding the lock. Ensure no accesses to
d.changeNotify, d.peers, or d.connStateListener happen after unlocking.
🟠 Major comments (23)
client/internal/stdnet/filter_test.go-51-66 (1)

51-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Asymmetric assertion silently skips most of the table — tighten the checks.

The loop only fails in two situations:

  1. !c.want && got — a "should be filtered" case that wasn't filtered.
  2. c.want && !got && GOOS == "windows" && name == "vEthernet (LAN)" — one very specific Windows case.

Every other c.want == true row ("Ethernet USB", "OpenVPN 1", "WiFi", "vEthernet (External)", and "vEthernet (LAN)" on non-Windows runners) silently passes even if InterfaceFilter incorrectly drops them. That's most of the table.

Compounding this, the comment at lines 57–59 is now stale: since filter.go lowercases both sides, on Linux "vEthernet (LAN)""vethernet (lan)" does match the "veth" prefix and is filtered. The test only "passes" on Linux CI because the assertion is gated on Windows. So this test as written cannot regress on the very behavior its name promises to lock down.

Two options:

  • Make c.want == true strict on the platform where it's meaningful (Windows for the vEthernet cases; all platforms for "Ethernet USB", "OpenVPN 1", "WiFi").
  • Or split into per-GOOS subtests with explicit t.Skip so each case has clear, symmetric assertEqual semantics.
🛠️ Suggested tightening
 	for _, c := range cases {
-		// The wgctrl branch can override on hosts where NetBird is
-		// running; tests run on a host where these names are not
-		// real interfaces, so the final return faithfully reflects
-		// the disallow-list logic.
-		got := allow(c.name)
-		// "veth*" prefix only filters on non-Windows; on Linux test
-		// runners "vEthernet (LAN)" still passes because of mixed
-		// case + the !Windows branch keeping the prefix match.
-		if !c.want && got {
-			t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name)
-		}
-		if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" {
-			t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name)
-		}
+		got := allow(c.name)
+
+		// vEthernet (LAN) / vEthernet (External) only stay allowed on
+		// Windows; on non-Windows the case-insensitive "veth" prefix
+		// in the disallow list legitimately filters them.
+		isWindowsOnlyAllow := strings.HasPrefix(strings.ToLower(c.name), "vethernet")
+		if c.want && isWindowsOnlyAllow && runtime.GOOS != "windows" {
+			continue
+		}
+
+		if got != c.want {
+			t.Errorf("InterfaceFilter(%q) = %v, want %v (GOOS=%s)", c.name, got, c.want, runtime.GOOS)
+		}
 	}

(Add "strings" to the imports.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter_test.go` around lines 51 - 66, The table-driven
test's loop around allow(...) is asymmetric and lets many "want==true" rows
silently pass; change the loop in Test (the for _, c := range cases block that
calls allow) to assert equality for every case (if got != c.want { t.Fatalf/
t.Errorf(... ) }) and handle platform-specific expectations explicitly: either
split into per-GOOS subtests (use t.Run with runtime.GOOS checks and t.Skip
where not applicable) or keep a single loop but special-case only the
Windows-specific vEthernet assertion (use runtime.GOOS == "windows" to adjust
expected value for "vEthernet (LAN)"). Ensure you reference the allow(...)
call/InterfaceFilter semantics when updating the messages so failures clearly
show c.name, c.want and got.
client/internal/peer/worker_ice.go-218-230 (1)

218-230: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize lastKnownState with IsConnected.

IsConnected() reads lastKnownState under muxAgent, but onConnectionStateChange mutates the same field without that synchronization. That is a real race, and it can make the new network-change fast path preserve or tear down ICE based on stale state. Guard both reads and writes with muxAgent or switch the field to an atomic.

Also applies to: 545-570

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 230, The IsConnected
race: guard reads and writes of lastKnownState consistently—either acquire
muxAgent around the mutation in onConnectionStateChange (and any other places
noted around the 545-570 range) or convert lastKnownState to an atomic value and
update/read it atomically; specifically, update the onConnectionStateChange
method and any other writers to use muxAgent.Lock()/Unlock() (or atomic store)
so IsConnected()'s read under muxAgent is synchronized with those writers.
client/internal/conn_state_pusher.go-293-298 (1)

293-298: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the push RPCs cancellable.

Both push paths use context.Background(). If the management RPC stalls, this goroutine blocks indefinitely and Stop() never returns. Please use a pusher-owned context and/or a bounded timeout so shutdown can interrupt an in-flight push.

Also applies to: 363-369

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 293 - 298, Replace the use
of context.Background() when calling p.sink.Push so the RPCs are cancellable:
derive a context from a pusher-owned context (e.g., p.ctx or p.stopCtx) and/or
wrap with a timeout via context.WithTimeout, pass that derived context to
p.sink.Push (both in the PeerConnectionMap push and the other push path using
p.sessionID), and ensure Stop() cancels the pusher-owned context (or calls the
cancel func) so in-flight pushes are interrupted and Stop() can return.
client/internal/conn_state_pusher.go-382-396 (1)

382-396: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Peer removals never get propagated.

computeDeltaFromSource() only emits peers that still exist in the current snapshot. When a peer disappears, its old entry remains in lastPushed and there is no tombstone or full-snapshot-on-shrink path, so management will keep stale connection state until an explicit full snapshot is requested. This needs delete events or a forced full snapshot whenever the key set shrinks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 382 - 396,
computeDeltaFromSource currently only emits peers present in the snapshot and
never reports removals; update it to detect when keys in p.lastPushed are absent
from p.source.SnapshotAllRemotePeers() and emit corresponding "delete/tombstone"
PeerStateChangeEvent entries (or a PeerStateChangeEvent that marks the peer as
removed using whatever removed/disconnected field/constant exists) for each
missing Pubkey, then remove those keys from p.lastPushed (or trigger the
existing full-snapshot-on-shrink path if you prefer that alternative); keep
references to p.lastPushed, computeDeltaFromSource, SnapshotAllRemotePeers,
PeerStateChangeEvent and isMaterialChange when implementing the change.
shared/connectionmode/mode.go-19-29 (1)

19-29: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

p2p-dynamic-lazy still has no enum slot.

#5989's objective includes p2p-dynamic-lazy, but this Mode type stops at ModeP2PDynamic. Without a distinct value here, the parser and proto/config conversions cannot preserve that mode, so it will collapse into some other behavior during resolution or sync.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/connectionmode/mode.go` around lines 19 - 29, The Mode enum lacks a
value for "p2p-dynamic-lazy" so Mode values like ModeP2PDynamic currently cannot
represent it; add a new constant (e.g., ModeP2PDynamicLazy) into the const block
alongside ModeP2PDynamic and ModeFollowServer, then update all
conversion/serialization points that reference Mode (for example any
ToProto/FromProto, parser, config marshaling/unmarshaling functions) to handle
the new ModeP2PDynamicLazy case so the mode is preserved across parsing, proto
conversion and sync.
client/internal/debouncer/debouncer.go-28-53 (1)

28-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stale timers can still run the newest callback.

Every AfterFunc closure reads shared state from d.fn. If an older timer has already fired when Trigger or Stop calls timer.Stop(), that stale closure can still run and invoke the latest callback, which breaks debounce semantics and lets Stop() leak work after shutdown.

Suggested fix
 type Debouncer struct {
 	delay time.Duration
 	mu    sync.Mutex
 	timer *time.Timer
 	fn    func()
+	gen   uint64
 }
@@
 func (d *Debouncer) Trigger(fn func()) {
 	d.mu.Lock()
 	defer d.mu.Unlock()
+	d.gen++
+	gen := d.gen
 	d.fn = fn
 	if d.timer != nil {
 		d.timer.Stop()
 	}
 	d.timer = time.AfterFunc(d.delay, func() {
 		d.mu.Lock()
+		if gen != d.gen {
+			d.mu.Unlock()
+			return
+		}
 		f := d.fn
+		d.fn = nil
+		d.timer = nil
 		d.mu.Unlock()
 		if f != nil {
 			f()
 		}
 	})
 }
@@
 func (d *Debouncer) Stop() {
 	d.mu.Lock()
 	defer d.mu.Unlock()
+	d.gen++
 	if d.timer != nil {
 		d.timer.Stop()
 		d.timer = nil
 	}
+	d.fn = nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/debouncer/debouncer.go` around lines 28 - 53, The Debouncer
allows stale timers' closures to run and call the newest callback because
AfterFunc closures read shared d.fn; fix by versioning/invalidation: add a
generation counter (e.g. d.generation) and increment it whenever you
replace/clear the timer (in Trigger and Stop); when creating the new timer in
Debouncer.Trigger capture the current generation value and the callback into
locals and have the closure check that the captured generation matches
d.generation (under d.mu) before invoking the captured callback; also ensure
Stop increments the generation and clears d.fn so any already-firing closures
see a mismatched generation and return without calling the latest fn.
client/internal/profilemanager/config.go-178-203 (1)

178-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

All four timeout/connection-mode fields need explicit snake_case JSON tags per proto/API contracts.

The protobuf definitions and OpenAPI spec define these fields with snake_case JSON names: connection_mode, relay_timeout_seconds, and p2p_retry_max_seconds. Currently, ConnectionMode, RelayTimeoutSeconds, and P2pTimeoutSeconds lack explicit JSON tags and will serialize as PascalCase ("ConnectionMode", "RelayTimeoutSeconds", "P2pTimeoutSeconds"), while only P2pRetryMaxSeconds is correctly tagged "p2p_retry_max_seconds". This breaks serialization/deserialization consistency. Add JSON tags to all three: ConnectionModejson:"connection_mode,omitempty", RelayTimeoutSecondsjson:"relay_timeout_seconds,omitempty", P2pTimeoutSecondsjson:"p2p_timeout_seconds,omitempty".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/profilemanager/config.go` around lines 178 - 203, The struct
fields ConnectionMode, RelayTimeoutSeconds and P2pTimeoutSeconds are missing
explicit snake_case JSON tags and will serialize as PascalCase; update their
struct tags to match the proto/OpenAPI contract by adding
json:"connection_mode,omitempty" to ConnectionMode,
json:"relay_timeout_seconds,omitempty" to RelayTimeoutSeconds, and
json:"p2p_timeout_seconds,omitempty" to P2pTimeoutSeconds (leave
P2pRetryMaxSeconds as-is with json:"p2p_retry_max_seconds,omitempty"). Ensure
tags are placed on the same field declarations as shown (ConnectionMode,
RelayTimeoutSeconds, P2pTimeoutSeconds) so JSON marshal/unmarshal uses the
correct snake_case names.
management/server/activity/codes.go-357-362 (1)

357-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the public event-code enum in sync with these new activities.

These codes are now emitted on the server side, but shared/management/http/api/types.gen.go still doesn't define matching EventActivityCode values. That leaves /api/events and notification-type consumers out of sync as soon as one of these events is returned.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/activity/codes.go` around lines 357 - 362, The public
EventActivityCode enum (EventActivityCode) must be updated to include the new
activity codes added in the server (AccountConnectionModeChanged,
AccountRelayTimeoutChanged, AccountP2pTimeoutChanged, AccountP2pRetryMaxChanged,
AccountLegacyLazyFallbackEnabledChanged,
AccountLegacyLazyFallbackTimeoutChanged); open the generated API types file that
defines EventActivityCode (types.gen.go) and add matching enum entries or re-run
the generator that emits EventActivityCode so the shared API and server activity
codes stay in sync, then rebuild to ensure /api/events consumers see the new
values.
shared/management/http/api/types.gen.go-41-63 (1)

41-63: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate this schema with the final connection-mode set.

The issue/PR objective includes p2p-dynamic-lazy, but this generated enum still exposes only four modes and the doc blocks still describe the old Phase-1 behavior. Any generated client or dashboard code built from this schema won't be able to accept or round-trip the final mode.

Also applies to: 1482-1487, 1576-1581

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
AccountSettingsConnectionMode enum and its Valid() method are missing the final
"p2p-dynamic-lazy" mode and still reflect Phase-1 docs; regenerate or update the
schema so the constants include AccountSettingsConnectionModeP2pDynamicLazy
(value "p2p-dynamic-lazy"), add that constant to the switch in
AccountSettingsConnectionMode.Valid(), and update the related doc comments (and
the same changes for the other generated blocks referenced) so generated
clients/dashboards can accept and round-trip the new mode.
client/android/preferences.go-325-330 (1)

325-330: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Map follow-server to clearing the override.

The getter docs advertise follow-server to gomobile callers, but this setter persists it as a literal mode string. That stores an invalid override instead of reverting to the server-pushed mode.

Suggested fix
 func (p *Preferences) SetConnectionMode(mode string) {
+	if mode == "" || mode == "follow-server" {
+		p.configInput.ConnectionMode = nil
+		return
+	}
 	m := mode
 	p.configInput.ConnectionMode = &m
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 325 - 330, The setter
Preferences.SetConnectionMode currently persists the literal "follow-server"
instead of clearing the override; update SetConnectionMode so that when mode is
"" or "follow-server" it clears the override (set p.configInput.ConnectionMode =
nil) to revert to server-pushed behavior, otherwise set the pointer to the
provided mode string as before; locate this change in the
Preferences.SetConnectionMode method to apply the mapping.
client/ui/peers_tab.go-319-320 (1)

319-320: ⚠️ Potential issue | 🟠 Major | 💤 Low value

Proto fields bytesRx and bytesTx are int64, not uint64.

The concern is valid. The proto definition (client/proto/daemon.proto:363-364) declares these as int64, and the generated getters return int64. Casting a negative int64 to uint64 would wrap around and render as an absurdly large transfer count. Either the proto fields should be uint64, or the code should validate the values are non-negative before casting.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/peers_tab.go` around lines 319 - 320, The proto getters
p.GetBytesRx() and p.GetBytesTx() return int64, so avoid directly casting them
to uint64; update the formatting in peers_tab.go where
humanBytes(uint64(p.GetBytesRx())) / humanBytes(uint64(p.GetBytesTx())) is used
— check each value from p.GetBytesRx() and p.GetBytesTx(), clamp negatives to 0
(or render a safe placeholder) and only convert to uint64 after ensuring
non-negative, then pass the uint64 to humanBytes; alternatively, if you prefer
schema change mention, change the proto fields to uint64 and regenerate, but the
quick fix is to validate/clamp the int64 return values before conversion.
client/server/server.go-1540-1549 (1)

1540-1549: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cfg.ServerSSHAllowed before dereferencing it.

This response path still treats ServerSSHAllowed as a required pointer, but the rest of the file handles it as optional. If an older or partially migrated profile leaves that field nil, GetConfig will panic here.

Proposed fix
+	serverSSHAllowed := true
+	if cfg.ServerSSHAllowed != nil {
+		serverSSHAllowed = *cfg.ServerSSHAllowed
+	}
+
 	return &proto.GetConfigResponse{
 		ManagementUrl:                   managementURL.String(),
 		PreSharedKey:                    preSharedKey,
 		AdminURL:                        adminURL.String(),
 		InterfaceName:                   cfg.WgIface,
 		WireguardPort:                   int64(cfg.WgPort),
 		Mtu:                             int64(cfg.MTU),
 		DisableAutoConnect:              cfg.DisableAutoConnect,
-		ServerSSHAllowed:                *cfg.ServerSSHAllowed,
+		ServerSSHAllowed:                serverSSHAllowed,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/server/server.go` around lines 1540 - 1549, The GetConfigResponse
construction dereferences cfg.ServerSSHAllowed unguarded which can panic when
it's nil; update the code building the proto.GetConfigResponse to check
cfg.ServerSSHAllowed for nil and only set ServerSSHAllowed to
*cfg.ServerSSHAllowed when non-nil (otherwise set a safe default or omit/set a
nil/false-equivalent per proto semantics), i.e., guard the dereference of
ServerSSHAllowed in the function that builds the response so the GetConfig path
won’t panic.
client/ui/network.go-379-384 (1)

379-384: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh the Peers tab when it is the active view.

Line 382 still routes the timer through getGridAndFilterFromTab(...), which treats the Peers tab as the default networks tab. In practice that means the visible Peers view never auto-refreshes, while the hidden all-networks grid keeps doing background refresh work. Branch on peersText here and call peersBundle.Refresh() instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/network.go` around lines 379 - 384, The timer refresh currently
always calls getGridAndFilterFromTab(tabs, allGrid, overlappingGrid,
exitNodesGrid) which treats the Peers tab as networks and prevents the visible
Peers view from auto-refreshing; update the branch in the timer handler to check
if the active tab equals peersText and, if so, call peersBundle.Refresh() (and
avoid calling getGridAndFilterFromTab or s.wNetworks.Content().Refresh() in that
path), otherwise continue to call getGridAndFilterFromTab(...) and
s.updateNetworksSilent(grid, f) so hidden networks still refresh as before.
client/ui/client_ui.go-656-666 (1)

656-666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject invalid timeout text instead of silently converting it to 0.

parseUint32Field treats parse failures and overflow the same as an empty field, so values like abc or 999999999999 get saved as “use server default”. That makes config edits lossy and very hard to diagnose.

Suggested direction
-func parseUint32Field(text string) uint32 {
+func parseUint32Field(text string) (uint32, error) {
 	t := strings.TrimSpace(text)
 	if t == "" {
-		return 0
+		return 0, nil
 	}
 	v, err := strconv.ParseUint(t, 10, 32)
 	if err != nil {
-		return 0
+		return 0, fmt.Errorf("invalid timeout value %q", t)
 	}
-	return uint32(v)
+	return uint32(v), nil
 }

Then validate/build with the returned error instead of silently sending zeroes.

Also applies to: 741-750

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 656 - 666, parseUint32Field currently
treats empty, invalid, and overflow inputs the same by returning 0, which
silently converts bad user edits into "use server default"; change
parseUint32Field to return (uint32, error) (or an explicit ok bool plus value)
and return an error on invalid parse/overflow instead of 0, update callers
(including the other similar helper at the 741-750 region) to check and
propagate that error to the config validation/build path so invalid timeout text
is rejected with a clear error rather than saved as zero.
management/server/http/handlers/peer_connections/handler.go-161-188 (1)

161-188: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Filter snapshot entries by the caller’s peer visibility before returning them.

GetPeer only authorizes the source peer. buildResponse then emits every remote pubkey/endpoint in that peer’s connection map, so a regular user who can see a routing peer via policy reachability can learn unrelated peers and their traffic metadata. Please drop or redact entries the caller cannot access under the same visibility rules used by GetPeer/GetPeers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/peer_connections/handler.go` around lines 161
- 188, buildResponse currently returns every entry in the connection map
regardless of the caller’s visibility; instead, apply the same
visibility/filtering used by GetPeer/GetPeers before emitting entries. For each
entry in buildResponse (refer to apiEntry assembly and the call to
h.account.GetPeerByPubKey), call the same visibility-check helper used by
GetPeer/GetPeers (or extract that logic into a shared helper like
authorizePeerVisibility) and if the caller cannot see the remote peer then drop
the entry or redact sensitive fields (remote pubkey, endpoint, RxBytes/TxBytes,
latency) before appending to resp.Entries so only peers the caller is authorized
to see are returned.
client/ui/client_ui.go-501-510 (1)

501-510: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expose p2p-dynamic-lazy everywhere the selector reads or writes modes.

The new mode is missing from the dropdown, the refresh options, the enable/disable switch, and the load-time switch. Opening settings on a profile already set to p2p-dynamic-lazy will fall back to “Follow server”, and saving from there clears the override.

Suggested direction
 s.sConnectionMode = widget.NewSelect(
-	[]string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic"},
+	[]string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy"},
 	func(string) { s.updateTimeoutEntriesEnabled() },
 )
@@
 	s.sConnectionMode.Options = []string{
 		s.followServerLabel(),
 		"relay-forced",
 		"p2p",
 		"p2p-lazy",
 		"p2p-dynamic",
+		"p2p-dynamic-lazy",
 	}
@@
-	case "p2p-dynamic":
+	case "p2p-dynamic", "p2p-dynamic-lazy":
 		s.iRelayTimeout.Enable()
 		s.iP2pTimeout.Enable()
 		s.iP2pRetryMax.Enable()
@@
-	case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic":
+	case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy":
 		s.sConnectionMode.SetSelected(cfg.ConnectionMode)

Also applies to: 855-907, 1573-1598

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 501 - 510, The dropdown and related
mode-handling code omit the new "p2p-dynamic-lazy" mode; add "p2p-dynamic-lazy"
to the NewSelect options list (where s.sConnectionMode is created) and update
every place that enumerates or branches on connection modes — e.g., the
refresh/options arrays, the enable/disable switch logic, the load-time mapping
that calls s.sConnectionMode.SetSelected or reads s.sConnectionMode.Selected,
and the updateTimeoutEntriesEnabled function — so that checks that currently
look for "p2p-dynamic" also accept "p2p-dynamic-lazy" (or include it as a new
case) and saving/loading code preserves this exact string instead of falling
back to "Follow server".
management/server/http/handlers/accounts/accounts_handler.go-231-283 (1)

231-283: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return InvalidArgument for these validation failures.

These branches build plain fmt.Errorfs, but updateAccount() forwards them to util.WriteError(). That makes malformed connection_mode / timeout input look like a server failure instead of a 4xx validation error. Please return status.Errorf(status.InvalidArgument, ...) here, including from validateUint32Timeout().

Suggested fix
 	if req.Settings.ConnectionMode != nil {
 		modeStr := string(*req.Settings.ConnectionMode)
 		if !req.Settings.ConnectionMode.Valid() {
-			return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+			return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
 		}
@@
 		if v < 60 || v > 86400 {
-			return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+			return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
 		}
 func validateUint32Timeout(name string, v int64) (uint32, error) {
 	if v < 0 {
-		return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v)
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v)
 	}
 	if v > int64(math.MaxUint32) {
-		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, math.MaxUint32)
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, math.MaxUint32)
 	}
 	return uint32(v), nil
 }

Also applies to: 505-519

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
231 - 283, The validation branches in accounts_handler.go (e.g., the
connection_mode check, the legacy_lazy_fallback_timeout_seconds check, and calls
to validateUint32Timeout for p2p_timeout_seconds, p2p_retry_max_seconds,
relay_timeout_seconds) currently return plain fmt.Errorf which causes
util.WriteError to treat them as server errors; change these to return gRPC
status errors using status.Errorf(codes.InvalidArgument, ...) with the same
message text, and update validateUint32Timeout to return a status.Error (or
status.Errorf) on invalid input so its callers receive an InvalidArgument
status; references: req.Settings.ConnectionMode, validateUint32Timeout,
returnSettings.LegacyLazyFallbackTimeoutSeconds, and the timeout fields
P2pTimeoutSeconds/P2pRetryMaxSeconds/RelayTimeoutSeconds.
client/internal/peer/conn.go-97-102 (1)

97-102: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

P2pRetryMaxSeconds currently can't represent “disabled”.

The new flow collapses two different states into 0: “use DefaultP2PRetryMax” and “explicitly disable backoff”. Open() interprets 0 as the default cap, and SetIceBackoffMax(0) stores the same value back into config, so an explicit disable is lost on first open/reopen. This needs a real tri-state through ConnConfig and the open/setter paths.

Also applies to: 222-227, 1428-1434

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 97 - 102, P2pRetryMaxSeconds
currently conflates three states into 0; make it a tri-state by changing
ConnConfig.P2pRetryMaxSeconds from uint32 to *uint32 (nil = use
DefaultP2PRetryMax, pointer to 0 = explicit disable, pointer to >0 = explicit
cap), update Open() to interpret nil/0/>0 accordingly (use DefaultP2PRetryMax
for nil, treat 0 as disabled), and change SetIceBackoffMax to accept/store a
*uint32 (or adapt its signature to set nil/0/value) so explicit disable isn't
lost on reopen; apply the same pattern wherever P2pRetryMaxSeconds is consumed
(e.g., the other referenced sites: lines ~222-227 and ~1428-1434) and adjust any
code that previously relied on the sentinel ^uint32(0) to use the new
pointer-tri-state semantics.
client/internal/conn_mgr.go-297-301 (1)

297-301: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Propagate live mode/timeout changes into existing peer.Conns.

This updates the resolved values on ConnMgr, but the only thing pushed into already-created peers is p2pRetryMax. The same peer.Conn objects are then reused by closeManager() / resetPeersToLazyIdle(), so their mode-sensitive logic keeps running with stale ConnConfig.Mode and timeout-derived behavior after a server push. A runtime switch to relay-forced or p2p-dynamic won't fully take effect until the peers are rebuilt.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 297 - 301, When updating e.mode,
e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs, also push the new
configuration into already-created peer.Conn objects so their ConnConfig.Mode
and timeout-derived behavior are updated immediately; extend
propagateP2pRetryMaxToConns() (or add a new helper called from this same place)
to iterate existing peers and set each peer.Conn.ConnConfig.Mode, .RelayTimeout
(or equivalent), .P2PTimeout and .P2PRetryMax fields from the ConnMgr's resolved
values so closeManager() and resetPeersToLazyIdle() do not keep operating on
stale settings. Ensure the update is safe for concurrent access to the peers
collection.
management/internals/shared/grpc/conversion.go-282-289 (1)

282-289: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sentinel encoding inconsistency between configured_p2p_retry_max_secs and PeerConfig.P2pRetryMaxSeconds will cause false UI divergence warnings.

In toPeerConfig (lines 132-139), *settings.P2pRetryMaxSeconds == 0 is encoded on the wire as p2pRetryMaxDisabledSentinel (uint32 max) to mean "user explicitly disabled backoff", while wire-value 0 means "not set / use daemon default" — matching the wire-format semantics documented on PeerConfig.P2pRetryMaxSeconds in management.proto.

Here, cfgP2pRetryMax = derefUint32OrZero(c.Cfg.P2pRetryMaxSeconds) skips that transform: when an admin has explicitly disabled backoff (settings value 0), RemotePeerConfig.configured_p2p_retry_max_secs is sent as 0 while PeerConfig.P2pRetryMaxSeconds and the peer's reported EffectiveP2PRetryMaxSecs are sent as the sentinel max. UIs that compare "effective vs configured to spot local overrides (≠ → ⚠)" — as the proto comment on these fields states — will flag every legitimately backoff-disabled peer as diverging.

🛠️ Proposed fix: extract a small helper and apply the sentinel transform in both call sites
 	var cfgConnMode string
 	var cfgRelayTO, cfgP2pTO, cfgP2pRetryMax uint32
 	if c.Cfg != nil {
 		cfgConnMode = derefStringOrEmpty(c.Cfg.ConnectionMode)
 		cfgRelayTO = derefUint32OrZero(c.Cfg.RelayTimeoutSeconds)
 		cfgP2pTO = derefUint32OrZero(c.Cfg.P2pTimeoutSeconds)
-		cfgP2pRetryMax = derefUint32OrZero(c.Cfg.P2pRetryMaxSeconds)
+		cfgP2pRetryMax = encodeP2pRetryMaxWire(c.Cfg.P2pRetryMaxSeconds)
 	}
// encodeP2pRetryMaxWire applies the wire-format sentinel for
// PeerConfig.P2pRetryMaxSeconds / configured_p2p_retry_max_secs:
//
//	nil    -> 0   (not set, use daemon default)
//	*v==0  -> max (user explicitly disabled backoff)
//	*v>0   -> *v
func encodeP2pRetryMaxWire(v *uint32) uint32 {
	if v == nil {
		return 0
	}
	if *v == 0 {
		return p2pRetryMaxDisabledSentinel
	}
	return *v
}

Then update toPeerConfig (lines 132-139) to use the same helper for symmetry.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 282 - 289, The
configured_p2p_retry_max_secs wire-value is using derefUint32OrZero and
therefore fails to apply the same sentinel mapping used by
PeerConfig.P2pRetryMaxSeconds, causing false divergence; add a helper like
encodeP2pRetryMaxWire(v *uint32) uint32 that returns 0 for nil,
p2pRetryMaxDisabledSentinel for *v==0, and *v otherwise, then replace uses of
derefUint32OrZero for the P2P retry field (e.g., the cfgP2pRetryMax assignment
in this diff and the toPeerConfig mapping of settings.P2pRetryMaxSeconds) to
call encodeP2pRetryMaxWire so both sides use the same wire encoding.
management/internals/shared/grpc/server.go-435-437 (1)

435-437: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add nil-guard on s.snapshotRouter to match defensive SyncPeerConnections pattern.

The unconditional dereference of s.snapshotRouter.Register() on line 435 is inconsistent with the if s.peerConnections != nil check in SyncPeerConnections (line 1182). NewServer accepts both parameters with no documented non-nil contract, and multiple test files pass nil for peerConnRouter. Since those tests do invoke Sync() which calls handleUpdates(), the nil panic is a genuine risk.

Apply Option A (defensive guard):

-	// Phase 3.7i (`#5989`): register for SnapshotRequest dispatch.
-	snapshotCh := s.snapshotRouter.Register(peerKey.String())
-	defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh)
+	// Phase 3.7i (`#5989`): register for SnapshotRequest dispatch.
+	var snapshotCh <-chan uint64
+	if s.snapshotRouter != nil {
+		snapshotCh = s.snapshotRouter.Register(peerKey.String())
+		defer s.snapshotRouter.Unregister(peerKey.String(), snapshotCh)
+	}

The existing case nonce, ok := <-snapshotCh: will safely block forever on a nil channel, degrading gracefully.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 435 - 437, The code
unconditionally calls s.snapshotRouter.Register(...) and Unregister(...),
causing a nil-deref when s.snapshotRouter is nil; add a defensive nil-guard
around registration/unregistration similar to the SyncPeerConnections pattern:
only call s.snapshotRouter.Register(peerKey.String()) and defer
s.snapshotRouter.Unregister(peerKey.String(), snapshotCh) when s.snapshotRouter
!= nil, and handle snapshotCh being nil (the existing select/case will block on
a nil channel as intended); update the block using the symbols snapshotCh,
s.snapshotRouter.Register, s.snapshotRouter.Unregister and keep behavior when
NewServer may receive a nil peerConnRouter.
shared/management/http/api/openapi.yml-379-405 (1)

379-405: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timeout defaults conflict with #5989 behavior goals

Line 382 documents ICE idle default as 180 minutes, while Line 405 documents relay idle default as 5 minutes. #5989 objectives describe the opposite shape (ICE around 5 minutes, relay around 1 hour). Please align these descriptions (and values, if needed) with actual server behavior so clients don’t apply wrong defaults.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 379 - 405, The YAML
descriptions and implied defaults are flipped: update the ICE idle timeout field
(the description block that starts "Default ICE-worker idle timeout in seconds.
0 = never tear down.") to state the built-in default is ~5 minutes (NULL means
use built-in default (5 minutes)) and change any example/value references
accordingly, and update relay_timeout_seconds to state the built-in default is
~60 minutes (NULL means use built-in default (60 minutes)) so the documented
defaults match the actual server behavior; adjust any example values and text to
reflect these corrected defaults and keep p2p_retry_max_seconds unchanged.
shared/management/http/api/openapi.yml-362-372 (1)

362-372: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

connection_mode is missing p2p-dynamic-lazy and still documents stale phase behavior

Line 365 omits p2p-dynamic-lazy, even though #5989 defines it as a first-class mode. This can block generated clients from sending/accepting valid values and causes an API contract mismatch.

Suggested OpenAPI fix
         connection_mode:
           x-experimental: true
           type: string
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Supported modes: relay-forced, p2p, p2p-lazy, p2p-dynamic,
+            and p2p-dynamic-lazy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 362 - 372, The OpenAPI
schema for property connection_mode currently omits the valid enum value
"p2p-dynamic-lazy" and contains stale phase-specific documentation; update the
enum for connection_mode to include "p2p-dynamic-lazy" alongside relay-forced,
p2p, p2p-lazy, and p2p-dynamic, keep nullable: true and x-experimental: true,
and revise the description to remove the outdated "Phase 1/Phase 2" wording and
instead document the current semantics and fallback behavior (NULL meaning fall
back to lazy_connection_enabled) so generated clients accept and document the
new mode correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 19da9f16-b34b-4316-be52-6651b3b8f39b

📥 Commits

Reviewing files that changed from the base of the PR and between b19b746 and 86fab00.

⛔ Files ignored due to path filters (7)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • client/proto/daemon_grpc.pb.go is excluded by !**/*.pb.go
  • flow/proto/flow.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management_grpc.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/proxy_service.pb.go is excluded by !**/*.pb.go
  • shared/signal/proto/signalexchange.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (108)
  • .gitignore
  • client/android/client.go
  • client/android/peer_notifier.go
  • client/android/preferences.go
  • client/android/preferences_clamp_test.go
  • client/cmd/root.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/cmd/testutil_test.go
  • client/cmd/up.go
  • client/iface/bind/activity.go
  • client/iface/device/endpoint_manager.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/conn_state_pusher.go
  • client/internal/conn_state_pusher_material_test.go
  • client/internal/conn_state_pusher_test.go
  • client/internal/conn_state_pusher_testhelper_test.go
  • client/internal/connect.go
  • client/internal/debouncer/debouncer.go
  • client/internal/engine.go
  • client/internal/engine_offline_debounce_test.go
  • client/internal/engine_pusher_adapters.go
  • client/internal/engine_test.go
  • client/internal/lazyconn/activity/listener_bind_test.go
  • client/internal/lazyconn/env.go
  • client/internal/lazyconn/inactivity/manager.go
  • client/internal/lazyconn/inactivity/manager_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/lazyconn/support.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_handover_order_test.go
  • client/internal/peer/conn_lazy_keepwgpeer_test.go
  • client/internal/peer/conn_test.go
  • client/internal/peer/env.go
  • client/internal/peer/env_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/guard/guard_test.go
  • client/internal/peer/guard/ice_retry_state_test.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/handshaker_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/status_debounce_test.go
  • client/internal/peer/status_remote_meta_notify_test.go
  • client/internal/peer/status_test.go
  • client/internal/peer/worker_ice.go
  • client/internal/peerstore/store.go
  • client/internal/profilemanager/config.go
  • client/internal/stdnet/filter.go
  • client/internal/stdnet/filter_test.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/server/server_test.go
  • client/server/setconfig_test.go
  • client/status/status.go
  • client/status/status_test.go
  • client/system/features.go
  • client/system/features_test.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
  • client/ui/network.go
  • client/ui/peers_tab.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md
  • docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md
  • management/internals/controllers/network_map/controller/controller.go
  • management/internals/server/boot.go
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/internals/shared/grpc/server.go
  • management/server/account.go
  • management/server/account/manager.go
  • management/server/account/manager_mock.go
  • management/server/activity/codes.go
  • management/server/http/handler.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/http/handlers/accounts/validate_uint32_timeout_test.go
  • management/server/http/handlers/peer_connections/handler.go
  • management/server/http/handlers/peer_connections/handler_test.go
  • management/server/http/testing/testing_tools/channel/channel.go
  • management/server/management_proto_test.go
  • management/server/management_test.go
  • management/server/mock_server/account_mock.go
  • management/server/peer.go
  • management/server/peer/peer.go
  • management/server/peer/peer_test.go
  • management/server/peer_connections/snapshot_router.go
  • management/server/peer_connections/snapshot_router_test.go
  • management/server/peer_connections/store.go
  • management/server/peer_connections/store_test.go
  • management/server/peer_test.go
  • management/server/store/file_store.go
  • management/server/store/sql_store.go
  • management/server/types/settings.go
  • management/server/types/settings_ptr_equal_test.go
  • management/server/types/settings_test.go
  • shared/connectionmode/mode.go
  • shared/connectionmode/mode_test.go
  • shared/management/client/client.go
  • shared/management/client/client_test.go
  • shared/management/client/grpc.go
  • shared/management/client/mock.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
💤 Files with no reviewable changes (2)
  • client/ui/const.go
  • client/ui/event_handler.go

Comment on lines +347 to +355
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override. Negative values are clamped to 0;
// values larger than MaxUint32 are clamped to MaxUint32. The Android
// AdvancedFragment UI already clamps negatives but a Java caller using
// the bare gomobile API directly would otherwise wrap silently.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := clampUint32Seconds(secs)
p.configInput.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

cfg_file="$(fd -p 'config.go' client/internal/profilemanager | head -n1)"
if [[ -z "${cfg_file}" ]]; then
  echo "Could not find client/internal/profilemanager/config.go" >&2
  exit 1
fi

echo "=== ConfigInput / Config timeout fields ==="
rg -n -C3 'type ConfigInput struct|type Config struct|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds|ConnectionMode' "$cfg_file"

echo
echo "=== Read / write paths ==="
rg -n -C5 'func ReadConfig|func UpdateOrCreateConfig' "$cfg_file"

echo
echo "=== Any nil-vs-zero normalization before persist ==="
rg -n -C3 'RelayTimeoutSeconds.*0|P2pTimeoutSeconds.*0|P2pRetryMaxSeconds.*0|ConnectionMode.*follow-server|nil' client/internal/profilemanager client/android

Repository: netbirdio/netbird

Length of output: 50374


🏁 Script executed:

cat -n client/android/preferences.go | sed -n '340,400p'

Repository: netbirdio/netbird

Length of output: 2798


🏁 Script executed:

cat -n client/android/preferences.go | sed -n '398,420p'

Repository: netbirdio/netbird

Length of output: 713


Fix setters to write nil when 0 is passed, not &0.

The documented behavior—"Pass 0 to clear the override"—is broken. When you call SetRelayTimeoutSeconds(0), the setter always creates a non-nil pointer to 0 and stores it in configInput.RelayTimeoutSeconds. When UpdateOrCreateConfig is called, it dereferences this pointer and writes literal 0 to the config file.

Per the Config struct documentation, local 0 means "follow server," so this persists an override (follow-server behavior) instead of clearing the local override. The same issue affects SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds.

Each setter should check if the clamped value is 0 and write nil instead of &0:

func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
	v := clampUint32Seconds(secs)
	if v == 0 {
		p.configInput.RelayTimeoutSeconds = nil
	} else {
		p.configInput.RelayTimeoutSeconds = &v
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 347 - 355, The setters
(Preferences.SetRelayTimeoutSeconds, and likewise SetP2pTimeoutSeconds and
SetP2pRetryMaxSeconds) currently always take the clamped uint32 value v and
assign its address to p.configInput.* which creates a non-nil pointer for 0;
instead detect when clampUint32Seconds(secs) returns 0 and set the corresponding
p.configInput.RelayTimeoutSeconds / P2pTimeoutSeconds / P2pRetryMaxSeconds to
nil, otherwise set it to a pointer to v so passing 0 clears the override rather
than persisting a literal 0.

Comment on lines +487 to +498
case nonce, ok := <-snapshotCh:
if !ok {
continue
}
snapMsg := &network_map.UpdateMessage{
Update: &proto.SyncResponse{
SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce},
},
}
if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil {
log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect SnapshotRouter implementation to determine when it closes channels.
fd -t f 'snapshot_router.go' | xargs -I{} cat {}
ast-grep --pattern $'func ($_ *SnapshotRouter) Unregister($$$) {
  $$$
}'
ast-grep --pattern $'func ($_ *SnapshotRouter) $_($$$) {
  $$$
  close($_)
  $$$
}'

Repository: netbirdio/netbird

Length of output: 4855


Closed-channel busy-loop on fast-reconnect scenarios.

When a peer reconnects quickly, SnapshotRouter.Register() closes the previous stream's channel (documented behavior for fast reconnect handling). However, the old handleUpdates goroutine remains in the select loop, reading from the now-closed channel. Repeatedly, <-snapshotCh returns (0, false), the if !ok { continue } re-enters the select, and the closed case fires again immediately—a tight spin until either ctx.Done() fires or the deferred Unregister() completes after handleUpdates returns.

Set snapshotCh = nil after closing so subsequent select evaluations skip this case (nil channels block forever in select):

Proposed fix
 		case nonce, ok := <-snapshotCh:
 			if !ok {
-				continue
+				snapshotCh = nil
+				continue
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case nonce, ok := <-snapshotCh:
if !ok {
continue
}
snapMsg := &network_map.UpdateMessage{
Update: &proto.SyncResponse{
SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce},
},
}
if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil {
log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err)
}
case nonce, ok := <-snapshotCh:
if !ok {
snapshotCh = nil
continue
}
snapMsg := &network_map.UpdateMessage{
Update: &proto.SyncResponse{
SnapshotRequest: &proto.PeerSnapshotRequest{Nonce: nonce},
},
}
if err := s.sendUpdate(ctx, accountID, peerKey, peer, snapMsg, srv, streamStartTime); err != nil {
log.WithContext(ctx).Warnf("send snapshot request to %s: %v", peerKey.String(), err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 487 - 498, The
select case reading from snapshotCh can busy-loop when the channel is closed; in
the case branch where you detect the closed channel (if !ok), set snapshotCh =
nil so the select will skip that case (nil channels block) and then continue;
update the select-handling logic in the function/method containing snapshotCh
(the handleUpdates/select loop) to assign snapshotCh = nil immediately after
detecting closure before continuing to avoid the tight spin.

Comment thread management/server/management_proto_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
management/server/peer/peer.go (1)

207-225: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

isEmpty not updated to consider the new Phase 3.7i fields.

The library context for this change states that PeerSystemMeta.isEmpty was updated to factor in the new fields, but the implementation still only inspects the legacy fields. A meta payload that only carried the new Effective* / SupportedFeatures data (e.g., a future minimal-meta path) would be silently treated as empty by UpdateMetaIfNew and skipped. In current flows real peers always populate Hostname/OS/etc., so this isn't actively triggered, but it's worth aligning to keep isEmpty and isEqual in sync.

🔧 Suggested alignment
 func (p PeerSystemMeta) isEmpty() bool {
 	return p.Hostname == "" &&
 		p.GoOS == "" &&
 		p.Kernel == "" &&
 		p.Core == "" &&
 		p.Platform == "" &&
 		p.OS == "" &&
 		p.OSVersion == "" &&
 		p.WtVersion == "" &&
 		p.UIVersion == "" &&
 		p.KernelVersion == "" &&
 		len(p.NetworkAddresses) == 0 &&
 		p.SystemSerialNumber == "" &&
 		p.SystemProductName == "" &&
 		p.SystemManufacturer == "" &&
 		p.Environment.Cloud == "" &&
 		p.Environment.Platform == "" &&
-		len(p.Files) == 0
+		len(p.Files) == 0 &&
+		p.EffectiveConnectionMode == "" &&
+		p.EffectiveRelayTimeoutSecs == 0 &&
+		p.EffectiveP2PTimeoutSecs == 0 &&
+		p.EffectiveP2PRetryMaxSecs == 0 &&
+		len(p.SupportedFeatures) == 0
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 207 - 225,
PeerSystemMeta.isEmpty currently checks only legacy fields and doesn't consider
the new Phase 3.7i fields (the Effective* fields and SupportedFeatures), so
payloads containing only those new fields are treated as empty; update
PeerSystemMeta.isEmpty to mirror the expanded checks in
PeerSystemMeta.isEqual/UpdateMetaIfNew by including the Effective* fields (e.g.,
EffectiveOS, EffectivePlatform, EffectiveKernel, etc.) and SupportedFeatures
(and any other new Phase 3.7i fields added to the struct), and ensure
slices/maps are checked for length/non-nil the same way as in isEqual so
UpdateMetaIfNew won't skip valid new-meta updates.
🟠 Major comments (21)
docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-44-44 (1)

44-44: ⚠️ Potential issue | 🟠 Major

Use dual-stack UDP socket instead of IPv4-only.

The proposed code restricts to "udp4", but the current sharedsock implementation explicitly creates both IPv4 and IPv6 raw sockets. The rest of the codebase—including ICEBind, DualStackPacketConn, and udpmux—expects dual-stack support. For consistency with existing infrastructure and to avoid breaking IPv6 connectivity, use "udp" (which handles both IPv4 and IPv6) or create separate IPv4 and IPv6 listeners:

udpConn, err := net.ListenUDP("udp", &net.UDPAddr{Port: 0})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 44,
The code currently calls net.ListenUDP with "udp4" which forces an IPv4-only
socket; update the socket creation used by udpConn (the net.ListenUDP
invocation) to use a dual-stack socket by switching the network string to "udp"
or by creating separate IPv4 and IPv6 listeners so it matches the
sharedsock/ICEBind/DualStackPacketConn/udpmux expectations and preserves IPv6
support; ensure the created socket(s) integrate with the existing ICEBind and
DualStackPacketConn usage paths.
client/internal/stdnet/filter.go-51-54 (1)

51-54: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope the lo prefix guard away from Windows.

Line 52 now filters any Windows interface whose friendly name starts with Lo, not just the loopback pseudo-interface. Names like Local Area Connection would be dropped before the Windows-specific exceptions run, which can hide a real default-route NIC and recreate the relay-only failure you’re trying to fix here.

Suggested fix
-		// Linux/macOS loopback prefix ("lo", "lo0").
-		if strings.HasPrefix(lowerIFace, "lo") {
+		// Unix loopback prefixes ("lo", "lo0").
+		if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") {
 			return false
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter.go` around lines 51 - 54, The current prefix
guard uses strings.HasPrefix(lowerIFace, "lo") and wrongly drops Windows
interfaces whose friendly name starts with "Lo"; change the condition to only
apply on non-Windows platforms by checking runtime.GOOS (e.g. if runtime.GOOS !=
"windows" && strings.HasPrefix(lowerIFace, "lo") { return false }) and add the
runtime import if missing so Windows-specific exceptions still run for
interfaces like "Local Area Connection".
docs/bugs/2026-05-04-user-peer-visibility-regression.md-33-70 (1)

33-70: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redact user-identifiable and internal endpoint details in bug docs.

Line 33 and Lines 68-70 include a real-looking user identifier and internal host/path details. Please replace these with sanitized placeholders before merge to avoid persisting potentially sensitive identifiers/infrastructure metadata in git history.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/bugs/2026-05-04-user-peer-visibility-regression.md` around lines 33 -
70, Replace all user-identifiable and internal endpoint strings in this
document: redact the user id `georg.stoisser-gigacher` and device id `ctb50-d`
(and any other real names) with placeholders such as <USER_ID> and <PEER_ID>,
and sanitize the curl example by replacing the real token and host/path
`https://netbird.uplink.plant-control.net:44106/api/peers` with placeholders
like <AUTH_TOKEN> and <INTERNAL_API_HOST>/api/peers (or remove the exact host
entirely), ensuring no real identifiers remain in the text or examples (also
scan for similar strings elsewhere in the file and replace them).
client/internal/debouncer/debouncer.go-31-42 (1)

31-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stale timer callback can execute the newest function (duplicate/early fire).

Because the callback reads shared d.fn (Line 37), an older time.AfterFunc that races with a newer Trigger can run the latest function before the new delay expires, and then the new timer runs it again. This violates last-write-wins debounce behavior.

Suggested fix (generation guard)
 type Debouncer struct {
 	delay time.Duration
 	mu    sync.Mutex
 	timer *time.Timer
 	fn    func()
+	gen   uint64
 }

 func (d *Debouncer) Trigger(fn func()) {
 	d.mu.Lock()
 	defer d.mu.Unlock()
+	d.gen++
+	currGen := d.gen
 	d.fn = fn
 	if d.timer != nil {
 		d.timer.Stop()
 	}
 	d.timer = time.AfterFunc(d.delay, func() {
 		d.mu.Lock()
-		f := d.fn
+		if currGen != d.gen {
+			d.mu.Unlock()
+			return
+		}
+		f := d.fn
+		d.fn = nil
+		d.timer = nil
 		d.mu.Unlock()
 		if f != nil {
 			f()
 		}
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/debouncer/debouncer.go` around lines 31 - 42, The callback
races by reading shared d.fn and can invoke a newer function; fix by introducing
a generation guard on the debouncer: add a generation counter (e.g.,
d.generation), increment it inside the Trigger path when setting d.fn and
creating d.timer, capture the current generation into a local variable in the
time.AfterFunc closure, and inside the closure acquire d.mu then compare the
captured generation to d.generation — only read/execute d.fn if the generations
match so stale timers cannot run the latest function.
client/internal/peer/ice_backoff.go-178-197 (1)

178-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consume the activity override atomically.

AllowActivityOverride() only checks the window; it does not mark the override as spent. Two activity callbacks racing here can both return true before either caller runs Reset(), which breaks the “one override per 5 minutes per peer” guarantee. It also ignores nextRetry, so an already-expired suspension still looks overrideable.

Suggested fix
 func (s *iceBackoffState) AllowActivityOverride() bool {
 	s.mu.Lock()
 	defer s.mu.Unlock()
-	if !s.suspended {
+	now := time.Now()
+	if !s.suspended || now.After(s.nextRetry) {
 		return false // not in backoff, nothing to override
 	}
-	if time.Since(s.lastResetAt) < activityOverrideMinInterval {
+	if now.Sub(s.lastResetAt) < activityOverrideMinInterval {
 		return false // too soon since last reset, respect rate limit
 	}
+	s.lastResetAt = now // consume the override budget before unlocking
 	return true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 178 - 197,
AllowActivityOverride currently only inspects state and doesn't consume the
override, allowing races; change it to atomically check and mark the override
while holding s.mu: ensure s.suspended is true, ensure time.Since(s.lastResetAt)
>= activityOverrideMinInterval, and ensure the suspension hasn’t already expired
by checking s.nextRetry <= time.Now() (or nextRetry.IsZero logic as
appropriate); if all checks pass update s.lastResetAt = time.Now() (or otherwise
record the override as spent) before returning true so the override cannot be
reused by a concurrent caller; keep Reset() semantics intact.
management/server/account.go-477-506 (1)

477-506: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve nil vs 0 in the timeout audit payloads.

The change detection correctly treats nil and explicit 0 as different, but derefUint32Ptr() flattens both to 0 before storing the event. That makes a nil → 0 or 0 → nil transition audit as old=0, new=0, even though those states mean different things for these settings.

Suggested fix
 	if !equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) {
 		am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountRelayTimeoutChanged, map[string]any{
-			"old": derefUint32Ptr(oldSettings.RelayTimeoutSeconds),
-			"new": derefUint32Ptr(newSettings.RelayTimeoutSeconds),
+			"old": oldSettings.RelayTimeoutSeconds,
+			"new": newSettings.RelayTimeoutSeconds,
 		})
 	}
 	if !equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) {
 		am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountP2pTimeoutChanged, map[string]any{
-			"old": derefUint32Ptr(oldSettings.P2pTimeoutSeconds),
-			"new": derefUint32Ptr(newSettings.P2pTimeoutSeconds),
+			"old": oldSettings.P2pTimeoutSeconds,
+			"new": newSettings.P2pTimeoutSeconds,
 		})
 	}
 	if !equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) {
 		am.StoreEvent(ctx, userID, accountID, accountID, activity.AccountP2pRetryMaxChanged, map[string]any{
-			"old": derefUint32Ptr(oldSettings.P2pRetryMaxSeconds),
-			"new": derefUint32Ptr(newSettings.P2pRetryMaxSeconds),
+			"old": oldSettings.P2pRetryMaxSeconds,
+			"new": newSettings.P2pRetryMaxSeconds,
 		})
 	}

Also applies to: 537-541

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/account.go` around lines 477 - 506, The audit payload
currently uses derefUint32Ptr(...) which turns nil and 0 into the same value;
instead preserve nil vs non-nil by passing the pointer values through to the
event payload. Replace derefUint32Ptr(oldSettings.X) /
derefUint32Ptr(newSettings.X) with the actual pointer fields
(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds,
oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds,
oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds, and similarly
for LegacyLazyFallbackTimeoutSeconds) when calling am.StoreEvent so the map
values reflect nil vs 0 while keeping the existing equalUint32Ptr checks and
am.StoreEvent calls.
management/server/peer_connections/store.go-148-153 (1)

148-153: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expire stale entries before applying the nonce gate.

Line 148 can return false before Line 151 gets a chance to evict an expired entry, so a caller that keeps polling with a higher since value can leave dead peer maps in memory indefinitely. Check TTL first, then apply the nonce filter.

Suggested fix
 func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	e, ok := s.maps[peerPubKey]
 	if !ok {
 		return nil, false
 	}
-	if since > 0 && e.m.GetInResponseToNonce() < since {
-		return nil, false
-	}
 	if s.clock.Now().Sub(e.updatedAt) > s.ttl {
 		delete(s.maps, peerPubKey)
 		return nil, false
 	}
+	if since > 0 && e.m.GetInResponseToNonce() < since {
+		return nil, false
+	}
 	return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_connections/store.go` around lines 148 - 153, The
nonce check in the method returns before expired entries are evicted; change the
order so you first check TTL and evict expired entries (use
s.clock.Now().Sub(e.updatedAt) > s.ttl to delete s.maps[peerPubKey] and return
nil,false) and only after that apply the nonce gate (if since > 0 &&
e.m.GetInResponseToNonce() < since return nil,false); also ensure you handle a
nil entry (e) defensively before accessing fields like e.updatedAt or
e.m.GetInResponseToNonce().
client/internal/peer/conn.go-1428-1434 (1)

1428-1434: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the live backoff cap aligned with the documented zero-as-default contract.

Open() normalizes P2pRetryMaxSeconds == 0 to DefaultP2PRetryMax, but SetIceBackoffMax(0) pushes a literal zero into iceBackoff.SetMaxBackoff. After a runtime config update back to “use default”, the live backoff cap diverges from the config semantics.

Suggested fix
func (conn *Conn) SetIceBackoffMax(d time.Duration) {
	conn.mu.Lock()
	defer conn.mu.Unlock()
	conn.config.P2pRetryMaxSeconds = uint32(d / time.Second)
+	liveMax := d
+	if liveMax == 0 {
+		liveMax = DefaultP2PRetryMax
+	}
 	if conn.iceBackoff != nil {
-		conn.iceBackoff.SetMaxBackoff(d)
+		conn.iceBackoff.SetMaxBackoff(liveMax)
 	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1428 - 1434, SetIceBackoffMax
currently passes a literal zero into iceBackoff.SetMaxBackoff which breaks the
“zero means use default” contract that Open() enforces; change SetIceBackoffMax
(Conn.SetIceBackoffMax) to preserve the stored config value but when invoking
conn.iceBackoff.SetMaxBackoff interpret d==0 as the effective default by
substituting time.Duration(DefaultP2PRetryMax)*time.Second (referencing
DefaultP2PRetryMax and Open() behavior). Keep conn.config.P2pRetryMaxSeconds set
from the raw d (so config reflects the updated value), but call
iceBackoff.SetMaxBackoff with the normalized cap when iceBackoff != nil.
client/internal/peer/worker_ice.go-103-117 (1)

103-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear agentConnecting on explicit shutdown, or this branch can wedge ICE forever.

With this new early-return, any WorkerICE.Close() during an in-flight dial can leave agentConnecting == true: Close() nils w.agent, and closeAgent() only clears the flag when w.agent == agent. After that, every future offer hits this branch and gets dropped permanently. The new Conn.DetachICE() / Conn.onNetworkChange() paths make that reachable.

Suggested fix
func (w *WorkerICE) Close() {
	w.muxAgent.Lock()
	defer w.muxAgent.Unlock()

	if w.agent == nil {
		return
	}

	w.agentDialerCancel()
	if err := w.agent.Close(); err != nil {
		w.log.Warnf("failed to close ICE agent: %s", err)
	}

	w.agent = nil
+	w.agentConnecting = false
+	w.remoteSessionChanged = false
+	w.remoteSessionID = ""
+	w.lastKnownState = ice.ConnectionStateDisconnected
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 103 - 117, The early-return
when w.agentConnecting is true can permanently drop future offers after
WorkerICE.Close() clears w.agent but leaves w.agentConnecting set; update the
shutdown flow to clear the flag on explicit teardown: in WorkerICE.Close() (or
in closeAgent() when invoked for an explicit shutdown or when agent == nil) set
w.agentConnecting = false so the guard loop won't forever ignore incoming
offers; touch the Close(), closeAgent(), and any DetachICE/onNetworkChange paths
that nil w.agent to ensure they also clear agentConnecting to avoid wedging ICE.
shared/management/http/api/openapi.yml-379-406 (1)

379-406: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timeout default semantics look inconsistent with the phase objectives

The documented defaults for p2p_timeout_seconds (180m) and relay_timeout_seconds (5m) appear opposite of the intended split-threshold behavior for this rollout. Please align these descriptions with the actual server defaults to avoid misconfiguration by API consumers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 379 - 406, The OpenAPI
descriptions for timeout defaults are inconsistent with server behavior; update
the description text for p2p_timeout_seconds and relay_timeout_seconds so they
match the actual server defaults (swap or correct the stated built-in defaults
as implemented on the server) and clarify which modes each applies to; locate
the YAML properties p2p_timeout_seconds and relay_timeout_seconds and change the
quoted built-in default values and any explanatory text so p2p_timeout_seconds
documents the server's actual default and relay_timeout_seconds documents its
actual default (also ensure nullable/null semantics remain correct and any
references to minutes vs seconds are accurate).
shared/management/http/api/openapi.yml-365-372 (1)

365-372: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

connection_mode schema is missing p2p-dynamic-lazy and still documents obsolete Phase-1 behavior

This enum/docs block still describes p2p-dynamic as non-functional pass-through and does not include p2p-dynamic-lazy. That can cause client-side validation failures and incorrect API contract generation.

Suggested OpenAPI fix
         connection_mode:
           x-experimental: true
           type: string
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Supports relay-forced, p2p, p2p-lazy, p2p-dynamic, and
+            p2p-dynamic-lazy as defined by issue `#5989` rollout.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 365 - 372, The openapi
enum for the connection_mode schema is out of date and missing the
p2p-dynamic-lazy value and still documents obsolete "Phase 1" behavior; update
the enum in the connection_mode schema to include "p2p-dynamic-lazy" (so enum:
[relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]) and revise the
description text to remove the Phase 1/pass-through wording and instead
accurately describe the current semantics of p2p-dynamic and p2p-dynamic-lazy
(and how NULL falls back to lazy_connection_enabled) so client validation and
generated contracts match the implementation.
client/internal/peer/status.go-300-306 (1)

300-306: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Capture the connection-state listener before unlocking.

All of the new updatePeer*Locked paths call this helper after d.mux.Unlock(), but SetConnStateListener mutates d.connStateListener under the same mutex. Reading it here without synchronization introduces a real race on listener register/unregister.

Suggested pattern
- d.mux.Unlock()
-
- if hasConnStatusChanged(oldState, receivedState.ConnStatus) {
- 	return d.notifyConnStateChange(receivedState.PubKey, peerState), nil
- }
- return func() {}, nil
+ notifyConn := func() {}
+ if hasConnStatusChanged(oldState, receivedState.ConnStatus) {
+ 	notifyConn = d.notifyConnStateChange(receivedState.PubKey, peerState)
+ }
+ d.mux.Unlock()
+ return notifyConn, nil

Apply the same capture-before-unlock pattern to the other updatePeer*Locked variants.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 300 - 306, The
notifyConnStateChange function currently reads d.connStateListener without
synchronization, creating a race with SetConnStateListener; modify the
updatePeer*Locked callers to capture the listener while the mutex is held (e.g.,
listener := d.connStateListener) and pass that captured listener into
notifyConnStateChange (or change notifyConnStateChange signature to accept a
listener parameter) so that the listener is read under the lock and the returned
callback can be safely invoked after d.mux.Unlock(); update all
updatePeer*Locked variants to follow this capture-before-unlock pattern.
client/internal/peer/status_debounce_test.go-17-149 (1)

17-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the errcheck lint failures (CI lint job is failing).

UpdatePeerICEState, UpdatePeerRemoteMeta, and UpdatePeerState all return errors that the tests drop on Lines 27, 37, 53, 80, 88, 129, 135. The lint pipeline is currently failing on every one of them. Best to surface real errors with require.NoError/assert.NoError so a regression in the recorder fails the test rather than silently passes.

♻️ Proposed fix (apply to all 7 sites)
-	rec.UpdatePeerICEState(State{
+	require.NoError(t, rec.UpdatePeerICEState(State{
 		PubKey:                     key,
 		ConnStatus:                 StatusConnected,
 		// ...
-	})
+	}))

And add "github.com/stretchr/testify/require" to imports. Same treatment for the remaining UpdatePeerRemoteMeta / UpdatePeerState call sites at Lines 37, 53, 80, 88, 129, 135.

As per static analysis / pipeline failures: golangci-lint errcheck is failing on Lines 27, 37, 53, 80, 88, 129, 135.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status_debounce_test.go` around lines 17 - 149, The
tests drop errors from recorder methods causing errcheck failures: replace the
bare calls to UpdatePeerICEState, UpdatePeerRemoteMeta, and UpdatePeerState in
TestStatus_DuringOfflineDebounce_LocalConnStateUnchanged,
TestStatus_AfterDebouncedClose_StatusReflectsLocalIdle,
TestStatus_DeriveExtended_DuringLivenessFlap (where used), and
TestStatus_GetFullStatus_PreservesEffectiveAndBackoffFields with assertions that
surface errors (e.g., require.NoError or assert.NoError) and add the
"github.com/stretchr/testify/require" import; specifically wrap each call to
UpdatePeerICEState, UpdatePeerRemoteMeta, and UpdatePeerState with
require.NoError(t, err) (or assert.NoError) so test failures propagate and
errcheck is satisfied.
client/ui/client_ui.go-501-510 (1)

501-510: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add p2p-dynamic-lazy everywhere this UI enumerates connection modes.

The selector, refresh logic, enablement switch, and load path only know about four modes. If a profile already has p2p-dynamic-lazy, the settings window will show Follow server, and saving any unrelated change will clear that override because buildSetConfigRequest() always sends the selected mode back.

♻️ Suggested update
-		[]string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic"},
+		[]string{"Follow server", "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy"},
 	s.sConnectionMode.Options = []string{
 		s.followServerLabel(),
 		"relay-forced",
 		"p2p",
 		"p2p-lazy",
 		"p2p-dynamic",
+		"p2p-dynamic-lazy",
 	}
-	case "p2p-dynamic":
+	case "p2p-dynamic", "p2p-dynamic-lazy":
 		s.iRelayTimeout.Enable()
 		s.iP2pTimeout.Enable()
 		s.iP2pRetryMax.Enable()
-		case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic":
+		case "relay-forced", "p2p", "p2p-lazy", "p2p-dynamic", "p2p-dynamic-lazy":
 			s.sConnectionMode.SetSelected(cfg.ConnectionMode)

Also applies to: 741-745, 855-866, 889-902, 1573-1598

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 501 - 510, The connection-mode selector
and related logic miss the "p2p-dynamic-lazy" option so profiles with that mode
get mapped to "Follow server" and overwritten; update the options slice passed
to s.sConnectionMode = widget.NewSelect (and any other NewSelect usages for
connection modes) to include "p2p-dynamic-lazy", adjust any switch/if logic in
updateTimeoutEntriesEnabled and the refresh/load code that checks modes to
handle "p2p-dynamic-lazy" the same as "p2p-dynamic" (or the intended behavior),
and ensure buildSetConfigRequest (and the save/load paths referenced around the
other occurrences) preserves and emits "p2p-dynamic-lazy" when selected so the
profile value isn't lost.
client/internal/conn_state_pusher.go-317-323 (1)

317-323: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A successful full snapshot should replace lastPushed, not merge into it.

flushFull() currently calls markPushed(), which only overwrites keys present in the snapshot. After a full snapshot that omits peer B, management will correctly drop B, but lastPushed will still retain it locally. If B later reappears with the same state, computeDeltaFromSource() suppresses the delta because it compares against that stale cached entry, so management never gets B back until another full snapshot.

♻️ Suggested update
 func (p *connStatePusher) markPushed(events []PeerStateChangeEvent) {
 	p.mu.Lock()
 	for _, ev := range events {
 		p.lastPushed[ev.Pubkey] = ev
 	}
 	p.mu.Unlock()
 }
+
+func (p *connStatePusher) replacePushed(events []PeerStateChangeEvent) {
+	next := make(map[string]PeerStateChangeEvent, len(events))
+	for _, ev := range events {
+		next[ev.Pubkey] = ev
+	}
+	p.mu.Lock()
+	p.lastPushed = next
+	p.mu.Unlock()
+}
-	p.markPushed(events)
+	p.replacePushed(events)

Also applies to: 346-379, 382-396

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 317 - 323, The bug:
markPushed currently merges events into p.lastPushed which leaves stale entries
after a full snapshot; when flushFull() calls markPushed it should replace the
entire cache so omitted peers are removed. Fix by changing the behavior used by
full snapshots: add or modify a function (e.g., markPushedFull or extend
markPushed with a flag) so that when called from flushFull() it acquires p.mu,
replaces p.lastPushed with a freshly allocated map populated only with the
provided events (instead of writing/merging into the existing map), and then
unlocks; keep the existing merge behavior for incremental updates used elsewhere
(so computeDeltaFromSource compares against the correct state).
client/ui/client_ui.go-638-640 (1)

638-640: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't silently turn invalid timeout text into 0.

parseUint32Field() collapses every parse error to 0, and buildSetConfigRequest() uses that value directly. A typo like 30s or abc will clear the override instead of surfacing a validation error, and hasConnectionModeChanges() can even treat the bad input as “no change” when the stored value is already zero.

🛡️ Suggested direction
-func parseUint32Field(text string) uint32 {
+func parseUint32Field(text string) (uint32, error) {
 	t := strings.TrimSpace(text)
 	if t == "" {
-		return 0
+		return 0, nil
 	}
 	v, err := strconv.ParseUint(t, 10, 32)
 	if err != nil {
-		return 0
+		return 0, fmt.Errorf("must be an unsigned integer")
 	}
-	return uint32(v)
+	return uint32(v), nil
 }

Then validate the three timeout entries during save and only build the request when all non-empty values parse successfully.

Also applies to: 656-666, 745-750

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 638 - 640, parseUint32Field currently
converts any parse error to 0 which lets typos like "30s" silently clear
overrides and makes hasConnectionModeChanges/buildSetConfigRequest behave
incorrectly; update the save/validation flow to explicitly parse
iRelayTimeout.Text, iP2pTimeout.Text, and iP2pRetryMax.Text using a parsing
function that returns (value, error), validate each non-empty field before
building the SetConfigRequest, surface validation errors to the user instead of
substituting 0, and only call buildSetConfigRequest (and compare against
relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs) when all parses succeeded so
invalid input does not get treated as a valid zero override.
client/internal/conn_state_pusher.go-153-164 (1)

153-164: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the newest refresh nonce when the queue is full.

The comment says snapshot requests are coalesced to “the latest nonce”, but this implementation drops the newest nonce once the 4-slot buffer fills. That means POST /connections/refresh can hand back a token that will never be echoed by the next full snapshot, so the caller waits against a nonce the pusher already discarded.

♻️ Suggested update
 func (p *connStatePusher) OnSnapshotRequest(nonce uint64) {
 	if p == nil {
 		return
 	}
 	select {
 	case p.snapshotReq <- nonce:
 	default:
+		select {
+		case <-p.snapshotReq:
+		default:
+		}
+		select {
+		case p.snapshotReq <- nonce:
+		default:
+		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 153 - 164,
OnSnapshotRequest currently drops the newest nonce when p.snapshotReq is full;
change it to retain the newest nonce by removing one oldest entry and then
enqueueing the incoming nonce. Locate connStatePusher.OnSnapshotRequest and
update the default branch to non-blockingly discard a single value from
p.snapshotReq (e.g., a receive with select) and then attempt to send the new
nonce into p.snapshotReq (retrying non-blockingly if needed) so the channel
always ends up containing the most recent nonce rather than silently dropping
it.
client/internal/engine.go-2759-2803 (1)

2759-2803: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Serialize the debounced close with syncMsgMux.

This timer callback mutates peer lifecycle state (statusRecorder, peerStore, connMgr, conn.Close) from its own goroutine, while the rest of connection management is serialized under e.syncMsgMux. That opens a race with handleSync, signal processing, and removePeer, so the debounce can close a conn mid-reconfigure or after concurrent teardown started.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 2759 - 2803, The timer callback in
Engine.scheduleRemoteOfflineClose mutates peer lifecycle state from its own
goroutine and must be serialized with the existing e.syncMsgMux; modify the
anonymous func passed to time.AfterFunc so it acquires e.syncMsgMux.Lock() at
the start (and defers Unlock()) before performing the re-validation and any
calls into e.statusRecorder, e.peerStore, e.connMgr or conn.Close; this ensures
the debounce handler runs under the same synchronization used by
handleSync/removePeer and prevents races while keeping the existing early-return
checks intact.
management/internals/shared/grpc/conversion.go-315-326 (1)

315-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only mark liveness as authoritative when a status record exists.

Lines 315-321 correctly nil-check rPeer.Status, but Line 326 still sets ServerLivenessKnown = true even when no status data was available. In that case clients will treat live_online=false as a real offline signal instead of “unknown”.

Suggested fix
 		if rPeer.Status != nil {
 			if !rPeer.Status.LastSeen.IsZero() {
 				cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen)
 			}
 			cfg.LiveOnline = rPeer.Status.Connected
+			cfg.ServerLivenessKnown = true
 		}
-		// New servers always know per-peer liveness; signal that to new
-		// clients so they can trust LiveOnline directly instead of
-		// guessing from the LastSeenAtServer-zero heuristic. Old servers
-		// leave this field at default (false) and clients fall back.
-		cfg.ServerLivenessKnown = true
 		dst = append(dst, cfg)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The
code wrongly sets cfg.ServerLivenessKnown = true unconditionally; change it so
ServerLivenessKnown is only true when a status record exists by guarding the
assignment with the same nil-check used above (i.e., set cfg.ServerLivenessKnown
= true inside the if rPeer.Status != nil block or assign cfg.ServerLivenessKnown
= (rPeer.Status != nil)), keeping existing handling of LastSeen and LiveOnline
in the rPeer.Status branch.
management/internals/shared/grpc/server.go-487-490 (1)

487-490: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable the snapshot case after channel closure.

On Line 489, continue leaves a closed snapshotCh permanently selectable, so this loop can spin hot and starve normal updates once the router closes that registration. Nil the channel or return after ok == false.

Suggested fix
 		case nonce, ok := <-snapshotCh:
 			if !ok {
-				continue
+				snapshotCh = nil
+				continue
 			}
 			snapMsg := &network_map.UpdateMessage{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 487 - 490, The
snapshot case leaves a closed snapshotCh selectable and can cause a hot spin;
when the receive yields ok == false, disable that case by setting snapshotCh =
nil (or return from the surrounding goroutine) instead of continuing. Locate the
receive case handling "nonce, ok := <-snapshotCh" and replace the "if !ok {
continue }" with "snapshotCh = nil" (or an early return) so the select no longer
considers the closed channel.
client/internal/conn_mgr.go-292-341 (1)

292-341: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart or reconfigure the lazy manager when only timeouts change.

This branch updates relayTimeoutSecs / p2pTimeoutSecs, but if the mode stays p2p-lazy or p2p-dynamic, the already-running lazyConnMgr keeps its old inactivity thresholds. A server-pushed timeout change therefore has no effect until a later mode flip or daemon restart.

Suggested direction
-	if modeChanged && wasManaged && isManaged {
+	timeoutChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
+
+	if (modeChanged || timeoutChanged) && wasManaged && isManaged {
 		// Switching between lazy and dynamic at runtime: tear down the
-		// existing manager so initLazyManager picks up the new timeouts.
-		log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
+		// existing manager so initLazyManager picks up the new timeouts.
+		log.Infof("lazy/dynamic settings changed, restarting manager (mode=%s->%s)", prev, newMode)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 292 - 341, The code updates
relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs but doesn't restart or
reconfigure an already-running lazyConnMgr when mode stays managed, so timeout
changes don't take effect; modify the branch after you've set
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs to detect when isManaged
&& e.lazyConnMgr != nil && mode didn't change but any of those timeout fields
changed, then call e.closeManager(ctx) followed by e.initLazyManager(ctx) and
e.startModeSideEffects() (and optionally e.resetPeersToLazyIdle(ctx) if you want
immediate idle semantics), ensuring the manager picks up the new thresholds; use
the existing functions closeManager, initLazyManager, startModeSideEffects and
reference the fields relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs and
lazyConnMgr.
🟡 Minor comments (8)
docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-44-60 (1)

44-60: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resource leak on error path.

If any operation fails after net.ListenUDP succeeds (line 44), the udpConn socket will leak because it's not closed before returning the error. Add cleanup for partial failures.

🔧 Proposed fix with cleanup
 	udpConn, err := net.ListenUDP("udp4", &net.UDPAddr{Port: 0})
 	if err != nil {
 		return nil, fmt.Errorf("listen udp for ICE: %w", err)
 	}
-	log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpConn.LocalAddr().(*net.UDPAddr).Port, t.wgPort)
+	udpAddr, ok := udpConn.LocalAddr().(*net.UDPAddr)
+	if !ok {
+		udpConn.Close()
+		return nil, fmt.Errorf("unexpected address type: %T", udpConn.LocalAddr())
+	}
+	log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpAddr.Port, t.wgPort)
 
 	bindParams := udpmux.UniversalUDPMuxParams{
 		UDPConn:   nbnet.WrapPacketConn(udpConn),

Note: Also verify that there's a cleanup path (likely in a Close() or Down() method) that closes t.udpMuxConn when the device is torn down.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines
44 - 60, The code opens udpConn via net.ListenUDP and assigns it to t.udpMuxConn
only after creating the udpmux and starting mux.ReadFromConn, which leaks
udpConn if any subsequent operation fails; update the function to defer a
cleanup that closes udpConn on error (e.g., capture success with a boolean or
set t.udpMuxConn only after all steps succeed and call udpConn.Close() on early
returns), ensure you stop the started goroutine or only start mux.ReadFromConn
after assigning t.udpMux and t.udpMuxConn successfully, and verify the device
Close()/Down() path closes t.udpMuxConn and stops the mux to avoid
goroutine/socket leaks (referencing udpConn, bindParams,
udpmux.NewUniversalUDPMuxDefault, mux.ReadFromConn, t.udpMuxConn, t.udpMux).
docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md-48-48 (1)

48-48: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Type assertion could panic.

The type assertion udpConn.LocalAddr().(*net.UDPAddr).Port will panic if LocalAddr() returns a different address type. While unlikely for a UDP connection, defensive code should check the type assertion result.

🛡️ Proposed fix with safe type assertion
-	log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpConn.LocalAddr().(*net.UDPAddr).Port, t.wgPort)
+	udpAddr, ok := udpConn.LocalAddr().(*net.UDPAddr)
+	if !ok {
+		udpConn.Close()
+		return nil, fmt.Errorf("unexpected address type: %T", udpConn.LocalAddr())
+	}
+	log.Infof("ICE using dedicated UDP port: %d (WireGuard kernel owns port %d)", udpAddr.Port, t.wgPort)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 48,
The log line uses a direct type assertion
udpConn.LocalAddr().(*net.UDPAddr).Port which can panic; update the code around
the log.Infof call to perform a safe type assertion (addr, ok :=
udpConn.LocalAddr().(*net.UDPAddr)) and use addr.Port when ok, otherwise fall
back to printing the address via udpConn.LocalAddr().String() or a placeholder;
ensure you still include t.wgPort in the message and preserve the original
log.Infof call semantics (use the safe value or fallback in the formatted
string).
client/internal/stdnet/filter_test.go-51-65 (1)

51-65: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This loop does not assert most of the want == true cases.

Right now the test only checks negative cases, plus one Windows-only special case for vEthernet (LAN). If Ethernet USB, OpenVPN 1, WiFi, or vEthernet (External) start returning false, this test still passes. Since this is meant to pin Windows behavior, I’d either skip it off Windows or assert got == c.want for every case.

Suggested fix
 func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) {
+	if runtime.GOOS != "windows" {
+		t.Skip("Windows-specific regression test")
+	}
+
 	disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"}
 	allow := InterfaceFilter(disallow)
@@
 	for _, c := range cases {
 		// The wgctrl branch can override on hosts where NetBird is
 		// running; tests run on a host where these names are not
 		// real interfaces, so the final return faithfully reflects
 		// the disallow-list logic.
 		got := allow(c.name)
-		// "veth*" prefix only filters on non-Windows; on Linux test
-		// runners "vEthernet (LAN)" still passes because of mixed
-		// case + the !Windows branch keeping the prefix match.
-		if !c.want && got {
-			t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name)
-		}
-		if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" {
-			t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name)
+		if got != c.want {
+			t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test loop in
filter_test.go only asserts negative cases and one Windows special-case, so it
can miss regressions where entries that should be allowed (c.want == true) start
returning false; update the loop that iterates over cases and the call to
allow(c.name) to assert equality for every case (i.e., if got != c.want {
t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want) }) instead of
the two separate conditionals, keeping the call to allow and the cases slice
as-is (or alternatively skip the entire test when runtime.GOOS != "windows" if
the intention is to only pin Windows behavior).
management/server/types/settings_ptr_equal_test.go-71-72 (1)

71-72: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename new to avoid the predeclared-identifier lint failure.

Line 71 shadows Go's predeclared new function. The predeclared linter is enabled in .golangci.yaml and will flag this violation. Rename the variable to updated or next to clear the check without changing test logic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/types/settings_ptr_equal_test.go` around lines 71 - 72, The
test creates a variable named new which shadows Go's predeclared new; rename
that variable (e.g., to updated or next) wherever it is assigned from old.Copy()
and used to set ConnectionMode = &dynamic so the logic (old.Copy(),
updated.ConnectionMode = &dynamic, and any subsequent assertions referencing the
variable) remains unchanged and the predeclared-identifier lint error is
resolved.
client/internal/peer/conn_handover_order_test.go-100-115 (1)

100-115: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test will false-fail if onWGTimeoutRecover is referenced before CloseConn() (e.g., a nil-guard).

strings.Index returns the first occurrence. A common pattern is if c.onWGTimeoutRecover != nil { ... } placed before the close switch with the actual invocation after. In that case idxCb < idxClose and the test fails despite the invariant being satisfied. Match the call site (e.g., onWGTimeoutRecover() and/or use strings.LastIndex to locate the actual invocation.

🛠️ Proposed fix
-	const cbField = "onWGTimeoutRecover"
-	if !strings.Contains(body, cbField) {
-		t.Fatalf("onWGDisconnected missing reference to %q — WG-timeout recovery is broken", cbField)
+	const cbField = "onWGTimeoutRecover"
+	const cbCall = "onWGTimeoutRecover("
+	if !strings.Contains(body, cbField) {
+		t.Fatalf("onWGDisconnected missing reference to %q — WG-timeout recovery is broken", cbField)
 	}
 	// The callback must be invoked AFTER the conn close switch (otherwise
 	// lazy mgr would be re-armed before the active workers are torn down).
 	idxClose := strings.Index(body, "workerRelay.CloseConn()")
-	idxCb := strings.Index(body, cbField)
+	idxCb := strings.LastIndex(body, cbCall)
 	if idxClose < 0 {
 		t.Fatalf("workerRelay.CloseConn() landmark missing")
 	}
+	if idxCb < 0 {
+		t.Fatalf("onWGDisconnected missing call site %q", cbCall)
+	}
 	if idxCb < idxClose {
 		t.Errorf("recover callback (idx %d) must come AFTER worker close (idx %d)", idxCb, idxClose)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_handover_order_test.go` around lines 100 - 115, The
test currently uses strings.Index to find the first occurrence of
"onWGTimeoutRecover" which can pick up nil-guards before the actual invocation
and cause false failures; update the check in the onWGDisconnected test to
locate the actual call site (e.g., search for the substring
"onWGTimeoutRecover(") or use strings.LastIndex to find the final occurrence,
then compare that index (idxCb) against the "workerRelay.CloseConn()" index
(idxClose) to ensure the invocation occurs after the close; keep the existing
error messages and landmarks (onWGDisconnected, onWGTimeoutRecover,
workerRelay.CloseConn) but change the search logic so the test matches the real
call site rather than the first reference.
management/server/http/handlers/peer_connections/handler_test.go-40-46 (1)

40-46: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Enforce account scoping in fake GetPeerByPubKey.

This fake ignores accountID, so tests can pass even if cross-account key lookups are accidentally allowed in handler logic.

Suggested hardening
-func (a *fakeAM) GetPeerByPubKey(_ context.Context, _, pubKey string) (*nbpeer.Peer, error) {
+func (a *fakeAM) GetPeerByPubKey(_ context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
 	p, ok := a.peersByKey[pubKey]
 	if !ok {
 		return nil, errors.New("not found")
 	}
+	if a.allowedAcc != "" && accountID != a.allowedAcc {
+		return nil, errors.New("not found")
+	}
+	if p.AccountID != "" && p.AccountID != accountID {
+		return nil, errors.New("not found")
+	}
 	return p, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/peer_connections/handler_test.go` around
lines 40 - 46, The fake GetPeerByPubKey implementation ignores the accountID
parameter and looks up peers solely by pubKey (function: GetPeerByPubKey in
fakeAM, map: peersByKey), allowing cross-account lookups in tests; update the
fake to respect account scoping by using the accountID parameter (do not use the
blank identifier), and change the lookup to be account-scoped — e.g., maintain
peersByAccount map[string]map[string]*nbpeer.Peer or use a composite key
including accountID when accessing peersByKey — and return "not found" when the
pubKey is not present for that account.
client/cmd/service_installer.go-173-211 (1)

173-211: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent nil-guarding on cmd.Flag(...).Changed.

Lines 175-180 carefully nil-check each flag (f != nil && f.Changed), but Lines 194-205 dereference cmd.Flag(name).Changed without that guard. If any of the four flags ever isn't registered on installCmd/reconfigureCmd (e.g., refactor, build-tag-gated registration), the second block panics. Memoize the result of the first scan and reuse it:

♻️ Proposed consolidation
 func applyConnectionModeFlagsToProfile(cmd *cobra.Command) error {
-	anyChanged := false
-	for _, name := range []string{connectionModeFlag, relayTimeoutFlag, p2pTimeoutFlag, p2pRetryMaxFlag} {
-		if f := cmd.Flag(name); f != nil && f.Changed {
-			anyChanged = true
-			break
-		}
-	}
-	if !anyChanged {
+	changed := func(name string) bool {
+		f := cmd.Flag(name)
+		return f != nil && f.Changed
+	}
+	if !changed(connectionModeFlag) && !changed(relayTimeoutFlag) &&
+		!changed(p2pTimeoutFlag) && !changed(p2pRetryMaxFlag) {
 		return nil
 	}

 	cfgPath := profilemanager.DefaultConfigPath
 	if configPath != "" {
 		cfgPath = configPath
 	}
 	if cfgPath == "" {
 		return fmt.Errorf("default config path is not set on this platform; pass --config")
 	}

 	ic := profilemanager.ConfigInput{ConfigPath: cfgPath}
-	if cmd.Flag(connectionModeFlag).Changed {
+	if changed(connectionModeFlag) {
 		ic.ConnectionMode = &connectionMode
 	}
-	if cmd.Flag(relayTimeoutFlag).Changed {
+	if changed(relayTimeoutFlag) {
 		ic.RelayTimeoutSeconds = &relayTimeoutSecs
 	}
-	if cmd.Flag(p2pTimeoutFlag).Changed {
+	if changed(p2pTimeoutFlag) {
 		ic.P2pTimeoutSeconds = &p2pTimeoutSecs
 	}
-	if cmd.Flag(p2pRetryMaxFlag).Changed {
+	if changed(p2pRetryMaxFlag) {
 		ic.P2pRetryMaxSeconds = &p2pRetryMaxSecs
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/service_installer.go` around lines 173 - 211, In
applyConnectionModeFlagsToProfile, avoid dereferencing cmd.Flag(...).Changed
twice without nil-guards: during the initial scan memoize each flag pointer
(e.g., map[string]*pflag.Flag or local variables for connectionModeFlag,
relayTimeoutFlag, p2pTimeoutFlag, p2pRetryMaxFlag) and its .Changed state, then
reuse those memoized pointers when building the profilemanager.ConfigInput (set
ic.ConnectionMode, ic.RelayTimeoutSeconds, ic.P2pTimeoutSeconds,
ic.P2pRetryMaxSeconds only if the corresponding memoized flag pointer is non-nil
and .Changed). This preserves the original anyChanged behavior, prevents nil
panics if a flag is not registered, and keeps references to the exact flags used
when writing the config.
management/server/http/handlers/accounts/accounts_handler_test.go-408-426 (1)

408-426: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Indentation in expectedSettings is over-tabbed and will fail gofmt.

Lines 418-419 carry the indentation level of a deeply nested struct literal but appear inside a top-level expectedSettings declaration, so they're indented several tabs further than the surrounding fields. Run gofmt -w on the file:

♻️ Proposed fix
 	expectedSettings := api.AccountSettings{
 		PeerLoginExpiration:             3600,
 		// ...
 		LazyConnectionEnabled:           br(false),
-				LegacyLazyFallbackEnabled:      br(true),
-				LegacyLazyFallbackTimeoutSeconds: ir(3600),
+		LegacyLazyFallbackEnabled:        br(true),
+		LegacyLazyFallbackTimeoutSeconds: ir(3600),
 		DnsDomain:                       sr(""),

While at it, the LegacyLazyFallbackEnabled: br(true), lines added throughout the table at Lines 124-125, 152-153, 180-181, 208-209, 236-237, and 264-265 also have one fewer alignment space than the surrounding fields; gofmt will normalize them but it's worth a one-shot pass to keep the diff tidy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler_test.go` around
lines 408 - 426, The struct literal assigned to expectedSettings has
incorrect/over-tabbed indentation (notably the LegacyLazyFallbackEnabled and
LegacyLazyFallbackTimeoutSeconds entries) which will fail gofmt; open the
accounts_handler_test.go, fix the indentation of the expectedSettings block so
all fields align consistently with the surrounding fields (including the other
occurrences of LegacyLazyFallbackEnabled in the file), then run gofmt -w to
normalize spacing across the file.

Comment thread client/internal/peer/env_test.go Outdated
Comment thread client/internal/peer/status.go
Comment on lines +218 to +230
// IsConnected returns true when pion's ICE agent reports Connected and
// has not yet transitioned to Disconnected/Failed/Closed. Used by
// Conn.onNetworkChange (Phase 3.7g of #5989) to skip a needless
// workerICE.Close when an srReconnect/network-change event arrives but
// the existing P2P session is still alive end-to-end (typical for a
// brief signal-server outage while peer-to-peer UDP keeps flowing).
// Closing the agent in that case forces a 15-25 s renegotiation cycle
// and a Relay→ICE handover gap that the user would observe as a ping
// dropout, even though no real peer-to-peer connectivity loss occurred.
func (w *WorkerICE) IsConnected() bool {
w.muxAgent.Lock()
defer w.muxAgent.Unlock()
return w.agent != nil && w.lastKnownState == ice.ConnectionStateConnected
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Synchronize lastKnownState consistently before using it as a liveness signal.

IsConnected() reads lastKnownState under muxAgent, but onConnectionStateChange() writes that field without the same lock. Now that Conn.onNetworkChange() uses IsConnected() to decide whether to tear down the agent, this is a real cross-goroutine data race and can produce stale keep/close decisions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 230, The race occurs
because IsConnected() reads lastKnownState under w.muxAgent but
onConnectionStateChange() updates lastKnownState without that lock; fix by
synchronizing updates to lastKnownState with the same mutex (w.muxAgent) used in
IsConnected(), i.e., acquire w.muxAgent.Lock()/Unlock() (or use RLock/RUnlock
appropriately) in onConnectionStateChange() when writing lastKnownState and when
touching w.agent, so both WorkerICE.onConnectionStateChange and
WorkerICE.IsConnected use the same protection for lastKnownState and agent.

Comment thread management/server/http/handlers/accounts/accounts_handler.go
Comment on lines +22 to +24
func (r *SnapshotRouter) Register(peerPubKey string) <-chan uint64 {
r.mu.Lock()
defer r.mu.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard the public methods against a nil router.

The pipeline is already hitting this: Register panics when SnapshotRouter was not injected on a Sync path yet. Unregister and Request have the same receiver dereference and will fail the same way.

Suggested fix
 func (r *SnapshotRouter) Register(peerPubKey string) <-chan uint64 {
+	if r == nil {
+		return nil
+	}
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	ch := make(chan uint64, 4)
@@
 func (r *SnapshotRouter) Unregister(peerPubKey string, token <-chan uint64) {
+	if r == nil {
+		return
+	}
 	r.mu.Lock()
 	defer r.mu.Unlock()
@@
 func (r *SnapshotRouter) Request(peerPubKey string, nonce uint64) bool {
+	if r == nil {
+		return false
+	}
 	r.mu.Lock()
 	defer r.mu.Unlock()

Also applies to: 41-44, 60-63

🧰 Tools
🪛 GitHub Actions: Darwin

[error] 23-23: panic: runtime error: invalid memory address or nil pointer dereference in SnapshotRouter.Register during Sync handling.

🪛 GitHub Actions: FreeBSD

[error] 23-23: Runtime panic: nil pointer dereference in SnapshotRouter.Register (receiver is nil)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_connections/snapshot_router.go` around lines 22 - 24,
Guard Register, Unregister and Request on a nil SnapshotRouter by checking "if r
== nil" at the top of each method before touching r.mu or other fields; for
Register return a closed channel (so callers won't block) when r is nil, and for
Unregister and Request make them no-ops (just return) when r is nil. Update the
methods named Register, Unregister and Request on type SnapshotRouter to perform
this nil-receiver check before any dereference to avoid panics.

@MichaelUray MichaelUray changed the title [4/4 stacked] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes (closes #5989) [client, management] Phase 3.7i hardening + activity-trigger fast-path + Codex-review fixes — closes #5989 (stack 4/4) May 6, 2026
@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch from 86fab00 to 82758b2 Compare May 6, 2026 06:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
management/server/peer.go (1)

1297-1324: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor RegularUsersViewBlocked on the detail endpoint too.

GetPeers now hides everything for restrictable users when that setting is enabled, but GetPeer skips that gate and still returns the peer to the owner or any policy-reachable user who knows the ID. That reopens the visibility restriction through the detail route. Mirror the same settings check here before the direct/policy branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1297 - 1324, GetPeer currently skips
the RegularUsersViewBlocked global setting and can leak peers via the detail
route; update GetPeer to mirror GetPeers by checking the RegularUsersViewBlocked
flag after resolving the user (or immediately after permissions validation) and,
if the user is a restrictable/regular user and RegularUsersViewBlocked is
enabled, return a permission-denied error instead of falling through to the
admin/owner or policy reachability branches. Locate GetPeer and add the setting
check (using the same predicate used by GetPeers), consult user via
Store.GetUserByUserID and user.IsRestrictable()/IsRegular() as appropriate, and
return a permission-denied status (use the project’s existing status error
helper) when the check blocks access before calling peer-owner or
checkIfUserOwnsPeer.
client/ui/network.go (1)

139-206: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap Fyne widget updates from background goroutines in fyne.Do or fyne.DoAndWait.

In startAutoRefresh, the ticker fires on a background goroutine and calls updateNetworksSilentupdateNetworksWithMode, which performs multiple widget mutations (grid.Objects = nil, grid.Add, Refresh) without dispatching to the UI goroutine. Fyne v2.7.0 requires all widget and container mutations from background goroutines to be wrapped in fyne.Do (fire-and-forget) or fyne.DoAndWait (block until applied) to avoid race conditions.

Affected locations: updateNetworksWithMode (lines 139–206) and its call site within startAutoRefresh (lines 375–385).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/network.go` around lines 139 - 206, The updateNetworksWithMode
function mutates Fyne widgets from a background goroutine (grid.Objects = nil,
grid.Add, setting widget properties, Resize/Refresh) which must be run on the UI
thread; wrap all UI/container mutations inside fyne.Do or fyne.DoAndWait to
dispatch to the UI goroutine and avoid races. Specifically, update
updateNetworksWithMode so that the body which clears grid, creates widgets
(checkboxes, labels, Selects), sets their properties, calls grid.Add and Refresh
are executed inside a single fyne.Do (or fyne.DoAndWait if the caller expects
synchronous completion), and adjust the call site in startAutoRefresh (where
updateNetworksSilent/updateNetworksWithMode is invoked from the ticker) to use
fyne.Do or call the updated synchronous variant accordingly. Ensure you only
move UI mutations inside the Do/DoAndWait block and keep non-UI work (filtering,
sorting, building data) outside to avoid blocking the UI.
♻️ Duplicate comments (4)
client/internal/peer/worker_ice.go (1)

227-230: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Protect lastKnownState with muxAgent on writes too.

IsConnected() now reads lastKnownState under muxAgent, but onConnectionStateChange() still writes that field without the same lock. With Conn.onNetworkChange() using IsConnected() to decide whether to tear down the agent, this becomes a real cross-goroutine data race and can flip the keep/close decision.

Suggested fix
 func (w *WorkerICE) onConnectionStateChange(agent *icemaker.ThreadSafeAgent, dialerCancel context.CancelFunc) func(ice.ConnectionState) {
 	return func(state ice.ConnectionState) {
 		w.log.Debugf("ICE ConnectionState has changed to %s", state.String())
 		switch state {
 		case ice.ConnectionStateConnected:
-			w.lastKnownState = ice.ConnectionStateConnected
+			w.muxAgent.Lock()
+			w.lastKnownState = ice.ConnectionStateConnected
+			w.muxAgent.Unlock()
 			w.logSuccessfulPaths(agent)
 			// Phase 3 of `#5989`: reset backoff on ICE success.
 			w.conn.onICEConnected()
 			return
 		case ice.ConnectionStateFailed, ice.ConnectionStateDisconnected, ice.ConnectionStateClosed:
@@
 			sessionChanged := w.closeAgent(agent, dialerCancel)
 
-			if w.lastKnownState == ice.ConnectionStateConnected {
-				w.lastKnownState = ice.ConnectionStateDisconnected
+			w.muxAgent.Lock()
+			wasConnected := w.lastKnownState == ice.ConnectionStateConnected
+			w.lastKnownState = ice.ConnectionStateDisconnected
+			w.muxAgent.Unlock()
+			if wasConnected {
 				w.conn.onICEStateDisconnected(sessionChanged)
 			}

Also applies to: 545-571

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 227 - 230, IsConnected reads
lastKnownState under muxAgent but onConnectionStateChange (and other writers
around the 545-571 range) update lastKnownState without that lock, causing a
race; modify onConnectionStateChange and any other locations that assign to
WorkerICE.lastKnownState to acquire w.muxAgent.Lock()/Unlock() (or use
RLock/RUnlock where appropriate) around writes so all reads/writes of
lastKnownState are protected by muxAgent, keeping the lock usage consistent with
WorkerICE.IsConnected and Conn.onNetworkChange checks.
client/android/preferences.go (1)

325-330: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear overrides by writing nil, not empty/zero pointers.

SetConnectionMode("") and the Set*TimeoutSeconds(0) setters still leave the corresponding ConfigInput fields non-nil. That means Commit() persists an explicit override instead of actually removing it, so callers cannot revert to “follow server” through this API.

Suggested fix
 func (p *Preferences) SetConnectionMode(mode string) {
+	if mode == "" {
+		p.configInput.ConnectionMode = nil
+		return
+	}
 	m := mode
 	p.configInput.ConnectionMode = &m
 }
@@
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
+	if v == 0 {
+		p.configInput.RelayTimeoutSeconds = nil
+		return
+	}
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
+	if v == 0 {
+		p.configInput.P2pTimeoutSeconds = nil
+		return
+	}
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
+	if v == 0 {
+		p.configInput.P2pRetryMaxSeconds = nil
+		return
+	}
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 347-355, 371-376, 391-396

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 325 - 330, SetConnectionMode and
the Set*TimeoutSeconds setters currently store empty/zero values as non-nil
pointers, causing Commit() to persist explicit overrides instead of removing
them; change each setter (e.g., SetConnectionMode and the timeout setters) to
write nil to the corresponding p.configInput field when the input is the "clear"
value (mode == "" or seconds == 0) and otherwise allocate and assign a pointer
to the provided value (keep using p.configInput.ConnectionMode and the timeout
fields to locate the spots to change).
client/internal/peer/status.go (1)

682-692: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Move notifyPeerStateChangeListeners back under the mutex.

These paths still invoke notifyPeerStateChangeListeners after d.mux.Unlock(), but that helper snapshots d.peers/d.changeNotify and is documented as lock-only. This can race with concurrent writers and panic on map access.

Also applies to: 758-760

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 682 - 692, The call to
notifyPeerStateChangeListeners is executed after releasing the mutex (after
d.mux.Unlock()), but that function snapshots d.peers/d.changeNotify and must be
called while holding the lock; move the
notifyPeerStateChangeListeners(receivedState.PubKey) call so it runs before
d.mux.Unlock() in this block (and likewise for the other occurrence at the
758-760 region), keeping d.notifier.peerListChanged(numPeers) and
d.dispatchRouterPeers(...) semantics unchanged and only reordering to hold d.mux
during notifyPeerStateChangeListeners to avoid concurrent map access.
management/server/http/handlers/accounts/accounts_handler.go (1)

511-517: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

math.MaxUint32 still overflows the format arg on 32-bit builds.

This is the same build break that was already flagged earlier: the comparison uses int64(math.MaxUint32), but the bare math.MaxUint32 passed to fmt.Errorf still defaults to int and overflows on 386/arm32.

💡 Minimal fix
 	if v > int64(math.MaxUint32) {
-		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, math.MaxUint32)
+		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, int64(math.MaxUint32))
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
511 - 517, The error message in validateUint32Timeout still passes
math.MaxUint32 as a plain value which can overflow on 32-bit builds; update the
second fmt.Errorf call in validateUint32Timeout to cast the constant to a 64-bit
integer (e.g. int64(math.MaxUint32)) so the formatted argument matches the %d
verb and avoids overflow, ensuring the comparison int64(math.MaxUint32) already
used and the error message use the same casted value.
🟡 Minor comments (6)
client/internal/peer/env_test.go-33-34 (1)

33-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the existing constants from env.go instead of hardcoded strings.

Lines 33–34 use bare string literals "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and "NB_LAZY_CONN_INACTIVITY_THRESHOLD". The peer package already defines these as unexported constants (envEnableLazyConn and envInactivityThreshold at lines 21–22 of env.go). Using the constants ensures that if the production definition is renamed, the test will also be updated, preventing silent divergence.

♻️ Proposed fix
-			t.Setenv("NB_ENABLE_EXPERIMENTAL_LAZY_CONN", c.envEnableLazy)
-			t.Setenv("NB_LAZY_CONN_INACTIVITY_THRESHOLD", c.envInactivity)
+			t.Setenv(envEnableLazyConn, c.envEnableLazy)
+			t.Setenv(envInactivityThreshold, c.envInactivity)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/env_test.go` around lines 33 - 34, Replace the hardcoded
environment variable strings in the test with the package constants: use
envEnableLazyConn instead of "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and
envInactivityThreshold instead of "NB_LAZY_CONN_INACTIVITY_THRESHOLD" in the
t.Setenv calls in client/internal/peer/env_test.go so the test references the
same unexported constants defined in env.go (envEnableLazyConn,
envInactivityThreshold).
client/internal/stdnet/filter.go-51-54 (1)

51-54: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unconditional lo prefix can over-filter on Windows; gate it on GOOS.

The comment on line 51 says "Linux/macOS loopback prefix", but the check itself runs on every platform. With the new lowercase normalization on line 49, this also fires for any Windows interface name starting with Lo* (e.g. legacy "Local Area Connection"), causing false-positive filtering. It also makes the loopback pseudo-interface entry in windowsKnownBadSubstrings redundant.

Suggest gating it to non-Windows so Windows loopback is exclusively handled by windowsKnownBadSubstrings (which is the documented strategy on lines 11–22):

🛡️ Proposed fix
-		// Linux/macOS loopback prefix ("lo", "lo0").
-		if strings.HasPrefix(lowerIFace, "lo") {
-			return false
-		}
+		// Linux/macOS loopback prefix ("lo", "lo0"). Windows loopback is
+		// handled by windowsKnownBadSubstrings below to avoid false
+		// positives for interfaces like "Local Area Connection".
+		if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") {
+			return false
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter.go` around lines 51 - 54, The unconditional
check using strings.HasPrefix(lowerIFace, "lo") is running on all OSes and
mistakenly filters Windows interfaces; change it so the "lo" prefix check only
runs on non-Windows (e.g., wrap it behind a runtime.GOOS != "windows" guard or
equivalent) and remove the redundant Windows loopback entry reliance only on
windowsKnownBadSubstrings; update the conditional around the lowerIFace "lo"
prefix check (the code using lowerIFace and strings.HasPrefix) so Windows names
are not filtered here and remain handled by windowsKnownBadSubstrings.
client/iface/device/endpoint_manager.go-15-21 (1)

15-21: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

"Always non-nil on userspace binds" overpromises for an interface method.

The comment implies that callers holding a valid EndpointManager can skip nil-checking ActivityRecorder()'s return value, yet:

  • engine.go:651 defensively nil-checks the return (if rec := bind.ActivityRecorder(); rec != nil).
  • The test mock returns nil from this method.
  • Any future or custom EndpointManager implementation can legally return nil without violating the interface contract.

The documentation should say the return value may be nil, aligning with how every existing caller already handles it.

📝 Suggested comment correction
-	// ActivityRecorder exposes the per-bind ActivityRecorder so the
-	// engine can wire its OnActivity callback (Codex review 2026-05-05,
-	// fast-path Relay -> P2P upgrade trigger). Always non-nil on
-	// userspace binds. Kernel-mode WG returns nil from GetICEBind so
-	// callers MUST nil-check the EndpointManager itself before
-	// dereferencing.
+	// ActivityRecorder exposes the per-bind ActivityRecorder so the
+	// engine can wire its OnActivity callback (Codex review 2026-05-05,
+	// fast-path Relay -> P2P upgrade trigger). May return nil (e.g.,
+	// kernel-mode WG where GetICEBind returns nil, or test stubs).
+	// Callers MUST nil-check both the EndpointManager and the returned
+	// *ActivityRecorder before use.
	ActivityRecorder() *bind.ActivityRecorder
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/iface/device/endpoint_manager.go` around lines 15 - 21, Update the doc
comment for EndpointManager.ActivityRecorder to stop promising a non-nil result;
change the wording to explicitly state the returned *bind.ActivityRecorder may
be nil and callers must nil-check it before use (e.g., engine's current pattern
if rec := bind.ActivityRecorder(); rec != nil). Reference the ActivityRecorder()
method on the EndpointManager interface and mention that mocks and other
implementations are allowed to return nil so callers should not assume a non-nil
value.
client/internal/debouncer/debouncer.go-45-53 (1)

45-53: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop does not fully cancel a pending fn — clear d.fn to honor the docstring.

If time.AfterFunc has already fired and the callback is blocked acquiring d.mu when Stop is called, d.timer.Stop() returns false (timer already triggered), Stop returns, and then the AfterFunc callback proceeds to read d.fn and invoke it. The docstring states "Stop cancels any pending fn." but this race lets a stale fn run after Stop returned. Clearing d.fn under the lock closes most of the window.

🛡️ Proposed fix
 func (d *Debouncer) Stop() {
 	d.mu.Lock()
 	defer d.mu.Unlock()
 	if d.timer != nil {
 		d.timer.Stop()
 		d.timer = nil
 	}
+	d.fn = nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/debouncer/debouncer.go` around lines 45 - 53, The Stop method
on Debouncer does not fully cancel a pending fn because if the timer callback is
already fired and waiting on d.mu, Stop only stops the timer and doesn't clear
d.fn; update Stop (method Debouncer.Stop) to acquire d.mu, set d.timer to nil
and also set d.fn = nil while holding the lock to prevent a later callback from
running the stale function, and ensure the timer's AfterFunc callback (the
anonymous function that grabs d.mu and reads d.fn) checks d.fn under the same
lock and returns without invoking if it's nil.
client/ui/peers_tab.go-61-71 (1)

61-71: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear stale peer data when refresh fails.

Both error paths only update summary and return. If the daemon becomes unavailable after a successful refresh, the tab keeps showing the previous breakdown and peer rows under an error banner, which makes stale state look current.

🧹 Suggested fix
 		conn, err := s.getSrvClient(failFastTimeout)
 		if err != nil {
-			fyne.Do(func() { summary.SetText("Error: " + err.Error()) })
+			fyne.Do(func() {
+				summary.SetText("Error: " + err.Error())
+				breakdown.SetText("")
+				listVBox.Objects = nil
+				listVBox.Refresh()
+			})
 			return
 		}
 		callCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 		defer cancel()
 		st, err := conn.Status(callCtx, &proto.StatusRequest{GetFullPeerStatus: true})
 		if err != nil {
-			fyne.Do(func() { summary.SetText("Error: " + err.Error()) })
+			fyne.Do(func() {
+				summary.SetText("Error: " + err.Error())
+				breakdown.SetText("")
+				listVBox.Objects = nil
+				listVBox.Refresh()
+			})
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/peers_tab.go` around lines 61 - 71, When either getSrvClient(...)
fails or conn.Status(...) returns an error, you must clear stale peer UI state
instead of only updating summary: in both error paths wrap UI updates in fyne.Do
to set summary to the error text, clear breakdown (e.g. breakdown.SetText(""))
and remove/clear the peers rows/container (the peers table or peersContainer
holding rows) so previous peer entries aren't shown under the error banner;
apply this to the error branch after getSrvClient and the one after conn.Status
to ensure stale data is always wiped when refresh fails.
client/server/server.go-1523-1572 (1)

1523-1572: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't mix requested-profile config with active-profile pushed state.

GetConfig() loads cfg from req.ProfileName/req.Username, but these sp* values always come from the currently running connectClient. When the caller asks for a non-active profile, the response combines one profile's persisted config with another profile's server-pushed hints. Gate these fields on the requested profile being the active/running one; otherwise return zero-values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/server/server.go` around lines 1523 - 1572, GetConfig currently mixes
the requested profile's persisted cfg with server-pushed runtime fields (spMode,
spRelayTOSecs, spP2pTOSecs, spP2pRetMax) unconditionally; change this so those
sp* fields are only populated from s.connectClient (Engine()->ConnMgr()) when
the requested profile equals the active/running profile (compare
req.ProfileName/req.Username against the active session identifiers used by the
running connectClient), otherwise leave sp* as zero-values. Update the logic
that sets spMode/spRelayTOSecs/spP2pTOSecs/spP2pRetMax (the block that queries
s.connectClient.Engine().ConnMgr()) to be guarded by an isActiveProfile check
and ensure the returned proto.GetConfigResponse uses those guarded values.
🧹 Nitpick comments (9)
management/server/peer/peer.go (1)

199-203: ⚡ Quick win

SupportedFeatures equality is order-sensitive — consider treating it as a set.

slices.Equal(p.SupportedFeatures, other.SupportedFeatures) returns false whenever the client emits the same features in a different order. If client/system/features.go ever changes how it builds the slice (e.g. iterates a map, adds a feature in the middle, or callers append in a different sequence), isEqual will start reporting "changed" on every sync, generating unnecessary writes/events. Either sort both sides before comparing or compare as a set.

♻️ Proposed fix
-		slices.Equal(p.SupportedFeatures, other.SupportedFeatures) &&
+		equalFeatureSets(p.SupportedFeatures, other.SupportedFeatures) &&
func equalFeatureSets(a, b []string) bool {
    if len(a) != len(b) {
        return false
    }
    aa := slices.Clone(a)
    bb := slices.Clone(b)
    sort.Strings(aa)
    sort.Strings(bb)
    return slices.Equal(aa, bb)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 199 - 203, The current
comparison in isEqual uses slices.Equal(p.SupportedFeatures,
other.SupportedFeatures), which is order-sensitive; change it to compare feature
sets instead by implementing a helper (e.g., equalFeatureSets) that returns
false if lengths differ, clones both slices, sorts them (or builds maps/sets)
and then compares, and replace the slices.Equal call with
equalFeatureSets(p.SupportedFeatures, other.SupportedFeatures) so identical
features in different orders are treated as equal.
shared/management/http/api/openapi.yml (1)

383-431: ⚡ Quick win

Documented defaults should be encoded as OpenAPI default values.

Lines 391-392, 416-417, and 428-430 describe defaults in text, but the schema omits default (e.g., p2p_retry_max_seconds=900, legacy_lazy_fallback_enabled=true, legacy_lazy_fallback_timeout_seconds=3600). Add explicit defaults to keep generated clients/docs consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 383 - 431, The OpenAPI
schema omits explicit default values for documented defaults; add default: 900
(integer) to p2p_retry_max_seconds, default: true (boolean) to
legacy_lazy_fallback_enabled, and default: 3600 (integer) to
legacy_lazy_fallback_timeout_seconds in the YAML so generated clients/docs
reflect the described defaults (keep the keys as p2p_retry_max_seconds,
legacy_lazy_fallback_enabled, legacy_lazy_fallback_timeout_seconds and ensure
types/formats remain unchanged).
client/internal/peer/guard/guard_test.go (1)

65-96: 💤 Low value

Test name overstates what is exercised.

TestGuard_PeerActivityResetsHourlyMode does not actually run reconnectLoopWithRetry; it manually reads from g.peerActivity and calls iceState.reset() itself, then asserts on iceState. The comment is honest about this, but the test would happily pass even if a future refactor accidentally dropped the case <-g.peerActivity: arm in reconnectLoopWithRetry (since the test simulates the case body itself). Consider running reconnectLoopWithRetry in a goroutine with a context cancel and observing iceState through the real loop — would turn this into a true regression pin.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard_test.go` around lines 65 - 96, The test
TestGuard_PeerActivityResetsHourlyMode currently simulates the
reconnectLoopWithRetry behaviour by reading from g.peerActivity and calling
iceState.reset() itself; instead, start the actual reconnectLoopWithRetry (or
the goroutine that runs it) with a cancellable context, call
g.NotifyPeerActivity(), and then wait for the real loop to consume from
g.peerActivity and perform the reset so you assert the real effects on the
iceRetryState (check iceState.hourly and iceState.retries) — i.e., replace the
manual select+iceState.reset() with running reconnectLoopWithRetry in a
goroutine, use context cancellation to stop the loop, and assert that the loop
cleared hourly mode and zeroed retries after NotifyPeerActivity().
client/internal/peer/conn_handover_order_test.go (2)

138-175: ⚖️ Poor tradeoff

extractFunctionBody brace counting is string/comment-blind.

Brace counting walks every {/} character even inside string literals and comments, so adding a Go format directive containing { or } to one of the tested functions in conn.go would silently break extraction (mismatched depth, wrong slice, or fatal "unbalanced braces"). Crude is fine for now, but if these tests become flaky after a conn.go log-string edit, replacing the helper with go/parser + ast.Inspect to grab the function's exact source range would be more robust. Optional since the comment already calls out the limitation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_handover_order_test.go` around lines 138 - 175, The
current extractFunctionBody helper blindly counts '{' and '}' characters (in
function extractFunctionBody) which miscounts braces inside string literals or
comments; replace this brittle scanner with a proper Go AST-based extraction:
parse the source with go/parser, use ast.Inspect to find the FuncDecl or
MethodDecl node matching the target name (handle both receiver method form and
plain func), then use the node.Pos()/node.End() token positions with the
original src to slice out the exact function text; update tests to call this new
extractor instead of extractFunctionBody.

25-53: 💤 Low value

Test name vs target function disagree.

The test is named TestConn_HandoverOrder_OnICEConnected and the comment block references onICEConnected, but extractFunctionBody actually inspects onICEConnectionIsReady. Aligning the name to the inspected function would make CI failures easier to triage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_handover_order_test.go` around lines 25 - 53, The
test name and its comment refer to onICEConnected but the call to
extractFunctionBody(t, string(src), "onICEConnectionIsReady") targets a
different function; update either the test name/comment to match
onICEConnectionIsReady (e.g. rename TestConn_HandoverOrder_OnICEConnected to
TestConn_HandoverOrder_OnICEConnectionIsReady and update the comment) or change
the extractFunctionBody call to "onICEConnected" so the inspected function and
test identity match; adjust any related references in this test (landmarks or
error messages) to use the chosen function name.
client/internal/peer/status.go (1)

1617-1644: ⚡ Quick win

Avoid holding d.mux across wgIface.FullStats().

FullStats() is an external/kernel-facing call. Keeping the status mutex locked for its full duration stalls unrelated peer and UI state updates behind it. Grab d.wgIface under the lock, call FullStats() outside, then re-lock only to apply the snapshot.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 1617 - 1644,
RefreshWireGuardStats currently holds d.mux while calling d.wgIface.FullStats(),
which can block other operations; fix it by acquiring the lock only to read the
wgIface pointer (and return nil if nil), then release the lock, call
wgIface.FullStats() outside the lock, and finally re-acquire d.mux to apply the
returned stats snapshot into d.peers (updating LastWireguardHandshake, BytesRx,
BytesTx for each peer). Ensure you preserve the existing error wrapping from
FullStats() and the behavior of skipping unknown public keys when updating the
map.
management/server/peer_test.go (1)

588-589: ⚡ Quick win

Make the denial check specific.

Line 589 only asserts that some error happened after policies are removed. That lets unrelated failures in GetPeer satisfy the test and weakens the visibility regression coverage. Assert the expected status/type here so the test only passes when access is denied for the right reason.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_test.go` around lines 588 - 589, Replace the generic
assert.Error with a specific assertion that the failure is an
authorization/denial error from GetPeer: call
manager.GetPeer(context.Background(), accountID, peer2.ID, someUser) and then
assert that the returned error matches the expected permission-denied sentinel
or status (e.g. errors.Is(err, ErrPermissionDenied) or check a typed error's
StatusCode == 403 or assert.EqualError(t, err, ErrPermissionDenied.Error())).
Use the actual project symbol for the sentinel (e.g., ErrPermissionDenied,
ErrUnauthorized, or the API error type) instead of a generic error check so the
test only passes when access is denied for the right reason.
client/ui/peers_tab.go (1)

280-355: ⚡ Quick win

Split buildPeerDetailText before it grows further.

This formatter is already doing multiple jobs: base fields, endpoint selection, full-details rendering, and ICE backoff formatting. Extracting those sections into small helpers should clear the current complexity warning and make future field additions less error-prone.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/peers_tab.go` around lines 280 - 355, The buildPeerDetailText
function is doing too many jobs; split it into small helpers: create
renderBasePeerFields(sb *strings.Builder, p *proto.PeerState) to write
IP/FQDN/connection type/effective/configured modes/latency/last-seen/groups,
renderConnectionEndpoints(sb *strings.Builder, p *proto.PeerState) to handle
relayed vs local/remote endpoint printing and Relay server/local/remote endpoint
logic, and renderFullPeerDetails(sb *strings.Builder, p *proto.PeerState) which
calls a helper renderIceBackoff(sb *strings.Builder, p *proto.PeerState) to
encapsulate the ICE backoff nextRetry/suspended/time-until logic; then have
buildPeerDetailText simply create the strings.Builder, call
renderBasePeerFields, renderConnectionEndpoints, and if full call
renderFullPeerDetails, and return sb.String().
management/internals/server/boot.go (1)

111-125: Plan for multi-replica management before relying on this state path.

PeerConnStore and SnapshotRouter are process-local. In an HA deployment, SyncPeerConnections traffic landing on one replica and dashboard/API reads landing on another will see different peer-connection state. Please make sure rollout includes sticky routing or a shared backing/pubsub layer, otherwise visibility will flap across nodes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/server/boot.go` around lines 111 - 125, PeerConnStore
and PeerConnRouter currently create process-local state (via Create with
peer_connections.NewMemoryStore and peer_connections.NewSnapshotRouter), which
will cause inconsistent visibility across replicas; update the initialization to
use a shared backing or pub/sub layer (or make the Create wrapper accept an
alternate implementation) so PeerConnStore() and PeerConnRouter() return a
cluster-scoped store/router instead of an in-memory local one; specifically
adjust the code paths that call PeerConnStore, PeerConnRouter, and the
SyncPeerConnections flow to use the shared implementation (or guard rollout with
sticky routing) so reads and SyncPeerConnections traffic observe the same state
across replicas.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/iface/bind/activity.go`:
- Around line 114-121: The callback passed to record() (which triggers
AttachICEOnRelayActivity()) can acquire conn.mu and perform non-trivial checks
(including conn.handshaker.readICEListener()), violating the "MUST be cheap"
contract in activity.go:115; change the invocation to avoid holding the WG I/O
goroutine: instead of calling cb(...) synchronously, dispatch the work
off-thread (e.g., spawn a dedicated goroutine) or perform a non-blocking handoff
(send on a buffered channel with select { case ch<-payload: default: }) so that
record() returns immediately; ensure the dispatched worker invokes
AttachICEOnRelayActivity() and retains existing rate-gating semantics.

In `@client/internal/conn_mgr.go`:
- Around line 322-326: The code updates e.mode and timeout fields then returns
early when e.rosenpassEnabled blocks lazy/dynamic mode, leaving the instance in
an unsupported "lazy" state; change the flow so you either validate
rosenpassEnabled before mutating live state or, if already mutated, explicitly
set e.mode to a supported eager mode (e.g. "p2p-eager") and restore/clear any
lazy-specific timeout fields before returning; update the branch that checks
e.rosenpassEnabled (the spot referencing e.lazyConnMgr and e.rosenpassEnabled)
to perform this reset-or-reject behavior so the in-memory state never reflects a
disabled lazy/dynamic mode.
- Around line 292-329: The code only restarts the lazy manager when modeChanged,
so updates to relayTimeoutSecs or p2pTimeoutSecs on ConnMgr never get propagated
into the running manager.Config; modify the update path in the function that
sets newMode/newRelay/newP2P/newP2pRetry (the block around
propagateP2pRetryMaxToConns, modeUsesLazyMgr, closeManager, initLazyManager,
startModeSideEffects and lazyConnMgr) to detect when the inactivity timeouts
(relayTimeoutSecs or p2pTimeoutSecs) change while isManaged is true and
lazyConnMgr != nil and then restart the manager (call closeManager(ctx),
UpdateLazyConnection(false), then initLazyManager(ctx) and
startModeSideEffects()) so the new timeout values are snapshotted into the
manager.Config even when mode did not change.

In `@client/internal/conn_state_pusher.go`:
- Around line 153-164: The OnSnapshotRequest method currently enqueues a nonce
but can leave an old nonce in the channel causing the pusher to process a stale
value; change connStatePusher.OnSnapshotRequest to coalesce by draining
snapshotReq before sending: perform a non-blocking loop that reads and discards
enqueued nonces keeping only the most recent value, then non-blocking-send that
newest nonce into p.snapshotReq (so callers' latest InResponseToNonce is
preserved). Apply the same drain-to-newest fix to the other similar handler
using the same pattern (the peer/state-change request channel referenced around
lines 245-248).
- Around line 115-122: newSessionID currently ignores errors from rand.Read and
can loop forever; change newSessionID to return (uint64, error), check the error
from rand.Read and return it instead of retrying, and keep the non-zero check
for the generated id; update any callers that call newSessionID (e.g., the
pusher/constructor code in conn_state_pusher) to handle the error by propagating
it or providing a safe fallback (e.g., abort startup with a clear error) so
startup cannot hang indefinitely.

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 102-109: The code currently treats iceTO==0 && relayTO==0 as a
signal to use the legacy inactivity.NewManager which restores a 24h relay
timeout; instead, only use the legacy constructor when the deprecated
InactivityThreshold field was actually used — i.e., check
config.InactivityThreshold (the old field) to decide to call
inactivity.NewManager(wgIface, config.InactivityThreshold), otherwise always
call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) so explicit
zero timeouts from config.resolvedTimeouts() correctly disable teardown as
documented and set m.inactivityManager accordingly.

In `@client/internal/peer/status.go`:
- Around line 468-471: The closure returned after detecting a connection-state
change currently calls notifyConnStateChange which reads d.connStateListener
without holding d.mux, causing a race with SetConnStateListener; fix by
capturing the listener reference under the lock (e.g., read d.connStateListener
into a local variable while holding d.mux) and build the closure to call
notifyConnStateChange (or call the listener directly) using that captured
reference; apply the same pattern for the other similar sites (around the logic
referenced at lines ~694-697, ~762-765, ~810-813, ~861-864) so the listener is
read under d.mux before unlocking.
- Around line 519-544: The current branch in the peer-update code holds d.mux
while calling
d.notifyPeerListChanged()/d.notifyPeerStateChangeListeners(pubKey), which can
deadlock if listeners re-enter Status; change it to mirror other update paths
by: under the lock update st and d.peers[pubKey] and compute a local notify flag
(already done), then release the lock (remove defer or unlock before
notifications) and call d.notifyPeerListChanged() and
d.notifyPeerStateChangeListeners(pubKey) only after the mutex is unlocked;
reference: d.mux, d.peers, pubKey, st, notifyPeerListChanged,
notifyPeerStateChangeListeners.

In `@client/proto/daemon.proto`:
- Around line 744-754: SetConfig currently only copies ProfileName and Username
into ActiveProfileState, so the new connection override fields (connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds) are dropped
and not persisted; update the SetConfig handler (function SetConfig in
client/server/server.go) to copy these fields from the incoming SetConfig
request into ActiveProfileState (add corresponding fields/properties on
ActiveProfileState if missing and map from the proto names connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds), ensure any
serialization/storage code that persists ActiveProfileState is updated to
include them, and update tests to assert that saving a config with those fields
actually persists and is returned on subsequent reads.
- Around line 327-347: GetConfig currently returns the new proto fields
(connection_mode, p2p_timeout_seconds, relay_timeout_seconds,
p2p_retry_max_seconds, server_pushed_connection_mode,
server_pushed_relay_timeout_seconds, server_pushed_p2p_timeout_seconds,
server_pushed_p2p_retry_max_seconds) as empty/0 because the implementation in
GetConfig (server.go) never assigns them; fix by reading the local profile's
connection override values and the management server's last-received PeerConfig
and explicitly set the corresponding fields on the outgoing proto message (use
the same names in the daemon proto struct returned by GetConfig), leaving fields
empty/0 when no local override or PeerConfig exists so callers observe real
values rather than placeholders.
- Around line 208-218: Login currently only copies ProfileName and Username into
ActiveProfileState and ignores the new proto fields (connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, p2p_retry_max_seconds), so
connection overrides sent at login are dropped; update the Login handler to read
those four fields from the incoming request and populate the corresponding
fields on ActiveProfileState (or the existing connection-overrides
sub-structure) when creating or updating the profile on first-run, and ensure
the code path that persists/forwards ActiveProfileState includes these fields
(refer to the Login function, ActiveProfileState, ProfileName, and Username
symbols to locate where to add the assignments).

In `@client/ui/client_ui.go`:
- Around line 893-906: The gating currently switches on
s.sConnectionMode.Selected which treats the "Follow server" case as default and
disables all timeout inputs even when the effective mode is a server-pushed one;
change the logic to compute the effective mode (call selectedConnectionMode() or
use serverPushedMode when selectedConnectionMode() returns empty) and switch on
that value instead, then enable/disable s.iRelayTimeout, s.iP2pTimeout, and
s.iP2pRetryMax based on the effective mode ("p2p-lazy", "p2p-dynamic", default)
so following the server still allows the correct timeout fields to be editable
when the server-pushed mode requires it.
- Around line 656-665: parseUint32Field currently swallows parse errors and
returns 0 which silently treats invalid input as “no override”; change
parseUint32Field to return (uint32, error) (or add a companion
validateUint32Field) so callers can detect bad input, surface a validation error
to the UI, and avoid treating typos as the sentinel 0; update all callers (e.g.,
any code invoking parseUint32Field and logic in hasConnectionModeChanges()) to
handle the error path, show a validation message rather than writing 0, and only
use uint32 value when parsing succeeds.

In `@management/internals/shared/grpc/conversion.go`:
- Around line 315-327: The code sets cfg.ServerLivenessKnown = true
unconditionally even when rPeer.Status is nil, which incorrectly advertises
authoritative liveness; change the logic so cfg.ServerLivenessKnown is only set
to true inside the rPeer.Status != nil branch (where LiveOnline and
LastSeenAtServer are populated) and leave it false when rPeer.Status is absent
so clients fall back to the LastSeenAtServer heuristic; update the block around
rPeer.Status, cfg.LiveOnline, cfg.LastSeenAtServer and cfg.ServerLivenessKnown
accordingly.

In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 231-280: Replace plain fmt.Errorf errors with gRPC InvalidArgument
status errors so bad input yields a 4xx validation response: change the
connection_mode error in the accounts handler to return
status.Errorf(codes.InvalidArgument, "invalid connection_mode %q", modeStr) and
change the legacy_lazy_fallback_timeout_seconds branch to return
status.Errorf(codes.InvalidArgument, "invalid
legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v).
Also ensure validateUint32Timeout returns or surfaces InvalidArgument errors
(update validateUint32Timeout to return status.Errorf(codes.InvalidArgument,
...) for invalid ranges) and add imports for "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status". This targets the connection_mode branch, the
legacy_lazy_fallback_timeout_seconds block, and the validateUint32Timeout
function so updateAccount -> util.WriteError receives typed InvalidArgument
errors.

In `@management/server/peer_connections/store.go`:
- Around line 141-155: GetWithNonceCheck on MemoryStore checks the nonce gate
before TTL, allowing expired entries to be returned as nonce-old and never
deleted; change GetWithNonceCheck to first evaluate TTL (compare
s.clock.Now().Sub(e.updatedAt) > s.ttl and delete from s.maps if expired) before
performing the since / InResponseToNonce check so stale entries are evicted even
when the nonce condition would cause an early return; adjust references inside
the GetWithNonceCheck method that touch e, e.m, e.updatedAt, s.maps, s.ttl and
s.clock accordingly.

In `@management/server/peer.go`:
- Line 1355: Replace the internal-server error return with an authorization
error: change the status code in the return that currently uses status.Internal
to an auth-level code (e.g., status.PermissionDenied) so callers get a
permission response instead of a 500; keep the existing message formatting that
includes userID, peer.ID and accountID in the return statement.

In `@shared/management/client/grpc.go`:
- Around line 483-492: The snapshot callback is invoked before applying the
SyncResponse which can cause snapshots to be built from pre-update state; move
the c.onSnapshotRequest invocation (the cb from c.onSnapshotRequest and
decryptedResp.GetSnapshotRequest()) to after the call to
msgHandler(decryptedResp) so the SyncResponse is applied first, i.e., call
msgHandler(decryptedResp) and only if it returns nil then acquire c.snapMu, read
c.onSnapshotRequest into cb, release the lock and invoke cb(req.GetNonce()) when
decryptedResp.GetSnapshotRequest() is non-nil.

In `@shared/management/http/api/types.gen.go`:
- Around line 41-63: The generated enum AccountSettingsConnectionMode and its
Valid() method are missing the p2p-dynamic-lazy value; update the OpenAPI spec
so the connection-mode enum includes "p2p-dynamic-lazy", regenerate the code,
and ensure the generated constants (e.g.,
AccountSettingsConnectionModeP2pDynamicLazy) and Valid() cover that value (so
Valid() returns true for AccountSettingsConnectionModeP2pDynamicLazy) instead of
manually editing types.gen.go.

---

Outside diff comments:
In `@client/ui/network.go`:
- Around line 139-206: The updateNetworksWithMode function mutates Fyne widgets
from a background goroutine (grid.Objects = nil, grid.Add, setting widget
properties, Resize/Refresh) which must be run on the UI thread; wrap all
UI/container mutations inside fyne.Do or fyne.DoAndWait to dispatch to the UI
goroutine and avoid races. Specifically, update updateNetworksWithMode so that
the body which clears grid, creates widgets (checkboxes, labels, Selects), sets
their properties, calls grid.Add and Refresh are executed inside a single
fyne.Do (or fyne.DoAndWait if the caller expects synchronous completion), and
adjust the call site in startAutoRefresh (where
updateNetworksSilent/updateNetworksWithMode is invoked from the ticker) to use
fyne.Do or call the updated synchronous variant accordingly. Ensure you only
move UI mutations inside the Do/DoAndWait block and keep non-UI work (filtering,
sorting, building data) outside to avoid blocking the UI.

In `@management/server/peer.go`:
- Around line 1297-1324: GetPeer currently skips the RegularUsersViewBlocked
global setting and can leak peers via the detail route; update GetPeer to mirror
GetPeers by checking the RegularUsersViewBlocked flag after resolving the user
(or immediately after permissions validation) and, if the user is a
restrictable/regular user and RegularUsersViewBlocked is enabled, return a
permission-denied error instead of falling through to the admin/owner or policy
reachability branches. Locate GetPeer and add the setting check (using the same
predicate used by GetPeers), consult user via Store.GetUserByUserID and
user.IsRestrictable()/IsRegular() as appropriate, and return a permission-denied
status (use the project’s existing status error helper) when the check blocks
access before calling peer-owner or checkIfUserOwnsPeer.

---

Minor comments:
In `@client/iface/device/endpoint_manager.go`:
- Around line 15-21: Update the doc comment for EndpointManager.ActivityRecorder
to stop promising a non-nil result; change the wording to explicitly state the
returned *bind.ActivityRecorder may be nil and callers must nil-check it before
use (e.g., engine's current pattern if rec := bind.ActivityRecorder(); rec !=
nil). Reference the ActivityRecorder() method on the EndpointManager interface
and mention that mocks and other implementations are allowed to return nil so
callers should not assume a non-nil value.

In `@client/internal/debouncer/debouncer.go`:
- Around line 45-53: The Stop method on Debouncer does not fully cancel a
pending fn because if the timer callback is already fired and waiting on d.mu,
Stop only stops the timer and doesn't clear d.fn; update Stop (method
Debouncer.Stop) to acquire d.mu, set d.timer to nil and also set d.fn = nil
while holding the lock to prevent a later callback from running the stale
function, and ensure the timer's AfterFunc callback (the anonymous function that
grabs d.mu and reads d.fn) checks d.fn under the same lock and returns without
invoking if it's nil.

In `@client/internal/peer/env_test.go`:
- Around line 33-34: Replace the hardcoded environment variable strings in the
test with the package constants: use envEnableLazyConn instead of
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of
"NB_LAZY_CONN_INACTIVITY_THRESHOLD" in the t.Setenv calls in
client/internal/peer/env_test.go so the test references the same unexported
constants defined in env.go (envEnableLazyConn, envInactivityThreshold).

In `@client/internal/stdnet/filter.go`:
- Around line 51-54: The unconditional check using strings.HasPrefix(lowerIFace,
"lo") is running on all OSes and mistakenly filters Windows interfaces; change
it so the "lo" prefix check only runs on non-Windows (e.g., wrap it behind a
runtime.GOOS != "windows" guard or equivalent) and remove the redundant Windows
loopback entry reliance only on windowsKnownBadSubstrings; update the
conditional around the lowerIFace "lo" prefix check (the code using lowerIFace
and strings.HasPrefix) so Windows names are not filtered here and remain handled
by windowsKnownBadSubstrings.

In `@client/server/server.go`:
- Around line 1523-1572: GetConfig currently mixes the requested profile's
persisted cfg with server-pushed runtime fields (spMode, spRelayTOSecs,
spP2pTOSecs, spP2pRetMax) unconditionally; change this so those sp* fields are
only populated from s.connectClient (Engine()->ConnMgr()) when the requested
profile equals the active/running profile (compare req.ProfileName/req.Username
against the active session identifiers used by the running connectClient),
otherwise leave sp* as zero-values. Update the logic that sets
spMode/spRelayTOSecs/spP2pTOSecs/spP2pRetMax (the block that queries
s.connectClient.Engine().ConnMgr()) to be guarded by an isActiveProfile check
and ensure the returned proto.GetConfigResponse uses those guarded values.

In `@client/ui/peers_tab.go`:
- Around line 61-71: When either getSrvClient(...) fails or conn.Status(...)
returns an error, you must clear stale peer UI state instead of only updating
summary: in both error paths wrap UI updates in fyne.Do to set summary to the
error text, clear breakdown (e.g. breakdown.SetText("")) and remove/clear the
peers rows/container (the peers table or peersContainer holding rows) so
previous peer entries aren't shown under the error banner; apply this to the
error branch after getSrvClient and the one after conn.Status to ensure stale
data is always wiped when refresh fails.

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 325-330: SetConnectionMode and the Set*TimeoutSeconds setters
currently store empty/zero values as non-nil pointers, causing Commit() to
persist explicit overrides instead of removing them; change each setter (e.g.,
SetConnectionMode and the timeout setters) to write nil to the corresponding
p.configInput field when the input is the "clear" value (mode == "" or seconds
== 0) and otherwise allocate and assign a pointer to the provided value (keep
using p.configInput.ConnectionMode and the timeout fields to locate the spots to
change).

In `@client/internal/peer/status.go`:
- Around line 682-692: The call to notifyPeerStateChangeListeners is executed
after releasing the mutex (after d.mux.Unlock()), but that function snapshots
d.peers/d.changeNotify and must be called while holding the lock; move the
notifyPeerStateChangeListeners(receivedState.PubKey) call so it runs before
d.mux.Unlock() in this block (and likewise for the other occurrence at the
758-760 region), keeping d.notifier.peerListChanged(numPeers) and
d.dispatchRouterPeers(...) semantics unchanged and only reordering to hold d.mux
during notifyPeerStateChangeListeners to avoid concurrent map access.

In `@client/internal/peer/worker_ice.go`:
- Around line 227-230: IsConnected reads lastKnownState under muxAgent but
onConnectionStateChange (and other writers around the 545-571 range) update
lastKnownState without that lock, causing a race; modify onConnectionStateChange
and any other locations that assign to WorkerICE.lastKnownState to acquire
w.muxAgent.Lock()/Unlock() (or use RLock/RUnlock where appropriate) around
writes so all reads/writes of lastKnownState are protected by muxAgent, keeping
the lock usage consistent with WorkerICE.IsConnected and Conn.onNetworkChange
checks.

In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 511-517: The error message in validateUint32Timeout still passes
math.MaxUint32 as a plain value which can overflow on 32-bit builds; update the
second fmt.Errorf call in validateUint32Timeout to cast the constant to a 64-bit
integer (e.g. int64(math.MaxUint32)) so the formatted argument matches the %d
verb and avoids overflow, ensuring the comparison int64(math.MaxUint32) already
used and the error message use the same casted value.

---

Nitpick comments:
In `@client/internal/peer/conn_handover_order_test.go`:
- Around line 138-175: The current extractFunctionBody helper blindly counts '{'
and '}' characters (in function extractFunctionBody) which miscounts braces
inside string literals or comments; replace this brittle scanner with a proper
Go AST-based extraction: parse the source with go/parser, use ast.Inspect to
find the FuncDecl or MethodDecl node matching the target name (handle both
receiver method form and plain func), then use the node.Pos()/node.End() token
positions with the original src to slice out the exact function text; update
tests to call this new extractor instead of extractFunctionBody.
- Around line 25-53: The test name and its comment refer to onICEConnected but
the call to extractFunctionBody(t, string(src), "onICEConnectionIsReady")
targets a different function; update either the test name/comment to match
onICEConnectionIsReady (e.g. rename TestConn_HandoverOrder_OnICEConnected to
TestConn_HandoverOrder_OnICEConnectionIsReady and update the comment) or change
the extractFunctionBody call to "onICEConnected" so the inspected function and
test identity match; adjust any related references in this test (landmarks or
error messages) to use the chosen function name.

In `@client/internal/peer/guard/guard_test.go`:
- Around line 65-96: The test TestGuard_PeerActivityResetsHourlyMode currently
simulates the reconnectLoopWithRetry behaviour by reading from g.peerActivity
and calling iceState.reset() itself; instead, start the actual
reconnectLoopWithRetry (or the goroutine that runs it) with a cancellable
context, call g.NotifyPeerActivity(), and then wait for the real loop to consume
from g.peerActivity and perform the reset so you assert the real effects on the
iceRetryState (check iceState.hourly and iceState.retries) — i.e., replace the
manual select+iceState.reset() with running reconnectLoopWithRetry in a
goroutine, use context cancellation to stop the loop, and assert that the loop
cleared hourly mode and zeroed retries after NotifyPeerActivity().

In `@client/internal/peer/status.go`:
- Around line 1617-1644: RefreshWireGuardStats currently holds d.mux while
calling d.wgIface.FullStats(), which can block other operations; fix it by
acquiring the lock only to read the wgIface pointer (and return nil if nil),
then release the lock, call wgIface.FullStats() outside the lock, and finally
re-acquire d.mux to apply the returned stats snapshot into d.peers (updating
LastWireguardHandshake, BytesRx, BytesTx for each peer). Ensure you preserve the
existing error wrapping from FullStats() and the behavior of skipping unknown
public keys when updating the map.

In `@client/ui/peers_tab.go`:
- Around line 280-355: The buildPeerDetailText function is doing too many jobs;
split it into small helpers: create renderBasePeerFields(sb *strings.Builder, p
*proto.PeerState) to write IP/FQDN/connection type/effective/configured
modes/latency/last-seen/groups, renderConnectionEndpoints(sb *strings.Builder, p
*proto.PeerState) to handle relayed vs local/remote endpoint printing and Relay
server/local/remote endpoint logic, and renderFullPeerDetails(sb
*strings.Builder, p *proto.PeerState) which calls a helper renderIceBackoff(sb
*strings.Builder, p *proto.PeerState) to encapsulate the ICE backoff
nextRetry/suspended/time-until logic; then have buildPeerDetailText simply
create the strings.Builder, call renderBasePeerFields,
renderConnectionEndpoints, and if full call renderFullPeerDetails, and return
sb.String().

In `@management/internals/server/boot.go`:
- Around line 111-125: PeerConnStore and PeerConnRouter currently create
process-local state (via Create with peer_connections.NewMemoryStore and
peer_connections.NewSnapshotRouter), which will cause inconsistent visibility
across replicas; update the initialization to use a shared backing or pub/sub
layer (or make the Create wrapper accept an alternate implementation) so
PeerConnStore() and PeerConnRouter() return a cluster-scoped store/router
instead of an in-memory local one; specifically adjust the code paths that call
PeerConnStore, PeerConnRouter, and the SyncPeerConnections flow to use the
shared implementation (or guard rollout with sticky routing) so reads and
SyncPeerConnections traffic observe the same state across replicas.

In `@management/server/peer_test.go`:
- Around line 588-589: Replace the generic assert.Error with a specific
assertion that the failure is an authorization/denial error from GetPeer: call
manager.GetPeer(context.Background(), accountID, peer2.ID, someUser) and then
assert that the returned error matches the expected permission-denied sentinel
or status (e.g. errors.Is(err, ErrPermissionDenied) or check a typed error's
StatusCode == 403 or assert.EqualError(t, err, ErrPermissionDenied.Error())).
Use the actual project symbol for the sentinel (e.g., ErrPermissionDenied,
ErrUnauthorized, or the API error type) instead of a generic error check so the
test only passes when access is denied for the right reason.

In `@management/server/peer/peer.go`:
- Around line 199-203: The current comparison in isEqual uses
slices.Equal(p.SupportedFeatures, other.SupportedFeatures), which is
order-sensitive; change it to compare feature sets instead by implementing a
helper (e.g., equalFeatureSets) that returns false if lengths differ, clones
both slices, sorts them (or builds maps/sets) and then compares, and replace the
slices.Equal call with equalFeatureSets(p.SupportedFeatures,
other.SupportedFeatures) so identical features in different orders are treated
as equal.

In `@shared/management/http/api/openapi.yml`:
- Around line 383-431: The OpenAPI schema omits explicit default values for
documented defaults; add default: 900 (integer) to p2p_retry_max_seconds,
default: true (boolean) to legacy_lazy_fallback_enabled, and default: 3600
(integer) to legacy_lazy_fallback_timeout_seconds in the YAML so generated
clients/docs reflect the described defaults (keep the keys as
p2p_retry_max_seconds, legacy_lazy_fallback_enabled,
legacy_lazy_fallback_timeout_seconds and ensure types/formats remain unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment thread client/iface/bind/activity.go
Comment on lines +292 to +329
if newMode == e.mode && newRelay == e.relayTimeoutSecs &&
newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs {
return nil
}
prev := e.mode
e.mode = newMode
e.relayTimeoutSecs = newRelay
e.p2pTimeoutSecs = newP2P
e.p2pRetryMaxSecs = newP2pRetry
e.propagateP2pRetryMaxToConns()

wasManaged := modeUsesLazyMgr(prev)
isManaged := modeUsesLazyMgr(newMode)
modeChanged := prev != newMode

if modeChanged && wasManaged && !isManaged {
log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
return nil
}

if modeChanged && wasManaged && isManaged {
// Switching between lazy and dynamic at runtime: tear down the
// existing manager so initLazyManager picks up the new timeouts.
log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
}

if isManaged && e.lazyConnMgr == nil {
if e.rosenpassEnabled {
log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push")
return nil
}
log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode)
e.initLazyManager(ctx)
e.startModeSideEffects()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart or live-reconfigure the manager when only inactivity timeouts change.

This path only restarts the lazy manager on modeChanged, but initLazyManager() snapshots relayTimeoutSecs and p2pTimeoutSecs into manager.Config once. If management changes just RelayTimeoutSeconds or P2PTimeoutSeconds, the new values are stored on ConnMgr but never reach the running inactivity manager until a full restart or a later mode flip.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 292 - 329, The code only restarts
the lazy manager when modeChanged, so updates to relayTimeoutSecs or
p2pTimeoutSecs on ConnMgr never get propagated into the running manager.Config;
modify the update path in the function that sets
newMode/newRelay/newP2P/newP2pRetry (the block around
propagateP2pRetryMaxToConns, modeUsesLazyMgr, closeManager, initLazyManager,
startModeSideEffects and lazyConnMgr) to detect when the inactivity timeouts
(relayTimeoutSecs or p2pTimeoutSecs) change while isManaged is true and
lazyConnMgr != nil and then restart the manager (call closeManager(ctx),
UpdateLazyConnection(false), then initLazyManager(ctx) and
startModeSideEffects()) so the new timeout values are snapshotted into the
manager.Config even when mode did not change.

Comment thread client/internal/conn_mgr.go
Comment on lines +115 to +122
func newSessionID() uint64 {
var b [8]byte
for {
_, _ = rand.Read(b[:])
if id := binary.BigEndian.Uint64(b[:]); id != 0 {
return id
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle rand.Read failures instead of retrying forever.

If rand.Read errors here, b remains zeroed and this loop spins forever during pusher construction. A small fallback or propagated error avoids hanging startup.

Suggested fix
 func newSessionID() uint64 {
 	var b [8]byte
 	for {
-		_, _ = rand.Read(b[:])
+		if _, err := rand.Read(b[:]); err != nil {
+			if id := uint64(time.Now().UnixNano()); id != 0 {
+				return id
+			}
+			continue
+		}
 		if id := binary.BigEndian.Uint64(b[:]); id != 0 {
 			return id
 		}
 	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func newSessionID() uint64 {
var b [8]byte
for {
_, _ = rand.Read(b[:])
if id := binary.BigEndian.Uint64(b[:]); id != 0 {
return id
}
}
func newSessionID() uint64 {
var b [8]byte
for {
if _, err := rand.Read(b[:]); err != nil {
if id := uint64(time.Now().UnixNano()); id != 0 {
return id
}
continue
}
if id := binary.BigEndian.Uint64(b[:]); id != 0 {
return id
}
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 115 - 122, newSessionID
currently ignores errors from rand.Read and can loop forever; change
newSessionID to return (uint64, error), check the error from rand.Read and
return it instead of retrying, and keep the non-zero check for the generated id;
update any callers that call newSessionID (e.g., the pusher/constructor code in
conn_state_pusher) to handle the error by propagating it or providing a safe
fallback (e.g., abort startup with a clear error) so startup cannot hang
indefinitely.

Comment on lines +153 to +164
// OnSnapshotRequest enqueues a snapshot-request nonce. Non-blocking,
// coalescing — multiple requests in flight result in a single full
// snapshot with the latest nonce echoed. Same nil-guard rationale as
// OnPeerStateChange.
func (p *connStatePusher) OnSnapshotRequest(nonce uint64) {
if p == nil {
return
}
select {
case p.snapshotReq <- nonce:
default:
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drain snapshotReq to the newest nonce before sending the full snapshot.

The comment promises latest-nonce coalescing, but the loop flushes the first queued nonce it reads. Under back-to-back refreshes, the caller can miss its own InResponseToNonce.

Suggested fix
 		case nonce := <-p.snapshotReq:
+			for {
+				select {
+				case newer := <-p.snapshotReq:
+					nonce = newer
+				default:
+					goto flushSnapshot
+				}
+			}
+		flushSnapshot:
 			if p.source != nil {
 				p.flushFull(p.source.SnapshotAllRemotePeers(), nonce)
 			}
 			interval = p.tuning.baseInterval

Also applies to: 245-248

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 153 - 164, The
OnSnapshotRequest method currently enqueues a nonce but can leave an old nonce
in the channel causing the pusher to process a stale value; change
connStatePusher.OnSnapshotRequest to coalesce by draining snapshotReq before
sending: perform a non-blocking loop that reads and discards enqueued nonces
keeping only the most recent value, then non-blocking-send that newest nonce
into p.snapshotReq (so callers' latest InResponseToNonce is preserved). Apply
the same drain-to-newest fix to the other similar handler using the same pattern
(the peer/state-change request channel referenced around lines 245-248).

Comment on lines +231 to +280
if req.Settings.ConnectionMode != nil {
modeStr := string(*req.Settings.ConnectionMode)
if !req.Settings.ConnectionMode.Valid() {
return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
}
// Persist as the canonical string. Important: returnSettings
// is a fresh struct built from scratch by this handler -- if
// the request body omits connection_mode (or sets JSON null,
// which deserializes to a nil pointer), this whole block is
// skipped AND returnSettings.ConnectionMode stays nil, which
// the storage layer interprets as "clear the override". To
// preserve the existing value the caller must include the
// current value explicitly in the PUT body. This is also true
// for the four timeout fields below.
s := modeStr
returnSettings.ConnectionMode = &s
}
if req.Settings.P2pTimeoutSeconds != nil {
v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.RelayTimeoutSeconds = &v
}
if req.Settings.LegacyLazyFallbackEnabled != nil {
returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled
}
if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil {
// Phase 3.7i (#5989): legacy fallback timeout. Range chosen to
// match the range an admin would plausibly set on a metered LTE
// fleet: 60s lower bound (anything shorter just hammers
// signaling), 86400s upper bound (24h - longer than that and the
// fallback is effectively "never tear down").
v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
if v < 60 || v > 86400 {
return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return InvalidArgument for these new validation failures.

These branches return plain fmt.Errorf, so updateAccount() hands util.WriteError a generic error instead of a typed management/status error. Bad input for connection_mode and the timeout fields will therefore go down the 5xx path instead of returning a 4xx validation response.

💡 Minimal fix
 	if req.Settings.ConnectionMode != nil {
 		modeStr := string(*req.Settings.ConnectionMode)
 		if !req.Settings.ConnectionMode.Valid() {
-			return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+			return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
 		}
@@
 	if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil {
@@
 		v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
 		if v < 60 || v > 86400 {
-			return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+			return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
 		}
 		returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if req.Settings.ConnectionMode != nil {
modeStr := string(*req.Settings.ConnectionMode)
if !req.Settings.ConnectionMode.Valid() {
return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
}
// Persist as the canonical string. Important: returnSettings
// is a fresh struct built from scratch by this handler -- if
// the request body omits connection_mode (or sets JSON null,
// which deserializes to a nil pointer), this whole block is
// skipped AND returnSettings.ConnectionMode stays nil, which
// the storage layer interprets as "clear the override". To
// preserve the existing value the caller must include the
// current value explicitly in the PUT body. This is also true
// for the four timeout fields below.
s := modeStr
returnSettings.ConnectionMode = &s
}
if req.Settings.P2pTimeoutSeconds != nil {
v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.RelayTimeoutSeconds = &v
}
if req.Settings.LegacyLazyFallbackEnabled != nil {
returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled
}
if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil {
// Phase 3.7i (#5989): legacy fallback timeout. Range chosen to
// match the range an admin would plausibly set on a metered LTE
// fleet: 60s lower bound (anything shorter just hammers
// signaling), 86400s upper bound (24h - longer than that and the
// fallback is effectively "never tear down").
v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
if v < 60 || v > 86400 {
return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
if req.Settings.ConnectionMode != nil {
modeStr := string(*req.Settings.ConnectionMode)
if !req.Settings.ConnectionMode.Valid() {
return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
}
// Persist as the canonical string. Important: returnSettings
// is a fresh struct built from scratch by this handler -- if
// the request body omits connection_mode (or sets JSON null,
// which deserializes to a nil pointer), this whole block is
// skipped AND returnSettings.ConnectionMode stays nil, which
// the storage layer interprets as "clear the override". To
// preserve the existing value the caller must include the
// current value explicitly in the PUT body. This is also true
// for the four timeout fields below.
s := modeStr
returnSettings.ConnectionMode = &s
}
if req.Settings.P2pTimeoutSeconds != nil {
v, err := validateUint32Timeout("p2p_timeout_seconds", *req.Settings.P2pTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v, err := validateUint32Timeout("p2p_retry_max_seconds", *req.Settings.P2pRetryMaxSeconds)
if err != nil {
return nil, err
}
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v, err := validateUint32Timeout("relay_timeout_seconds", *req.Settings.RelayTimeoutSeconds)
if err != nil {
return nil, err
}
returnSettings.RelayTimeoutSeconds = &v
}
if req.Settings.LegacyLazyFallbackEnabled != nil {
returnSettings.LegacyLazyFallbackEnabled = *req.Settings.LegacyLazyFallbackEnabled
}
if req.Settings.LegacyLazyFallbackTimeoutSeconds != nil {
// Phase 3.7i (`#5989`): legacy fallback timeout. Range chosen to
// match the range an admin would plausibly set on a metered LTE
// fleet: 60s lower bound (anything shorter just hammers
// signaling), 86400s upper bound (24h - longer than that and the
// fallback is effectively "never tear down").
v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
if v < 60 || v > 86400 {
return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
}
returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
231 - 280, Replace plain fmt.Errorf errors with gRPC InvalidArgument status
errors so bad input yields a 4xx validation response: change the connection_mode
error in the accounts handler to return status.Errorf(codes.InvalidArgument,
"invalid connection_mode %q", modeStr) and change the
legacy_lazy_fallback_timeout_seconds branch to return
status.Errorf(codes.InvalidArgument, "invalid
legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v).
Also ensure validateUint32Timeout returns or surfaces InvalidArgument errors
(update validateUint32Timeout to return status.Errorf(codes.InvalidArgument,
...) for invalid ranges) and add imports for "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status". This targets the connection_mode branch, the
legacy_lazy_fallback_timeout_seconds block, and the validateUint32Timeout
function so updateAccount -> util.WriteError receives typed InvalidArgument
errors.

Comment on lines +141 to +155
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
s.mu.Lock()
defer s.mu.Unlock()
e, ok := s.maps[peerPubKey]
if !ok {
return nil, false
}
if since > 0 && e.m.GetInResponseToNonce() < since {
return nil, false
}
if s.clock.Now().Sub(e.updatedAt) > s.ttl {
delete(s.maps, peerPubKey)
return nil, false
}
return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expire stale entries before the nonce gate.

GetWithNonceCheck returns on InResponseToNonce < since before it evaluates TTL, so entries queried only through the refresh path never get evicted once they expire. That breaks the store's advertised TTL cleanup and can leave dead per-peer snapshots resident indefinitely.

🐛 Suggested fix
 func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	e, ok := s.maps[peerPubKey]
 	if !ok {
 		return nil, false
 	}
-	if since > 0 && e.m.GetInResponseToNonce() < since {
-		return nil, false
-	}
 	if s.clock.Now().Sub(e.updatedAt) > s.ttl {
 		delete(s.maps, peerPubKey)
 		return nil, false
 	}
+	if since > 0 && e.m.GetInResponseToNonce() < since {
+		return nil, false
+	}
 	return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
s.mu.Lock()
defer s.mu.Unlock()
e, ok := s.maps[peerPubKey]
if !ok {
return nil, false
}
if since > 0 && e.m.GetInResponseToNonce() < since {
return nil, false
}
if s.clock.Now().Sub(e.updatedAt) > s.ttl {
delete(s.maps, peerPubKey)
return nil, false
}
return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
s.mu.Lock()
defer s.mu.Unlock()
e, ok := s.maps[peerPubKey]
if !ok {
return nil, false
}
if s.clock.Now().Sub(e.updatedAt) > s.ttl {
delete(s.maps, peerPubKey)
return nil, false
}
if since > 0 && e.m.GetInResponseToNonce() < since {
return nil, false
}
return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_connections/store.go` around lines 141 - 155,
GetWithNonceCheck on MemoryStore checks the nonce gate before TTL, allowing
expired entries to be returned as nonce-old and never deleted; change
GetWithNonceCheck to first evaluate TTL (compare s.clock.Now().Sub(e.updatedAt)
> s.ttl and delete from s.maps if expired) before performing the since /
InResponseToNonce check so stale entries are evicted even when the nonce
condition would cause an early return; adjust references inside the
GetWithNonceCheck method that touch e, e.m, e.updatedAt, s.maps, s.ttl and
s.clock accordingly.

Comment thread management/server/peer.go
}
}

return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return an auth error here instead of status.Internal.

“No access to this peer” is an expected authorization outcome on this path. Returning status.Internal turns it into a 500-class failure for callers instead of a permission/not-found response.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` at line 1355, Replace the internal-server error
return with an authorization error: change the status code in the return that
currently uses status.Internal to an auth-level code (e.g.,
status.PermissionDenied) so callers get a permission response instead of a 500;
keep the existing message formatting that includes userID, peer.ID and accountID
in the return statement.

Comment on lines +483 to 492
if req := decryptedResp.GetSnapshotRequest(); req != nil {
c.snapMu.Lock()
cb := c.onSnapshotRequest
c.snapMu.Unlock()
if cb != nil {
cb(req.GetNonce())
}
}

if err := msgHandler(decryptedResp); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the snapshot callback after applying the same SyncResponse.

SyncResponse can carry snapshot_request alongside ordinary sync data. Invoking cb() first lets the client publish a snapshot built from pre-update peer state when both travel in the same frame.

Suggested ordering change
-		if req := decryptedResp.GetSnapshotRequest(); req != nil {
-			c.snapMu.Lock()
-			cb := c.onSnapshotRequest
-			c.snapMu.Unlock()
-			if cb != nil {
-				cb(req.GetNonce())
-			}
-		}
-
 		if err := msgHandler(decryptedResp); err != nil {
 			log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
 		}
+
+		if req := decryptedResp.GetSnapshotRequest(); req != nil {
+			c.snapMu.Lock()
+			cb := c.onSnapshotRequest
+			c.snapMu.Unlock()
+			if cb != nil {
+				cb(req.GetNonce())
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 483 - 492, The snapshot
callback is invoked before applying the SyncResponse which can cause snapshots
to be built from pre-update state; move the c.onSnapshotRequest invocation (the
cb from c.onSnapshotRequest and decryptedResp.GetSnapshotRequest()) to after the
call to msgHandler(decryptedResp) so the SyncResponse is applied first, i.e.,
call msgHandler(decryptedResp) and only if it returns nil then acquire c.snapMu,
read c.onSnapshotRequest into cb, release the lock and invoke cb(req.GetNonce())
when decryptedResp.GetSnapshotRequest() is non-nil.

Comment on lines +41 to +63
// Defines values for AccountSettingsConnectionMode.
const (
AccountSettingsConnectionModeP2p AccountSettingsConnectionMode = "p2p"
AccountSettingsConnectionModeP2pDynamic AccountSettingsConnectionMode = "p2p-dynamic"
AccountSettingsConnectionModeP2pLazy AccountSettingsConnectionMode = "p2p-lazy"
AccountSettingsConnectionModeRelayForced AccountSettingsConnectionMode = "relay-forced"
)

// Valid indicates whether the value is a known member of the AccountSettingsConnectionMode enum.
func (e AccountSettingsConnectionMode) Valid() bool {
switch e {
case AccountSettingsConnectionModeP2p:
return true
case AccountSettingsConnectionModeP2pDynamic:
return true
case AccountSettingsConnectionModeP2pLazy:
return true
case AccountSettingsConnectionModeRelayForced:
return true
default:
return false
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expose p2p-dynamic-lazy in the generated connection-mode enum.

The #5989 contract in this PR still calls out p2p-dynamic-lazy, but AccountSettingsConnectionMode and Valid() only permit four values here. That leaves the HTTP API surface unable to represent the full mode set and generated clients will reject the missing mode if the backend starts using it. Please fix the OpenAPI source and regenerate this file rather than patching the generated output directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
enum AccountSettingsConnectionMode and its Valid() method are missing the
p2p-dynamic-lazy value; update the OpenAPI spec so the connection-mode enum
includes "p2p-dynamic-lazy", regenerate the code, and ensure the generated
constants (e.g., AccountSettingsConnectionModeP2pDynamicLazy) and Valid() cover
that value (so Valid() returns true for
AccountSettingsConnectionModeP2pDynamicLazy) instead of manually editing
types.gen.go.

@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch 2 times, most recently from b07184d to 7f51b84 Compare May 6, 2026 07:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
management/server/peer/peer.go (1)

207-225: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

isEmpty omits newly added 3.7i metadata fields, which can suppress valid updates.

If a payload only carries the new effective mode/timeout/features fields, this function still returns empty and UpdateMetaIfNew exits early.

💡 Suggested fix
 func (p PeerSystemMeta) isEmpty() bool {
 	return p.Hostname == "" &&
 		p.GoOS == "" &&
 		p.Kernel == "" &&
 		p.Core == "" &&
 		p.Platform == "" &&
 		p.OS == "" &&
 		p.OSVersion == "" &&
 		p.WtVersion == "" &&
 		p.UIVersion == "" &&
 		p.KernelVersion == "" &&
 		len(p.NetworkAddresses) == 0 &&
 		p.SystemSerialNumber == "" &&
 		p.SystemProductName == "" &&
 		p.SystemManufacturer == "" &&
 		p.Environment.Cloud == "" &&
 		p.Environment.Platform == "" &&
-		len(p.Files) == 0
+		len(p.Files) == 0 &&
+		p.EffectiveConnectionMode == "" &&
+		p.EffectiveRelayTimeoutSecs == 0 &&
+		p.EffectiveP2PTimeoutSecs == 0 &&
+		p.EffectiveP2PRetryMaxSecs == 0 &&
+		len(p.SupportedFeatures) == 0
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 207 - 225, The isEmpty method on
PeerSystemMeta currently ignores the new 3.7i metadata fields so UpdateMetaIfNew
can mistakenly treat payloads with only those fields as empty; update
PeerSystemMeta.isEmpty to include the newly added fields (the effective mode,
effective timeout, and features fields added for 3.7i) in the emptiness checks
so that if any of those fields are set the method returns false and
UpdateMetaIfNew will process the update.
management/server/account.go (1)

416-442: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate ConnectionMode before persisting it.

validateSettingsUpdate never rejects unknown mode strings, but toPeerConfig() later ignores parse failures and falls back to LazyConnectionEnabled. That lets an invalid admin setting save successfully while clients keep using the old mode. Reject unknown values here so misconfiguration fails fast.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/account.go` around lines 416 - 442, validateSettingsUpdate
currently allows invalid ConnectionMode strings to be saved while toPeerConfig
silently falls back to LazyConnectionEnabled; update validateSettingsUpdate to
parse/validate newSettings.ConnectionMode (using the same parser used by
toPeerConfig) and return a status.InvalidArgument error when the parse fails
(include the invalid value in the message) so unknown modes are rejected before
persisting.
management/server/peer.go (1)

1311-1323: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reapply RegularUsersViewBlocked in GetPeer.

GetPeers returns nothing for restrictable users when settings.RegularUsersViewBlocked is enabled, but GetPeer now skips that gate and still allows owner/policy-reachable access if the caller knows a peer ID. That reopens a path the list endpoint explicitly blocks.

Suggested fix
 	user, err := am.Store.GetUserByUserID(ctx, store.LockingStrengthNone, userID)
 	if err != nil {
 		return nil, err
 	}
+
+	settings, err := am.Store.GetAccountSettings(ctx, store.LockingStrengthNone, accountID)
+	if err != nil {
+		return nil, fmt.Errorf("failed to get account settings: %w", err)
+	}
+	if user.IsRestrictable() && settings.RegularUsersViewBlocked {
+		return nil, status.NewPermissionDeniedError()
+	}
 
 	// admin/service-user, or the peer owner -- direct access.
 	if user.IsAdminOrServiceUser() || peer.UserID == userID {
 		return peer, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1311 - 1323, GetPeer currently
bypasses the RegularUsersViewBlocked guard that GetPeers enforces, allowing
restrictable users to access a peer by ID; update GetPeer to reapply the same
check used in GetPeers: if settings.RegularUsersViewBlocked is enabled and the
caller user is a restrictable/regular user, deny access (return an appropriate
error) before performing the owner/admin or checkIfUserOwnsPeer logic. Locate
the gate near the user lookup (am.Store.GetUserByUserID / GetPeer) and perform
the setting + user-type check there so only admin/service users or callers
allowed by the existing Restrictable policy proceed to the owner/policy
reachability branch.
♻️ Duplicate comments (10)
client/internal/peer/worker_ice.go (1)

218-231: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Data race on lastKnownState.

IsConnected() reads w.lastKnownState under muxAgent, but onConnectionStateChange() writes that same field at lines 550 and 562 without holding muxAgent. With Conn.onNetworkChange now using IsConnected() to decide whether to tear down the agent, this is a real cross-goroutine race that can produce stale keep/close decisions. Acquire w.muxAgent in the state-change callback when touching lastKnownState (and agent while you're at it).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 231, IsConnected reads
WorkerICE.lastKnownState under w.muxAgent but onConnectionStateChange writes
that field without locking, causing a data race; update the
onConnectionStateChange callback to acquire w.muxAgent before touching
lastKnownState (and agent) so both reads/writes are synchronized with
IsConnected; specifically, wrap assignments at the locations that set
lastKnownState (and any writes to agent) in a w.muxAgent.Lock()/Unlock() pair
inside the onConnectionStateChange handler to eliminate the race.
client/android/preferences.go (1)

328-331: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Setters still write &0 / &"" instead of nil when "clearing".

The previous review flagged that SetRelayTimeoutSeconds(0) etc. always store a non-nil pointer to 0 in configInput, defeating the "Pass 0 to clear the override" contract. That bug persists in this revision and now also applies to SetConnectionMode(""):

  • SetConnectionMode("") at line 328-331 → &"" (docstring says empty-string clears the override)
  • SetRelayTimeoutSeconds(0) at line 352-355 → &0
  • SetP2pTimeoutSeconds(0) at line 373-376 → &0
  • SetP2pRetryMaxSeconds(0) at line 393-396 → &0

Per Config's docstring (config.go lines 180-199), local 0 means "follow server" — so persisting 0 is a different state than "no local override". For ConnectionMode, persisting "" similarly differs from "unset" in apply()'s comparison logic (config.go line 627).

🛡️ Proposed fix
 func (p *Preferences) SetConnectionMode(mode string) {
-	m := mode
-	p.configInput.ConnectionMode = &m
+	if mode == "" {
+		p.configInput.ConnectionMode = nil
+		return
+	}
+	m := mode
+	p.configInput.ConnectionMode = &m
 }
 
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
-	p.configInput.RelayTimeoutSeconds = &v
+	if v == 0 {
+		p.configInput.RelayTimeoutSeconds = nil
+		return
+	}
+	p.configInput.RelayTimeoutSeconds = &v
 }
 
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
-	p.configInput.P2pTimeoutSeconds = &v
+	if v == 0 {
+		p.configInput.P2pTimeoutSeconds = nil
+		return
+	}
+	p.configInput.P2pTimeoutSeconds = &v
 }
 
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
 	v := clampUint32Seconds(secs)
-	p.configInput.P2pRetryMaxSeconds = &v
+	if v == 0 {
+		p.configInput.P2pRetryMaxSeconds = nil
+		return
+	}
+	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 352-355, 373-376, 393-396

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 328 - 331, The setters are
currently storing pointers to zero-values (e.g., &"" or &0) which prevents
representing "unset" per Config's contract; update each setter
(SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds,
SetP2pRetryMaxSeconds) to set the corresponding p.configInput field to nil when
the passed value is the clearing sentinel (empty string for ConnectionMode, 0
for the timeout/retry setters), otherwise allocate and store a pointer to the
provided value so callers can distinguish "unset" (nil) from an explicit zero
value.
client/internal/lazyconn/manager/manager.go (1)

102-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't treat explicit 0/0 timeouts as the legacy fallback (still applies).

This branch maps both new fields being zero to inactivity.NewManager(wgIface, config.InactivityThreshold). When the caller has migrated to the new API (InactivityThreshold == nil, ICEInactivityThreshold == 0, RelayInactivityThreshold == 0) — which is the legitimate "always-on, no inactivity teardown" configuration — this dereferences a nil pointer or restores the legacy 24h relay default depending on NewManager's signature. Either outcome silently violates the documented "0 disables that teardown path" contract.

Gate the legacy constructor on config.InactivityThreshold != nil instead:

Suggested fix
 	if wgIface.IsUserspaceBind() {
 		iceTO, relayTO := config.resolvedTimeouts()
-		if iceTO == 0 && relayTO == 0 {
+		if config.InactivityThreshold != nil &&
+			config.ICEInactivityThreshold == 0 &&
+			config.RelayInactivityThreshold == 0 {
 			// Phase 1 / single-timer fallback when caller hasn't been migrated.
 			m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback
 		} else {
 			m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO)
 		}
 	} else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 102 - 109, The
current branch treats iceTO==0 && relayTO==0 as the legacy single-timer fallback
and calls inactivity.NewManager(wgIface, config.InactivityThreshold), which
incorrectly runs when the caller has migrated and intentionally set
ICEInactivityThreshold==0 and RelayInactivityThreshold==0 with
InactivityThreshold==nil (meaning “disable those teardowns”); update the logic
in the manager initialization to only call inactivity.NewManager(...) when
config.InactivityThreshold != nil (preserving the legacy behavior), otherwise
call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) (so explicit
0/0 from resolvedTimeouts() is honored as “disable”), and keep the existing use
of wgIface, config.resolvedTimeouts(), inactivity.NewManager and
inactivity.NewManagerWithTwoTimers identifiers to locate the change.
management/internals/shared/grpc/conversion.go (1)

315-327: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only mark liveness as authoritative when rPeer.Status exists.

LiveOnline is populated only inside the nil guard, but ServerLivenessKnown is forced to true for every peer. When status is absent, new clients will interpret that as “server knows this peer is offline” instead of “liveness unknown”.

Suggested fix
 		if rPeer.Status != nil {
 			if !rPeer.Status.LastSeen.IsZero() {
 				cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen)
 			}
 			cfg.LiveOnline = rPeer.Status.Connected
+			cfg.ServerLivenessKnown = true
 		}
-		cfg.ServerLivenessKnown = true
 		dst = append(dst, cfg)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 315 - 327, The
code sets cfg.ServerLivenessKnown = true unconditionally even when rPeer.Status
is nil; change this so ServerLivenessKnown is set to true only when rPeer.Status
!= nil (i.e. inside the existing nil-guard where LiveOnline is populated) or
explicitly set to false when rPeer.Status is nil, ensuring the conversion logic
(the block handling rPeer, cfg and rPeer.Status) only marks server liveness
authoritative when status exists.
client/internal/conn_mgr.go (2)

287-341: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the running lazy manager when only relay/ICE timeouts change.

initLazyManager() snapshots relayTimeoutSecs and p2pTimeoutSecs into the manager config once. In this update path a pure timeout change reaches e.relayTimeoutSecs/e.p2pTimeoutSecs, but because modeChanged == false the running manager is left on the old thresholds until some unrelated restart.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 287 - 341, When timeouts change but
mode stays the same you must restart an already-running lazy manager so it picks
up the new relay/p2p/retry thresholds: after computing
newMode/newRelay/newP2P/newP2pRetry and before the branch that only handles
e.lazyConnMgr == nil, add a check for isManaged && e.lazyConnMgr != nil &&
(newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs || newP2pRetry !=
e.p2pRetryMaxSecs) and in that case log the restart, call e.closeManager(ctx)
(and e.statusRecorder.UpdateLazyConnection(false) if you need parity with other
restarts), then call e.initLazyManager(ctx) and e.startModeSideEffects() so the
running manager picks up the new timeouts; keep the existing assignments to
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs and the earlier
e.propagateP2pRetryMaxToConns() call.

296-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't keep a lazy/dynamic resolved mode when Rosenpass blocks it.

This branch returns after e.mode and the timeout fields have already been updated. The daemon can then report p2p-lazy/p2p-dynamic even though no lazy manager is running and peers were left on the eager path. Normalize or reject the pushed mode before mutating live state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 296 - 326, The code currently
updates e.mode and timeout fields before checking e.rosenpassEnabled, so a
pushed lazy/dynamic mode can be recorded even when Rosenpass prevents running a
lazy manager; move the Rosenpass check ahead of mutating live state: first
evaluate modeUsesLazyMgr(newMode) and if it returns true and e.rosenpassEnabled
is true, log the warning and return (or reject the push) without changing e.mode
or any timeout fields; only after this check mutate e.mode, e.relayTimeoutSecs,
e.p2pTimeoutSecs, e.p2pRetryMaxSecs and call e.propagateP2pRetryMaxToConns(),
then proceed with the existing manager start/stop logic (references: e.mode,
modeUsesLazyMgr, e.rosenpassEnabled, e.relayTimeoutSecs, e.p2pTimeoutSecs,
e.p2pRetryMaxSecs, e.propagateP2pRetryMaxToConns, e.closeManager, e.lazyConnMgr,
e.statusRecorder.UpdateLazyConnection).
management/internals/shared/grpc/server.go (1)

487-489: ⚠️ Potential issue | 🔴 Critical

Closed snapshotCh still busy-loops on fast reconnect.

When SnapshotRouter.Register() closes the old channel, this branch keeps snapshotCh live in the select, so the loop can spin immediately on the closed case until the stream exits.

Suggested fix
 		case nonce, ok := <-snapshotCh:
 			if !ok {
+				snapshotCh = nil
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 487 - 489, The
select loop spins on a closed snapshotCh because the receive case stays
selectable; in the case handling "case nonce, ok := <-snapshotCh:" when ok is
false set snapshotCh to nil (or otherwise remove it from the select) so the
closed channel is no longer selected, then continue; update the select loop in
server.go to assign snapshotCh = nil on close (referencing snapshotCh and the
register flow from SnapshotRouter.Register()) so fast reconnects don't
busy-loop.
management/server/peer.go (1)

1355-1355: ⚠️ Potential issue | 🟠 Major

Return an auth/not-found status here, not status.Internal.

This still turns a normal access miss into a 500-class error for callers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` at line 1355, The current return uses
status.Internal which incorrectly produces a 500 for an access miss; change the
status to an auth/not-found code (e.g. status.NotFound or the equivalent
not-found constant used in your status package) so callers receive a 404-style
response instead of Internal. Update the return that mentions userID, peer.ID
and accountID to use the not-found status (or PermissionDenied if you prefer
explicit auth failure) while keeping the same error message.
client/ui/client_ui.go (2)

889-907: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timeout-field gating still reads the raw dropdown label, not the effective mode.

When the dropdown sits on Follow server (or Follow server (currently: <mode>)), the switch falls into default and disables all three timeout entries even when serverPushedMode is p2p-lazy / p2p-dynamic. That blocks legitimate client-side overrides while following the server. Switch on the effective mode (use selectedConnectionMode(), falling back to serverPushedMode when it returns "") instead of s.sConnectionMode.Selected.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 889 - 907, The timeout gating currently
switches on the raw dropdown label s.sConnectionMode.Selected; change
updateTimeoutEntriesEnabled to determine the effective mode first by calling
s.selectedConnectionMode() and, if that returns an empty string, falling back to
s.serverPushedMode, then switch on that effective mode (instead of
s.sConnectionMode.Selected) to enable/disable s.iRelayTimeout, s.iP2pTimeout and
s.iP2pRetryMax so following-the-server modes like p2p-lazy/p2p-dynamic are
honored.

656-666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invalid timeout input still silently collapses to 0 (no override).

parseUint32Field continues to swallow parse errors and returns 0, so a typo like "5m" or "abc" is persisted as “no override” and hasConnectionModeChanges then sees no change when the stored value is also 0. The user gets no validation feedback. Please surface a parse error to the caller and reject the save instead of mapping bad input to the sentinel.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 656 - 666, The parseUint32Field function
currently swallows parse errors and returns 0; change its signature to return
(uint32, error) (e.g., parseUint32Field(text string) (uint32, error)), trim and
if empty keep returning (0, nil) but on strconv.ParseUint failure return (0,
fmt.Errorf("invalid uint32: %w", err)); update callers (such as
hasConnectionModeChanges and the save/validate path) to handle the error, reject
the save and show validation feedback to the user instead of silently treating
bad input as the sentinel 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/internal/conn_state_pusher.go`:
- Around line 293-298: The Push calls in flushDelta and flushFull use
context.Background() which can block the loop and prevent Stop() from returning;
change those Push invocations (p.sink.Push(...)) to use a cancelable/timeout
context derived from the pusher's lifecycle context (e.g., p.ctx or a context
created in NewPusher) or wrap each call with context.WithTimeout so that Stop()
can cancel the parent context and pending Push calls unblock; ensure Stop()
cancels that context (or respects the timeout) before waiting on wg so wg.Wait()
can return.

In `@client/internal/engine_offline_debounce_test.go`:
- Around line 77-97: The tests read e.peerOfflineDebounce without holding the
engine's debounce mutex, causing races; update the assertions in
TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and
TestCancelAllRemoteOfflineCloses_ClearsEverything to acquire the engine's mutex
that guards peerOfflineDebounce (use the same lock used by
scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses, e.g.
e.mu.Lock()/Unlock() or e.mu.RLock()/RUnlock()) around each
len(e.peerOfflineDebounce) check and around the setup verification so the map
reads are synchronized with
scheduleRemoteOfflineClose/cancelRemoteOfflineClose/cancelAllRemoteOfflineCloses.

In `@client/internal/lazyconn/inactivity/manager.go`:
- Around line 83-95: NewManagerWithTwoTimers currently accepts iceTimeout and
relayTimeout without enforcing MinimumInactivityThreshold, so short durations
(e.g., 30s) can be silently violated; update NewManagerWithTwoTimers to validate
and enforce the 1-minute floor by either clamping inputs to
MinimumInactivityThreshold or returning an error/zero for invalid values before
calling newManager: call validateInactivityThreshold (or inline the same check)
for both iceTimeout and relayTimeout, log or return when values are below
MinimumInactivityThreshold, and ensure newManager always receives durations >=
MinimumInactivityThreshold (or zero if caller intended to disable a path).
Ensure you reference the symbols MinimumInactivityThreshold,
validateInactivityThreshold, NewManagerWithTwoTimers, and newManager when making
the change.

In `@client/internal/peer/conn.go`:
- Around line 1213-1219: AttachICEOnRelayActivity drops conn.mu then calls
AttachICE(), but AttachICE() doesn't re-check conn state (opened, ctx.Err(),
current transport), so a concurrent Close() or transport transition can cause
reattachment on a torn-down connection; update AttachICE to re-acquire the
connection mutex (or otherwise validate under lock) at its start and return
without attaching if conn.opened is false, ctx.Err() != nil, or the
transport/state has changed from the expected relay state, and apply the same
revalidation logic to other callers (e.g., the similar relay handler paths
referenced around the other relay-related callers) to prevent attaching ICE
after teardown or transport transition.

In `@client/internal/peer/status.go`:
- Around line 309-319: The helper notifyPeerListChanged currently claims "Caller
must hold d.mux" which encourages holding the lock across listener re-entry;
change its contract to "may be called with or without d.mux held; prefer after
unlock to avoid listener re-entry deadlock" and update implementations to
snapshot the peer count under the lock and call
d.notifier.peerListChanged(numPeers) after d.mux.Unlock(); specifically, stop
calling notifyPeerListChanged while d.mux is held in UpdatePeerRemoteMeta and
similar sites—replace those calls by capturing num := d.numOfPeers() while
locked and then call d.notifier.peerListChanged(num) after unlocking, or keep
notifyPeerListChanged as a thin wrapper but remove the "must hold d.mux" claim
and document the preferred-after-unlock usage.

In `@client/internal/stdnet/filter_test.go`:
- Around line 51-66: The test currently makes asymmetric assertions over the
table-driven "cases" using allow(c.name) and only fails for one false-positive
and a single hardcoded Windows false-negative; change the loop in the test to
assert symmetrically (if got != c.want { t.Errorf("InterfaceFilter(%q) = %v,
want %v", c.name, got, c.want) }) so every case is verified, and for any rows
intended only for Windows (e.g., the "vEthernet (LAN)"/other Windows-only cases)
wrap or skip those rows with runtime.GOOS == "windows" (or t.Skipf when not
windows) so platform-specific expectations are gated; also update the misleading
comment around the veth/vEthernet behavior to reflect the actual non-Windows
filtering behavior.

In `@client/server/server.go`:
- Around line 1518-1538: The comment claims “All zero/empty when the engine has
not received PeerConfig yet” but ConnMgr.ServerPushedP2pRetryMaxSecs() returns a
non‑zero default, so either document the asymmetry or expose a raw accessor and
use it; implement ConnMgr.ServerPushedP2pRetryMaxSecsRaw() that returns the
internal serverPushedP2pRetryMaxSecs without fallback, then in server.go
populate spP2pRetMax from that raw accessor (or alternatively update the block
comment to explicitly state that spP2pRetMax will default to DefaultP2PRetryMax
when no PeerConfig has been received).

In `@client/ui/network.go`:
- Around line 97-103: In OnUnselected, don't resolve the grid via
tabs.Selected() (which can point at the newly selected tab) — use the provided
item parameter to determine which grid to clear so we clear the tab being left;
update the call site (OnUnselected) to pass the unselected TabItem into
getGridAndFilterFromTab (or add a helper that maps a *container.TabItem to the
corresponding grid) and then clear that returned grid (e.g., grid.Objects = nil)
for the unselected tab; reference symbols: OnUnselected, item
(*container.TabItem), getGridAndFilterFromTab, tabs.Selected(), and grids
allGrid/overlappingGrid/exitNodeGrid to locate and fix the logic.

In `@management/server/account/manager.go`:
- Around line 107-110: GetPeerByPubKey currently lacks the requesting user's
identity, so manager code cannot apply per-user visibility/RBAC when returning
peer metadata; update the GetPeerByPubKey signature (or its calling flow) to
accept and thread the requester identity (e.g., userID or a principal) from the
REST handlers through the manager layer (or extract it reliably from ctx) and
enforce the same visibility checks used by GetPeer before resolving/returning
nbpeer.Peer details to prevent leaking FQDNs/peer names.

In `@management/server/http/handlers/accounts/accounts_handler_test.go`:
- Around line 408-426: Fix the gofmt failure by correcting the indentation of
the struct literal in TestAccountsHandler_PutSettings_P2pRetryMax: align the
fields LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds to match
the surrounding entries (same indentation as LazyConnectionEnabled, DnsDomain,
etc.) so the colon columns line up; after adjusting those two lines, run gofmt
-w to reformat the entire api.AccountSettings literal.

In `@management/server/http/handlers/peer_connections/handler_test.go`:
- Around line 40-45: The fakeAM.GetPeerByPubKey currently ignores the accountID
parameter which can hide cross-account lookup bugs; update GetPeerByPubKey to
validate the provided accountID the same way fakeAM.GetPeer does: look up the
peer by pubKey (from a.peersByKey or similar store) and ensure its AccountID
matches the incoming accountID, returning a not-found (or error) when the peer
is missing or the account IDs differ; reference the GetPeerByPubKey and GetPeer
methods to mirror the account check and error semantics.

In `@management/server/http/handlers/peer_connections/handler.go`:
- Around line 85-88: The call to h.account.GetPeer currently maps every error to
a 404; change the handling in the GetPeer call sites (the handler block around
peer, err := h.account.GetPeer(...) and the similar block at the other
occurrence) to inspect the returned error/type instead of blanketing it: detect
and return http.StatusNotFound only when the error indicates a true "not found"
or an intentionally-hidden permission denial (e.g., errors.Is(err,
store.ErrNotFound) or checking a management error code/Status() on the error
object), otherwise log the error and return http.StatusInternalServerError (or
the appropriate status for permission/validation errors). Ensure you preserve
context in the log and do not leak internal details to clients.
- Line 92: The handler currently ignores parse errors from strconv.ParseUint and
treats malformed ?since=... as 0; change the ParseUint call in the peer
connections handler to check the error returned (the
strconv.ParseUint(r.URL.Query().Get("since"), 10, 64) call) and, if the query
value is non-empty and parsing fails, respond with HTTP 400 (Bad Request)
indicating an invalid refresh token instead of proceeding with since=0; keep
successful parsing behavior unchanged and only reject when the since parameter
is present but malformed.

In `@management/server/peer/peer.go`:
- Around line 199-204: The equality check currently uses slices.Equal on
SupportedFeatures which is order-sensitive; change it to an order-insensitive
set comparison (e.g., convert p.SupportedFeatures and other.SupportedFeatures
into maps/sets or sort them before comparing) so two feature lists with the same
elements in different orders are treated equal; update the equality logic
surrounding EffectiveConnectionMode/.../SupportedFeatures and keep the existing
p.Flags.isEqual call unchanged.

In `@management/server/store/sql_store.go`:
- Around line 1641-1652: The DB int64→uint32 casts (sRelayTimeoutSeconds,
sP2pTimeoutSeconds, sP2pRetryMaxSeconds and the meta/legacy fields
metaEffectiveRelayTimeoutSecs, metaEffectiveP2PTimeoutSecs,
metaEffectiveP2PRetryMaxSecs and sLegacyLazyFallbackTimeoutSecs) must be guarded
against negative values and values > math.MaxUint32 before assigning into
account.Settings (RelayTimeoutSeconds, P2pTimeoutSeconds, P2pRetryMaxSeconds,
etc.); change each block to check that the source.Int64 is >= 0 and <=
math.MaxUint32 and only then create the uint32 v and assign &v to the
corresponding account.Settings field (otherwise leave the pointer nil or handle
as per existing pattern).

In `@shared/management/http/api/openapi.yml`:
- Around line 365-405: The OpenAPI schema's connection_mode enum is missing the
new value "p2p-dynamic-lazy" and several timeout field descriptions still
reference outdated phase-specific behavior; update the enum under
connection_mode to include "p2p-dynamic-lazy" and revise the descriptions of
p2p_timeout_seconds, p2p_retry_max_seconds, and relay_timeout_seconds to reflect
the final mode model (which modes each timeout applies to, default values, and
NULL semantics) so generated clients and validation match the server behavior.

---

Outside diff comments:
In `@management/server/account.go`:
- Around line 416-442: validateSettingsUpdate currently allows invalid
ConnectionMode strings to be saved while toPeerConfig silently falls back to
LazyConnectionEnabled; update validateSettingsUpdate to parse/validate
newSettings.ConnectionMode (using the same parser used by toPeerConfig) and
return a status.InvalidArgument error when the parse fails (include the invalid
value in the message) so unknown modes are rejected before persisting.

In `@management/server/peer.go`:
- Around line 1311-1323: GetPeer currently bypasses the RegularUsersViewBlocked
guard that GetPeers enforces, allowing restrictable users to access a peer by
ID; update GetPeer to reapply the same check used in GetPeers: if
settings.RegularUsersViewBlocked is enabled and the caller user is a
restrictable/regular user, deny access (return an appropriate error) before
performing the owner/admin or checkIfUserOwnsPeer logic. Locate the gate near
the user lookup (am.Store.GetUserByUserID / GetPeer) and perform the setting +
user-type check there so only admin/service users or callers allowed by the
existing Restrictable policy proceed to the owner/policy reachability branch.

In `@management/server/peer/peer.go`:
- Around line 207-225: The isEmpty method on PeerSystemMeta currently ignores
the new 3.7i metadata fields so UpdateMetaIfNew can mistakenly treat payloads
with only those fields as empty; update PeerSystemMeta.isEmpty to include the
newly added fields (the effective mode, effective timeout, and features fields
added for 3.7i) in the emptiness checks so that if any of those fields are set
the method returns false and UpdateMetaIfNew will process the update.

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 328-331: The setters are currently storing pointers to zero-values
(e.g., &"" or &0) which prevents representing "unset" per Config's contract;
update each setter (SetConnectionMode, SetRelayTimeoutSeconds,
SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds) to set the corresponding
p.configInput field to nil when the passed value is the clearing sentinel (empty
string for ConnectionMode, 0 for the timeout/retry setters), otherwise allocate
and store a pointer to the provided value so callers can distinguish "unset"
(nil) from an explicit zero value.

In `@client/internal/conn_mgr.go`:
- Around line 287-341: When timeouts change but mode stays the same you must
restart an already-running lazy manager so it picks up the new relay/p2p/retry
thresholds: after computing newMode/newRelay/newP2P/newP2pRetry and before the
branch that only handles e.lazyConnMgr == nil, add a check for isManaged &&
e.lazyConnMgr != nil && (newRelay != e.relayTimeoutSecs || newP2P !=
e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) and in that case log the
restart, call e.closeManager(ctx) (and
e.statusRecorder.UpdateLazyConnection(false) if you need parity with other
restarts), then call e.initLazyManager(ctx) and e.startModeSideEffects() so the
running manager picks up the new timeouts; keep the existing assignments to
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs and the earlier
e.propagateP2pRetryMaxToConns() call.
- Around line 296-326: The code currently updates e.mode and timeout fields
before checking e.rosenpassEnabled, so a pushed lazy/dynamic mode can be
recorded even when Rosenpass prevents running a lazy manager; move the Rosenpass
check ahead of mutating live state: first evaluate modeUsesLazyMgr(newMode) and
if it returns true and e.rosenpassEnabled is true, log the warning and return
(or reject the push) without changing e.mode or any timeout fields; only after
this check mutate e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs,
e.p2pRetryMaxSecs and call e.propagateP2pRetryMaxToConns(), then proceed with
the existing manager start/stop logic (references: e.mode, modeUsesLazyMgr,
e.rosenpassEnabled, e.relayTimeoutSecs, e.p2pTimeoutSecs, e.p2pRetryMaxSecs,
e.propagateP2pRetryMaxToConns, e.closeManager, e.lazyConnMgr,
e.statusRecorder.UpdateLazyConnection).

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 102-109: The current branch treats iceTO==0 && relayTO==0 as the
legacy single-timer fallback and calls inactivity.NewManager(wgIface,
config.InactivityThreshold), which incorrectly runs when the caller has migrated
and intentionally set ICEInactivityThreshold==0 and RelayInactivityThreshold==0
with InactivityThreshold==nil (meaning “disable those teardowns”); update the
logic in the manager initialization to only call inactivity.NewManager(...) when
config.InactivityThreshold != nil (preserving the legacy behavior), otherwise
call inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) (so explicit
0/0 from resolvedTimeouts() is honored as “disable”), and keep the existing use
of wgIface, config.resolvedTimeouts(), inactivity.NewManager and
inactivity.NewManagerWithTwoTimers identifiers to locate the change.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: IsConnected reads WorkerICE.lastKnownState under
w.muxAgent but onConnectionStateChange writes that field without locking,
causing a data race; update the onConnectionStateChange callback to acquire
w.muxAgent before touching lastKnownState (and agent) so both reads/writes are
synchronized with IsConnected; specifically, wrap assignments at the locations
that set lastKnownState (and any writes to agent) in a
w.muxAgent.Lock()/Unlock() pair inside the onConnectionStateChange handler to
eliminate the race.

In `@client/ui/client_ui.go`:
- Around line 889-907: The timeout gating currently switches on the raw dropdown
label s.sConnectionMode.Selected; change updateTimeoutEntriesEnabled to
determine the effective mode first by calling s.selectedConnectionMode() and, if
that returns an empty string, falling back to s.serverPushedMode, then switch on
that effective mode (instead of s.sConnectionMode.Selected) to enable/disable
s.iRelayTimeout, s.iP2pTimeout and s.iP2pRetryMax so following-the-server modes
like p2p-lazy/p2p-dynamic are honored.
- Around line 656-666: The parseUint32Field function currently swallows parse
errors and returns 0; change its signature to return (uint32, error) (e.g.,
parseUint32Field(text string) (uint32, error)), trim and if empty keep returning
(0, nil) but on strconv.ParseUint failure return (0, fmt.Errorf("invalid uint32:
%w", err)); update callers (such as hasConnectionModeChanges and the
save/validate path) to handle the error, reject the save and show validation
feedback to the user instead of silently treating bad input as the sentinel 0.

In `@management/internals/shared/grpc/conversion.go`:
- Around line 315-327: The code sets cfg.ServerLivenessKnown = true
unconditionally even when rPeer.Status is nil; change this so
ServerLivenessKnown is set to true only when rPeer.Status != nil (i.e. inside
the existing nil-guard where LiveOnline is populated) or explicitly set to false
when rPeer.Status is nil, ensuring the conversion logic (the block handling
rPeer, cfg and rPeer.Status) only marks server liveness authoritative when
status exists.

In `@management/internals/shared/grpc/server.go`:
- Around line 487-489: The select loop spins on a closed snapshotCh because the
receive case stays selectable; in the case handling "case nonce, ok :=
<-snapshotCh:" when ok is false set snapshotCh to nil (or otherwise remove it
from the select) so the closed channel is no longer selected, then continue;
update the select loop in server.go to assign snapshotCh = nil on close
(referencing snapshotCh and the register flow from SnapshotRouter.Register()) so
fast reconnects don't busy-loop.

In `@management/server/peer.go`:
- Line 1355: The current return uses status.Internal which incorrectly produces
a 500 for an access miss; change the status to an auth/not-found code (e.g.
status.NotFound or the equivalent not-found constant used in your status
package) so callers receive a 404-style response instead of Internal. Update the
return that mentions userID, peer.ID and accountID to use the not-found status
(or PermissionDenied if you prefer explicit auth failure) while keeping the same
error message.

---

Nitpick comments:
In `@client/cmd/service_installer.go`:
- Around line 137-139: The current calls to applyConnectionModeFlagsToProfile
swallow persistence failures and only emit warnings; update the callers (notably
the reconfigure flow and the install flow where
applyConnectionModeFlagsToProfile is invoked) to surface failures: for the
reconfigure command return the error from applyConnectionModeFlagsToProfile so
the CLI exits non‑zero and the user is informed, and for the install command
either return the error up the call chain or convert the PrintErrf warning into
a returned error depending on desired UX; locate the calls to
applyConnectionModeFlagsToProfile and change their error handling to return the
error (or propagate it) instead of only logging a warning.
- Around line 194-205: Add defensive nil checks before reading .Changed on
per-flag lookups: for each flag access currently using
cmd.Flag(connectionModeFlag).Changed, cmd.Flag(relayTimeoutFlag).Changed,
cmd.Flag(p2pTimeoutFlag).Changed, and cmd.Flag(p2pRetryMaxFlag).Changed, first
assign the result to a local (e.g., f := cmd.Flag(connectionModeFlag)) and only
set ic.ConnectionMode, ic.RelayTimeoutSeconds, ic.P2pTimeoutSeconds, or
ic.P2pRetryMaxSeconds if f != nil && f.Changed; this mirrors the existing
guarded pattern used earlier and prevents nil-pointer panics when a flag isn't
registered for a command.

In `@client/internal/engine_offline_debounce_test.go`:
- Around line 23-31: The helper engineForDebounceTest should register cleanup
with testing.T to reliably stop any timers instead of relying on manual
test-tail cleanup: change engineForDebounceTest to accept t *testing.T,
construct the Engine with peerOfflineDebounce map[string]*time.Timer, then call
t.Cleanup with a closure that ranges over e.peerOfflineDebounce, stops each
*time.Timer and drains its channel if needed (and clears the map). Update all
test callers to pass t. Use the function name engineForDebounceTest and the
field peerOfflineDebounce to locate the code to modify.

In `@client/internal/engine_pusher_adapters.go`:
- Around line 47-50: The endpoint string logic is asymmetric: it initializes
from st.LocalIceCandidateEndpoint and only sets a combined "local <-> remote"
when both exist, causing remote-only values to be lost; update the logic around
endpoint (the variables st.LocalIceCandidateEndpoint and
st.RemoteIceCandidateEndpoint) to handle all three cases—both non-empty =>
"local <-> remote", only local => local, only remote => remote—so the UI
surfaces remote-only windows during partial state transitions.

In `@client/internal/engine.go`:
- Around line 1376-1402: The repeated system.GetInfoWithChecks(...) →
SetFlags(...) sequence leads to duplication; extract it into a private Engine
helper (e.g. func (e *Engine) systemInfoWithFlags() *system.Info) that performs
GetInfoWithChecks with the same fallback to system.GetInfo, logs on error, calls
info.SetFlags(...) with the exact Engine.config fields, and returns
*system.Info; then replace each inline block (the one inside
e.syncMetaDebouncer.Trigger and the other occurrences) to call
e.systemInfoWithFlags() and use its return value when calling
e.mgmClient.SyncMeta or other callers.

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 610-625: The manager is holding managedPeersMu while calling
external/conn methods (peerStore.PeerConn -> conn.ResetIceBackoff,
conn.AttachICE, conn.NotifyGuardActivity), which is fragile; change the code to
grab the conn (and peerCfg.PublicKey if needed) while holding managedPeersMu via
peerStore.PeerConn, store them in local variables, then release/unlock
managedPeersMu before invoking ResetIceBackoff(), AttachICE(), and
NotifyGuardActivity() so external calls run without the manager lock held; keep
the same warning log path for AttachICE errors and preserve behavior but move
interaction with conn outside the lock.

In `@client/internal/peer/conn_handover_order_test.go`:
- Around line 138-175: The current extractFunctionBody implementation (function
extractFunctionBody) uses a manual brace counter that miscounts braces inside
string/rune literals, raw string/backticks, or comments; replace it by parsing
the source with go/parser and go/ast: parse src into an *ast.File, walk
declarations to find the ast.FuncDecl whose Name (and Receiver if looking for
"func (conn *Conn) Name") matches the requested name, then use the
funcDecl.Pos() and funcDecl.End() token.Pos values (via the token.File set from
the parser) to slice the original src and return the exact function body text;
this removes fragile rune-based counting and correctly handles strings/comments
and receiver vs plain func forms.

In `@client/internal/peer/conn_lazy_keepwgpeer_test.go`:
- Around line 41-75: The test TestConn_Close_KeepWgPeerParameterPresent is
fragile because it asserts the exact Close signature string; replace it with a
behavioral test that constructs a Conn with a fake endpointUpdater (spy) and
calls conn.Close twice: once with keepWgPeer=false and once with
keepWgPeer=true, asserting that endpointUpdater.RemoveWgPeer() is called for the
false case and NOT called for the true case; retain the existing checks that the
guard appears before the call only if you still need a text check, but prefer
the fake endpointUpdater approach to verify the invariant without depending on
source formatting or parameter names.
- Around line 99-132: The current scanner in conn_lazy_keepwgpeer_test.go (the
loop that inspects lines/trim and looks for ".Close(") misses multi-line Close
calls and any receiver names other than "conn"/"peerConn"; replace the fragile
string-based scan with an AST-based check: parse the file with go/parser, walk
*ast.CallExpr nodes, identify selector expressions whose Sel.Name == "Close" and
whose X resolves to a *peer.Conn receiver (type-checking or simple
selector.Type/Import-qualified check), then examine the CallExpr.Args length and
fail the test if the second argument (keepWgPeer) is missing; ensure this covers
multi-line calls and arbitrary receiver identifiers instead of relying on string
ops like strings.Index or strings.Contains.

In `@client/internal/peer/guard/guard_test.go`:
- Around line 18-23: The helper newTestGuard currently returns (*Guard,
*SRWatcher) but callers ignore the SRWatcher; change newTestGuard to return only
*Guard by removing the *SRWatcher from the signature and return value (you can
still construct an SRWatcher inside the function for initialization but do not
return it), update the function body that creates sr := NewSRWatcher(...) and
the NewGuard(...) call accordingly, and update all call sites that do g, _ :=
newTestGuard(...) to use g := newTestGuard(...). Ensure references to SRWatcher
in tests are not relied upon before removing the return.
- Around line 65-96: TestGuard_PeerActivityResetsHourlyMode only exercises a
standalone iceRetryState and manually calls iceState.reset() after receiving on
g.peerActivity, so it doesn't exercise the real reconnectLoopWithRetry logic;
update the test to start the guard's real loop (call g.Start(ctx,
connStatusFunc) in a goroutine), use a connStatusFunc that forces hourly mode
(e.g., returning ConnStatusPartiallyConnected until budget exhausts), call
g.NotifyPeerActivity(), and then assert the loop reacted (hourly ticker cleared
/ short cadence observed) by observing behavior from the actual reconnect loop
instead of directly invoking iceRetryState.reset(); reference
TestGuard_PeerActivityResetsHourlyMode, g.Start, reconnectLoopWithRetry,
NotifyPeerActivity, iceRetryState.enterHourlyMode and reset, and g.peerActivity
to locate changes.

In `@client/internal/peer/guard/guard.go`:
- Around line 67-72: SetOnNetworkChange currently writes g.onNetworkChange
unsafely while reconnectLoopWithRetry reads it concurrently; change the API to
accept the callback at construction to enforce "must set before Start": add a cb
func() parameter to NewGuard and assign g.onNetworkChange inside NewGuard
(remove or deprecate SetOnNetworkChange), update all NewGuard call sites to pass
the callback, and ensure reconnectLoopWithRetry still reads g.onNetworkChange as
before; this makes the callback immutable after start and avoids race/torn-reads
without runtime synchronization.
- Around line 153-183: Extract the repeated ticker reset + iceState.reset() into
a helper method on the guard (e.g., func (g *Guard) resetTicker(ticker
**time.Ticker, tickerChannel *<-chan time.Time, ctx context.Context, iceState
*iceStateType)) and replace the duplicated blocks in the select (the cases for
g.relayedConnDisconnected, g.iCEConnDisconnected, g.peerActivity,
srReconnectedChan) with a single call to g.resetTicker(&ticker, &tickerChannel,
ctx, iceState); ensure the helper stops the old ticker, creates a new one via
g.newReconnectTicker(ctx), updates tickerChannel to the new ticker.C, and calls
iceState.reset(); keep the existing srReconnectedChan behavior of calling
g.onNetworkChange() after invoking g.resetTicker().

In `@client/internal/peer/ice_backoff.go`:
- Around line 178-197: The AllowActivityOverride() method currently returns true
and relies on the caller to call Reset(), which can be misused; change
AllowActivityOverride (or create TryActivityOverride) to perform the rate-limit
check and the reset atomically while holding s.mu: i.e., inside
iceBackoffState.AllowActivityOverride() check s.suspended and
time.Since(s.lastResetAt) < activityOverrideMinInterval, and if allowed
immediately update s.lastResetAt = time.Now() (or call s.Reset() while still
holding s.mu) before returning true so the 5-minute gate cannot be bypassed by a
missed external Reset; keep use of s.mu, s.suspended, s.lastResetAt and
activityOverrideMinInterval consistent and update callers/docs to stop calling
Reset() after AllowActivityOverride().

In `@client/internal/peer/status_remote_meta_notify_test.go`:
- Around line 55-65: The comment above the baseline UpdatePeerRemoteMeta is
contradictory: it says "no notification expected" while the test actually waits
for notifications via waitForCount(t, listener, 2, ...) and then resets
listener.peersChangedCount; update the comment to accurately state that this
initial UpdatePeerRemoteMeta does trigger the first flip/notification (because
default RemoteLiveOnline differs) or otherwise change the assertion to match the
stated intent; locate the block around UpdatePeerRemoteMeta("peerA",
RemoteMeta{LiveOnline: true, ServerLivenessKnown: true}), waitForCount, and
listener.peersChangedCount.Store and make the comment clearly reflect that the
first update causes a notification and we then reset the counter for subsequent
checks.

In `@client/internal/stdnet/filter.go`:
- Around line 47-93: InterfaceFilter currently has high cognitive complexity due
to inline Windows substring matching and the wgctrl probe; extract the
Windows-specific loop into a small helper like isWindowsKnownBad(lowerIFace
string, windowsKnownBadSubstrings []string) bool (called from InterfaceFilter)
and optionally move the WireGuard probe into a helper function (e.g.,
isWireGuardDevice(iFace string) bool) that handles wgctrl.New(), wg.Device and
closing, then call those helpers from InterfaceFilter; keep existing behavior
(skip "veth" special-case on Windows and preserve the iOS skip) and reuse
symbols windowsKnownBadSubstrings, InterfaceFilter, wgctrl.New(), and wg.Device
when relocating logic.

In `@client/ui/peers_tab.go`:
- Around line 280-356: buildPeerDetailText is over the SonarCloud cognitive
complexity limit due to many independent display sections; refactor by
extracting logical blocks into helper functions—e.g., create
appendConnectionType(sb *strings.Builder, p *proto.PeerState),
appendHandshakeAndEndpoints(sb, p), appendGroupsAndSeen(sb, p),
appendFullSection(sb, p) and appendICEBackoff(sb, p) —then have
buildPeerDetailText simply call these helpers in order and return sb.String();
ensure helper names match exactly (appendConnectionType,
appendHandshakeAndEndpoints, appendGroupsAndSeen, appendFullSection,
appendICEBackoff) so tests can target them and keep existing behavior (including
the ICE backoff wall-clock checks and conditional full-block prints).
- Around line 365-375: The function peerLatencyStr currently returns "-" for
both lat == nil and for durations that round down to zero, losing the
distinction between "unknown" and measured-zero/sub-microsecond latencies;
change it so lat == nil still returns "-" but otherwise examine the raw duration
d := lat.AsDuration(): if d == 0 return "0ns" (explicit measured zero), else if
d < time.Microsecond return "<1µs" (measured nonzero but below display
precision), and otherwise return d.Round(time.Microsecond).String(); update the
logic in peerLatencyStr accordingly so the nil check remains tied to unmeasured
and rounding is used only for display.
- Around line 319-320: The code casts proto int64 fields p.GetBytesRx() and
p.GetBytesTx() directly to uint64 when calling humanBytes, which is unsafe if
negatives appear; update the call site (around the fmt.Fprintf using humanBytes)
to normalize values first by clamping negatives to zero (e.g., compute rx :=
p.GetBytesRx(); if rx < 0 { rx = 0 } and same for tx) then cast the non-negative
int64 to uint64 for humanBytes, and/or add a short comment near the fmt.Fprintf
noting that proto fields are int64 but negatives are clamped because WireGuard
counters are expected non-negative.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md`:
- Line 13: The "### Task 1: Replace sharedsock with standard UDP socket in
TunKernelDevice.Up()" heading skips from h1 to h3; change it to an h2 to follow
markdownlint MD001 and proper heading increment (i.e., replace the leading "###"
with "##") so the heading level increments from the top-level h1 to h2; ensure
the rest of the document's heading hierarchy remains consistent after modifying
the "Task 1" heading.
- Around line 71-74: Update Step 3 so the description matches the command:
either change the step title/text "Build and verify compilation" to "Build and
verify compilation for arm64" to reflect the GOARCH=arm64 GOOS=linux go build
./client/ command, or replace the command with a local-arch build (go build
./client/) to keep the generic description; edit the Step 3 block that contains
the shell command to ensure consistency with Step 4's explicit cross-compile for
arm64.
- Line 48: The log line uses udpConn.LocalAddr().(*net.UDPAddr).Port which can
panic if LocalAddr() is nil or not a *net.UDPAddr; update the code around that
log (the call to udpConn.LocalAddr(), the log.Infof invocation) to first capture
addr := udpConn.LocalAddr(), check addr != nil, then perform a safe type
assertion udpAddr, ok := addr.(*net.UDPAddr) and only use udpAddr.Port when ok;
otherwise log a safe fallback (e.g., "unknown" or 0) so the code never panics
when LocalAddr() is nil or of an unexpected type.

In `@management/internals/controllers/network_map/controller/controller.go`:
- Around line 328-333: UpdateAccountPeer is calling
grpc.BuildGroupNamesByPeerID(account.Groups) which scans all account groups for
every peer; replace this with a targeted helper that only collects group-name
entries for the current peer (using peer.ID) or builds the needed slice directly
from peerGroups computed by account.GetPeerGroups(peerId). Implement a new
function (e.g., BuildGroupNamesForPeer or BuildGroupNamesFromPeerGroups) that
iterates either account.Groups once but filters by peerId or iterates peerGroups
to assemble the same shape used by grpc.ToSyncResponse, and use that helper in
place of grpc.BuildGroupNamesByPeerID so UpdateAccountPeer work is proportional
to the single peer rather than account size.

In `@management/internals/server/boot.go`:
- Around line 113-117: The hard-coded TTL value 1 * time.Hour in
BaseServer.PeerConnStore should be promoted to a named constant to make the
operational tuning knob discoverable and reusable; create a constant (e.g.,
peer_connections.DefaultStoreTTL) and replace the literal in the call to
peer_connections.NewMemoryStore inside BaseServer.PeerConnStore so the store
uses that constant, and update any other places that might reference the same
TTL to use the new named constant as well.

In `@management/server/http/handlers/accounts/accounts_handler_test.go`:
- Around line 353-429: The test TestAccountsHandler_PutSettings_P2pRetryMax
should capture the argument passed into the mock UpdateAccountSettingsFunc and
assert that the handler translated the API field into the domain
types.Settings.P2pRetryMaxSeconds (not just rely on the returned value). Modify
the mock UpdateAccountSettingsFunc to store its incoming parameter (e.g.,
newSettingsArg := &types.Settings{}) in a closure, return as before, then after
the request assert that newSettingsArg.P2pRetryMaxSeconds is non-nil and equals
600 (matching the request) so the API→types mapping for P2pRetryMaxSeconds is
validated. Ensure assertions reference UpdateAccountSettingsFunc,
TestAccountsHandler_PutSettings_P2pRetryMax, and
types.Settings.P2pRetryMaxSeconds.

In `@management/server/http/handlers/accounts/validate_uint32_timeout_test.go`:
- Around line 79-90: The test TestValidateUint32Timeout_PlainError currently
doesn't detect wrapped errors because it initializes unwrapped from err and then
only asserts unwrapped != nil; update the test for validateUint32Timeout so it
first asserts err is non-nil and then asserts errors.Unwrap(err) == nil
(ensuring the error is not a %w-wrapped error), or alternatively assert the
concrete type is not a gRPC status error (e.g. via status.FromError) if you want
to forbid status.Error wrappers; use the validateUint32Timeout call and
errors.Unwrap/ status.FromError to make the assertion explicit.

In `@management/server/peer_connections/store_test.go`:
- Around line 221-240: Add a companion test that mirrors
TestMemoryStore_MixedSessionAcceptsLegacyDelta but with the cached state having
SessionId==0 and the incoming delta tagged (non-zero) to ensure the seq-only
acceptance is symmetric; using newStoreWithClock create a store, s.Put a cached
map with SessionId 0 (Seq 5, FullSnapshot true, Entry LatencyMs 10), then s.Put
an incoming map with SessionId non-zero (e.g., sessionB, Seq 6, FullSnapshot
false, Entry LatencyMs 22), and assert via s.Get that the entry latency was
updated to 22; name the test clearly (e.g.,
TestMemoryStore_MixedSessionAcceptsTaggedDelta) and reuse the same helpers
(newStoreWithClock, s.Put, s.Get) as in the existing test.
- Around line 38-52: The test TestMemoryStore_DeepCopyOnReturn only mutates
RemotePubkey so a shallow copy that still shares the nested LastHandshake can
pass; update the test to also mutate the nested timestamp on got1 (e.g. change
got1.GetEntries()[0].LastHandshake) after s.Get and then re-fetch got2 and
assert that got2.GetEntries()[0].GetLastHandshake() still equals the original
timestamp from the fixture; reference the TestMemoryStore_DeepCopyOnReturn,
s.Put, s.Get, got1, got2 and the LastHandshake/timestamppb fields when making
this change to ensure the store performs a deep copy of nested protobuf fields.

In `@management/server/peer_test.go`:
- Line 1214: The test currently calls grpc.ToSyncResponse with a nil
groupNamesByPeerID and thus doesn't exercise appendRemotePeerConfig's branch
that populates cfg.Groups from c.GroupNamesByPeerID[rPeer.ID]; update the test
to call grpc.ToSyncResponse twice (or add a second case) passing a non-nil map
for the trailing groupNamesByPeerID argument that maps the test peer's ID to a
slice of group names, then assert that response.RemotePeers[0].Groups equals the
provided slice to verify the conversion consumed the map (refer to
grpc.ToSyncResponse, appendRemotePeerConfig, and
response.RemotePeers[0].Groups).

In `@management/server/store/sql_store.go`:
- Around line 1656-1663: Replace the hardcoded defaults with the shared
constants: set account.Settings.LegacyLazyFallbackEnabled =
types.DefaultLegacyLazyFallbackEnabled instead of true, and set
account.Settings.LegacyLazyFallbackTimeoutSeconds =
types.DefaultLegacyLazyFallbackTimeoutSeconds instead of 3600 when
sLegacyLazyFallbackTimeoutSecs is not valid; ensure the types package is
imported/available in sql_store.go and adjust any conversions to match the
uint32 type of LegacyLazyFallbackTimeoutSeconds.

In `@shared/management/proto/management.proto`:
- Around line 794-807: Update the comment for PeerConnectionMap.seq to
explicitly state its semantics: document that seq is a monotonic counter tied to
the current session_id (i.e., seq increments for each delta within the same
session_id), that a new full_snapshot (full_snapshot=true) or a change in
session_id indicates a reset/new baseline for seq, and describe fallback
semantics for legacy clients (session_id==0) where receivers should treat seq as
the only monotonic ordering key for that process and reset/interpret seq on
full_snapshot accordingly; reference the fields PeerConnectionMap.seq,
PeerConnectionMap.session_id and PeerConnectionMap.full_snapshot in the comment
so future implementers know how to compare and reset sequence numbers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +293 to +298
if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{
Seq: seq,
FullSnapshot: false,
Entries: entries,
SessionId: p.sessionID,
}); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

context.Background() in Push calls can deadlock Stop().

Both flushDelta and flushFull invoke p.sink.Push(context.Background(), ...) from the loop goroutine. If the management gRPC client hangs (slow server, broken connection, head-of-line blocking on retry), the loop will block inside Push, will not service <-p.stop, and Stop()wg.Wait() will never return — taking daemon shutdown with it.

Derive a context from a per-pusher cancel that Stop() cancels first, or apply a hard context.WithTimeout on each push. Either approach also bounds individual push latency, which is desirable for the heartbeat path.

Suggested shape
 type connStatePusher struct {
 	...
 	stop         chan struct{}
+	ctx          context.Context
+	cancel       context.CancelFunc
 	wg           sync.WaitGroup
 }
 
 func newConnStatePusherForTest(...) *connStatePusher {
 	p := &connStatePusher{...}
+	p.ctx, p.cancel = context.WithCancel(context.Background())
 	...
 }
 
 func (p *connStatePusher) Stop() {
+	p.cancel()
 	close(p.stop)
 	p.wg.Wait()
 }
 
-	if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{...}); err != nil {
+	pushCtx, cancel := context.WithTimeout(p.ctx, 30*time.Second)
+	err := p.sink.Push(pushCtx, &mgmProto.PeerConnectionMap{...})
+	cancel()
+	if err != nil {

Also applies to: 363-369

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 293 - 298, The Push calls
in flushDelta and flushFull use context.Background() which can block the loop
and prevent Stop() from returning; change those Push invocations
(p.sink.Push(...)) to use a cancelable/timeout context derived from the pusher's
lifecycle context (e.g., p.ctx or a context created in NewPusher) or wrap each
call with context.WithTimeout so that Stop() can cancel the parent context and
pending Push calls unblock; ensure Stop() cancels that context (or respects the
timeout) before waiting on wg so wg.Wait() can return.

Comment on lines +77 to +97
func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
e := engineForDebounceTest()
// must not panic
e.cancelRemoteOfflineClose("never-scheduled")
if len(e.peerOfflineDebounce) != 0 {
t.Error("map must remain empty")
}
}

func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) {
e := engineForDebounceTest()
for _, k := range []string{"a", "b", "c", "d"} {
e.scheduleRemoteOfflineClose(k)
}
if len(e.peerOfflineDebounce) != 4 {
t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce))
}
e.cancelAllRemoteOfflineCloses()
if len(e.peerOfflineDebounce) != 0 {
t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard debounce-map length assertions with the mutex

len(e.peerOfflineDebounce) is read without locking on Line 81, Line 91, and Line 95. Since timers can mutate this map asynchronously, these reads can race.

Suggested patch
 func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
 	e := engineForDebounceTest()
 	// must not panic
 	e.cancelRemoteOfflineClose("never-scheduled")
-	if len(e.peerOfflineDebounce) != 0 {
+	e.peerOfflineDebounceMu.Lock()
+	n := len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 0 {
 		t.Error("map must remain empty")
 	}
 }
@@
 	for _, k := range []string{"a", "b", "c", "d"} {
 		e.scheduleRemoteOfflineClose(k)
 	}
-	if len(e.peerOfflineDebounce) != 4 {
-		t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce))
+	e.peerOfflineDebounceMu.Lock()
+	n := len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 4 {
+		t.Fatalf("setup: expected 4 timers, got %d", n)
 	}
 	e.cancelAllRemoteOfflineCloses()
-	if len(e.peerOfflineDebounce) != 0 {
-		t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce))
+	e.peerOfflineDebounceMu.Lock()
+	n = len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 0 {
+		t.Errorf("cancel-all must clear the map, got %d entries", n)
 	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
e := engineForDebounceTest()
// must not panic
e.cancelRemoteOfflineClose("never-scheduled")
if len(e.peerOfflineDebounce) != 0 {
t.Error("map must remain empty")
}
}
func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) {
e := engineForDebounceTest()
for _, k := range []string{"a", "b", "c", "d"} {
e.scheduleRemoteOfflineClose(k)
}
if len(e.peerOfflineDebounce) != 4 {
t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce))
}
e.cancelAllRemoteOfflineCloses()
if len(e.peerOfflineDebounce) != 0 {
t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce))
}
func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
e := engineForDebounceTest()
// must not panic
e.cancelRemoteOfflineClose("never-scheduled")
e.peerOfflineDebounceMu.Lock()
n := len(e.peerOfflineDebounce)
e.peerOfflineDebounceMu.Unlock()
if n != 0 {
t.Error("map must remain empty")
}
}
func TestCancelAllRemoteOfflineCloses_ClearsEverything(t *testing.T) {
e := engineForDebounceTest()
for _, k := range []string{"a", "b", "c", "d"} {
e.scheduleRemoteOfflineClose(k)
}
e.peerOfflineDebounceMu.Lock()
n := len(e.peerOfflineDebounce)
e.peerOfflineDebounceMu.Unlock()
if n != 4 {
t.Fatalf("setup: expected 4 timers, got %d", n)
}
e.cancelAllRemoteOfflineCloses()
e.peerOfflineDebounceMu.Lock()
n = len(e.peerOfflineDebounce)
e.peerOfflineDebounceMu.Unlock()
if n != 0 {
t.Errorf("cancel-all must clear the map, got %d entries", n)
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The
tests read e.peerOfflineDebounce without holding the engine's debounce mutex,
causing races; update the assertions in
TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and
TestCancelAllRemoteOfflineCloses_ClearsEverything to acquire the engine's mutex
that guards peerOfflineDebounce (use the same lock used by
scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses, e.g.
e.mu.Lock()/Unlock() or e.mu.RLock()/RUnlock()) around each
len(e.peerOfflineDebounce) check and around the setup verification so the map
reads are synchronized with
scheduleRemoteOfflineClose/cancelRemoteOfflineClose/cancelAllRemoteOfflineCloses.

Comment on lines +83 to +95
// NewManagerWithTwoTimers is the Phase-2 constructor. Pass 0 for either
// timeout to disable that teardown path. Both 0 leaves the manager
// running but inert (no channel ever fires) -- used by p2p / relay-forced
// modes that don't tear down workers.
func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
if iceTimeout > 0 {
log.Infof("ICE inactivity timeout: %v", iceTimeout)
}
if relayTimeout > 0 {
log.Infof("relay inactivity timeout: %v", relayTimeout)
}
return newManager(iface, iceTimeout, relayTimeout)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Enforce the 1-minute floor in NewManagerWithTwoTimers.

NewManagerWithTwoTimers bypasses validateInactivityThreshold, so callers can pass values below MinimumInactivityThreshold even though the sweep runs every minute. A 30-second timeout will therefore fire late and silently violate the configured value. Clamp or reject both timers here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/inactivity/manager.go` around lines 83 - 95,
NewManagerWithTwoTimers currently accepts iceTimeout and relayTimeout without
enforcing MinimumInactivityThreshold, so short durations (e.g., 30s) can be
silently violated; update NewManagerWithTwoTimers to validate and enforce the
1-minute floor by either clamping inputs to MinimumInactivityThreshold or
returning an error/zero for invalid values before calling newManager: call
validateInactivityThreshold (or inline the same check) for both iceTimeout and
relayTimeout, log or return when values are below MinimumInactivityThreshold,
and ensure newManager always receives durations >= MinimumInactivityThreshold
(or zero if caller intended to disable a path). Ensure you reference the symbols
MinimumInactivityThreshold, validateInactivityThreshold,
NewManagerWithTwoTimers, and newManager when making the change.

Comment thread client/internal/peer/conn.go
Comment on lines +309 to +319
// notifyPeerListChanged fires a peer-list-changed notification using the
// current peer count. Phase 3.7i: thin wrapper around the notifier so
// callers in UpdatePeerRemoteMeta and similar paths don't need to know
// about d.numOfPeers() and d.notifier internals.
//
// Caller must hold d.mux (this method reads d.peers/d.offlinePeers via
// numOfPeers and assumes consistent state).
func (d *Status) notifyPeerListChanged() {
d.notifier.peerListChanged(d.numOfPeers())
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Misleading "Caller must hold d.mux" docstring on notifyPeerListChanged.

The implementation just calls d.notifier.peerListChanged(d.numOfPeers()). The notifier dispatch itself is the part that re-enters user code, and the established pattern across this file (e.g. lines 461-462, 684-685, 752-753, 854) is to fire it after d.mux.Unlock(). Documenting this helper as "caller must hold d.mux" actively encourages the deadlock pattern that exists at the UpdatePeerRemoteMeta call site (lines 540-541, 564). Either:

  1. Reword to "may be called with or without d.mux held; prefer after unlock to avoid listener re-entry deadlock", and inline d.numOfPeers() accordingly, or
  2. Drop the helper and have callers use d.notifier.peerListChanged(numPeers) directly with numPeers snapshotted under the lock — matching every other path in this file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 309 - 319, The helper
notifyPeerListChanged currently claims "Caller must hold d.mux" which encourages
holding the lock across listener re-entry; change its contract to "may be called
with or without d.mux held; prefer after unlock to avoid listener re-entry
deadlock" and update implementations to snapshot the peer count under the lock
and call d.notifier.peerListChanged(numPeers) after d.mux.Unlock();
specifically, stop calling notifyPeerListChanged while d.mux is held in
UpdatePeerRemoteMeta and similar sites—replace those calls by capturing num :=
d.numOfPeers() while locked and then call d.notifier.peerListChanged(num) after
unlocking, or keep notifyPeerListChanged as a thin wrapper but remove the "must
hold d.mux" claim and document the preferred-after-unlock usage.

Comment on lines +85 to +88
peer, err := h.account.GetPeer(r.Context(), userAuth.AccountId, peerID, userAuth.UserId)
if err != nil {
http.Error(w, "peer not found", http.StatusNotFound)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t turn every GetPeer failure into a 404.

These branches currently collapse permission-validation failures and unexpected store/backend errors into "peer not found". That makes real server faults indistinguishable from auth/not-found outcomes and will mislead clients during outages.

Suggested fix
 	peer, err := h.account.GetPeer(r.Context(), userAuth.AccountId, peerID, userAuth.UserId)
 	if err != nil {
-		http.Error(w, "peer not found", http.StatusNotFound)
+		// Map expected auth/not-found outcomes explicitly; keep unexpected
+		// failures as 5xx so callers can distinguish them.
+		http.Error(w, "peer not found", http.StatusNotFound)
 		return
 	}

At minimum, inspect the returned management status/error type here and reserve 404 for not-found / intentionally-hidden permission misses.

Also applies to: 137-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/peer_connections/handler.go` around lines 85
- 88, The call to h.account.GetPeer currently maps every error to a 404; change
the handling in the GetPeer call sites (the handler block around peer, err :=
h.account.GetPeer(...) and the similar block at the other occurrence) to inspect
the returned error/type instead of blanketing it: detect and return
http.StatusNotFound only when the error indicates a true "not found" or an
intentionally-hidden permission denial (e.g., errors.Is(err, store.ErrNotFound)
or checking a management error code/Status() on the error object), otherwise log
the error and return http.StatusInternalServerError (or the appropriate status
for permission/validation errors). Ensure you preserve context in the log and do
not leak internal details to clients.

}

pubkey := peer.Key
since, _ := strconv.ParseUint(r.URL.Query().Get("since"), 10, 64)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject malformed since tokens.

Right now ?since=abc is treated the same as since=0, so the handler can return cached data instead of telling the caller the refresh token is invalid.

Suggested fix
-	since, _ := strconv.ParseUint(r.URL.Query().Get("since"), 10, 64)
+	var since uint64
+	if sinceRaw := r.URL.Query().Get("since"); sinceRaw != "" {
+		parsed, err := strconv.ParseUint(sinceRaw, 10, 64)
+		if err != nil {
+			http.Error(w, "invalid since value", http.StatusBadRequest)
+			return
+		}
+		since = parsed
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/peer_connections/handler.go` at line 92, The
handler currently ignores parse errors from strconv.ParseUint and treats
malformed ?since=... as 0; change the ParseUint call in the peer connections
handler to check the error returned (the
strconv.ParseUint(r.URL.Query().Get("since"), 10, 64) call) and, if the query
value is non-empty and parsing fails, respond with HTTP 400 (Bad Request)
indicating an invalid refresh token instead of proceeding with since=0; keep
successful parsing behavior unchanged and only reject when the since parameter
is present but malformed.

Comment on lines +199 to 204
p.EffectiveConnectionMode == other.EffectiveConnectionMode &&
p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs &&
p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs &&
p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs &&
slices.Equal(p.SupportedFeatures, other.SupportedFeatures) &&
p.Flags.isEqual(other.Flags)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

SupportedFeatures comparison is order-sensitive but capabilities are set-like.

Two equivalent feature sets in different orders will be treated as unequal, causing unnecessary metadata churn.

💡 Suggested fix
 func (p PeerSystemMeta) isEqual(other PeerSystemMeta) bool {
+	pFeatures := slices.Clone(p.SupportedFeatures)
+	otherFeatures := slices.Clone(other.SupportedFeatures)
+	sort.Strings(pFeatures)
+	sort.Strings(otherFeatures)
+
 	return p.Hostname == other.Hostname &&
 		p.GoOS == other.GoOS &&
 		p.Kernel == other.Kernel &&
 		p.KernelVersion == other.KernelVersion &&
 		p.Core == other.Core &&
 		p.Platform == other.Platform &&
 		p.OS == other.OS &&
 		p.OSVersion == other.OSVersion &&
 		p.WtVersion == other.WtVersion &&
 		p.UIVersion == other.UIVersion &&
 		p.SystemSerialNumber == other.SystemSerialNumber &&
 		p.SystemProductName == other.SystemProductName &&
 		p.SystemManufacturer == other.SystemManufacturer &&
 		p.Environment.Cloud == other.Environment.Cloud &&
 		p.Environment.Platform == other.Environment.Platform &&
 		p.EffectiveConnectionMode == other.EffectiveConnectionMode &&
 		p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs &&
 		p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs &&
 		p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs &&
-		slices.Equal(p.SupportedFeatures, other.SupportedFeatures) &&
+		slices.Equal(pFeatures, otherFeatures) &&
 		p.Flags.isEqual(other.Flags)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 199 - 204, The equality check
currently uses slices.Equal on SupportedFeatures which is order-sensitive;
change it to an order-insensitive set comparison (e.g., convert
p.SupportedFeatures and other.SupportedFeatures into maps/sets or sort them
before comparing) so two feature lists with the same elements in different
orders are treated equal; update the equality logic surrounding
EffectiveConnectionMode/.../SupportedFeatures and keep the existing
p.Flags.isEqual call unchanged.

Comment on lines +1641 to +1652
if sRelayTimeoutSeconds.Valid {
v := uint32(sRelayTimeoutSeconds.Int64)
account.Settings.RelayTimeoutSeconds = &v
}
if sP2pTimeoutSeconds.Valid {
v := uint32(sP2pTimeoutSeconds.Int64)
account.Settings.P2pTimeoutSeconds = &v
}
if sP2pRetryMaxSeconds.Valid {
v := uint32(sP2pRetryMaxSeconds.Int64)
account.Settings.P2pRetryMaxSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Direct int64->uint32 casts in sql_store.go:"
rg -n -C2 'uint32\((sRelayTimeoutSeconds|sP2pTimeoutSeconds|sP2pRetryMaxSeconds|sLegacyLazyFallbackTimeoutSecs|metaEffectiveRelayTimeoutSecs|metaEffectiveP2PTimeoutSecs|metaEffectiveP2PRetryMaxSecs)\.Int64\)' management/server/store/sql_store.go

echo
echo "Range guards around those variables (expected none currently):"
rg -n -C2 '(sRelayTimeoutSeconds|sP2pTimeoutSeconds|sP2pRetryMaxSeconds|sLegacyLazyFallbackTimeoutSecs|metaEffectiveRelayTimeoutSecs|metaEffectiveP2PTimeoutSecs|metaEffectiveP2PRetryMaxSecs).*(<\s*0|>\s*math\.MaxUint32)' management/server/store/sql_store.go || true

Repository: netbirdio/netbird

Length of output: 1323


Guard int64→uint32 conversions before assigning timeouts.

At lines 1642, 1646, 1650, 1660, 1923, 1926, and 1929, direct narrowing casts can silently wrap invalid DB values (negative or > math.MaxUint32) into unrelated timeout values.

💡 Suggested fix pattern
+import "math"
...
 if sRelayTimeoutSeconds.Valid {
+    if sRelayTimeoutSeconds.Int64 < 0 || sRelayTimeoutSeconds.Int64 > math.MaxUint32 {
+        return nil, status.NewGetAccountFromStoreError(
+            fmt.Errorf("settings_relay_timeout_seconds out of range: %d", sRelayTimeoutSeconds.Int64),
+        )
+    }
     v := uint32(sRelayTimeoutSeconds.Int64)
     account.Settings.RelayTimeoutSeconds = &v
 }

Apply the same bounds check pattern to:

  • sP2pTimeoutSeconds
  • sP2pRetryMaxSeconds
  • sLegacyLazyFallbackTimeoutSecs
  • metaEffectiveRelayTimeoutSecs
  • metaEffectiveP2PTimeoutSecs
  • metaEffectiveP2PRetryMaxSecs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/store/sql_store.go` around lines 1641 - 1652, The DB
int64→uint32 casts (sRelayTimeoutSeconds, sP2pTimeoutSeconds,
sP2pRetryMaxSeconds and the meta/legacy fields metaEffectiveRelayTimeoutSecs,
metaEffectiveP2PTimeoutSecs, metaEffectiveP2PRetryMaxSecs and
sLegacyLazyFallbackTimeoutSecs) must be guarded against negative values and
values > math.MaxUint32 before assigning into account.Settings
(RelayTimeoutSeconds, P2pTimeoutSeconds, P2pRetryMaxSeconds, etc.); change each
block to check that the source.Int64 is >= 0 and <= math.MaxUint32 and only then
create the uint32 v and assign &v to the corresponding account.Settings field
(otherwise leave the pointer nil or handle as per existing pattern).

Comment on lines +365 to +405
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
nullable: true
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
p2p_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default ICE-worker idle timeout in seconds. 0 = never tear down.
Effective only in p2p-dynamic mode (added in Phase 2).
NULL means "use built-in default" (180 minutes).
p2p_retry_max_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Maximum interval between P2P retry attempts after consecutive
ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to
disable backoff (always retry immediately, Phase-2 behavior).
Effective only in p2p-dynamic mode (added in Phase 3).
example: 900
relay_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default relay-worker idle timeout in seconds. 0 = never tear
down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
client. NULL means "use built-in default" (5 minutes).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add p2p-dynamic-lazy to connection_mode enum and align timeout docs with final mode model

connection_mode is missing p2p-dynamic-lazy (Line 365), which is part of the mode set for this rollout. That creates contract drift: generated clients may reject/omit a valid server value, and schema validation can fail on updates. Also, timeout descriptions (Lines 380-405) still describe earlier phase behavior and can mislead consumers about which modes each timeout applies to.

Suggested OpenAPI fix
         connection_mode:
           x-experimental: true
           type: string
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Includes dynamic and dynamic-lazy rollout modes.
...
         p2p_timeout_seconds:
...
           description: |
             Default ICE-worker idle timeout in seconds. 0 = never tear down.
-            Effective only in p2p-dynamic mode (added in Phase 2).
+            Effective in dynamic modes.
...
         relay_timeout_seconds:
...
           description: |
             Default relay-worker idle timeout in seconds. 0 = never tear
-            down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
+            down. Effective in lazy/dynamic-lazy behavior. Backwards-
             compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
             client. NULL means "use built-in default" (5 minutes).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
nullable: true
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
p2p_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default ICE-worker idle timeout in seconds. 0 = never tear down.
Effective only in p2p-dynamic mode (added in Phase 2).
NULL means "use built-in default" (180 minutes).
p2p_retry_max_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Maximum interval between P2P retry attempts after consecutive
ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to
disable backoff (always retry immediately, Phase-2 behavior).
Effective only in p2p-dynamic mode (added in Phase 3).
example: 900
relay_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default relay-worker idle timeout in seconds. 0 = never tear
down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
client. NULL means "use built-in default" (5 minutes).
enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
nullable: true
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
Includes dynamic and dynamic-lazy rollout modes.
p2p_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default ICE-worker idle timeout in seconds. 0 = never tear down.
Effective in dynamic modes.
NULL means "use built-in default" (180 minutes).
p2p_retry_max_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Maximum interval between P2P retry attempts after consecutive
ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to
disable backoff (always retry immediately, Phase-2 behavior).
Effective only in p2p-dynamic mode (added in Phase 3).
example: 900
relay_timeout_seconds:
x-experimental: true
type: integer
format: int64
minimum: 0
nullable: true
description: |
Default relay-worker idle timeout in seconds. 0 = never tear
down. Effective in lazy/dynamic-lazy behavior. Backwards-
compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
client. NULL means "use built-in default" (5 minutes).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 365 - 405, The OpenAPI
schema's connection_mode enum is missing the new value "p2p-dynamic-lazy" and
several timeout field descriptions still reference outdated phase-specific
behavior; update the enum under connection_mode to include "p2p-dynamic-lazy"
and revise the descriptions of p2p_timeout_seconds, p2p_retry_max_seconds, and
relay_timeout_seconds to reflect the final mode model (which modes each timeout
applies to, default values, and NULL semantics) so generated clients and
validation match the server behavior.

@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch 2 times, most recently from 0ed16b8 to b082536 Compare May 6, 2026 07:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

♻️ Duplicate comments (17)
client/internal/lazyconn/manager/manager.go (1)

50-55: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the Phase-1 inactivity fallback on the deprecated field actually being in use.

Line 52 still applies InactivityThreshold whenever relay is 0, and Line 104 still treats any effective 0/0 as “use the legacy single-timer manager.” That means explicit ICEInactivityThreshold=0 / RelayInactivityThreshold=0 can silently restore Phase-1 relay teardown, and mixed old/new config can override an explicit “relay always-on” setting with the deprecated field.

Suggested fix
 func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
 	relay := c.RelayInactivityThreshold
-	if relay == 0 && c.InactivityThreshold != nil {
+	if c.InactivityThreshold != nil &&
+		c.ICEInactivityThreshold == 0 &&
+		c.RelayInactivityThreshold == 0 {
 		relay = *c.InactivityThreshold
 	}
 	return c.ICEInactivityThreshold, relay
 }
@@
 	if wgIface.IsUserspaceBind() {
 		iceTO, relayTO := config.resolvedTimeouts()
-		if iceTO == 0 && relayTO == 0 {
+		if config.InactivityThreshold != nil &&
+			config.ICEInactivityThreshold == 0 &&
+			config.RelayInactivityThreshold == 0 {
 			// Phase 1 / single-timer fallback when caller hasn't migrated.
 			m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback
 		} else {
 			m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO)
 		}

Also applies to: 103-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 50 - 55, The
fallback to the deprecated single-timer behavior must only trigger when the
deprecated InactivityThreshold was actually provided; update
Config.resolvedTimeouts to indicate whether the deprecated field was used (e.g.,
return a bool or expose a method like Config.deprecatedInUse) and only copy
InactivityThreshold into RelayInactivityThreshold when InactivityThreshold !=
nil; then change the manager selection logic (the code around where
resolvedTimeouts is consumed, lines selecting Phase-1 vs new manager) to treat
an effective 0/0 as legacy only if the deprecated flag is set, otherwise respect
explicit zero values for ICEInactivityThreshold and RelayInactivityThreshold so
mixed old/new configs cannot override an explicit "relay always-on" or restore
Phase-1 unintentionally.
management/server/peer/peer.go (1)

199-203: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Compare SupportedFeatures as a set, not a slice.

Capabilities are set-like. slices.Equal makes metadata equality depend on wire order, so two equivalent feature sets can still churn peer metadata and trigger needless updates.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 199 - 203, The metadata equality
currently uses slices.Equal on p.SupportedFeatures and other.SupportedFeatures
which makes equality order-sensitive; change the comparison in the equality
function (the method that compares peer metadata in peer.go) to treat
SupportedFeatures as a set by comparing membership instead of order — e.g.,
build a temporary map/set of features from one slice and verify every element of
the other slice exists and lengths match (or sort both slices deterministically
before comparing), replacing the slices.Equal(p.SupportedFeatures,
other.SupportedFeatures) check with this set-based comparison so equivalent
feature lists in different orders are considered equal.
client/internal/stdnet/filter_test.go (1)

51-65: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the table symmetrically and gate Windows-only rows.

The loop still only fails on false-positive allows and one hardcoded Windows false-negative. Most want: true cases are never checked, so regressions like filtering vEthernet (External) will still pass on non-Windows runners.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test loop
only checks false-positive allows and a single hardcoded Windows case; update
the loop around allow(c.name) to assert both directions (if c.want && !got then
t.Errorf("... want true"), and if !c.want && got then t.Errorf("... want
false")), and gate Windows-only rows so they only assert on Windows runners
(either add a boolean field like windowsOnly to the case struct and skip/assert
based on runtime.GOOS, or check the existing Windows-specific name(s) such as
"vEthernet (LAN)" before asserting on non-windows). Ensure you reference the
allow function, c.want, runtime.GOOS, and the test helpers t.Errorf/t.Fatalf
appropriately when making these changes.
client/internal/peer/status.go (3)

692-693: ⚠️ Potential issue | 🔴 Critical

Keep notifyPeerStateChangeListeners under d.mux.

This helper calls snapshotRouterPeersLocked, which expects d.mux to be held while reading d.peers and d.changeNotify. Invoking it after unlock reintroduces the concurrent-map race the earlier review called out. Move the materialICE/materialRelay notification back above d.mux.Unlock().

Also applies to: 760-761

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 692 - 693, Move the calls to
notifyPeerStateChangeListeners back inside the critical section guarded by
d.mux: hold d.mux while checking materialICE/materialRelay and calling
notifyPeerStateChangeListeners so snapshotRouterPeersLocked sees d.peers and
d.changeNotify under the lock; specifically, ensure the
materialICE/materialRelay checks and the
notifyPeerStateChangeListeners(receivedState.PubKey) /
notifyPeerStateChangeListeners(receivedState.PubKey) calls occur before invoking
d.mux.Unlock() (also fix the identical pattern at the other occurrence around
lines 760-761).

542-543: ⚠️ Potential issue | 🟠 Major

Move peerListChanged out from under d.mux.

d.notifyPeerListChanged() can synchronously invoke listener code, and this path still does it before the deferred unlock runs. Any listener that re-enters Status can deadlock here. Snapshot numPeers under the lock, unlock, then fire the peer-list notification.

Also applies to: 565-566

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 542 - 543, The peer-list
notification is being called while holding d.mux which can deadlock if listeners
re-enter Status; modify the code in the Status method around
notifyPeerListChanged (and the similar block at the other occurrence) to
snapshot numPeers (or any needed state) while holding d.mux, release the lock,
then call d.notifyPeerListChanged() outside the mutex; ensure you remove the
notify call from under the lock and only reference the pre-captured values
inside the notification invocation.

470-471: ⚠️ Potential issue | 🟠 Major

Capture the conn-state callback before unlocking.

notifyConnStateChange reads d.connStateListener, but each of these branches builds the closure after d.mux.Unlock(). A concurrent SetConnStateListener(nil/...) can race with that read and lose or misroute the event. Snapshot the closure under the lock, then return it after unlock.

🧩 Suggested pattern
- d.mux.Unlock()
-
- if notifyList {
- 	d.notifier.peerListChanged(numPeers)
- }
- ...
- if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) {
- 	return d.notifyConnStateChange(receivedState.PubKey, peerState), nil
- }
+ notifyConn := func() {}
+ if hasStatusOrRelayedChange(oldStatus, receivedState.ConnStatus, oldSnapshot.Relayed, receivedState.Relayed) {
+ 	notifyConn = d.notifyConnStateChange(receivedState.PubKey, peerState)
+ }
+
+ d.mux.Unlock()
+
+ if notifyList {
+ 	d.notifier.peerListChanged(numPeers)
+ }
+ ...
+ return notifyConn, nil

Also applies to: 696-697, 764-765, 812-813, 863-864

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 470 - 471, The code calls
notifyConnStateChange (and similar callbacks) after releasing d.mux which allows
a concurrent SetConnStateListener to change d.connStateListener and race the
callback; fix by snapshotting the listener under the lock (read
d.connStateListener into a local variable while holding d.mux) and return/invoke
that captured closure after unlock instead of accessing d.connStateListener
later; apply this pattern for the occurrences around
hasConnStatusChanged/notifyConnStateChange and the other mentioned branches
(around lines handling conn-state at the other call sites).
shared/management/http/api/types.gen.go (1)

41-63: ⚠️ Potential issue | 🟠 Major

Expose p2p-dynamic-lazy in the generated connection-mode enum.

AccountSettingsConnectionMode and Valid() still only admit four values, but the Phase 3 contract in this PR includes p2p-dynamic-lazy. If the backend starts returning that mode, generated clients will reject it or fail to round-trip it. Please fix the OpenAPI source and regenerate this file instead of patching the generated output directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
enum AccountSettingsConnectionMode is missing the new value "p2p-dynamic-lazy",
causing Valid() and round-trips to reject it; update the OpenAPI specification
to include the new enum value for AccountSettingsConnectionMode (so the
generator emits a constant like AccountSettingsConnectionModeP2pDynamicLazy with
the string "p2p-dynamic-lazy"), then regenerate types.gen.go so the Valid()
switch in AccountSettingsConnectionMode also includes the new constant (and any
related serialization/deserialization logic) rather than editing the generated
file by hand.
shared/management/http/api/openapi.yml (1)

365-405: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

AccountSettings.connection_mode and timeout docs are still out of sync with the final mode model.

Line 365 is still missing p2p-dynamic-lazy in the enum, and Lines 380-405 still describe phase-specific applicability that no longer matches the rollout model. This can break generated clients/schema validation and misconfigure consumers relying on these descriptions.

Suggested OpenAPI patch
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
@@
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Includes dynamic rollout modes (`p2p-dynamic`, `p2p-dynamic-lazy`).
@@
-            Effective only in p2p-dynamic mode (added in Phase 2).
+            Effective in dynamic modes.
@@
-            disable backoff (always retry immediately, Phase-2 behavior).
-            Effective only in p2p-dynamic mode (added in Phase 3).
+            disable backoff (always retry immediately).
+            Effective in dynamic modes.
@@
-            down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
+            down. Effective in relay-lazy modes (for example `p2p-lazy`
+            and `p2p-dynamic-lazy`). Backwards-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 365 - 405, Update the
AccountSettings.connection_mode enum to include "p2p-dynamic-lazy" and revise
the descriptions for p2p_timeout_seconds, p2p_retry_max_seconds, and
relay_timeout_seconds to reflect the final mode model (remove phase-specific
rollout language and phase applicability notes); explicitly state which modes
each timeout applies to (e.g., p2p_timeout_seconds applies to p2p-dynamic and
p2p-dynamic-lazy, p2p_retry_max_seconds applies to p2p-dynamic and
p2p-dynamic-lazy backoff behavior, relay_timeout_seconds applies to p2p-lazy and
p2p-dynamic-lazy and acts as the client NB_LAZY_CONN_INACTIVITY_THRESHOLD
alias), preserve nullable/default semantics and examples (e.g., default 180
minutes for p2p_timeout_seconds, example: 900 for p2p_retry_max_seconds) and
remove any "Phase X" wording so generated clients/schema validation match the
final model.
client/internal/conn_state_pusher.go (3)

245-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Snapshot-request handling does not drain to the newest nonce.

The doc on OnSnapshotRequest (lines 153-156) promises latest-nonce coalescing, but the loop pulls the first queued nonce and runs flushFull with it. If the management server fired two refreshes back-to-back, the second InResponseToNonce may never be echoed (the first one bumps seq, then the second nonce gets popped and answered with stale state — or the buffer fills and the second nonce is dropped at OnSnapshotRequest's default case). Drain to the newest queued nonce before flushing. This was flagged previously and is still present.

Suggested fix
 		case nonce := <-p.snapshotReq:
+			// Coalesce: drain any newer queued nonces; we only need to
+			// answer the latest one with a single full snapshot.
+		drainSnapshotReq:
+			for {
+				select {
+				case newer := <-p.snapshotReq:
+					nonce = newer
+				default:
+					break drainSnapshotReq
+				}
+			}
 			if p.source != nil {
 				p.flushFull(p.source.SnapshotAllRemotePeers(), nonce)
 			}
 			interval = p.tuning.baseInterval
 			emptyTicks = 0
 			timer.Reset(interval)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 245 - 251, The
snapshot-request handling must drain p.snapshotReq to the newest nonce before
calling p.flushFull so we coalesce to the latest InResponseToNonce; change the
case handling that currently does "nonce := <-p.snapshotReq" to read the first
nonce then loop with non-blocking receives (select with default) to consume any
additional pending nonces and keep the most recent value, then call
p.flushFull(p.source.SnapshotAllRemotePeers(), newestNonce); ensure you
reference and update the logic around p.snapshotReq, flushFull, and any
variables like interval/emptyTicks/timer so behavior after the flush remains
unchanged.

293-298: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Push calls use context.Background() and can wedge Stop() indefinitely.

Both flushDelta (line 293) and flushFull (line 363) call p.sink.Push(context.Background(), …) from the loop goroutine. If the management gRPC client hangs (slow server, broken connection, head-of-line blocking on retry), the loop blocks inside Push, never services <-p.stop, and Stop()wg.Wait() never returns — taking daemon shutdown with it. Either derive the context from a per-pusher cancel that Stop cancels first, or wrap each push in a context.WithTimeout. This was flagged previously and is still present.

Suggested shape
 type connStatePusher struct {
 	...
+	ctx          context.Context
+	cancel       context.CancelFunc
 	stop         chan struct{}
 	wg           sync.WaitGroup
 }

 func newConnStatePusherForTest(...) *connStatePusher {
-	p := &connStatePusher{...}
+	ctx, cancel := context.WithCancel(context.Background())
+	p := &connStatePusher{ctx: ctx, cancel: cancel, ...}
 	...
 }

 func (p *connStatePusher) Stop() {
+	p.cancel()
 	close(p.stop)
 	p.wg.Wait()
 }

-	if err := p.sink.Push(context.Background(), &mgmProto.PeerConnectionMap{...}); err != nil {
+	pushCtx, cancel := context.WithTimeout(p.ctx, 30*time.Second)
+	err := p.sink.Push(pushCtx, &mgmProto.PeerConnectionMap{...})
+	cancel()
+	if err != nil {

Also applies to: 363-369

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 293 - 298, flushDelta and
flushFull call p.sink.Push with context.Background(), which can block the loop
and prevent Stop/wg.Wait from returning; change both flushDelta and flushFull to
use a cancellable/timeout context instead of context.Background() — either
derive the context from a per-pusher context stored on the pusher (e.g., p.ctx)
that Stop cancels, or wrap each Push call with context.WithTimeout and defer
cancel so Push returns on Stop; ensure Stop cancels the pusher context (or
relies on the timeout) before waiting on wg so the goroutine unblocks from
p.sink.Push.

115-123: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

newSessionID will spin-loop forever if rand.Read ever returns an error.

rand.Read errors are ignored; on a system where getrandom is unavailable or the entropy source is wedged, b stays zeroed, the generated id is 0, and the loop never exits — newConnStatePusherForTest (and therefore Engine startup) hangs. Bound the retry, propagate the error, or fall back to a non-zero monotonic seed. This was flagged previously and is still present.

Suggested fix
 func newSessionID() uint64 {
 	var b [8]byte
 	for {
-		_, _ = rand.Read(b[:])
+		if _, err := rand.Read(b[:]); err != nil {
+			if id := uint64(time.Now().UnixNano()); id != 0 {
+				return id
+			}
+			continue
+		}
 		if id := binary.BigEndian.Uint64(b[:]); id != 0 {
 			return id
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 115 - 123, newSessionID
currently ignores rand.Read errors and can spin forever returning 0; change
newSessionID to return (uint64, error), check the error from rand.Read each
iteration, limit retries (e.g. a small constant like 3-5 attempts), and if still
failing return a non-zero fallback only if safe or better return a clear error;
update callers such as newConnStatePusherForTest and any Engine startup call
sites to handle the returned error (propagate or fail startup) rather than
assuming a uint64 always succeeds.
management/server/http/handlers/accounts/accounts_handler.go (1)

234-234: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Plain fmt.Errorf still surfaces as 5xx instead of 4xx for these validation failures.

The validation branches for connection_mode (line 234), legacy_lazy_fallback_timeout_seconds (line 280), AutoUpdateVersion (line 291), and validateUint32Timeout (lines 513, 516) still return plain errors. util.WriteError will treat these as Internal/5xx instead of InvalidArgument/422 for bad input. This was raised previously and isn't marked addressed.

💡 Minimal fix
-		return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+		return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
@@
-			return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+			return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
@@
 func validateUint32Timeout(name string, v int64) (uint32, error) {
 	if v < 0 {
-		return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v)
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v)
 	}
 	if v > int64(math.MaxUint32) {
-		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
 	}
 	return uint32(v), nil
 }

Note this will also break TestValidateUint32Timeout_PlainError — it should be removed or updated since the helper would no longer return a plain fmt.Errorf value.

Also applies to: 280-280, 291-291, 513-513, 516-516

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` at line 234,
Replace plain fmt.Errorf returns for validation failures with the utility error
constructor that maps to InvalidArgument (so util.WriteError produces a 4xx),
e.g. change the return in the connection_mode branch (modeStr), the
legacy_lazy_fallback_timeout_seconds case, the AutoUpdateVersion branch, and
inside validateUint32Timeout to return util.NewInvalidArgumentErrorf(...) (or
the project’s equivalent constructor) with the same message; also update or
remove TestValidateUint32Timeout_PlainError to reflect that
validateUint32Timeout no longer returns a plain fmt.Errorf.
management/internals/shared/grpc/conversion.go (1)

315-327: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ServerLivenessKnown is still set unconditionally outside the rPeer.Status != nil guard.

When rPeer.Status is nil, LiveOnline stays false (default) but ServerLivenessKnown is forced to true on line 326, telling new clients "the server authoritatively knows this peer is offline". That is exactly the regression flagged in the prior review and is the wrong signal — the client should fall back to the LastSeenAtServer heuristic in this case. Move the assignment inside the rPeer.Status != nil branch.

🐛 Proposed fix
 		// nbpeer.Peer.Status is *PeerStatus; nil-guard before accessing.
 		if rPeer.Status != nil {
 			if !rPeer.Status.LastSeen.IsZero() {
 				cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen)
 			}
 			cfg.LiveOnline = rPeer.Status.Connected
+			// New servers always know per-peer liveness; signal that to
+			// new clients so they can trust LiveOnline directly instead
+			// of guessing from the LastSeenAtServer-zero heuristic. Old
+			// servers leave this field at default (false) and clients
+			// fall back. Only set when Status is actually present.
+			cfg.ServerLivenessKnown = true
 		}
-		// New servers always know per-peer liveness; signal that to new
-		// clients so they can trust LiveOnline directly instead of
-		// guessing from the LastSeenAtServer-zero heuristic. Old servers
-		// leave this field at default (false) and clients fall back.
-		cfg.ServerLivenessKnown = true
 		dst = append(dst, cfg)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 315 - 327, The
ServerLivenessKnown flag is being set unconditionally, which falsely signals
authoritative liveness when rPeer.Status is nil; move the assignment of
cfg.ServerLivenessKnown = true inside the rPeer.Status != nil branch (next to
where cfg.LiveOnline and cfg.LastSeenAtServer are set) so ServerLivenessKnown is
only true when rPeer.Status is non-nil, leaving the default false behavior for
clients to fall back to the LastSeenAtServer heuristic.
management/server/http/handlers/accounts/accounts_handler_test.go (1)

418-419: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

gofmt will still reject this — fields 418-419 are over-indented (4 tabs vs 2 tabs of the surrounding block).

The neighboring fields in expectedSettings (lines 409-425) use two tabs, but LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds use four. This breaks gofmt-strict CI gates. The previous review flagged the same hunk; please run gofmt -w on the file so the colon column re-aligns across the whole struct literal.

Proposed fix
 		LazyConnectionEnabled:           br(false),
-				LegacyLazyFallbackEnabled:      br(true),
-				LegacyLazyFallbackTimeoutSeconds: ir(3600),
+		LegacyLazyFallbackEnabled:        br(true),
+		LegacyLazyFallbackTimeoutSeconds: ir(3600),
 		DnsDomain:                       sr(""),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler_test.go` around
lines 418 - 419, The struct literal in expectedSettings has two fields,
LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds, over-indented (4
tabs) and misaligning the colon column; run gofmt -w on the file or manually
re-indent those two fields to match the surrounding two-tab indentation so the
colon alignment is consistent (look for the expectedSettings struct literal and
the symbols LegacyLazyFallbackEnabled and LegacyLazyFallbackTimeoutSeconds and
correct their indentation).
shared/management/client/grpc.go (1)

483-493: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the snapshot callback after applying the same SyncResponse.

SnapshotRequest can arrive in the same frame as ordinary sync data. Calling cb() first lets the client publish a snapshot from pre-update peer state.

🩹 Suggested ordering change
-		if req := decryptedResp.GetSnapshotRequest(); req != nil {
-			c.snapMu.Lock()
-			cb := c.onSnapshotRequest
-			c.snapMu.Unlock()
-			if cb != nil {
-				cb(req.GetNonce())
-			}
-		}
-
 		if err := msgHandler(decryptedResp); err != nil {
 			log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
+			continue
+		}
+
+		if req := decryptedResp.GetSnapshotRequest(); req != nil {
+			c.snapMu.Lock()
+			cb := c.onSnapshotRequest
+			c.snapMu.Unlock()
+			if cb != nil {
+				cb(req.GetNonce())
+			}
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 483 - 493, The snapshot
callback is invoked before applying the SyncResponse, which can cause snapshots
to be published from pre-update state; change the ordering so that handling the
SyncResponse via msgHandler(decryptedResp) happens first and only after
successful application (or at least after msgHandler returns nil) you call the
snapshot callback (use c.onSnapshotRequest and req.GetNonce() as before),
preserving the c.snapMu lock/unlock around reading c.onSnapshotRequest and
keeping the existing nil checks around req and cb.
client/internal/engine_offline_debounce_test.go (1)

77-97: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard debounce-map length checks with peerOfflineDebounceMu.

These assertions read e.peerOfflineDebounce without the mutex while timers can mutate the same map asynchronously. That can trip the race detector and make the tests flaky.

🩹 Suggested fix
 func TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp(t *testing.T) {
 	e := engineForDebounceTest()
 	// must not panic
 	e.cancelRemoteOfflineClose("never-scheduled")
-	if len(e.peerOfflineDebounce) != 0 {
+	e.peerOfflineDebounceMu.Lock()
+	n := len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 0 {
 		t.Error("map must remain empty")
 	}
 }
@@
 	for _, k := range []string{"a", "b", "c", "d"} {
 		e.scheduleRemoteOfflineClose(k)
 	}
-	if len(e.peerOfflineDebounce) != 4 {
-		t.Fatalf("setup: expected 4 timers, got %d", len(e.peerOfflineDebounce))
+	e.peerOfflineDebounceMu.Lock()
+	n := len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 4 {
+		t.Fatalf("setup: expected 4 timers, got %d", n)
 	}
 	e.cancelAllRemoteOfflineCloses()
-	if len(e.peerOfflineDebounce) != 0 {
-		t.Errorf("cancel-all must clear the map, got %d entries", len(e.peerOfflineDebounce))
+	e.peerOfflineDebounceMu.Lock()
+	n = len(e.peerOfflineDebounce)
+	e.peerOfflineDebounceMu.Unlock()
+	if n != 0 {
+		t.Errorf("cancel-all must clear the map, got %d entries", n)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The
test reads e.peerOfflineDebounce without holding the associated mutex, which can
race with concurrent timer callbacks; wrap any access that checks
len(e.peerOfflineDebounce) (in TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp
and TestCancelAllRemoteOfflineCloses_ClearsEverything) with
e.peerOfflineDebounceMu RLock/RUnlock (or Lock/Unlock) to safely read the map;
locate uses around the assertions after calling engineForDebounceTest(),
scheduleRemoteOfflineClose, cancelRemoteOfflineClose, and
cancelAllRemoteOfflineCloses and protect them with the mutex to avoid race
detector failures.
client/internal/conn_mgr.go (1)

287-340: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Timeout-only changes still don't reach the running lazy/dynamic manager.

When the management push only updates RelayTimeoutSeconds or P2PTimeoutSeconds (and the mode stays in p2p-lazy/p2p-dynamic), modeChanged is false so neither the closeManager branch (Line 314) nor the e.lazyConnMgr == nil branch (Line 322) fires. The new timeouts are stored on ConnMgr, but manager.Config.RelayInactivityThreshold / ICEInactivityThreshold were snapshotted once in initLazyManager (Lines 605-610) and the running inactivity manager keeps using the stale values until a full mode flip.

propagateP2pRetryMaxToConns() only handles the per-Conn ICE-backoff cap, not the lazy manager's two-timer thresholds. Consider detecting timeout-only changes when isManaged && lazyConnMgr != nil and either restarting the manager or pushing the new thresholds through a live-update path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 287 - 340, Timeout-only changes to
relayTimeoutSecs/p2pTimeoutSecs are not applied to a running lazy manager
because modeChanged is false; detect when isManaged && e.lazyConnMgr != nil and
the newRelay/newP2P differ from the existing e.relayTimeoutSecs/e.p2pTimeoutSecs
and handle it by restarting the manager so it snapshots the new thresholds: call
e.closeManager(ctx), e.initLazyManager(ctx), e.startModeSideEffects(), and then
return e.resetPeersToLazyIdle(ctx) (similar to the existing activation path).
Keep the existing propagateP2pRetryMaxToConns() behavior for p2pRetry changes
and use the same helper symbols (resolveConnectionMode, modeUsesLazyMgr,
initLazyManager, closeManager, startModeSideEffects, resetPeersToLazyIdle,
lazyConnMgr) to locate and implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db5bc397-20dd-4a93-a989-752a79f7ed0f

📥 Commits

Reviewing files that changed from the base of the PR and between b07184d and b082536.

⛔ Files ignored due to path filters (2)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (63)
  • .gitignore
  • client/android/client.go
  • client/android/peer_notifier.go
  • client/android/preferences.go
  • client/android/preferences_clamp_test.go
  • client/cmd/testutil_test.go
  • client/iface/bind/activity.go
  • client/iface/device/endpoint_manager.go
  • client/internal/conn_mgr.go
  • client/internal/conn_state_pusher.go
  • client/internal/conn_state_pusher_material_test.go
  • client/internal/conn_state_pusher_test.go
  • client/internal/conn_state_pusher_testhelper_test.go
  • client/internal/engine.go
  • client/internal/engine_offline_debounce_test.go
  • client/internal/lazyconn/activity/listener_bind_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_handover_order_test.go
  • client/internal/peer/conn_lazy_keepwgpeer_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/guard/guard_test.go
  • client/internal/peer/guard/ice_retry_state_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/status_debounce_test.go
  • client/internal/peer/status_remote_meta_notify_test.go
  • client/internal/peerstore/store.go
  • client/internal/profilemanager/config.go
  • client/internal/stdnet/filter.go
  • client/internal/stdnet/filter_test.go
  • client/proto/daemon.proto
  • client/server/server_test.go
  • client/system/features.go
  • client/system/features_test.go
  • client/ui/network.go
  • client/ui/peers_tab.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md
  • docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md
  • management/internals/server/boot.go
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/internals/shared/grpc/server.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/http/handlers/accounts/validate_uint32_timeout_test.go
  • management/server/peer.go
  • management/server/peer/peer.go
  • management/server/peer_connections/store.go
  • management/server/peer_connections/store_test.go
  • management/server/peer_test.go
  • management/server/store/file_store.go
  • management/server/store/sql_store.go
  • management/server/types/settings.go
  • management/server/types/settings_ptr_equal_test.go
  • management/server/types/settings_test.go
  • shared/management/client/grpc.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (3)
  • client/cmd/testutil_test.go
  • client/android/peer_notifier.go
  • management/server/types/settings_ptr_equal_test.go

Comment on lines 286 to +301

log.Warnf("lazy connection manager is enabled by management feature flag")
e.initLazyManager(ctx)
e.statusRecorder.UpdateLazyConnection(true)
return e.addPeersToLazyConnManager()
} else {
if e.lazyConnMgr == nil {
return nil
}
log.Infof("lazy connection manager is disabled by management feature flag")
newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode(
e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout,
e.cfgP2pTimeout, e.cfgP2pRetryMax, pc,
)

if newMode == e.mode && newRelay == e.relayTimeoutSecs &&
newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs {
return nil
}
prev := e.mode
e.mode = newMode
e.relayTimeoutSecs = newRelay
e.p2pTimeoutSecs = newP2P
e.p2pRetryMaxSecs = newP2pRetry
e.propagateP2pRetryMaxToConns()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolved-mode/timeout fields are read concurrently without synchronization.

spMu protects the serverPushed* fields only. The "live" fields mode, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs are written here at Lines 297-300 (under engine syncMsgMux via the NetworkMap path) but read by:

  • Mode()/RelayTimeout()/P2pTimeout()/P2pRetryMax() callers from arbitrary goroutines (e.g. engine.scheduleRemoteOfflineClose timer callback at engine.go:2782, plus the Android UI / daemon-RPC paths the doc-comments call out).
  • runDynamicInactivityLoop reads e.p2pTimeoutSecs/e.relayTimeoutSecs for its startup log line outside any lock.

This is a data race on a non-atomic connectionmode.Mode/uint32 write. Either extend spMu to cover the live fields too, or use atomic loads/stores for them so the accessors are safe to call without the engine mux.

Also applies to: 753-778

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 286 - 301, The live config fields
(mode, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs) are written under
syncMsgMux but read concurrently (Mode(), RelayTimeout(), P2pTimeout(),
P2pRetryMax(), runDynamicInactivityLoop), causing a data race; fix by making
these fields updated/read with atomic ops instead of relying on spMu: change
their storage to atomic-compatible types (e.g. uint32/uint64 backing for
connectionmode.Mode and timeout seconds), use atomic.Store* in the update path
around resolveConnectionMode/propagateP2pRetryMaxToConns, and use atomic.Load*
in the accessor methods and in runDynamicInactivityLoop/startup logging so all
concurrent readers see safe, race-free values. Ensure all references to those
fields (including propagateP2pRetryMaxToConns and any callers listed) are
switched to the atomic loads/stores.

Comment thread client/internal/conn_mgr.go
Comment thread client/internal/conn_state_pusher_test.go
Comment thread client/internal/engine.go
Comment on lines +2765 to +2802
t := time.AfterFunc(remoteOfflineGracePeriod, func() {
e.peerOfflineDebounceMu.Lock()
delete(e.peerOfflineDebounce, pubKey)
e.peerOfflineDebounceMu.Unlock()
// Codex review: re-validate on fire. Several preconditions
// must still hold:
// 1. engine context not cancelled (Stop() in flight)
// 2. connMgr still in p2p-dynamic mode (mode-switch racing)
// 3. peer still has a peerConn AND status recorder still
// reports the peer as remote-offline (the live state
// could have flipped back without us cancelling — e.g.
// mgmt push for a different peer landed before this fire)
// Without these checks the debounce fires blindly and can
// tear down a perfectly good conn in any of those races.
if e.ctx == nil || e.ctx.Err() != nil {
return
}
if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic {
return
}
if state, err := e.statusRecorder.GetPeer(pubKey); err == nil {
if !state.RemoteServerLivenessKnown || state.RemoteLiveOnline {
return
}
} else {
return
}
conn, ok := e.peerStore.PeerConn(pubKey)
if !ok {
return
}
log.Infof("[peer: %s] remote went offline (debounced %s), closing local conn (p2p-dynamic)", pubKey, remoteOfflineGracePeriod)
// Remote-offline close: keep the WG peer entry so that if the
// remote comes back online and traffic flows, the route-mgr-
// applied AllowedIPs are still in place. The lazy-mgr will
// reactivate the peer through the activity listener.
conn.Close(false, true)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all readers of ConnMgr.Mode()/RelayTimeout()/P2pTimeout()/P2pRetryMax()
# to confirm which ones run outside syncMsgMux.
rg -nP --type=go -C2 '\bconnMgr\.(Mode|RelayTimeout|P2pTimeout|P2pRetryMax)\b'

Repository: netbirdio/netbird

Length of output: 2018


🏁 Script executed:

#!/bin/bash
# Check function signatures and context for each call site
# Line 1360 context
echo "=== Line 1360 context ==="
sed -n '1350,1375p' client/internal/engine.go

echo -e "\n=== Line 1584 context ==="
sed -n '1570,1590p' client/internal/engine.go

echo -e "\n=== Lines 1821-1822 context ==="
sed -n '1810,1830p' client/internal/engine.go

echo -e "\n=== Line 2765-2804 timer context (full function) ==="
sed -n '2755,2810p' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 5498


🏁 Script executed:

#!/bin/bash
# Find the function containing line 1360
echo "=== Function containing line 1360 ==="
sed -n '1330,1365p' client/internal/engine.go | head -20

# Search for syncMsgMux lock patterns and their scope
echo -e "\n=== syncMsgMux usage patterns ==="
rg -n 'syncMsgMux\.(Lock|Unlock)' client/internal/engine.go

# Find where the lines in question are - identify function boundaries
echo -e "\n=== Looking for function defs near lines 1350-1380 ==="
sed -n '1320,1360p' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 2771


🏁 Script executed:

#!/bin/bash
# Find callers of updateNetworkMap
echo "=== Callers of updateNetworkMap ==="
rg -n 'updateNetworkMap\(' client/internal/engine.go

# Check the context around those call sites
echo -e "\n=== Context of updateNetworkMap calls ==="
sed -n '350,360p' client/internal/engine.go
echo "---"
sed -n '490,500p' client/internal/engine.go

# Find function containing createPeerConfig (line 1821-1822)
echo -e "\n=== Function def for lines 1821-1822 ==="
sed -n '1800,1830p' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 2151


🏁 Script executed:

#!/bin/bash
# Check context around line 1047 where updateNetworkMap is called
echo "=== Context around line 1047 (updateNetworkMap call) ==="
sed -n '1035,1055p' client/internal/engine.go

# Find function definition that contains line 1047
echo -e "\n=== Function containing line 1047 ==="
sed -n '1000,1050p' client/internal/engine.go | head -40

# Look at updateNetworkMap itself to see if it expects syncMsgMux
echo -e "\n=== Full updateNetworkMap function signature and first lines ==="
sed -n '1335,1450p' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 5980


🏁 Script executed:

#!/bin/bash
# Find function name containing line 1047
echo "=== Function name and signature for line 1047 ==="
sed -n '950,1047p' client/internal/engine.go | grep -A 100 "^func"

# Also check for any syncMsgMux locks in message handlers
echo -e "\n=== Check for syncMsgMux in OnSyncResponse or sync handlers ==="
rg -n 'func.*OnSync|func.*handleSync' client/internal/engine.go

# Check line 1584 context - is it in UpdatedRemotePeerConfig handler?
echo -e "\n=== Function containing line 1584 ==="
sed -n '1500,1590p' client/internal/engine.go | head -50

Repository: netbirdio/netbird

Length of output: 4180


🏁 Script executed:

#!/bin/bash
# Check where createPeerConfig is called
echo "=== Callers of createPeerConfig ==="
rg -n 'createPeerConfig\(' client/internal/engine.go

# Check context for those call sites
echo -e "\n=== addNewPeers function (likely caller of createPeerConfig) ==="
sed -n '1700,1750p' client/internal/engine.go | head -40

# Also check if line 1584 is indeed in updateNetworkMap
echo -e "\n=== Confirm line 1584 is in updateNetworkMap ==="
awk 'NR>=1335 && NR<=1600 {if (/^func /) print NR": "$0}' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 1684


🏁 Script executed:

#!/bin/bash
# Search for where lines 1821-1822 appear (in what function)
echo "=== Function containing line 1821 ==="
sed -n '1750,1830p' client/internal/engine.go | head -60

# Let's also search for function names around line 1800
echo -e "\n=== Function names near line 1800 ==="
awk 'NR>=1750 && NR<=1850 {if (/^func /) print NR": "$0}' client/internal/engine.go

# Now search for where connMgr reads happen in line 1584 context
echo -e "\n=== Verify line 1584 context (within updateNetworkMap or another function) ==="
awk 'NR>=1500 && NR<=1600 {if (/^func /) print NR": "$0}' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 2383


🏁 Script executed:

#!/bin/bash
# Find all function boundaries to understand which function contains line 1584
echo "=== All function definitions and their line numbers ==="
awk '/^func \(e \*Engine\)/ {print NR": "$0}' client/internal/engine.go | head -50

# Specifically check lines 1500-1700
echo -e "\n=== Function boundaries in range 1500-1700 ==="
awk 'NR>=1500 && NR<=1700 && /^func / {print NR": "$0}' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 3860


🏁 Script executed:

#!/bin/bash
# Check where addNewPeer is called from
echo "=== Callers of addNewPeer ==="
rg -n 'addNewPeer\(' client/internal/engine.go

# Check if addNewPeers is under syncMsgMux (which calls addNewPeer)
echo -e "\n=== Full addNewPeers function ==="
sed -n '1746,1756p' client/internal/engine.go

# Now check where addNewPeers is called
echo -e "\n=== Where addNewPeers is called ==="
rg -n 'addNewPeers\(' client/internal/engine.go -B 5 -A 1

Repository: netbirdio/netbird

Length of output: 1019


🏁 Script executed:

#!/bin/bash
# Verify line 1584 is indeed in updateNetworkMap and see the full context
echo "=== Full updateNetworkMap function (lines 1335-1606) ==="
sed -n '1335,1606p' client/internal/engine.go | tail -150

# Confirm the lock boundaries by checking handleSync
echo -e "\n=== handleSync lock scope verification ==="
sed -n '969,985p' client/internal/engine.go

Repository: netbirdio/netbird

Length of output: 6880


🏁 Script executed:

#!/bin/bash
# Final verification - check if UpdatedRemotePeerConfig modifies e.connMgr.mode
echo "=== UpdatedRemotePeerConfig writes ==="
rg -n 'UpdatedRemotePeerConfig' client/internal/engine.go -A 20

# Check connmgr.go for mode field access
echo -e "\n=== ConnMgr.Mode() implementation ==="
fd -e go conn_mgr.go | xargs grep -n "func.*Mode\|m\.mode\|\.mode ="

Repository: netbirdio/netbird

Length of output: 1726


🏁 Script executed:

#!/bin/bash
# Check the Mode() implementation
echo "=== ConnMgr.Mode() full implementation ==="
sed -n '756,765p' client/internal/engine.go

# Check what field is being accessed
echo -e "\n=== Line 297 context (mode write) ==="
sed -n '290,310p' client/internal/engine.go

# Search for where connMgr.mode is written
echo -e "\n=== All writes to connMgr.mode or e.mode in engine.go ==="
rg -n 'e\.mode\s*=' client/internal/engine.go

# Check UpdatedRemotePeerConfig to see what it modifies
echo -e "\n=== UpdatedRemotePeerConfig in connmgr ==="
fd -e go conn_mgr.go | xargs grep -n 'UpdatedRemotePeerConfig' -A 30 | head -50

Repository: netbirdio/netbird

Length of output: 3193


🏁 Script executed:

#!/bin/bash
# Search directly in conn_mgr.go for the Mode() implementation
echo "=== Mode() implementation in conn_mgr.go ==="
fd -e go conn_mgr.go | xargs sed -n '756,765p'

# Also check the complete UpdatedRemotePeerConfig function
echo -e "\n=== UpdatedRemotePeerConfig function ==="
fd -e go conn_mgr.go | xargs grep -n 'func.*UpdatedRemotePeerConfig' -A 40

# Check if mode field is protected by a mutex
echo -e "\n=== ConnMgr struct fields ==="
fd -e go conn_mgr.go | xargs sed -n '40,110p'

Repository: netbirdio/netbird

Length of output: 4840


Timer goroutine reads connection-manager state without holding syncMsgMux.

The time.AfterFunc callback at line 2782 reads e.connMgr.Mode() without synchronization. Meanwhile, handleSync calls UpdatedRemotePeerConfig, which writes e.mode under syncMsgMux (line 297). The Mode() getter has no internal synchronization—it returns e.mode directly—so the timer observes unsynchronized reads/writes to shared mutable state during concurrent mode changes or engine shutdown.

Lock protection is inconsistent: spMu guards only the serverPushed* fields, not mode, relayTimeoutSecs, p2pTimeoutSecs, or p2pRetryMaxSecs. Other callers (lines 1360, 1366–1369, 1584, 1821–1822) are protected by syncMsgMux via handleSync, but the timer callback bypasses this.

Either guard the timer's reads with syncMsgMux, or make the ConnMgr accessors internally synchronized.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 2765 - 2802, The timer callback reads
e.connMgr.Mode() without synchronization, causing a data race with
handleSync/UpdatedRemotePeerConfig which writes e.mode under syncMsgMux; fix by
acquiring e.syncMsgMux around the ConnMgr access in the time.AfterFunc callback
(wrap the nil check and Mode() call with e.syncMsgMux.Lock()/Unlock()), or
alternatively make ConnMgr.Mode() itself thread-safe; reference symbols:
time.AfterFunc callback, e.connMgr.Mode(), syncMsgMux, handleSync /
UpdatedRemotePeerConfig, and Mode().

Comment on lines +109 to +116
case strings.Contains(trim, "lazyConnMgr.Close"),
strings.Contains(trim, "activityManager.Close"),
strings.Contains(trim, "im.Close"),
strings.Contains(trim, "peerStore.Close"),
strings.Contains(trim, "// "),
!(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")):
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inline-trailing-comments slip past the guard.

strings.Contains(trim, "// ") is intended to skip comment-only lines, but it also matches lines with trailing comments. For example, peerConn.Close(false) // todo add keepWgPeer contains // and is silently excluded — a regressed 1-arg call with any inline comment never gets flagged, defeating the test's purpose.

Detect leading-// instead so trailing comments don't bypass the check.

🛡️ Proposed fix
 		switch {
 		case strings.Contains(trim, "lazyConnMgr.Close"),
 			strings.Contains(trim, "activityManager.Close"),
 			strings.Contains(trim, "im.Close"),
 			strings.Contains(trim, "peerStore.Close"),
-			strings.Contains(trim, "// "),
+			strings.HasPrefix(trim, "//"),
 			!(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")):
 			continue
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case strings.Contains(trim, "lazyConnMgr.Close"),
strings.Contains(trim, "activityManager.Close"),
strings.Contains(trim, "im.Close"),
strings.Contains(trim, "peerStore.Close"),
strings.Contains(trim, "// "),
!(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")):
continue
}
case strings.Contains(trim, "lazyConnMgr.Close"),
strings.Contains(trim, "activityManager.Close"),
strings.Contains(trim, "im.Close"),
strings.Contains(trim, "peerStore.Close"),
strings.HasPrefix(trim, "//"),
!(strings.Contains(trim, "conn.Close(") || strings.Contains(trim, "peerConn.Close(")):
continue
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 109 - 116,
The guard in the test currently uses strings.Contains(trim, "// ") which also
matches lines with trailing inline comments and lets regressed single-arg calls
slip through; update the condition in the case block that checks for comments to
detect only leading comment lines (e.g., replace the strings.Contains check with
a leading-check such as strings.HasPrefix(trim, "//")) so only full-line
comments are skipped while lines with trailing comments (like
peerConn.Close(false) // ...) are still evaluated by the
conn.Close/peerConn.Close checks.

Comment on lines +50 to +56
bindParams := udpmux.UniversalUDPMuxParams{
UDPConn: nbnet.WrapPacketConn(udpConn),
Net: t.transportNet,
FilterFn: t.filterFn,
WGAddress: t.address,
MTU: t.mtu,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how filterFn is set in device_kernel_unix.go
rg -n 'filterFn.*=' client/iface/device/device_kernel_unix.go

# Check for any sharedsock.NewIncomingSTUNFilter usage
rg -n 'sharedsock\.NewIncomingSTUNFilter' client/iface/device/device_kernel_unix.go

Repository: netbirdio/netbird

Length of output: 152


🏁 Script executed:

# Find the TunKernelDevice struct definition
rg -n 'type TunKernelDevice' client/iface/device/

# Search for filterFn field within the struct or its initialization
rg -n 'filterFn\s*:' client/iface/device/device_kernel_unix.go

# Search for where filterFn might be set as a field
rg -B5 -A5 'FilterFn:' client/iface/device/device_kernel_unix.go

Repository: netbirdio/netbird

Length of output: 411


🏁 Script executed:

# Read the device_kernel_unix.go file to see TunKernelDevice struct and filterFn initialization
head -150 client/iface/device/device_kernel_unix.go | cat -n

# Also search more broadly for filterFn assignments
rg -n 'filterFn' client/iface/device/device_kernel_unix.go

Repository: netbirdio/netbird

Length of output: 4633


🏁 Script executed:

# Check the full markdown file mentioned in the review to see all proposed changes
head -200 docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md | tail -150

Repository: netbirdio/netbird

Length of output: 4569


🏁 Script executed:

# Find udpmux.FilterFn type definition
rg -n 'type FilterFn' client/iface/

# Check if FilterFn can be nil in UniversalUDPMuxParams
rg -B5 -A10 'FilterFn:' client/iface/udpmux/

Repository: netbirdio/netbird

Length of output: 164


Clarify the role of FilterFn in the new separate-port approach.

The proposed code references t.filterFn in FilterFn: (line 53), but this field is never initialized and would be nil. This differs from the original code, which passes sharedsock.NewIncomingSTUNFilter() directly to sharedsock.Listen().

Confirm whether:

  1. FilterFn should remain nil (no filtering on the dedicated ICE port), OR
  2. FilterFn needs to be initialized with a filter function (and if so, what replaces sharedsock.NewIncomingSTUNFilter())

Without this clarification, it's unclear if the change is intentionally simplifying the filtering or if it's an incomplete migration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines
50 - 56, The FilterFn field in the udpmux.UniversalUDPMuxParams is currently set
to t.filterFn but that value is never initialized; decide whether the dedicated
ICE port should have no filtering or should use the same STUN filter previously
provided by sharedsock.NewIncomingSTUNFilter(). If you want no filtering leave
FilterFn nil but document this decision and remove any dead t.filterFn state; if
you need the STUN filter, initialize FilterFn with the equivalent function
(replace t.filterFn with sharedsock.NewIncomingSTUNFilter() or wire the filter
factory into the transport so t.filterFn is set during construction) and ensure
the constructor that creates t (or the call site that builds bindParams) sets
that function before creating udpmux.UniversalUDPMuxParams.

Comment thread docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md
Comment thread management/server/peer.go
Comment on lines +1361 to +1370
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
if err != nil {
return nil, err
}
if p.AccountID != accountID {
return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
}
return p, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a typed NotFound error and avoid leaking cross-account peer existence.

GetPeerByPeerPubKey is not scoped by accountID, so the cross-account check at line 1366 is the only barrier. Returning an untyped fmt.Errorf has two issues:

  1. The error message echoes back the caller-supplied pubKey, confirming the peer exists in some account — this is an information-disclosure leak across tenants when this method is reached from REST handlers (per the doc comment).
  2. Callers cannot distinguish "peer doesn't exist" from "peer exists in another account"; downstream HTTP handlers will likely surface this as 500 instead of 404.

Both branches should return the same typed not-found status:

🛡️ Proposed fix
 func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
 	p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
 	if err != nil {
 		return nil, err
 	}
 	if p.AccountID != accountID {
-		return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
+		return nil, status.Errorf(status.NotFound, "peer not found")
 	}
 	return p, nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
if err != nil {
return nil, err
}
if p.AccountID != accountID {
return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
}
return p, nil
}
func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
if err != nil {
return nil, err
}
if p.AccountID != accountID {
return nil, status.Errorf(status.NotFound, "peer not found")
}
return p, nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1361 - 1370, GetPeerByPubKey leaks
cross-account existence and returns an untyped fmt.Errorf; change
DefaultAccountManager.GetPeerByPubKey so both failure cases
(Store.GetPeerByPeerPubKey error indicating not found and the accountID
mismatch) return the same typed NotFound error instead of fmt.Errorf and do not
include the caller-supplied pubKey in the message. Use the project's canonical
not-found sentinel/type from the store package (e.g., store.ErrNotFound or
store.NewNotFoundError) when translating/store errors and when returning on
account mismatch so callers can distinguish not-found vs other errors and no
tenant information is exposed.

Comment on lines +142 to +154
// Phase 3.7i (#5989): peer-self-reported runtime mode/timeouts. Stored
// alongside Hostname/Kernel/etc as meta. Empty when peer pre-dates 3.7i.
EffectiveConnectionMode string `json:"effective_connection_mode,omitempty"`
EffectiveRelayTimeoutSecs uint32 `json:"effective_relay_timeout_secs,omitempty"`
EffectiveP2PTimeoutSecs uint32 `json:"effective_p2p_timeout_secs,omitempty"`
EffectiveP2PRetryMaxSecs uint32 `json:"effective_p2p_retry_max_secs,omitempty"`

// Phase 3.7i (#5989): capability keywords this client build advertises.
// Empty for peers that pre-date the field. Used by mgmt to decide
// whether to send legacy-compat fallback settings (e.g. downgrade to
// p2p-lazy when client lacks "p2p_dynamic"). See
// client/system/features.go for the canonical list.
SupportedFeatures []string `json:"supported_features,omitempty" gorm:"serializer:json"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include the new 3.7i fields in isEmpty().

UpdateMetaIfNew() returns early when meta.isEmpty() is true, but isEmpty() below still ignores Effective* and SupportedFeatures. A partial metadata update carrying only these new runtime/capability fields will be dropped and leave stale compatibility/runtime state behind.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 142 - 154, The isEmpty() method
currently ignores the new Phase 3.7i fields so UpdateMetaIfNew() can wrongly
return early; update isEmpty() to consider EffectiveConnectionMode,
EffectiveRelayTimeoutSecs, EffectiveP2PTimeoutSecs, EffectiveP2PRetryMaxSecs and
SupportedFeatures as making the meta non-empty (i.e., return false if any
Effective* field is non-zero/non-empty or SupportedFeatures has elements) so
partial updates that only carry these runtime/capability fields are applied.

Comment thread shared/management/client/grpc.go
@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch from b082536 to e8a298f Compare May 6, 2026 13:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (27)
shared/management/client/grpc.go (2)

483-494: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run msgHandler before triggering snapshot callback.

On Line 483, invoking onSnapshotRequest before applying the same SyncResponse can trigger a snapshot from pre-update state. Move snapshot handling after msgHandler (and only when handler succeeds).

Suggested ordering fix
-		if req := decryptedResp.GetSnapshotRequest(); req != nil {
-			c.snapMu.Lock()
-			cb := c.onSnapshotRequest
-			c.snapMu.Unlock()
-			if cb != nil {
-				cb(req.GetNonce())
-			}
-		}
-
 		if err := msgHandler(decryptedResp); err != nil {
 			log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
+			continue
+		}
+
+		if req := decryptedResp.GetSnapshotRequest(); req != nil {
+			c.snapMu.Lock()
+			cb := c.onSnapshotRequest
+			c.snapMu.Unlock()
+			if cb != nil {
+				cb(req.GetNonce())
+			}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 483 - 494, The snapshot
callback is being invoked before the SyncResponse is applied; change the
ordering so you call msgHandler(decryptedResp) first and only if it returns nil
then acquire c.snapMu, read c.onSnapshotRequest and invoke the callback with
req.GetNonce(); ensure you still check decryptedResp.GetSnapshotRequest() for
non-nil but move that block after the msgHandler success path and preserve the
use of c.snapMu when accessing c.onSnapshotRequest.

968-970: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not return nil metadata when info == nil.

Line 969 still drops SupportedFeatures and effective connection settings when system.Info is unavailable, which makes this client look legacy and can misapply fallback logic. Always return a PeerSystemMeta with capability/effective-config fields populated.

Suggested fix
 func infoToMetaData(info *system.Info, eff EffectiveConnConfig) *proto.PeerSystemMeta {
+	meta := &proto.PeerSystemMeta{
+		EffectiveConnectionMode:   eff.Mode,
+		EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs,
+		EffectiveP2PTimeoutSecs:   eff.P2PTimeoutSecs,
+		EffectiveP2PRetryMaxSecs:  eff.P2PRetryMaxSecs,
+		SupportedFeatures:         system.SupportedFeatures(),
+	}
 	if info == nil {
-		return nil
+		return meta
 	}
@@
-	return &proto.PeerSystemMeta{
+	meta.Hostname = info.Hostname
+	meta.GoOS = info.GoOS
+	meta.OS = info.OS
+	meta.Core = info.OSVersion
+	meta.OSVersion = info.OSVersion
+	meta.Platform = info.Platform
+	meta.Kernel = info.Kernel
+	meta.NetbirdVersion = info.NetbirdVersion
+	meta.UiVersion = info.UIVersion
+	meta.KernelVersion = info.KernelVersion
+	meta.NetworkAddresses = addresses
+	meta.SysSerialNumber = info.SystemSerialNumber
+	meta.SysManufacturer = info.SystemManufacturer
+	meta.SysProductName = info.SystemProductName
+	meta.Environment = &proto.Environment{
+		Cloud:    info.Environment.Cloud,
+		Platform: info.Environment.Platform,
+	}
+	meta.Files = files
+	meta.Flags = &proto.Flags{
+		RosenpassEnabled:      info.RosenpassEnabled,
+		RosenpassPermissive:   info.RosenpassPermissive,
+		ServerSSHAllowed:      info.ServerSSHAllowed,
+		DisableClientRoutes:   info.DisableClientRoutes,
+		DisableServerRoutes:   info.DisableServerRoutes,
+		DisableDNS:            info.DisableDNS,
+		DisableFirewall:       info.DisableFirewall,
+		BlockLANAccess:        info.BlockLANAccess,
+		BlockInbound:          info.BlockInbound,
+		LazyConnectionEnabled: info.LazyConnectionEnabled,
+	}
-		...
-		SupportedFeatures: system.SupportedFeatures(),
-	}
+	return meta
 }

Also applies to: 1026-1035

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 968 - 970, The function
infoToMetaData currently returns nil when info == nil; instead always construct
and return a proto.PeerSystemMeta populated with capability and
effective-connection fields so the client does not appear legacy. Update
infoToMetaData (and the analogous block around the second occurrence) to create
a PeerSystemMeta even if info is nil: set SupportedFeatures (or the equivalent
capabilities field) from defaults or an empty slice and populate the
EffectiveConnConfig fields from the eff parameter so effective connection
settings are always conveyed. Ensure the returned value is a non-nil
*proto.PeerSystemMeta with SupportedFeatures and EffectiveConnConfig fields
filled.
client/internal/conn_state_pusher_test.go (1)

199-205: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Wait for disabled latch, not only the first notification

At Line 202, sink.notif can fire before the Unimplemented error is processed, so the test may proceed before p.disabled is latched and become flaky.

Suggested fix
-	// Wait for the very first push (initial snapshot), which gets the
-	// Unimplemented error back and latches `disabled`.
-	select {
-	case <-sink.notif:
-	case <-time.After(500 * time.Millisecond):
-		t.Fatal("timed out waiting for first push attempt")
-	}
+	// Wait until the first push attempt is observed and disabled is latched.
+	deadline := time.After(500 * time.Millisecond)
+	for !p.disabled.Load() {
+		select {
+		case <-sink.notif:
+			// keep waiting until Unimplemented has been handled
+		case <-deadline:
+			t.Fatal("timed out waiting for pusher to latch disabled after first push attempt")
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher_test.go` around lines 199 - 205, The test
currently proceeds on receiving sink.notif which may occur before the
Unimplemented error handling latches p.disabled; change the wait to ensure the
disabled latch is set: after receiving from sink.notif, block until p.disabled
is observed (e.g., poll/check p.disabled or wait on a dedicated disabled
channel/condition) with the same timeout (500ms) so the test only continues once
p.disabled has been latched following the Unimplemented error.
client/internal/stdnet/filter_test.go (1)

51-65: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This regression test still doesn't verify most want: true cases.

Line 60 only catches false-positive allows, and Line 63 special-cases one Windows false-negative. On Linux/macOS, vEthernet (LAN) and vEthernet (External) are filtered by the veth prefix today, but this test still passes because those rows are never asserted symmetrically. Either skip this test on non-Windows or gate the Windows-only rows, then compare got != c.want for every case.

Proposed fix
 func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) {
+	if runtime.GOOS != "windows" {
+		t.Skip("Windows-specific interface-name filtering")
+	}
+
 	disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"}
 	allow := InterfaceFilter(disallow)
@@
 	for _, c := range cases {
-		// The wgctrl branch can override on hosts where NetBird is
-		// running; tests run on a host where these names are not
-		// real interfaces, so the final return faithfully reflects
-		// the disallow-list logic.
 		got := allow(c.name)
-		// "veth*" prefix only filters on non-Windows; on Linux test
-		// runners "vEthernet (LAN)" still passes because of mixed
-		// case + the !Windows branch keeping the prefix match.
-		if !c.want && got {
-			t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name)
-		}
-		if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" {
-			t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name)
+		if got != c.want {
+			t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter_test.go` around lines 51 - 65, The test only
asserts one-sided failures; change the loop to assert symmetry by checking if
got != c.want and calling t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name,
got, c.want) so every case is verified. To preserve the special Windows-only
expectations, skip or gate rows like "vEthernet (LAN)" and "vEthernet
(External)" when runtime.GOOS != "windows" (e.g., if runtime.GOOS != "windows"
&& (c.name == "vEthernet (LAN)" || c.name == "vEthernet (External)") { continue
}) so the allow(...) check and the unified got!=c.want assertion run correctly
across platforms; update the references around the allow(...) call and the
existing runtime.GOOS handling accordingly.
client/internal/stdnet/filter.go (1)

51-76: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Windows still filters Local Area Connection-style NICs.

At Line 52 and again via the disallow-list loop at Line 76, the case-folded "lo" prefix rejects any Windows adapter whose name starts with "Lo". That means interfaces like Local Area Connection are still dropped before the Windows-specific exceptions can help, which recreates the false-negative this change was meant to fix.

Proposed fix
-		// Linux/macOS loopback prefix ("lo", "lo0").
-		if strings.HasPrefix(lowerIFace, "lo") {
+		// Linux/macOS loopback prefix ("lo", "lo0").
+		if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") {
 			return false
 		}
@@
-			if sLower == "veth" && runtime.GOOS == "windows" {
+			if runtime.GOOS == "windows" && (sLower == "veth" || sLower == "lo") {
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter.go` around lines 51 - 76, The current early
check uses strings.HasPrefix(lowerIFace, "lo") which incorrectly filters Windows
adapters like "Local Area Connection"; update that check to skip the "lo" prefix
test on Windows (i.e., only treat "lo" as loopback when runtime.GOOS !=
"windows"). Also in the disallowList loop (symbols: disallowList, sLower) ensure
if sLower == "lo" you similarly continue/skip when runtime.GOOS == "windows" so
Windows NICs starting with "Lo" are not dropped; keep the existing special-case
for "veth" unchanged.
management/server/http/handlers/accounts/accounts_handler_test.go (1)

408-426: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Indentation on lines 418-419 still appears inconsistent with surrounding fields.

Despite the previous comment being marked addressed, the displayed code shows LegacyLazyFallbackEnabled / LegacyLazyFallbackTimeoutSeconds with extra leading tabs vs. the rest of the api.AccountSettings{} literal (LazyConnectionEnabled, DnsDomain, …). If this still fails CI gofmt, please run gofmt -w on the file. The same pattern exists in the older tt table at lines 124-125, 152-153, 180-181, 208-209, 236-237, 264-265 — worth verifying those too.

#!/bin/bash
# Confirm whether gofmt would reformat the file
fd -e go accounts_handler_test.go --exec gofmt -d {} \;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler_test.go` around
lines 408 - 426, The struct literal for api.AccountSettings contains
inconsistent indentation for the fields LegacyLazyFallbackEnabled and
LegacyLazyFallbackTimeoutSeconds (they are indented with extra tabs compared to
neighboring fields like LazyConnectionEnabled and DnsDomain); run gofmt (or
manually align those two lines to match the surrounding field indentation) and
also check the older tt table occurrences of
LegacyLazyFallbackEnabled/LegacyLazyFallbackTimeoutSeconds mentioned in the
comment (lines where tt entries exist) to ensure consistent indentation so the
file passes gofmt.
management/server/peer.go (2)

1297-1356: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return an authorization error (not status.Internal) when the user has no policy access to the peer.

"No access to this peer" is an expected authorization outcome when the user lacks peers:read, isn't an admin/service user, isn't the owner, and has no policy-reachable own peer to the target. Returning status.Internal here surfaces as a 5xx to API callers instead of a proper 403/404. Use status.PermissionDenied (or status.NotFound to also avoid existence disclosure for accounts whose peer ids are guessable).

🛡️ Proposed fix
-	return nil, status.Errorf(status.Internal, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID)
+	return nil, status.Errorf(status.PermissionDenied, "user %s has no access to peer %s under account %s", userID, peer.ID, accountID)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1297 - 1356, The final error returned
from checkIfUserOwnsPeer uses status.Internal but should be an authorization
error; in checkIfUserOwnsPeer (called by GetPeer) replace the final return that
currently does return nil, status.Errorf(status.Internal, ...) with an
authorization-level status (e.g. status.PermissionDenied or status.NotFound) to
surface a 403/404 instead of a 5xx; keep the same message or adjust to a generic
denial message and ensure the change is applied in the checkIfUserOwnsPeer
function so callers of GetPeer receive the correct permission error.

1358-1370: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GetPeerByPubKey cross-account branch leaks tenant info and returns an untyped error.

Store.GetPeerByPeerPubKey is not scoped by accountID, so the cross-account check at line 1366 is the only barrier. Two issues remain:

  1. The error message echoes the caller-supplied pubKey and confirms the peer exists in some account → cross-tenant info disclosure when reached from REST handlers (per the doc comment).
  2. fmt.Errorf is untyped, so HTTP handlers cannot distinguish "not found" from "in another account" and will surface this as 5xx rather than 404.
🛡️ Proposed fix
 func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
 	p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
 	if err != nil {
 		return nil, err
 	}
 	if p.AccountID != accountID {
-		return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
+		return nil, status.Errorf(status.NotFound, "peer not found")
 	}
 	return p, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1358 - 1370, GetPeerByPubKey leaks
tenant data and returns an untyped error; change the cross-account branch in
DefaultAccountManager.GetPeerByPubKey (which calls Store.GetPeerByPeerPubKey) to
not echo pubKey or account details and to return a typed "not found" error that
callers can detect (use an existing sentinel like store.ErrNotFound or define
ErrPeerNotFound and return fmt.Errorf("%w", store.ErrNotFound) or
errors.New/Wrap with that sentinel) so REST handlers get a 404 instead of 5xx
and no tenant information is revealed.
client/internal/peer/ice_backoff.go (1)

199-210: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SetMaxBackoff(0) still leaves stale suspension in place.

When d == 0 (backoff disabled via management push), s.suspended and s.nextRetry are not cleared, so IsSuspended() will keep blocking retries until the previously computed deadline despite backoff being turned off. Same fix as previously suggested applies.

🐛 Proposed fix
 func (s *iceBackoffState) SetMaxBackoff(d time.Duration) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	if d == s.maxBackoff {
 		return
 	}
 	s.maxBackoff = d
+	if d == 0 {
+		s.failures = 0
+		s.suspended = false
+		s.nextRetry = time.Time{}
+		s.bo = buildBackoff(0)
+		return
+	}
 	s.bo = buildBackoff(d)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 199 - 210, SetMaxBackoff
currently updates s.maxBackoff and s.bo but when d == 0 it leaves a previous
suspension and deadline in place so IsSuspended() still blocks; while holding
the mutex in SetMaxBackoff, if d == 0 clear the suspension state by setting
s.suspended = false and s.nextRetry = time.Time{} (zero value) before rebuilding
s.bo (call buildBackoff(0) or nil as your backoff factory expects) so the
failure counter is preserved but any stale suspension is removed; reference
symbols: SetMaxBackoff, s.suspended, s.nextRetry, s.bo, buildBackoff,
IsSuspended.
management/server/http/handlers/accounts/accounts_handler.go (1)

231-283: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validation errors still need status.InvalidArgument wrapping.

fmt.Errorf at lines 234, 280 and inside validateUint32Timeout (513, 516) yields untyped errors, so util.WriteError surfaces them as 5xx. Bad input on connection_mode, legacy_lazy_fallback_timeout_seconds, and the new uint32 timeout fields will return Internal Server Error instead of 4xx validation responses. Wrap them at one of the two layers (handler-side, or inside validateUint32Timeout itself).

💡 Minimal fix at the validator boundary
 	if req.Settings.ConnectionMode != nil {
 		modeStr := string(*req.Settings.ConnectionMode)
 		if !req.Settings.ConnectionMode.Valid() {
-			return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+			return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
 		}
@@
 		v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
 		if v < 60 || v > 86400 {
-			return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+			return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
 		}
@@
 func validateUint32Timeout(name string, v int64) (uint32, error) {
 	if v < 0 {
-		return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v)
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v)
 	}
 	if v > int64(math.MaxUint32) {
-		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
 	}
 	return uint32(v), nil
 }

If you prefer keeping validateUint32Timeout as plain-error and wrapping at the handler, also update the corresponding test in validate_uint32_timeout_test.go accordingly.

Also applies to: 505-519

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
231 - 283, The handler and validator are returning plain fmt.Errorf which
surfaces as 5xx; change these validation errors to gRPC InvalidArgument errors
so util.WriteError returns 4xx. Either (A) update validateUint32Timeout to
return status.Errorf(codes.InvalidArgument, ...) for its range/parse errors (so
callers like the handler get a properly-typed error), or (B) wrap the
handler-side errors before returning (wrap the fmt.Errorf at the connection_mode
check in the block handling req.Settings.ConnectionMode and the
legacy_lazy_fallback_timeout_seconds check around
req.Settings.LegacyLazyFallbackTimeoutSeconds with
status.Errorf(codes.InvalidArgument, ...)). Reference validateUint32Timeout,
req.Settings.ConnectionMode, returnSettings.ConnectionMode, and
req.Settings.LegacyLazyFallbackTimeoutSeconds when making the change.
client/internal/profilemanager/config.go (1)

200-202: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

JSON tag for P2pRetryMaxSeconds is still inconsistent with neighboring fields.

ConnectionMode, RelayTimeoutSeconds, and P2pTimeoutSeconds use json:",omitempty" (PascalCase Go-field-name keys, matching every other field in Config), but P2pRetryMaxSeconds uniquely uses json:"p2p_retry_max_seconds,omitempty". This produces a mixed PascalCase + one snake_case key in the on-disk JSON and is a future-maintenance hazard for anyone reading or hand-editing the profile.

♻️ Proposed fix
 	RelayTimeoutSeconds uint32 `json:",omitempty"`
 	P2pTimeoutSeconds   uint32 `json:",omitempty"`
-	P2pRetryMaxSeconds  uint32 `json:"p2p_retry_max_seconds,omitempty"`
+	P2pRetryMaxSeconds  uint32 `json:",omitempty"`

If snake_case is intentional (e.g., to match a wire format), apply it consistently to all three timeout fields and ConnectionMode instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/profilemanager/config.go` around lines 200 - 202, The JSON
tag for the struct field P2pRetryMaxSeconds is inconsistent with its neighbors;
update the struct tags so they are consistent—either change P2pRetryMaxSeconds'
tag to use the same PascalCase-empty-omitempty style (json:",omitempty") to
match RelayTimeoutSeconds, P2pTimeoutSeconds and ConnectionMode, or (if
snake_case is intended) change RelayTimeoutSeconds, P2pTimeoutSeconds and
ConnectionMode to use the snake_case form so all timeout/connection fields use
the same pattern; adjust the tag on the P2pRetryMaxSeconds field (and the other
timeout/ConnectionMode tags if choosing snake_case) accordingly.
management/server/store/sql_store.go (1)

1641-1663: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate DB timeout/meta values before narrowing them to uint32.

These int64 -> uint32 casts still silently wrap negative or oversized values into unrelated timeouts. A bad row in Postgres would come back as a seemingly valid RelayTimeoutSeconds/P2pRetryMaxSeconds/meta timeout instead of failing fast. Please range-check each NullInt64 (>= 0 && <= math.MaxUint32) before assigning, and return an error on invalid data.

Also applies to: 1922-1929

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/store/sql_store.go` around lines 1641 - 1663, Validate each
sql NullInt64 before casting to uint32: check sRelayTimeoutSeconds,
sP2pTimeoutSeconds, sP2pRetryMaxSeconds (and the other block at lines 1922-1929)
for value >= 0 and <= math.MaxUint32; if out of range return a descriptive error
instead of silently casting, otherwise perform the cast and assign to
account.Settings.RelayTimeoutSeconds / P2pTimeoutSeconds / P2pRetryMaxSeconds;
likewise validate sLegacyLazyFallbackTimeoutSecs before converting it to uint32
and return an error on invalid DB data.
client/internal/conn_mgr.go (3)

297-300: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t persist a lazy/dynamic mode when Rosenpass blocks that mode.

By the time the Rosenpass guard returns, e.mode and the timeout fields have already been updated to the pushed lazy/dynamic values. That leaves the rest of ConnMgr observing managed-mode semantics even though the manager was never started. Reject or normalize the resolved mode before mutating the live state so Mode() and mode-dependent paths never advertise an unsupported configuration.

Also applies to: 322-326

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 297 - 300, Compute and
validate/normalize the resolved mode before mutating ConnMgr state: instead of
assigning newMode/newRelay/newP2P/newP2pRetry directly to e.mode,
e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs, run the Rosenpass
guard (or the existing validation logic) against the candidate newMode and only
if allowed assign the lazy/dynamic values; if the guard rejects it, normalize to
an allowed fallback (e.g., the supported static/managed mode) and only then
mutate e.mode and the timeout fields so Mode() and mode-dependent paths never
observe an unsupported configuration. Apply the same change to the other
assignment block referenced (the similar statements at lines 322-326).

287-340: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the lazy manager when only inactivity thresholds change.

initLazyManager() snapshots relayTimeoutSecs/p2pTimeoutSecs into manager.Config, but UpdatedRemotePeerConfig() only tears the manager down on modeChanged. If the server keeps the same managed mode and changes just one timeout, the running inactivity manager keeps the old thresholds until a later mode flip or process restart.

Also applies to: 601-611

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 287 - 340, resolveConnectionMode
can change inactivity timeouts without changing mode, but the code only restarts
the lazy manager when modeChanged; this leaves the running manager using stale
thresholds. Detect when wasManaged && isManaged and any of newRelay, newP2P or
newP2pRetry differ from e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs
(i.e. timeoutChanged) and, similar to the existing mode-change handling for
switching managed types, call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), then e.initLazyManager(ctx),
e.startModeSideEffects(), and return e.resetPeersToLazyIdle(ctx) so the new
manager config and inactivity timers take effect immediately; reference
resolveConnectionMode, initLazyManager, closeManager,
propagateP2pRetryMaxToConns, startModeSideEffects, and resetPeersToLazyIdle.

35-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Protect the resolved live mode/timeout fields from concurrent reads.

spMu only covers the serverPushed* snapshot. The live fields (mode, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs) are still written in UpdatedRemotePeerConfig() and read concurrently from runDynamicInactivityLoop() and the public accessors with no synchronization, so this remains racy. Either guard them with the same lock or switch them to atomic-backed storage.

Also applies to: 231-240, 297-300, 756-778

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 35 - 45, The live config fields
mode, relayTimeoutSecs, p2pTimeoutSecs and p2pRetryMaxSecs are written in
UpdatedRemotePeerConfig() but read concurrently from runDynamicInactivityLoop()
and the public accessors, so make them concurrency-safe: either (A) guard all
reads and writes with the existing spMu (use spMu in UpdatedRemotePeerConfig(),
runDynamicInactivityLoop() and the accessor methods that reference these
fields), or (B) convert the fields to atomic-backed storage (e.g., atomic.Value
for mode and atomic.Uint32/uint64 for the timeouts) and update
UpdatedRemotePeerConfig(), runDynamicInactivityLoop() and the accessors to use
atomic loads/stores; apply the same change to the other affected groups
mentioned (lines ~231-240, ~297-300, ~756-778) to eliminate the data race.
client/internal/peer/conn_lazy_keepwgpeer_test.go (1)

108-116: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inline-trailing-comment lines bypass the arity check — duplicate of past open review.

strings.Contains(trim, "// ") at line 113 also matches lines like peerConn.Close(false) // keepWgPeer omitted, silently skipping them before the two-argument check fires. The fix from the prior review still applies.

🛡️ Proposed fix
-		strings.Contains(trim, "// "),
+		strings.HasPrefix(trim, "//"),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 108 - 116,
The test's switch uses strings.Contains(trim, "// ") which wrongly matches lines
with code followed by inline comments (e.g., "peerConn.Close(false) //
keepWgPeer omitted"), skipping the two-argument check; update the condition in
the switch that currently references strings.Contains(trim, "// ") to instead
detect comment-only lines (for example use
strings.HasPrefix(strings.TrimSpace(trim), "//") or check that
strings.Index(trim, "//") == 0) so only lines that are purely comments are
skipped; keep the rest of the checks (conn.Close( and peerConn.Close() checks)
intact and ensure you modify the case inside the same switch that references the
variable trim.
client/internal/lazyconn/manager/manager.go (1)

102-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Legacy-fallback selector conflates "not migrated" with explicit zero-timeouts — duplicate of past open review.

iceTO == 0 && relayTO == 0 is true for two distinct cases:

  1. A caller that hasn't set any of the new thresholds (intended fallback to Phase-1 single-timer).
  2. A caller that explicitly sets both to 0 to disable all teardowns.

Case 2 falls into the Phase-1 single-timer path, which then silently re-enables the default inactivity teardown via inactivity.NewManager(wgIface, nil). The correct discriminant is the presence of the deprecated InactivityThreshold field.

🛡️ Proposed fix
-		if iceTO == 0 && relayTO == 0 {
-			// Phase 1 / single-timer fallback when caller hasn't migrated.
-			m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback
-		} else {
+		if config.InactivityThreshold != nil && iceTO == 0 && relayTO == 0 { //nolint:staticcheck
+			// Phase 1 / single-timer fallback: deprecated field present, new fields absent.
+			m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold)
+		} else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 102 - 109, The
current conditional treats iceTO==0 && relayTO==0 as the indicator of the legacy
single-timer fallback, which conflates "caller didn't migrate" with "caller
explicitly disabled teardowns"; instead, base the legacy fallback on presence of
the deprecated InactivityThreshold field. Change the condition around
config.resolvedTimeouts()/InactivityThreshold so that if
config.InactivityThreshold is non-nil you call inactivity.NewManager(wgIface,
config.InactivityThreshold) (the legacy single-timer path), otherwise use
inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) — this ensures
explicit zero timeouts remain the two-timer (disable) case; reference wgIface,
config.resolvedTimeouts(), config.InactivityThreshold, inactivity.NewManager and
inactivity.NewManagerWithTwoTimers when making the change.
management/server/peer_connections/store.go (1)

141-155: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expire stale entries before the nonce gate.

Returning on since before the TTL check means refresh-only callers never evict expired snapshots, so dead entries can live forever in the map.

Suggested fix
 func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	e, ok := s.maps[peerPubKey]
 	if !ok {
 		return nil, false
 	}
-	if since > 0 && e.m.GetInResponseToNonce() < since {
-		return nil, false
-	}
 	if s.clock.Now().Sub(e.updatedAt) > s.ttl {
 		delete(s.maps, peerPubKey)
 		return nil, false
 	}
+	if since > 0 && e.m.GetInResponseToNonce() < since {
+		return nil, false
+	}
 	return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_connections/store.go` around lines 141 - 155, In
MemoryStore.GetWithNonceCheck, expired entries are checked after the nonce gate
causing stale entries to persist; change the order so the TTL eviction runs
before the nonce comparison: while holding the lock in GetWithNonceCheck, call
s.clock.Now().Sub(e.updatedAt) > s.ttl and delete from s.maps when expired
(using the existing delete logic) before evaluating the since /
e.m.GetInResponseToNonce() check so refresh-only callers will not retain expired
snapshots.
management/server/peer/peer.go (2)

199-203: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Compare SupportedFeatures as a set, not a slice.

slices.Equal makes two equivalent capability sets compare unequal when protobuf serialization order changes, which will keep churning peer metadata unnecessarily.

Suggested fix
 func (p PeerSystemMeta) isEqual(other PeerSystemMeta) bool {
+	pFeatures := slices.Clone(p.SupportedFeatures)
+	otherFeatures := slices.Clone(other.SupportedFeatures)
+	sort.Strings(pFeatures)
+	sort.Strings(otherFeatures)
+
 	sort.Slice(p.NetworkAddresses, func(i, j int) bool {
 		return p.NetworkAddresses[i].Mac < p.NetworkAddresses[j].Mac
 	})
@@
 		p.Environment.Cloud == other.Environment.Cloud &&
 		p.Environment.Platform == other.Environment.Platform &&
 		p.EffectiveConnectionMode == other.EffectiveConnectionMode &&
 		p.EffectiveRelayTimeoutSecs == other.EffectiveRelayTimeoutSecs &&
 		p.EffectiveP2PTimeoutSecs == other.EffectiveP2PTimeoutSecs &&
 		p.EffectiveP2PRetryMaxSecs == other.EffectiveP2PRetryMaxSecs &&
-		slices.Equal(p.SupportedFeatures, other.SupportedFeatures) &&
+		slices.Equal(pFeatures, otherFeatures) &&
 		p.Flags.isEqual(other.Flags)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 199 - 203, The comparison
currently uses slices.Equal on p.SupportedFeatures vs other.SupportedFeatures
which fails when protobuf reorders capabilities; change the comparison in the
equality logic to treat SupportedFeatures as a set by converting each slice into
a map/set (or by making a stable sort before comparing) and then compare those
sets for equality; update the equality check that references p.SupportedFeatures
and other.SupportedFeatures (the surrounding code using slices.Equal) to use the
set-based comparison so logically equivalent feature lists no longer appear
different.

142-154: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include the new 3.7i fields in isEmpty().

UpdateMetaIfNew() still returns early when only Effective* or SupportedFeatures is present, so capability-only/runtime-only updates get dropped and stale peer metadata can persist.

Suggested fix
 func (p PeerSystemMeta) isEmpty() bool {
 	return p.Hostname == "" &&
 		p.GoOS == "" &&
 		p.Kernel == "" &&
 		p.Core == "" &&
 		p.Platform == "" &&
 		p.OS == "" &&
 		p.OSVersion == "" &&
 		p.WtVersion == "" &&
 		p.UIVersion == "" &&
 		p.KernelVersion == "" &&
 		len(p.NetworkAddresses) == 0 &&
 		p.SystemSerialNumber == "" &&
 		p.SystemProductName == "" &&
 		p.SystemManufacturer == "" &&
 		p.Environment.Cloud == "" &&
 		p.Environment.Platform == "" &&
-		len(p.Files) == 0
+		len(p.Files) == 0 &&
+		p.EffectiveConnectionMode == "" &&
+		p.EffectiveRelayTimeoutSecs == 0 &&
+		p.EffectiveP2PTimeoutSecs == 0 &&
+		p.EffectiveP2PRetryMaxSecs == 0 &&
+		len(p.SupportedFeatures) == 0
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 142 - 154, The isEmpty()
predicate currently ignores the new 3.7i fields so UpdateMetaIfNew() can
incorrectly return early; update the Peer.isEmpty() implementation to treat any
non-empty EffectiveConnectionMode string, any non-zero
EffectiveRelayTimeoutSecs/EffectiveP2PTimeoutSecs/EffectiveP2PRetryMaxSecs, or a
non-empty SupportedFeatures slice as non-empty (i.e., return false when any of
those are present) so capability-only or runtime-only updates are not dropped by
UpdateMetaIfNew().
client/internal/engine.go (1)

2765-2783: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard the timer's connMgr reads with syncMsgMux.

This callback still reads e.connMgr.Mode() outside the lock that protects UpdatedRemotePeerConfig(), so the remote-offline debounce path races with mode changes and shutdown.

Suggested fix
 	t := time.AfterFunc(remoteOfflineGracePeriod, func() {
 		e.peerOfflineDebounceMu.Lock()
 		delete(e.peerOfflineDebounce, pubKey)
 		e.peerOfflineDebounceMu.Unlock()
@@
-		if e.ctx == nil || e.ctx.Err() != nil {
+		e.syncMsgMux.Lock()
+		ctx := e.ctx
+		mode := connectionmode.ModeUnspecified
+		if e.connMgr != nil {
+			mode = e.connMgr.Mode()
+		}
+		e.syncMsgMux.Unlock()
+
+		if ctx == nil || ctx.Err() != nil {
 			return
 		}
-		if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic {
+		if mode != connectionmode.ModeP2PDynamic {
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 2765 - 2783, The timer callback reads
e.connMgr.Mode() (and other engine state) without holding the engine's
syncMsgMux, causing races with UpdatedRemotePeerConfig()/mode switches and
shutdown; fix by acquiring e.syncMsgMux (or the engine's message sync mutex)
before accessing e.connMgr or other mutable engine state in the debounce
function and release it after the checks so the Mode() call and any related
reads are performed under the same lock that UpdatedRemotePeerConfig() uses
(refer to e.connMgr, e.syncMsgMux, e.ctx and the timer callback closing over
pubKey/peerOfflineDebounce).
client/ui/peers_tab.go (1)

60-68: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep render() off the Fyne event thread.

showFull.OnChanged and the exported Refresh callback still invoke a function that does getSrvClient and a 5-second Status RPC synchronously, so the whole window can hang when the daemon is slow or unreachable.

Suggested fix
-	showFull.OnChanged = func(_ bool) { render() }
+	asyncRender := func() { go render() }
+	showFull.OnChanged = func(_ bool) { asyncRender() }
@@
-	return peersTabBundle{Content: content, ShowFull: showFull, Refresh: render}
+	return peersTabBundle{Content: content, ShowFull: showFull, Refresh: asyncRender}
In Fyne v2, are widget callback handlers like widget.Check.OnChanged executed on the main UI thread, and should blocking RPC/network work be moved to a background goroutine with UI updates marshaled back via fyne.Do?

Also applies to: 98-98, 126-126

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/peers_tab.go` around lines 60 - 68, render() (and the callbacks
wired to showFull.OnChanged and the exported Refresh) perform blocking work
(getSrvClient and conn.Status RPC) on the Fyne event thread; move that blocking
work into a background goroutine and only marshal UI updates back onto the Fyne
main thread using fyne.Do (or RunOnMain). Specifically, wrap the
getSrvClient/Status call sequence (the conn := getSrvClient(...) and st, err :=
conn.Status(...) logic) in a goroutine started from render(), showFull.OnChanged
handler, and Refresh, capture results/error, then call fyne.Do to set
summary.SetText, update widgets, and any other UI state so the UI never blocks
while waiting for the 5s RPC.
management/internals/shared/grpc/server.go (1)

497-500: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Nil out snapshotCh when the router closes it.

A closed channel stays ready forever, so continue here turns the handleUpdates loop into a tight spin until the stream exits.

Suggested fix
 		case nonce, ok := <-snapshotCh:
 			if !ok {
+				snapshotCh = nil
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 497 - 500, The
select case receiving from snapshotCh in handleUpdates currently does "if !ok {
continue }", which causes a tight spin because a closed channel remains
selectable; change the handling so that when the receive returns !ok you set
snapshotCh = nil (or otherwise remove it from the select) to stop further
selects on the closed channel and avoid busy-looping, keeping the rest of
handleUpdates behavior unchanged. Ensure you reference the variable snapshotCh
in the select case and update any related logic that assumes snapshotCh may be
non-nil.
client/internal/peer/status.go (3)

470-471: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Capture the conn-state callback before unlocking.

These sites still call notifyConnStateChange(...) after d.mux.Unlock(), so the read of d.connStateListener races with SetConnStateListener. Build the closure while the mutex is still held, then return/invoke it after unlock.

Also applies to: 696-697, 764-765, 812-813, 863-864

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 470 - 471, The code currently
calls notifyConnStateChange(...) while the mutex is already unlocked, causing a
race on d.connStateListener; instead, capture the callback under the mutex (e.g.
read d.connStateListener into a local closure/variable inside the critical
section where hasConnStatusChanged(...) is checked and before calling
d.mux.Unlock()), then release the lock and invoke the captured closure or call
notifyConnStateChange using that local reference; apply the same pattern at the
other occurrences mentioned (around lines where
hasConnStatusChanged/notifyConnStateChange are used and SetConnStateListener may
run).

692-693: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

notifyPeerStateChangeListeners is called without the required lock.

This helper immediately reads d.peers/d.changeNotify via snapshotRouterPeersLocked, and its own docstring says the caller must hold d.mux. Moving these calls below Unlock() reintroduces the concurrent-access bug the earlier fix was addressing.

Suggested fix
-	d.mux.Unlock()
-
 	if materialICE {
 		d.notifyPeerStateChangeListeners(receivedState.PubKey)
 	}
+	d.mux.Unlock()

Mirror the same ordering in the relay path.

Also applies to: 760-761

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 692 - 693, The call to
notifyPeerStateChangeListeners is being made without holding d.mux even though
that helper (and snapshotRouterPeersLocked it calls) requires the caller to hold
d.mux; fix by mirroring the relay-path ordering: acquire d.mux, call
notifyPeerStateChangeListeners while still holding the lock (before calling
Unlock()), then release the lock — ensure the same pattern is applied at both
locations mentioned (around notifyPeerStateChangeListeners and
snapshotRouterPeersLocked) so accesses to d.peers and d.changeNotify happen
while d.mux is held.

542-544: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t fire peerListChanged while d.mux is held.

notifyPeerListChanged() synchronously enters notifier/listener code. With the current defer d.mux.Unlock(), any listener that re-enters Status can deadlock this path. Snapshot numPeers under the lock, release it, then notify.

Also applies to: 565-566

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 542 - 544, The code currently
calls d.notifyPeerListChanged() and d.notifyPeerStateChangeListeners(pubKey)
while holding d.mux (defer d.mux.Unlock()), which can deadlock if listeners
re-enter Status; instead, inside the critical section capture the minimal state
needed (e.g., snapshot numPeers and any other values used by the listeners, and
the pubKey flag) then release d.mux and invoke d.notifyPeerListChanged() and
d.notifyPeerStateChangeListeners(pubKey) after unlocking; apply the same change
to the other occurrence around lines that call those notify methods (the 565-566
block) so notifications happen outside the lock.
management/internals/shared/grpc/conversion.go (1)

315-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only mark liveness authoritative when rPeer.Status is present.

cfg.ServerLivenessKnown = true is still outside the rPeer.Status != nil guard. When the status record is missing, clients now treat LiveOnline=false as authoritative offline instead of using the legacy fallback heuristic.

Suggested fix
 		if rPeer.Status != nil {
 			if !rPeer.Status.LastSeen.IsZero() {
 				cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen)
 			}
 			cfg.LiveOnline = rPeer.Status.Connected
+			cfg.ServerLivenessKnown = true
 		}
-		cfg.ServerLivenessKnown = true
 		dst = append(dst, cfg)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The
ServerLivenessKnown flag is being set unconditionally which falsely marks
liveness authoritative when rPeer.Status is nil; move the assignment of
cfg.ServerLivenessKnown = true inside the rPeer.Status != nil guard (the same
block that sets cfg.LastSeenAtServer and cfg.LiveOnline) so that
ServerLivenessKnown is only true when rPeer.Status is present and clients will
fall back to the legacy LastSeenAtServer heuristic when status is missing.
🧹 Nitpick comments (4)
client/iface/bind/activity.go (1)

98-122: 💤 Low value

Callback dispatch path is correct; reaffirm the "MUST be cheap" contract at registration sites.

The CAS-gated single-fire is a nice way to dedupe under packet-burst races: only the goroutine that wins the CompareAndSwap invokes cb, and the captured cb snapshot avoids holding the lock across the user callback.

The previously-raised concern about the synchronous invocation on the WG read/write goroutine has been addressed by callers (per the prior thread, commits 2ef31c0–b07184d). Just note that the contract documented here on lines 117–119 is now load-bearing — any future SetOnActivity registration that calls into peer-state machinery directly (rather than dispatching to its own goroutine / buffered channel) will reintroduce the original contention risk. Worth a brief comment on the engine-side registration site reaffirming the non-blocking expectation, if not already present.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/iface/bind/activity.go` around lines 98 - 122, The CAS-protected
callback path in ActivityRecorder.record correctly ensures only the winning
goroutine invokes the callback (cb/onActivity), but you must explicitly reaffirm
the "handler MUST be cheap or self-defer" contract at registration sites: update
the SetOnActivity (or whichever code assigns ActivityRecorder.onActivity) to
include a short comment and/or documentation that any callback registered must
not block the WG read/write goroutine and should either spawn its own goroutine
or enqueue to a buffered channel; reference ActivityRecorder.record, onActivity,
and SetOnActivity in that note so future changes do not reintroduce blocking in
the hot path.
client/internal/stdnet/filter.go (1)

47-92: ⚡ Quick win

Split the closure into small helpers before merge.

This function now mixes platform-specific rules, disallow-list matching, and wgctrl probing in one branchy closure, which is why Sonar is still flagging it for cognitive complexity. Extracting those checks into helpers should clear the gate and make regressions like the Windows "lo" case much easier to spot.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter.go` around lines 47 - 92, The InterfaceFilter
closure is doing platform checks, disallow-list prefix logic, and wgctrl probing
in one function; split it into small helpers: implement isLoopbackLike(name
string) bool (handle "lo"/"lo0" and Windows/ios exceptions),
isWindowsKnownBad(name string) bool (use windowsKnownBadSubstrings),
matchesDisallowList(name string, disallowList []string) bool (apply the
veth/Windows special-case and ios exception), and isWireGuardDevice(name string)
(wrap wgctrl.New(), Device lookup and Close); then refactor InterfaceFilter to
call these helpers (keeping the same return semantics) so the closure contains
only orchestration and cognitive complexity is reduced while preserving behavior
of InterfaceFilter, windowsKnownBadSubstrings, and wgctrl probing.
client/internal/peer/conn.go (1)

1143-1231: 💤 Low value

Minor doc inaccuracy in AttachICEOnRelayActivity gate 6 comment.

Gate 6 says "we had P2P at least once — avoids pointless retries for peers we never reached P2P with", but everConnected is also set in onRelayConnectionIsReady (line 656), so it's true after the first relay connection too, not exclusively P2P. The actual intent (distinguishing "ICE detached by design" from "never connected") is correct; only the comment is misleading.

📝 Suggested comment update
-//   6. everConnected must be true (we had P2P at least once -- avoids
-//      pointless retries for peers we never reached P2P with)
+//   6. everConnected must be true (peer has established at least one
+//      connection — relay or P2P — distinguishing "ICE detached for
+//      inactivity" from "brand new peer whose ICE listener hasn't been
+//      registered yet").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1143 - 1231, Update the gate-6
comment in AttachICEOnRelayActivity to accurately describe everConnected: note
that everConnected is set when any connection (including relay via
onRelayConnectionIsReady) was established, not only P2P, and that the check is
meant to distinguish peers that previously connected (P2P or relay) from peers
that never connected; reference AttachICEOnRelayActivity, everConnected, and
onRelayConnectionIsReady so the reader can see where everConnected is set.
client/internal/peer/conn_handover_order_test.go (1)

135-176: 💤 Low value

extractFunctionBody brace counter is fragile but acceptable for these tests.

The helper counts {/} without awareness of string literals or comment-enclosed braces. For the functions currently under test (onICEConnectionIsReady, onGuardEvent, onWGDisconnected, onICEStateDisconnected) this is fine, but a future function containing a string literal like s := "map: {key: val}" would give a wrong body slice. Worth noting for anyone who extends this helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_handover_order_test.go` around lines 135 - 176, The
brace counter in extractFunctionBody is fragile because it counts '{'/'}' even
when they appear inside string literals or comments; update extractFunctionBody
to ignore braces found inside single-quoted, double-quoted or backtick strings
and inside // line-comments and /* block-comments (i.e., implement a small
lexer-state machine around the existing loop), keeping the existing
depth/openSeen logic and return behavior unchanged so the helper still extracts
the function body correctly for onICEConnectionIsReady, onGuardEvent,
onWGDisconnected, onICEStateDisconnected and future functions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c4d3e39-19dd-4193-ae26-ba105f0bdd06

📥 Commits

Reviewing files that changed from the base of the PR and between b082536 and e8a298f.

⛔ Files ignored due to path filters (2)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (60)
  • .gitignore
  • client/android/client.go
  • client/android/peer_notifier.go
  • client/android/preferences.go
  • client/android/preferences_clamp_test.go
  • client/iface/bind/activity.go
  • client/iface/device/endpoint_manager.go
  • client/internal/conn_mgr.go
  • client/internal/conn_state_pusher.go
  • client/internal/conn_state_pusher_material_test.go
  • client/internal/conn_state_pusher_test.go
  • client/internal/conn_state_pusher_testhelper_test.go
  • client/internal/engine.go
  • client/internal/engine_offline_debounce_test.go
  • client/internal/lazyconn/activity/listener_bind_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_handover_order_test.go
  • client/internal/peer/conn_lazy_keepwgpeer_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/guard/guard_test.go
  • client/internal/peer/guard/ice_retry_state_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/status_debounce_test.go
  • client/internal/peer/status_remote_meta_notify_test.go
  • client/internal/peerstore/store.go
  • client/internal/profilemanager/config.go
  • client/internal/stdnet/filter.go
  • client/internal/stdnet/filter_test.go
  • client/proto/daemon.proto
  • client/system/features.go
  • client/system/features_test.go
  • client/ui/network.go
  • client/ui/peers_tab.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md
  • docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/internals/shared/grpc/server.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/http/handlers/accounts/validate_uint32_timeout_test.go
  • management/server/peer.go
  • management/server/peer/peer.go
  • management/server/peer_connections/store.go
  • management/server/peer_connections/store_test.go
  • management/server/peer_test.go
  • management/server/store/file_store.go
  • management/server/store/sql_store.go
  • management/server/types/settings.go
  • management/server/types/settings_ptr_equal_test.go
  • management/server/types/settings_test.go
  • shared/management/client/grpc.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
🚧 Files skipped from review as they are similar to previous changes (3)
  • management/server/types/settings_ptr_equal_test.go
  • .gitignore
  • client/internal/conn_state_pusher.go
👮 Files not reviewed due to content moderation or server errors (13)
  • shared/management/http/api/types.gen.go
  • client/proto/daemon.proto
  • shared/management/proto/management.proto
  • management/server/store/file_store.go
  • management/server/types/settings_test.go
  • management/server/peer_connections/store_test.go
  • client/internal/engine_offline_debounce_test.go
  • client/internal/peer/status_remote_meta_notify_test.go
  • management/server/peer_test.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md
  • management/server/activity/codes.go
  • management/server/account.go
  • client/ui/network.go

Comment on lines +265 to 270
// Phase 3.5 (#5989): reset ICE backoff + recreate workerICE on network change.
// Set before Start() is called so the goroutine sees it without races.
if !skipICE {
conn.guard.SetOnNetworkChange(conn.onNetworkChange)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that guard.Start calls the network-change callback synchronously
# from its own goroutine (not in a sub-goroutine), which is the prerequisite
# for the deadlock.
rg -n "networkChange\|NetworkChange\|OnNetworkChange\|SetOnNetworkChange" --type=go -C3

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Check if the file exists and examine its actual content around the referenced lines
ls -la client/internal/peer/conn.go 2>/dev/null && echo "File exists" || echo "File not found"

# Check file size to understand scope
wc -l client/internal/peer/conn.go

# Read the actual content around lines 265-270
head -280 client/internal/peer/conn.go | tail -30

Repository: netbirdio/netbird

Length of output: 1390


🏁 Script executed:

# Search for any guard-related callback setup in the codebase
find . -name "*.go" -type f | head -20 | xargs grep -l "SetOnNetworkChange\|onNetworkChange" 2>/dev/null || echo "No matches found"

# Also search more broadly for guard-related patterns
rg "guard\." --type=go | head -20

Repository: netbirdio/netbird

Length of output: 1821


🏁 Script executed:

# Read guard.go to understand SetOnNetworkChange
cat client/internal/peer/guard/guard.go | head -150

Repository: netbirdio/netbird

Length of output: 5339


🏁 Script executed:

# Find the onNetworkChange function in conn.go
rg -n "func.*onNetworkChange" client/internal/peer/conn.go -A 15

Repository: netbirdio/netbird

Length of output: 521


🏁 Script executed:

# Find the Close() function and examine defer structure
rg -n "func.*Close\(\)" client/internal/peer/conn.go -A 50 | head -80

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Search for conn.wg.Wait() in conn.go
rg -n "wg\.Wait\(\)" client/internal/peer/conn.go -B 5 -A 5

Repository: netbirdio/netbird

Length of output: 461


🏁 Script executed:

# Find where onNetworkChange is actually called in guard.go
rg -n "onNetworkChange\|\.onNetworkChange" client/internal/peer/guard/guard.go -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Get the full Close() function to see defer structure
rg -n "func.*\(conn \*Conn\) Close" client/internal/peer/conn.go -A 70 | head -120

Repository: netbirdio/netbird

Length of output: 2571


🏁 Script executed:

# Check if there's an onGuardEvent callback pattern mentioned in the review
rg -n "onGuardEvent\|eventCallback" client/internal/peer/conn.go -B 2 -A 2 | head -40

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Get the full reconnectLoopWithRetry function with complete select statement
sed -n '118,250p' client/internal/peer/guard/guard.go

Repository: netbirdio/netbird

Length of output: 3204


🏁 Script executed:

# Double-check by searching for srReconnectedChan case in the select
rg -n "srReconnectedChan\|case.*srWatcher\|case.*Reconnected" client/internal/peer/guard/guard.go -B 1 -A 3

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Verify guard goroutine is added to conn.wg
rg -n "guard.Start\|go.*guard\|conn.guard" client/internal/peer/conn.go -B 2 -A 2 | head -30

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Check if guard is started and in what goroutine
rg -n "func.*Start.*context\|guard.Start" client/internal/peer/conn.go -B 1 -A 3

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Look for where guard is started in relation to wg
sed -n '280,295p' client/internal/peer/conn.go

Repository: netbirdio/netbird

Length of output: 452


🏁 Script executed:

# Check what srWatcher does and if it can fire srReconnectedChan
rg -n "NewListener\|Listener\|srWatcher" client/internal/peer/guard/sr_watcher.go | head -20

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Verify conn.onGuardEvent exists to confirm the pattern
rg -n "func.*onGuardEvent\|eventCallback" client/internal/peer/conn.go -B 1 -A 5 | head -30

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Check sr_watcher.go directly
ls -la client/internal/peer/guard/sr_watcher.go
head -100 client/internal/peer/guard/sr_watcher.go

Repository: netbirdio/netbird

Length of output: 2515


🏁 Script executed:

# Find onGuardEvent definition/usage
rg -n "onGuardEvent" client/internal/peer/ -B 2 -A 5 | head -40

Repository: netbirdio/netbird

Length of output: 3119


🏁 Script executed:

# Verify SRWatcher has actual event sending logic
cat client/internal/peer/guard/sr_watcher.go | tail -100

Repository: netbirdio/netbird

Length of output: 2364


🏁 Script executed:

# Find onGuardEvent definition to understand existing callback pattern
rg -n "func.*onGuardEvent" client/internal/peer/conn.go -A 20 | head -40

Repository: netbirdio/netbird

Length of output: 1347


🏁 Script executed:

# Check if onICEFailed/onICEConnected exist and access iceBackoff without lock
rg -n "onICEFailed\|onICEConnected" client/internal/peer/conn.go -B 2 -A 5 | head -50

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Verify srReconnectedChan can have pending events when ctx.Done() fires
# Check if notify() is called from onReconnected/onICEChanged
rg -n "notify\|onReconnected\|onICEChanged" client/internal/peer/guard/sr_watcher.go

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Check the actual ctx/cancel flow to confirm sequence
rg -n "ctxCancel\|ctx.Done\|context.WithCancel" client/internal/peer/conn.go | head -20

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Final verification: confirm srReconnectedChan can have events queued
# Check the notify function calls notify() sends to buffered channels
grep -n "case listener <-" client/internal/peer/guard/sr_watcher.go

Repository: netbirdio/netbird

Length of output: 96


🏁 Script executed:

# Verify the complete defer order in Close to ensure mu.Unlock runs after wg.Wait
sed -n '312,376p' client/internal/peer/conn.go | grep -n "defer\|conn.wg.Wait"

Repository: netbirdio/netbird

Length of output: 139


🏁 Script executed:

# Verify iceBackoff access patterns
rg -n "iceBackoff\." client/internal/peer/conn.go | head -30

Repository: netbirdio/netbird

Length of output: 1614


🏁 Script executed:

# Check if iceBackoff has internal synchronization (mutex)
fd -t f "backoff" --type go | xargs grep -l "type.*Backoff\|iceBackoff" | head -5

Repository: netbirdio/netbird

Length of output: 231


🏁 Script executed:

# Look for iceBackoff type definition
rg -n "type.*iceBackoff\|iceBackoff.*struct" client/internal/peer/ -B 2 -A 5 | head -40

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Find iceBackoff type definition in the codebase
rg -n "type.*[Ii]ceBackoff" --type=go -B 1 -A 5

Repository: netbirdio/netbird

Length of output: 514


🏁 Script executed:

# Check where iceBackoff is instantiated to understand its type
rg -n "iceBackoff\s*=" client/internal/peer/conn.go | head -5

Repository: netbirdio/netbird

Length of output: 246


🏁 Script executed:

# Look for onICEFailed and onICEConnected callback definitions
rg -n "func.*onICEFailed\|func.*onICEConnected" client/internal/peer/conn.go -A 10 | head -40

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Confirm the iceBackoffState mutex protects all methods
cat client/internal/peer/ice_backoff.go | head -120

Repository: netbirdio/netbird

Length of output: 4044


🏁 Script executed:

# Verify markFailure and IsSuspended acquire the mutex
rg -n "func.*markFailure\|func.*IsSuspended\|func.*markSuccess" client/internal/peer/ice_backoff.go -A 3

Repository: netbirdio/netbird

Length of output: 43


🏁 Script executed:

# Get the rest of ice_backoff.go to see all method definitions
tail -80 client/internal/peer/ice_backoff.go

Repository: netbirdio/netbird

Length of output: 3224


🏁 Script executed:

# Search more carefully for all iceBackoffState methods
rg -n "func \(s \*iceBackoffState\)" client/internal/peer/ice_backoff.go

Repository: netbirdio/netbird

Length of output: 440


New onNetworkChange callback introduces a deadlock with Close()'s mutex-held wg.Wait().

Close() holds conn.mu for the entire duration of conn.wg.Wait() (line 374) because defer conn.mu.Unlock() (line 315) runs only after the function returns. The guard goroutine lives in conn.wg. After the new SetOnNetworkChange(conn.onNetworkChange) registration (line 268), if the guard's select simultaneously has ctx.Done() and an srReconnectChan event ready (when signal/relay reconnect), Go may choose the reconnect case and call onNetworkChange(), which immediately tries conn.mu.Lock() (line 1461). That goroutine blocks forever, conn.wg.Wait() never returns, and the daemon hangs.

The existing onGuardEvent callback was safe because it never acquires conn.mu. onNetworkChange is the first guard callback to do so, breaking the invariant.

Minimal mitigation—add a non-blocking ctx check before taking the lock in onNetworkChange:

🔒 Proposed fix
 func (conn *Conn) onNetworkChange() {
+	// Fast-path: if the connection is already being closed (ctx cancelled),
+	// skip the work to avoid deadlocking with Close(), which holds conn.mu
+	// while waiting for the guard goroutine via wg.Wait().
+	select {
+	case <-conn.ctx.Done():
+		return
+	default:
+	}
 	conn.mu.Lock()
 	defer conn.mu.Unlock()

 	if conn.ctx.Err() != nil {
 		return
 	}

A structural fix (release conn.mu before conn.wg.Wait()) eliminates the root cause entirely but requires restructuring the Close() defers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 265 - 270, The onNetworkChange
callback can deadlock because it attempts conn.mu.Lock() while Close() holds
conn.mu during conn.wg.Wait(); modify onNetworkChange to first perform a
non-blocking check of the connection's context (or a closed/closing flag) and
return immediately if the context is done or connection is closing, before
attempting to acquire conn.mu, so the guard goroutine will not block trying to
lock; locate the callback registered via
conn.guard.SetOnNetworkChange(conn.onNetworkChange) and add the early ctx.Done()
(or connIsClosing check) guard at the top of onNetworkChange, preserving
existing behavior when not closing.

Comment on lines +1750 to +1752
IceBackoffFailures: int32(peerState.IceBackoffFailures),
IceBackoffNextRetry: timestamppb.New(peerState.IceBackoffNextRetry),
IceBackoffSuspended: peerState.IceBackoffSuspended,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve unset semantics for IceBackoffNextRetry.

timestamppb.New(time.Time{}) serializes as 0001-01-01T00:00:00Z, not “missing”. For peers with no scheduled retry, clients will see a non-nil timestamp unless this is guarded the same way as LastSeenAtServer.

Suggested fix
 			SshHostKey:                 peerState.SSHHostKey,
 			IceBackoffFailures:         int32(peerState.IceBackoffFailures),
-			IceBackoffNextRetry:        timestamppb.New(peerState.IceBackoffNextRetry),
 			IceBackoffSuspended:        peerState.IceBackoffSuspended,
 			// Phase 3.7i (`#5989`): per-peer remote meta fields.
 			ServerOnline:               peerState.ServerOnline,
@@
 			ConnectionTypeExtended:     DeriveConnectionTypeExtended(peerState),
 		}
+		if !peerState.IceBackoffNextRetry.IsZero() {
+			pbPeerState.IceBackoffNextRetry = timestamppb.New(peerState.IceBackoffNextRetry)
+		}
 		if !peerState.RemoteLastSeenAtServer.IsZero() {
 			pbPeerState.LastSeenAtServer = timestamppb.New(peerState.RemoteLastSeenAtServer)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 1750 - 1752, The field
IceBackoffNextRetry currently always sets a timestamp via
timestamppb.New(peerState.IceBackoffNextRetry) which yields a non-nil zero time
instead of preserving "unset" semantics; change the assignment for
IceBackoffNextRetry to mirror how LastSeenAtServer is handled by checking if
peerState.IceBackoffNextRetry.IsZero() (or equivalent) and only calling
timestamppb.New(...) when non-zero, otherwise leave the proto field nil so
clients see it as missing.


---

### Task 1: Replace sharedsock with standard UDP socket in TunKernelDevice.Up()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix heading-level jump to satisfy markdownlint MD001.

Line 13 jumps from higher-level context to ### directly; use ## here (or add an intermediate heading) to keep heading levels incrementing by one.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 13,
Change the heading "Task 1: Replace sharedsock with standard UDP socket in
TunKernelDevice.Up()" from a third-level heading (###) to a second-level heading
(##) so markdown heading levels increment by one and satisfy MD001; locate the
heading text in the document and update its leading hashes accordingly (ensure
any surrounding headings remain correctly ordered).

Comment on lines +73 to +81
Run: `cd /home/ai-agent/projects/netbird && GOOS=linux GOARCH=arm64 go build ./client/`
Expected: Successful build, no errors.

- [ ] **Step 4: Cross-compile for arm64 (OpenWrt router)**

```bash
cd /home/ai-agent/projects/netbird
GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -o /tmp/netbird-kernel-fix-arm64 ./client/
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid hard-coded local paths in build instructions.

Lines 73-81 use /home/ai-agent/projects/netbird, which makes the plan less portable for other environments. Prefer repo-relative steps (e.g., “from repo root, run …”).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` around lines
73 - 81, Update Step 4 (“Step 4: Cross-compile for arm64 (OpenWrt router)”) to
remove the hard-coded absolute path `/home/ai-agent/projects/netbird`; instead
instruct users to run the cross-compile from the repository root and use a
repo-relative client path (e.g., reference the ./client directory) while
retaining the environment variables (GOOS, GOARCH, CGO_ENABLED) and output
filename; edit the markdown block that currently contains the `cd
/home/ai-agent/projects/netbird` command to present the repo-relative
instruction and command invocation.

Comment on lines +248 to +269
// ApplyLegacyLazyFallbackDefaults seeds the two LegacyLazyFallback*
// fields if they are at the Go zero value. Idempotent — calling it on
// an already-populated Settings is a no-op. The "is at zero value"
// detection is intentionally simple: there is no semantic difference
// between "user explicitly turned the toggle off / set timeout to 0"
// and "field uninitialised", because we forbid 0 timeouts at the API
// layer (range [60, 86400]) and the false toggle case is preserved
// only when the field was already true and got copied verbatim. New
// codepaths that need to remember "user opted out" should use the API
// handler's path (which only ever sees the wire field).
func (s *Settings) ApplyLegacyLazyFallbackDefaults() {
if s == nil {
return
}
// timeout==0 is never valid, so we always rewrite. Toggle: only
// reset to default true when the timeout was also zero (= field
// freshly built, never touched), otherwise honour the explicit
// false the caller put there.
if s.LegacyLazyFallbackTimeoutSeconds == 0 {
s.LegacyLazyFallbackEnabled = DefaultLegacyLazyFallbackEnabled
s.LegacyLazyFallbackTimeoutSeconds = DefaultLegacyLazyFallbackTimeoutSeconds
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don’t let timeout defaulting silently re-enable legacy fallback.

ApplyLegacyLazyFallbackDefaults() treats LegacyLazyFallbackTimeoutSeconds == 0 as “freshly built” and then also forces LegacyLazyFallbackEnabled = true. That makes an explicit false impossible to preserve on any caller that constructs a fresh Settings and leaves the timeout unset, even though the field comments say this helper is used from the PUT path. Split the two concerns: seed the timeout here, but only seed the boolean in code paths that know the struct is brand new.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/types/settings.go` around lines 248 - 269,
ApplyLegacyLazyFallbackDefaults currently treats
LegacyLazyFallbackTimeoutSeconds == 0 as “fresh” and forces
LegacyLazyFallbackEnabled = DefaultLegacyLazyFallbackEnabled, which prevents
callers from explicitly preserving false; change ApplyLegacyLazyFallbackDefaults
(in type Settings) to only seed LegacyLazyFallbackTimeoutSeconds when zero (set
to DefaultLegacyLazyFallbackTimeoutSeconds) and do NOT modify
LegacyLazyFallbackEnabled there, and instead seed LegacyLazyFallbackEnabled =
DefaultLegacyLazyFallbackEnabled only from code paths that create brand-new
Settings (e.g. the PUT/new-settings handler) so callers that leave timeout unset
but explicitly set Enabled=false are preserved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
management/server/http/handlers/accounts/accounts_handler.go (1)

176-198: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Omitted mode/timeout fields still wipe existing account settings.

This function rebuilds types.Settings from scratch and only seeds the legacy-fallback fields. An older Dashboard or API client that updates unrelated settings without sending connection_mode, p2p_timeout_seconds, p2p_retry_max_seconds, or relay_timeout_seconds will clear the stored values on save, which breaks the backward-compatibility goal for this rollout. Seed these fields from the current persisted settings before applying request overrides, or merge against the existing settings instead of starting from zero.

Also applies to: 231-268

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
176 - 198, The handler currently constructs a new types.Settings
(returnSettings) from req.Settings which causes omission of older mode/timeout
fields (connection_mode, p2p_timeout_seconds, p2p_retry_max_seconds,
relay_timeout_seconds) to wipe persisted values; instead, initialize
returnSettings by copying the existing persisted settings for those specific
fields (or merge the persisted settings into returnSettings) before applying
overrides from req.Settings so absent fields in the request preserve their
stored values; update the logic around returnSettings, req.Settings and the
per-field if-blocks to only overwrite a field when the request explicitly
provides it, and ensure the same change is applied to the other block noted
(lines 231-268) that rebuilds Settings elsewhere.
♻️ Duplicate comments (30)
client/internal/stdnet/filter_test.go (1)

23-66: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

This Windows regression test is still asymmetric and cross-platform brittle.

On non-Windows runners, the vEthernet (...) rows are expected to be filtered by the generic veth prefix path, but the loop never asserts most want: true cases, so the test still passes. Skip this test outside Windows (or split the Windows-only rows out) and use a symmetric got != c.want assertion so allowed interfaces are verified too.

Suggested adjustment
 func TestInterfaceFilter_Windows_TargetedFiltering(t *testing.T) {
+	if runtime.GOOS != "windows" {
+		t.Skip("Windows-specific interface-name filtering")
+	}
+
 	disallow := []string{"wt", "wg", "veth", "br-", "lo", "docker"}
 	allow := InterfaceFilter(disallow)
@@
 	for _, c := range cases {
-		// The wgctrl branch can override on hosts where NetBird is
-		// running; tests run on a host where these names are not
-		// real interfaces, so the final return faithfully reflects
-		// the disallow-list logic.
 		got := allow(c.name)
-		// "veth*" prefix only filters on non-Windows; on Linux test
-		// runners "vEthernet (LAN)" still passes because of mixed
-		// case + the !Windows branch keeping the prefix match.
-		if !c.want && got {
-			t.Errorf("InterfaceFilter(%q) = true, want false (should be filtered)", c.name)
-		}
-		if c.want && !got && runtime.GOOS == "windows" && c.name == "vEthernet (LAN)" {
-			t.Fatalf("InterfaceFilter(%q) = false, want true on Windows (this is uray-mic-d4's default-route interface)", c.name)
+		if got != c.want {
+			t.Errorf("InterfaceFilter(%q) = %v, want %v", c.name, got, c.want)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter_test.go` around lines 23 - 66, The test
TestInterfaceFilter_Windows_TargetedFiltering is brittle cross-platform: update
it to skip non-Windows runs (check runtime.GOOS != "windows") or separate
Windows-only cases from generic cases, and change the checks to a symmetric
assertion (if got != c.want { t.Fatalf/... }) so both allowed and filtered
expectations are enforced; operate on the InterfaceFilter result (allow :=
InterfaceFilter(...); got := allow(c.name)) and use runtime.GOOS to gate
Windows-specific rows like "vEthernet (LAN)".
client/internal/stdnet/filter.go (1)

49-76: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Windows still false-positives adapters starting with lo.

The earlier Local Area Connection regression is still alive here: after strings.ToLower, both the dedicated HasPrefix(..., "lo") branch and the generic disallow-list pass will reject any Windows adapter whose name starts with lo. Guarding only the loopback branch is not enough unless the "lo" token is also skipped in the Windows-specific prefix filtering.

Proposed fix
 		// Linux/macOS loopback prefix ("lo", "lo0").
-		if strings.HasPrefix(lowerIFace, "lo") {
+		if runtime.GOOS != "windows" && strings.HasPrefix(lowerIFace, "lo") {
 			return false
 		}
@@
-			if sLower == "veth" && runtime.GOOS == "windows" {
+			if runtime.GOOS == "windows" && (sLower == "veth" || sLower == "lo") {
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/stdnet/filter.go` around lines 49 - 76, The code currently
rejects interfaces starting with "lo" both in the explicit loopback check
(lowerIFace and HasPrefix(..., "lo")) and again when iterating disallowList,
causing Windows adapters like "Local Area Connection" to be false-positive
filtered; update the loop that iterates disallowList (the block referencing
disallowList, sLower and runtime.GOOS) to skip the "lo" token on Windows
(similar to the existing veth guard) so that when runtime.GOOS == "windows" and
sLower == "lo" you continue without applying the Prefix check, leaving the
explicit loopback branch as the sole handler for true loopbacks.
client/internal/lazyconn/manager/manager.go (1)

103-107: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only use the legacy inactivity manager when the deprecated field is actually the source.

This still falls back to inactivity.NewManager(...) whenever both effective timeouts resolve to zero. If the caller explicitly set the new fields to 0/0 and left InactivityThreshold unset, that restores phase-1 default timeout behavior instead of honoring the intended always-on two-timer config.

Suggested fix
 	if wgIface.IsUserspaceBind() {
 		iceTO, relayTO := config.resolvedTimeouts()
-		if iceTO == 0 && relayTO == 0 {
+		if config.InactivityThreshold != nil &&
+			config.ICEInactivityThreshold == 0 &&
+			config.RelayInactivityThreshold == 0 {
 			// Phase 1 / single-timer fallback when caller hasn't migrated.
 			m.inactivityManager = inactivity.NewManager(wgIface, config.InactivityThreshold) //nolint:staticcheck // intentional Phase-1 single-timer fallback
 		} else {
 			m.inactivityManager = inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO)
 		}
 	} else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 103 - 107, The
current fallback to inactivity.NewManager when iceTO==0 && relayTO==0
incorrectly triggers even if the caller explicitly set the new timeouts to 0/0;
change the logic so the legacy inactivity manager is only used when the
deprecated InactivityThreshold is the actual source. Modify
config.resolvedTimeouts() (or add a small accessor) to return a third boolean
like usedDeprecated (or expose whether InactivityThreshold was the source), then
in manager.go call that and only assign m.inactivityManager =
inactivity.NewManager(...) when usedDeprecated is true (keeping the current
wgIface and config.InactivityThreshold parameters); otherwise preserve the
two-timer always-on behavior when new fields were explicitly set to 0/0.
management/server/peer/peer.go (1)

207-224: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

isEmpty() still omits the new Phase 3.7i fields.

UpdateMetaIfNew returns early without applying the update when meta.isEmpty() returns true. Since isEmpty() ignores EffectiveConnectionMode, EffectiveRelayTimeoutSecs, EffectiveP2PTimeoutSecs, EffectiveP2PRetryMaxSecs, and SupportedFeatures, a peer carrying only these new runtime/capability fields would have its update silently dropped.

This was raised in a previous review iteration.

🐛 Proposed fix
 func (p PeerSystemMeta) isEmpty() bool {
 	return p.Hostname == "" &&
 		p.GoOS == "" &&
 		p.Kernel == "" &&
 		p.Core == "" &&
 		p.Platform == "" &&
 		p.OS == "" &&
 		p.OSVersion == "" &&
 		p.WtVersion == "" &&
 		p.UIVersion == "" &&
 		p.KernelVersion == "" &&
 		len(p.NetworkAddresses) == 0 &&
 		p.SystemSerialNumber == "" &&
 		p.SystemProductName == "" &&
 		p.SystemManufacturer == "" &&
 		p.Environment.Cloud == "" &&
 		p.Environment.Platform == "" &&
-		len(p.Files) == 0
+		len(p.Files) == 0 &&
+		p.EffectiveConnectionMode == "" &&
+		p.EffectiveRelayTimeoutSecs == 0 &&
+		p.EffectiveP2PTimeoutSecs == 0 &&
+		p.EffectiveP2PRetryMaxSecs == 0 &&
+		len(p.SupportedFeatures) == 0
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer/peer.go` around lines 207 - 224, The isEmpty() method
on PeerSystemMeta currently omits the Phase 3.7i runtime/capability fields so
UpdateMetaIfNew can drop updates that only contain those values; update
PeerSystemMeta.isEmpty() to include checks for EffectiveConnectionMode,
EffectiveRelayTimeoutSecs, EffectiveP2PTimeoutSecs, EffectiveP2PRetryMaxSecs,
and SupportedFeatures (e.g., ensure SupportedFeatures length == 0) alongside the
existing field checks so a meta containing only those new fields is not treated
as empty.
shared/management/http/api/openapi.yml (1)

365-405: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

connection_mode contract is still missing p2p-dynamic-lazy, and timeout docs are stale

Line 365 omits a valid mode from the API enum, and Lines 370-405 still document old phase behavior/mode applicability. This can cause client/schema rejection of server-valid values and misconfigure consumers.

Suggested OpenAPI patch
         connection_mode:
           x-experimental: true
           type: string
-          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
+          enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic, p2p-dynamic-lazy]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            Includes relay-forced, p2p, p2p-lazy, p2p-dynamic, and
+            p2p-dynamic-lazy.
...
         p2p_timeout_seconds:
...
           description: |
             Default ICE-worker idle timeout in seconds. 0 = never tear down.
-            Effective only in p2p-dynamic mode (added in Phase 2).
+            Effective in dynamic modes (p2p-dynamic, p2p-dynamic-lazy).
             NULL means "use built-in default" (180 minutes).
...
         p2p_retry_max_seconds:
...
           description: |
             Maximum interval between P2P retry attempts after consecutive
             ICE failures, in seconds. Default 900 (= 15 min). Set to 0 to
-            disable backoff (always retry immediately, Phase-2 behavior).
-            Effective only in p2p-dynamic mode (added in Phase 3).
+            disable backoff (always retry immediately).
+            Effective in dynamic modes (p2p-dynamic, p2p-dynamic-lazy).
...
         relay_timeout_seconds:
...
           description: |
             Default relay-worker idle timeout in seconds. 0 = never tear
-            down. Effective in p2p-lazy and p2p-dynamic modes. Backwards-
+            down. Effective in modes that support relay idle teardown
+            (p2p-lazy and p2p-dynamic-lazy). Backwards-
             compat alias for NB_LAZY_CONN_INACTIVITY_THRESHOLD on the
             client. NULL means "use built-in default" (5 minutes).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 365 - 405, The OpenAPI
enum for connection_mode is missing the valid value "p2p-dynamic-lazy" and
several descriptions are stale; add "p2p-dynamic-lazy" to the enum for
connection_mode and update the descriptions for p2p_timeout_seconds,
p2p_retry_max_seconds, and relay_timeout_seconds to remove Phase/X-phase
language and explicitly state which concrete modes each field applies to (e.g.,
p2p_timeout_seconds and p2p_retry_max_seconds apply to p2p-dynamic and
p2p-dynamic-lazy; relay_timeout_seconds applies to p2p-lazy and
p2p-dynamic-lazy/p2p-dynamic as appropriate), and ensure nullable/default
semantics and examples remain accurate for those fields.
shared/management/client/grpc.go (2)

968-970: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always send capability/effective-config metadata, even when system.Info is unavailable.

When info == nil, this returns nil and drops SupportedFeatures plus the effective connection settings. On that path a current client becomes indistinguishable from a legacy one, which can trigger the wrong fallback mode.

Suggested fix
 func infoToMetaData(info *system.Info, eff EffectiveConnConfig) *proto.PeerSystemMeta {
-	if info == nil {
-		return nil
-	}
+	meta := &proto.PeerSystemMeta{
+		EffectiveConnectionMode:   eff.Mode,
+		EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs,
+		EffectiveP2PTimeoutSecs:   eff.P2PTimeoutSecs,
+		EffectiveP2PRetryMaxSecs:  eff.P2PRetryMaxSecs,
+		SupportedFeatures:         system.SupportedFeatures(),
+	}
+	if info == nil {
+		return meta
+	}
@@
-	return &proto.PeerSystemMeta{
+	meta.Hostname = info.Hostname
+	meta.GoOS = info.GoOS
+	meta.OS = info.OS
+	meta.Core = info.OSVersion
+	meta.OSVersion = info.OSVersion
+	meta.Platform = info.Platform
+	meta.Kernel = info.Kernel
+	meta.NetbirdVersion = info.NetbirdVersion
+	meta.UiVersion = info.UIVersion
+	meta.KernelVersion = info.KernelVersion
+	meta.NetworkAddresses = addresses
+	meta.SysSerialNumber = info.SystemSerialNumber
+	meta.SysManufacturer = info.SystemManufacturer
+	meta.SysProductName = info.SystemProductName
+	meta.Environment = &proto.Environment{
+		Cloud:    info.Environment.Cloud,
+		Platform: info.Environment.Platform,
+	}
+	meta.Files = files
+	meta.Flags = &proto.Flags{
+		RosenpassEnabled:      info.RosenpassEnabled,
+		RosenpassPermissive:   info.RosenpassPermissive,
+		ServerSSHAllowed:      info.ServerSSHAllowed,
+		DisableClientRoutes:   info.DisableClientRoutes,
+		DisableServerRoutes:   info.DisableServerRoutes,
+		DisableDNS:            info.DisableDNS,
+		DisableFirewall:       info.DisableFirewall,
+		BlockLANAccess:        info.BlockLANAccess,
+		BlockInbound:          info.BlockInbound,
+		LazyConnectionEnabled: info.LazyConnectionEnabled,
+	}
+	return meta
-		Hostname:         info.Hostname,
-		GoOS:             info.GoOS,
-		OS:               info.OS,
-		Core:             info.OSVersion,
-		OSVersion:        info.OSVersion,
-		Platform:         info.Platform,
-		Kernel:           info.Kernel,
-		NetbirdVersion:   info.NetbirdVersion,
-		UiVersion:        info.UIVersion,
-		KernelVersion:    info.KernelVersion,
-		NetworkAddresses: addresses,
-		SysSerialNumber:  info.SystemSerialNumber,
-		SysManufacturer:  info.SystemManufacturer,
-		SysProductName:   info.SystemProductName,
-		Environment: &proto.Environment{
-			Cloud:    info.Environment.Cloud,
-			Platform: info.Environment.Platform,
-		},
-		Files: files,
-		Flags: &proto.Flags{
-			RosenpassEnabled:    info.RosenpassEnabled,
-			RosenpassPermissive: info.RosenpassPermissive,
-			ServerSSHAllowed:    info.ServerSSHAllowed,
-			DisableClientRoutes: info.DisableClientRoutes,
-			DisableServerRoutes: info.DisableServerRoutes,
-			DisableDNS:          info.DisableDNS,
-			DisableFirewall:     info.DisableFirewall,
-			BlockLANAccess:      info.BlockLANAccess,
-			BlockInbound:        info.BlockInbound,
-			LazyConnectionEnabled: info.LazyConnectionEnabled,
-		},
-		EffectiveConnectionMode:   eff.Mode,
-		EffectiveRelayTimeoutSecs: eff.RelayTimeoutSecs,
-		EffectiveP2PTimeoutSecs:   eff.P2PTimeoutSecs,
-		EffectiveP2PRetryMaxSecs:  eff.P2PRetryMaxSecs,
-		SupportedFeatures:         system.SupportedFeatures(),
-	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 968 - 970, The function
infoToMetaData should never return nil so capability and effective-config
metadata are always sent; modify infoToMetaData (and its use of
proto.PeerSystemMeta and EffectiveConnConfig) to always allocate and return a
&proto.PeerSystemMeta populated with SupportedFeatures (use
info.SupportedFeatures when info != nil, otherwise an empty/zero-value slice),
and fill in the EffectiveConnConfig-derived fields from eff so the client is
never mistaken for a legacy client even when info == nil.

483-492: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the snapshot callback after the sync frame has been applied.

SyncResponse can carry ordinary peer-state updates and SnapshotRequest in the same frame. Calling cb() before msgHandler(decryptedResp) lets the client publish a snapshot from stale state, and it still fires when msgHandler rejects the frame.

Suggested ordering fix
-		if req := decryptedResp.GetSnapshotRequest(); req != nil {
-			c.snapMu.Lock()
-			cb := c.onSnapshotRequest
-			c.snapMu.Unlock()
-			if cb != nil {
-				cb(req.GetNonce())
-			}
-		}
-
 		if err := msgHandler(decryptedResp); err != nil {
 			log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
+			continue
+		}
+
+		if req := decryptedResp.GetSnapshotRequest(); req != nil {
+			c.snapMu.Lock()
+			cb := c.onSnapshotRequest
+			c.snapMu.Unlock()
+			if cb != nil {
+				cb(req.GetNonce())
+			}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/client/grpc.go` around lines 483 - 492, The snapshot
callback is invoked before the sync frame is applied; change the order so you
first call msgHandler(decryptedResp) and only if it returns nil then check
decryptedResp.GetSnapshotRequest(), acquire c.snapMu to read c.onSnapshotRequest
into a local cb, release the lock, and if cb != nil call cb(req.GetNonce());
ensure the callback is not invoked when msgHandler rejects the frame and keep
using the same symbols: decryptedResp.GetSnapshotRequest(),
msgHandler(decryptedResp), c.onSnapshotRequest, and req.GetNonce().
management/server/peer.go (2)

1329-1355: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return an auth-level error here instead of status.Internal.

“No access to this peer” is an expected authorization result on this path. Returning Internal turns a normal denial into a 500-class failure for callers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1329 - 1355, The function
DefaultAccountManager.checkIfUserOwnsPeer treats a normal "no access" outcome as
an internal error; replace the final return that uses status.Internal with an
authorization-level gRPC error such as status.PermissionDenied (or
status.Unauthenticated if appropriate) so callers receive a 4xx auth error
instead of a 500; update the error message text remain the same but use
status.PermissionDenied(fmt.Sprintf(...)) in the return from
checkIfUserOwnsPeer.

1361-1368: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t leak cross-account peer existence from GetPeerByPubKey.

This lookup is global by WireGuard pubkey, so the account check is the tenant boundary. Returning a distinct fmt.Errorf here both confirms that the peer exists in some other account and makes the failure hard for REST handlers to classify correctly. Both “not found” and “belongs to another account” should collapse to the same typed not-found response.

Suggested fix
 func (am *DefaultAccountManager) GetPeerByPubKey(ctx context.Context, accountID, pubKey string) (*nbpeer.Peer, error) {
 	p, err := am.Store.GetPeerByPeerPubKey(ctx, store.LockingStrengthNone, pubKey)
 	if err != nil {
 		return nil, err
 	}
 	if p.AccountID != accountID {
-		return nil, fmt.Errorf("peer with pubkey %s not in account %s", pubKey, accountID)
+		return nil, status.Errorf(status.NotFound, "peer not found")
 	}
 	return p, nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer.go` around lines 1361 - 1368, GetPeerByPubKey
currently returns a distinct fmt.Errorf when the found peer belongs to a
different account, leaking existence across tenants; change
DefaultAccountManager.GetPeerByPubKey so that if p.AccountID != accountID it
returns the same typed "not found" error used by Store.GetPeerByPeerPubKey
(e.g., the store's ErrNotFound or the same error value/type the store returns)
instead of a new fmt.Errorf, so REST handlers cannot distinguish "exists in
another account" from "not found."
client/internal/conn_mgr.go (2)

297-300: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The live mode/timeout getters still race with UpdatedRemotePeerConfig.

These fields are written here without synchronization and then exposed through Mode(), RelayTimeout(), P2pTimeout(), and P2pRetryMax() to other goroutines. The UI/daemon-RPC paths and timer-driven readers can still observe a real data race.

Also applies to: 753-778

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 297 - 300, The assignments to
e.mode, e.relayTimeoutSecs, e.p2pTimeoutSecs and e.p2pRetryMaxSecs in
UpdatedRemotePeerConfig are unsynchronized and race with the readers Mode(),
RelayTimeout(), P2pTimeout(), and P2pRetryMax(); fix this by protecting these
writes and reads with a shared lock: add (or reuse) a sync.RWMutex on the conn
manager struct, wrap the block in UpdatedRemotePeerConfig that sets
e.mode/e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs with
mu.Lock()/defer mu.Unlock(), and change the getters Mode(), RelayTimeout(),
P2pTimeout(), P2pRetryMax() to use mu.RLock()/mu.RUnlock() (or use atomic
operations if the fields are simple scalars), ensuring all accesses use the same
mutex.

287-340: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart/live-reconfigure the lazy manager when only the inactivity timeouts change.

initLazyManager() snapshots relayTimeoutSecs and p2pTimeoutSecs into manager.Config once. This path only tears the manager down on modeChanged, so a management push that changes just the timeout fields leaves the running manager on stale thresholds until a later mode flip or daemon restart.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 287 - 340, The lazy manager is not
restarted when only inactivity timeout values change, leaving it running with
stale relay/p2p thresholds; after computing newRelay/newP2P/newP2pRetry and
updating e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs (and calling
e.propagateP2pRetryMaxToConns()), detect when modeUsesLazyMgr(newMode) is true
and e.lazyConnMgr != nil but timeouts or retry max actually changed (i.e., mode
didn't change but newRelay/newP2P/newP2pRetry differ from previous stored
values) and then restart the manager by calling e.closeManager(ctx),
e.initLazyManager(ctx), e.startModeSideEffects(), and (if needed per current
behavior) return e.resetPeersToLazyIdle(ctx) so the running manager picks up the
new timeouts immediately.
client/ui/peers_tab.go (1)

60-98: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep render() off the Fyne event thread.

showFull.OnChanged and the exported Refresh callback still run the blocking getSrvClient/Status path synchronously, so a slow daemon can freeze the whole window for up to 5 seconds. Kick off the fetch in a goroutine and keep only the widget mutations inside fyne.Do(...).

Proposed fix
-	showFull.OnChanged = func(_ bool) { render() }
+	asyncRender := func() { go render() }
+	showFull.OnChanged = func(_ bool) { asyncRender() }
@@
-	return peersTabBundle{Content: content, ShowFull: showFull, Refresh: render}
+	return peersTabBundle{Content: content, ShowFull: showFull, Refresh: asyncRender}

Also applies to: 126-126

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/peers_tab.go` around lines 60 - 98, The render function currently
performs blocking work (getSrvClient, conn.Status) on the Fyne event thread;
move the network/status fetch off the UI thread by invoking render in a new
goroutine and ensure only UI mutations are executed inside fyne.Do. Concretely:
keep the body of render the same but call it via go render() from
showFull.OnChanged and from the exported Refresh callback (and any other places
that call render), and ensure all calls to getSrvClient, context.WithTimeout,
conn.Status, sorting and building of peer data happen outside fyne.Do while only
summary.SetText, breakdown.SetText, listVBox.Add/Refresh and other widget
updates occur inside fyne.Do.
shared/management/http/api/types.gen.go (1)

41-63: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

p2p-dynamic-lazy is still missing from the public connection-mode enum.

The PR contract now includes five modes, but generated clients from this schema can still only represent four. Anything relying on AccountSettingsConnectionMode.Valid() will reject p2p-dynamic-lazy, so the HTTP API still cannot round-trip the full mode set. Please update the OpenAPI source and regenerate this file rather than patching the generated output directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/types.gen.go` around lines 41 - 63, The generated
enum AccountSettingsConnectionMode and its Valid() method are missing the
"p2p-dynamic-lazy" constant, so Valid() will reject that mode; update the
OpenAPI schema to include the new enum value (so the generator emits a new
constant, e.g., AccountSettingsConnectionModeP2pDynamicLazy) and then re-run the
codegen to regenerate shared/management/http/api/types.gen.go rather than
hand-editing the generated file; ensure the generated Valid() switch in
AccountSettingsConnectionMode includes the new constant name so
"p2p-dynamic-lazy" is accepted.
management/server/http/handlers/accounts/accounts_handler.go (1)

231-283: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return InvalidArgument for these new validation failures.

These branches still use fmt.Errorf, so bad connection_mode / timeout input reaches util.WriteError as a generic error and is likely surfaced as a 5xx instead of a 4xx validation response. Use status.Errorf(status.InvalidArgument, ...) here and inside validateUint32Timeout.

Minimal fix
 	if req.Settings.ConnectionMode != nil {
 		modeStr := string(*req.Settings.ConnectionMode)
 		if !req.Settings.ConnectionMode.Valid() {
-			return nil, fmt.Errorf("invalid connection_mode %q", modeStr)
+			return nil, status.Errorf(status.InvalidArgument, "invalid connection_mode %q", modeStr)
 		}
@@
 		v := *req.Settings.LegacyLazyFallbackTimeoutSeconds
 		if v < 60 || v > 86400 {
-			return nil, fmt.Errorf("invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
+			return nil, status.Errorf(status.InvalidArgument, "invalid legacy_lazy_fallback_timeout_seconds %d (must be between 60 and 86400)", v)
 		}
 		returnSettings.LegacyLazyFallbackTimeoutSeconds = uint32(v)
 	}
@@
 func validateUint32Timeout(name string, v int64) (uint32, error) {
 	if v < 0 {
-		return 0, fmt.Errorf("invalid %s: %d (must be >= 0)", name, v)
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (must be >= 0)", name, v)
 	}
 	if v > int64(math.MaxUint32) {
-		return 0, fmt.Errorf("invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
+		return 0, status.Errorf(status.InvalidArgument, "invalid %s: %d (exceeds %d)", name, v, uint64(math.MaxUint32))
 	}
 	return uint32(v), nil
 }

Also applies to: 511-517

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
231 - 283, Change validation errors from plain fmt.Errorf to gRPC
status.InvalidArgument errors so they are returned as 4xx validation responses:
replace fmt.Errorf(...) in the ConnectionMode validation branch (where it
currently returns invalid connection_mode) with
status.Errorf(status.InvalidArgument, ...), and update validateUint32Timeout to
return status.Errorf(status.InvalidArgument, ...) for out-of-range timeout
errors; ensure the callers in the branches that handle P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds check and propagate the new error type
unchanged. Also convert the range-check error for
LegacyLazyFallbackTimeoutSeconds (the fmt.Errorf(...) that checks v < 60 || v >
86400) to status.Errorf(status.InvalidArgument, ...) so that
legacy_lazy_fallback_timeout_seconds validation also yields InvalidArgument.
client/ui/network.go (1)

97-103: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear the tab being left, not the newly selected tab.

OnUnselected resolves the grid through tabs.Selected(), so a tab switch can wipe the wrong container. Leaving Exit-node networks for Peers still falls through the default branch and clears allGrid.

Suggested fix
 	tabs.OnUnselected = func(item *container.TabItem) {
 		// Only reset network grids when leaving a network tab; the
 		// peers VBox manages its own state.
 		if item != nil && item.Text != peersText {
-			grid, _ := getGridAndFilterFromTab(tabs, allGrid, overlappingGrid, exitNodeGrid)
-			grid.Objects = nil
+			switch item.Text {
+			case allNetworksText:
+				allGrid.Objects = nil
+			case overlappingNetworksText:
+				overlappingGrid.Objects = nil
+			case exitNodeNetworksText:
+				exitNodeGrid.Objects = nil
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/network.go` around lines 97 - 103, The code clears the wrong grid
because getGridAndFilterFromTab is effectively using tabs.Selected(); change the
logic to determine and clear the grid based on the unselected tab parameter
(item) instead of the currently selected tab: in the tabs.OnUnselected handler
use item (and item.Text) to pick the correct grid (via getGridAndFilterFromTab
or a new small helper that accepts the TabItem or tab text) and then clear only
that grid (e.g., call getGridAndFilterFromTab(item, allGrid, overlappingGrid,
exitNodeGrid) or map item.Text to the matching grid) so leaving "Exit-node
networks" won't clear allGrid or other unrelated grids; keep peersText, allGrid,
overlappingGrid and exitNodeGrid references intact.
management/server/http/handlers/accounts/accounts_handler_test.go (1)

418-419: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the struct literal indentation in the new test.

Lines 418-419 are still over-indented relative to the surrounding fields, so gofmt will rewrite this hunk and can trip formatting gates.

Proposed fix
-				LegacyLazyFallbackEnabled:      br(true),
-				LegacyLazyFallbackTimeoutSeconds: ir(3600),
+		LegacyLazyFallbackEnabled:      br(true),
+		LegacyLazyFallbackTimeoutSeconds: ir(3600),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler_test.go` around
lines 418 - 419, The struct literal in accounts_handler_test.go has two
over-indented fields (LegacyLazyFallbackEnabled and
LegacyLazyFallbackTimeoutSeconds using br(...) and ir(...)) which will be
changed by gofmt; fix by aligning these fields' indentation with the other
fields in the surrounding composite literal so their leading tabs/spaces match
the surrounding entries (move LegacyLazyFallbackEnabled: br(true), and
LegacyLazyFallbackTimeoutSeconds: ir(3600), left to the same column as adjacent
fields).
management/server/peer_connections/store.go (1)

141-153: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expire the cached entry before the nonce gate.

GetWithNonceCheck returns on the since check before it evaluates TTL, so refresh-path polling can keep expired maps resident indefinitely instead of evicting them.

Proposed fix
 func (s *MemoryStore) GetWithNonceCheck(peerPubKey string, since uint64) (*mgmProto.PeerConnectionMap, bool) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	e, ok := s.maps[peerPubKey]
 	if !ok {
 		return nil, false
 	}
-	if since > 0 && e.m.GetInResponseToNonce() < since {
-		return nil, false
-	}
 	if s.clock.Now().Sub(e.updatedAt) > s.ttl {
 		delete(s.maps, peerPubKey)
 		return nil, false
 	}
+	if since > 0 && e.m.GetInResponseToNonce() < since {
+		return nil, false
+	}
 	return proto.Clone(e.m).(*mgmProto.PeerConnectionMap), true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/peer_connections/store.go` around lines 141 - 153,
GetWithNonceCheck currently checks the nonce gate before TTL so expired entries
can persist; in MemoryStore.GetWithNonceCheck, move the TTL eviction check
(compare s.clock.Now().Sub(e.updatedAt) > s.ttl, delete from s.maps and return
false) to occur immediately after retrieving e (before evaluating
e.m.GetInResponseToNonce()), so expired entries are removed and not allowed to
pass the nonce check.
management/internals/shared/grpc/conversion.go (1)

315-326: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only mark liveness as authoritative when status is present.

cfg.ServerLivenessKnown = true currently runs even when rPeer.Status is nil, which turns “no status data” into “server says offline.” New clients will trust LiveOnline=false instead of falling back to the last-seen heuristic.

Proposed fix
 		if rPeer.Status != nil {
 			if !rPeer.Status.LastSeen.IsZero() {
 				cfg.LastSeenAtServer = timestamppb.New(rPeer.Status.LastSeen)
 			}
 			cfg.LiveOnline = rPeer.Status.Connected
+			cfg.ServerLivenessKnown = true
 		}
-		cfg.ServerLivenessKnown = true
 		dst = append(dst, cfg)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 315 - 326, The
code unconditionally sets cfg.ServerLivenessKnown = true even when rPeer.Status
is nil, causing clients to trust LiveOnline when no status exists; change the
logic so cfg.ServerLivenessKnown is only set to true inside the rPeer.Status !=
nil branch (i.e., after you nil-guard and assign cfg.LiveOnline /
LastSeenAtServer), leaving it false/default when rPeer.Status is nil so clients
fall back to the LastSeen heuristic; adjust the block around rPeer.Status,
cfg.LiveOnline and cfg.LastSeenAtServer accordingly.
client/internal/engine_offline_debounce_test.go (1)

77-97: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Lock the debounce map around these assertions.

These len(e.peerOfflineDebounce) reads bypass peerOfflineDebounceMu, so the tests can still race with timer-map mutation and flap under -race.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine_offline_debounce_test.go` around lines 77 - 97, The
tests read len(e.peerOfflineDebounce) without acquiring the debounce mutex,
causing races; wrap those assertions in the engine's peerOfflineDebounceMu lock.
Specifically, in TestCancelRemoteOfflineClose_OnAbsentPeer_NoOp and
TestCancelAllRemoteOfflineCloses_ClearsEverything acquire
e.peerOfflineDebounceMu before checking len(e.peerOfflineDebounce) (and release
after) so the reads are synchronized with
scheduleRemoteOfflineClose/cancelAllRemoteOfflineCloses/cancelRemoteOfflineClose
which mutate the map.
client/internal/peer/conn_lazy_keepwgpeer_test.go (1)

108-115: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Only skip full-line comments here.

strings.Contains(trim, "// ") also filters out real conn.Close(...) // ... lines, so a regressed 1-arg call with an inline comment would never be checked.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_lazy_keepwgpeer_test.go` around lines 108 - 115,
The condition that skips comment lines uses strings.Contains(trim, "// "), which
also matches inline comments like "conn.Close(...) // ..." and hides real 1-arg
calls; change that check to only skip full-line comments by using
strings.HasPrefix(trim, "//") (i.e., update the switch condition that references
trim alongside lazyConnMgr.Close, activityManager.Close, im.Close,
peerStore.Close, conn.Close( and peerConn.Close( to use HasPrefix instead of
Contains).
client/internal/peer/ice_backoff.go (1)

202-210: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear active suspension when max backoff is disabled.

If a peer is already suspended and management pushes 0, this leaves s.suspended/s.nextRetry intact, so retries stay blocked until the old deadline even though backoff is now off.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 202 - 210, SetMaxBackoff
currently updates s.maxBackoff and s.bo but doesn't clear an active suspension,
so if management sets max backoff to 0 while a peer is suspended the peer
remains blocked by s.suspended/s.nextRetry. In iceBackoffState.SetMaxBackoff,
after updating s.maxBackoff and s.bo when d == 0 clear the suspension state by
setting s.suspended = false and resetting s.nextRetry to the zero time
(time.Time{}); reference the SetMaxBackoff method, iceBackoffState type, and
fields s.suspended and s.nextRetry so the change unblocks retries immediately
when backoff is disabled.
client/internal/conn_state_pusher.go (3)

151-162: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The latest-nonce coalescing contract isn't actually enforced.

Older nonces can stay queued in snapshotReq, and the loop flushes the first pending value it reads. Under back-to-back requests, the caller can miss its own InResponseToNonce.

Also applies to: 243-246

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 151 - 162,
OnSnapshotRequest currently may leave older nonces in p.snapshotReq so the
pusher can flush a stale value; change OnSnapshotRequest (and the analogous code
at the other occurrence) to coalesce by first draining any pending values from
p.snapshotReq in a non-blocking loop and then send the latest nonce non-blocking
— i.e., if p is nil return, then repeatedly try a non-blocking receive from
p.snapshotReq to discard old nonces, and finally perform a non-blocking send of
the new nonce so the channel only ever retains the most recent value (use the
existing symbol p.snapshotReq and function connStatePusher.OnSnapshotRequest and
replicate the same drain-then-send logic at the referenced second spot).

115-122: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't ignore rand.Read failures in newSessionID.

On error, b stays zeroed and this loop can spin forever during pusher construction. Propagate the error or use a bounded fallback instead of retrying blindly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 115 - 122, newSessionID
currently ignores errors from rand.Read and can loop forever if Read fails;
change newSessionID to return (uint64, error) (or an error alongside the ID) and
propagate the rand.Read error instead of retrying blindly, or implement a
bounded retry with a deterministic fallback; update callers that use
newSessionID (e.g., pusher/session construction) to handle the returned error
and fail fast if session ID generation fails.

291-296: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid context.Background() for push calls.

A hung Push blocks the loop goroutine, which means <-p.stop is never serviced and Stop() can stall forever in wg.Wait(). Use a cancelable lifecycle context or a per-call timeout for both delta and full pushes.

Also applies to: 361-367

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_state_pusher.go` around lines 291 - 296, The Push calls
currently use context.Background() (see p.sink.Push invocations) which can hang
and prevent Stop() from returning; replace those with a cancelable context
derived from the p lifecycle context (e.g., p.ctx or p.lifecycleCtx) or wrap
each call in context.WithTimeout(p.ctx, pushTimeout) and pass that to
p.sink.Push for both delta and full pushes (the call at p.sink.Push near seq and
the similar one at the later block), and ensure Stop() cancels the lifecycle
context so blocked Pushes are unblocked and wg.Wait() can complete.
client/internal/peer/status.go (3)

296-307: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Capture the conn-state listener before unlocking.

notifyConnStateChange() is still written as a lock-held helper, but these update paths call it only after d.mux.Unlock(). That leaves the read of d.connStateListener racing with SetConnStateListener(nil/...), so the new callback plumbing is still not thread-safe.

Also applies to: 470-473, 696-699, 764-767, 812-815, 863-866

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 296 - 307, notifyConnStateChange
currently reads d.connStateListener while callers may have already released
d.mux, creating a race with SetConnStateListener; fix by capturing the listener
while holding the lock and passing that captured value into the post-unlock
closure. Concretely, either (A) change notifyConnStateChange to accept a
listener parameter (e.g., notifyConnStateChange(listener func(string, State),
peerPubKey string, peerState State) func()) and have callers read listener :=
d.connStateListener while d.mux is held and then call the new helper after
unlock, or (B) update all call sites to read listener := d.connStateListener
while holding d.mux and then call the existing notifyConnStateChange-like
closure creator with that captured listener; ensure SetConnStateListener races
are eliminated by always reading the listener under d.mux before unlocking.

309-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't fire peerListChanged while holding d.mux.

UpdatePeerRemoteMeta() still calls d.notifyPeerListChanged() before releasing d.mux. Since the notifier can re-enter Status, this path can still deadlock on remote-meta-only updates even though the rest of the file generally snapshots under lock and notifies afterwards.

Also applies to: 520-567

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 309 - 320, notifyPeerListChanged
currently calls d.notifier.peerListChanged while still holding d.mux, which can
deadlock because the notifier may re-enter Status; change the pattern to
snapshot the required state under the lock (e.g., capture count :=
d.numOfPeers() or any other needed snapshot of d.peers/d.offlinePeers) and then
release d.mux before calling d.notifier.peerListChanged(count). Apply the same
fix to the other path referenced (the UpdatePeerRemoteMeta-related block around
the second region) so no notifier calls occur while d.mux is held.

322-336: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Snapshot router state while d.mux is still held.

notifyPeerStateChangeListeners() still assumes d.mux is held because it reads d.peers and d.changeNotify through snapshotRouterPeersLocked(). The ICE and relay paths now call it after unlock, so those reads can still race with writers and hit the same concurrent-map bug that was previously reported.

Also applies to: 692-694, 760-762

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 322 - 336,
notifyPeerStateChangeListeners currently calls snapshotRouterPeersLocked which
reads d.peers and d.changeNotify and therefore must be executed while d.mux is
held; some ICE/relay code paths call notifyPeerStateChangeListeners after
unlocking and can race with writers. Fix by ensuring the snapshot happens under
the lock: either (A) call notifyPeerStateChangeListeners while still holding
d.mux in the ICE/relay paths (the callers referenced around the other
occurrences), or (B) change notifyPeerStateChangeListeners to acquire d.mux
internally before calling snapshotRouterPeersLocked and release it immediately
after taking the snapshot, then spawn dispatchRouterPeers; apply the same
pattern for all call sites (including the ones at the other noted locations) so
snapshotRouterPeersLocked is always invoked with d.mux held.
management/internals/shared/grpc/server.go (1)

445-447: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Nil out snapshotCh after closure to avoid a tight spin.

When SnapshotRouter.Register() replaces a stream, the old snapshotCh can be closed. This branch still does continue on !ok, so the closed receive stays immediately ready and the goroutine busy-loops until ctx.Done().

Possible minimal fix
 		case nonce, ok := <-snapshotCh:
 			if !ok {
-				continue
+				snapshotCh = nil
+				continue
 			}

Also applies to: 497-500

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/server.go` around lines 445 - 447, The
current receive loop uses snapshotCh from
snapshotRouter.Register(peerKey.String()) and on detecting a closed channel (ok
== false) it does continue, which causes a closed channel to remain immediately
readable and the goroutine to busy-spin; fix this by setting snapshotCh = nil
(or otherwise disabling reads) when you detect !ok so the select no longer reads
the closed channel, and keep the existing defer
s.snapshotRouter.Unregister(peerKey.String(), snapshotCh); apply the same niling
change to the analogous receive logic referenced around the other occurrence
(lines ~497-500) to prevent the same tight loop.
client/internal/engine.go (1)

2765-2783: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard the timer's ConnMgr read with the same lock as the writer.

The debounce timer still reads e.connMgr.Mode() outside syncMsgMux, while updateNetworkMap() updates the connection-manager config under that mutex. This is the same unsynchronized read/write pair that was flagged earlier, so the callback can still race with mode changes or shutdown.

Possible minimal fix
 	t := time.AfterFunc(remoteOfflineGracePeriod, func() {
 		e.peerOfflineDebounceMu.Lock()
 		delete(e.peerOfflineDebounce, pubKey)
 		e.peerOfflineDebounceMu.Unlock()
@@
 		if e.ctx == nil || e.ctx.Err() != nil {
 			return
 		}
-		if e.connMgr == nil || e.connMgr.Mode() != connectionmode.ModeP2PDynamic {
+		e.syncMsgMux.Lock()
+		connMgr := e.connMgr
+		mode := connectionmode.ModeUnspecified
+		if connMgr != nil {
+			mode = connMgr.Mode()
+		}
+		e.syncMsgMux.Unlock()
+		if connMgr == nil || mode != connectionmode.ModeP2PDynamic {
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 2765 - 2783, The debounce timer
callback is reading e.connMgr.Mode() without holding the same syncMsgMux used by
updateNetworkMap(), creating a race; modify the callback to acquire the same
syncMsgMux (use the identical lock used in updateNetworkMap(), e.g.
e.syncMsgMux.Lock()/Unlock() or RLock()/RUnlock()) around all accesses to
e.connMgr and any related peer state checks so the read is synchronized with
writers, taking care to acquire locks in the same order as other code paths to
avoid deadlocks and release the lock before returning.
client/internal/peer/conn.go (1)

1171-1218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Revalidate conn state inside AttachICE().

AttachICEOnRelayActivity() still drops conn.mu before calling AttachICE(), and AttachICE() still never checks opened/ctx.Err() after reacquiring the lock. A concurrent Close() can therefore finish in between and this path will still reattach ICE plus send a fresh offer on a torn-down conn.

Possible minimal fix
 func (conn *Conn) AttachICE() error {
 	conn.mu.Lock()
 	defer conn.mu.Unlock()
 
+	if !conn.opened || conn.ctx == nil || conn.ctx.Err() != nil {
+		return fmt.Errorf("AttachICE: connection is closed")
+	}
+
 	if conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() {
 		snap := conn.iceBackoff.Snapshot()
 		conn.Log.Debugf("ICE backoff active (failure #%d, retry at %s), staying on relay",
 			snap.Failures,
 			snap.NextRetry.Format("15:04:05"))

Also applies to: 1280-1305

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1171 - 1218, AttachICE can be run
after conn.mu is dropped in AttachICEOnRelayActivity, so Add state revalidation
at the top of AttachICE immediately after reacquiring conn.mu: check that
conn.opened is true, ctx.Err() is nil (or context still active), and other
preconditions that AttachICEOnRelayActivity enforces (e.g.,
conn.currentConnPriority == conntype.Relay and handshaker/readICEListener still
indicate attachment is valid); if any check fails, return early with a harmless
error. Ensure these checks reference the existing symbols AttachICE, conn.mu,
conn.opened, conn.ctx (or context), conn.currentConnPriority and handshaker so
concurrent Close() or state changes don’t cause ICE to be attached on a
torn-down connection.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 557b44c6-66de-483f-9506-4870e6b21175

📥 Commits

Reviewing files that changed from the base of the PR and between b082536 and e8a298f.

⛔ Files ignored due to path filters (2)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (60)
  • .gitignore
  • client/android/client.go
  • client/android/peer_notifier.go
  • client/android/preferences.go
  • client/android/preferences_clamp_test.go
  • client/iface/bind/activity.go
  • client/iface/device/endpoint_manager.go
  • client/internal/conn_mgr.go
  • client/internal/conn_state_pusher.go
  • client/internal/conn_state_pusher_material_test.go
  • client/internal/conn_state_pusher_test.go
  • client/internal/conn_state_pusher_testhelper_test.go
  • client/internal/engine.go
  • client/internal/engine_offline_debounce_test.go
  • client/internal/lazyconn/activity/listener_bind_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_handover_order_test.go
  • client/internal/peer/conn_lazy_keepwgpeer_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/guard/guard_test.go
  • client/internal/peer/guard/ice_retry_state_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/status_debounce_test.go
  • client/internal/peer/status_remote_meta_notify_test.go
  • client/internal/peerstore/store.go
  • client/internal/profilemanager/config.go
  • client/internal/stdnet/filter.go
  • client/internal/stdnet/filter_test.go
  • client/proto/daemon.proto
  • client/system/features.go
  • client/system/features_test.go
  • client/ui/network.go
  • client/ui/peers_tab.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md
  • docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/internals/shared/grpc/server.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/http/handlers/accounts/validate_uint32_timeout_test.go
  • management/server/peer.go
  • management/server/peer/peer.go
  • management/server/peer_connections/store.go
  • management/server/peer_connections/store_test.go
  • management/server/peer_test.go
  • management/server/store/file_store.go
  • management/server/store/sql_store.go
  • management/server/types/settings.go
  • management/server/types/settings_ptr_equal_test.go
  • management/server/types/settings_test.go
  • shared/management/client/grpc.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
🚧 Files skipped from review as they are similar to previous changes (4)
  • management/server/peer_connections/store_test.go
  • management/server/http/handlers/accounts/validate_uint32_timeout_test.go
  • management/server/store/sql_store.go
  • docs/bugs/2026-05-04-user-peer-visibility-regression.md

Comment thread client/internal/engine.go
Comment on lines +363 to +366
if e.connStatePusher != nil {
e.connStatePusher.Stop()
e.connStatePusher = nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize connStatePusher access across shutdown and callbacks.

The nil-checks prevent a panic, but Stop() still writes e.connStatePusher = nil while the status/snapshot handlers read the same field without any synchronization. That is still a real data race during shutdown, and it can surface as torn reads or calls into a half-stopped pusher under -race.

Use an atomic.Pointer, or unregister/replace the handlers before clearing the field so reads and writes use the same synchronization boundary.

Also applies to: 671-683

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/engine.go` around lines 363 - 366, The code has a data race
on the field connStatePusher: handlers (status/snapshot handlers) read
e.connStatePusher while shutdown calls Stop() and sets e.connStatePusher = nil
without synchronization. Fix by making access atomic or by
unregistering/replacing handlers before clearing the field: either convert
connStatePusher to an atomic.Pointer (use atomic.LoadPointer/StorePointer when
handlers read/write and call Stop via an atomic swap) or ensure shutdown first
unregisters the status/snapshot handlers (so no concurrent readers), then call
connStatePusher.Stop() and set the field to nil; update all reads in the
status/snapshot handler to use the same atomic/mutex-protected access. Ensure
references to connStatePusher and Stop() are the only sites changed.

Comment on lines +50 to +54
func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
relay := c.RelayInactivityThreshold
if relay == 0 && c.InactivityThreshold != nil {
relay = *c.InactivityThreshold
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t let the deprecated timeout bleed into mixed two-timer configs.

Line 52 still copies InactivityThreshold into the relay timeout whenever RelayInactivityThreshold == 0, even if ICEInactivityThreshold is explicitly set. That means a migrated p2p-dynamic config (ICE > 0, relay = 0) can unexpectedly inherit the old relay timeout and tear relay down, which contradicts the comment above this field and changes the new mode’s semantics.

Suggested fix
 func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
 	relay := c.RelayInactivityThreshold
-	if relay == 0 && c.InactivityThreshold != nil {
+	if c.ICEInactivityThreshold == 0 && relay == 0 && c.InactivityThreshold != nil {
 		relay = *c.InactivityThreshold
 	}
 	return c.ICEInactivityThreshold, relay
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
relay := c.RelayInactivityThreshold
if relay == 0 && c.InactivityThreshold != nil {
relay = *c.InactivityThreshold
}
func (c Config) resolvedTimeouts() (iceTimeout, relayTimeout time.Duration) {
relay := c.RelayInactivityThreshold
if c.ICEInactivityThreshold == 0 && relay == 0 && c.InactivityThreshold != nil {
relay = *c.InactivityThreshold
}
return c.ICEInactivityThreshold, relay
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` around lines 50 - 54, The relay
timeout fallback is incorrectly using the deprecated InactivityThreshold
whenever RelayInactivityThreshold == 0 even if ICEInactivityThreshold is
explicitly set; update Config.resolvedTimeouts so that RelayInactivityThreshold
only falls back to InactivityThreshold when both RelayInactivityThreshold == 0
and ICEInactivityThreshold == 0 (and InactivityThreshold != nil), otherwise
leave relay as zero when ICEInactivityThreshold is non‑zero to preserve the new
mode semantics; modify the logic in the resolvedTimeouts method accordingly,
referencing the RelayInactivityThreshold, ICEInactivityThreshold, and
InactivityThreshold fields.

}
mux := udpmux.NewUniversalUDPMuxDefault(bindParams)
go mux.ReadFromConn(t.ctx)
t.udpMuxConn = udpConn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Specify whether t.udpMuxConn field needs to be added to the struct.

The plan assigns to t.udpMuxConn but doesn't clarify whether this field already exists in TunKernelDevice or needs to be added as part of this change. Include a note specifying the struct field addition if required, or confirm the field already exists.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-04-06-kernel-ice-separate-port.md` at line 59,
The assignment to t.udpMuxConn implies TunKernelDevice needs a field named
udpMuxConn; either add that field to the TunKernelDevice struct with the same
type as the udpConn variable (e.g., net.PacketConn or *net.UDPConn depending on
udpConn's type) and a short comment describing its role, or if the field already
exists, update the plan text to explicitly confirm it exists and show its type;
ensure any initialization/close responsibilities for udpMuxConn are handled
consistently with other connection fields in TunKernelDevice (e.g., in its
constructor/Close method).

Comment on lines +235 to +237
// AccountConnectionModeChanged indicates the account-wide ConnectionMode
// setting was changed (Phase 1 of issue #5989).
AccountConnectionModeChanged Activity = 121
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inaccurate phase reference in comment — should be Phase 3.7i, not Phase 1.

Line 236 says "Phase 1 of issue #5989", but this PR is Phase 3.7i. The other new constants on lines 244–245 and 248–252 correctly reference Phase 3 and Phase 3.7i.

✏️ Proposed fix
-	// AccountConnectionModeChanged indicates the account-wide ConnectionMode
-	// setting was changed (Phase 1 of issue `#5989`).
+	// AccountConnectionModeChanged indicates the account-wide ConnectionMode
+	// setting was changed (Phase 3.7i of issue `#5989`).
	AccountConnectionModeChanged Activity = 121
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// AccountConnectionModeChanged indicates the account-wide ConnectionMode
// setting was changed (Phase 1 of issue #5989).
AccountConnectionModeChanged Activity = 121
// AccountConnectionModeChanged indicates the account-wide ConnectionMode
// setting was changed (Phase 3.7i of issue `#5989`).
AccountConnectionModeChanged Activity = 121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/activity/codes.go` around lines 235 - 237, The comment
above the Activity constant AccountConnectionModeChanged incorrectly references
"Phase 1 of issue `#5989`"; update that comment to reference "Phase 3.7i of issue
`#5989`" so it matches the other new constants (e.g., the constants added around
AccountConnectionModeChanged and those at lines after it) and clearly documents
that AccountConnectionModeChanged Activity = 121 is part of Phase 3.7i.

Comment on lines +397 to 408
// ConnectionMode controls how a peer establishes connections to other peers.
// Added in Phase 1 of the connection-mode consolidation (see issue #5989).
// CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to
// the legacy LazyConnectionEnabled boolean field" -- required for backwards
// compatibility with old management servers that don't set this field.
enum ConnectionMode {
CONNECTION_MODE_UNSPECIFIED = 0;
CONNECTION_MODE_RELAY_FORCED = 1;
CONNECTION_MODE_P2P = 2;
CONNECTION_MODE_P2P_LAZY = 3;
CONNECTION_MODE_P2P_DYNAMIC = 4;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add CONNECTION_MODE_P2P_DYNAMIC_LAZY to the wire enum.

#5989's mode set includes p2p-dynamic-lazy, but this enum stops at CONNECTION_MODE_P2P_DYNAMIC. Without a wire value here, management cannot push or persist that mode through PeerConfig, so clients can never receive the dynamic-lazy behavior over the management protocol.

Proposed fix
 enum ConnectionMode {
   CONNECTION_MODE_UNSPECIFIED = 0;
   CONNECTION_MODE_RELAY_FORCED = 1;
   CONNECTION_MODE_P2P = 2;
   CONNECTION_MODE_P2P_LAZY = 3;
   CONNECTION_MODE_P2P_DYNAMIC = 4;
+  CONNECTION_MODE_P2P_DYNAMIC_LAZY = 5;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// ConnectionMode controls how a peer establishes connections to other peers.
// Added in Phase 1 of the connection-mode consolidation (see issue #5989).
// CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to
// the legacy LazyConnectionEnabled boolean field" -- required for backwards
// compatibility with old management servers that don't set this field.
enum ConnectionMode {
CONNECTION_MODE_UNSPECIFIED = 0;
CONNECTION_MODE_RELAY_FORCED = 1;
CONNECTION_MODE_P2P = 2;
CONNECTION_MODE_P2P_LAZY = 3;
CONNECTION_MODE_P2P_DYNAMIC = 4;
}
// ConnectionMode controls how a peer establishes connections to other peers.
// Added in Phase 1 of the connection-mode consolidation (see issue `#5989`).
// CONNECTION_MODE_UNSPECIFIED is the proto default and means "fall back to
// the legacy LazyConnectionEnabled boolean field" -- required for backwards
// compatibility with old management servers that don't set this field.
enum ConnectionMode {
CONNECTION_MODE_UNSPECIFIED = 0;
CONNECTION_MODE_RELAY_FORCED = 1;
CONNECTION_MODE_P2P = 2;
CONNECTION_MODE_P2P_LAZY = 3;
CONNECTION_MODE_P2P_DYNAMIC = 4;
CONNECTION_MODE_P2P_DYNAMIC_LAZY = 5;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/proto/management.proto` around lines 397 - 408, The
ConnectionMode proto enum is missing the new constant for the p2p-dynamic-lazy
mode, preventing management from encoding/decoding that mode; add a new enum
value named CONNECTION_MODE_P2P_DYNAMIC_LAZY (next numeric value after
CONNECTION_MODE_P2P_DYNAMIC) to the ConnectionMode enum so management can
push/persist the p2p-dynamic-lazy mode via PeerConfig and clients can receive
it.

@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch from e8a298f to 587ba07 Compare May 6, 2026 15:01
MichaelUray and others added 9 commits May 6, 2026 16:15
…nfig

Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers
(11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are
unchanged so old clients ignore the additions and old servers send
UNSPECIFIED, which the new client maps back via the legacy boolean.

Note: the regenerated pb.go files now report protoc v5.29.3 in their
header (this branch was generated with locally-installed protoc 29.3
instead of upstream's v7.34.1). Functionally identical; header diff is
the only delta beyond the actual schema additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the
client-only sentinels Unspecified and FollowServer), ParseString for
CLI/env input, ToProto/FromProto for wire translation, and the two
backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled
that bridge the old Settings.LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched
in this commit; the engine/conn_mgr migration follows in subsequent
commits in the same PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns

NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY,
NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set
together, NB_FORCE_RELAY wins (most-restrictive, mirrors the
group-conflict rule from issue netbirdio#5990).

Each legacy var emits a one-shot deprecation warning when it actually
contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD
becomes an alias for the future relay_timeout setting and warns once.

IsForceRelayed() is kept for callers that have not yet been migrated
(conn.go, statusrecorder); they will be updated in the engine/conn
refactor commits later in this PR.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing:
- --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server>
- --relay-timeout <seconds>
- --p2p-timeout <seconds>

Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput,
LoginRequest), persisted in profilemanager.Config, and added as new
fields on the daemon.proto IPC messages. Empty / not-changed flags fall
back to the server-pushed value (which itself falls back to the legacy
lazy_connection_enabled boolean for old servers).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds.
ConnMgr now stores the resolved Mode plus the raw inputs (env, config)
so it can re-resolve when the server pushes a new PeerConfig.

UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and
takes the full PeerConfig pointer; a thin shim with the old name
delegates to it for callers that haven't been updated yet.

connect.go copies the three new fields from profilemanager.Config into
the EngineConfig builder, with a tolerant parser that logs and falls
through to Unspecified on invalid input.

Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now
checks Mode == ModeRelayForced instead of calling the global env-reader
IsForceRelayed(). The local 'forceRelay' variable name is renamed to
'skipICE' to make the new branching intent explicit.

The PeerStateUpdate block at the end of Open() also reads from
conn.config.Mode now, so the StatusRecorder sees the per-peer mode
rather than the global env var.

A single remaining caller of IsForceRelayed() (srWatcher.Start in
engine.go) is left for a follow-up; that path uses a process-wide flag
not per-peer state, so it can be migrated in Phase 2 once srWatcher
itself learns about ConnectionMode.

Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into
peer.ConnConfig, so the per-peer Open() loop in conn.go can take the
ModeRelayForced skip-ICE branch without reading the global env var.

This is the last wiring commit for the client side of Phase 1; the
server-side mgmt changes (Settings + OpenAPI + handler + audit +
NetworkMap-build) follow in Section D.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default'
(NULL) from explicit values (incl. 0 = never tear down). Copy() now
deep-clones the new pointer fields via two small helpers.

GORM AutoMigrate creates the new columns at first start; existing
accounts have NULL in all three columns and resolve via the legacy
LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings

Three new optional, nullable fields with descriptions of the NULL =
built-in-default semantics and the Phase-1-vs-Phase-2 status of
p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen
tooling.

The generated AccountSettingsConnectionMode enum has the canonical
values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid()
helper for handler-side validation.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MichaelUray and others added 28 commits May 6, 2026 16:16
…E-backoff display

Finding 1 (PR-blocker): client/server compile fix.
  client/server/server_test.go now passes 13 args (peerConnStore +
  peerConnRouter as nil) to nbgrpc.NewServer matching the current
  signature. Also adds the four Phase 3.7i ConnectionMode fields to
  setconfig_test.go's expectedFields + fieldsWithoutCLIFlags maps —
  they're in the proto for GetConfig only; SetConfig RPC currently
  doesn't apply them (CLI sets them via service install/reconfigure
  writing the profile file directly). Documented as a wiring gap.

Finding 2: peer_connections.MemoryStore drops fresh full snapshots
after daemon restart.
  store.go now ALWAYS replaces on full_snapshot=true, regardless of
  seq, because the pusher resets seq to 1 on every stream restart.
  Stale in-flight deltas from the closed stream cannot physically
  arrive after the new full snapshot (transport itself is gone).
  New regression test TestMemoryStore_FullSnapshotResetsEpoch covers
  the scenario plus the follow-up delta merge from the new session.
  TestMemoryStore_OutOfOrderDeltaDropped split out to keep the
  delta-out-of-order assertion intact.

Finding 3: UpdatePeerRemoteMeta now fires OnPeersListChanged when a
UI-relevant field flips.
  status.go detects LiveOnline / ServerLivenessKnown /
  EffectiveConnectionMode flips and calls notifyPeerListChanged so
  the Android home/peers fragments — which only refresh on
  OnPeersListChanged — pick up the change immediately instead of at
  the next 30 s daemon-RPC poll. New tests:
  - TestStatus_UpdatePeerRemoteMeta_LiveOnlineFlipNotifies
  - TestStatus_UpdatePeerRemoteMeta_EffectiveModeChangeNotifies
  - TestStatus_UpdatePeerRemoteMeta_NonMaterialFieldsDoNotNotify

Finding 4: ICE-backoff display now wall-clock aware on Windows.
  client/ui/peers_tab.go mirrors the CLI's status.go:797 pattern: only
  show "suspended for Xs (retry at ...)" while nextRetry is still in
  the future, otherwise show the next-retry timestamp, otherwise
  hide. The hard "SUSPENDED" line could remain stale for hours after
  the cool-down expired because the daemon snapshot only refreshes
  on ICE state-change events.

Android Java UI receives the same fix in the netbird-android repo
(submodule bump in companion commit).
The previous fix (full_snapshot=true always replaces) covers the
normal daemon-restart case but cannot detect a stale unary RPC from
the previous daemon process arriving AFTER the new process's full
snapshot. Because SyncPeerConnections is a unary RPC (not stream), a
retried in-flight delta can race past the snapshot and merge
old-session data into the fresh map (stale delta seq=51 beats new
session seq=2 under pure seq comparison).

Fix: add a session_id field to PeerConnectionMap, generated once per
daemon process via crypto/rand at conn_state_pusher construction.
Mgmt's MemoryStore.Put now drops any delta whose session_id doesn't
match the cached entry's. session_id=0 means "legacy / unset" and
falls back to the existing seq-only behaviour (so a partial fleet
upgrade doesn't silently drop pushes).

Changes:
  - shared/management/proto/management.proto: PeerConnectionMap.session_id
    field 5, uint64.
  - client/internal/conn_state_pusher.go: connStatePusher.sessionID
    set once via crypto/rand; flushDelta + flushFull both stamp it.
  - management/server/peer_connections/store.go: drop deltas with
    mismatched session_id when both sides advertise one.

Tests:
  - TestMemoryStore_StaleDeltaFromOldSessionDropped: full snapshot
    from session B arrives at seq=1 against cached session A seq=50,
    then a stale delta from session A retries at seq=51 — must not
    leak into session B's map.
  - TestMemoryStore_LegacyZeroSessionFallsBackToSeqOnly: legacy
    clients (session_id=0) keep working under seq-only rules.
  - TestMemoryStore_MixedSessionAcceptsLegacyDelta: legacy delta
    against a session-tagged cached state must not be dropped (fleet
    upgrade hazard).

management.pb.go regenerated with protoc-gen-go v1.36.6 (matches the
existing header) — diff scoped to the new SessionId getter + struct
field + rawDesc entry.
Hardware test on 4998e5a surfaced a regression: my Item 1 fix
(suppress reconnect-guard offers under p2p-dynamic when ICE was
detached for inactivity) used "handshaker.readICEListener() == nil"
as the detach signal. But the listener is ALSO nil for a brand-new
peer that has never been connected — the initial setup attaches it.
Result: the very first bootstrap offer fired by the lazy mgr's
guard activation got suppressed, and dk20 saw all 3 BM routers
stuck in "Connecting" forever after the wakeup ping.

Fix: track "have we ever been connected" via a new
Conn.everConnected atomic.Bool, set on first successful
configureConnection (ICE) or relay-connected transition. The skip
branch now ALSO requires everConnected==true, so:
  - brand-new peer  -> first offer goes through (bootstrap works)
  - was-connected   -> ICE detach for inactivity skips offers
                       (Item 1 intent preserved)
  - ICE-failure-backoff -> existing 3-tries-then-hourly retry
                          continues to handle it

Regression test: TestConn_OnGuardEvent_SkipOfferGatedOnEverConnected
ensures the everConnected.Load() landmark stays above the skip-offer
trace landmark in onGuardEvent's source — cheap static-text guard
that fails immediately if a future refactor drops the gate.
Hardware test on c9a47ed surfaced a stuck-state bug: after a
WireGuard handshake timeout (3 min of no kernel-side handshake), the
peer's active worker was closed via onWGDisconnected, BUT the lazy
manager kept it in the "active" set with no activity listener
attached. Result:
  - Status frozen at "Connecting" indefinitely
  - Local outbound traffic (ping etc.) silently dropped — no
    "activity detected" log, no offer fired
  - Recovery only happened if the REMOTE peer happened to send an
    offer (signal-RPC path bypasses the activity listener)
  - Confirmed on dk20: 0/10 ping responses to 572a2/5731A after
    WG timeout, no daemon log activity for 10+ minutes

Fix: add an onWGTimeoutRecover callback on Conn, wired by ConnMgr
to ConnMgr.RecoverPeerToIdle(peerKey) which calls
lazyConnMgr.DeactivatePeer(connID). DeactivatePeer already does the
right thing (close conn fully, restart activity monitor) — it was
just never invoked from the WG-timeout path. The callback is
launched in a goroutine to avoid re-entering conn.mu (the recovery
path closes the conn, which itself takes conn.mu).

Files:
  - client/internal/peer/conn.go: new onWGTimeoutRecover field +
    SetOnWGTimeoutRecover setter, invoked at end of onWGDisconnected
    after the active worker close.
  - client/internal/conn_mgr.go: new RecoverPeerToIdle method and
    callback wiring in AddPeerConn.
  - client/internal/peer/conn_handover_order_test.go: static-text
    regression test enforces that onWGDisconnected references
    onWGTimeoutRecover and invokes it AFTER workerRelay.CloseConn().
Phase 3.7i (netbirdio#5989). Adds PeerSystemMeta.supported_features as a
forward-compatible capability advertisement mechanism. The client
ships a list of feature keywords (currently just "p2p_dynamic")
that the management server can branch on to decide whether to send
legacy-compat fallback settings to clients that pre-date a feature.

Source of truth for the keyword list lives in
client/system/features.go with a pin-down test in features_test.go
to make adding new capabilities a deliberate two-edit operation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). Two new account-level settings:

  LegacyLazyFallbackEnabled        bool   default true
  LegacyLazyFallbackTimeoutSeconds uint32 default 3600

When the account ConnectionMode is p2p-dynamic, conversion.go (next
commit) downgrades clients that don't advertise the "p2p_dynamic"
capability to p2p-lazy with the configured timeout. Default is ON so
older clients keep behaving sanely after an admin flips the mode.

GORM AutoMigrate adds the columns. The pgx fast-path SELECT loads
them with NullBool / NullInt64 backed defaults so pre-3.7i rows in
the DB stay safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). When the resolved account ConnectionMode is
p2p-dynamic and the peer does NOT advertise the "p2p_dynamic"
capability via PeerSystemMeta.SupportedFeatures, downgrade its
PeerConfig to p2p-lazy with the admin-configured fallback timeout.

This addresses Codex's compatibility concern: a default of
p2p-dynamic on the account combined with an old client (that just
ignores the unknown enum and falls back to LazyConnectionEnabled)
would otherwise send LazyConnectionEnabled=false -> peers eagerly
hold ICE forever -> battery / metered-link cost. With this
override, old clients see LazyConnectionEnabled=true and the
configured timeout, so they get the closest behavioural equivalent
to p2p-dynamic that their proto vocabulary allows.

Override is gated on the LegacyLazyFallbackEnabled toggle; admins
who know their entire fleet is on a 3.7i+ build can disable it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989), follow-up to Codex PR review. New clients
talking to an old (pre-3.7i) management server would otherwise hit
codes.Unimplemented from SyncPeerConnections on every push attempt:
once at the initial snapshot, once per peer state change, and once
per heartbeat tick (60s). The retries are wasteful (gRPC roundtrip
+ wakeup on every metered link) and noisy in the daemon log.

Detect Unimplemented once via status.FromError, log a single WARN
for operator visibility, and latch a sticky `disabled` atomic.Bool
that short-circuits all further flushDelta / flushFull paths. The
events are still recorded in lastPushed so the dirty-state
computation doesn't keep re-flagging them.

Disabled state is per-pusher-instance, i.e. resets at the next
daemon restart - if mgmt has been upgraded by then, the new pusher
detects support naturally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989). Adds the two new account settings to the
HTTP-API surface and emits audit events when they change:

  legacy_lazy_fallback_enabled         (bool, default true)
  legacy_lazy_fallback_timeout_seconds (int,  default 3600,
                                        range 60-86400)

PUT /api/accounts/{id} validates the timeout range and rejects
out-of-range values with HTTP 400. GET returns both fields
unconditionally (defaults if the DB row predates 3.7i).

Activity codes 125 + 126 emit on toggle/timeout change with
old+new values in the meta payload.

The change-detection block in handleAccountSettingsUpdate now
treats both new fields as peer-update-triggering so legacy clients
re-receive their PeerConfig with the new override timeout when an
admin tweaks the setting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989), follow-up to Codex PR review. The GORM
`default:true` / `default:3600` tags on Settings.LegacyLazyFallback*
only fire on SQL INSERT, leaving every in-memory construction path at
the Go zero value (false / 0). The conversion layer reads those
zeros and disables the legacy-fallback for the account -- the exact
opposite of the intended behaviour.

Fixes:
- types.DefaultLegacyLazyFallback{Enabled,TimeoutSeconds} constants
  + ApplyLegacyLazyFallbackDefaults helper (single source of truth)
- account.go (both NewAccount paths) seeds the new fields explicitly
- store/file_store.go applies defaults on every JSON-account load
  (existing on-disk JSON predates the fields entirely)
- http handler PUT path seeds the rebuilt-from-scratch returnSettings
  with defaults BEFORE per-field if-blocks; explicit user values
  (false toggle, custom timeout) still override
- http handler GET path returns defaults for accounts whose DB row
  predates the fields (zero-valued), keeping the API response in
  sync with what the conversion layer actually applies
- accounts_handler_test fixtures updated for the round-trip

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream PR netbirdio#6006 (commit db44848 "Drop netmap calculation on peer
read", merged 2026-04-28) simplified GetPeers and GetPeer by removing
the post-filter that expanded a regular user's own-peer list with
peers reachable via account access policies. Reason given was the
expense of GetPeerConnectionResources on large accounts. The
visibility behaviour was lost in the process: a "user"-role account
member now sees ONLY their directly-owned peers in the dashboard,
not the routing peers / counterparts their policies authorise them
to communicate with -- making the dashboard close to useless.

Reported by Michael Uray for georg.stoisser-gigacher: Georg has 1
own peer (ctb50-d) and 17 auto_groups feeding 10+ access policies,
yet sees a single peer in the web UI. Promoting to auditor exposes
ALL peers -- too much.

Fix: restore the two pre-netbirdio#6006 helpers verbatim (modulo a small
optimisation to avoid the redundant GetAccountPeers call):

  - getUserAccessiblePeers(ctx, accountID, ownPeers) -- merges the
    user's own peers with the result of GetPeerConnectionResources
    on each, returning the union.
  - checkIfUserOwnsPeer(ctx, accountID, userID, peer) -- per-peer
    membership check used by the GET /api/peers/{id} path.

Plus call them from GetPeers / GetPeer respectively.

Filter parity with admin path: the SQL store's GetAccountPeers
applies LIKE-style nameFilter / ipFilter; for the user-role branch
(GetUserPeers + ACL union) we apply the same substring matching
in-process via a new filterPeersByNameAndIP helper. This keeps the
public GetPeers contract identical for both roles. Unit tests cover
the helper directly and the existing GetPeer test now covers both
the policy-allows and policy-denies sub-cases.

Performance trade-off acknowledged: GetPeerConnectionResources is
expensive on large accounts. We accept the cost because (a) only
non-admin readers hit this branch, (b) typical own-peer count is
small, and (c) a future cache layer can offset it without changing
this contract. See docs/bugs/2026-05-04-user-peer-visibility-regression.md
(now bundled in this repo) for the full analysis and verification path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug reported by Michael Uray on uray-mic-d4 (2026-05-04): toggling the
VPN off while the "Peers and Networks" window is open spawns a modal
"get client: failed to list routes: rpc error: code = Unknown desc =
not connected" dialog. The dialog re-pops every 10s because the
auto-refresh ticker keeps trying to ListNetworks against a daemon
that just shut down its IPC pipe.

Fix: split getFilteredNetworks / updateNetworks into two-mode helpers
(loud + silent). The auto-refresh ticker uses the silent variant
(logs only); the manual Refresh button still uses the loud variant
because the user pressing Refresh expects feedback if it fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback (Michael Uray, 2026-05-04): the expanded peer-detail
text in the "Peers and Networks" window cannot be marked or copied
with the mouse, so a user who wants to paste an FQDN or IP into
another tool has to retype it. Same for the network-range cells in
the All/Overlapping/Exit-node tabs.

Fix: set widget.Label.Selectable = true on the affected labels.
Fyne 2.6+ supports the flag natively, no custom widget required.
The header buttons (peer-row collapse/expand) stay non-selectable
because they need to keep their tap-to-toggle behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported by Michael Uray on uray-mic-d4 (2026-05-04, debug bundle):
0/28 peers P2P, all relayed even on a network with valid public srflx
visibility. Pion ICE log showed selected candidate pairs like

  [udp4 host 127.0.0.1:51820]    <-> [udp4 srflx 81.16.19.15:51820]
  [udp4 host 172.26.240.1:51820] <-> [udp4 host 41.66.90.143:40488]
  [udp4 host 10.102.0.52:51820]  <-> [udp4 srflx 64.141.62.202:60993]

These local addresses are unroutable from the remote peer (loopback,
Hyper-V Default Switch, internal VPN tunnel). Pion ICE never sees a
working pair, falls back to relay every time.

Root cause: client/internal/stdnet/filter.go used strings.HasPrefix
on the raw interface name. The disallow list ("lo", "veth", "docker",
"br-") is lowercase but Windows reports interface names in mixed
case ("Loopback Pseudo-Interface 1", "vEthernet (Default Switch)",
"Docker Desktop"). Case-sensitive HasPrefix matched none of them, so
*every* Windows interface slipped past the filter and Pion ICE
gathered host candidates from all of them.

Fix is two-part to avoid over-filtering (Codex review caught a v1
that filtered ALL vEthernet*, including the user's actual default-
route external switch "vEthernet (LAN)"):

 1. Lower-case both sides for the disallow-list prefix match. This
    makes "lo" / "wt" / "wg" / "tailscale" / "zerotier" / "docker"
    work uniformly on Linux and Windows.

 2. On Windows, *additionally* skip a small targeted list of
    well-known internal interfaces by case-insensitive substring:
       - "loopback pseudo-interface"     (127.0.0.1)
       - "vethernet (default switch)"    (Hyper-V NAT-only)
       - "vethernet (wsl"                (WSL2)
    User-named Hyper-V external switches like "vEthernet (LAN)" are
    LEFT ALONE — those are the host's real default route on uray-
    mic-d4 and on every multi-NIC Hyper-V host.

 3. The "veth" entry in disallowList is intentionally skipped on
    Windows (handled by the targeted check above). The Linux veth
    pair filtering stays unchanged.

Regression tests in filter_test.go pin both the kill list and the
keep list, with an explicit assertion that "vEthernet (LAN)" stays
allowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ffective_*

Same trap the in-tree comment 30 lines higher in getAccount() warns
about for settings_connection_mode -- but the analogous Phase-3.7i
columns on the peers table were never added to the pgx fast-path:

  - meta_supported_features
  - meta_effective_connection_mode
  - meta_effective_relay_timeout_secs
  - meta_effective_p2_p_timeout_secs
  - meta_effective_p2_p_retry_max_secs

Symptom on uray-mic-d4 + ctb50-d (debug bundles 2026-05-04):
   22:52:10 mode=p2p-dynamic  (Login response, in-memory peer.Meta)
   22:52:16 mode=p2p-lazy     (~5s later via pgx-loaded NetworkMap push)

The Login-response path uses the freshly-extracted PeerSystemMeta
from the Login request (with SupportedFeatures populated), so the
first PeerConfig correctly resolves to p2p-dynamic. Five seconds
later -- as soon as the next NetworkMap broadcast triggers the pgx
fast-path account loader -- toPeerConfig sees `Meta.SupportedFeatures
== nil` because the SELECT statement omits the column. Its
legacy-fallback check
   slices.Contains(peer.Meta.SupportedFeatures, "p2p_dynamic")
returns false for EVERY 3.7i+ client, so resolvedMode silently
downgrades to p2p-lazy and the mode-flip cascades through every
peer's lazy-mgr ("peer reset to idle" x N -> 1610x "ICE Agent is
not initialized yet" -> ICE retries exhausted -> 0 P2P).

Verified by SQL: meta_supported_features for uray-mic-d4, ctb50-d,
dk20 all hold ["p2p_dynamic"] in the DB; the column was being
written correctly, just never read on this hot path.

Fix: SELECT the missing five columns and scan them into
sql.NullInt64 / sql.NullString / []byte holders, then unpack into
nbpeer.PeerSystemMeta exactly as the GORM path does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Ps survive

Pre-existing NetBird p2p-lazy bug, made more visible by the p2p-dynamic
mode of Phase 3.7i (which adds the WG-handshake-timeout-recovery cycle):
when the lazy manager deactivated a peer, peer.Conn.Close() called
endpointUpdater.RemoveWgPeer() unconditionally, wiping every AllowedIP
including the routed subnets the route-manager had appended in-place.
The lazy listener then re-armed the peer with only the basic peer-IP
/32. The route-manager's allowedIPsRefCounter went stale -- routed-
subnet traffic was silently dropped by WG until either the next mgmt-
side reconcile re-attached the prefix or the operator manually pinged
the routing peer's NetBird IP to wake it up via the activity listener
(which only matches the peer-IP /32).

Symptom (Michael Uray, 2026-05-04): from uray-mic-d4 (Windows) ping to
192.168.91.220 (behind r1-pve5) timed out for ~30s, then a manual ping
to r1-pve5's NetBird IP woke it, after which 192.168.91.220 was reachable
again.

Fix: add `keepWgPeer bool` parameter to peer.Conn.Close. Lazy-suspend
callers pass true (keep the WG peer entry; the listener will UpdatePeer
in place to switch the endpoint to the fake-IP); permanent-removal and
fresh-Conn-cleanup callers pass false (current behaviour preserved).

Updated callers (intent in parentheses):
  peerstore.Store.PeerConnIdle  -> true  (lazy idle)
  peerstore.Store.PeerConnClose -> true  (lazy excluded)
  ConnMgr loop RelayInactiveChan -> true (lazy suspend)
  ConnMgr.RemovePeerConn        -> false (permanent remove)
  ConnMgr resetPeers (mode chg) -> false (full reopen)
  Engine peer-already-exists    -> true  (live Conn owns WG entry)
  Engine remote-offline-debounce -> true (allow wake-up on return)

Regression tests in conn_lazy_keepwgpeer_test.go cover the Conn.Close
signature + the !keepWgPeer guard + every conn_mgr.go caller. Both fail
on origin/main; both pass with this fix.

Detailed mechanism analysis in
docs/bugs/2026-05-04-lazy-wake-on-routed-subnet.md.

Refs: Codex review 2026-05-04 (precision on Windows fake-IP at
127.2.x.y vs Linux 127.0.1.x UDP listener; HA-routing pitfall flag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fix)

Reported by Michael Uray on dk20 ↔ ctb50-d ↔ BM-Routers (2026-05-04):
after the lazy/inactivity cycle suspends a peer (Connection type
goes through P2P → Relay → Idle), a subsequent ping to wake it up
opens the connection BUT NEVER ESTABLISHES ICE. The peer stays at
"Connecting" forever, ICMP times out, only a daemon restart on one
side cleans the state.

Live trace from dk20.client.log:
  22:31:42 close peer connection (keepWgPeer=true)
  22:31:42 created activity listener: 127.0.1.1:46189
  ... 28 minutes of idle ...
  22:59:21 activity detected
  22:59:21 detected peer activity
  22:59:21 starting guard for reconnection with MaxInterval: 35.9s
  ...sixty-seven seconds of nothing...

Root cause: signal-trigger and activity-trigger paths were asymmetric.

  Signal path (engine.go:1845 → ConnMgr.ActivatePeer):
    ActivatePeer does conn.Open() + (for p2p-dynamic) conn.AttachICE().
    AttachICE registers the ICE listener on the handshaker AND issues
    SendOffer so the remote side learns we're ICE-capable again.

  Activity path (lazy-mgr.onPeerActivity → peerStore.PeerConnOpen):
    PeerConnOpen only calls conn.Open(). Open() recreates workerICE
    but does NOT register the listener on the handshaker -- in
    ModeP2PDynamic the registration is deliberately deferred to
    AttachICE (deferICEListener=true at conn.go:258). With no
    AttachICE call, handshaker.iceListener stays nil.

The guard's onGuardEvent (conn.go:751) then sees:
  - everConnected.Load() == true (we ran before the cycle)
  - handshaker.readICEListener() == nil (Open didn't attach for
    p2p-dynamic)
  - IceBackoff not suspended, no failures
and skips every offer with the comment
  "guard: skip offer (ICE detached for inactivity, p2p-dynamic;
   will re-attach on real traffic)"

But the only re-attach path was ConnMgr.ActivatePeer -- only fired
on signal messages, never on local activity. Result: dead loop, no
offer ever sent, ICE never converges, peer stuck on relay or fully
unreachable until manual restart.

Fix: lazy-mgr.onPeerActivity now also calls conn.AttachICE() after
PeerConnOpen. AttachICE is mode-safe:
  - ModeP2P / ModeP2PLazy: listener already registered by Open,
    attachICEListenerLocked returns false -> no-op
  - ModeRelayForced: workerICE nil -> error returned, swallowed
  - ModeP2PDynamic: listener registered + SendOffer issued, ICE
    negotiation re-engages cleanly

Honours iceBackoff so the existing 3-tries-then-hourly retry policy
still applies on persistent failure. Symmetric to the signal path
in ConnMgr.ActivatePeer (conn_mgr.go:488).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review point 4 (2026-05-04): "Bei echter Aktivität sollte der
Backoff entweder bewusst respektiert oder bewusst soft-reset/kurz
überbrückt werden, aber nicht zufällig P2P verhindern."

Reproduced on D95820 ↔ w11-test1 (Hetzendorf LAN ↔ Graz LAN, both
behind A1 NAT, 2026-05-05):

  04:53:35 [w11-test1] ICE failure netbirdio#1 (transient, e.g. concurrent
                                       wake-up race or first STUN
                                       binding miss)
  04:54:53 [w11-test1] ICE failure netbirdio#2
  04:55:15 [w11-test1] ICE retries exhausted (3/3),
                       switching to hourly retry

After that the connection stays Connection type=Relayed for an HOUR
even when the user pings hard from one side. AttachICE early-returns
on iceBackoff.IsSuspended() so the previous activity-trigger fix
(commit 4faa3f1) couldn't help: the listener gets re-attached only
because the backoff blocks both Open() AND the offer.

Verified post-fix mechanism: a daemon restart resets the backoff
in-process (init path), and ICE then converges to srflx/srflx
80.120.218.226:51821 ↔ 41.66.90.143:51820 in ~50s. So the NAT
traversal is physically possible -- it's the in-process state that
needs clearing on real user activity.

Fix: in Manager.onPeerActivity, call new Conn.ResetIceBackoff() right
before AttachICE. The lazy-mgr's local-activity listener fires on
real outbound traffic to the peer's NetBird IP -- that's the
strongest possible "user wants to talk to this peer" signal, much
more reliable than the existing backoff retry policy. Resetting the
counter is safe: if ICE keeps failing, the next 3 attempts will
re-enter hourly mode the same way as before, and signal-server load
stays bounded.

Signal-trigger path (engine.go:1845 → ConnMgr.ActivatePeer →
conn.AttachICE) is unchanged and still respects the backoff -- those
triggers are not user-initiated and shouldn't override the failure
policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review 2026-05-05 (refinement of the audit prompted by Michael
Uray): the lazy-mgr activity-listener only fires for peers in fully-
Idle state (post-Close, watcherActivity). For peers in Relay-state
(Open=true, ICE detached after iceTimeout, watcherInactivity), no
local activity ever triggered AttachICE. Result: a peer that fell
back to relay would stay relayed indefinitely even with continuous
user traffic, only escaping if remote sent an offer or the conn
fully cycled to Idle (5-8 min later).

Codex also flagged that the previous "ResetIceBackoff on every
activity" change was too aggressive for the relay path -- every
relay payload packet would have reset the failure backoff,
defeating the 3-tries-then-hourly retry policy that protects
metered links from useless retries when ICE is structurally broken.

Refined fix is two-layer:

1. Fast-path (this commit): ActivityRecorder gains an OnActivity
   callback fired at most once per saveFrequency=5s per peer when
   transport activity is observed. PeerRecord now carries the pubkey
   so the callback knows which peer woke. Engine wires the callback
   to Conn.AttachICEOnRelayActivity which gates on:
     - mode == p2p-dynamic
     - conn.opened
     - currentConnPriority == Relay
     - handshaker.iceListener == nil (ICE actually detached)
     - !iceBackoff.IsSuspended() (respect failure backoff -- no reset)
     - everConnected.Load() (had successful P2P at least once)
   When all gates pass, AttachICE is called -- it sends a fresh offer,
   the remote's OnNewOffer recreates its agent, and ICE re-converges
   in seconds. The relay tunnel stays up across the upgrade for
   continuity.

2. Activity-recorder note: only the receive path (ice_bind.go:232,
   ice_bind.go:333) records into LastActivities today. Outbound-only
   traffic is currently invisible to this fast-path; ping-style and
   bidirectional traffic does fire correctly because the echo reply
   counts as receive activity. A future commit can instrument the
   ICEBind.Send path if pure-outbound wake becomes important.

The ResetIceBackoff call stays in lazy-mgr.onPeerActivity because
that fires only after the peer has been fully Closed (via relay-
inactivity). At that point user-driven outbound traffic IS the
authoritative signal that the user wants the peer back, so the
hourly-retry suspension is correctly overridden. The relay-path
trigger added here does NOT reset, by design.

Tests: ActivityRecorder OnActivity callback tested via existing
activity_test.go path; AttachICEOnRelayActivity gating verified by
build + targeted test runs (./client/iface/bind, ./client/internal/peer,
./client/internal/lazyconn/inactivity all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported (Michael Uray, 2026-05-05) and Codex review point 5
follow-up: a peer in Relayed state (ICE detached after iceTimeout)
that hits a transient ICE failure during the next re-attempt enters
"3 retries exhausted -> hourly retry" mode and stays Relayed for a
full HOUR even when the user actively pings. The backoff was
designed to protect against truly broken paths, but the LTE-NAT
recovery window (15-30s) consistently exceeds the Guard's 3-fast-
retry timer (~12s), so legitimate paths get poisoned by transient
network jitter.

Reproduced live on ctb50-d (Lebring) <-> 572A2 (Graz) hop after
~3min idle: ICE detached, then transient re-attach failure, then
hourly retry. User pings every few seconds, sees Relayed at 35ms
instead of P2P at 11ms.

Codex review point 5 explicitly allowed an "optional, very
deliberate user-activity retry override with hard rate-limit". This
commit is that override:

  - iceBackoff.AllowActivityOverride() returns true at most ONCE
    per activityOverrideMinInterval (5min) per peer
  - AttachICEOnRelayActivity calls AllowActivityOverride before
    skipping on suspended backoff; on grant it Reset()s the backoff
    and proceeds to AttachICE
  - 5min lines up with the relayTimeout default -- after one
    override window the conn would have cycled to Idle anyway,
    freeing the backoff via the C->A wake path which already does
    ResetIceBackoff

Signal-server load impact: at most one extra offer/answer pair per
peer per 5min when the path is genuinely broken. For the typical
case (transient drop with quick recovery) we converge to P2P in the
override and stay there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review 2026-05-05 catch: the AllowActivityOverride doc-comment
promised the rate-limit window applies "since the last (success-,
network-change-, or override-driven) reset", but markSuccess() never
updated lastResetAt. Only Reset() did. Concrete bug: a peer that
briefly succeeded an ICE connect (markSuccess fires, failures = 0,
suspended = false, but lastResetAt is still from daemon-start
minutes-or-hours ago) and then failed 3x to enter hourly retry mode
would have AllowActivityOverride return true on the very next user
ping -- defeating the rate-limit's intent of bounding signal-server
storm to one offer per peer per 5min.

Fix is a single line in markSuccess: stamp lastResetAt = time.Now().
A successful ICE connect IS semantically a reset point -- the path
just demonstrably worked, the failure history before it is no longer
predictive. Both consumers of lastResetAt (markFailure's
networkChangeGracePeriod and AllowActivityOverride's rate limit)
benefit from the fresh stamp.

Tests:
  - TestIceBackoff_MarkSuccessStampsLastResetAt: direct regression
    pin -- markSuccess after a stale lastResetAt MUST update it to
    ~now.
  - TestIceBackoff_AllowActivityOverride: covers the four corners of
    the rate-limited override (not-suspended, recently-reset,
    >5min-since-reset+suspended, and the post-Reset/markSuccess case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ant test

Codex review follow-up 2026-05-05: three small refinements to make the
backoff semantic explicit and self-documenting.

1. AttachICEOnRelayActivity doc-comment was outdated -- it said
   "iceBackoff must NOT be suspended" but didn't mention the rate-
   limited override added in commit 322adc9 / 1762601. Updated
   to reference iceBackoff.AllowActivityOverride explicitly so future
   readers know the override exists and where its rate limit lives.

2. onICEFailed now classifies the failure into one of three buckets
   in the log line:
     - first-attempt    : never reached P2P, this is the bootstrap try
     - post-success-drop: was P2P, ICE dropped (consent-freshness
                          fail / NAT-mapping recovery race)
     - re-attach        : came out of detached state, retry failed
   Classification is best-effort (pion only tells us "Failed"; we
   infer from local everConnected + iceListener state). Helps future
   debugging when the user asks "why did this peer enter hourly
   retry mode" -- the log now distinguishes "first attempt couldn't
   pair" from "had P2P, lost it" without requiring full daemon-log
   correlation.

3. Added doc-comment to onICEFailed explicitly stating that backoff
   is exclusively triggered by ConnectionStateFailed and NEVER by
   inactivity-driven detach or relay-timeout close. Codex review
   point 1: "Backoff sollte nur kaputte oder aktuell nicht stabile
   P2P-Pfade bremsen". This pins that intent in code, not just docs.

4. New unit test TestIceBackoff_OnlyMarkFailureMutates documents the
   invariant: read-only calls (IsSuspended, Snapshot,
   AllowActivityOverride) must never mutate the backoff state.
   Prevents accidental side-effects in future refactors of those
   getters.

Existing tests still green:
  ok  github.com/netbirdio/netbird/client/internal/peer 0.011s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7i (netbirdio#5989) lifecycle test (w11-test1, 2026-05-05) showed
that after a full Idle -> Wake (C->A path), peers with non-LAN
ICE candidates (srflx, e.g. Internet/LTE) often stay on Relay
indefinitely while same-LAN peers (host candidates) recover
instantly. Root cause: the guard's per-cycle ICE retry budget
(maxICERetries=3, then iceRetryInterval=1h) is consumed in the
first ~5 s of pair-checking after wake (cold srflx mappings need
several attempts to prime). Subsequent real user activity then
finds the guard already in hourly mode and waits up to an hour
before retrying.

Fix per Codex review 2026-05-05: route real user/transport
activity into the guard so it explicitly resets the per-cycle
budget and reconnect ticker. The hourly throttle stays in place
for genuinely broken NAT paths but no longer blocks the
"user keeps pinging" recovery path.

Mechanics:
- Guard gets a buffered peerActivity channel + NotifyPeerActivity()
  (mirrors SetICEConnDisconnected). reconnectLoopWithRetry treats
  it like a connection-disconnected event: ticker.Stop, fresh
  newReconnectTicker, iceState.reset.
- Conn.AttachICEOnRelayActivity (B->A relay-state activity) now
  calls conn.guard.NotifyPeerActivity after the AttachICE
  succeeds, complementing the existing iceBackoff override.
- Conn.NotifyGuardActivity wraps the call so the lazy-mgr does
  not have to peek into Conn internals; lazy-mgr.onPeerActivity
  invokes it after the post-Idle Open + AttachICE.

Tests:
- TestICERetryState_ResetClearsHourlyAndBudget pins the regression
  shape: hourly mode + reset must restore the full 3-attempt
  budget and clear the hourly ticker.
- TestGuard_NotifyPeerActivity_{NonBlockingCoalesce, NilSafe}
  cover the call contract for high-rate ActivityRecorder bursts
  and partially-initialised conns.
- TestGuard_PeerActivityResetsHourlyMode exercises the channel
  delivery + handler-equivalent reset.
Generated proto code regenerated from .proto sources after upstream/main
rebase pulled in commits that altered the proto schema (PacketCapture
fields). Hand-merged conflicts on the .proto files were resolved during
the rebase; the .pb.go files are now consistent with them.

Phase 3.7i UI-push helpers Status.notifyPeerListChanged and
Status.notifyPeerStateChangeListeners were lost in mid-rebase reorder;
re-add them as thin wrappers around the existing notifier.peerListChanged
and snapshotRouterPeersLocked + dispatchRouterPeers paths, so the
UpdatePeerRemoteMeta and updatePeer*StateLocked callsites compile.

Test mock mockEndpointManager.ActivityRecorder() added: the new
EndpointManager interface method (added by the relay-state ICE
re-attach fast-path commit) was missing on the test mock, breaking
the listener_bind_test build. The mock returns nil — these tests
exercise endpoint dispatch only, never the recorder.
Three fixes from Codex review 2026-05-05:

1. client/cmd/testutil_test.go: nbgrpc.NewServer call updated to the
   current 13-arg signature (peer_connections.Store and
   *peer_connections.SnapshotRouter were added in Phase 3.7i and the
   test wasn't updated). The test now passes a fresh in-memory store
   and snapshot router so the package builds again.

2. .gitignore: management/netbird-mgmt added next to the existing
   management/management entry. The 51 MB ELF binary was a build
   artefact accidentally committed in the relay-state ICE re-attach
   commit; it is now removed from history via interactive rebase of
   that commit (see previous force-push) and ignored going forward.

3. ResetIceBackoff: stamp the new (cleared) snapshot into the
   StatusRecorder so `netbird status -d` and the daemon RPC stop
   advertising the stale "suspended" / "Failures=N" view after a
   reset. Previously only AttachICEOnRelayActivity's override path
   pushed the snapshot; the lazy-mgr-driven hard reset on Idle wake
   silently left the recorder out of date.
… + codespell fix

- proto regen across client/, shared/management/, shared/signal/, flow/
- protoc-version comment headers re-pinned to upstream-main values
- status_remote_meta_notify_test.go: codespell typo (atLeast -> minCount)
- client/internal/peer/status_debounce_test.go: explicit _ = on
  rec.UpdatePeer*State/Meta calls so errcheck (golangci-lint) doesn't
  flag the discarded error returns. The tests deliberately ignore
  the return because the recorder is pre-populated with the peer.

- management/server/http/handlers/accounts/accounts_handler.go:
  cast math.MaxUint32 to uint64 when interpolating it into the error
  string. On 32-bit goreleaser builds (linux/386 etc.) the untyped
  constant 4294967295 cannot be passed where the printf %d formatter
  expects an int (overflows). uint64 covers all archs cleanly.
- mgmt/internals/server/boot.go: collapse PeerConnRouter() lambda to
  Create(s, peer_connections.NewSnapshotRouter) (unlambda)
- mgmt/server/types/settings_ptr_equal_test.go: rename 'new' to
  'updated' to avoid shadowing the predeclared identifier
- client/internal/peer/conn_handover_order_test.go: if/else on rune
  -> tagged switch (QF1003 staticcheck)
- mgmt/server/http/handlers/accounts/validate_uint32_timeout_test.go:
  drop redundant explicit error type from var declaration (ST1023)
- client/internal/conn_mgr.go: add //nolint:unused on
  addPeersToLazyConnManager (deliberately retained for the eventual
  snapshot-import path; documented inline)
- client/internal/peer/status.go: replace 'notify := false; notify =
  ...' with 'var notify bool' to silence wastedassign
@MichaelUray MichaelUray force-pushed the pr/d-hardening-activity-trigger branch from 587ba07 to 6aac4ee Compare May 6, 2026 16:16
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
26 New issues
26 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate connection-mode flags; add p2p-dynamic and p2p-dynamic-lazy modes

1 participant