[relay] evict foreign client cache on disconnect#6015
Conversation
When a foreign relay's TCP connection drops, the manager's onServerDisconnected handler only triggered reconnect logic for the home server; the disconnected foreign entry stayed in the relayClients cache. Subsequent OpenConn calls reused the closed client until the 60-second cleanup tick evicted it, breaking peer connectivity through that relay for up to a minute. Evict the foreign entry from the cache on disconnect so the next OpenConn dials a fresh client. Also: - Make the reconnect backoff cap configurable via WithMaxBackoffInterval ManagerOption; the previous hard-coded 60s constant forced TestAutoReconnect to sleep ~61s. Test now polls Ready() and finishes in ~2s. - Add NB_HOME_RELAY_SERVERS env var that overrides the relay URL list received from management, so a peer can be pinned to a specific home relay (used by the netbird-conn-lab Edge 4 reproducer).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an environment-driven relay server URL override ( Changes
Sequence DiagramsequenceDiagram
participant Connect as Connect code
participant Peer as peer.env
participant Env as Environment
participant Manager as RelayManager/Manager
participant Guard as Guard
Connect->>Peer: OverrideRelayURLs()
Peer->>Env: read NB_HOME_RELAY_SERVERS
alt override present
Env-->>Peer: return URLs, true
Peer-->>Connect: overridden URL list
Connect->>Connect: log override key + URLs
Connect->>Manager: NewManager(..., overriddenURLs, opts...)
else no override
Env-->>Peer: nil, false
Peer-->>Connect: nil
Connect->>Manager: NewManager(..., originalURLs, opts...)
end
Manager->>Guard: NewGuard(serverPicker, maxBackoffInterval)
Guard->>Guard: use maxBackoffInterval for exponentTicker/backoff
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/env.go`:
- Around line 27-40: The OverrideRelayURLs function currently returns true
whenever the environment variable EnvKeyNBHomeRelayServers is set, even if
parsing yields an empty slice (only separators/whitespace); update
OverrideRelayURLs to treat an empty parsed result as inactive by returning (nil,
false) when the constructed urls slice has length 0 — i.e., after trimming and
appending in OverrideRelayURLs, check len(urls) and only return (urls, true)
when len(urls) > 0, otherwise return nil, false.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 23a96584-cf35-4535-b720-40aedf785102
📒 Files selected for processing (6)
client/internal/connect.goclient/internal/engine.goclient/internal/peer/env.goshared/relay/client/guard.goshared/relay/client/manager.goshared/relay/client/manager_test.go
Returning (urls=[], ok=true) when the env var contained only separators or whitespace caused callers to wipe the mgmt-provided relay list, leaving the peer with no relays. Treat a parsed-empty result the same as an unset env.
|



When a foreign relay's TCP connection drops, the manager's onServerDisconnected handler only triggered reconnect logic for the home server; the disconnected foreign entry stayed in the relayClients cache. Subsequent OpenConn calls reused the closed client until the 60-second cleanup tick evicted it, breaking peer connectivity through that relay for up to a minute.
Evict the foreign entry from the cache on disconnect so the next OpenConn dials a fresh client.
Also:
Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Improvements
Tests