Skip to content

[relay] evict foreign client cache on disconnect#6015

Merged
pappz merged 2 commits intomainfrom
fix/relay-foreign-disconnect-evict
Apr 28, 2026
Merged

[relay] evict foreign client cache on disconnect#6015
pappz merged 2 commits intomainfrom
fix/relay-foreign-disconnect-evict

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Apr 28, 2026

When a foreign relay's TCP connection drops, the manager's onServerDisconnected handler only triggered reconnect logic for the home server; the disconnected foreign entry stayed in the relayClients cache. Subsequent OpenConn calls reused the closed client until the 60-second cleanup tick evicted it, breaking peer connectivity through that relay for up to a minute.

Evict the foreign entry from the cache on disconnect so the next OpenConn dials a fresh client.

Also:

  • Make the reconnect backoff cap configurable via WithMaxBackoffInterval ManagerOption; the previous hard-coded 60s constant forced TestAutoReconnect to sleep ~61s. Test now polls Ready() and finishes in ~2s.
  • Add NB_HOME_RELAY_SERVERS env var that overrides the relay URL list received from management, so a peer can be pinned to a specific home relay (used by the netbird-conn-lab Edge 4 reproducer).

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • Environment variable support to override relay server URLs at runtime (with informational logging when an override is used).
  • Improvements

    • Configurable reconnect backoff interval per relay manager instance.
    • Foreign relay disconnects now evict stale cached clients to ensure fresh reconnections.
  • Tests

    • Improved test reliability by using readiness-based connection polling.

When a foreign relay's TCP connection drops, the manager's
onServerDisconnected handler only triggered reconnect logic for the
home server; the disconnected foreign entry stayed in the relayClients
cache. Subsequent OpenConn calls reused the closed client until the
60-second cleanup tick evicted it, breaking peer connectivity through
that relay for up to a minute.

Evict the foreign entry from the cache on disconnect so the next
OpenConn dials a fresh client.

Also:
- Make the reconnect backoff cap configurable via WithMaxBackoffInterval
  ManagerOption; the previous hard-coded 60s constant forced
  TestAutoReconnect to sleep ~61s. Test now polls Ready() and finishes
  in ~2s.
- Add NB_HOME_RELAY_SERVERS env var that overrides the relay URL list
  received from management, so a peer can be pinned to a specific home
  relay (used by the netbird-conn-lab Edge 4 reproducer).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2fdc8187-2ba8-4b43-938f-debd9eb70536

📥 Commits

Reviewing files that changed from the base of the PR and between 3dcf559 and 10a4ed1.

📒 Files selected for processing (1)
  • client/internal/peer/env.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/peer/env.go

📝 Walkthrough

Walkthrough

Adds an environment-driven relay server URL override (NB_HOME_RELAY_SERVERS) that, when present, replaces relay URLs received from management before manager construction. Refactors relay reconnection backoff to be configurable per Guard/Manager via options instead of a fixed constant.

Changes

Cohort / File(s) Summary
Relay URL Override
client/internal/peer/env.go, client/internal/connect.go, client/internal/engine.go
Adds EnvKeyNBHomeRelayServers and OverrideRelayURLs(); connect/engine call the helper and, when active, log the env key and use the overridden URL list instead of the management-provided URLs before creating relay managers.
Reconnection Backoff Configuration
shared/relay/client/guard.go, shared/relay/client/manager.go
Replaces fixed reconnect timeout with per-instance maxBackoffInterval; NewGuard now accepts the interval and exponentTicker became a Guard method. NewManager accepts ManagerOption and applies WithMaxBackoffInterval.
Manager Tests / Readiness Polling
shared/relay/client/manager_test.go
TestAutoReconnect uses WithMaxBackoffInterval(2*time.Second) and replaces fixed sleep with a waitForReady polling helper bounded to 15s.

Sequence Diagram

sequenceDiagram
    participant Connect as Connect code
    participant Peer as peer.env
    participant Env as Environment
    participant Manager as RelayManager/Manager
    participant Guard as Guard

    Connect->>Peer: OverrideRelayURLs()
    Peer->>Env: read NB_HOME_RELAY_SERVERS
    alt override present
        Env-->>Peer: return URLs, true
        Peer-->>Connect: overridden URL list
        Connect->>Connect: log override key + URLs
        Connect->>Manager: NewManager(..., overriddenURLs, opts...)
    else no override
        Env-->>Peer: nil, false
        Peer-->>Connect: nil
        Connect->>Manager: NewManager(..., originalURLs, opts...)
    end
    Manager->>Guard: NewGuard(serverPicker, maxBackoffInterval)
    Guard->>Guard: use maxBackoffInterval for exponentTicker/backoff
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • mlsmaycon
  • lixmal

Poem

🐰
I sniff the vars where servers hide,
NB_HOME whispers URLs worldwide,
Guards learn to wait in measured beats,
Reconnects dance on softer streets,
Hooray — new hops for tiny feet! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes a key bug fix: evicting the foreign relay client cache on disconnect, which directly addresses the main performance issue in this PR.
Description check ✅ Passed The description provides context for the main bug fix, lists additional changes, marks the bug fix checkbox, and confirms documentation is not needed; however, it lacks the 'Describe your changes' section details despite selecting that template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/relay-foreign-disconnect-evict

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/env.go`:
- Around line 27-40: The OverrideRelayURLs function currently returns true
whenever the environment variable EnvKeyNBHomeRelayServers is set, even if
parsing yields an empty slice (only separators/whitespace); update
OverrideRelayURLs to treat an empty parsed result as inactive by returning (nil,
false) when the constructed urls slice has length 0 — i.e., after trimming and
appending in OverrideRelayURLs, check len(urls) and only return (urls, true)
when len(urls) > 0, otherwise return nil, false.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 23a96584-cf35-4535-b720-40aedf785102

📥 Commits

Reviewing files that changed from the base of the PR and between 6f0eff3 and 3dcf559.

📒 Files selected for processing (6)
  • client/internal/connect.go
  • client/internal/engine.go
  • client/internal/peer/env.go
  • shared/relay/client/guard.go
  • shared/relay/client/manager.go
  • shared/relay/client/manager_test.go

Comment thread client/internal/peer/env.go
Returning (urls=[], ok=true) when the env var contained only separators or
whitespace caused callers to wipe the mgmt-provided relay list, leaving the
peer with no relays. Treat a parsed-empty result the same as an unset env.
@sonarqubecloud
Copy link
Copy Markdown

@pappz pappz merged commit 8fc4265 into main Apr 28, 2026
42 of 43 checks passed
@pappz pappz deleted the fix/relay-foreign-disconnect-evict branch April 28, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants