[management] fix skip of ephemeral peers on deletion#5206
Conversation
📝 WalkthroughWalkthroughLogging verbosity and wording were adjusted and peer-deletion error handling changed to skip problematic peers: NotFound and transactional errors are now logged at trace/error levels and the deletion process continues for remaining peers rather than returning early. (48 words) Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR fixes an issue where ephemeral peer deletion would fail for the entire batch if any single peer encountered an error. The fix allows the deletion process to continue processing remaining peers even when individual deletions fail.
Changes:
- Modified
DeletePeersto gracefully handle NotFound errors by skipping already-deleted peers - Changed error handling from aborting the loop to logging errors and continuing with remaining peers
- Enhanced trace logging to provide better visibility into peer deletion and skip decisions
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| management/internals/modules/peers/manager.go | Added graceful handling for NotFound errors, enhanced trace logging for skipped peers, and changed from returning errors to logging them and continuing the loop |
| management/internals/modules/peers/ephemeral/manager/ephemeral.go | Updated logging levels and improved error message wording for batch peer deletion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@management/internals/modules/peers/manager.go`:
- Around line 161-164: The DeletePeers function currently declares an error
return but always logs errors and continues (best-effort), which is misleading;
change the function signature (DeletePeers) to return no error (remove the error
return) and treat it explicitly as a best-effort/ephemeral cleanup routine,
update its single caller (the ephemeral peer expiration cleanup) to stop
expecting an error, and keep the existing per-peer log-on-error-and-continue
behavior (log.WithContext(ctx).Errorf(...) and continue) so the contract matches
the implementation; alternatively, if you prefer to preserve an error return,
replace it with an aggregated error (e.g., collect errors into a slice or use a
multierror) and return that summary instead—pick one approach and apply it
consistently to DeletePeers and its caller.
🧹 Nitpick comments (1)
management/internals/modules/peers/manager.go (1)
118-124: Minor:time.Now()called twice may log a different threshold than evaluated.The threshold is computed with
time.Now()both in the condition (line 118) and in the log message (lines 122-123). The slight time difference between these calls means the logged threshold won't exactly match what was evaluated, which could cause confusion during debugging.🔧 Suggested fix: compute threshold once
+ threshold := time.Now().Add(-(ephemeral.EphemeralLifeTime - 10 * time.Second)) - if checkConnected && (peer.Status.Connected || peer.Status.LastSeen.After(time.Now().Add(-(ephemeral.EphemeralLifeTime - 10*time.Second)))) { + if checkConnected && (peer.Status.Connected || peer.Status.LastSeen.After(threshold)) { log.WithContext(ctx).Tracef("DeletePeers: peer %s skipped (connected=%t, lastSeen=%s, threshold=%s, ephemeral=%t)", peerID, peer.Status.Connected, peer.Status.LastSeen.Format(time.RFC3339), - time.Now().Add(-(ephemeral.EphemeralLifeTime - 10*time.Second)).Format(time.RFC3339), + threshold.Format(time.RFC3339), peer.Ephemeral) return nil }
| if err != nil { | ||
| return err | ||
| log.WithContext(ctx).Errorf("DeletePeers: failed to delete peer %s: %v", peerID, err) | ||
| continue | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and find the DeletePeers function
cd management/internals/modules/peers
wc -l manager.goRepository: netbirdio/netbird
Length of output: 75
🏁 Script executed:
# Read the DeletePeers function - lines around 161-179 to see full context
sed -n '140,185p' management/internals/modules/peers/manager.goRepository: netbirdio/netbird
Length of output: 1304
🏁 Script executed:
# Search for all callers of DeletePeers function
rg -n 'DeletePeers\s*\(' --type=go -g '!*_mock.go' -g '!*_test.go'Repository: netbirdio/netbird
Length of output: 568
🏁 Script executed:
# Read the caller context in ephemeral manager
sed -n '180,210p' management/internals/modules/peers/ephemeral/manager/ephemeral.goRepository: netbirdio/netbird
Length of output: 891
🏁 Script executed:
# Search for tests of DeletePeers to understand intended behavior
rg -n 'DeletePeers' --type=go -A 10 -B 2 | grep -E '(test|Test|_test\.go)' -A 10 -B 2Repository: netbirdio/netbird
Length of output: 6933
🏁 Script executed:
# Look for all test files that test DeletePeers
find . -name '*_test.go' -type f -exec grep -l 'DeletePeers' {} \;Repository: netbirdio/netbird
Length of output: 133
🏁 Script executed:
# Read the DeletePeers function signature and any comments
sed -n '99,110p' management/internals/modules/peers/manager.goRepository: netbirdio/netbird
Length of output: 630
🏁 Script executed:
# Also check if there's an interface definition that documents expected behavior
rg -n 'interface Manager' -A 20 management/internals/modules/peers/manager.goRepository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
# Check the interface definition for DeletePeers
rg -n 'DeletePeers' management/internals/modules/peers/manager.go | head -5Repository: netbirdio/netbird
Length of output: 619
🏁 Script executed:
# Look for the interface definition in the same file
sed -n '1,50p' management/internals/modules/peers/manager.goRepository: netbirdio/netbird
Length of output: 2218
The error-on-continue behavior is intentional for ephemeral cleanup, but clarify the misleading error contract.
The function's design—logging errors and continuing to the next peer without failing—is appropriate for ephemeral peer cleanup (temporary peers will expire anyway). However, the function signature declares an error return type but always returns nil, which is misleading. The single caller (ephemeral peer expiration cleanup) correctly handles this, but the design could confuse future developers or cause bugs if the function is extended to other contexts.
Consider documenting this as best-effort cleanup, or distinguish between ephemeral cleanup (which can be best-effort) and other deletion scenarios that may require stricter error handling. Alternatively, return an error summary or retry count to make the contract explicit.
🤖 Prompt for AI Agents
In `@management/internals/modules/peers/manager.go` around lines 161 - 164, The
DeletePeers function currently declares an error return but always logs errors
and continues (best-effort), which is misleading; change the function signature
(DeletePeers) to return no error (remove the error return) and treat it
explicitly as a best-effort/ephemeral cleanup routine, update its single caller
(the ephemeral peer expiration cleanup) to stop expecting an error, and keep the
existing per-peer log-on-error-and-continue behavior
(log.WithContext(ctx).Errorf(...) and continue) so the contract matches the
implementation; alternatively, if you prefer to preserve an error return,
replace it with an aggregated error (e.g., collect errors into a slice or use a
multierror) and return that summary instead—pick one approach and apply it
consistently to DeletePeers and its caller.



Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
bug fix
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.