[client] add reset for management backoff#4935
Conversation
WalkthroughAdds a shared backoff instance in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)
159-159: LGTM: Backoff reset after successful connection.The reset correctly addresses the bug where long-lived connections (exceeding MaxElapsedTime of 3 months) would cause the backoff to give up immediately on the next disconnection. By resetting after each successful connection, the elapsed time counter restarts, ensuring the retry mechanism continues to work properly.
The reset is placed at the right point—after the stream is successfully established but before blocking on event reception.
Consider adding a brief comment explaining the reset:
log.Infof("connected to the Management Service stream") + // Reset backoff after successful connection to prevent MaxElapsedTime accumulation + // across connection cycles. Without this, long-lived connections would cause + // immediate retry failure after disconnection. backOff.Reset()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
Retry(10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client / Unit (386)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit
- GitHub Check: release
- GitHub Check: release_ui_darwin
- GitHub Check: Client / Unit
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
- GitHub Check: Linux
- GitHub Check: Darwin
- GitHub Check: Windows
- GitHub Check: Android / Build
- GitHub Check: iOS / Build
🔇 Additional comments (4)
shared/management/client/grpc.go (4)
114-115: LGTM: Backoff variable initialization.The introduction of a local backOff variable that will be reused across retry attempts is correct and necessary for the reset mechanism to work properly.
133-133: LGTM: Backoff passed to handleStream.Passing the backOff instance to handleStream allows it to reset the backoff state after successful connection establishment.
136-136: LGTM: Consistent backoff usage in retry loop.Using the same backOff instance throughout the retry loop ensures proper backoff state management and enables the reset mechanism to function correctly.
145-145: LGTM: Signature updated to accept backoff parameter.The signature change is necessary to support the backoff reset functionality.
1. backOff.Reset is already called by backoff.Retry, so it is not required during initialization. 2. Remove unnecessary empty lines.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)
162-163: Consider moving the backoff reset to after successfully receiving the first event.The current placement resets the backoff after
receiveEventsreturns, regardless of whether any events were successfully received. If the stream fails immediately (before receiving events), the backoff is still reset, which may be too aggressive.Consider moving the reset inside
receiveEventsafter successfully processing the first event. This ensures the backoff is only reset for "truly" successful connections.🔎 Proposed refactoring
Move the reset into
receiveEventsafter the first successful event:In
handleStream, remove the reset:// blocking until error err = c.receiveEvents(stream, serverPubKey, msgHandler) - backOff.Reset() if err != nil {In
receiveEvents, add a flag and reset after first event:-func (c *GrpcClient) receiveEvents(stream proto.ManagementService_SyncClient, serverPubKey wgtypes.Key, msgHandler func(msg *proto.SyncResponse) error) error { +func (c *GrpcClient) receiveEvents(stream proto.ManagementService_SyncClient, serverPubKey wgtypes.Key, msgHandler func(msg *proto.SyncResponse) error, backOff backoff.BackOff) error { + firstEventReceived := false for { update, err := stream.Recv() if err == io.EOF { log.Debugf("Management stream has been closed by server: %s", err) return err } if err != nil { log.Debugf("disconnected from Management Service sync stream: %v", err) return err } log.Debugf("got an update message from Management Service") decryptedResp := &proto.SyncResponse{} err = encryption.DecryptMessage(serverPubKey, c.key, update.Body, decryptedResp) if err != nil { log.Errorf("failed decrypting update message from Management Service: %s", err) return err } + + if !firstEventReceived { + backOff.Reset() + firstEventReceived = true + } if err := msgHandler(decryptedResp); err != nil { log.Errorf("failed handling an update message received from Management Service: %v", err.Error()) } } }Then update the call in
handleStream:- err = c.receiveEvents(stream, serverPubKey, msgHandler) + err = c.receiveEvents(stream, serverPubKey, msgHandler, backOff)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
Retry(10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Client / Unit (386)
- GitHub Check: Signal / Unit (386)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Windows
- GitHub Check: Linux
- GitHub Check: Darwin
- GitHub Check: Android / Build
- GitHub Check: iOS / Build
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: release_ui
- GitHub Check: release
- GitHub Check: release_ui_darwin
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
🔇 Additional comments (2)
shared/management/client/grpc.go (2)
114-115: Backoff instance correctly shared across retries.Creating the backoff once in
Sync()and reusing it is the correct approach for addressing the MaxElapsedTime exhaustion issue described in the PR objectives.
144-145: LGTM!The signature change correctly adds the backoff parameter to support the shared backoff pattern. Since this is a private method, there's no impact on external callers.
| return c.handleStream(ctx, *serverPubKey, sysInfo, msgHandler, backOff) | ||
| } | ||
|
|
||
| err := backoff.Retry(operation, defaultBackoff(ctx)) | ||
| err := backoff.Retry(operation, backOff) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "grpc.go" -path "*/management/client/*" | head -1Repository: netbirdio/netbird
Length of output: 95
🏁 Script executed:
grep -n "backoff" shared/management/client/grpc.go | head -30Repository: netbirdio/netbird
Length of output: 933
🏁 Script executed:
find . -type f -name "grpc.go" | grep -E "(management|client)" | head -5Repository: netbirdio/netbird
Length of output: 126
🏁 Script executed:
sed -n '130,170p' shared/management/client/grpc.goRepository: netbirdio/netbird
Length of output: 1346
Do not reset backoff from within the operation function.
Calling backOff.Reset() inside handleStream (line 163) while backoff.Retry is managing the same instance corrupts retry state. The ExponentialBackOff implementation is not thread-safe. Use a separate backoff instance for the stream handler, or avoid resetting within the operation.
🤖 Prompt for AI Agents
In shared/management/client/grpc.go around lines 133 to 136, the operation
passed to backoff.Retry calls handleStream which currently resets the shared
backOff instance (see line ~163), corrupting ExponentialBackOff state; change
the code so handleStream uses its own backoff instance (copy/new
ExponentialBackOff) or remove any Reset() calls inside handleStream and only let
the backoff.Retry caller manage the shared backOff lifecycle; ensure any
per-stream retry logic constructs and manages its own backoff object so the
backoff.Retry call operates on an instance that is not mutated concurrently.
There was a problem hiding this comment.
In this specific case, handleStream is only ever called from Sync, and the entire call chain (backoff.Retry → operation → handleStream) runs within a single goroutine. There is no concurrent access to the backOff instance, so calling Reset() here does not introduce a race condition.
This pattern is also already used in the signal client with the same retry/backoff approach:
netbird/shared/signal/client/grpc.go
Lines 165 to 178 in b7e98ac
I think backoff.Retry API is primarily designed for short-lived, one-shot operations (e.g. HTTP requests), where the lifecycle of the backoff instance clearly maps to a single attempt sequence. In our case, however, we are using it to manage a long-lived streaming connection, where a single “attempt” may last minutes or hours, and a successful stream session is logically equivalent to a completed retry cycle. If we want to completely avoid any semantic ambiguity or incorrect assumptions about backoff ownership, the clean solution would be a structural change. But they would also require a broader refactor rather than a localized fix.
Would it be acceptable to keep the current PR as-is and revisit all retry/backoff structure in Netbird client in feature PR?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the signal client implementation to verify the user's claim
cat shared/signal/client/grpc.go | sed -n '160,185p'
Length of output: 935
🏁 Script executed:
#!/bin/bash
# Also check the broader context of how backoff is used in signal client
rg -n "backoff.Reset\(\)" shared/signal/client/grpc.go -A 3 -B 3
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check if there are any other places in the codebase using similar pattern
rg -n "backoff.Retry.*operation.*handleStream" --type go
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look at the full Sync method in signal client to understand the pattern
cat shared/signal/client/grpc.go | sed -n '90,200p'
Length of output: 3783
@gamerslouis, you're absolutely right. I verified the signal client implementation and it uses the exact same pattern with backOff.Reset() at line 175-177, including a helpful comment explaining the rationale:
// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()Your points are well-taken:
- No concurrency issue in this single-goroutine flow
- The pattern is already established in the codebase
- The semantic concern I raised doesn't apply when intentionally using
backoff.Retryto manage long-lived streaming connections
Suggestion: Consider adding a similar explanatory comment at line 163 in handleStream to document why the reset is needed, making the intent explicit for future maintainers:
// Reset backoff after stream ends so next retry doesn't start with accumulated delay
backOff.Reset()The PR is acceptable as-is. If you'd like to track the broader structural refactor you mentioned (potentially moving away from backoff.Retry for streaming connections), I can help create a GitHub issue to document the discussion and proposed improvements for future work. Would that be helpful?
🧬 Code Graph Analysis Results
[shared/management/client/grpc.go:37-60]
Summary: NewClient uses a short-lived retry loop to establish a gRPC connection to the Management Service.
- Function: NewClient(ctx context.Context, addr string, ourPrivateKey wgtypes.Key, tlsEnabled bool) (*GrpcClient, error)
- Behavior:
- Defines operation as a function that attempts to create a connection via nbgrpc.CreateConnection with the provided addr, TLS flag, and a component marker (Management component).
- If creation fails, operation returns an error; otherwise, nil.
- Executes the operation inside backoff.Retry(operation, nbgrpc.Backoff(ctx)).
- On unrecoverable failure, logs and returns the error.
- Error handling:
- Retry continues according to the backoff strategy nbgrpc.Backoff(ctx) until success or a permanent error surfaces.
Rationale: This shows how a single connection establishment workflow uses the project's backoff/backoff.Retry mechanism to manage transient failures during setup, aligning with the user comment about Retry semantics and backoff ownership.
[shared/management/client/grpc.go:65-85]
Summary: defaultBackoff constructs a backoff strategy used for general retry/backoff logic.
- Function: defaultBackoff(ctx context.Context) backoff.BackOff
- Behavior:
- Creates a backoff.ExponentialBackOff with:
- InitialInterval: 800ms
- RandomizationFactor: 1
- Multiplier: 1.7
- MaxInterval: 10s
- MaxElapsedTime: 3 months
- Stop: backoff.Stop
- Clock: backoff.SystemClock
- Wraps the backoff in backoff.WithContext(..., ctx) to tie cancellation/timeouts to the provided context.
- Creates a backoff.ExponentialBackOff with:
- Return: a backoff.BackOff instance ready to drive Retry loops.
Rationale: This captures how the code defines a long-lived, context-aware backoff strategy used by Retry calls elsewhere (e.g., in Sync and Login flows). It’s central to understanding how backoff ownership and lifecycle are intended to map to the operation lifecycle.
[shared/management/client/grpc.go:102-140]
Summary: handleStream uses a streaming RPC with a backoff object and explicitly resets the backoff after a stream attempt completes.
- Function: (c *GrpcClient) handleStream(ctx context.Context, serverPubKey wgtypes.Key, sysInfo *system.Info, msgHandler func(*proto.SyncResponse) error, backOff backoff.BackOff) error
- Behavior:
- Establishes a streaming Sync connection via connectToStream.
- On success, logs and notifies connection.
- Enters a blocking loop via receiveEvents to process incoming messages.
- After receiveEvents returns (i.e., the stream ends or errors), calls backOff.Reset().
- If receiveEvents returns an error, it maps certain gRPC status codes to retryable vs. permanent errors:
- PermissionDenied: return backoff.Permanent(err)
- Canceled: interpret as shutdown and return nil
- Others: log warning and return err to trigger a retry at a higher level
- Backoff interaction:
- The function receives backOff as a parameter and resets it after each stream processing attempt, aligning with the comment about reusing a single backoff instance for a long-lived streaming connection within a single goroutine.
Rationale: This directly relates to the user’s point about Reset() on backOff within a single-goroutine flow, the absence of races for this usage, and the broader discussion about how backoff/backoff.Retry is intended to be used with streaming connections. It also documents the exact control flow around stream handling, error categorization, and retry triggering.
[shared/management/client/grpc.go: additional note]
Context: The user comment references a pattern used in the signal client with the same retry/backoff approach (and a link to a specific lines range). This file’s own usage of backoff and the handleStream pattern is the most directly relevant to that discussion and to understanding how backoff is intended to be owned and Reset in this codebase. If you want to compare, inspect shared/signal/client/grpc.go around the analogous connection/setup code (not provided here).
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)
163-165: Backoff reset logic is correct.The reset after receiveEvents completes addresses the issue where long-lived connections exhaust MaxElapsedTime and prevent subsequent retries. The single-threaded execution ensures no race conditions.
The comment accurately describes the reset rationale, though you could optionally make it more explicit about MaxElapsedTime:
Optional: More explicit comment
- // we need this reset because after a successful connection and a consequent error, backoff lib doesn't - // reset times and next try will start with a long delay + // Reset backoff after a successful connection period. Without this reset, + // long-lived connections can exhaust MaxElapsedTime and cause immediate + // retry failure, or accumulate large intervals causing long retry delays. backOff.Reset()Minor optimization opportunity: The reset currently happens before checking error types (line 165 vs lines 169-171). You could move the reset to after line 171 to avoid resetting when returning permanent errors, though this has negligible impact since the retry loop stops anyway.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
Retry(10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Client / Unit (386)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Windows
- GitHub Check: JS / Lint
- GitHub Check: Linux
- GitHub Check: Darwin
- GitHub Check: release_ui
- GitHub Check: Client / Unit
- GitHub Check: release
- GitHub Check: release_ui_darwin
- GitHub Check: Client / Unit
- GitHub Check: Android / Build
- GitHub Check: Client / Unit
- GitHub Check: iOS / Build
🔇 Additional comments (3)
shared/management/client/grpc.go (3)
114-115: LGTM - Backoff instance properly scoped.Creating the backoff instance at the Sync level and passing it to handleStream enables the reset logic to work correctly. The instance is properly scoped to the Sync call lifetime.
133-136: LGTM - Backoff properly shared between Sync and handleStream.The shared backoff instance is correctly passed to both handleStream (line 133) and backoff.Retry (line 136), enabling the reset logic to affect subsequent retry attempts.
144-145: LGTM - Method signature correctly updated.The handleStream signature change to accept a backoff.BackOff parameter is necessary and correctly implemented.



Describe your changes
Reset client management grpc client backoff after successful connected to management API.
Current Situation:
If the connection duration exceeds MaxElapsedTime, when the connection is interrupted, the backoff fails immediately due to timeout and does not actually perform a retry.
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.