Skip to content

[client] add reset for management backoff#4935

Merged
pappz merged 4 commits intonetbirdio:mainfrom
gamerslouis:fix/backoff-reset
Dec 30, 2025
Merged

[client] add reset for management backoff#4935
pappz merged 4 commits intonetbirdio:mainfrom
gamerslouis:fix/backoff-reset

Conversation

@gamerslouis
Copy link
Copy Markdown
Contributor

@gamerslouis gamerslouis commented Dec 11, 2025

Describe your changes

Reset client management grpc client backoff after successful connected to management API.

Current Situation:
If the connection duration exceeds MaxElapsedTime, when the connection is interrupted, the backoff fails immediately due to timeout and does not actually perform a retry.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes
    • Improved connection reliability by unifying and preserving retry backoff across reconnects and streaming. The shared backoff is now maintained and reset at appropriate points, yielding more predictable reconnections, faster recovery from transient network issues, and fewer unnecessary reconnect attempts.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 11, 2025

Walkthrough

Adds a shared backoff instance in GrpcClient.Sync, passes it into handleStream (signature changed), uses the shared backoff for initial retries, and resets the backoff on successful stream setup and after processing received events.

Changes

Cohort / File(s) Summary
Grpc client backoff propagation
shared/management/client/grpc.go
Introduces a local backOff in GrpcClient.Sync; uses it for the initial retry loop; changes handleStream signature to accept backoff.BackOff; passes the shared backoff into handleStream; resets the backoff on successful stream setup and after processing events; replaces previous fresh-Retry usage with the shared instance.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[client] add reset for management backoff' is specific and directly related to the main change of resetting backoff on successful connection.
Description check ✅ Passed The PR description covers the key sections: it explains the change, describes the current situation/bug, marks it as a bug fix, and confirms documentation is not needed. However, it lacks an issue ticket number link.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)

159-159: LGTM: Backoff reset after successful connection.

The reset correctly addresses the bug where long-lived connections (exceeding MaxElapsedTime of 3 months) would cause the backoff to give up immediately on the next disconnection. By resetting after each successful connection, the elapsed time counter restarts, ensuring the retry mechanism continues to work properly.

The reset is placed at the right point—after the stream is successfully established but before blocking on event reception.

Consider adding a brief comment explaining the reset:

 	log.Infof("connected to the Management Service stream")
+	// Reset backoff after successful connection to prevent MaxElapsedTime accumulation
+	// across connection cycles. Without this, long-lived connections would cause
+	// immediate retry failure after disconnection.
 	backOff.Reset()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44851e0 and 1299d56.

📒 Files selected for processing (1)
  • shared/management/client/grpc.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
  • Retry (10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Client / Unit
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: Windows
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
🔇 Additional comments (4)
shared/management/client/grpc.go (4)

114-115: LGTM: Backoff variable initialization.

The introduction of a local backOff variable that will be reused across retry attempts is correct and necessary for the reset mechanism to work properly.


133-133: LGTM: Backoff passed to handleStream.

Passing the backOff instance to handleStream allows it to reset the backoff state after successful connection establishment.


136-136: LGTM: Consistent backoff usage in retry loop.

Using the same backOff instance throughout the retry loop ensures proper backoff state management and enables the reset mechanism to function correctly.


145-145: LGTM: Signature updated to accept backoff parameter.

The signature change is necessary to support the backoff reset functionality.

1. backOff.Reset is already called by backoff.Retry, so it is not required during initialization.
2. Remove unnecessary empty lines.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)

162-163: Consider moving the backoff reset to after successfully receiving the first event.

The current placement resets the backoff after receiveEvents returns, regardless of whether any events were successfully received. If the stream fails immediately (before receiving events), the backoff is still reset, which may be too aggressive.

Consider moving the reset inside receiveEvents after successfully processing the first event. This ensures the backoff is only reset for "truly" successful connections.

🔎 Proposed refactoring

Move the reset into receiveEvents after the first successful event:

In handleStream, remove the reset:

 	// blocking until error
 	err = c.receiveEvents(stream, serverPubKey, msgHandler)
-	backOff.Reset()
 	if err != nil {

In receiveEvents, add a flag and reset after first event:

-func (c *GrpcClient) receiveEvents(stream proto.ManagementService_SyncClient, serverPubKey wgtypes.Key, msgHandler func(msg *proto.SyncResponse) error) error {
+func (c *GrpcClient) receiveEvents(stream proto.ManagementService_SyncClient, serverPubKey wgtypes.Key, msgHandler func(msg *proto.SyncResponse) error, backOff backoff.BackOff) error {
+	firstEventReceived := false
 	for {
 		update, err := stream.Recv()
 		if err == io.EOF {
 			log.Debugf("Management stream has been closed by server: %s", err)
 			return err
 		}
 		if err != nil {
 			log.Debugf("disconnected from Management Service sync stream: %v", err)
 			return err
 		}
 
 		log.Debugf("got an update message from Management Service")
 		decryptedResp := &proto.SyncResponse{}
 		err = encryption.DecryptMessage(serverPubKey, c.key, update.Body, decryptedResp)
 		if err != nil {
 			log.Errorf("failed decrypting update message from Management Service: %s", err)
 			return err
 		}
+
+		if !firstEventReceived {
+			backOff.Reset()
+			firstEventReceived = true
+		}
 
 		if err := msgHandler(decryptedResp); err != nil {
 			log.Errorf("failed handling an update message received from Management Service: %v", err.Error())
 		}
 	}
 }

Then update the call in handleStream:

-	err = c.receiveEvents(stream, serverPubKey, msgHandler)
+	err = c.receiveEvents(stream, serverPubKey, msgHandler, backOff)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6500b0 and e2d8205.

📒 Files selected for processing (1)
  • shared/management/client/grpc.go
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
  • Retry (10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Signal / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: release_ui
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: JS / Lint
  • GitHub Check: Client / Unit
🔇 Additional comments (2)
shared/management/client/grpc.go (2)

114-115: Backoff instance correctly shared across retries.

Creating the backoff once in Sync() and reusing it is the correct approach for addressing the MaxElapsedTime exhaustion issue described in the PR objectives.


144-145: LGTM!

The signature change correctly adds the backoff parameter to support the shared backoff pattern. Since this is a private method, there's no impact on external callers.

Comment on lines +133 to +136
return c.handleStream(ctx, *serverPubKey, sysInfo, msgHandler, backOff)
}

err := backoff.Retry(operation, defaultBackoff(ctx))
err := backoff.Retry(operation, backOff)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "grpc.go" -path "*/management/client/*" | head -1

Repository: netbirdio/netbird

Length of output: 95


🏁 Script executed:

grep -n "backoff" shared/management/client/grpc.go | head -30

Repository: netbirdio/netbird

Length of output: 933


🏁 Script executed:

find . -type f -name "grpc.go" | grep -E "(management|client)" | head -5

Repository: netbirdio/netbird

Length of output: 126


🏁 Script executed:

sed -n '130,170p' shared/management/client/grpc.go

Repository: netbirdio/netbird

Length of output: 1346


Do not reset backoff from within the operation function.

Calling backOff.Reset() inside handleStream (line 163) while backoff.Retry is managing the same instance corrupts retry state. The ExponentialBackOff implementation is not thread-safe. Use a separate backoff instance for the stream handler, or avoid resetting within the operation.

🤖 Prompt for AI Agents
In shared/management/client/grpc.go around lines 133 to 136, the operation
passed to backoff.Retry calls handleStream which currently resets the shared
backOff instance (see line ~163), corrupting ExponentialBackOff state; change
the code so handleStream uses its own backoff instance (copy/new
ExponentialBackOff) or remove any Reset() calls inside handleStream and only let
the backoff.Retry caller manage the shared backOff lifecycle; ensure any
per-stream retry logic constructs and manages its own backoff object so the
backoff.Retry call operates on an instance that is not mutated concurrently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case, handleStream is only ever called from Sync, and the entire call chain (backoff.Retry → operation → handleStream) runs within a single goroutine. There is no concurrent access to the backOff instance, so calling Reset() here does not introduce a race condition.

This pattern is also already used in the signal client with the same retry/backoff approach:

// start receiving messages from the Signal stream (from other peers through signal)
err = c.receive(stream)
if err != nil {
if s, ok := status.FromError(err); ok && s.Code() == codes.Canceled {
log.Debugf("signal connection context has been canceled, this usually indicates shutdown")
return nil
}
// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()
c.notifyDisconnected(err)
log.Warnf("disconnected from the Signal service but will retry silently. Reason: %v", err)
return err
}

I think backoff.Retry API is primarily designed for short-lived, one-shot operations (e.g. HTTP requests), where the lifecycle of the backoff instance clearly maps to a single attempt sequence. In our case, however, we are using it to manage a long-lived streaming connection, where a single “attempt” may last minutes or hours, and a successful stream session is logically equivalent to a completed retry cycle. If we want to completely avoid any semantic ambiguity or incorrect assumptions about backoff ownership, the clean solution would be a structural change. But they would also require a broader refactor rather than a localized fix.

Would it be acceptable to keep the current PR as-is and revisit all retry/backoff structure in Netbird client in feature PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the signal client implementation to verify the user's claim
cat shared/signal/client/grpc.go | sed -n '160,185p'

Length of output: 935


🏁 Script executed:

#!/bin/bash
# Also check the broader context of how backoff is used in signal client
rg -n "backoff.Reset\(\)" shared/signal/client/grpc.go -A 3 -B 3

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there are any other places in the codebase using similar pattern
rg -n "backoff.Retry.*operation.*handleStream" --type go

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look at the full Sync method in signal client to understand the pattern
cat shared/signal/client/grpc.go | sed -n '90,200p'

Length of output: 3783


@gamerslouis, you're absolutely right. I verified the signal client implementation and it uses the exact same pattern with backOff.Reset() at line 175-177, including a helpful comment explaining the rationale:

// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()

Your points are well-taken:

  • No concurrency issue in this single-goroutine flow
  • The pattern is already established in the codebase
  • The semantic concern I raised doesn't apply when intentionally using backoff.Retry to manage long-lived streaming connections

Suggestion: Consider adding a similar explanatory comment at line 163 in handleStream to document why the reset is needed, making the intent explicit for future maintainers:

// Reset backoff after stream ends so next retry doesn't start with accumulated delay
backOff.Reset()

The PR is acceptable as-is. If you'd like to track the broader structural refactor you mentioned (potentially moving away from backoff.Retry for streaming connections), I can help create a GitHub issue to document the discussion and proposed improvements for future work. Would that be helpful?


🧬 Code Graph Analysis Results

[shared/management/client/grpc.go:37-60]
Summary: NewClient uses a short-lived retry loop to establish a gRPC connection to the Management Service.

  • Function: NewClient(ctx context.Context, addr string, ourPrivateKey wgtypes.Key, tlsEnabled bool) (*GrpcClient, error)
  • Behavior:
    • Defines operation as a function that attempts to create a connection via nbgrpc.CreateConnection with the provided addr, TLS flag, and a component marker (Management component).
    • If creation fails, operation returns an error; otherwise, nil.
    • Executes the operation inside backoff.Retry(operation, nbgrpc.Backoff(ctx)).
    • On unrecoverable failure, logs and returns the error.
  • Error handling:
    • Retry continues according to the backoff strategy nbgrpc.Backoff(ctx) until success or a permanent error surfaces.

Rationale: This shows how a single connection establishment workflow uses the project's backoff/backoff.Retry mechanism to manage transient failures during setup, aligning with the user comment about Retry semantics and backoff ownership.

[shared/management/client/grpc.go:65-85]
Summary: defaultBackoff constructs a backoff strategy used for general retry/backoff logic.

  • Function: defaultBackoff(ctx context.Context) backoff.BackOff
  • Behavior:
    • Creates a backoff.ExponentialBackOff with:
      • InitialInterval: 800ms
      • RandomizationFactor: 1
      • Multiplier: 1.7
      • MaxInterval: 10s
      • MaxElapsedTime: 3 months
      • Stop: backoff.Stop
      • Clock: backoff.SystemClock
    • Wraps the backoff in backoff.WithContext(..., ctx) to tie cancellation/timeouts to the provided context.
  • Return: a backoff.BackOff instance ready to drive Retry loops.

Rationale: This captures how the code defines a long-lived, context-aware backoff strategy used by Retry calls elsewhere (e.g., in Sync and Login flows). It’s central to understanding how backoff ownership and lifecycle are intended to map to the operation lifecycle.

[shared/management/client/grpc.go:102-140]
Summary: handleStream uses a streaming RPC with a backoff object and explicitly resets the backoff after a stream attempt completes.

  • Function: (c *GrpcClient) handleStream(ctx context.Context, serverPubKey wgtypes.Key, sysInfo *system.Info, msgHandler func(*proto.SyncResponse) error, backOff backoff.BackOff) error
  • Behavior:
    • Establishes a streaming Sync connection via connectToStream.
    • On success, logs and notifies connection.
    • Enters a blocking loop via receiveEvents to process incoming messages.
    • After receiveEvents returns (i.e., the stream ends or errors), calls backOff.Reset().
    • If receiveEvents returns an error, it maps certain gRPC status codes to retryable vs. permanent errors:
      • PermissionDenied: return backoff.Permanent(err)
      • Canceled: interpret as shutdown and return nil
      • Others: log warning and return err to trigger a retry at a higher level
  • Backoff interaction:
    • The function receives backOff as a parameter and resets it after each stream processing attempt, aligning with the comment about reusing a single backoff instance for a long-lived streaming connection within a single goroutine.

Rationale: This directly relates to the user’s point about Reset() on backOff within a single-goroutine flow, the absence of races for this usage, and the broader discussion about how backoff/backoff.Retry is intended to be used with streaming connections. It also documents the exact control flow around stream handling, error categorization, and retry triggering.

[shared/management/client/grpc.go: additional note]
Context: The user comment references a pattern used in the signal client with the same retry/backoff approach (and a link to a specific lines range). This file’s own usage of backoff and the handleStream pattern is the most directly relevant to that discussion and to understanding how backoff is intended to be owned and Reset in this codebase. If you want to compare, inspect shared/signal/client/grpc.go around the analogous connection/setup code (not provided here).

Comment thread shared/management/client/grpc.go
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)

163-165: Backoff reset logic is correct.

The reset after receiveEvents completes addresses the issue where long-lived connections exhaust MaxElapsedTime and prevent subsequent retries. The single-threaded execution ensures no race conditions.

The comment accurately describes the reset rationale, though you could optionally make it more explicit about MaxElapsedTime:

Optional: More explicit comment
-	// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
-	// reset times and next try will start with a long delay
+	// Reset backoff after a successful connection period. Without this reset,
+	// long-lived connections can exhaust MaxElapsedTime and cause immediate
+	// retry failure, or accumulate large intervals causing long retry delays.
 	backOff.Reset()

Minor optimization opportunity: The reset currently happens before checking error types (line 165 vs lines 169-171). You could move the reset to after line 171 to avoid resetting when returning permanent errors, though this has negligible impact since the retry loop stops anyway.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2d8205 and 352ca85.

📒 Files selected for processing (1)
  • shared/management/client/grpc.go
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
  • Retry (10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Windows
  • GitHub Check: JS / Lint
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: release_ui
  • GitHub Check: Client / Unit
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: Client / Unit
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
  • GitHub Check: iOS / Build
🔇 Additional comments (3)
shared/management/client/grpc.go (3)

114-115: LGTM - Backoff instance properly scoped.

Creating the backoff instance at the Sync level and passing it to handleStream enables the reset logic to work correctly. The instance is properly scoped to the Sync call lifetime.


133-136: LGTM - Backoff properly shared between Sync and handleStream.

The shared backoff instance is correctly passed to both handleStream (line 133) and backoff.Retry (line 136), enabling the reset logic to affect subsequent retry attempts.


144-145: LGTM - Method signature correctly updated.

The handleStream signature change to accept a backoff.BackOff parameter is necessary and correctly implemented.

@pappz pappz merged commit e11970e into netbirdio:main Dec 30, 2025
37 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Apr 2, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants