Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions shared/management/client/grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ func (c *GrpcClient) ready() bool {
// Sync wraps the real client's Sync endpoint call and takes care of retries and encryption/decryption of messages
// Blocking request. The result will be sent via msgHandler callback function
func (c *GrpcClient) Sync(ctx context.Context, sysInfo *system.Info, msgHandler func(msg *proto.SyncResponse) error) error {
backOff := defaultBackoff(ctx)

operation := func() error {
log.Debugf("management connection state %v", c.conn.GetState())
connState := c.conn.GetState()
Expand All @@ -128,10 +130,10 @@ func (c *GrpcClient) Sync(ctx context.Context, sysInfo *system.Info, msgHandler
return err
}

return c.handleStream(ctx, *serverPubKey, sysInfo, msgHandler)
return c.handleStream(ctx, *serverPubKey, sysInfo, msgHandler, backOff)
}

err := backoff.Retry(operation, defaultBackoff(ctx))
err := backoff.Retry(operation, backOff)
Comment on lines +133 to +136
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "grpc.go" -path "*/management/client/*" | head -1

Repository: netbirdio/netbird

Length of output: 95


🏁 Script executed:

grep -n "backoff" shared/management/client/grpc.go | head -30

Repository: netbirdio/netbird

Length of output: 933


🏁 Script executed:

find . -type f -name "grpc.go" | grep -E "(management|client)" | head -5

Repository: netbirdio/netbird

Length of output: 126


🏁 Script executed:

sed -n '130,170p' shared/management/client/grpc.go

Repository: netbirdio/netbird

Length of output: 1346


Do not reset backoff from within the operation function.

Calling backOff.Reset() inside handleStream (line 163) while backoff.Retry is managing the same instance corrupts retry state. The ExponentialBackOff implementation is not thread-safe. Use a separate backoff instance for the stream handler, or avoid resetting within the operation.

🤖 Prompt for AI Agents
In shared/management/client/grpc.go around lines 133 to 136, the operation
passed to backoff.Retry calls handleStream which currently resets the shared
backOff instance (see line ~163), corrupting ExponentialBackOff state; change
the code so handleStream uses its own backoff instance (copy/new
ExponentialBackOff) or remove any Reset() calls inside handleStream and only let
the backoff.Retry caller manage the shared backOff lifecycle; ensure any
per-stream retry logic constructs and manages its own backoff object so the
backoff.Retry call operates on an instance that is not mutated concurrently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case, handleStream is only ever called from Sync, and the entire call chain (backoff.Retry → operation → handleStream) runs within a single goroutine. There is no concurrent access to the backOff instance, so calling Reset() here does not introduce a race condition.

This pattern is also already used in the signal client with the same retry/backoff approach:

// start receiving messages from the Signal stream (from other peers through signal)
err = c.receive(stream)
if err != nil {
if s, ok := status.FromError(err); ok && s.Code() == codes.Canceled {
log.Debugf("signal connection context has been canceled, this usually indicates shutdown")
return nil
}
// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()
c.notifyDisconnected(err)
log.Warnf("disconnected from the Signal service but will retry silently. Reason: %v", err)
return err
}

I think backoff.Retry API is primarily designed for short-lived, one-shot operations (e.g. HTTP requests), where the lifecycle of the backoff instance clearly maps to a single attempt sequence. In our case, however, we are using it to manage a long-lived streaming connection, where a single “attempt” may last minutes or hours, and a successful stream session is logically equivalent to a completed retry cycle. If we want to completely avoid any semantic ambiguity or incorrect assumptions about backoff ownership, the clean solution would be a structural change. But they would also require a broader refactor rather than a localized fix.

Would it be acceptable to keep the current PR as-is and revisit all retry/backoff structure in Netbird client in feature PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the signal client implementation to verify the user's claim
cat shared/signal/client/grpc.go | sed -n '160,185p'

Length of output: 935


🏁 Script executed:

#!/bin/bash
# Also check the broader context of how backoff is used in signal client
rg -n "backoff.Reset\(\)" shared/signal/client/grpc.go -A 3 -B 3

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there are any other places in the codebase using similar pattern
rg -n "backoff.Retry.*operation.*handleStream" --type go

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look at the full Sync method in signal client to understand the pattern
cat shared/signal/client/grpc.go | sed -n '90,200p'

Length of output: 3783


@gamerslouis, you're absolutely right. I verified the signal client implementation and it uses the exact same pattern with backOff.Reset() at line 175-177, including a helpful comment explaining the rationale:

// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()

Your points are well-taken:

  • No concurrency issue in this single-goroutine flow
  • The pattern is already established in the codebase
  • The semantic concern I raised doesn't apply when intentionally using backoff.Retry to manage long-lived streaming connections

Suggestion: Consider adding a similar explanatory comment at line 163 in handleStream to document why the reset is needed, making the intent explicit for future maintainers:

// Reset backoff after stream ends so next retry doesn't start with accumulated delay
backOff.Reset()

The PR is acceptable as-is. If you'd like to track the broader structural refactor you mentioned (potentially moving away from backoff.Retry for streaming connections), I can help create a GitHub issue to document the discussion and proposed improvements for future work. Would that be helpful?


🧬 Code Graph Analysis Results

[shared/management/client/grpc.go:37-60]
Summary: NewClient uses a short-lived retry loop to establish a gRPC connection to the Management Service.

  • Function: NewClient(ctx context.Context, addr string, ourPrivateKey wgtypes.Key, tlsEnabled bool) (*GrpcClient, error)
  • Behavior:
    • Defines operation as a function that attempts to create a connection via nbgrpc.CreateConnection with the provided addr, TLS flag, and a component marker (Management component).
    • If creation fails, operation returns an error; otherwise, nil.
    • Executes the operation inside backoff.Retry(operation, nbgrpc.Backoff(ctx)).
    • On unrecoverable failure, logs and returns the error.
  • Error handling:
    • Retry continues according to the backoff strategy nbgrpc.Backoff(ctx) until success or a permanent error surfaces.

Rationale: This shows how a single connection establishment workflow uses the project's backoff/backoff.Retry mechanism to manage transient failures during setup, aligning with the user comment about Retry semantics and backoff ownership.

[shared/management/client/grpc.go:65-85]
Summary: defaultBackoff constructs a backoff strategy used for general retry/backoff logic.

  • Function: defaultBackoff(ctx context.Context) backoff.BackOff
  • Behavior:
    • Creates a backoff.ExponentialBackOff with:
      • InitialInterval: 800ms
      • RandomizationFactor: 1
      • Multiplier: 1.7
      • MaxInterval: 10s
      • MaxElapsedTime: 3 months
      • Stop: backoff.Stop
      • Clock: backoff.SystemClock
    • Wraps the backoff in backoff.WithContext(..., ctx) to tie cancellation/timeouts to the provided context.
  • Return: a backoff.BackOff instance ready to drive Retry loops.

Rationale: This captures how the code defines a long-lived, context-aware backoff strategy used by Retry calls elsewhere (e.g., in Sync and Login flows). It’s central to understanding how backoff ownership and lifecycle are intended to map to the operation lifecycle.

[shared/management/client/grpc.go:102-140]
Summary: handleStream uses a streaming RPC with a backoff object and explicitly resets the backoff after a stream attempt completes.

  • Function: (c *GrpcClient) handleStream(ctx context.Context, serverPubKey wgtypes.Key, sysInfo *system.Info, msgHandler func(*proto.SyncResponse) error, backOff backoff.BackOff) error
  • Behavior:
    • Establishes a streaming Sync connection via connectToStream.
    • On success, logs and notifies connection.
    • Enters a blocking loop via receiveEvents to process incoming messages.
    • After receiveEvents returns (i.e., the stream ends or errors), calls backOff.Reset().
    • If receiveEvents returns an error, it maps certain gRPC status codes to retryable vs. permanent errors:
      • PermissionDenied: return backoff.Permanent(err)
      • Canceled: interpret as shutdown and return nil
      • Others: log warning and return err to trigger a retry at a higher level
  • Backoff interaction:
    • The function receives backOff as a parameter and resets it after each stream processing attempt, aligning with the comment about reusing a single backoff instance for a long-lived streaming connection within a single goroutine.

Rationale: This directly relates to the user’s point about Reset() on backOff within a single-goroutine flow, the absence of races for this usage, and the broader discussion about how backoff/backoff.Retry is intended to be used with streaming connections. It also documents the exact control flow around stream handling, error categorization, and retry triggering.

[shared/management/client/grpc.go: additional note]
Context: The user comment references a pattern used in the signal client with the same retry/backoff approach (and a link to a specific lines range). This file’s own usage of backoff and the handleStream pattern is the most directly relevant to that discussion and to understanding how backoff is intended to be owned and Reset in this codebase. If you want to compare, inspect shared/signal/client/grpc.go around the analogous connection/setup code (not provided here).

if err != nil {
log.Warnf("exiting the Management service connection retry loop due to the unrecoverable error: %s", err)
}
Expand All @@ -140,7 +142,7 @@ func (c *GrpcClient) Sync(ctx context.Context, sysInfo *system.Info, msgHandler
}

func (c *GrpcClient) handleStream(ctx context.Context, serverPubKey wgtypes.Key, sysInfo *system.Info,
msgHandler func(msg *proto.SyncResponse) error) error {
msgHandler func(msg *proto.SyncResponse) error, backOff backoff.BackOff) error {
ctx, cancelStream := context.WithCancel(ctx)
defer cancelStream()

Expand All @@ -158,6 +160,9 @@ func (c *GrpcClient) handleStream(ctx context.Context, serverPubKey wgtypes.Key,

// blocking until error
err = c.receiveEvents(stream, serverPubKey, msgHandler)
// we need this reset because after a successful connection and a consequent error, backoff lib doesn't
// reset times and next try will start with a long delay
backOff.Reset()
Comment thread
coderabbitai[bot] marked this conversation as resolved.
if err != nil {
c.notifyDisconnected(err)
s, _ := gstatus.FromError(err)
Expand Down
Loading