[management] Fix sync metrics#4939
Conversation
WalkthroughAdds request duration tracking to gRPC Sync operations. Records the start time after semaphore acquisition, calculates total duration, and records it via a new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
management/internals/shared/grpc/server.go(2 hunks)management/server/telemetry/grpc_metrics.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
management/internals/shared/grpc/server.go (2)
monotime/time.go (2)
Now(28-31)Since(33-35)management/server/telemetry/grpc_metrics.go (1)
GRPCMetrics(15-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Build Cache
- GitHub Check: release
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: Android / Build
- GitHub Check: Linux
- GitHub Check: Windows
- GitHub Check: Client / Unit
- GitHub Check: Darwin
- GitHub Check: JS / Lint
- GitHub Check: release_ui_darwin
🔇 Additional comments (3)
management/server/telemetry/grpc_metrics.go (2)
24-24: LGTM! Histogram initialization follows established patterns.The new
syncRequestDurationhistogram field and initialization correctly mirror the existingloginRequestDurationimplementation, with appropriate OpenTelemetry configuration.Also applies to: 88-94, 125-125
165-168: Method implementation is correct.The
CountSyncRequestDurationmethod properly records the duration. TheaccountIDparameter is currently unused, unlike inCountLoginRequestDurationwhere it supports high-latency tracking with attributes. This appears intentional for future extensibility.management/internals/shared/grpc/server.go (1)
174-174: Verify the timing measurement start point.The
reqStartis captured after semaphore acquisition (Line 172), excluding queue wait time from the metric. In contrast, theLoginfunction (Line 518) capturesreqStartat the very beginning, including all overhead. This inconsistency makes it difficult to compare Sync vs Login performance metrics.#!/bin/bash # Verify if there's documentation or comments explaining the intended metric behavior rg -n -C3 -P 'reqStart.*time\.Now|CountSyncRequestDuration|CountLoginRequestDuration' --type=go
| if s.appMetrics != nil { | ||
| s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID) | ||
| } |
There was a problem hiding this comment.
Duration metric not recorded on error paths.
The duration is only recorded on the happy path (Line 267), but multiple error paths (Lines 182, 197, 214-218, 245-246, 252-253, 259-261) exit early without recording the metric. This creates incomplete telemetry data that only reflects successful syncs. The Login function avoids this issue by using a defer statement (Lines 558-562), ensuring the duration is always captured.
Apply this diff to ensure duration is always recorded:
func (s *Server) Sync(req *proto.EncryptedMessage, srv proto.ManagementService_SyncServer) error {
if s.syncSem.Load() >= s.syncLim {
return status.Errorf(codes.ResourceExhausted, "too many concurrent sync requests, please try again later")
}
s.syncSem.Add(1)
reqStart := time.Now()
+
+ ctx := srv.Context()
+
+ // Declare accountID early so defer can access it
+ var accountID string
+
+ defer func() {
+ if s.appMetrics != nil {
+ s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID)
+ }
+ }()
- ctx := srv.Context()
syncReq := &proto.SyncRequest{}
peerKey, err := s.parseRequest(ctx, req, syncReq)
if err != nil {
s.syncSem.Add(-1)
return err
}
realIP := getRealIP(ctx)
sRealIP := realIP.String()
peerMeta := extractPeerMeta(ctx, syncReq.GetMeta())
metahashed := metaHash(peerMeta, sRealIP)
if !s.loginFilter.allowLogin(peerKey.String(), metahashed) {
if s.appMetrics != nil {
s.appMetrics.GRPCMetrics().CountSyncRequestBlocked()
}
if s.logBlockedPeers {
log.WithContext(ctx).Tracef("peer %s with meta hash %d is blocked from syncing", peerKey.String(), metahashed)
}
if s.blockPeersWithSameConfig {
s.syncSem.Add(-1)
return mapError(ctx, internalStatus.ErrPeerAlreadyLoggedIn)
}
}
if s.appMetrics != nil {
s.appMetrics.GRPCMetrics().CountSyncRequest()
}
// nolint:staticcheck
ctx = context.WithValue(ctx, nbContext.PeerIDKey, peerKey.String())
- accountID, err := s.accountManager.GetAccountIDForPeerKey(ctx, peerKey.String())
+ accountID, err = s.accountManager.GetAccountIDForPeerKey(ctx, peerKey.String())
if err != nil {
// nolint:staticcheck
ctx = context.WithValue(ctx, nbContext.AccountIDKey, "UNKNOWN")
+ accountID = "UNKNOWN"
log.WithContext(ctx).Tracef("peer %s is not registered", peerKey.String())
if errStatus, ok := internalStatus.FromError(err); ok && errStatus.Type() == internalStatus.NotFound {
s.syncSem.Add(-1)
return status.Errorf(codes.PermissionDenied, "peer is not registered")
}
s.syncSem.Add(-1)
return err
}
// nolint:staticcheck
ctx = context.WithValue(ctx, nbContext.AccountIDKey, accountID)
start := time.Now()
unlock := s.acquirePeerLockByUID(ctx, peerKey.String())
defer func() {
if unlock != nil {
unlock()
}
}()
log.WithContext(ctx).Tracef("acquired peer lock for peer %s took %v", peerKey.String(), time.Since(start))
log.WithContext(ctx).Debugf("Sync request from peer [%s] [%s]", req.WgPubKey, sRealIP)
if syncReq.GetMeta() == nil {
log.WithContext(ctx).Tracef("peer system meta has to be provided on sync. Peer %s, remote addr %s", peerKey.String(), realIP)
}
metahash := metaHash(peerMeta, realIP.String())
s.loginFilter.addLogin(peerKey.String(), metahash)
peer, netMap, postureChecks, dnsFwdPort, err := s.accountManager.SyncAndMarkPeer(ctx, accountID, peerKey.String(), peerMeta, realIP)
if err != nil {
log.WithContext(ctx).Debugf("error while syncing peer %s: %v", peerKey.String(), err)
s.syncSem.Add(-1)
return mapError(ctx, err)
}
err = s.sendInitialSync(ctx, peerKey, peer, netMap, postureChecks, srv, dnsFwdPort)
if err != nil {
log.WithContext(ctx).Debugf("error while sending initial sync for %s: %v", peerKey.String(), err)
s.syncSem.Add(-1)
return err
}
updates, err := s.networkMapController.OnPeerConnected(ctx, accountID, peer.ID)
if err != nil {
log.WithContext(ctx).Debugf("error while notify peer connected for %s: %v", peerKey.String(), err)
s.syncSem.Add(-1)
s.cancelPeerRoutines(ctx, accountID, peer)
return err
}
s.secretsManager.SetupRefresh(ctx, accountID, peer.ID)
- if s.appMetrics != nil {
- s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID)
- }
-
unlock()
unlock = nil
s.syncSem.Add(-1)
return s.handleUpdates(ctx, accountID, peerKey, peer, updates, srv)
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In management/internals/shared/grpc/server.go around lines 266-268, the sync
duration metric is only recorded on the happy path; add a defer immediately
after reqStart is set that calls
s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart),
accountID) (guarded by s.appMetrics != nil) so the duration is recorded on all
exits, and remove the existing inline happy-path-only metric call to avoid
double-reporting.



Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.