Skip to content

[management] Fix sync metrics#4939

Merged
pascal-fischer merged 1 commit intomainfrom
fix/sync-metrics
Dec 11, 2025
Merged

[management] Fix sync metrics#4939
pascal-fischer merged 1 commit intomainfrom
fix/sync-metrics

Conversation

@pascal-fischer
Copy link
Copy Markdown
Collaborator

@pascal-fischer pascal-fischer commented Dec 11, 2025

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Chores
    • Improved internal metrics collection for synchronization requests to enable better performance monitoring.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 11, 2025

Walkthrough

Adds request duration tracking to gRPC Sync operations. Records the start time after semaphore acquisition, calculates total duration, and records it via a new CountSyncRequestDuration metric method that stores the duration in a histogram.

Changes

Cohort / File(s) Summary
gRPC Sync Request Timing
management/internals/shared/grpc/server.go
Adds start time recording after semaphore acquisition and calls new CountSyncRequestDuration metric method with calculated duration before unlocking.
gRPC Metrics Recording
management/server/telemetry/grpc_metrics.go
Adds new public method CountSyncRequestDuration(duration time.Duration, accountID string) that records request duration to the syncRequestDuration histogram.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify timing is recorded at the correct point in the request lifecycle (after semaphore, before unlock)
  • Confirm histogram recording in CountSyncRequestDuration is properly initialized and functional
  • Note: accountID parameter is present but unused; clarify if reserved for future attribute labeling

Possibly related PRs

Suggested reviewers

  • crn4

Poem

🐰 Hop, hop, the timers tick,
Counting each request, quick and slick,
Metrics bloom where durations flow,
Sync requests now steal the show! ⏱️✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is largely incomplete with missing critical information including what changes were made, the issue reference, and root cause explanation. Add a detailed explanation of the changes, reference the related issue ticket, and explain why this fix addresses the sync metrics problem.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding sync request duration metrics to the management service.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/sync-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94d34dc and 50edef2.

📒 Files selected for processing (2)
  • management/internals/shared/grpc/server.go (2 hunks)
  • management/server/telemetry/grpc_metrics.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
management/internals/shared/grpc/server.go (2)
monotime/time.go (2)
  • Now (28-31)
  • Since (33-35)
management/server/telemetry/grpc_metrics.go (1)
  • GRPCMetrics (15-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Build Cache
  • GitHub Check: release
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: Android / Build
  • GitHub Check: Linux
  • GitHub Check: Windows
  • GitHub Check: Client / Unit
  • GitHub Check: Darwin
  • GitHub Check: JS / Lint
  • GitHub Check: release_ui_darwin
🔇 Additional comments (3)
management/server/telemetry/grpc_metrics.go (2)

24-24: LGTM! Histogram initialization follows established patterns.

The new syncRequestDuration histogram field and initialization correctly mirror the existing loginRequestDuration implementation, with appropriate OpenTelemetry configuration.

Also applies to: 88-94, 125-125


165-168: Method implementation is correct.

The CountSyncRequestDuration method properly records the duration. The accountID parameter is currently unused, unlike in CountLoginRequestDuration where it supports high-latency tracking with attributes. This appears intentional for future extensibility.

management/internals/shared/grpc/server.go (1)

174-174: Verify the timing measurement start point.

The reqStart is captured after semaphore acquisition (Line 172), excluding queue wait time from the metric. In contrast, the Login function (Line 518) captures reqStart at the very beginning, including all overhead. This inconsistency makes it difficult to compare Sync vs Login performance metrics.

#!/bin/bash
# Verify if there's documentation or comments explaining the intended metric behavior
rg -n -C3 -P 'reqStart.*time\.Now|CountSyncRequestDuration|CountLoginRequestDuration' --type=go

Comment on lines +266 to +268
if s.appMetrics != nil {
s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Duration metric not recorded on error paths.

The duration is only recorded on the happy path (Line 267), but multiple error paths (Lines 182, 197, 214-218, 245-246, 252-253, 259-261) exit early without recording the metric. This creates incomplete telemetry data that only reflects successful syncs. The Login function avoids this issue by using a defer statement (Lines 558-562), ensuring the duration is always captured.

Apply this diff to ensure duration is always recorded:

 func (s *Server) Sync(req *proto.EncryptedMessage, srv proto.ManagementService_SyncServer) error {
 	if s.syncSem.Load() >= s.syncLim {
 		return status.Errorf(codes.ResourceExhausted, "too many concurrent sync requests, please try again later")
 	}
 	s.syncSem.Add(1)
 
 	reqStart := time.Now()
+
+	ctx := srv.Context()
+
+	// Declare accountID early so defer can access it
+	var accountID string
+
+	defer func() {
+		if s.appMetrics != nil {
+			s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID)
+		}
+	}()
 
-	ctx := srv.Context()
 
 	syncReq := &proto.SyncRequest{}
 	peerKey, err := s.parseRequest(ctx, req, syncReq)
 	if err != nil {
 		s.syncSem.Add(-1)
 		return err
 	}
 	realIP := getRealIP(ctx)
 	sRealIP := realIP.String()
 	peerMeta := extractPeerMeta(ctx, syncReq.GetMeta())
 	metahashed := metaHash(peerMeta, sRealIP)
 	if !s.loginFilter.allowLogin(peerKey.String(), metahashed) {
 		if s.appMetrics != nil {
 			s.appMetrics.GRPCMetrics().CountSyncRequestBlocked()
 		}
 		if s.logBlockedPeers {
 			log.WithContext(ctx).Tracef("peer %s with meta hash %d is blocked from syncing", peerKey.String(), metahashed)
 		}
 		if s.blockPeersWithSameConfig {
 			s.syncSem.Add(-1)
 			return mapError(ctx, internalStatus.ErrPeerAlreadyLoggedIn)
 		}
 	}
 
 	if s.appMetrics != nil {
 		s.appMetrics.GRPCMetrics().CountSyncRequest()
 	}
 
 	// nolint:staticcheck
 	ctx = context.WithValue(ctx, nbContext.PeerIDKey, peerKey.String())
 
-	accountID, err := s.accountManager.GetAccountIDForPeerKey(ctx, peerKey.String())
+	accountID, err = s.accountManager.GetAccountIDForPeerKey(ctx, peerKey.String())
 	if err != nil {
 		// nolint:staticcheck
 		ctx = context.WithValue(ctx, nbContext.AccountIDKey, "UNKNOWN")
+		accountID = "UNKNOWN"
 		log.WithContext(ctx).Tracef("peer %s is not registered", peerKey.String())
 		if errStatus, ok := internalStatus.FromError(err); ok && errStatus.Type() == internalStatus.NotFound {
 			s.syncSem.Add(-1)
 			return status.Errorf(codes.PermissionDenied, "peer is not registered")
 		}
 		s.syncSem.Add(-1)
 		return err
 	}
 
 	// nolint:staticcheck
 	ctx = context.WithValue(ctx, nbContext.AccountIDKey, accountID)
 
 	start := time.Now()
 	unlock := s.acquirePeerLockByUID(ctx, peerKey.String())
 	defer func() {
 		if unlock != nil {
 			unlock()
 		}
 	}()
 	log.WithContext(ctx).Tracef("acquired peer lock for peer %s took %v", peerKey.String(), time.Since(start))
 
 	log.WithContext(ctx).Debugf("Sync request from peer [%s] [%s]", req.WgPubKey, sRealIP)
 
 	if syncReq.GetMeta() == nil {
 		log.WithContext(ctx).Tracef("peer system meta has to be provided on sync. Peer %s, remote addr %s", peerKey.String(), realIP)
 	}
 
 	metahash := metaHash(peerMeta, realIP.String())
 	s.loginFilter.addLogin(peerKey.String(), metahash)
 
 	peer, netMap, postureChecks, dnsFwdPort, err := s.accountManager.SyncAndMarkPeer(ctx, accountID, peerKey.String(), peerMeta, realIP)
 	if err != nil {
 		log.WithContext(ctx).Debugf("error while syncing peer %s: %v", peerKey.String(), err)
 		s.syncSem.Add(-1)
 		return mapError(ctx, err)
 	}
 
 	err = s.sendInitialSync(ctx, peerKey, peer, netMap, postureChecks, srv, dnsFwdPort)
 	if err != nil {
 		log.WithContext(ctx).Debugf("error while sending initial sync for %s: %v", peerKey.String(), err)
 		s.syncSem.Add(-1)
 		return err
 	}
 
 	updates, err := s.networkMapController.OnPeerConnected(ctx, accountID, peer.ID)
 	if err != nil {
 		log.WithContext(ctx).Debugf("error while notify peer connected for %s: %v", peerKey.String(), err)
 		s.syncSem.Add(-1)
 		s.cancelPeerRoutines(ctx, accountID, peer)
 		return err
 	}
 
 	s.secretsManager.SetupRefresh(ctx, accountID, peer.ID)
 
-	if s.appMetrics != nil {
-		s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart), accountID)
-	}
-
 	unlock()
 	unlock = nil
 
 	s.syncSem.Add(-1)
 
 	return s.handleUpdates(ctx, accountID, peerKey, peer, updates, srv)
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In management/internals/shared/grpc/server.go around lines 266-268, the sync
duration metric is only recorded on the happy path; add a defer immediately
after reqStart is set that calls
s.appMetrics.GRPCMetrics().CountSyncRequestDuration(time.Since(reqStart),
accountID) (guarded by s.appMetrics != nil) so the duration is recorded on all
exits, and remove the existing inline happy-path-only metric call to avoid
double-reporting.

@pascal-fischer pascal-fischer merged commit 90e3b80 into main Dec 11, 2025
40 checks passed
@pascal-fischer pascal-fischer deleted the fix/sync-metrics branch December 11, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants