Skip to content

[management] monitoring updates#4937

Merged
pascal-fischer merged 3 commits intomainfrom
chore/monitoring-update
Dec 11, 2025
Merged

[management] monitoring updates#4937
pascal-fischer merged 3 commits intomainfrom
chore/monitoring-update

Conversation

@pascal-fischer
Copy link
Copy Markdown
Collaborator

@pascal-fischer pascal-fischer commented Dec 11, 2025

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Chores

    • Removed per-request duration tracking for one type of gRPC call and simplified related metrics.
    • Lowered verbosity for some peer-blocking logs and moved a login debug log later in the flow.
    • Ensured request context is initialized and propagates account/user identifiers.
  • Bug Fixes

    • HTTP handler now injects account and user IDs into the request context for downstream logging/metrics.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 11, 2025

Walkthrough

Removed per-request sync latency tracking and its metric; lowered blocked/login log levels and moved a Login log later in the flow; ensured accountID is injected into context before GetAccountUsers; added HTTP middleware propagation of AccountId/UserId into request context for downstream logging/metrics.

Changes

Cohort / File(s) Summary
gRPC server instrumentation
management/internals/shared/grpc/server.go
Removed Sync request timing and the call to CountSyncRequestDuration; changed blocked-log level from Warnf to Tracef; moved initial Login debug log later (after account ID is known); changed blocked-login log level to Tracef.
Telemetry & metrics
management/server/telemetry/grpc_metrics.go
Removed syncRequestHighLatencyCounter field and CountSyncRequestDuration method; retained login latency tracking and related counters.
HTTP middleware context propagation
management/server/telemetry/http_api_metrics.go
After handling the wrapped response, extracts auth from request context and, on success, injects nbContext.AccountIDKey and nbContext.UserIDKey into the local context for downstream logging/metrics; preserves existing error handling.
Account context setup
management/server/account.go
Ensure non-nil ctx (use context.Background() if nil) and inject nbcontext.AccountIDKey via context.WithValue before calling GetAccountUsers (with a // nolint:staticcheck directive).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review points:
    • Confirm no remaining callers of CountSyncRequestDuration or usage of the removed syncRequestHighLatencyCounter.
    • Verify that reduced log levels (Warnf→Tracef) still meet observability needs.
    • Ensure middleware AccountIDKey/UserIDKey usage aligns with downstream consumers.
    • Validate context.WithValue placement and the nolint justification.

Possibly related PRs

Suggested reviewers

  • crn4

Poem

🐰 I hopped through logs and nudged a key,

moved a login whisper till the ID I see,
sync durations slipped away like dew,
account IDs now travel through,
small tidy hops — a rabbit's happy plea.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is largely empty with only the template present; critical sections like 'Describe your changes' and 'Issue ticket number and link' are completely unfilled. Fill in the 'Describe your changes' section with a clear explanation of the modifications, provide the issue ticket number/link, and explain why documentation updates were not needed.
Title check ❓ Inconclusive The title '[management] monitoring updates' is vague and generic, using non-descriptive terms that don't clearly convey the specific changes made to the codebase. Use a more specific title that describes the actual changes, such as 'Remove sync request duration metrics and update context propagation' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/monitoring-update

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 200b7a8 and df0b77a.

📒 Files selected for processing (1)
  • management/server/telemetry/http_api_metrics.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
management/server/telemetry/http_api_metrics.go (1)
management/server/context/auth.go (1)
  • GetUserAuthFromContext (25-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: Management / Benchmark (API) (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Management / Benchmark (API) (amd64, postgres)
  • GitHub Check: Management / Benchmark (amd64, sqlite)
  • GitHub Check: Management / Benchmark (amd64, postgres)
  • GitHub Check: Management / Integration (amd64, postgres)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Management / Integration (amd64, sqlite)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: release
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
  • GitHub Check: Client / Unit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
management/server/account.go (1)

790-792: Context enrichment for AccountID looks good; consider using the string form for consistency

Setting nbcontext.AccountIDKey on the context before GetAccountUsers is consistent with the rest of the codebase and helps downstream logging/telemetry.

Since you already compute accountIDString := fmt.Sprintf("%v", accountID), you might want to store that string value in the context instead of the raw any to keep the value type uniform across all AccountID usages:

-	// nolint:staticcheck
-	ctx = context.WithValue(ctx, nbcontext.AccountIDKey, accountID)
+	// nolint:staticcheck
+	ctx = context.WithValue(ctx, nbcontext.AccountIDKey, accountIDString)

Not mandatory, but it can avoid surprises if something later assumes a string AccountID in the context.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44851e0 and 2e371b3.

📒 Files selected for processing (4)
  • management/internals/shared/grpc/server.go (1 hunks)
  • management/server/account.go (1 hunks)
  • management/server/telemetry/grpc_metrics.go (0 hunks)
  • management/server/telemetry/http_api_metrics.go (2 hunks)
💤 Files with no reviewable changes (1)
  • management/server/telemetry/grpc_metrics.go
🧰 Additional context used
🧬 Code graph analysis (2)
management/server/account.go (2)
shared/context/keys.go (1)
  • AccountIDKey (5-5)
management/server/context/keys.go (1)
  • AccountIDKey (7-7)
management/server/telemetry/http_api_metrics.go (2)
management/server/context/auth.go (1)
  • GetUserAuthFromContext (25-30)
shared/management/http/util/util.go (1)
  • WriteError (84-120)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Management / Benchmark (API) (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Benchmark (API) (amd64, sqlite)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Integration (amd64, postgres)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Signal / Unit (386)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Management / Benchmark (amd64, postgres)
  • GitHub Check: Management / Benchmark (amd64, sqlite)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Management / Integration (amd64, sqlite)
  • GitHub Check: Linux
  • GitHub Check: Windows
  • GitHub Check: Darwin
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: iOS / Build
  • GitHub Check: release_ui_darwin
  • GitHub Check: Android / Build
  • GitHub Check: release
  • GitHub Check: release_ui
🔇 Additional comments (1)
management/internals/shared/grpc/server.go (1)

550-550: Logging after AccountID context setup is appropriate

Moving the login debug log to after PeerIDKey and AccountIDKey are set on ctx ensures logs carry richer context without changing request handling. Looks good.

Comment on lines 18 to 19
"github.com/netbirdio/netbird/shared/management/http/util"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the actual implementation in http_api_metrics.go
cat management/server/telemetry/http_api_metrics.go | head -250

Repository: netbirdio/netbird

Length of output: 7416


🏁 Script executed:

# Now let's verify the context issue and how auth middleware works
# 1. Check GetUserAuthFromContext implementation
rg -n "GetUserAuthFromContext" -A 10 management | head -50

# 2. Find where UserAuth is set in context
rg -n "UserAuthContextKey|WithValue.*UserAuth" -B 2 -A 2 management | head -100

Repository: netbirdio/netbird

Length of output: 4923


🏁 Script executed:

# Find where auth middleware sets UserAuth in context
rg -n "SetUserAuthInContext|UserAuthContextKey.*WithValue" -B 3 -A 3 management | head -80

# Find auth middleware implementations
rg -n "type.*Auth.*Middleware|AuthenticationMiddleware|JWTMiddleware" -B 2 -A 5 management | head -100

Repository: netbirdio/netbird

Length of output: 3122


🏁 Script executed:

# Check AuthMiddleware Handler implementation
rg -n "func.*Handler|type.*Handler" management/server/http/middleware/auth_middleware.go -A 20 | head -100

# Look for the actual Handler function that wraps h http.Handler
cat management/server/http/middleware/auth_middleware.go | sed -n '1,150p'

Repository: netbirdio/netbird

Length of output: 5776


🏁 Script executed:

# Check checkJWTFromRequest to confirm SetUserAuthInRequest is called
cat management/server/http/middleware/auth_middleware.go | sed -n '135,250p'

# Find router setup to see middleware order
rg -n "NewAuthMiddleware|NewMetricsMiddleware|mux.NewRouter" management/server/http -B 2 -A 5 | head -150

Repository: netbirdio/netbird

Length of output: 15636


🏁 Script executed:

# Find how middlewares are chained/composed with the router
cat management/server/http/handler.go | sed -n '100,180p'

Repository: netbirdio/netbird

Length of output: 1628


🏁 Script executed:

# Check WriteError implementation to confirm it writes headers and body
rg -n "func WriteError" -A 15 shared/management/http/util

Repository: netbirdio/netbird

Length of output: 2385


🏁 Script executed:

# Verify that context.WithValue creates a NEW context and doesn't mutate the original
rg -n "func WithValue" -B 2 -A 5 --type go | head -30

# Also double-check that the modified request is not accessible after h.ServeHTTP returns
rg -n "SetUserAuthInRequest" -B 3 -A 3 management/server/context/auth.go

Repository: netbirdio/netbird

Length of output: 281


HTTP metrics middleware should not write a new error response after the handler or depend critically on UserAuth

Two problems here can break or distort HTTP responses:

  1. Reading UserAuth from the wrong context source

    The middleware chain order is: metrics → cors → auth → handler. When auth runs inside h.ServeHTTP, it sets UserAuth on a modified request via SetUserAuthInRequest(r, userAuth), which returns a new *http.Request with enriched context. This modified request is only visible to downstream handlers within the h.ServeHTTP call. After h.ServeHTTP returns, the metrics middleware's original outer r still holds its original context without UserAuth. Calling GetUserAuthFromContext(r.Context()) here will always fail because the context mutations that happen inside h.ServeHTTP are not visible to the original request object.

  2. Writing an error response after the wrapped handler already responded

    At this point h.ServeHTTP has already run and may have:

    • Written headers and a body (successful or error response), or
    • Written its own error via util.WriteError.

    Calling WriteError here will:

    • Write headers again (the WrappedResponseWriter will silently ignore this due to the wroteHeader flag).
    • Append a JSON error body to an already-sent response, creating a malformed response.
    • Log the same error twice (once in the handler, once in metrics).
    • Change behavior for endpoints that legitimately don't attach UserAuth (e.g. health checks, login, or any unauthenticated route), even though the underlying handler completed successfully.

    Middleware responsible for metrics/logging should be observational, not alter HTTP semantics when it can't find optional context like UserAuth.

Suggested fix

Make UserAuth enrichment best-effort and non-fatal, and avoid writing a new error response here:

-		h.ServeHTTP(w, r.WithContext(ctx))
-
-		userAuth, err := nbContext.GetUserAuthFromContext(r.Context())
-		if err != nil {
-			util.WriteError(r.Context(), err, w)
-			return
-		}
-
-		//nolint
-		ctx = context.WithValue(ctx, nbContext.AccountIDKey, userAuth.AccountId)
-		//nolint
-		ctx = context.WithValue(ctx, nbContext.UserIDKey, userAuth.UserId)
+		h.ServeHTTP(w, r.WithContext(ctx))
+
+		// Best-effort enrichment: if user auth is present, add account/user IDs to the logging context.
+		// If not present (e.g. unauthenticated endpoints or failed auth), just skip enrichment.
+		if userAuth, err := nbContext.GetUserAuthFromContext(ctx); err == nil {
+			//nolint
+			ctx = context.WithValue(ctx, nbContext.AccountIDKey, userAuth.AccountId)
+			//nolint
+			ctx = context.WithValue(ctx, nbContext.UserIDKey, userAuth.UserId)
+		}

Key points:

  • No WriteError here; errors from auth/handlers should be handled in their own middleware/handlers.
  • Check ctx (the enriched context passed into the handler) instead of r.Context(). While this doesn't fully solve the context visibility problem, it's more correct and keeps the logic aligned. The real fix requires verifying how the HTTP auth middleware is wired to ensure UserAuth is actually reachable at the appropriate point.

This keeps metrics/logging behavior non-invasive while still enriching logs when user/account information is available.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In management/server/telemetry/http_api_metrics.go around lines 18–19, the
metrics middleware currently reads UserAuth from the original request context
after calling h.ServeHTTP and may call util.WriteError if auth info is missing;
instead make UserAuth enrichment best-effort by reading from the
handler-provided ctx (the context used inside ServeHTTP) rather than
r.Context(), do not call util.WriteError or otherwise mutate the response after
h.ServeHTTP returns, and only add user/account fields to logs/metrics when the
UserAuth lookup succeeds; keep the middleware observational—log presence/absence
of UserAuth but do not alter HTTP semantics or write responses.

crn4
crn4 previously approved these changes Dec 11, 2025
@sonarqubecloud
Copy link
Copy Markdown

@pascal-fischer pascal-fischer merged commit 94d34dc into main Dec 11, 2025
39 of 41 checks passed
@pascal-fischer pascal-fischer deleted the chore/monitoring-update branch December 11, 2025 17:29
@coderabbitai coderabbitai Bot mentioned this pull request Dec 11, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants