[client] Feature/client metrics by pappz · Pull Request #5512 · netbirdio/netbird

pappz · 2026-03-05T13:47:58Z

Describe your changes

Adds opt-in client metrics push for tracking connection performance. Metrics are collected in InfluxDB line protocol format, buffered locally, and pushed periodically to an ingest server.

Disabled by default; enable push via NB_METRICS_PUSH_ENABLED=true
Collection is always active (for debug bundles), only push is opt-in
Push behavior is controlled by a remote configuration endpoint with env var overrides for development/testing
Buffer is hard-limited by age (5 days) and size (5 MB) for bounded memory usage
Debug bundles include collected metrics in metrics.txt
Includes an InfluxDB ingest server, Grafana dashboards, and a Docker Compose dev stack

Metrics Collected

Measurement	Fields	Description
`netbird_peer_connection`	signaling_to_connection_seconds, connection_to_wg_handshake_seconds, total_seconds	Per-peer connection stage timings
`netbird_sync`	duration_seconds	Management sync message processing time
`netbird_login`	duration_seconds	Login/auth duration (success/failure)

All metrics include tags: deployment_type, version, os, arch, peer_id (hashed).

Privacy

Peer IDs are truncated SHA-256 hashes of WireGuard public keys (not reversible)
No IP addresses, peer names, or configuration details are sent
Connection pair IDs are deterministic hashes (same pair produces same ID from either side)

Ingest Server Validation

Allowlisted measurements, fields, and tags (rejects unknown)
Duration fields capped at 300s
Tag values capped at 64 chars
Peer ID format validated (16 hex chars)

Environment Variables

Variable	Default	Description
`NB_METRICS_PUSH_ENABLED`	`false`	Enable metrics push to backend
`NB_METRICS_FORCE_SENDING`	`false`	Skip remote config, push unconditionally
`NB_METRICS_CONFIG_URL`	`https://ingest.netbird.io/config`	Remote push config endpoint
`NB_METRICS_SERVER_URL`	(from remote config)	Override ingest server URL
`NB_METRICS_INTERVAL`	(from remote config)	Override push interval (e.g. "30m", "1h")

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Summary by CodeRabbit

New Features
- Added anonymized client metrics (connection stages, sync duration, login timing) with optional periodic push and local ingest endpoint.
- Exposed metrics in debug bundles and added pprof profiling support for diagnostics.
- Included a Grafana dashboard and local metrics stack for visualization and local development.
Documentation
- Added comprehensive metrics documentation, config templates, and setup instructions for the local metrics environment.

Implements a comprehensive client metrics system to track peer connection stages and performance. The system supports multiple backend implementations (OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection stage durations from creation through WireGuard handshake. Key changes: - Add metrics package with pluggable backend implementations - Implement OpenTelemetry metrics backend - Implement VictoriaMetrics metrics backend - Add no-op metrics implementation for disabled state - Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake - Move WireGuard watcher functionality to conn.go - Refactor engine to integrate metrics tracking - Add metrics export endpoint in debug server

# Conflicts: # client/internal/debug/debug.go # client/internal/engine.go # client/server/debug.go

…ecessary metrics tracking

Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.

Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.

…uctor Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.

Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.

# Conflicts: # client/internal/engine.go # client/internal/peer/conn.go

The semaphore queuing stage (Created → SemaphoreAcquired) is no longer tracked. Connection metrics now start from SignalingReceived. Updated docs and Grafana dashboard accordingly.

…ility Introduce remoteconfig.Manager that fetches a remote JSON config to control metrics push interval and restrict pushing to a specific agent version range. When NB_METRICS_INTERVAL is set, remote config is bypassed entirely for local override.

coderabbitai · 2026-03-05T13:48:11Z

📝 Walkthrough

Walkthrough

Adds a new client metrics subsystem (collection, buffering, export) with an InfluxDB backend, periodic push with remote config, peer-connection instrumentation, debug bundle integration, ingest+infra for local development, and minor API additions to management client and test workflow.

Changes

Cohort / File(s)	Summary
Core metrics types & config `client/internal/metrics/connection_type.go`, `client/internal/metrics/deployment_type.go`, `client/internal/metrics/env.go`, `client/internal/metrics/metrics.go`, `client/internal/metrics/metrics_default.go`, `client/internal/metrics/metrics_js.go`	Adds AgentInfo, ClientMetrics, ConnectionStageTimestamps, deployment detection, env-driven config constants/helpers, and build-tag aware constructor (native vs WASM nil).
InfluxDB backend & tests `client/internal/metrics/influxdb.go`, `client/internal/metrics/influxdb_test.go`	Implements InfluxDB line-protocol backend, sample buffering, trimming rules, Export/Reset and tests validating formatting, trimming, and recorded metrics.
Push orchestration & tests `client/internal/metrics/push.go`, `client/internal/metrics/push_test.go`	Implements Push and PushConfig, periodic export loop, gzip POST to server with X-Peer-ID, version gating, server/interval resolution, and extensive unit tests.
Remote config manager & tests `client/internal/metrics/remoteconfig/manager.go`, `.../manager_test.go`	Adds remote config fetcher with caching/min-refresh, JSON parsing into Config (server URL, version range, interval), and tests for caching and error cases.
Engine & connect wiring `client/internal/connect.go`, `client/internal/engine.go`, `client/internal/engine_test.go`	Wires ClientMetrics into ConnectClient/EngineServices, starts/stops push lifecycle, updates agent info after management connection, records login/sync durations, exposes Engine.GetClientMetrics().
Peer instrumentation `client/internal/peer/conn.go`, `client/internal/peer/handshaker.go`, `client/internal/peer/metrics_saver.go`, `client/internal/peer/wg_watcher.go`, `client/internal/peer/..._test.go`	Adds MetricsRecorder interface, MetricsStages struct, propagates timestamps through handshaker/wg watcher, records signaling/ready/handshake timestamps and emits connection metrics. Tests updated/added.
Debug bundle integration `client/internal/debug/debug.go`, `client/server/debug.go`	Adds MetricsExporter interface to debug bundle generator deps and embeds exported metrics (metrics.txt) into debug archives.
Management client API `shared/management/client/client.go`, `shared/management/client/grpc.go`, `shared/management/client/mock.go`	Adds GetServerURL() to Client interface and implements accessor in GrpcClient and MockClient.
Ingest service & infra `client/internal/metrics/infra/ingest/main.go`, `.../Dockerfile`, `go.mod`, `main_test.go`	Adds HTTP ingest service validating InfluxDB line protocol, auth via X-Peer-ID, forwards to InfluxDB Write API, /config and /health endpoints, and unit tests.
Local infra & provisioning `client/internal/metrics/infra/docker-compose.yml`, `.../grafana/...`, `influxdb/scripts/create-tokens.sh`, `.env.example`, `README.md`, `.gitignore`	Adds Docker Compose stack (ingest, influxdb, grafana), Grafana provisioning (datasource/dashboard JSON), token script, env template, and README for running local metrics stack.
CI: Windows tests script `.github/workflows/golang-test-windows.yml`	Replaces long inline test invocation with generated PowerShell/command script that collects package list and runs tests via run-tests.cmd, capturing output to test-out.txt.

Sequence Diagram(s)

sequenceDiagram
    participant Conn as Conn
    participant Handshaker as Handshaker
    participant MetricsStages as MetricsStages
    participant Engine as Engine
    participant Metrics as ClientMetrics

    Conn->>MetricsStages: Initialize()
    Conn->>Handshaker: NewHandshaker(..., metricsStages)
    Handshaker->>MetricsStages: RecordSignalingReceived()
    Handshaker->>MetricsStages: RecordConnectionReady(now)
    Conn->>MetricsStages: RecordWGHandshakeSuccess(now)
    Conn->>Engine: recordConnectionMetrics()
    Engine->>Metrics: RecordConnectionStages(pubKey, type, isReconn, timestamps)
    Metrics->>Metrics: anonymize peer IDs, buffer sample

sequenceDiagram
    participant Engine as Engine
    participant ClientMetrics as ClientMetrics
    participant Push as Push
    participant RemoteConfig as RemoteConfig
    participant Ingest as Ingest/HTTP
    participant Influx as InfluxDB

    Engine->>ClientMetrics: StartPush(ctx, cfg)
    ClientMetrics->>Push: NewPush(...)
    Push->>RemoteConfig: RefreshIfNeeded()
    RemoteConfig->>Ingest: GET /config (if present)
    Push->>ClientMetrics: Export() -> line-protocol
    Push->>Push: gzipCompress(payload)
    Push->>Ingest: POST /write with X-Peer-ID
    Ingest->>Influx: Forward to InfluxDB Write API
    Influx-->>Push: 2xx success
    Push->>ClientMetrics: Reset() on success

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[client] auto-update #4732: overlaps Engine/Connect wiring changes where EngineServices gained new fields (ClientMetrics).
[client] Extend WG watcher for ICE connection too #5133: modifies wg_watcher and handshake lifecycle signatures similarly to this PR's watcher/handshake timestamp propagation.
[Management/Client] Trigger debug bundle runs from API/Dashboard #4592: touches debug bundle generation and injection of extra GeneratorDependencies like metrics.

Suggested reviewers

pappz

Poem

🐰 Hop, I gathered every timestamp and thread,

From signaling sparks to handshakes ahead.
I nibble bytes, gzip them snug and tight,
Send them upstream where dashboards light.
A carrot-toast to metrics now in flight!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[client] Feature/client metrics' is vague and generic, using non-descriptive terms that don't convey the specific purpose of the changeset.	Revise the title to be more specific about the primary change, e.g., 'Add opt-in client metrics collection and push to ingest server' or 'Implement client-side metrics tracking for connection performance'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description is comprehensive and well-structured. It explains the feature, provides detailed configuration options, lists collected metrics with descriptions, addresses privacy concerns, and includes validation details. The checklist is properly completed with the feature enhancement box checked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/client-metrics-push

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

pappz · 2026-03-05T13:48:26Z

@RabbitAI review this draft PR

Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.

…ead of "selfhosted"

Replace cumulative Prometheus histograms with timestamped point-in-time samples that are pushed once and cleared. This fixes metrics for sparse events (connections/syncs that happen once at startup) where rate() and increase() produced incorrect or empty results. Changes: - Switch from VictoriaMetrics histogram library to raw Prometheus text format with explicit millisecond timestamps - Reset samples after successful push (no resending stale data) - Rename connection_to_handshake → connection_to_wg_handshake - Add netbird_peer_connection_count metric for ICE vs Relay tracking - Simplify dashboard: point-based scatter plots, donut pie chart - Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill - Fix deployment_type Unknown returning "selfhosted" instead of "unknown" - Fix inverted shouldPush condition in push.go

Add influxdb.go with timestamped line protocol export for sparse one-shot events. Restore victoria.go to use proper Prometheus histograms. Update Grafana dashboards, add InfluxDB datasource, and update docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix StopPush not clearing push state, preventing restart - Fix race condition reading currentConnPriority without lock in recordConnectionMetrics - Fix stale comment referencing old metrics server URL - Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts - Rename docker-compose.victoria.yml to docker-compose.yml

Introduce peer_id and connection_pair_id tags to InfluxDB metrics. Public keys are hashed (truncated SHA-256) for anonymisation. The connection pair ID is deterministic regardless of which side computes it, enabling deduplication of reconnections in the ICE vs Relay dashboard. Also pin Grafana to v11.6.0 for file-based provisioning and fix datasource UID references.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

client/internal/metrics/infra/README.md (3)
32-39: Optional: Add language specifier to fenced code block for linter compliance.

The ASCII flow diagram is inside a fenced code block without a language specifier. Adding text after the opening backticks would silence the markdownlint warning.
📝 Proposed fix
-```
+```text
 Client ──POST──▶ Ingest Server (:8087) ──▶ InfluxDB (internal)
                   │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 32 - 39, Update the
fenced code block in client/internal/metrics/infra/README.md that contains the
ASCII flow diagram so the opening fence includes a language specifier (use
"text"); locate the block starting with ``` and change it to ```text to satisfy
markdownlint and silence the linter warning while leaving the diagram contents
unchanged.
167-170: Optional: Show explicit token parameter in InfluxDB query example.

The comment mentions "using admin token from .env", but the command doesn't show how to pass it. Adding an explicit --token parameter would make the example more self-contained for developers unfamiliar with the InfluxDB CLI.
📝 Proposed enhancement
 # Query via InfluxDB (using admin token from .env)
 docker compose exec influxdb influx query \
   'from(bucket: "metrics") |> range(start: -1h)' \
-  --org netbird
+  --org netbird \
+  --token <INFLUXDB_ADMIN_TOKEN>
Alternatively, clarify in the comment whether the token is auto-discovered within the container or needs to be explicitly passed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 167 - 170, Update the
InfluxDB query example to explicitly show how to pass the admin token: modify
the `docker compose exec influxdb influx query` command to include the `--token`
flag (e.g., `--token $INFLUX_ADMIN_TOKEN`) so readers know where to supply the
token, or alternately add a one-line note clarifying whether the token is
auto-discovered inside the container; reference the exact command string `docker
compose exec influxdb influx query` and the `--token` flag in your change.
17-26: Optional: Add language specifier to fenced code block for linter compliance.

The ASCII diagram is inside a fenced code block without a language specifier. Adding text after the opening backticks would silence the markdownlint warning while preserving readability.
📝 Proposed fix
-```
+```text
 Daemon Layer (connect.go)
   ├─ Creates ClientMetrics instance once
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 17 - 26, Add a language
specifier to the fenced code block in client/internal/metrics/infra/README.md
(the ASCII diagram containing "Daemon Layer (connect.go)" and "Engine Layer
(engine.go)") by changing the opening fence from ``` to ```text so markdownlint
stops warning; keep the diagram content unchanged and only modify the opening
backticks to include "text".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/debug/debug.go`:
- Line 56: Update the manifest description for "metrics.txt" to note it is only
included when metrics collection is enabled and there are buffered samples
(i.e., when addMetrics() writes output from Export() rather than when Export()
returns no data); change the manifest string near the metrics file description
to explicitly mention that the file is skipped if Export() returns no samples,
and ensure references to addMetrics() and Export() are accurate so readers
understand the empty-buffer case.

---

Nitpick comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 32-39: Update the fenced code block in
client/internal/metrics/infra/README.md that contains the ASCII flow diagram so
the opening fence includes a language specifier (use "text"); locate the block
starting with ``` and change it to ```text to satisfy markdownlint and silence
the linter warning while leaving the diagram contents unchanged.
- Around line 167-170: Update the InfluxDB query example to explicitly show how
to pass the admin token: modify the `docker compose exec influxdb influx query`
command to include the `--token` flag (e.g., `--token $INFLUX_ADMIN_TOKEN`) so
readers know where to supply the token, or alternately add a one-line note
clarifying whether the token is auto-discovered inside the container; reference
the exact command string `docker compose exec influxdb influx query` and the
`--token` flag in your change.
- Around line 17-26: Add a language specifier to the fenced code block in
client/internal/metrics/infra/README.md (the ASCII diagram containing "Daemon
Layer (connect.go)" and "Engine Layer (engine.go)") by changing the opening
fence from ``` to ```text so markdownlint stops warning; keep the diagram
content unchanged and only modify the opening backticks to include "text".

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52cd1f1c-bc2d-4674-92e2-7bc4528efdb0

📥 Commits

Reviewing files that changed from the base of the PR and between d8118df and 804cd5d.

📒 Files selected for processing (4)

client/internal/debug/debug.go
client/internal/metrics/infra/README.md
client/internal/metrics/remoteconfig/manager.go
client/internal/peer/metrics_saver.go

🚧 Files skipped from review as they are similar to previous changes (2)

client/internal/metrics/remoteconfig/manager.go
client/internal/peer/metrics_saver.go

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/push.go`:
- Around line 192-246: The push method in Push (func (p *Push) push) currently
calls p.metrics.Export(...) and later p.metrics.Reset(), which can drop samples
recorded between the two calls; change the metrics API to provide an atomic
drain/swap operation (e.g., ExportAndReset or Drain that returns the snapshot
buffer) and update Push.push to call that single atomic method instead of Export
followed by Reset so the snapshot and clearing happen atomically and no samples
are lost.
- Around line 167-181: The env override currently still early-returns when
RefreshIfNeeded() fails or when isVersionInRange() is false; change the logic so
that when p.envInterval > 0 you still call p.configManager.RefreshIfNeeded(ctx)
to resolve push URL/metadata but do NOT make the interval decision or skip
pushes based on config==nil or isVersionInRange(); specifically, set interval =
p.envInterval first, call RefreshIfNeeded to populate config if available (use
config only for URL/endpoint resolution), and bypass the config nil-check and
the isVersionInRange(p.agentVersion, config.VersionSince, config.VersionUntil)
check when p.envInterval > 0 so the NB_METRICS_INTERVAL override works even if
RefreshIfNeeded fails or the agent is out of configured version range.
- Around line 149-156: The timer is being reset with whatever p.resolve(ctx)
returns which can be zero or negative; clamp the returned interval before
calling timer.Reset to a safe minimum (e.g., at least 1s or another configured
MIN_PUSH_INTERVAL) to avoid tight busy-loops. Update the code around
p.resolve(ctx) and timer.Reset: call p.resolve(ctx) -> validate interval (if <=
0 then set to MIN_PUSH_INTERVAL) -> then call timer.Reset(interval); ensure any
MIN_PUSH_INTERVAL constant is defined near the push logic and referenced in the
push/p.resolve flow so push(ctx, pushURL) and timer.Reset use the validated
value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f1058da-2471-4b0e-927f-049a30472a13

📥 Commits

Reviewing files that changed from the base of the PR and between 804cd5d and b4cd717.

📒 Files selected for processing (1)

client/internal/metrics/push.go

… and Reset calls

coderabbitai

♻️ Duplicate comments (2)

client/internal/metrics/push.go (2)

149-156: ⚠️ Potential issue | 🟠 Major

Clamp interval before timer.Reset to prevent busy-loop.

While NewPush validates envInterval for force-sending mode, resolve() can still return a zero or negative interval from config.Interval (remote config) when p.envInterval == 0. This would cause timer.Reset(0) to fire immediately, creating a busy-loop.

Suggested guard

 		pushURL, interval := p.resolve(ctx)
+		if interval <= 0 {
+			log.Warnf("invalid metrics push interval %s, falling back to %s", interval, defaultPushInterval)
+			interval = defaultPushInterval
+		}
 		if pushURL != "" {
 			if err := p.push(ctx, pushURL); err != nil {
 				log.Errorf("failed to push metrics: %v", err)
 			}
 		}
 
 		timer.Reset(interval)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 149 - 156, The call to
timer.Reset(interval) can receive zero/negative intervals from p.resolve() (when
p.envInterval==0 and config.Interval is invalid), causing a busy-loop; update
the logic around p.resolve(ctx) and timer.Reset to clamp the resolved interval
to a safe minimum (e.g. if interval <= 0 then set interval = defaultMinInterval
or a constant like time.Second) before calling timer.Reset; reference p.resolve,
p.envInterval and the timer.Reset call so you locate and replace the unguarded
reset with a clamped value.

167-182: ⚠️ Potential issue | 🟠 Major

NB_METRICS_INTERVAL does not bypass version gating as documented.

When p.envInterval > 0, the code still checks isVersionInRange() at line 178 and skips pushes if the version is out of range. This contradicts the documented behavior that NB_METRICS_INTERVAL should "override remote config period/version filtering."

Suggested fix to bypass version check when env interval is set

 func (p *Push) resolve(ctx context.Context) (pushURL string, interval time.Duration) {
 	if p.config.ForceSending {
 		return p.resolveServerURL(nil), p.envInterval
 	}

+	// When env interval is set, bypass version gating but still use remote config for URL
+	if p.envInterval > 0 {
+		var remoteServerURL *url.URL
+		if cfg := p.configManager.RefreshIfNeeded(ctx); cfg != nil {
+			remoteServerURL = &cfg.ServerURL
+		}
+		return p.resolveServerURL(remoteServerURL), p.envInterval
+	}
+
 	config := p.configManager.RefreshIfNeeded(ctx)
 	if config == nil {
 		log.Debug("no metrics push config available, waiting to retry")
 		return "", defaultPushInterval
 	}

-	interval = config.Interval
-	if p.envInterval > 0 {
-		interval = p.envInterval
-	}
+	interval = config.Interval

 	if !isVersionInRange(p.agentVersion, config.VersionSince, config.VersionUntil) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 167 - 182, The env var override
(p.envInterval) currently gets applied but still falls through to the version
gate; change the logic so that when p.envInterval > 0 the code bypasses the
version check entirely. Concretely, in the RefreshIfNeeded/interval branch
adjust control flow around p.envInterval and isVersionInRange (e.g., set
interval = p.envInterval and short-circuit returning "" and interval or only
evaluate isVersionInRange when p.envInterval == 0) so that p.envInterval
overrides remote config period/version filtering; reference p.envInterval,
config.Interval, isVersionInRange and p.agentVersion to locate the code to
modify.

🧹 Nitpick comments (1)

client/internal/metrics/influxdb.go (1)

142-167: Consider deterministic field ordering for reproducibility.

Go map iteration order is non-deterministic, so the field order in the output will vary between exports. While InfluxDB accepts any field order, this can complicate debugging and testing.

Optional: Use sorted keys for deterministic output

+import "sort"
+
 // writeSamples writes samples in InfluxDB line protocol format.
 func writeSamples(w io.Writer, samples []influxSample) error {
 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
-		first := true
-		for k, v := range s.fields {
-			if !first {
+		for i, k := range keys {
+			if i > 0 {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
-			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
+			if _, err := fmt.Fprintf(w, "%s=%g", k, s.fields[k]); err != nil {
 				return err
 			}
-			first = false
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 142 - 167, The writeSamples
function uses map iteration over influxSample.fields which yields
non-deterministic field order; to make output reproducible, collect the map keys
into a slice, sort them (e.g., sort.Strings), then iterate the sorted keys to
format fields instead of ranging directly over s.fields. Update writeSamples to
build a []string keys from s.fields, sort it, and use keys[i] to access values
when writing each field.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/metrics/push.go`:
- Around line 149-156: The call to timer.Reset(interval) can receive
zero/negative intervals from p.resolve() (when p.envInterval==0 and
config.Interval is invalid), causing a busy-loop; update the logic around
p.resolve(ctx) and timer.Reset to clamp the resolved interval to a safe minimum
(e.g. if interval <= 0 then set interval = defaultMinInterval or a constant like
time.Second) before calling timer.Reset; reference p.resolve, p.envInterval and
the timer.Reset call so you locate and replace the unguarded reset with a
clamped value.
- Around line 167-182: The env var override (p.envInterval) currently gets
applied but still falls through to the version gate; change the logic so that
when p.envInterval > 0 the code bypasses the version check entirely. Concretely,
in the RefreshIfNeeded/interval branch adjust control flow around p.envInterval
and isVersionInRange (e.g., set interval = p.envInterval and short-circuit
returning "" and interval or only evaluate isVersionInRange when p.envInterval
== 0) so that p.envInterval overrides remote config period/version filtering;
reference p.envInterval, config.Interval, isVersionInRange and p.agentVersion to
locate the code to modify.

---

Nitpick comments:
In `@client/internal/metrics/influxdb.go`:
- Around line 142-167: The writeSamples function uses map iteration over
influxSample.fields which yields non-deterministic field order; to make output
reproducible, collect the map keys into a slice, sort them (e.g., sort.Strings),
then iterate the sorted keys to format fields instead of ranging directly over
s.fields. Update writeSamples to build a []string keys from s.fields, sort it,
and use keys[i] to access values when writing each field.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89647518-c342-4244-af60-8e129b3ecf24

📥 Commits

Reviewing files that changed from the base of the PR and between b4cd717 and d272755.

📒 Files selected for processing (4)

client/internal/metrics/influxdb.go
client/internal/metrics/metrics.go
client/internal/metrics/push.go
client/internal/metrics/push_test.go

…sh interval

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

client/internal/metrics/push.go (1)

271-283: Double-close of gzip writer.

The deferred gz.Close() at line 275 will be called even after the explicit gz.Close() at line 279. While gzip.Writer.Close() is safe to call multiple times, the defer is unnecessary overhead.

Remove redundant defer

 func gzipCompress(data []byte) (*bytes.Buffer, error) {
 	var buf bytes.Buffer
 	gz := gzip.NewWriter(&buf)
-	defer func() { _ = gz.Close() }()
 	if _, err := gz.Write(data); err != nil {
+		_ = gz.Close()
 		return nil, err
 	}
 	if err := gz.Close(); err != nil {
 		return nil, err
 	}
 	return &buf, nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 271 - 283, In gzipCompress
remove the redundant deferred close of the gzip writer: delete the defer func()
{ _ = gz.Close() }() and keep the explicit gz.Close() call that checks and
returns any error; ensure gz is closed exactly once (via the explicit if err :=
gz.Close(); err != nil { return nil, err }) so the function still returns &buf
on success and propagates close/write errors from the gzip.Writer.

client/internal/metrics/env.go (1)

74-87: Consider validating URL scheme for security.

The function parses the URL but doesn't validate that it uses an expected scheme (e.g., http or https). A misconfigured URL with an unexpected scheme could cause issues or security concerns.

Optional: Add scheme validation

 func getMetricsServerURL() *url.URL {
 	envURL := os.Getenv(EnvMetricsServerURL)
 	if envURL == "" {
 		return nil
 	}
 	parsed, err := url.Parse(envURL)
 	if err != nil {
 		log.Warnf("invalid metrics server URL from env: %v", err)
 		return nil
 	}
+	if parsed.Scheme != "http" && parsed.Scheme != "https" {
+		log.Warnf("invalid metrics server URL scheme %q, expected http or https", parsed.Scheme)
+		return nil
+	}
 	return parsed
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/env.go` around lines 74 - 87, getMetricsServerURL
currently parses EnvMetricsServerURL but doesn't validate the URL scheme; update
getMetricsServerURL to reject URLs that don't use an allowed scheme (e.g.,
"http" or "https") by checking parsed.Scheme after url.Parse and logging a
warning + returning nil for disallowed schemes. Use the EnvMetricsServerURL
constant and the parsed variable from url.Parse within getMetricsServerURL to
perform this validation and ensure only expected schemes are accepted.

client/internal/metrics/influxdb.go (1)

142-167: Map iteration order may cause non-deterministic output.

The fields map iteration order is non-deterministic in Go. While InfluxDB line protocol doesn't require a specific field order, non-deterministic output can make debugging and testing harder. This is a minor concern since InfluxDB handles it correctly.

Optional: Sort field keys for deterministic output

+import "sort"
+
 func writeSamples(w io.Writer, samples []influxSample) error {
 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		// Sort field keys for deterministic output
+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
 		first := true
-		for k, v := range s.fields {
+		for _, k := range keys {
+			v := s.fields[k]
 			if !first {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
 			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
 				return err
 			}
 			first = false
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 142 - 167, The writeSamples
function uses map iteration over s.fields which is non-deterministic; to make
output deterministic for testing/debugging, gather the keys from s.fields inside
writeSamples, sort them (e.g., using sort.Strings), and then iterate over the
sorted keys to write fields instead of ranging directly over the map; update the
loop that currently reads "for k, v := range s.fields" to iterate the sorted key
slice and fetch values with s.fields[k].

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/push.go`:
- Around line 196-252: The push method currently calls p.metrics.ExportAndReset
which clears metrics immediately, causing permanent loss if the HTTP send
(p.client.Do) fails; change the flow to avoid clearing until push succeeds: use
a non-destructive export (replace ExportAndReset with an Export that writes to
buf but does not clear, or call p.metrics.Export(&buf) then only call
p.metrics.Reset() after a successful HTTP response), and if your metrics API has
an import/requeue (e.g., Import or AddSamples) use that to re-enqueue samples on
send failure; update push to use gzipCompress on the exported bytes, send via
p.client.Do, and only clear/reset p.metrics after resp.StatusCode is 2xx (or on
confirmed success).

---

Nitpick comments:
In `@client/internal/metrics/env.go`:
- Around line 74-87: getMetricsServerURL currently parses EnvMetricsServerURL
but doesn't validate the URL scheme; update getMetricsServerURL to reject URLs
that don't use an allowed scheme (e.g., "http" or "https") by checking
parsed.Scheme after url.Parse and logging a warning + returning nil for
disallowed schemes. Use the EnvMetricsServerURL constant and the parsed variable
from url.Parse within getMetricsServerURL to perform this validation and ensure
only expected schemes are accepted.

In `@client/internal/metrics/influxdb.go`:
- Around line 142-167: The writeSamples function uses map iteration over
s.fields which is non-deterministic; to make output deterministic for
testing/debugging, gather the keys from s.fields inside writeSamples, sort them
(e.g., using sort.Strings), and then iterate over the sorted keys to write
fields instead of ranging directly over the map; update the loop that currently
reads "for k, v := range s.fields" to iterate the sorted key slice and fetch
values with s.fields[k].

In `@client/internal/metrics/push.go`:
- Around line 271-283: In gzipCompress remove the redundant deferred close of
the gzip writer: delete the defer func() { _ = gz.Close() }() and keep the
explicit gz.Close() call that checks and returns any error; ensure gz is closed
exactly once (via the explicit if err := gz.Close(); err != nil { return nil,
err }) so the function still returns &buf on success and propagates close/write
errors from the gzip.Writer.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19eaf51-10e2-4e05-aa2b-f2a32913175e

📥 Commits

Reviewing files that changed from the base of the PR and between b4cd717 and 1085ad0.

📒 Files selected for processing (5)

client/internal/metrics/env.go
client/internal/metrics/influxdb.go
client/internal/metrics/metrics.go
client/internal/metrics/push.go
client/internal/metrics/push_test.go

coderabbitai

🧹 Nitpick comments (2)

client/internal/metrics/push.go (1)

272-284: Deferred gz.Close() masks error and is redundant.

Line 276 defers gz.Close() but line 280 explicitly calls gz.Close() to capture the error. The deferred call will execute again after the explicit close, which is harmless but wasteful. More importantly, if the explicit gz.Close() on line 280 returns an error, the deferred close will still run and its error is discarded.

Consider removing the defer and relying solely on the explicit close:
Suggested fix
 func gzipCompress(data []byte) (*bytes.Buffer, error) {
 	var buf bytes.Buffer
 	gz := gzip.NewWriter(&buf)
-	defer func() { _ = gz.Close() }()
 	if _, err := gz.Write(data); err != nil {
+		_ = gz.Close()
 		return nil, err
 	}
 	if err := gz.Close(); err != nil {
 		return nil, err
 	}
 	return &buf, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 272 - 284, In gzipCompress,
remove the deferred gz.Close() and rely on the explicit gz.Close() call to
capture and return any close error: locate the gzipCompress function, delete the
defer func() { _ = gz.Close() }() and ensure the function writes to gz, checks
the write error, then calls gz.Close() once and returns any error from that
Close before returning the buffer.

client/internal/metrics/influxdb.go (1)

119-150: Non-deterministic field order in InfluxDB line protocol output.

The for k, v := range s.fields loop (line 133) iterates over a map with non-deterministic order. While InfluxDB accepts fields in any order, this makes the output non-reproducible across runs, which can complicate debugging and testing.

Consider sorting field keys for consistent output:

Suggested fix for deterministic field order

+import "sort"
+
 func (m *influxDBMetrics) Export(w io.Writer) error {
 	m.mu.Lock()
 	samples := make([]influxSample, len(m.samples))
 	copy(samples, m.samples)
 	m.mu.Unlock()

 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
-		first := true
-		for k, v := range s.fields {
-			if !first {
+		for i, k := range keys {
+			if i > 0 {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
-			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
+			if _, err := fmt.Fprintf(w, "%s=%g", k, s.fields[k]); err != nil {
 				return err
 			}
-			first = false
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 119 - 150, The Export
method produces non-deterministic field order because it iterates directly over
the map s.fields; to fix, in influxDBMetrics.Export (inside the loop over
samples) collect the map keys from s.fields into a slice, sort them (e.g.,
sort.Strings), and then iterate over the sorted keys to write fields (replace
the for k, v := range s.fields loop with a deterministic loop over sorted keys),
preserving the existing formatting/escaping logic and the first-flag comma
handling for fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/metrics/influxdb.go`:
- Around line 119-150: The Export method produces non-deterministic field order
because it iterates directly over the map s.fields; to fix, in
influxDBMetrics.Export (inside the loop over samples) collect the map keys from
s.fields into a slice, sort them (e.g., sort.Strings), and then iterate over the
sorted keys to write fields (replace the for k, v := range s.fields loop with a
deterministic loop over sorted keys), preserving the existing
formatting/escaping logic and the first-flag comma handling for fields.

In `@client/internal/metrics/push.go`:
- Around line 272-284: In gzipCompress, remove the deferred gz.Close() and rely
on the explicit gz.Close() call to capture and return any close error: locate
the gzipCompress function, delete the defer func() { _ = gz.Close() }() and
ensure the function writes to gz, checks the write error, then calls gz.Close()
once and returns any error from that Close before returning the buffer.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e00590ca-aacb-4792-8a6e-3b6563af10d5

📥 Commits

Reviewing files that changed from the base of the PR and between 1085ad0 and 44edbfd.

📒 Files selected for processing (4)

client/internal/metrics/influxdb.go
client/internal/metrics/metrics.go
client/internal/metrics/push.go
client/internal/metrics/push_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

client/internal/metrics/push_test.go

- Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that collection is always active (for debug bundles) and only push is opt-in - Change default config URL from staging to production (ingest.netbird.io) - Delete broken Prometheus dashboard (used non-existent metric names) - Delete unused VictoriaMetrics datasource config - Replace committed .env with .env.example containing placeholder values - Wire Grafana admin credentials through env vars in docker-compose - Make metricsStages a pointer to prevent reset-vs-write race on reconnect - Fix typed-nil interface in debug bundle path (GetClientMetrics) - Use deterministic field order in InfluxDB Export (sorted keys) - Replace Authorization header with X-Peer-ID for metrics push - Fix ingest server timeout to use time.Second instead of float - Fix gzip double-close, stale comments, trim log levels - Add tests for influxdb.go and MetricsStages

- Add netbird_login measurement recording login/auth duration to management server, with success/failure result tag - Validate InfluxDB tags against per-measurement allowlists in ingest server to prevent arbitrary tag injection - Cap all duration fields (*_seconds) at 300s instead of only total_seconds - Add ingest server tests for tag/field validation, bounds, and auth

…s-push # Conflicts: # client/internal/connect.go # client/internal/engine.go # client/internal/engine_test.go

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

client/internal/metrics/infra/ingest/main.go (2)

115-116: ⚠️ Potential issue | 🟠 Major

Use explicit HTTP server timeouts instead of bare ListenAndServe.

http.ListenAndServe here runs with zero server-side read/write/idle timeouts, which can be abused by slow clients to hold sockets indefinitely.

🛠️ Suggested fix

-	if err := http.ListenAndServe(listenAddr, nil); err != nil { //nolint:gosec
+	srv := &http.Server{
+		Addr:              listenAddr,
+		Handler:           nil,
+		ReadHeaderTimeout: 5 * time.Second,
+		ReadTimeout:       15 * time.Second,
+		WriteTimeout:      15 * time.Second,
+		IdleTimeout:       60 * time.Second,
+	}
+	if err := srv.ListenAndServe(); err != nil { //nolint:gosec
 		log.Fatal(err)
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/ingest/main.go` around lines 115 - 116, Replace
the bare http.ListenAndServe(listenAddr, nil) call with an explicit http.Server
instance configured with sensible timeouts (set ReadTimeout, WriteTimeout, and
IdleTimeout) and the same Addr and Handler (use listenAddr and nil or the
existing mux/handler), then start it via server.ListenAndServe() and handle the
returned error; update the code around the ListenAndServe call in main.go to
construct the server, assign server.Addr = listenAddr, server.Handler = nil (or
the existing handler), set server.ReadTimeout/WriteTimeout/IdleTimeout to
appropriate durations, and use server.ListenAndServe() instead of the bare
function to mitigate slow-client resource exhaustion.

177-190: ⚠️ Potential issue | 🔴 Critical

X-Peer-ID validation is identification, not authentication.

Any caller can forge a 16-char hex X-Peer-ID and write to / if this service is reachable outside a trusted network. Please gate ingest writes with actual authN/authZ (e.g., shared token/HMAC/MTLS or trusted ingress ACLs).

#!/bin/bash
set -euo pipefail

# Verify whether ingest is exposed and whether any auth middleware/token check exists.
fd -t f 'docker-compose*.yml' client/internal/metrics/infra
rg -nP 'ingest|8087|ports:' client/internal/metrics/infra -C3
rg -nP --type=go 'X-Peer-ID|Authorization|Bearer|HMAC|mTLS|validateAuth' client/internal/metrics/infra -C3

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/ingest/main.go` around lines 177 - 190,
validateAuth currently only checks format of X-Peer-ID (identification) but does
not authenticate/authorize writes; replace or augment this with real
authN/authZ: either implement token/HMAC verification against a server-side
secret (check Authorization header or HMAC on payload), require mTLS client cert
validation in the HTTP server, or enforce a trusted ingress ACL before accepting
writes to the ingest endpoint; update the validateAuth function (or add
middleware invoked before handlers that accept ingest writes) to verify the
chosen credential mechanism and return 401/403 on failure, and if you
intentionally keep only format checks, rename validateAuth to validatePeerID and
add explicit comments so callers know it does not perform authentication.

client/internal/metrics/push.go (1)

197-252: ⚠️ Potential issue | 🟠 Major

Potential sample-loss window between Export() and Reset() on successful push.

If new samples are recorded after Export() but before Reset(), they can be cleared without ever being sent unless reset is snapshot-aware.
#!/bin/bash
set -euo pipefail

# Inspect metrics interface and concrete Export/Reset behavior.
rg -nP --type=go 'type\s+metricsImplementation\s+interface' client/internal/metrics -C6
rg -nP --type=go 'func\s+$.*$\s+Export$|func\s+\(.*$\s+Reset$' client/internal/metrics -C10

# Focus on InfluxDB metrics implementation used by this push path.
rg -nP --type=go 'func\s+\(m \*influxDBMetrics$\s+Export$|func\s+\(m \*influxDBMetrics$\s+Reset\(' client/internal/metrics/influxdb.go -C25
Expected verification outcome: if Reset() truncates the live buffer without correlating to the exported snapshot, this is a real data-loss race and needs an atomic drain/swap API.

🧹 Nitpick comments (1)

client/internal/metrics/infra/.env.example (1)

8-9: Reorder keys to keep dotenv-linter output clean.

dotenv-linter reports key ordering warnings for these blocks. Reordering avoids recurring lint noise in this file.

♻️ Optional reorder to satisfy linter warnings

-# Grafana admin credentials
-GRAFANA_ADMIN_USER=admin
-GRAFANA_ADMIN_PASSWORD=changeme
+# Grafana admin credentials
+GRAFANA_ADMIN_PASSWORD=changeme
+GRAFANA_ADMIN_USER=admin

 # Remote config served by ingest at /config
 # Set CONFIG_METRICS_SERVER_URL to the ingest server's public address to enable
 CONFIG_METRICS_SERVER_URL=
+CONFIG_PERIOD_MINUTES=5
 CONFIG_VERSION_SINCE=0.0.0
 CONFIG_VERSION_UNTIL=99.99.99
-CONFIG_PERIOD_MINUTES=5

Also applies to: 14-16

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/.env.example` around lines 8 - 9, Reorder the
environment variable keys in the .env example blocks so they match the project's
dotenv-linter expected ordering (e.g., alphabetically) to remove lint warnings;
specifically adjust the block containing GRAFANA_ADMIN_USER and
GRAFANA_ADMIN_PASSWORD (and the similar block at lines 14–16) so the keys appear
in the same sorted order used elsewhere in the file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/env.go`:
- Around line 83-88: The current parsing of NB_METRICS_SERVER_URL using
url.Parse in the function that handles metrics server override accepts relative
or malformed inputs; after url.Parse(envURL) (see the parsed variable) validate
that parsed.Scheme is non-empty and is either "http" or "https" and that
parsed.Host is non-empty before returning parsed; if those checks fail, call
log.Warnf with a clear message about invalid absolute HTTP(S) URL and return nil
so the override is rejected. Ensure you update the code path that returns parsed
(the block with parsed, err := url.Parse(envURL)) to perform these extra checks.

In `@client/internal/metrics/infra/README.md`:
- Around line 35-37: Update the architecture README to reflect that the ingest
server now also allowlists tag keys per measurement: change the bullet list
under "Validates line protocol" (which currently mentions "Whitelists
measurements & fields") to explicitly state "Whitelists measurements, fields &
tags" and add a short note that the ingest server validates tag keys per
measurement (i.e., tag key allowlist is enforced along with measurement/field
validation in the ingest server).
- Line 17: The markdown has unlabeled fenced code blocks that trigger
markdownlint MD040; update each fenced block in the README examples to include a
language identifier (e.g., ```text or ```bash) so the fences are explicitly
marked. Locate the unlabeled triple-backtick blocks and add an appropriate
language token to the opening fence to satisfy MD040 while preserving the
existing content.
- Around line 41-42: The README bullet claiming "No client-side auth required"
is misleading because the ingest server still requires the X-Peer-ID header;
update both occurrences (the bullet at "No client-side auth required" and the
repeated text later) to a clearer phrase such as "No secret/token-based client
auth; X-Peer-ID identifier header is required" so readers understand an
identifier header is mandatory while no token is expected.

In `@client/internal/metrics/push.go`:
- Around line 125-126: Update the Start() comment so it accurately describes
current behavior: state that setting NB_METRICS_INTERVAL only overrides the
scheduling interval but does not bypass resolve() or RefreshIfNeeded() and does
not ignore version gating; clarify that only explicit force-sending mode
(forceSend) will skip version checks/remote config, while the env var merely
changes how often resolve() runs. Reference Start(), resolve(),
RefreshIfNeeded(), NB_METRICS_INTERVAL and force-sending mode in the comment so
readers understand the exact runtime behavior.

---

Duplicate comments:
In `@client/internal/metrics/infra/ingest/main.go`:
- Around line 115-116: Replace the bare http.ListenAndServe(listenAddr, nil)
call with an explicit http.Server instance configured with sensible timeouts
(set ReadTimeout, WriteTimeout, and IdleTimeout) and the same Addr and Handler
(use listenAddr and nil or the existing mux/handler), then start it via
server.ListenAndServe() and handle the returned error; update the code around
the ListenAndServe call in main.go to construct the server, assign server.Addr =
listenAddr, server.Handler = nil (or the existing handler), set
server.ReadTimeout/WriteTimeout/IdleTimeout to appropriate durations, and use
server.ListenAndServe() instead of the bare function to mitigate slow-client
resource exhaustion.
- Around line 177-190: validateAuth currently only checks format of X-Peer-ID
(identification) but does not authenticate/authorize writes; replace or augment
this with real authN/authZ: either implement token/HMAC verification against a
server-side secret (check Authorization header or HMAC on payload), require mTLS
client cert validation in the HTTP server, or enforce a trusted ingress ACL
before accepting writes to the ingest endpoint; update the validateAuth function
(or add middleware invoked before handlers that accept ingest writes) to verify
the chosen credential mechanism and return 401/403 on failure, and if you
intentionally keep only format checks, rename validateAuth to validatePeerID and
add explicit comments so callers know it does not perform authentication.

---

Nitpick comments:
In `@client/internal/metrics/infra/.env.example`:
- Around line 8-9: Reorder the environment variable keys in the .env example
blocks so they match the project's dotenv-linter expected ordering (e.g.,
alphabetically) to remove lint warnings; specifically adjust the block
containing GRAFANA_ADMIN_USER and GRAFANA_ADMIN_PASSWORD (and the similar block
at lines 14–16) so the keys appear in the same sorted order used elsewhere in
the file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1db4ad3b-9fa1-48e2-8882-4776a2b20bdc

📥 Commits

Reviewing files that changed from the base of the PR and between 44edbfd and c44b797.

⛔ Files ignored due to path filters (1)

client/internal/metrics/infra/ingest/go.sum is excluded by !**/*.sum

📒 Files selected for processing (18)

client/internal/connect.go
client/internal/metrics/env.go
client/internal/metrics/influxdb.go
client/internal/metrics/influxdb_test.go
client/internal/metrics/infra/.env.example
client/internal/metrics/infra/.gitignore
client/internal/metrics/infra/README.md
client/internal/metrics/infra/docker-compose.yml
client/internal/metrics/infra/grafana/provisioning/dashboards/json/netbird-influxdb-metrics.json
client/internal/metrics/infra/ingest/go.mod
client/internal/metrics/infra/ingest/main.go
client/internal/metrics/infra/ingest/main_test.go
client/internal/metrics/metrics.go
client/internal/metrics/push.go
client/internal/metrics/push_test.go
client/internal/peer/conn.go
client/internal/peer/metrics_saver_test.go
client/server/debug.go

🚧 Files skipped from review as they are similar to previous changes (6)

client/internal/metrics/influxdb.go
client/internal/peer/conn.go
client/internal/metrics/infra/docker-compose.yml
client/internal/metrics/push_test.go
client/server/debug.go
client/internal/metrics/infra/ingest/go.mod

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

client/internal/metrics/infra/README.md (1)

150-153: ⚠️ Potential issue | 🟡 Minor

Verify .env.example exists, or update the setup step to use the tracked .env.

The quickstart depends on cp .env.example .env; if the template is missing, setup breaks. Please verify and align the instructions accordingly.

#!/bin/bash
set -euo pipefail

echo "Checking for env templates in metrics infra..."
fd -t f '^\.env(\.example)?$' client/internal/metrics/infra

echo
echo "Tracked env-related files under client/internal/metrics/infra:"
git ls-files client/internal/metrics/infra | rg -n '\.env(\.example)?$' || true

echo
echo "README references to env files:"
rg -n 'env\.example|\.env' client/internal/metrics/infra/README.md -C2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 150 - 153, The README
step that runs "cp .env.example .env" can fail if .env.example is not present;
update the README (client/internal/metrics/infra/README.md) to either (a) verify
and instruct the user to create .env from the tracked .env file (if present) or
(b) instruct to copy from .env.example only when that template exists and
provide the alternative command to create a .env manually; specifically
reference the cp .env.example .env instruction and add a short check/notice
advising to confirm the presence of .env.example (or use the tracked .env)
before running the copy.

client/internal/metrics/push.go (1)

200-252: ⚠️ Potential issue | 🟠 Major

Reintroduced sample-loss window between export and reset.

Export() snapshots current samples, then the HTTP request runs, and only after success Reset() clears the live buffer. Metrics recorded between those two calls are dropped on success.

Suggested direction

- if err := p.metrics.Export(&buf); err != nil {
+ if err := p.metrics.ExportAndReset(&buf); err != nil {
    return fmt.Errorf("export metrics: %w", err)
  }
...
- p.metrics.Reset()
  return nil

(Requires adding an atomic ExportAndReset/Drain operation in the metrics implementation and interface.)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 200 - 252, The current flow
calls p.metrics.Export(&buf) and only calls p.metrics.Reset() after a successful
HTTP push, which drops samples recorded between Export and Reset; change the
metrics API to provide an atomic export-and-reset (e.g., ExportAndReset(ctx,
&buf) or Drain(&buf)) on the metrics implementation and interface, then call
that single atomic method in push.go instead of separate Export and Reset so
samples recorded during the push are captured safely; update the implementation
of p.metrics (the concrete type behind p.metrics) to perform the
snapshot-and-clear atomically and update any callers to use the new
ExportAndReset/Drain method.

🧹 Nitpick comments (1)

client/internal/engine.go (1)

822-824: Lower sync timing log level to avoid info-log churn.

This path runs for every sync; using Info will be noisy in normal operation. Prefer Debug for per-sync duration logs.

Suggested tweak

- log.Infof("sync finished in %s", duration)
+ log.Debugf("sync finished in %s", duration)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/engine.go` around lines 822 - 824, The per-sync duration log
currently uses log.Infof and is too noisy; change the log level to debug by
replacing the Infof call with log.Debugf (i.e., update the call around the
duration := time.Since(started) / log.Infof("sync finished in %s", duration) to
use Debugf) so per-sync timing is logged at debug level while keeping
e.clientMetrics.RecordSyncDuration(e.ctx, duration) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 130-133: Clarify that the described precedence for
NB_METRICS_SERVER_URL / NB_METRICS_INTERVAL vs remote config values applies only
after eligibility checks (e.g., isVersionInRange and remote-config
refresh/eligibility gating) have passed; update the README text to add one
sentence stating that environment variables do not bypass eligibility/version
gating or remote-config refresh logic (for example: NB_METRICS_INTERVAL does not
bypass isVersionInRange or remote-config refresh checks), so env vars only
select the value if the push is allowed by those checks.

In `@management/cmd/pprof.go`:
- Around line 26-30: The pprof goroutine uses log.Fatalf which exits the whole
process on bind failure; update the pprof function (pprof) to avoid terminating
the process by replacing log.Fatalf("pprof server: %v", err) with a non-fatal
error log (e.g., log.Errorf or log.Warnf) and simply return so the application
continues running when http.ListenAndServe fails (remember this is launched via
go pprof(addr) from init()).

In `@signal/cmd/pprof.go`:
- Around line 13-30: Remove the automatic startup in pprof.go by deleting the
init() that calls pprof(pprofAddr()) (so pprof.go should only expose pprofAddr()
and pprof(listenAddr string) but not start itself); ensure the single pprof
startup path is the explicit startPprof() invoked from RunE (or the existing
startPprof entrypoint) so only one listener is created and the process won't
fatally exit due to address contention between localhost/127.0.0.1.

---

Duplicate comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 150-153: The README step that runs "cp .env.example .env" can fail
if .env.example is not present; update the README
(client/internal/metrics/infra/README.md) to either (a) verify and instruct the
user to create .env from the tracked .env file (if present) or (b) instruct to
copy from .env.example only when that template exists and provide the
alternative command to create a .env manually; specifically reference the cp
.env.example .env instruction and add a short check/notice advising to confirm
the presence of .env.example (or use the tracked .env) before running the copy.

In `@client/internal/metrics/push.go`:
- Around line 200-252: The current flow calls p.metrics.Export(&buf) and only
calls p.metrics.Reset() after a successful HTTP push, which drops samples
recorded between Export and Reset; change the metrics API to provide an atomic
export-and-reset (e.g., ExportAndReset(ctx, &buf) or Drain(&buf)) on the metrics
implementation and interface, then call that single atomic method in push.go
instead of separate Export and Reset so samples recorded during the push are
captured safely; update the implementation of p.metrics (the concrete type
behind p.metrics) to perform the snapshot-and-clear atomically and update any
callers to use the new ExportAndReset/Drain method.

---

Nitpick comments:
In `@client/internal/engine.go`:
- Around line 822-824: The per-sync duration log currently uses log.Infof and is
too noisy; change the log level to debug by replacing the Infof call with
log.Debugf (i.e., update the call around the duration := time.Since(started) /
log.Infof("sync finished in %s", duration) to use Debugf) so per-sync timing is
logged at debug level while keeping e.clientMetrics.RecordSyncDuration(e.ctx,
duration) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d759f7b-2489-437a-acd2-283cb0bf03d3

📥 Commits

Reviewing files that changed from the base of the PR and between c44b797 and 59f8c5d.

📒 Files selected for processing (10)

client/internal/connect.go
client/internal/debug/debug.go
client/internal/engine.go
client/internal/engine_test.go
client/internal/metrics/env.go
client/internal/metrics/infra/README.md
client/internal/metrics/push.go
management/cmd/pprof.go
shared/management/client/grpc.go
signal/cmd/pprof.go

🚧 Files skipped from review as they are similar to previous changes (3)

shared/management/client/grpc.go
client/internal/metrics/env.go
client/internal/debug/debug.go

sonarqubecloud · 2026-03-18T19:36:36Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mlsmaycon

We should go with metrics bound to another URL

pappz added 20 commits January 12, 2026 12:48

Add client metrics

c5eb5ba

Merge main branch into feature/client-metrics

bb377a2

Merge branch 'main' into feature/client-metrics

5f02a48

# Conflicts: # client/internal/debug/debug.go # client/internal/engine.go # client/server/debug.go

Add signaling metrics tracking for initial and reconnection attempts

5169129

Reset connection stage timestamps during reconnections to exclude unn…

cbfde79

…ecessary metrics tracking

Delete otel lib from client

08295e5

Update unit tests

e7283a8

Invoke callback on handshake success in WireGuard watcher

138e728

Merge branch 'main' into feature/client-metrics

80abddb

Add Netbird version tracking to client metrics

ca3e6d9

Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.

Add sync duration tracking to client metrics

bec58b8

Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.

Remove no-op metrics implementation and simplify ClientMetrics constr…

3753bf7

…uctor Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.

Add total duration tracking for connection attempts

7e276a4

Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.

Add metrics push support to VictoriaMetrics integration

cf0a1fa

Merge branch 'main' into feature/client-metrics-push

75a5955

[client] anchor connection metrics to first signal received

8ed99ba

Merge branch 'main' into feature/client-metrics-push

dfdff69

# Conflicts: # client/internal/engine.go # client/internal/peer/conn.go

Remove creation_to_semaphore connection stage metric

8a852d4

The semaphore queuing stage (Created → SemaphoreAcquired) is no longer tracked. Connection metrics now start from SignalingReceived. Updated docs and Grafana dashboard accordingly.

pappz and others added 8 commits March 5, 2026 15:06

[client] Add WASM-compatible NewClientMetrics implementation

ee13016

Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.

Add missing file

eb1a9b1

Update default case in DeploymentType.String to return "unknown" inst…

f2ef0c4

…ead of "selfhosted"

Remove unused dependencies from go.mod and go.sum

b577289

pappz added 2 commits March 10, 2026 16:19

Preserve first WG handshake timestamp, ignore rekeys

e24c0bb

Skip adding empty metrics.txt to debug bundle in debug mode

804cd5d

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread client/internal/debug/debug.go

Update default metrics server URL to https://ingest.netbird.io

b4cd717

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread client/internal/metrics/push.go

Comment thread client/internal/metrics/push.go

Comment thread client/internal/metrics/push.go

Atomic metrics export-and-reset to prevent sample loss between Export…

d272755

… and Reset calls

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

pappz added 3 commits March 13, 2026 10:33

Fix doc

21ffd87

Refactor Push configuration to improve clarity and enforce minimum pu…

9df13ba

…sh interval

Remove minPushInterval and update push interval validation logic

1085ad0

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread client/internal/metrics/push.go

Revert ExportAndReset, it is acceptable data loss

44edbfd

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

lixmal added 5 commits March 18, 2026 19:31

Add arch tag to all metrics

672fc66

Fix Grafana dashboard: add arch to drop columns, add login panels

c44b797

Merge remote-tracking branch 'origin/main' into feature/client-metric…

efa2ec9

…s-push # Conflicts: # client/internal/connect.go # client/internal/engine.go # client/internal/engine_test.go

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread client/internal/metrics/env.go Outdated

Comment thread client/internal/metrics/infra/README.md Outdated

Comment thread client/internal/metrics/infra/README.md

Comment thread client/internal/metrics/infra/README.md Outdated

Comment thread client/internal/metrics/push.go Outdated

lixmal added 2 commits March 18, 2026 20:20

Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL

e67ae9a

Address review comments: fix README wording, update stale comments

59f8c5d

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread client/internal/metrics/infra/README.md

Comment thread management/cmd/pprof.go Outdated

Comment thread signal/cmd/pprof.go Outdated

lixmal added 2 commits March 18, 2026 20:34

Clarify env var precedence does not bypass remote config eligibility

8a761cc

Remove accidentally committed pprof files

909445d

mlsmaycon approved these changes Mar 22, 2026

View reviewed changes

mlsmaycon changed the title ~~[client] Feature/client metrics push~~ [client] Feature/client metrics Mar 22, 2026

mlsmaycon merged commit 91f0d5c into main Mar 22, 2026
57 of 58 checks passed

mlsmaycon deleted the feature/client-metrics-push branch March 22, 2026 11:45

coderabbitai Bot mentioned this pull request Apr 14, 2026

[client][management] feat: TPM/Secure Enclave certificate-based device authentication #5881

Closed

7 tasks

Uh oh!

Conversation

pappz commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Metrics Collected

Privacy

Ingest Server Validation

Environment Variables

Checklist

Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

pappz commented Mar 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Mar 18, 2026

Quality Gate passed

Uh oh!

mlsmaycon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pappz commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading