Skip to content

[client] Feature/client metrics#5512

Merged
mlsmaycon merged 58 commits intomainfrom
feature/client-metrics-push
Mar 22, 2026
Merged

[client] Feature/client metrics#5512
mlsmaycon merged 58 commits intomainfrom
feature/client-metrics-push

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Mar 5, 2026

Describe your changes

Adds opt-in client metrics push for tracking connection performance. Metrics are collected in InfluxDB line protocol format, buffered locally, and pushed periodically to an ingest server.

  • Disabled by default; enable push via NB_METRICS_PUSH_ENABLED=true
  • Collection is always active (for debug bundles), only push is opt-in
  • Push behavior is controlled by a remote configuration endpoint with env var overrides for development/testing
  • Buffer is hard-limited by age (5 days) and size (5 MB) for bounded memory usage
  • Debug bundles include collected metrics in metrics.txt
  • Includes an InfluxDB ingest server, Grafana dashboards, and a Docker Compose dev stack

Metrics Collected

Measurement Fields Description
netbird_peer_connection signaling_to_connection_seconds, connection_to_wg_handshake_seconds, total_seconds Per-peer connection stage timings
netbird_sync duration_seconds Management sync message processing time
netbird_login duration_seconds Login/auth duration (success/failure)

All metrics include tags: deployment_type, version, os, arch, peer_id (hashed).

Privacy

  • Peer IDs are truncated SHA-256 hashes of WireGuard public keys (not reversible)
  • No IP addresses, peer names, or configuration details are sent
  • Connection pair IDs are deterministic hashes (same pair produces same ID from either side)

Ingest Server Validation

  • Allowlisted measurements, fields, and tags (rejects unknown)
  • Duration fields capped at 300s
  • Tag values capped at 64 chars
  • Peer ID format validated (16 hex chars)

Environment Variables

Variable Default Description
NB_METRICS_PUSH_ENABLED false Enable metrics push to backend
NB_METRICS_FORCE_SENDING false Skip remote config, push unconditionally
NB_METRICS_CONFIG_URL https://ingest.netbird.io/config Remote push config endpoint
NB_METRICS_SERVER_URL (from remote config) Override ingest server URL
NB_METRICS_INTERVAL (from remote config) Override push interval (e.g. "30m", "1h")

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Summary by CodeRabbit

  • New Features

    • Added anonymized client metrics (connection stages, sync duration, login timing) with optional periodic push and local ingest endpoint.
    • Exposed metrics in debug bundles and added pprof profiling support for diagnostics.
    • Included a Grafana dashboard and local metrics stack for visualization and local development.
  • Documentation

    • Added comprehensive metrics documentation, config templates, and setup instructions for the local metrics environment.

pappz added 20 commits January 12, 2026 12:48
Implements a comprehensive client metrics system to track peer connection
stages and performance. The system supports multiple backend implementations
(OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection
stage durations from creation through WireGuard handshake.

Key changes:
- Add metrics package with pluggable backend implementations
- Implement OpenTelemetry metrics backend
- Implement VictoriaMetrics metrics backend
- Add no-op metrics implementation for disabled state
- Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake
- Move WireGuard watcher functionality to conn.go
- Refactor engine to integrate metrics tracking
- Add metrics export endpoint in debug server
# Conflicts:
#	client/internal/debug/debug.go
#	client/internal/engine.go
#	client/server/debug.go
Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.
Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.
…uctor

Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.
Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.
# Conflicts:
#	client/internal/engine.go
#	client/internal/peer/conn.go
The semaphore queuing stage (Created → SemaphoreAcquired) is no longer
tracked. Connection metrics now start from SignalingReceived. Updated
docs and Grafana dashboard accordingly.
…ility

Introduce remoteconfig.Manager that fetches a remote JSON config to control
metrics push interval and restrict pushing to a specific agent version
range. When NB_METRICS_INTERVAL is set, remote config is bypassed
entirely for local override.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Adds a new client metrics subsystem (collection, buffering, export) with an InfluxDB backend, periodic push with remote config, peer-connection instrumentation, debug bundle integration, ingest+infra for local development, and minor API additions to management client and test workflow.

Changes

Cohort / File(s) Summary
Core metrics types & config
client/internal/metrics/connection_type.go, client/internal/metrics/deployment_type.go, client/internal/metrics/env.go, client/internal/metrics/metrics.go, client/internal/metrics/metrics_default.go, client/internal/metrics/metrics_js.go
Adds AgentInfo, ClientMetrics, ConnectionStageTimestamps, deployment detection, env-driven config constants/helpers, and build-tag aware constructor (native vs WASM nil).
InfluxDB backend & tests
client/internal/metrics/influxdb.go, client/internal/metrics/influxdb_test.go
Implements InfluxDB line-protocol backend, sample buffering, trimming rules, Export/Reset and tests validating formatting, trimming, and recorded metrics.
Push orchestration & tests
client/internal/metrics/push.go, client/internal/metrics/push_test.go
Implements Push and PushConfig, periodic export loop, gzip POST to server with X-Peer-ID, version gating, server/interval resolution, and extensive unit tests.
Remote config manager & tests
client/internal/metrics/remoteconfig/manager.go, .../manager_test.go
Adds remote config fetcher with caching/min-refresh, JSON parsing into Config (server URL, version range, interval), and tests for caching and error cases.
Engine & connect wiring
client/internal/connect.go, client/internal/engine.go, client/internal/engine_test.go
Wires ClientMetrics into ConnectClient/EngineServices, starts/stops push lifecycle, updates agent info after management connection, records login/sync durations, exposes Engine.GetClientMetrics().
Peer instrumentation
client/internal/peer/conn.go, client/internal/peer/handshaker.go, client/internal/peer/metrics_saver.go, client/internal/peer/wg_watcher.go, client/internal/peer/..._test.go
Adds MetricsRecorder interface, MetricsStages struct, propagates timestamps through handshaker/wg watcher, records signaling/ready/handshake timestamps and emits connection metrics. Tests updated/added.
Debug bundle integration
client/internal/debug/debug.go, client/server/debug.go
Adds MetricsExporter interface to debug bundle generator deps and embeds exported metrics (metrics.txt) into debug archives.
Management client API
shared/management/client/client.go, shared/management/client/grpc.go, shared/management/client/mock.go
Adds GetServerURL() to Client interface and implements accessor in GrpcClient and MockClient.
Ingest service & infra
client/internal/metrics/infra/ingest/main.go, .../Dockerfile, go.mod, main_test.go
Adds HTTP ingest service validating InfluxDB line protocol, auth via X-Peer-ID, forwards to InfluxDB Write API, /config and /health endpoints, and unit tests.
Local infra & provisioning
client/internal/metrics/infra/docker-compose.yml, .../grafana/..., influxdb/scripts/create-tokens.sh, .env.example, README.md, .gitignore
Adds Docker Compose stack (ingest, influxdb, grafana), Grafana provisioning (datasource/dashboard JSON), token script, env template, and README for running local metrics stack.
CI: Windows tests script
.github/workflows/golang-test-windows.yml
Replaces long inline test invocation with generated PowerShell/command script that collects package list and runs tests via run-tests.cmd, capturing output to test-out.txt.

Sequence Diagram(s)

sequenceDiagram
    participant Conn as Conn
    participant Handshaker as Handshaker
    participant MetricsStages as MetricsStages
    participant Engine as Engine
    participant Metrics as ClientMetrics

    Conn->>MetricsStages: Initialize()
    Conn->>Handshaker: NewHandshaker(..., metricsStages)
    Handshaker->>MetricsStages: RecordSignalingReceived()
    Handshaker->>MetricsStages: RecordConnectionReady(now)
    Conn->>MetricsStages: RecordWGHandshakeSuccess(now)
    Conn->>Engine: recordConnectionMetrics()
    Engine->>Metrics: RecordConnectionStages(pubKey, type, isReconn, timestamps)
    Metrics->>Metrics: anonymize peer IDs, buffer sample
Loading
sequenceDiagram
    participant Engine as Engine
    participant ClientMetrics as ClientMetrics
    participant Push as Push
    participant RemoteConfig as RemoteConfig
    participant Ingest as Ingest/HTTP
    participant Influx as InfluxDB

    Engine->>ClientMetrics: StartPush(ctx, cfg)
    ClientMetrics->>Push: NewPush(...)
    Push->>RemoteConfig: RefreshIfNeeded()
    RemoteConfig->>Ingest: GET /config (if present)
    Push->>ClientMetrics: Export() -> line-protocol
    Push->>Push: gzipCompress(payload)
    Push->>Ingest: POST /write with X-Peer-ID
    Ingest->>Influx: Forward to InfluxDB Write API
    Influx-->>Push: 2xx success
    Push->>ClientMetrics: Reset() on success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • pappz

Poem

🐰 Hop, I gathered every timestamp and thread,

From signaling sparks to handshakes ahead.
I nibble bytes, gzip them snug and tight,
Send them upstream where dashboards light.
A carrot-toast to metrics now in flight!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[client] Feature/client metrics' is vague and generic, using non-descriptive terms that don't convey the specific purpose of the changeset. Revise the title to be more specific about the primary change, e.g., 'Add opt-in client metrics collection and push to ingest server' or 'Implement client-side metrics tracking for connection performance'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description is comprehensive and well-structured. It explains the feature, provides detailed configuration options, lists collected metrics with descriptions, addresses privacy concerns, and includes validation details. The checklist is properly completed with the feature enhancement box checked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/client-metrics-push

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

@pappz
Copy link
Copy Markdown
Collaborator Author

pappz commented Mar 5, 2026

@RabbitAI review this draft PR

pappz and others added 8 commits March 5, 2026 15:06
Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.
Replace cumulative Prometheus histograms with timestamped point-in-time
samples that are pushed once and cleared. This fixes metrics for sparse
events (connections/syncs that happen once at startup) where rate() and
increase() produced incorrect or empty results.

Changes:
- Switch from VictoriaMetrics histogram library to raw Prometheus text
  format with explicit millisecond timestamps
- Reset samples after successful push (no resending stale data)
- Rename connection_to_handshake → connection_to_wg_handshake
- Add netbird_peer_connection_count metric for ICE vs Relay tracking
- Simplify dashboard: point-based scatter plots, donut pie chart
- Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill
- Fix deployment_type Unknown returning "selfhosted" instead of "unknown"
- Fix inverted shouldPush condition in push.go
Add influxdb.go with timestamped line protocol export for sparse
one-shot events. Restore victoria.go to use proper Prometheus
histograms. Update Grafana dashboards, add InfluxDB datasource,
and update docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix StopPush not clearing push state, preventing restart
- Fix race condition reading currentConnPriority without lock in recordConnectionMetrics
- Fix stale comment referencing old metrics server URL
- Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts
- Rename docker-compose.victoria.yml to docker-compose.yml
Introduce peer_id and connection_pair_id tags to InfluxDB metrics.
Public keys are hashed (truncated SHA-256) for anonymisation. The
connection pair ID is deterministic regardless of which side computes
it, enabling deduplication of reconnections in the ICE vs Relay
dashboard. Also pin Grafana to v11.6.0 for file-based provisioning
and fix datasource UID references.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
client/internal/metrics/infra/README.md (3)

32-39: Optional: Add language specifier to fenced code block for linter compliance.

The ASCII flow diagram is inside a fenced code block without a language specifier. Adding text after the opening backticks would silence the markdownlint warning.

📝 Proposed fix
-```
+```text
 Client ──POST──▶ Ingest Server (:8087) ──▶ InfluxDB (internal)
                   │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 32 - 39, Update the
fenced code block in client/internal/metrics/infra/README.md that contains the
ASCII flow diagram so the opening fence includes a language specifier (use
"text"); locate the block starting with ``` and change it to ```text to satisfy
markdownlint and silence the linter warning while leaving the diagram contents
unchanged.

167-170: Optional: Show explicit token parameter in InfluxDB query example.

The comment mentions "using admin token from .env", but the command doesn't show how to pass it. Adding an explicit --token parameter would make the example more self-contained for developers unfamiliar with the InfluxDB CLI.

📝 Proposed enhancement
 # Query via InfluxDB (using admin token from .env)
 docker compose exec influxdb influx query \
   'from(bucket: "metrics") |> range(start: -1h)' \
-  --org netbird
+  --org netbird \
+  --token <INFLUXDB_ADMIN_TOKEN>

Alternatively, clarify in the comment whether the token is auto-discovered within the container or needs to be explicitly passed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 167 - 170, Update the
InfluxDB query example to explicitly show how to pass the admin token: modify
the `docker compose exec influxdb influx query` command to include the `--token`
flag (e.g., `--token $INFLUX_ADMIN_TOKEN`) so readers know where to supply the
token, or alternately add a one-line note clarifying whether the token is
auto-discovered inside the container; reference the exact command string `docker
compose exec influxdb influx query` and the `--token` flag in your change.

17-26: Optional: Add language specifier to fenced code block for linter compliance.

The ASCII diagram is inside a fenced code block without a language specifier. Adding text after the opening backticks would silence the markdownlint warning while preserving readability.

📝 Proposed fix
-```
+```text
 Daemon Layer (connect.go)
   ├─ Creates ClientMetrics instance once
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 17 - 26, Add a language
specifier to the fenced code block in client/internal/metrics/infra/README.md
(the ASCII diagram containing "Daemon Layer (connect.go)" and "Engine Layer
(engine.go)") by changing the opening fence from ``` to ```text so markdownlint
stops warning; keep the diagram content unchanged and only modify the opening
backticks to include "text".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/debug/debug.go`:
- Line 56: Update the manifest description for "metrics.txt" to note it is only
included when metrics collection is enabled and there are buffered samples
(i.e., when addMetrics() writes output from Export() rather than when Export()
returns no data); change the manifest string near the metrics file description
to explicitly mention that the file is skipped if Export() returns no samples,
and ensure references to addMetrics() and Export() are accurate so readers
understand the empty-buffer case.

---

Nitpick comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 32-39: Update the fenced code block in
client/internal/metrics/infra/README.md that contains the ASCII flow diagram so
the opening fence includes a language specifier (use "text"); locate the block
starting with ``` and change it to ```text to satisfy markdownlint and silence
the linter warning while leaving the diagram contents unchanged.
- Around line 167-170: Update the InfluxDB query example to explicitly show how
to pass the admin token: modify the `docker compose exec influxdb influx query`
command to include the `--token` flag (e.g., `--token $INFLUX_ADMIN_TOKEN`) so
readers know where to supply the token, or alternately add a one-line note
clarifying whether the token is auto-discovered inside the container; reference
the exact command string `docker compose exec influxdb influx query` and the
`--token` flag in your change.
- Around line 17-26: Add a language specifier to the fenced code block in
client/internal/metrics/infra/README.md (the ASCII diagram containing "Daemon
Layer (connect.go)" and "Engine Layer (engine.go)") by changing the opening
fence from ``` to ```text so markdownlint stops warning; keep the diagram
content unchanged and only modify the opening backticks to include "text".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52cd1f1c-bc2d-4674-92e2-7bc4528efdb0

📥 Commits

Reviewing files that changed from the base of the PR and between d8118df and 804cd5d.

📒 Files selected for processing (4)
  • client/internal/debug/debug.go
  • client/internal/metrics/infra/README.md
  • client/internal/metrics/remoteconfig/manager.go
  • client/internal/peer/metrics_saver.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • client/internal/metrics/remoteconfig/manager.go
  • client/internal/peer/metrics_saver.go

Comment thread client/internal/debug/debug.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/push.go`:
- Around line 192-246: The push method in Push (func (p *Push) push) currently
calls p.metrics.Export(...) and later p.metrics.Reset(), which can drop samples
recorded between the two calls; change the metrics API to provide an atomic
drain/swap operation (e.g., ExportAndReset or Drain that returns the snapshot
buffer) and update Push.push to call that single atomic method instead of Export
followed by Reset so the snapshot and clearing happen atomically and no samples
are lost.
- Around line 167-181: The env override currently still early-returns when
RefreshIfNeeded() fails or when isVersionInRange() is false; change the logic so
that when p.envInterval > 0 you still call p.configManager.RefreshIfNeeded(ctx)
to resolve push URL/metadata but do NOT make the interval decision or skip
pushes based on config==nil or isVersionInRange(); specifically, set interval =
p.envInterval first, call RefreshIfNeeded to populate config if available (use
config only for URL/endpoint resolution), and bypass the config nil-check and
the isVersionInRange(p.agentVersion, config.VersionSince, config.VersionUntil)
check when p.envInterval > 0 so the NB_METRICS_INTERVAL override works even if
RefreshIfNeeded fails or the agent is out of configured version range.
- Around line 149-156: The timer is being reset with whatever p.resolve(ctx)
returns which can be zero or negative; clamp the returned interval before
calling timer.Reset to a safe minimum (e.g., at least 1s or another configured
MIN_PUSH_INTERVAL) to avoid tight busy-loops. Update the code around
p.resolve(ctx) and timer.Reset: call p.resolve(ctx) -> validate interval (if <=
0 then set to MIN_PUSH_INTERVAL) -> then call timer.Reset(interval); ensure any
MIN_PUSH_INTERVAL constant is defined near the push logic and referenced in the
push/p.resolve flow so push(ctx, pushURL) and timer.Reset use the validated
value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f1058da-2471-4b0e-927f-049a30472a13

📥 Commits

Reviewing files that changed from the base of the PR and between 804cd5d and b4cd717.

📒 Files selected for processing (1)
  • client/internal/metrics/push.go

Comment thread client/internal/metrics/push.go
Comment thread client/internal/metrics/push.go
Comment thread client/internal/metrics/push.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
client/internal/metrics/push.go (2)

149-156: ⚠️ Potential issue | 🟠 Major

Clamp interval before timer.Reset to prevent busy-loop.

While NewPush validates envInterval for force-sending mode, resolve() can still return a zero or negative interval from config.Interval (remote config) when p.envInterval == 0. This would cause timer.Reset(0) to fire immediately, creating a busy-loop.

Suggested guard
 		pushURL, interval := p.resolve(ctx)
+		if interval <= 0 {
+			log.Warnf("invalid metrics push interval %s, falling back to %s", interval, defaultPushInterval)
+			interval = defaultPushInterval
+		}
 		if pushURL != "" {
 			if err := p.push(ctx, pushURL); err != nil {
 				log.Errorf("failed to push metrics: %v", err)
 			}
 		}
 
 		timer.Reset(interval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 149 - 156, The call to
timer.Reset(interval) can receive zero/negative intervals from p.resolve() (when
p.envInterval==0 and config.Interval is invalid), causing a busy-loop; update
the logic around p.resolve(ctx) and timer.Reset to clamp the resolved interval
to a safe minimum (e.g. if interval <= 0 then set interval = defaultMinInterval
or a constant like time.Second) before calling timer.Reset; reference p.resolve,
p.envInterval and the timer.Reset call so you locate and replace the unguarded
reset with a clamped value.

167-182: ⚠️ Potential issue | 🟠 Major

NB_METRICS_INTERVAL does not bypass version gating as documented.

When p.envInterval > 0, the code still checks isVersionInRange() at line 178 and skips pushes if the version is out of range. This contradicts the documented behavior that NB_METRICS_INTERVAL should "override remote config period/version filtering."

Suggested fix to bypass version check when env interval is set
 func (p *Push) resolve(ctx context.Context) (pushURL string, interval time.Duration) {
 	if p.config.ForceSending {
 		return p.resolveServerURL(nil), p.envInterval
 	}

+	// When env interval is set, bypass version gating but still use remote config for URL
+	if p.envInterval > 0 {
+		var remoteServerURL *url.URL
+		if cfg := p.configManager.RefreshIfNeeded(ctx); cfg != nil {
+			remoteServerURL = &cfg.ServerURL
+		}
+		return p.resolveServerURL(remoteServerURL), p.envInterval
+	}
+
 	config := p.configManager.RefreshIfNeeded(ctx)
 	if config == nil {
 		log.Debug("no metrics push config available, waiting to retry")
 		return "", defaultPushInterval
 	}

-	interval = config.Interval
-	if p.envInterval > 0 {
-		interval = p.envInterval
-	}
+	interval = config.Interval

 	if !isVersionInRange(p.agentVersion, config.VersionSince, config.VersionUntil) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 167 - 182, The env var override
(p.envInterval) currently gets applied but still falls through to the version
gate; change the logic so that when p.envInterval > 0 the code bypasses the
version check entirely. Concretely, in the RefreshIfNeeded/interval branch
adjust control flow around p.envInterval and isVersionInRange (e.g., set
interval = p.envInterval and short-circuit returning "" and interval or only
evaluate isVersionInRange when p.envInterval == 0) so that p.envInterval
overrides remote config period/version filtering; reference p.envInterval,
config.Interval, isVersionInRange and p.agentVersion to locate the code to
modify.
🧹 Nitpick comments (1)
client/internal/metrics/influxdb.go (1)

142-167: Consider deterministic field ordering for reproducibility.

Go map iteration order is non-deterministic, so the field order in the output will vary between exports. While InfluxDB accepts any field order, this can complicate debugging and testing.

Optional: Use sorted keys for deterministic output
+import "sort"
+
 // writeSamples writes samples in InfluxDB line protocol format.
 func writeSamples(w io.Writer, samples []influxSample) error {
 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
-		first := true
-		for k, v := range s.fields {
-			if !first {
+		for i, k := range keys {
+			if i > 0 {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
-			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
+			if _, err := fmt.Fprintf(w, "%s=%g", k, s.fields[k]); err != nil {
 				return err
 			}
-			first = false
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 142 - 167, The writeSamples
function uses map iteration over influxSample.fields which yields
non-deterministic field order; to make output reproducible, collect the map keys
into a slice, sort them (e.g., sort.Strings), then iterate the sorted keys to
format fields instead of ranging directly over s.fields. Update writeSamples to
build a []string keys from s.fields, sort it, and use keys[i] to access values
when writing each field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/metrics/push.go`:
- Around line 149-156: The call to timer.Reset(interval) can receive
zero/negative intervals from p.resolve() (when p.envInterval==0 and
config.Interval is invalid), causing a busy-loop; update the logic around
p.resolve(ctx) and timer.Reset to clamp the resolved interval to a safe minimum
(e.g. if interval <= 0 then set interval = defaultMinInterval or a constant like
time.Second) before calling timer.Reset; reference p.resolve, p.envInterval and
the timer.Reset call so you locate and replace the unguarded reset with a
clamped value.
- Around line 167-182: The env var override (p.envInterval) currently gets
applied but still falls through to the version gate; change the logic so that
when p.envInterval > 0 the code bypasses the version check entirely. Concretely,
in the RefreshIfNeeded/interval branch adjust control flow around p.envInterval
and isVersionInRange (e.g., set interval = p.envInterval and short-circuit
returning "" and interval or only evaluate isVersionInRange when p.envInterval
== 0) so that p.envInterval overrides remote config period/version filtering;
reference p.envInterval, config.Interval, isVersionInRange and p.agentVersion to
locate the code to modify.

---

Nitpick comments:
In `@client/internal/metrics/influxdb.go`:
- Around line 142-167: The writeSamples function uses map iteration over
influxSample.fields which yields non-deterministic field order; to make output
reproducible, collect the map keys into a slice, sort them (e.g., sort.Strings),
then iterate the sorted keys to format fields instead of ranging directly over
s.fields. Update writeSamples to build a []string keys from s.fields, sort it,
and use keys[i] to access values when writing each field.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89647518-c342-4244-af60-8e129b3ecf24

📥 Commits

Reviewing files that changed from the base of the PR and between b4cd717 and d272755.

📒 Files selected for processing (4)
  • client/internal/metrics/influxdb.go
  • client/internal/metrics/metrics.go
  • client/internal/metrics/push.go
  • client/internal/metrics/push_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
client/internal/metrics/push.go (1)

271-283: Double-close of gzip writer.

The deferred gz.Close() at line 275 will be called even after the explicit gz.Close() at line 279. While gzip.Writer.Close() is safe to call multiple times, the defer is unnecessary overhead.

Remove redundant defer
 func gzipCompress(data []byte) (*bytes.Buffer, error) {
 	var buf bytes.Buffer
 	gz := gzip.NewWriter(&buf)
-	defer func() { _ = gz.Close() }()
 	if _, err := gz.Write(data); err != nil {
+		_ = gz.Close()
 		return nil, err
 	}
 	if err := gz.Close(); err != nil {
 		return nil, err
 	}
 	return &buf, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 271 - 283, In gzipCompress
remove the redundant deferred close of the gzip writer: delete the defer func()
{ _ = gz.Close() }() and keep the explicit gz.Close() call that checks and
returns any error; ensure gz is closed exactly once (via the explicit if err :=
gz.Close(); err != nil { return nil, err }) so the function still returns &buf
on success and propagates close/write errors from the gzip.Writer.
client/internal/metrics/env.go (1)

74-87: Consider validating URL scheme for security.

The function parses the URL but doesn't validate that it uses an expected scheme (e.g., http or https). A misconfigured URL with an unexpected scheme could cause issues or security concerns.

Optional: Add scheme validation
 func getMetricsServerURL() *url.URL {
 	envURL := os.Getenv(EnvMetricsServerURL)
 	if envURL == "" {
 		return nil
 	}
 	parsed, err := url.Parse(envURL)
 	if err != nil {
 		log.Warnf("invalid metrics server URL from env: %v", err)
 		return nil
 	}
+	if parsed.Scheme != "http" && parsed.Scheme != "https" {
+		log.Warnf("invalid metrics server URL scheme %q, expected http or https", parsed.Scheme)
+		return nil
+	}
 	return parsed
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/env.go` around lines 74 - 87, getMetricsServerURL
currently parses EnvMetricsServerURL but doesn't validate the URL scheme; update
getMetricsServerURL to reject URLs that don't use an allowed scheme (e.g.,
"http" or "https") by checking parsed.Scheme after url.Parse and logging a
warning + returning nil for disallowed schemes. Use the EnvMetricsServerURL
constant and the parsed variable from url.Parse within getMetricsServerURL to
perform this validation and ensure only expected schemes are accepted.
client/internal/metrics/influxdb.go (1)

142-167: Map iteration order may cause non-deterministic output.

The fields map iteration order is non-deterministic in Go. While InfluxDB line protocol doesn't require a specific field order, non-deterministic output can make debugging and testing harder. This is a minor concern since InfluxDB handles it correctly.

Optional: Sort field keys for deterministic output
+import "sort"
+
 func writeSamples(w io.Writer, samples []influxSample) error {
 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		// Sort field keys for deterministic output
+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
 		first := true
-		for k, v := range s.fields {
+		for _, k := range keys {
+			v := s.fields[k]
 			if !first {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
 			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
 				return err
 			}
 			first = false
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 142 - 167, The writeSamples
function uses map iteration over s.fields which is non-deterministic; to make
output deterministic for testing/debugging, gather the keys from s.fields inside
writeSamples, sort them (e.g., using sort.Strings), and then iterate over the
sorted keys to write fields instead of ranging directly over the map; update the
loop that currently reads "for k, v := range s.fields" to iterate the sorted key
slice and fetch values with s.fields[k].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/push.go`:
- Around line 196-252: The push method currently calls p.metrics.ExportAndReset
which clears metrics immediately, causing permanent loss if the HTTP send
(p.client.Do) fails; change the flow to avoid clearing until push succeeds: use
a non-destructive export (replace ExportAndReset with an Export that writes to
buf but does not clear, or call p.metrics.Export(&buf) then only call
p.metrics.Reset() after a successful HTTP response), and if your metrics API has
an import/requeue (e.g., Import or AddSamples) use that to re-enqueue samples on
send failure; update push to use gzipCompress on the exported bytes, send via
p.client.Do, and only clear/reset p.metrics after resp.StatusCode is 2xx (or on
confirmed success).

---

Nitpick comments:
In `@client/internal/metrics/env.go`:
- Around line 74-87: getMetricsServerURL currently parses EnvMetricsServerURL
but doesn't validate the URL scheme; update getMetricsServerURL to reject URLs
that don't use an allowed scheme (e.g., "http" or "https") by checking
parsed.Scheme after url.Parse and logging a warning + returning nil for
disallowed schemes. Use the EnvMetricsServerURL constant and the parsed variable
from url.Parse within getMetricsServerURL to perform this validation and ensure
only expected schemes are accepted.

In `@client/internal/metrics/influxdb.go`:
- Around line 142-167: The writeSamples function uses map iteration over
s.fields which is non-deterministic; to make output deterministic for
testing/debugging, gather the keys from s.fields inside writeSamples, sort them
(e.g., using sort.Strings), and then iterate over the sorted keys to write
fields instead of ranging directly over the map; update the loop that currently
reads "for k, v := range s.fields" to iterate the sorted key slice and fetch
values with s.fields[k].

In `@client/internal/metrics/push.go`:
- Around line 271-283: In gzipCompress remove the redundant deferred close of
the gzip writer: delete the defer func() { _ = gz.Close() }() and keep the
explicit gz.Close() call that checks and returns any error; ensure gz is closed
exactly once (via the explicit if err := gz.Close(); err != nil { return nil,
err }) so the function still returns &buf on success and propagates close/write
errors from the gzip.Writer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19eaf51-10e2-4e05-aa2b-f2a32913175e

📥 Commits

Reviewing files that changed from the base of the PR and between b4cd717 and 1085ad0.

📒 Files selected for processing (5)
  • client/internal/metrics/env.go
  • client/internal/metrics/influxdb.go
  • client/internal/metrics/metrics.go
  • client/internal/metrics/push.go
  • client/internal/metrics/push_test.go

Comment thread client/internal/metrics/push.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
client/internal/metrics/push.go (1)

272-284: Deferred gz.Close() masks error and is redundant.

Line 276 defers gz.Close() but line 280 explicitly calls gz.Close() to capture the error. The deferred call will execute again after the explicit close, which is harmless but wasteful. More importantly, if the explicit gz.Close() on line 280 returns an error, the deferred close will still run and its error is discarded.

Consider removing the defer and relying solely on the explicit close:

Suggested fix
 func gzipCompress(data []byte) (*bytes.Buffer, error) {
 	var buf bytes.Buffer
 	gz := gzip.NewWriter(&buf)
-	defer func() { _ = gz.Close() }()
 	if _, err := gz.Write(data); err != nil {
+		_ = gz.Close()
 		return nil, err
 	}
 	if err := gz.Close(); err != nil {
 		return nil, err
 	}
 	return &buf, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 272 - 284, In gzipCompress,
remove the deferred gz.Close() and rely on the explicit gz.Close() call to
capture and return any close error: locate the gzipCompress function, delete the
defer func() { _ = gz.Close() }() and ensure the function writes to gz, checks
the write error, then calls gz.Close() once and returns any error from that
Close before returning the buffer.
client/internal/metrics/influxdb.go (1)

119-150: Non-deterministic field order in InfluxDB line protocol output.

The for k, v := range s.fields loop (line 133) iterates over a map with non-deterministic order. While InfluxDB accepts fields in any order, this makes the output non-reproducible across runs, which can complicate debugging and testing.

Consider sorting field keys for consistent output:

Suggested fix for deterministic field order
+import "sort"
+
 func (m *influxDBMetrics) Export(w io.Writer) error {
 	m.mu.Lock()
 	samples := make([]influxSample, len(m.samples))
 	copy(samples, m.samples)
 	m.mu.Unlock()

 	for _, s := range samples {
 		if _, err := fmt.Fprintf(w, "%s,%s ", s.measurement, s.tags); err != nil {
 			return err
 		}

+		keys := make([]string, 0, len(s.fields))
+		for k := range s.fields {
+			keys = append(keys, k)
+		}
+		sort.Strings(keys)
+
-		first := true
-		for k, v := range s.fields {
-			if !first {
+		for i, k := range keys {
+			if i > 0 {
 				if _, err := fmt.Fprint(w, ","); err != nil {
 					return err
 				}
 			}
-			if _, err := fmt.Fprintf(w, "%s=%g", k, v); err != nil {
+			if _, err := fmt.Fprintf(w, "%s=%g", k, s.fields[k]); err != nil {
 				return err
 			}
-			first = false
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/influxdb.go` around lines 119 - 150, The Export
method produces non-deterministic field order because it iterates directly over
the map s.fields; to fix, in influxDBMetrics.Export (inside the loop over
samples) collect the map keys from s.fields into a slice, sort them (e.g.,
sort.Strings), and then iterate over the sorted keys to write fields (replace
the for k, v := range s.fields loop with a deterministic loop over sorted keys),
preserving the existing formatting/escaping logic and the first-flag comma
handling for fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/metrics/influxdb.go`:
- Around line 119-150: The Export method produces non-deterministic field order
because it iterates directly over the map s.fields; to fix, in
influxDBMetrics.Export (inside the loop over samples) collect the map keys from
s.fields into a slice, sort them (e.g., sort.Strings), and then iterate over the
sorted keys to write fields (replace the for k, v := range s.fields loop with a
deterministic loop over sorted keys), preserving the existing
formatting/escaping logic and the first-flag comma handling for fields.

In `@client/internal/metrics/push.go`:
- Around line 272-284: In gzipCompress, remove the deferred gz.Close() and rely
on the explicit gz.Close() call to capture and return any close error: locate
the gzipCompress function, delete the defer func() { _ = gz.Close() }() and
ensure the function writes to gz, checks the write error, then calls gz.Close()
once and returns any error from that Close before returning the buffer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e00590ca-aacb-4792-8a6e-3b6563af10d5

📥 Commits

Reviewing files that changed from the base of the PR and between 1085ad0 and 44edbfd.

📒 Files selected for processing (4)
  • client/internal/metrics/influxdb.go
  • client/internal/metrics/metrics.go
  • client/internal/metrics/push.go
  • client/internal/metrics/push_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/metrics/push_test.go

lixmal added 5 commits March 18, 2026 19:31
- Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that
  collection is always active (for debug bundles) and only push is opt-in
- Change default config URL from staging to production (ingest.netbird.io)
- Delete broken Prometheus dashboard (used non-existent metric names)
- Delete unused VictoriaMetrics datasource config
- Replace committed .env with .env.example containing placeholder values
- Wire Grafana admin credentials through env vars in docker-compose
- Make metricsStages a pointer to prevent reset-vs-write race on reconnect
- Fix typed-nil interface in debug bundle path (GetClientMetrics)
- Use deterministic field order in InfluxDB Export (sorted keys)
- Replace Authorization header with X-Peer-ID for metrics push
- Fix ingest server timeout to use time.Second instead of float
- Fix gzip double-close, stale comments, trim log levels
- Add tests for influxdb.go and MetricsStages
- Add netbird_login measurement recording login/auth duration to management
  server, with success/failure result tag
- Validate InfluxDB tags against per-measurement allowlists in ingest server
  to prevent arbitrary tag injection
- Cap all duration fields (*_seconds) at 300s instead of only total_seconds
- Add ingest server tests for tag/field validation, bounds, and auth
…s-push

# Conflicts:
#	client/internal/connect.go
#	client/internal/engine.go
#	client/internal/engine_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
client/internal/metrics/infra/ingest/main.go (2)

115-116: ⚠️ Potential issue | 🟠 Major

Use explicit HTTP server timeouts instead of bare ListenAndServe.

http.ListenAndServe here runs with zero server-side read/write/idle timeouts, which can be abused by slow clients to hold sockets indefinitely.

🛠️ Suggested fix
-	if err := http.ListenAndServe(listenAddr, nil); err != nil { //nolint:gosec
+	srv := &http.Server{
+		Addr:              listenAddr,
+		Handler:           nil,
+		ReadHeaderTimeout: 5 * time.Second,
+		ReadTimeout:       15 * time.Second,
+		WriteTimeout:      15 * time.Second,
+		IdleTimeout:       60 * time.Second,
+	}
+	if err := srv.ListenAndServe(); err != nil { //nolint:gosec
 		log.Fatal(err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/ingest/main.go` around lines 115 - 116, Replace
the bare http.ListenAndServe(listenAddr, nil) call with an explicit http.Server
instance configured with sensible timeouts (set ReadTimeout, WriteTimeout, and
IdleTimeout) and the same Addr and Handler (use listenAddr and nil or the
existing mux/handler), then start it via server.ListenAndServe() and handle the
returned error; update the code around the ListenAndServe call in main.go to
construct the server, assign server.Addr = listenAddr, server.Handler = nil (or
the existing handler), set server.ReadTimeout/WriteTimeout/IdleTimeout to
appropriate durations, and use server.ListenAndServe() instead of the bare
function to mitigate slow-client resource exhaustion.

177-190: ⚠️ Potential issue | 🔴 Critical

X-Peer-ID validation is identification, not authentication.

Any caller can forge a 16-char hex X-Peer-ID and write to / if this service is reachable outside a trusted network. Please gate ingest writes with actual authN/authZ (e.g., shared token/HMAC/MTLS or trusted ingress ACLs).

#!/bin/bash
set -euo pipefail

# Verify whether ingest is exposed and whether any auth middleware/token check exists.
fd -t f 'docker-compose*.yml' client/internal/metrics/infra
rg -nP 'ingest|8087|ports:' client/internal/metrics/infra -C3
rg -nP --type=go 'X-Peer-ID|Authorization|Bearer|HMAC|mTLS|validateAuth' client/internal/metrics/infra -C3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/ingest/main.go` around lines 177 - 190,
validateAuth currently only checks format of X-Peer-ID (identification) but does
not authenticate/authorize writes; replace or augment this with real
authN/authZ: either implement token/HMAC verification against a server-side
secret (check Authorization header or HMAC on payload), require mTLS client cert
validation in the HTTP server, or enforce a trusted ingress ACL before accepting
writes to the ingest endpoint; update the validateAuth function (or add
middleware invoked before handlers that accept ingest writes) to verify the
chosen credential mechanism and return 401/403 on failure, and if you
intentionally keep only format checks, rename validateAuth to validatePeerID and
add explicit comments so callers know it does not perform authentication.
client/internal/metrics/push.go (1)

197-252: ⚠️ Potential issue | 🟠 Major

Potential sample-loss window between Export() and Reset() on successful push.

If new samples are recorded after Export() but before Reset(), they can be cleared without ever being sent unless reset is snapshot-aware.

#!/bin/bash
set -euo pipefail

# Inspect metrics interface and concrete Export/Reset behavior.
rg -nP --type=go 'type\s+metricsImplementation\s+interface' client/internal/metrics -C6
rg -nP --type=go 'func\s+\(.*\)\s+Export\(|func\s+\(.*\)\s+Reset\(' client/internal/metrics -C10

# Focus on InfluxDB metrics implementation used by this push path.
rg -nP --type=go 'func\s+\(m \*influxDBMetrics\)\s+Export\(|func\s+\(m \*influxDBMetrics\)\s+Reset\(' client/internal/metrics/influxdb.go -C25

Expected verification outcome: if Reset() truncates the live buffer without correlating to the exported snapshot, this is a real data-loss race and needs an atomic drain/swap API.

🧹 Nitpick comments (1)
client/internal/metrics/infra/.env.example (1)

8-9: Reorder keys to keep dotenv-linter output clean.

dotenv-linter reports key ordering warnings for these blocks. Reordering avoids recurring lint noise in this file.

♻️ Optional reorder to satisfy linter warnings
-# Grafana admin credentials
-GRAFANA_ADMIN_USER=admin
-GRAFANA_ADMIN_PASSWORD=changeme
+# Grafana admin credentials
+GRAFANA_ADMIN_PASSWORD=changeme
+GRAFANA_ADMIN_USER=admin

 # Remote config served by ingest at /config
 # Set CONFIG_METRICS_SERVER_URL to the ingest server's public address to enable
 CONFIG_METRICS_SERVER_URL=
+CONFIG_PERIOD_MINUTES=5
 CONFIG_VERSION_SINCE=0.0.0
 CONFIG_VERSION_UNTIL=99.99.99
-CONFIG_PERIOD_MINUTES=5

Also applies to: 14-16

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/.env.example` around lines 8 - 9, Reorder the
environment variable keys in the .env example blocks so they match the project's
dotenv-linter expected ordering (e.g., alphabetically) to remove lint warnings;
specifically adjust the block containing GRAFANA_ADMIN_USER and
GRAFANA_ADMIN_PASSWORD (and the similar block at lines 14–16) so the keys appear
in the same sorted order used elsewhere in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/env.go`:
- Around line 83-88: The current parsing of NB_METRICS_SERVER_URL using
url.Parse in the function that handles metrics server override accepts relative
or malformed inputs; after url.Parse(envURL) (see the parsed variable) validate
that parsed.Scheme is non-empty and is either "http" or "https" and that
parsed.Host is non-empty before returning parsed; if those checks fail, call
log.Warnf with a clear message about invalid absolute HTTP(S) URL and return nil
so the override is rejected. Ensure you update the code path that returns parsed
(the block with parsed, err := url.Parse(envURL)) to perform these extra checks.

In `@client/internal/metrics/infra/README.md`:
- Around line 35-37: Update the architecture README to reflect that the ingest
server now also allowlists tag keys per measurement: change the bullet list
under "Validates line protocol" (which currently mentions "Whitelists
measurements & fields") to explicitly state "Whitelists measurements, fields &
tags" and add a short note that the ingest server validates tag keys per
measurement (i.e., tag key allowlist is enforced along with measurement/field
validation in the ingest server).
- Line 17: The markdown has unlabeled fenced code blocks that trigger
markdownlint MD040; update each fenced block in the README examples to include a
language identifier (e.g., ```text or ```bash) so the fences are explicitly
marked. Locate the unlabeled triple-backtick blocks and add an appropriate
language token to the opening fence to satisfy MD040 while preserving the
existing content.
- Around line 41-42: The README bullet claiming "No client-side auth required"
is misleading because the ingest server still requires the X-Peer-ID header;
update both occurrences (the bullet at "No client-side auth required" and the
repeated text later) to a clearer phrase such as "No secret/token-based client
auth; X-Peer-ID identifier header is required" so readers understand an
identifier header is mandatory while no token is expected.

In `@client/internal/metrics/push.go`:
- Around line 125-126: Update the Start() comment so it accurately describes
current behavior: state that setting NB_METRICS_INTERVAL only overrides the
scheduling interval but does not bypass resolve() or RefreshIfNeeded() and does
not ignore version gating; clarify that only explicit force-sending mode
(forceSend) will skip version checks/remote config, while the env var merely
changes how often resolve() runs. Reference Start(), resolve(),
RefreshIfNeeded(), NB_METRICS_INTERVAL and force-sending mode in the comment so
readers understand the exact runtime behavior.

---

Duplicate comments:
In `@client/internal/metrics/infra/ingest/main.go`:
- Around line 115-116: Replace the bare http.ListenAndServe(listenAddr, nil)
call with an explicit http.Server instance configured with sensible timeouts
(set ReadTimeout, WriteTimeout, and IdleTimeout) and the same Addr and Handler
(use listenAddr and nil or the existing mux/handler), then start it via
server.ListenAndServe() and handle the returned error; update the code around
the ListenAndServe call in main.go to construct the server, assign server.Addr =
listenAddr, server.Handler = nil (or the existing handler), set
server.ReadTimeout/WriteTimeout/IdleTimeout to appropriate durations, and use
server.ListenAndServe() instead of the bare function to mitigate slow-client
resource exhaustion.
- Around line 177-190: validateAuth currently only checks format of X-Peer-ID
(identification) but does not authenticate/authorize writes; replace or augment
this with real authN/authZ: either implement token/HMAC verification against a
server-side secret (check Authorization header or HMAC on payload), require mTLS
client cert validation in the HTTP server, or enforce a trusted ingress ACL
before accepting writes to the ingest endpoint; update the validateAuth function
(or add middleware invoked before handlers that accept ingest writes) to verify
the chosen credential mechanism and return 401/403 on failure, and if you
intentionally keep only format checks, rename validateAuth to validatePeerID and
add explicit comments so callers know it does not perform authentication.

---

Nitpick comments:
In `@client/internal/metrics/infra/.env.example`:
- Around line 8-9: Reorder the environment variable keys in the .env example
blocks so they match the project's dotenv-linter expected ordering (e.g.,
alphabetically) to remove lint warnings; specifically adjust the block
containing GRAFANA_ADMIN_USER and GRAFANA_ADMIN_PASSWORD (and the similar block
at lines 14–16) so the keys appear in the same sorted order used elsewhere in
the file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1db4ad3b-9fa1-48e2-8882-4776a2b20bdc

📥 Commits

Reviewing files that changed from the base of the PR and between 44edbfd and c44b797.

⛔ Files ignored due to path filters (1)
  • client/internal/metrics/infra/ingest/go.sum is excluded by !**/*.sum
📒 Files selected for processing (18)
  • client/internal/connect.go
  • client/internal/metrics/env.go
  • client/internal/metrics/influxdb.go
  • client/internal/metrics/influxdb_test.go
  • client/internal/metrics/infra/.env.example
  • client/internal/metrics/infra/.gitignore
  • client/internal/metrics/infra/README.md
  • client/internal/metrics/infra/docker-compose.yml
  • client/internal/metrics/infra/grafana/provisioning/dashboards/json/netbird-influxdb-metrics.json
  • client/internal/metrics/infra/ingest/go.mod
  • client/internal/metrics/infra/ingest/main.go
  • client/internal/metrics/infra/ingest/main_test.go
  • client/internal/metrics/metrics.go
  • client/internal/metrics/push.go
  • client/internal/metrics/push_test.go
  • client/internal/peer/conn.go
  • client/internal/peer/metrics_saver_test.go
  • client/server/debug.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • client/internal/metrics/influxdb.go
  • client/internal/peer/conn.go
  • client/internal/metrics/infra/docker-compose.yml
  • client/internal/metrics/push_test.go
  • client/server/debug.go
  • client/internal/metrics/infra/ingest/go.mod

Comment thread client/internal/metrics/env.go Outdated
Comment thread client/internal/metrics/infra/README.md Outdated
Comment thread client/internal/metrics/infra/README.md
Comment thread client/internal/metrics/infra/README.md Outdated
Comment thread client/internal/metrics/push.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
client/internal/metrics/infra/README.md (1)

150-153: ⚠️ Potential issue | 🟡 Minor

Verify .env.example exists, or update the setup step to use the tracked .env.

The quickstart depends on cp .env.example .env; if the template is missing, setup breaks. Please verify and align the instructions accordingly.

#!/bin/bash
set -euo pipefail

echo "Checking for env templates in metrics infra..."
fd -t f '^\.env(\.example)?$' client/internal/metrics/infra

echo
echo "Tracked env-related files under client/internal/metrics/infra:"
git ls-files client/internal/metrics/infra | rg -n '\.env(\.example)?$' || true

echo
echo "README references to env files:"
rg -n 'env\.example|\.env' client/internal/metrics/infra/README.md -C2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/infra/README.md` around lines 150 - 153, The README
step that runs "cp .env.example .env" can fail if .env.example is not present;
update the README (client/internal/metrics/infra/README.md) to either (a) verify
and instruct the user to create .env from the tracked .env file (if present) or
(b) instruct to copy from .env.example only when that template exists and
provide the alternative command to create a .env manually; specifically
reference the cp .env.example .env instruction and add a short check/notice
advising to confirm the presence of .env.example (or use the tracked .env)
before running the copy.
client/internal/metrics/push.go (1)

200-252: ⚠️ Potential issue | 🟠 Major

Reintroduced sample-loss window between export and reset.

Export() snapshots current samples, then the HTTP request runs, and only after success Reset() clears the live buffer. Metrics recorded between those two calls are dropped on success.

Suggested direction
- if err := p.metrics.Export(&buf); err != nil {
+ if err := p.metrics.ExportAndReset(&buf); err != nil {
    return fmt.Errorf("export metrics: %w", err)
  }
...
- p.metrics.Reset()
  return nil

(Requires adding an atomic ExportAndReset/Drain operation in the metrics implementation and interface.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/metrics/push.go` around lines 200 - 252, The current flow
calls p.metrics.Export(&buf) and only calls p.metrics.Reset() after a successful
HTTP push, which drops samples recorded between Export and Reset; change the
metrics API to provide an atomic export-and-reset (e.g., ExportAndReset(ctx,
&buf) or Drain(&buf)) on the metrics implementation and interface, then call
that single atomic method in push.go instead of separate Export and Reset so
samples recorded during the push are captured safely; update the implementation
of p.metrics (the concrete type behind p.metrics) to perform the
snapshot-and-clear atomically and update any callers to use the new
ExportAndReset/Drain method.
🧹 Nitpick comments (1)
client/internal/engine.go (1)

822-824: Lower sync timing log level to avoid info-log churn.

This path runs for every sync; using Info will be noisy in normal operation. Prefer Debug for per-sync duration logs.

Suggested tweak
- log.Infof("sync finished in %s", duration)
+ log.Debugf("sync finished in %s", duration)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/engine.go` around lines 822 - 824, The per-sync duration log
currently uses log.Infof and is too noisy; change the log level to debug by
replacing the Infof call with log.Debugf (i.e., update the call around the
duration := time.Since(started) / log.Infof("sync finished in %s", duration) to
use Debugf) so per-sync timing is logged at debug level while keeping
e.clientMetrics.RecordSyncDuration(e.ctx, duration) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 130-133: Clarify that the described precedence for
NB_METRICS_SERVER_URL / NB_METRICS_INTERVAL vs remote config values applies only
after eligibility checks (e.g., isVersionInRange and remote-config
refresh/eligibility gating) have passed; update the README text to add one
sentence stating that environment variables do not bypass eligibility/version
gating or remote-config refresh logic (for example: NB_METRICS_INTERVAL does not
bypass isVersionInRange or remote-config refresh checks), so env vars only
select the value if the push is allowed by those checks.

In `@management/cmd/pprof.go`:
- Around line 26-30: The pprof goroutine uses log.Fatalf which exits the whole
process on bind failure; update the pprof function (pprof) to avoid terminating
the process by replacing log.Fatalf("pprof server: %v", err) with a non-fatal
error log (e.g., log.Errorf or log.Warnf) and simply return so the application
continues running when http.ListenAndServe fails (remember this is launched via
go pprof(addr) from init()).

In `@signal/cmd/pprof.go`:
- Around line 13-30: Remove the automatic startup in pprof.go by deleting the
init() that calls pprof(pprofAddr()) (so pprof.go should only expose pprofAddr()
and pprof(listenAddr string) but not start itself); ensure the single pprof
startup path is the explicit startPprof() invoked from RunE (or the existing
startPprof entrypoint) so only one listener is created and the process won't
fatally exit due to address contention between localhost/127.0.0.1.

---

Duplicate comments:
In `@client/internal/metrics/infra/README.md`:
- Around line 150-153: The README step that runs "cp .env.example .env" can fail
if .env.example is not present; update the README
(client/internal/metrics/infra/README.md) to either (a) verify and instruct the
user to create .env from the tracked .env file (if present) or (b) instruct to
copy from .env.example only when that template exists and provide the
alternative command to create a .env manually; specifically reference the cp
.env.example .env instruction and add a short check/notice advising to confirm
the presence of .env.example (or use the tracked .env) before running the copy.

In `@client/internal/metrics/push.go`:
- Around line 200-252: The current flow calls p.metrics.Export(&buf) and only
calls p.metrics.Reset() after a successful HTTP push, which drops samples
recorded between Export and Reset; change the metrics API to provide an atomic
export-and-reset (e.g., ExportAndReset(ctx, &buf) or Drain(&buf)) on the metrics
implementation and interface, then call that single atomic method in push.go
instead of separate Export and Reset so samples recorded during the push are
captured safely; update the implementation of p.metrics (the concrete type
behind p.metrics) to perform the snapshot-and-clear atomically and update any
callers to use the new ExportAndReset/Drain method.

---

Nitpick comments:
In `@client/internal/engine.go`:
- Around line 822-824: The per-sync duration log currently uses log.Infof and is
too noisy; change the log level to debug by replacing the Infof call with
log.Debugf (i.e., update the call around the duration := time.Since(started) /
log.Infof("sync finished in %s", duration) to use Debugf) so per-sync timing is
logged at debug level while keeping e.clientMetrics.RecordSyncDuration(e.ctx,
duration) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d759f7b-2489-437a-acd2-283cb0bf03d3

📥 Commits

Reviewing files that changed from the base of the PR and between c44b797 and 59f8c5d.

📒 Files selected for processing (10)
  • client/internal/connect.go
  • client/internal/debug/debug.go
  • client/internal/engine.go
  • client/internal/engine_test.go
  • client/internal/metrics/env.go
  • client/internal/metrics/infra/README.md
  • client/internal/metrics/push.go
  • management/cmd/pprof.go
  • shared/management/client/grpc.go
  • signal/cmd/pprof.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • shared/management/client/grpc.go
  • client/internal/metrics/env.go
  • client/internal/debug/debug.go

Comment thread client/internal/metrics/infra/README.md
Comment thread management/cmd/pprof.go Outdated
Comment thread signal/cmd/pprof.go Outdated
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Collaborator

@mlsmaycon mlsmaycon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go with metrics bound to another URL

@mlsmaycon mlsmaycon changed the title [client] Feature/client metrics push [client] Feature/client metrics Mar 22, 2026
@mlsmaycon mlsmaycon merged commit 91f0d5c into main Mar 22, 2026
57 of 58 checks passed
@mlsmaycon mlsmaycon deleted the feature/client-metrics-push branch March 22, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants