Skip to content

[proxy] refactor metrics and add usage logs#5533

Merged
mlsmaycon merged 5 commits intomainfrom
feature/move-proxy-to-opentelemetry
Mar 9, 2026
Merged

[proxy] refactor metrics and add usage logs#5533
mlsmaycon merged 5 commits intomainfrom
feature/move-proxy-to-opentelemetry

Conversation

@pascal-fischer
Copy link
Copy Markdown
Collaborator

@pascal-fischer pascal-fischer commented Mar 7, 2026

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • Access logs now include bytes_upload and bytes_download (API and schemas updated, fields required).
    • Certificate issuance duration is now recorded as a metric.
  • Refactor

    • Metrics switched from Prometheus client to OpenTelemetry-backed meters; health endpoint now exposes OpenMetrics via OTLP exporter.
  • Tests

    • Metric tests updated to use OpenTelemetry Prometheus exporter and MeterProvider.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

Migrate proxy metrics from Prometheus client to OpenTelemetry (meters/instruments), rewire Server to create an OTel Prometheus exporter and MeterProvider, add certificate issuance recording, and extend access logging with per-request bytes (upload/download) plus related API/proto/type changes and domain usage tracking.

Changes

Cohort / File(s) Summary
Metrics Core & Tests
proxy/internal/metrics/metrics.go, proxy/internal/metrics/metrics_test.go
Replaced Prometheus instruments with OpenTelemetry metric instruments; New(ctx, meter) (*Metrics, error) initializes instruments via Meter. Added mappings mutex/map, per-host path tracking, and RecordCertificateIssuance. Tests now create an OTel Prometheus exporter, MeterProvider, and Meter.
Server Integration
proxy/server.go
Server now builds an OTel Prometheus exporter and MeterProvider, obtains a Meter (pkg path), initializes proxymetrics.Metrics, and updates startHealthServer to use the global Prometheus gatherer. ACME manager is now passed the metrics instance.
ACME Manager
proxy/internal/acme/manager.go, proxy/internal/acme/manager_test.go
Added metricsRecorder interface and Manager.metrics field; NewManager signature gains a final metrics parameter. prefetchCertificate records certificate issuance duration when metrics present. Tests updated to pass nil.
Access Logging (runtime)
proxy/internal/accesslog/logger.go, proxy/internal/accesslog/middleware.go, proxy/internal/accesslog/statuswriter.go
Track bytes uploaded/downloaded: statusWriter.Write and bodyCounter capture sizes; middleware attaches BytesUpload/BytesDownload to log entries and invokes domain usage tracking. Logger holds domainUsage state, cleanup routine, Close() to cancel cleanup.
API / Proto / Types
shared/management/http/api/openapi.yml, shared/management/proto/proxy_service.proto, shared/management/http/api/types.gen.go, management/internals/modules/reverseproxy/accesslogs/accesslogentry.go
Added bytes_upload and bytes_download to OpenAPI schema, protobuf AccessLog, generated API types, and DB model mappings. types.gen.go also adds many Valid() enum helper methods and small JSON tag adjustments.
Misc tests & wiring
proxy/internal/metrics/metrics_test.go, proxy/server.go (test impact)
Tests and server wiring updated to use OTel exporter/MeterProvider; constructors now return errors handled in tests/startup.

Sequence Diagram(s)

sequenceDiagram
    participant Server
    participant Exporter as OTelPrometheusExporter
    participant MP as MeterProvider
    participant Meter
    participant Metrics
    participant ACME
    participant HTTP as Requests/Health

    Server->>Exporter: prometheus.New()
    Server->>MP: metric.NewMeterProvider(metric.WithReader(Exporter))
    Server->>Meter: MP.Meter(pkg)
    Server->>Metrics: New(ctx, Meter)
    HTTP->>Metrics: record request (Add/Record with ctx)
    Metrics->>Meter: use instruments (Counter/Histogram)
    ACME->>Metrics: RecordCertificateIssuance(duration) on issuance
    MP->>Exporter: expose metrics for /metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • bcmmbaga
  • mlsmaycon

Poem

🐰
I hop on meters, soft and spry,
Counters click as requests go by,
I nibble bytes, both out and in,
Histograms hum where metrics spin,
A carrot clap for telemetry high 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description lacks essential details beyond the template structure—no implementation description, technical rationale, or explanation of changes is provided despite this being a significant refactor affecting metrics, access logging, and certificate handling. Add a detailed description of what was changed and why, explaining the OpenTelemetry migration, access log enhancements, and certificate metrics integration. Include context on any breaking changes or migration impacts.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[proxy] refactor metrics and add usage logs' accurately summarizes the primary changes: migrating metrics from Prometheus to OpenTelemetry and introducing usage tracking via access logs with byte counters.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/move-proxy-to-opentelemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
proxy/internal/metrics/metrics.go (1)

61-74: Use UCUM ms and let the exporter add the unit suffix.

Line 63 and Line 72 encode the duration unit twice: once in the instrument name and again in WithUnit. The OTEL metric API expects UCUM units, and the Prometheus exporter already appends unit suffixes during translation, so this pairing is likely to produce awkward/non-standard metric names. Prefer proxy.http.request.duration / proxy.backend.duration with metric.WithUnit("ms"). (pkg.go.dev)

♻️ Proposed cleanup
 requestDuration, err := meter.Int64Histogram(
-	"proxy.http.request.duration.ms",
-	metric.WithUnit("milliseconds"),
+	"proxy.http.request.duration",
+	metric.WithUnit("ms"),
 	metric.WithDescription("Duration of requests made to the netbird proxy"),
 )

 backendDuration, err := meter.Int64Histogram(
-	"proxy.backend.duration.ms",
-	metric.WithUnit("milliseconds"),
+	"proxy.backend.duration",
+	metric.WithUnit("ms"),
 	metric.WithDescription("Duration of peer round trip time from the netbird proxy"),
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/metrics/metrics.go` around lines 61 - 74, The metric names
currently include a ".ms" suffix and also pass metric.WithUnit("milliseconds");
update the Int64Histogram calls that create requestDuration and backendDuration
to use UCUM unit "ms" in metric.WithUnit and remove the ".ms" suffix from the
instrument names (use "proxy.http.request.duration" and
"proxy.backend.duration"), ensuring the exporter will append any display
suffixes; leave the descriptions unchanged and keep using meter.Int64Histogram
for both metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/internal/metrics/metrics.go`:
- Around line 141-148: The metrics are accumulating on every MODIFY because
AddMapping blindly adds counts; make AddMapping/RemoveMapping idempotent by
tracking mapping state and applying only the delta: add a map field on Metrics
(e.g., mappings map[string]int) keyed by a stable mapping identifier (domain or
ID) to store the last len(Paths) and presence; in AddMapping check if the
mapping key exists—if not, increment configuredDomains by 1 and add len(Paths);
if it exists compute diff = len(Paths) - prev and add diff to totalPaths; in
RemoveMapping use the stored prev to subtract totalPaths and decrement
configuredDomains and delete the map entry; update function names
Metrics.AddMapping and Metrics.RemoveMapping accordingly.
- Around line 107-119: The post-request updates in Metrics.Middleware
(m.activeRequests.Add and m.requestDuration.Record) can be skipped if
next.ServeHTTP panics; move these updates into a deferred closure so they always
run: after incrementing m.activeRequests and capturing start := time.Now() in
Middleware, add defer func() { m.activeRequests.Add(m.ctx, -1);
m.requestDuration.Record(m.ctx, time.Since(start).Milliseconds()) }(), then call
next.ServeHTTP(interceptor, r) as before; this ensures the in-flight counter
(m.activeRequests) and duration recording are always executed even on panic.

In `@proxy/server.go`:
- Around line 158-165: The Prometheus exporter is being created with
prometheus.New() which uses the global DefaultRegisterer and can leak registered
collectors on partial failure; replace this by creating a per-server registry
via prometheus.NewRegistry(), pass that registry into
prometheus.WithRegisterer(reg) when constructing the exporter (instead of
prometheus.New()), and thread that registry into startHealthServer so it scrapes
the per-server registry rather than prometheus.DefaultGatherer; ensure exporter
creation and startHealthServer use the same registerer/registry to avoid
AlreadyRegisteredError across retries or multiple Server instances (affects the
exporter creation site, startHealthServer invocation, and any initialization
steps like proxymetrics.New, dialManagement, configureTLS that may fail).

---

Nitpick comments:
In `@proxy/internal/metrics/metrics.go`:
- Around line 61-74: The metric names currently include a ".ms" suffix and also
pass metric.WithUnit("milliseconds"); update the Int64Histogram calls that
create requestDuration and backendDuration to use UCUM unit "ms" in
metric.WithUnit and remove the ".ms" suffix from the instrument names (use
"proxy.http.request.duration" and "proxy.backend.duration"), ensuring the
exporter will append any display suffixes; leave the descriptions unchanged and
keep using meter.Int64Histogram for both metrics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d468d4ce-acf5-4ed4-a8ea-2b0e120bdbfe

📥 Commits

Reviewing files that changed from the base of the PR and between 5c20f13 and c2fec57.

📒 Files selected for processing (3)
  • proxy/internal/metrics/metrics.go
  • proxy/internal/metrics/metrics_test.go
  • proxy/server.go

Comment thread proxy/internal/metrics/metrics.go Outdated
Comment thread proxy/internal/metrics/metrics.go Outdated
Comment thread proxy/server.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
proxy/internal/metrics/metrics.go (1)

114-123: Use request context for request-scoped metrics to enable trace correlation.

These Add and Record calls use m.ctx, preventing the OpenTelemetry SDK from attaching trace/span IDs as exemplars. For trace correlation via exemplars, request-scoped metrics must use the request's context (which carries the active server span). Use r.Context() in the middleware handler and req.Context() in the RoundTripper to allow the SDK to inspect the sampled span context and attach trace correlation.

♻️ Suggested change
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-	m.requestsTotal.Add(m.ctx, 1)
-	m.activeRequests.Add(m.ctx, 1)
+	reqCtx := r.Context()
+	m.requestsTotal.Add(reqCtx, 1)
+	m.activeRequests.Add(reqCtx, 1)

	interceptor := &responseInterceptor{PassthroughWriter: responsewriter.New(w)}

	start := time.Now()
	defer func() {
		duration := time.Since(start)
-		m.activeRequests.Add(m.ctx, -1)
-		m.requestDuration.Record(m.ctx, duration.Milliseconds())
+		m.activeRequests.Add(reqCtx, -1)
+		m.requestDuration.Record(reqCtx, duration.Milliseconds())
	}()

	next.ServeHTTP(interceptor, r)
})
-	m.backendDuration.Record(m.ctx, duration.Milliseconds())
+	m.backendDuration.Record(req.Context(), duration.Milliseconds())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/metrics/metrics.go` around lines 114 - 123, Replace the
package-level m.ctx usage with the incoming request's context so exemplars can
be attached: in the middleware handler use r.Context() for calls to
m.requestsTotal.Add, m.activeRequests.Add and m.requestDuration.Record (capture
ctx := r.Context() at the top of the handler and use that inside the deferred
closure), and in the HTTP RoundTripper use req.Context() for any metrics
recording; update the responseInterceptor/handler code paths (symbols:
m.requestsTotal.Add, m.activeRequests.Add, requestDuration.Record,
responseInterceptor) to accept and use the request-scoped ctx instead of m.ctx
so the OpenTelemetry SDK can correlate traces via exemplars.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@proxy/internal/metrics/metrics.go`:
- Around line 114-123: Replace the package-level m.ctx usage with the incoming
request's context so exemplars can be attached: in the middleware handler use
r.Context() for calls to m.requestsTotal.Add, m.activeRequests.Add and
m.requestDuration.Record (capture ctx := r.Context() at the top of the handler
and use that inside the deferred closure), and in the HTTP RoundTripper use
req.Context() for any metrics recording; update the responseInterceptor/handler
code paths (symbols: m.requestsTotal.Add, m.activeRequests.Add,
requestDuration.Record, responseInterceptor) to accept and use the
request-scoped ctx instead of m.ctx so the OpenTelemetry SDK can correlate
traces via exemplars.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01f534b1-81a1-40cb-ae6f-669ccfd96f12

📥 Commits

Reviewing files that changed from the base of the PR and between c2fec57 and 9ab6138.

📒 Files selected for processing (1)
  • proxy/internal/metrics/metrics.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
proxy/internal/accesslog/middleware.go (1)

50-54: ⚠️ Potential issue | 🟠 Major

Don't retain raw Host values in the long-lived usage map.

This key is derived from the client-controlled Host header, and trackUsage() keeps it in Logger.domainUsage for up to 24 hours. On a public proxy, a stream of random hosts will create unbounded map growth and turns this into an easy memory-pressure vector. At minimum, only track requests that resolved to a known service; ideally, key usage on a validated mapping/service identifier instead of the raw header.

Minimal mitigation
-		l.trackUsage(host, bytesUpload+bytesDownload)
+		if capturedData.GetServiceId() != "" {
+			l.trackUsage(host, bytesUpload+bytesDownload)
+		}

Also applies to: 83-84

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/accesslog/middleware.go` around lines 50 - 54, The code
currently uses the raw client-controlled header (r.Host via net.SplitHostPort)
as the key stored in Logger.domainUsage by trackUsage, which can be abused to
bloat the map; change trackUsage and the middleware so you never insert the
unvalidated r.Host into Logger.domainUsage: resolve the incoming host to a known
service identifier (e.g. via your service mapping/lookup function) and only
increment domainUsage when that lookup returns a known service, or else
drop/ignore the entry; specifically, update the middleware around
net.SplitHostPort/r.Host and the trackUsage call to pass a validated service key
(not the raw host) and ensure trackUsage/Logger.domainUsage only accepts those
validated service IDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/internal/accesslog/logger.go`:
- Around line 58-68: NewLogger currently starts the background goroutine
cleanupStaleUsage using a local context that is never canceled because callers
keep the Logger in a local var and never call Close; modify the wiring so the
cleanup goroutine is tied to the server lifecycle: either (A) change Server to
store the *Logger created by NewLogger (reference Logger, NewLogger) as a field
and call its Close/cleanupCancel during server shutdown, or (B) change NewLogger
to accept a parent context (or the server's context) and use that instead of
context.Background() for cleanupStaleUsage; ensure cleanupStaleUsage is started
with the passed context and that the server shutdown path cancels the context or
invokes Logger.Close to stop the goroutine (affects calls around NewLogger and
the cleanupStaleUsage invocation).

In `@proxy/internal/accesslog/middleware.go`:
- Around line 56-60: The middleware currently sets bytesUpload from
r.ContentLength (coerced to 0 when -1), which misses streamed/chunked uploads;
wrap the incoming request body with a counting ReadCloser (e.g.,
countingReadCloser) before passing to the handler chain and use its accumulated
counter instead of r.ContentLength (update bytesUpload assignment to read
countingReadCloser.n after the request is processed); ensure the wrapper
delegates Close to the underlying ReadCloser and update references around
bytesUpload and the sw.bytesWritten logic to account for the counted upload
bytes.

In `@shared/management/http/api/openapi.yml`:
- Around line 2825-2834: Add a non-negative constraint to the size fields by
adding "minimum: 0" to both bytes_upload and bytes_download in the OpenAPI
schema so validators and generated clients reject negative values; locate the
bytes_upload and bytes_download property definitions in
shared/management/http/api/openapi.yml and add the minimum: 0 key under each
property alongside type/format/description/example.

---

Outside diff comments:
In `@proxy/internal/accesslog/middleware.go`:
- Around line 50-54: The code currently uses the raw client-controlled header
(r.Host via net.SplitHostPort) as the key stored in Logger.domainUsage by
trackUsage, which can be abused to bloat the map; change trackUsage and the
middleware so you never insert the unvalidated r.Host into Logger.domainUsage:
resolve the incoming host to a known service identifier (e.g. via your service
mapping/lookup function) and only increment domainUsage when that lookup returns
a known service, or else drop/ignore the entry; specifically, update the
middleware around net.SplitHostPort/r.Host and the trackUsage call to pass a
validated service key (not the raw host) and ensure
trackUsage/Logger.domainUsage only accepts those validated service IDs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bdfc522f-f743-4db1-ac14-573f0430b4bd

📥 Commits

Reviewing files that changed from the base of the PR and between 9ab6138 and d9418dd.

⛔ Files ignored due to path filters (2)
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/proxy_service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (7)
  • management/internals/modules/reverseproxy/accesslogs/accesslogentry.go
  • proxy/internal/accesslog/logger.go
  • proxy/internal/accesslog/middleware.go
  • proxy/internal/accesslog/statuswriter.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/proxy_service.proto

Comment thread proxy/internal/accesslog/logger.go
Comment thread proxy/internal/accesslog/middleware.go Outdated
Comment thread shared/management/http/api/openapi.yml
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Mar 9, 2026

Quality Gate Passed Quality Gate passed

Issues
0 New issues
3 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
proxy/internal/acme/manager.go (1)

161-172: ⚠️ Potential issue | 🟠 Major

Metric name RecordCertificateIssuance is misleading — it measures certificate prefetch latency including cache hits, not actual ACME issuance.

mgr.GetCertificate(...) returns cached certificates when available, so the elapsed time at lines 161–172 includes both cache hits (which are fast) and actual issuance/renewal (which is slow). The metric name implies issuance-only timing. Either gate the record on a real issuance path or rename the metric to reflect that it measures prefetch duration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/acme/manager.go` around lines 161 - 172, The metric currently
recorded after mgr.GetCertificate(hello) is misleading—RecordCertificateIssuance
is used for both cache hits and real ACME issuance; either only record on a real
issuance path or rename the metric to reflect prefetch timing. Fix by changing
the metrics call here to a prefetch-specific metric (e.g., call
mgr.metrics.RecordCertificatePrefetchDuration(elapsed)) and update the metrics
interface/implementation accordingly, or alter mgr.GetCertificate/its caller to
return an issuance flag and only call
mgr.metrics.RecordCertificateIssuance(elapsed) when that flag indicates an
actual ACME issuance; update references to RecordCertificateIssuance and any
tests to match the chosen approach.
♻️ Duplicate comments (1)
proxy/server.go (1)

158-165: ⚠️ Potential issue | 🟠 Major

Keep the OTel exporter off the global Prometheus registry.

prometheus.New() registers against the default registerer, and startHealthServer scrapes the default gatherer. If startup fails after Line 158, or ListenAndServe is invoked again in the same process, the first exporter stays registered and the next init can trip AlreadyRegisteredError. Use a per-server registry and thread it into both exporter creation and startHealthServer.

🧭 Proposed direction
- exporter, err := prometheus.New()
+ reg := prometheus2.NewRegistry()
+ exporter, err := prometheus.New(prometheus.WithRegisterer(reg))
  if err != nil {
  	return fmt.Errorf("create prometheus exporter: %w", err)
  }

  provider := metric.NewMeterProvider(metric.WithReader(exporter))
@@
- if err := s.startHealthServer(); err != nil {
+ if err := s.startHealthServer(reg); err != nil {
  	return err
  }
@@
-func (s *Server) startHealthServer() error {
+func (s *Server) startHealthServer(gatherer prometheus2.Gatherer) error {
@@
- s.healthServer = health.NewServer(healthAddr, s.healthChecker, s.Logger, promhttp.HandlerFor(prometheus2.DefaultGatherer, promhttp.HandlerOpts{EnableOpenMetrics: true}))
+ s.healthServer = health.NewServer(healthAddr, s.healthChecker, s.Logger, promhttp.HandlerFor(gatherer, promhttp.HandlerOpts{EnableOpenMetrics: true}))
In go.opentelemetry.io/otel/exporters/prometheus v0.48.0, what registerer does prometheus.New() use by default, and does WithRegisterer allow wiring a custom Prometheus registry?

Also applies to: 210-210, 301-306

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/server.go` around lines 158 - 165, prometheus.New() registers its
exporter on the global Prometheus registerer which can cause
AlreadyRegisteredError on repeated inits; change the initialization in the
Server setup so you create a dedicated prometheus.Registry and pass it to the
OTel exporter (use the exporter option that accepts a custom registerer, e.g.
WithRegisterer) instead of calling prometheus.New(), then thread that same
registry into startHealthServer so the health scrape uses the per-server
registry; update the code around prometheus.New(), exporter, provider :=
metric.NewMeterProvider(...) and the call sites of startHealthServer to accept
and use the custom registry.
🧹 Nitpick comments (2)
proxy/internal/accesslog/middleware.go (1)

35-41: Please add a regression test for chunked upload + streaming download paths.

These counters now drive both access-log fields and per-domain usage accounting, so a focused integration test here would help prevent quiet drift back to header-based or partial accounting.

Also applies to: 64-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/accesslog/middleware.go` around lines 35 - 41, Add an
integration regression test that exercises chunked upload and streaming download
through the access-log middleware to ensure byte counters drive both access-log
fields and per-domain usage accounting: simulate a client POST with
Transfer-Encoding: chunked that writes the request body in multiple chunks and
assert bodyCounter (the r.Body wrapper) reports the total bytesRead reflected in
access-log fields; simulate a streaming GET where the response is written in
chunks and assert the response-wrapping counters (the middleware's response
writer wrapper used for bytesWritten) and the per-domain usage accounting totals
are updated accordingly. Place the test in the accesslog package test suite and
assert both access-log output and the per-domain usage accounting store values
match the actual bytes transferred for upload and download. Ensure the test
covers the code paths that wrap r.Body with bodyCounter and the response writer
wrapper so future changes cannot regress to header-based or partial accounting.
proxy/internal/acme/manager.go (1)

71-81: NewManager’s positional signature is getting too brittle.

The new metrics dependency pushes this constructor to eight arguments, and the call sites are already down to nil, nil, "", nil. A small config struct or functional options layer would make future additions much cheaper and less error-prone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/internal/acme/manager.go` around lines 71 - 81, NewManager’s 8-argument
positional signature (NewManager(certDir, acmeURL, eabKID, eabHMACKey string,
notifier certificateNotifier, logger *log.Logger, lockMethod CertLockMethod,
metrics metricsRecorder)) is brittle; refactor it to accept a single
configuration object or functional options instead. Create a ManagerConfig (or
use functional Option funcs) that contains certDir, acmeURL, eabKID, eabHMACKey,
certificateNotifier, logger, CertLockMethod, and metricsRecorder, update
NewManager to accept that config (or variadic options), construct the Manager
using config fields and newCertLocker(config.lockMethod, config.certDir,
config.logger), and update all call sites to pass the config/options (replacing
nil placeholders) to simplify future dependency additions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@proxy/internal/acme/manager.go`:
- Around line 161-172: The metric currently recorded after
mgr.GetCertificate(hello) is misleading—RecordCertificateIssuance is used for
both cache hits and real ACME issuance; either only record on a real issuance
path or rename the metric to reflect prefetch timing. Fix by changing the
metrics call here to a prefetch-specific metric (e.g., call
mgr.metrics.RecordCertificatePrefetchDuration(elapsed)) and update the metrics
interface/implementation accordingly, or alter mgr.GetCertificate/its caller to
return an issuance flag and only call
mgr.metrics.RecordCertificateIssuance(elapsed) when that flag indicates an
actual ACME issuance; update references to RecordCertificateIssuance and any
tests to match the chosen approach.

---

Duplicate comments:
In `@proxy/server.go`:
- Around line 158-165: prometheus.New() registers its exporter on the global
Prometheus registerer which can cause AlreadyRegisteredError on repeated inits;
change the initialization in the Server setup so you create a dedicated
prometheus.Registry and pass it to the OTel exporter (use the exporter option
that accepts a custom registerer, e.g. WithRegisterer) instead of calling
prometheus.New(), then thread that same registry into startHealthServer so the
health scrape uses the per-server registry; update the code around
prometheus.New(), exporter, provider := metric.NewMeterProvider(...) and the
call sites of startHealthServer to accept and use the custom registry.

---

Nitpick comments:
In `@proxy/internal/accesslog/middleware.go`:
- Around line 35-41: Add an integration regression test that exercises chunked
upload and streaming download through the access-log middleware to ensure byte
counters drive both access-log fields and per-domain usage accounting: simulate
a client POST with Transfer-Encoding: chunked that writes the request body in
multiple chunks and assert bodyCounter (the r.Body wrapper) reports the total
bytesRead reflected in access-log fields; simulate a streaming GET where the
response is written in chunks and assert the response-wrapping counters (the
middleware's response writer wrapper used for bytesWritten) and the per-domain
usage accounting totals are updated accordingly. Place the test in the accesslog
package test suite and assert both access-log output and the per-domain usage
accounting store values match the actual bytes transferred for upload and
download. Ensure the test covers the code paths that wrap r.Body with
bodyCounter and the response writer wrapper so future changes cannot regress to
header-based or partial accounting.

In `@proxy/internal/acme/manager.go`:
- Around line 71-81: NewManager’s 8-argument positional signature
(NewManager(certDir, acmeURL, eabKID, eabHMACKey string, notifier
certificateNotifier, logger *log.Logger, lockMethod CertLockMethod, metrics
metricsRecorder)) is brittle; refactor it to accept a single configuration
object or functional options instead. Create a ManagerConfig (or use functional
Option funcs) that contains certDir, acmeURL, eabKID, eabHMACKey,
certificateNotifier, logger, CertLockMethod, and metricsRecorder, update
NewManager to accept that config (or variadic options), construct the Manager
using config fields and newCertLocker(config.lockMethod, config.certDir,
config.logger), and update all call sites to pass the config/options (replacing
nil placeholders) to simplify future dependency additions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 54010015-0ab7-449e-9c25-2c0d7376fe50

📥 Commits

Reviewing files that changed from the base of the PR and between d9418dd and 8db71b5.

📒 Files selected for processing (6)
  • proxy/internal/accesslog/middleware.go
  • proxy/internal/accesslog/statuswriter.go
  • proxy/internal/acme/manager.go
  • proxy/internal/acme/manager_test.go
  • proxy/internal/metrics/metrics.go
  • proxy/server.go

@mlsmaycon mlsmaycon changed the title [proxy] switch proxy to use opentelemetry [proxy] refactor metrics and add usage logs Mar 9, 2026
Comment thread proxy/internal/accesslog/logger.go
if usage.bytesTransferred >= bytesThreshold {
elapsed := time.Since(usage.bytesStartTime)
bytesInGB := float64(usage.bytesTransferred) / (1024 * 1024 * 1024)
l.logger.WithFields(log.Fields{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should log based on a module calculation

Comment thread proxy/server.go
"github.com/cenkalti/backoff/v4"
"github.com/pires/go-proxyproto"
"github.com/prometheus/client_golang/prometheus"
prometheus2 "github.com/prometheus/client_golang/prometheus"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientPrometheus is a better name

Comment thread proxy/server.go
prometheus2 "github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
log "github.com/sirupsen/logrus"
"go.opentelemetry.io/otel/exporters/prometheus"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exporterPrometheus is a better name

newPathCount := len(mapping.Paths)
oldPathCount, exists := m.mappingPaths[mapping.Host]

if !exists {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share more on why you chose to handle this here and not move the call from update?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because otherwise i could just count domains not paths. We could argue if we need the paths counter

Copy link
Copy Markdown
Collaborator

@mlsmaycon mlsmaycon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other notes can be changed in another Pr

@mlsmaycon mlsmaycon merged commit f884299 into main Mar 9, 2026
47 checks passed
@mlsmaycon mlsmaycon deleted the feature/move-proxy-to-opentelemetry branch March 9, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants