fix(exporter/prometheus): add background cleanup for expired metric families#47152
Conversation
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3918fc5ddd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent unbounded memory growth in the Prometheus exporter when /metrics is not being scraped by adding periodic background cleanup independent of Collect().
Changes:
- Start a background ticker in
Start()to periodically call metric cleanup logic. - Add a unit test intended to validate cleanup without a scrape.
- Add a changelog entry documenting the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
exporter/prometheusexporter/prometheus.go |
Adds a background goroutine (ticker + stop channel) to run cleanup on an interval and stop on shutdown. |
exporter/prometheusexporter/prometheus_test.go |
Adds a test intended to ensure cleanup happens without a Prometheus scrape. |
.chloggen/fix-prometheus-exporter-metric-family-leak.yaml |
Adds release note entry for the change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TestPrometheusExporter_BackgroundMetricFamilyCleanup verifies that expired entries | ||
| // in the collector's metricFamilies map are evicted by the background goroutine | ||
| // even when no Prometheus scrape occurs. This covers the memory leak described in | ||
| // https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/41123 | ||
| func TestPrometheusExporter_BackgroundMetricFamilyCleanup(t *testing.T) { | ||
| addr := testutil.GetAvailableLocalAddress(t) | ||
| cfg := &Config{ | ||
| ServerConfig: confighttp.ServerConfig{ | ||
| NetAddr: confignet.AddrConfig{ | ||
| Transport: "tcp", | ||
| Endpoint: addr, | ||
| }, | ||
| }, | ||
| MetricExpiration: 50 * time.Millisecond, | ||
| } | ||
|
|
||
| factory := NewFactory() | ||
| set := exportertest.NewNopSettings(metadata.Type) | ||
| exp, err := factory.CreateMetrics(t.Context(), set, cfg) | ||
| require.NoError(t, err) | ||
|
|
||
| require.NoError(t, exp.Start(t.Context(), componenttest.NewNopHost())) | ||
| defer func() { | ||
| require.NoError(t, exp.Shutdown(t.Context())) | ||
| }() | ||
|
|
||
| c := exp.(*wrapMetricsExporter).exporter.collector | ||
|
|
||
| gaugeType := io_prometheus_client.MetricType_GAUGE | ||
| // Seed the metricFamilies map with an entry whose lastSeen is already expired. | ||
| c.metricFamilies.Store("stale_metric", metricFamily{ | ||
| lastSeen: time.Now().Add(-10 * time.Minute), | ||
| mf: &io_prometheus_client.MetricFamily{ | ||
| Name: proto.String("stale_metric"), | ||
| Help: proto.String("should be cleaned up"), | ||
| Type: &gaugeType, | ||
| }, | ||
| }) | ||
| // Also store a fresh entry that must survive cleanup. | ||
| // Use a far-future lastSeen so the entry never expires during the test. | ||
| c.metricFamilies.Store("fresh_metric", metricFamily{ | ||
| lastSeen: time.Now().Add(time.Hour), | ||
| mf: &io_prometheus_client.MetricFamily{ | ||
| Name: proto.String("fresh_metric"), | ||
| Help: proto.String("should remain"), | ||
| Type: &gaugeType, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
This test seeds collector.metricFamilies and verifies the background goroutine removes the stale entry without a scrape, but metricFamilies are normally only populated during Collect() (i.e., during a scrape). To actually cover the reported no-scrape memory growth, the test should seed/observe the accumulator’s registeredMetrics (time series signatures created during ConsumeMetrics/Accumulate) and assert they’re expired by the background cleanup without any scrape.
3918fc5 to
8bf95bd
Compare
|
Updated the implementation — the background goroutine now cleans both layers: collector Also fixed Shutdown to be idempotent (nil out Test now covers both layers — seeds stale + fresh entries in both Would appreciate a look when you get a chance @dashpole — this was surfaced by #45332. |
|
|
||
| # A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
| note: "Ensure router operator does not split batches of entries" | ||
| note: Add background goroutine to periodically clean up expired metric families independent of Prometheus scraping. |
There was a problem hiding this comment.
Please, focus more on the solved problem than in the implementation
There was a problem hiding this comment.
Good call — updated the note and subtext to focus on the user-facing problem (unbounded memory growth) rather than the implementation details.
8bf95bd to
908cf6b
Compare
…amilies Previously, stale metric families were only evicted during Collect() calls triggered by Prometheus scrapes. If no scraper was active, expired entries accumulated in the metricFamilies sync.Map indefinitely, causing unbounded memory growth. Add a background goroutine in Start() that ticks at the configured MetricExpiration interval and calls cleanupMetricFamilies() independently of scraping. The goroutine is stopped on Shutdown() via a stop channel. Fixes open-telemetry#41123 Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com>
ArthurSens
left a comment
There was a problem hiding this comment.
Thanks for working on this, really good usage of AI tooling on this one :)
I have just one suggestion so far
| require.Eventually(t, func() bool { | ||
| _, mfFound := c.metricFamilies.Load("stale_metric") | ||
| _, accFound := a.registeredMetrics.Load("stale_acc_key") | ||
| return !mfFound && !accFound | ||
| }, 2*time.Second, 25*time.Millisecond, "stale entries were not removed by background cleanup") |
There was a problem hiding this comment.
Couldn't we use synctest instead of a time-dependent test?
Replace time-dependent require.Eventually with testing/synctest for deterministic fake-clock testing. Moves TestPrometheusExporter_BackgroundCleanup to a separate file with //go:build goexperiment.synctest build tag, matching the pattern used in deltatocumulativeprocessor/stale_test.go. Signed-off-by: Rajneesh Rana <rajneeshrana180@gmail.com> Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com>
|
Yeah good call — synctest is already used in deltatocumulativeprocessor for exactly this kind of thing. Replaced the |
Merge background cleanup test into prometheus_test.go using synctest.Test (stable in Go 1.25+) instead of the experimental synctest.Run. Remove the separate test file and goexperiment build tag. Strip unnecessary comments.
|
Thank you for your contribution @Rajneesh180! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help. |
…amilies (open-telemetry#47152) #### Description `cleanupMetricFamilies()` is currently only called inside `Collect()`, which runs when Prometheus scrapes the `/metrics` endpoint. If no scraper is active, expired entries in the `metricFamilies` sync.Map accumulate indefinitely, causing unbounded memory growth. This adds a background goroutine in `Start()` that ticks at the configured `MetricExpiration` interval and calls `cleanupMetricFamilies()` independently of scraping. The goroutine is cleanly stopped on `Shutdown()` via a stop channel. #### Link to tracking issue Fixes open-telemetry#41123 #### Testing Added `TestPrometheusExporter_BackgroundMetricFamilyCleanup` which: - Creates an exporter with a short MetricExpiration (50ms) - Seeds the collector's `metricFamilies` map with one stale entry (lastSeen 10 minutes ago) and one fresh entry - Asserts via `require.Eventually` that the stale entry is evicted by the background goroutine **without any Prometheus scrape** - Asserts the fresh entry survives Full test suite passes (`go test -count=1 ./...`). #### Documentation Changelog entry added in `.chloggen/fix-prometheus-exporter-metric-family-leak.yaml`. --------- Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com> Signed-off-by: Rajneesh Rana <rajneeshrana180@gmail.com>
Description
cleanupMetricFamilies()is currently only called insideCollect(), which runs when Prometheus scrapes the/metricsendpoint. If no scraper is active, expired entries in themetricFamiliessync.Map accumulate indefinitely, causing unbounded memory growth.This adds a background goroutine in
Start()that ticks at the configuredMetricExpirationinterval and callscleanupMetricFamilies()independently of scraping. The goroutine is cleanly stopped onShutdown()via a stop channel.Link to tracking issue
Fixes #41123
Testing
Added
TestPrometheusExporter_BackgroundMetricFamilyCleanupwhich:metricFamiliesmap with one stale entry (lastSeen 10 minutes ago) and one fresh entryrequire.Eventuallythat the stale entry is evicted by the background goroutine without any Prometheus scrapeFull test suite passes (
go test -count=1 ./...).Documentation
Changelog entry added in
.chloggen/fix-prometheus-exporter-metric-family-leak.yaml.