Skip to content

fix(exporter/prometheus): add background cleanup for expired metric families#47152

Merged
edmocosta merged 3 commits into
open-telemetry:mainfrom
Rajneesh180:fix/prometheus-exporter-metric-family-leak
Apr 2, 2026
Merged

fix(exporter/prometheus): add background cleanup for expired metric families#47152
edmocosta merged 3 commits into
open-telemetry:mainfrom
Rajneesh180:fix/prometheus-exporter-metric-family-leak

Conversation

@Rajneesh180

Copy link
Copy Markdown
Contributor

Description

cleanupMetricFamilies() is currently only called inside Collect(), which runs when Prometheus scrapes the /metrics endpoint. If no scraper is active, expired entries in the metricFamilies sync.Map accumulate indefinitely, causing unbounded memory growth.

This adds a background goroutine in Start() that ticks at the configured MetricExpiration interval and calls cleanupMetricFamilies() independently of scraping. The goroutine is cleanly stopped on Shutdown() via a stop channel.

Link to tracking issue

Fixes #41123

Testing

Added TestPrometheusExporter_BackgroundMetricFamilyCleanup which:

  • Creates an exporter with a short MetricExpiration (50ms)
  • Seeds the collector's metricFamilies map with one stale entry (lastSeen 10 minutes ago) and one fresh entry
  • Asserts via require.Eventually that the stale entry is evicted by the background goroutine without any Prometheus scrape
  • Asserts the fresh entry survives

Full test suite passes (go test -count=1 ./...).

Documentation

Changelog entry added in .chloggen/fix-prometheus-exporter-metric-family-leak.yaml.

@Rajneesh180 Rajneesh180 requested review from a team, ArthurSens and dashpole as code owners March 24, 2026 12:49
Copilot AI review requested due to automatic review settings March 24, 2026 12:49
@github-actions

Copy link
Copy Markdown
Contributor

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3918fc5ddd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread exporter/prometheusexporter/prometheus.go

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent unbounded memory growth in the Prometheus exporter when /metrics is not being scraped by adding periodic background cleanup independent of Collect().

Changes:

  • Start a background ticker in Start() to periodically call metric cleanup logic.
  • Add a unit test intended to validate cleanup without a scrape.
  • Add a changelog entry documenting the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
exporter/prometheusexporter/prometheus.go Adds a background goroutine (ticker + stop channel) to run cleanup on an interval and stop on shutdown.
exporter/prometheusexporter/prometheus_test.go Adds a test intended to ensure cleanup happens without a Prometheus scrape.
.chloggen/fix-prometheus-exporter-metric-family-leak.yaml Adds release note entry for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread exporter/prometheusexporter/prometheus.go
Comment on lines +726 to +773
// TestPrometheusExporter_BackgroundMetricFamilyCleanup verifies that expired entries
// in the collector's metricFamilies map are evicted by the background goroutine
// even when no Prometheus scrape occurs. This covers the memory leak described in
// https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/41123
func TestPrometheusExporter_BackgroundMetricFamilyCleanup(t *testing.T) {
addr := testutil.GetAvailableLocalAddress(t)
cfg := &Config{
ServerConfig: confighttp.ServerConfig{
NetAddr: confignet.AddrConfig{
Transport: "tcp",
Endpoint: addr,
},
},
MetricExpiration: 50 * time.Millisecond,
}

factory := NewFactory()
set := exportertest.NewNopSettings(metadata.Type)
exp, err := factory.CreateMetrics(t.Context(), set, cfg)
require.NoError(t, err)

require.NoError(t, exp.Start(t.Context(), componenttest.NewNopHost()))
defer func() {
require.NoError(t, exp.Shutdown(t.Context()))
}()

c := exp.(*wrapMetricsExporter).exporter.collector

gaugeType := io_prometheus_client.MetricType_GAUGE
// Seed the metricFamilies map with an entry whose lastSeen is already expired.
c.metricFamilies.Store("stale_metric", metricFamily{
lastSeen: time.Now().Add(-10 * time.Minute),
mf: &io_prometheus_client.MetricFamily{
Name: proto.String("stale_metric"),
Help: proto.String("should be cleaned up"),
Type: &gaugeType,
},
})
// Also store a fresh entry that must survive cleanup.
// Use a far-future lastSeen so the entry never expires during the test.
c.metricFamilies.Store("fresh_metric", metricFamily{
lastSeen: time.Now().Add(time.Hour),
mf: &io_prometheus_client.MetricFamily{
Name: proto.String("fresh_metric"),
Help: proto.String("should remain"),
Type: &gaugeType,
},
})

Copilot AI Mar 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seeds collector.metricFamilies and verifies the background goroutine removes the stale entry without a scrape, but metricFamilies are normally only populated during Collect() (i.e., during a scrape). To actually cover the reported no-scrape memory growth, the test should seed/observe the accumulator’s registeredMetrics (time series signatures created during ConsumeMetrics/Accumulate) and assert they’re expired by the background cleanup without any scrape.

Copilot uses AI. Check for mistakes.
@Rajneesh180 Rajneesh180 force-pushed the fix/prometheus-exporter-metric-family-leak branch from 3918fc5 to 8bf95bd Compare March 25, 2026 11:45
@Rajneesh180

Copy link
Copy Markdown
Contributor Author

Updated the implementation — the background goroutine now cleans both layers: collector metricFamilies and accumulator registeredMetrics. The accumulator's internal sync.Map was actually the real source of unbounded growth when no scrapes happen, since metricFamilies only gets populated during Collect(). Added cleanupExpired() to the accumulator interface with the same expiration logic that Collect() already uses inline, so no new semantics.

Also fixed Shutdown to be idempotent (nil out stopCh after close), and passed the stop channel as a goroutine argument to avoid a data race between the background goroutine reading the field and Shutdown writing it.

Test now covers both layers — seeds stale + fresh entries in both registeredMetrics and metricFamilies, verifies the background goroutine evicts only the expired ones without any scrape. Full suite passes clean including goleak.

Would appreciate a look when you get a chance @dashpole — this was surfaced by #45332.


# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: "Ensure router operator does not split batches of entries"
note: Add background goroutine to periodically clean up expired metric families independent of Prometheus scraping.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, focus more on the solved problem than in the implementation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — updated the note and subtext to focus on the user-facing problem (unbounded memory growth) rather than the implementation details.

@Rajneesh180 Rajneesh180 force-pushed the fix/prometheus-exporter-metric-family-leak branch from 8bf95bd to 908cf6b Compare March 27, 2026 11:17
…amilies

Previously, stale metric families were only evicted during Collect() calls
triggered by Prometheus scrapes. If no scraper was active, expired entries
accumulated in the metricFamilies sync.Map indefinitely, causing unbounded
memory growth.

Add a background goroutine in Start() that ticks at the configured
MetricExpiration interval and calls cleanupMetricFamilies() independently
of scraping. The goroutine is stopped on Shutdown() via a stop channel.

Fixes open-telemetry#41123

Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com>

@ArthurSens ArthurSens left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, really good usage of AI tooling on this one :)

I have just one suggestion so far

Comment on lines +795 to +799
require.Eventually(t, func() bool {
_, mfFound := c.metricFamilies.Load("stale_metric")
_, accFound := a.registeredMetrics.Load("stale_acc_key")
return !mfFound && !accFound
}, 2*time.Second, 25*time.Millisecond, "stale entries were not removed by background cleanup")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use synctest instead of a time-dependent test?

Replace time-dependent require.Eventually with testing/synctest for
deterministic fake-clock testing. Moves TestPrometheusExporter_BackgroundCleanup
to a separate file with //go:build goexperiment.synctest build tag, matching
the pattern used in deltatocumulativeprocessor/stale_test.go.

Signed-off-by: Rajneesh Rana <rajneeshrana180@gmail.com>
Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com>
@Rajneesh180

Copy link
Copy Markdown
Contributor Author

Yeah good call — synctest is already used in deltatocumulativeprocessor for exactly this kind of thing. Replaced the require.Eventually approach with synctest.Run so the cleanup fires deterministically without real timers. Pushed.

Comment thread exporter/prometheusexporter/prometheus_synctest_test.go Outdated
Comment thread exporter/prometheusexporter/prometheus_synctest_test.go Outdated
Merge background cleanup test into prometheus_test.go using
synctest.Test (stable in Go 1.25+) instead of the experimental
synctest.Run. Remove the separate test file and goexperiment build tag.
Strip unnecessary comments.
@ArthurSens ArthurSens added the ready to merge Code review completed; ready to merge by maintainers label Apr 2, 2026
@edmocosta edmocosta merged commit a659705 into open-telemetry:main Apr 2, 2026
207 checks passed
@otelbot

otelbot Bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution @Rajneesh180! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey. If you are getting started contributing, you can also join the CNCF Slack channel #opentelemetry-new-contributors to ask for guidance and get help.

araiu pushed a commit to araiu/opentelemetry-collector-contrib that referenced this pull request Apr 6, 2026
…amilies (open-telemetry#47152)

#### Description

`cleanupMetricFamilies()` is currently only called inside `Collect()`,
which runs when Prometheus scrapes the `/metrics` endpoint. If no
scraper is active, expired entries in the `metricFamilies` sync.Map
accumulate indefinitely, causing unbounded memory growth.

This adds a background goroutine in `Start()` that ticks at the
configured `MetricExpiration` interval and calls
`cleanupMetricFamilies()` independently of scraping. The goroutine is
cleanly stopped on `Shutdown()` via a stop channel.

#### Link to tracking issue

Fixes open-telemetry#41123

#### Testing

Added `TestPrometheusExporter_BackgroundMetricFamilyCleanup` which:
- Creates an exporter with a short MetricExpiration (50ms)
- Seeds the collector's `metricFamilies` map with one stale entry
(lastSeen 10 minutes ago) and one fresh entry
- Asserts via `require.Eventually` that the stale entry is evicted by
the background goroutine **without any Prometheus scrape**
- Asserts the fresh entry survives

Full test suite passes (`go test -count=1 ./...`).

#### Documentation

Changelog entry added in
`.chloggen/fix-prometheus-exporter-metric-family-leak.yaml`.

---------

Signed-off-by: Rajneesh180 <rajneeshrehsaan48@gmail.com>
Signed-off-by: Rajneesh Rana <rajneeshrana180@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exporter/prometheus first-time contributor PRs made by new contributors ready to merge Code review completed; ready to merge by maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exporter/prometheus: Metrics clean up should be independent of Collect()

7 participants