fix(database_observability): Ensure that `connection_info` metric is only emitted for a given DB instance when it is available by rgeyer · Pull Request #5707 · grafana/alloy

rgeyer · 2026-03-04T21:58:21Z

Brief description of Pull Request

We noticed a discrepancy between the number of DB instances reported in the dbo11y app, and in the billing data.

This is due to the fact that we are using different metrics/queries, namely.

App: last_over_time((count(count by (server_id) (mysql_perf_schema_events_statements_total{job=~"integrations/db-o11y", digest=~".+", } or pg_stat_statements_calls_total{job=~"integrations/db-o11y", queryid=~".+", })))[$__range:])

Billing: {__name__="database_observability_connection_info", server_id!="", job="integrations/db-o11y"}

Further research revealed that a database which successfully connected, then at some point was unable to be reached would continue emitting the database_observability_connection_info metric. This is in opposition to our desired billing model, where a customer could "turn off" a database and not be billed for the time it is down.

Notes to the Reviewer

This starts yet-another-goroutine to check the DB connectivity on a regular cadence (every 1 minute) and unregister the metric if there are 3 consecutive failures. It will also re-register the metric if the database ping is successful 3 consecutive times after a failure.

~~This now performs the checks in the main component.go file for each dbo11y component, reusing the goroutine that's already present, with different timers. Should be a bit cleaner.~~

Nevermind, we went back to putting the checks in the connection_info collector. Final decision :)

This will increase pings/connects to the database, but the alternative is having some other collector (which may not be running) report to this collector, which seems the worse of two evils.

PR Checklist

Tests updated

Screenshot from the local test. This is starting alloy monitoring a database which is unavailable, starting the database, then stopping, and resuming it.

…dically check database connectivity and register/deregister the database_observability_connection_info metric which is used for billing purposes

gaantunes · 2026-03-04T22:40:50Z

+		}
+	}
+	ctx, cancel = context.WithCancel(ctx)
+	go func() {


Could we use the component.go existing goroutine instead of starting a new one? I think it would make it cleaner.

Hmm yeah I'm wondering if it's easier to build this on top of the reconnection logic. Today we keep the collectors in "running" state even if the database connection is lost. Maybe we should just "stop()" collectors until database connection retries succeed?

It makes sense, but I wonder if it may cause unnecessary stop/starts in a flapping db connection.
We would also need to separate the logs collector from the pool, since it is independent from the db connection, once it is started. Right now it only count errors, but once we implement detailed error scraping it will be useful to troubleshoot a flapping db.

Refactored this to (mostly) reuse the existing reconnect logic.

The existing logic seems to only be used when the db is unavailable on first startup, so there is an additional check now to ensure that the DB remains connected, and stops the connection_info collector if the DB becomes unavailable.

I'm a bit confused though by the various tickets though. So, to recap, it seems we have:

the existing ticker loop for "initial" connection

a new ticker loop dedicated to connection_info with
I understand we want to keep those activities separate, but I'm wondering if we should just put the new logic inside the connection_info collector? i.e. that collector would have its own loop like other collectors

The only real reason for having two tickers is that they're on different cadences (30s and 60s).

We could probably combine them, and checking for DB connectivity every 30s is probably not a big deal. 🤔

gaantunes · 2026-03-04T22:49:21Z

+					consecutiveSuccesses++
+					if consecutiveSuccesses >= threshold {
+						registry.MustRegister(infoMetric)
+						infoMetric.WithLabelValues(labelValues[0], labelValues[1], labelValues[2], labelValues[3], labelValues[4], labelValues[5]).Set(1)


Can we make labelValues a typed struct instead? this is hard to read and understand.

cristiangreco · 2026-03-09T11:04:24Z

 }

+// hasConnectionInfoCollector reports whether the connection_info collector is currently in c.collectors.
+// Must be called with c.mut held.


Nit: given this and the next functions are just from one place, maybe they could take care of mutex handling rather than having this reminder?

Yeah.. I don't have a strong opinion here. They're called in a few places, and on lines 321-325 there is only a single mutex lock rather than two separate ones.

We could just make each of those helper functions hold the mutex individually tho, since they are always accessing a property that needs to avoid thread contention. 🤔

Made this change

matthewnolf · 2026-03-10T21:08:43Z

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

rgeyer · 2026-03-10T21:19:08Z

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.

This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

matthewnolf · 2026-03-11T09:53:32Z

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.

This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric.
Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

rgeyer · 2026-03-11T16:15:25Z

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.
This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric. Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

Nothing at all, and in fact, that was my initial implementation (see here), but there was discussion about changing it in the thread I linked above.

I'm not particular to either approach, but we should come to some consensus and get this merged.

Thoughts @matthewnolf @cristiangreco @gaantunes ?

gaantunes · 2026-03-11T20:53:35Z

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.
This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric. Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

Nothing at all, and in fact, that was my initial implementation (see here), but there was discussion about changing it in the thread I linked above.

I'm not particular to either approach, but we should come to some consensus and get this merged.

Thoughts @matthewnolf @cristiangreco @gaantunes ?

My initial PR comment was more targeted on the additional ticker being added for this new functionality. I agree that keeping it inside the collectors look cleaner on Matt's PR suggestion, but Ryan's logic for a consecutive failure threshold is important to avoid unneeded billing flapping.

…trengthen tests Remove the dead metricLabelValues field from the MySQL and Postgres ConnectionInfo structs, replace manual labelValues[0..5] index expansion with labelValues... spread throughout, fix a latent flakiness bug in TestRunConnectionInfoMonitor_ReregistersAfterConsecutiveSuccesses (the weak nil-guard masked a real timing issue where exhausted mock expectations caused the metric to be re-unregistered before the assertion), and add collector-level tests covering Stop unregistering the metric and the monitor goroutine lifecycle when a DB connection is provided.

cristiangreco · 2026-03-12T19:58:21Z

@rgeyer approved, and renamed the PR to get a nicer changelog entry

Add a shared connection_info_monitor component which is used to perio…

524bb09

…dically check database connectivity and register/deregister the database_observability_connection_info metric which is used for billing purposes

rgeyer requested a review from a team as a code owner March 4, 2026 21:58

gofmt

feb6b69

gaantunes mentioned this pull request Mar 4, 2026

fix: suppress connection_info metric when DB is unreachable using component-level ping #5708

Closed

6 tasks

gaantunes reviewed Mar 4, 2026

View reviewed changes

rgeyer requested review from cristiangreco and gaantunes March 6, 2026 18:44

cristiangreco reviewed Mar 9, 2026

View reviewed changes

rgeyer requested a review from cristiangreco March 10, 2026 18:42

rgeyer force-pushed the rgeyer/fix/dbo11y-connection-info-disconnect branch from 62a7d3a to 56ac9b3 Compare March 10, 2026 18:44

rgeyer added 2 commits March 11, 2026 15:06

Merge branch 'main' into rgeyer/fix/dbo11y-connection-info-disconnect

ee935fa

rgeyer force-pushed the rgeyer/fix/dbo11y-connection-info-disconnect branch from 5e0ec9d to 2f3798b Compare March 11, 2026 23:09

cristiangreco changed the title ~~fix: Ensure that database_observability_connection_info is only emitted for a given DB instance when it is available.~~ fix(database_observability): Ensure that connection_info metric is only emitted for a given DB instance when it is available Mar 12, 2026

cristiangreco approved these changes Mar 12, 2026

View reviewed changes

rgeyer merged commit bf0c3dc into main Mar 13, 2026
48 checks passed

rgeyer deleted the rgeyer/fix/dbo11y-connection-info-disconnect branch March 13, 2026 15:44

grafana-alloybot Bot mentioned this pull request Mar 13, 2026

chore(main): Release 1.15.0 #5170

Merged

github-actions Bot added the frozen-due-to-age label Mar 28, 2026

github-actions Bot locked as resolved and limited conversation to collaborators Mar 28, 2026

Conversation

rgeyer commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief description of Pull Request

Notes to the Reviewer

PR Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewnolf commented Mar 10, 2026

Uh oh!

rgeyer commented Mar 10, 2026

Uh oh!

matthewnolf commented Mar 11, 2026

Uh oh!

rgeyer commented Mar 11, 2026

Uh oh!

gaantunes commented Mar 11, 2026

Uh oh!

cristiangreco commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rgeyer commented Mar 4, 2026 •

edited

Loading