Skip to content

fix(database_observability): Ensure that connection_info metric is only emitted for a given DB instance when it is available#5707

Merged
rgeyer merged 4 commits into
mainfrom
rgeyer/fix/dbo11y-connection-info-disconnect
Mar 13, 2026
Merged

fix(database_observability): Ensure that connection_info metric is only emitted for a given DB instance when it is available#5707
rgeyer merged 4 commits into
mainfrom
rgeyer/fix/dbo11y-connection-info-disconnect

Conversation

@rgeyer
Copy link
Copy Markdown
Contributor

@rgeyer rgeyer commented Mar 4, 2026

Brief description of Pull Request

We noticed a discrepancy between the number of DB instances reported in the dbo11y app, and in the billing data.

This is due to the fact that we are using different metrics/queries, namely.

App: last_over_time((count(count by (server_id) (mysql_perf_schema_events_statements_total{job=~"integrations/db-o11y", digest=~".+", } or pg_stat_statements_calls_total{job=~"integrations/db-o11y", queryid=~".+", })))[$__range:])

Billing: {__name__="database_observability_connection_info", server_id!="", job="integrations/db-o11y"}

Further research revealed that a database which successfully connected, then at some point was unable to be reached would continue emitting the database_observability_connection_info metric. This is in opposition to our desired billing model, where a customer could "turn off" a database and not be billed for the time it is down.

Notes to the Reviewer

This starts yet-another-goroutine to check the DB connectivity on a regular cadence (every 1 minute) and unregister the metric if there are 3 consecutive failures. It will also re-register the metric if the database ping is successful 3 consecutive times after a failure.

This now performs the checks in the main component.go file for each dbo11y component, reusing the goroutine that's already present, with different timers. Should be a bit cleaner.

Nevermind, we went back to putting the checks in the connection_info collector. Final decision :)

This will increase pings/connects to the database, but the alternative is having some other collector (which may not be running) report to this collector, which seems the worse of two evils.

PR Checklist

  • Tests updated

Screenshot from the local test. This is starting alloy monitoring a database which is unavailable, starting the database, then stopping, and resuming it.

image

…dically check database connectivity and register/deregister the database_observability_connection_info metric which is used for billing purposes
@rgeyer rgeyer requested a review from a team as a code owner March 4, 2026 21:58
}
}
ctx, cancel = context.WithCancel(ctx)
go func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the component.go existing goroutine instead of starting a new one? I think it would make it cleaner.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I'm wondering if it's easier to build this on top of the reconnection logic. Today we keep the collectors in "running" state even if the database connection is lost. Maybe we should just "stop()" collectors until database connection retries succeed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I wonder if it may cause unnecessary stop/starts in a flapping db connection.
We would also need to separate the logs collector from the pool, since it is independent from the db connection, once it is started. Right now it only count errors, but once we implement detailed error scraping it will be useful to troubleshoot a flapping db.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this to (mostly) reuse the existing reconnect logic.

The existing logic seems to only be used when the db is unavailable on first startup, so there is an additional check now to ensure that the DB remains connected, and stops the connection_info collector if the DB becomes unavailable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused though by the various tickets though. So, to recap, it seems we have:

  • the existing ticker loop for "initial" connection
  • a new ticker loop dedicated to connection_info with
    I understand we want to keep those activities separate, but I'm wondering if we should just put the new logic inside the connection_info collector? i.e. that collector would have its own loop like other collectors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only real reason for having two tickers is that they're on different cadences (30s and 60s).

We could probably combine them, and checking for DB connectivity every 30s is probably not a big deal. 🤔

consecutiveSuccesses++
if consecutiveSuccesses >= threshold {
registry.MustRegister(infoMetric)
infoMetric.WithLabelValues(labelValues[0], labelValues[1], labelValues[2], labelValues[3], labelValues[4], labelValues[5]).Set(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make labelValues a typed struct instead? this is hard to read and understand.

}

// hasConnectionInfoCollector reports whether the connection_info collector is currently in c.collectors.
// Must be called with c.mut held.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: given this and the next functions are just from one place, maybe they could take care of mutex handling rather than having this reminder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. I don't have a strong opinion here. They're called in a few places, and on lines 321-325 there is only a single mutex lock rather than two separate ones.

We could just make each of those helper functions hold the mutex individually tho, since they are always accessing a property that needs to avoid thread contention. 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change

@rgeyer rgeyer requested a review from cristiangreco March 10, 2026 18:42
@rgeyer rgeyer force-pushed the rgeyer/fix/dbo11y-connection-info-disconnect branch from 62a7d3a to 56ac9b3 Compare March 10, 2026 18:44
@matthewnolf
Copy link
Copy Markdown
Contributor

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@rgeyer
Copy link
Copy Markdown
Contributor Author

rgeyer commented Mar 10, 2026

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.

This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

@matthewnolf
Copy link
Copy Markdown
Contributor

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.

This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric.
Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

@rgeyer
Copy link
Copy Markdown
Contributor Author

rgeyer commented Mar 11, 2026

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.
This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric. Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

Nothing at all, and in fact, that was my initial implementation (see here), but there was discussion about changing it in the thread I linked above.

I'm not particular to either approach, but we should come to some consensus and get this merged.

Thoughts @matthewnolf @cristiangreco @gaantunes ?

@gaantunes
Copy link
Copy Markdown
Contributor

I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in component.go are not tested at all. Is there a reason we can't go with something simpler like passing the db connection to the connection_info collector that can simply unregister the metric if a ping fails?

@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the connection_info.go collector for each DB. It checked with a DB ping every 60 seconds, and would deregister the metric from the registry.
This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine?

But what is stopping the connection_info collector behaving like all the others: taking a DB connection and running in a loop. The loop in this case would be to ping the DB and register/unregister the metric. Overall I'd prefer to keep logic out of component.go that is mainly for wiring, and is less tested.

Nothing at all, and in fact, that was my initial implementation (see here), but there was discussion about changing it in the thread I linked above.

I'm not particular to either approach, but we should come to some consensus and get this merged.

Thoughts @matthewnolf @cristiangreco @gaantunes ?

My initial PR comment was more targeted on the additional ticker being added for this new functionality. I agree that keeping it inside the collectors look cleaner on Matt's PR suggestion, but Ryan's logic for a consecutive failure threshold is important to avoid unneeded billing flapping.

rgeyer added 2 commits March 11, 2026 15:06
…trengthen tests

Remove the dead metricLabelValues field from the MySQL and Postgres
ConnectionInfo structs, replace manual labelValues[0..5] index expansion
with labelValues... spread throughout, fix a latent flakiness bug in
TestRunConnectionInfoMonitor_ReregistersAfterConsecutiveSuccesses (the
weak nil-guard masked a real timing issue where exhausted mock
expectations caused the metric to be re-unregistered before the
assertion), and add collector-level tests covering Stop unregistering the
metric and the monitor goroutine lifecycle when a DB connection is
provided.
@rgeyer rgeyer force-pushed the rgeyer/fix/dbo11y-connection-info-disconnect branch from 5e0ec9d to 2f3798b Compare March 11, 2026 23:09
@cristiangreco cristiangreco changed the title fix: Ensure that database_observability_connection_info is only emitted for a given DB instance when it is available. fix(database_observability): Ensure that connection_info metric is only emitted for a given DB instance when it is available Mar 12, 2026
@cristiangreco
Copy link
Copy Markdown
Contributor

@rgeyer approved, and renamed the PR to get a nicer changelog entry

@rgeyer rgeyer merged commit bf0c3dc into main Mar 13, 2026
48 checks passed
@rgeyer rgeyer deleted the rgeyer/fix/dbo11y-connection-info-disconnect branch March 13, 2026 15:44
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Mar 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants