fix(database_observability): Ensure that connection_info metric is only emitted for a given DB instance when it is available#5707
Conversation
…dically check database connectivity and register/deregister the database_observability_connection_info metric which is used for billing purposes
| } | ||
| } | ||
| ctx, cancel = context.WithCancel(ctx) | ||
| go func() { |
There was a problem hiding this comment.
Could we use the component.go existing goroutine instead of starting a new one? I think it would make it cleaner.
There was a problem hiding this comment.
Hmm yeah I'm wondering if it's easier to build this on top of the reconnection logic. Today we keep the collectors in "running" state even if the database connection is lost. Maybe we should just "stop()" collectors until database connection retries succeed?
There was a problem hiding this comment.
It makes sense, but I wonder if it may cause unnecessary stop/starts in a flapping db connection.
We would also need to separate the logs collector from the pool, since it is independent from the db connection, once it is started. Right now it only count errors, but once we implement detailed error scraping it will be useful to troubleshoot a flapping db.
There was a problem hiding this comment.
Refactored this to (mostly) reuse the existing reconnect logic.
The existing logic seems to only be used when the db is unavailable on first startup, so there is an additional check now to ensure that the DB remains connected, and stops the connection_info collector if the DB becomes unavailable.
There was a problem hiding this comment.
I'm a bit confused though by the various tickets though. So, to recap, it seems we have:
- the existing ticker loop for "initial" connection
- a new ticker loop dedicated to
connection_infowith
I understand we want to keep those activities separate, but I'm wondering if we should just put the new logic inside theconnection_infocollector? i.e. that collector would have its own loop like other collectors
There was a problem hiding this comment.
The only real reason for having two tickers is that they're on different cadences (30s and 60s).
We could probably combine them, and checking for DB connectivity every 30s is probably not a big deal. 🤔
| consecutiveSuccesses++ | ||
| if consecutiveSuccesses >= threshold { | ||
| registry.MustRegister(infoMetric) | ||
| infoMetric.WithLabelValues(labelValues[0], labelValues[1], labelValues[2], labelValues[3], labelValues[4], labelValues[5]).Set(1) |
There was a problem hiding this comment.
Can we make labelValues a typed struct instead? this is hard to read and understand.
| } | ||
|
|
||
| // hasConnectionInfoCollector reports whether the connection_info collector is currently in c.collectors. | ||
| // Must be called with c.mut held. |
There was a problem hiding this comment.
Nit: given this and the next functions are just from one place, maybe they could take care of mutex handling rather than having this reminder?
There was a problem hiding this comment.
Yeah.. I don't have a strong opinion here. They're called in a few places, and on lines 321-325 there is only a single mutex lock rather than two separate ones.
We could just make each of those helper functions hold the mutex individually tho, since they are always accessing a property that needs to avoid thread contention. 🤔
62a7d3a to
56ac9b3
Compare
|
I'm coming from being a bit out of the loop on this topic, but the changes involved seem fairly involved with multiple tickers and being in |
@matthewnolf That was the initial implementation. There was a separate goroutine started in each of the This thread is what prompted the changes. I may have overly influenced that, given that my initial PR description pointed out the additional goroutines and DB contacts as a negative thing. Perhaps it's fine? |
But what is stopping the |
Nothing at all, and in fact, that was my initial implementation (see here), but there was discussion about changing it in the thread I linked above. I'm not particular to either approach, but we should come to some consensus and get this merged. Thoughts @matthewnolf @cristiangreco @gaantunes ? |
My initial PR comment was more targeted on the additional ticker being added for this new functionality. I agree that keeping it inside the collectors look cleaner on Matt's PR suggestion, but Ryan's logic for a consecutive failure threshold is important to avoid unneeded billing flapping. |
…trengthen tests Remove the dead metricLabelValues field from the MySQL and Postgres ConnectionInfo structs, replace manual labelValues[0..5] index expansion with labelValues... spread throughout, fix a latent flakiness bug in TestRunConnectionInfoMonitor_ReregistersAfterConsecutiveSuccesses (the weak nil-guard masked a real timing issue where exhausted mock expectations caused the metric to be re-unregistered before the assertion), and add collector-level tests covering Stop unregistering the metric and the monitor goroutine lifecycle when a DB connection is provided.
5e0ec9d to
2f3798b
Compare
connection_info metric is only emitted for a given DB instance when it is available
|
@rgeyer approved, and renamed the PR to get a nicer changelog entry |
Brief description of Pull Request
We noticed a discrepancy between the number of DB instances reported in the dbo11y app, and in the billing data.
This is due to the fact that we are using different metrics/queries, namely.
App:
last_over_time((count(count by (server_id) (mysql_perf_schema_events_statements_total{job=~"integrations/db-o11y", digest=~".+", } or pg_stat_statements_calls_total{job=~"integrations/db-o11y", queryid=~".+", })))[$__range:])Billing:
{__name__="database_observability_connection_info", server_id!="", job="integrations/db-o11y"}Further research revealed that a database which successfully connected, then at some point was unable to be reached would continue emitting the
database_observability_connection_infometric. This is in opposition to our desired billing model, where a customer could "turn off" a database and not be billed for the time it is down.Notes to the Reviewer
This starts yet-another-goroutine to check the DB connectivity on a regular cadence (every 1 minute) and unregister the metric if there are 3 consecutive failures. It will also re-register the metric if the database ping is successful 3 consecutive times after a failure.This now performs the checks in the maincomponent.gofile for each dbo11y component, reusing the goroutine that's already present, with different timers. Should be a bit cleaner.Nevermind, we went back to putting the checks in the connection_info collector. Final decision :)
This will increase pings/connects to the database, but the alternative is having some other collector (which may not be running) report to this collector, which seems the worse of two evils.
PR Checklist
Screenshot from the local test. This is starting alloy monitoring a database which is unavailable, starting the database, then stopping, and resuming it.