Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: metric observer & pool wait timing #1900
feat: metric observer & pool wait timing #1900
Changes from all commits
b7e797e
1d25e95
1b5b2e4
2d3f260
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is honestly pretty overengineered. I think for most cases something like this would suffice:
In other cases, we wouldn't be timing the execution of a single
Future
and so would need a different approach anyway.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with this, but it's flawed in the case of timeouts being hit. A simplified example of the usage for measuring the pool wait is:
When this is called,
time_execution()
is called, and the instant is read. Lets say there are no permits available, thenfut.await
intime_execution()
returnsPoll::Pending
. Eventually the timeout may fire before the future is woken andfut
, thetime_execution()
future, and their respective state (including the instant) is dropped without ever being polled again.This will cause us to incorrectly miss recording the duration of time spent waiting on a pool permit when the connection timeout is hit. This case is the most critical example of connection pool starvation so capturing it is important.
I attempted to describe this situation in the docs, but feel free to suggest improvements 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For timeouts, I think that should be a completely different method of the observer that's invoked. It doesn't make sense to record the wait time when it times out because that's going to be basically just recording the timeout duration itself, within a margin of error.
On that note, while it's out of scope of this PR, I've been meaning to revise the definition of
PoolOptions::connect_timeout()
because I think it's misleading.connect_timeout
should probably be specifically for timing out on connection attempts to the database, with a separateacquire_timeout
capping the maximum wait time for a permit (or maybe just the wholePool::acquire()
call itself).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for separating out the
connect_timeout()
andacquire_timeout()
(and maybe even anexec_timeout()
that covers only the query exec portion, or an "overall" timeout that includes all three phases) - I think this would be much more intuitive.I agree emitting a "a timeout has occurred" event would be useful in order to track the number of queries that fail due to timeouts (regardless of which phase it was in) - perhaps this can be added in a follow-up PR.
I'm not sure I agree, as the pool wait == timeout value case is both valid and useful information, but perhaps we're trying to answer different questions with this metric.
I hope to answer "how saturated is my request pool", which doesn't place any importance on whether or not a timeout occurs. If we capture the pool wait time irrespective of a timeout being hit, and we know the configured timeout value, we can divide the actual wait time by the configured timeout value, and easily compute a relative "percentage of time spent waiting for a connection" which may be ~100% in the case of hitting a timeout during the semaphore wait. Should the timeout occur almost immediately after the permit has been granted, we'd get a 99% pool wait. If no timeout occurs, we still produce a relative measurement of pool saturation - i.e. waiting for half the configured timeout would give us a 50% pool wait measurement. It would be easy to create an alert for "pool wait > 50% of timeout" for example.
What would the benefit of emitting a different metric for a timeout be? If a timeout triggers a different method as you suggest, what does answering the saturation question look like if the timeout occurs during/immediately after/significantly after the permit has been granted? Would always emitting the pool wait duration + an "a timeout occurred" event be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counterargument: if you do want to record timeouts as part of the wait time, you can simply forward to that method from the one that's called on a timeout, e.g.:
However, as you want to implement it, if a user for some reason does not want timeouts to factor in to the permit wait time metric then they have to manually filter those out:
The former is simpler.
A separate callback for acquires timing out makes it easier for an application to flag itself as "unhealthy". If the pool is at 95% saturation, that is probably a good reason to alert someone or signal to a monitoring daemon to spin up extra capacity but 95% doesn't mean that requests are timing out yet. Having to check that the wait time equals the timeout to signal an unhealthy status unnecessarily complicates the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply forwarding a general timeout event to
permit_wait_time
doesn't work - it doesn't account for timeouts occurring after a permit has been acquired, such as "timed out during query I/O". It's not a permit wait timeout, so it doesn't make sense to then record it as a pool wait timeout.Why would we accept the duration in
fn acquire_timed_out(&self, total_wait: Duration)
? As you said:I'm not really sure what use case we're trying to cover with the proposed changes. Is this:
the use case you're trying to satisfy? If so, we could always emit a
pool_wait_time()
metric once the pool wait is complete/timed out, and any time a timeout occurs (including during the pool wait) emit atimeout_occurred()
event:metrics.pool_wait_time(duration)
metrics.timeout_occurred()
Or, how about we instrument each phase of the query, and add a "did time out" flag to the observer fn? So for a timeout during the pool wait, we would see the following calls:
metrics.pool_wait_timeout(duration, timeout=true)
and for timeouts occurring after the pool wait, during query I/O:
metrics.pool_wait_time(duration, timeout=false)
metrics.query_exec_time(duration, timeout=true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of "should timeouts be included in the wait metric?" I'm with @domodwyer – I consider the metric to be "how long did we wait in a queue before either succeeding or failing (for whatever reason) to acquire a connection". If it fails because of a timeout, then that's OK, that still counts.
If the pool is completely saturated and no connections are getting released, I would expect to see the metrics telling me that the wait times are always the timeout period. I would also like to see metrics telling me about connection errors (including failing to acquire a connection), as the E in USE. These two together should give me a pretty good idea of what's going on.