Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metric observer & pool wait timing #1900

Closed
wants to merge 4 commits into from

Conversation

domodwyer
Copy link

This PR implements observer-based Pool metric instrumentation, and captures the time taken to acquire a connection permit from the connection pool.

A follow-on from #1860.


  • feat: metric observer & pool wait timing (b7e797e)

    This commit adds the public PoolMetricsObserver trait for end users to 
    optionally implement, in order to receive metric / event data from the 
    underlying connection Pool.
    
    PoolMetricsObserver currently exposes a single metric: the duration of time
    spent waiting for a permit from the connection pool semaphore. This allows
    users to quantify the saturation of the connection pool
    (USE-style metrics) and identify when the workload exceeds the available pool
    capacity.
    
    A default no-op method implementation is provided so end-users can choose
    which metrics they care about, and not have to implement the full set of
    methods.
    

This commit adds the public PoolMetricsObserver trait for end users to
optionally implement, in order to receive metric / event data from the
underlying connection Pool.

PoolMetricsObserver currently exposes a single metric: the duration of
time spent waiting for a permit from the connection pool semaphore. This
allows users to quantify the saturation of the connection pool
(USE-style metrics) and identify when the workload exceeds the available
pool capacity.

A default no-op method implementation is provided so end-users can
choose which metrics they care about, and not have to implement the full
set of methods.
sqlx-core/Cargo.toml Show resolved Hide resolved
sqlx-core/src/pool/options.rs Show resolved Hide resolved
sqlx-core/src/pool/inner.rs Outdated Show resolved Hide resolved
sqlx-core/src/pool/metrics.rs Show resolved Hide resolved
/// This instrumentation therefore (approximately) measures the time between
/// first poll and yielding a value, or the future timing out and being dropped.
#[pin_project(PinnedDrop)]
struct PollWallClockRecorder<F, R>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is honestly pretty overengineered. I think for most cases something like this would suffice:

async fn time_execution<F: Future>(fut: F) -> (F::Output, Duration) {
    // Won't be collected until the future is first polled
    let started_at = Instant::now();
    let ret = fut.await;
    (ret, started_at.elapsed())
}

In other cases, we wouldn't be timing the execution of a single Future and so would need a different approach anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with this, but it's flawed in the case of timeouts being hit. A simplified example of the usage for measuring the pool wait is:

        sqlx_rt::timeout(timeout, async {
            let (permit, duration) = time_execution(semaphore.acquire(1)).await;

            // do stuff with permit
        }).await;

When this is called, time_execution() is called, and the instant is read. Lets say there are no permits available, then fut.await in time_execution() returns Poll::Pending. Eventually the timeout may fire before the future is woken and fut, the time_execution() future, and their respective state (including the instant) is dropped without ever being polled again.

This will cause us to incorrectly miss recording the duration of time spent waiting on a pool permit when the connection timeout is hit. This case is the most critical example of connection pool starvation so capturing it is important.

I attempted to describe this situation in the docs, but feel free to suggest improvements 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For timeouts, I think that should be a completely different method of the observer that's invoked. It doesn't make sense to record the wait time when it times out because that's going to be basically just recording the timeout duration itself, within a margin of error.

On that note, while it's out of scope of this PR, I've been meaning to revise the definition of PoolOptions::connect_timeout() because I think it's misleading. connect_timeout should probably be specifically for timing out on connection attempts to the database, with a separate acquire_timeout capping the maximum wait time for a permit (or maybe just the whole Pool::acquire() call itself).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for separating out the connect_timeout() and acquire_timeout() (and maybe even an exec_timeout() that covers only the query exec portion, or an "overall" timeout that includes all three phases) - I think this would be much more intuitive.

I agree emitting a "a timeout has occurred" event would be useful in order to track the number of queries that fail due to timeouts (regardless of which phase it was in) - perhaps this can be added in a follow-up PR.

It doesn't make sense to record the wait time when it times out

I'm not sure I agree, as the pool wait == timeout value case is both valid and useful information, but perhaps we're trying to answer different questions with this metric.

I hope to answer "how saturated is my request pool", which doesn't place any importance on whether or not a timeout occurs. If we capture the pool wait time irrespective of a timeout being hit, and we know the configured timeout value, we can divide the actual wait time by the configured timeout value, and easily compute a relative "percentage of time spent waiting for a connection" which may be ~100% in the case of hitting a timeout during the semaphore wait. Should the timeout occur almost immediately after the permit has been granted, we'd get a 99% pool wait. If no timeout occurs, we still produce a relative measurement of pool saturation - i.e. waiting for half the configured timeout would give us a 50% pool wait measurement. It would be easy to create an alert for "pool wait > 50% of timeout" for example.

What would the benefit of emitting a different metric for a timeout be? If a timeout triggers a different method as you suggest, what does answering the saturation question look like if the timeout occurs during/immediately after/significantly after the permit has been granted? Would always emitting the pool wait duration + an "a timeout occurred" event be sufficient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counterargument: if you do want to record timeouts as part of the wait time, you can simply forward to that method from the one that's called on a timeout, e.g.:

fn acquire_timed_out(&self, total_wait: Duration) {
    self.permit_wait_time(total_wait);
}

However, as you want to implement it, if a user for some reason does not want timeouts to factor in to the permit wait time metric then they have to manually filter those out:

fn permit_wait_time(&self, time: Duration) {
    // Even more complicated if they have to get this value from configuration.
    if time == ACQUIRE_TIMEOUT {
        return;
    }

    ...
}

The former is simpler.

A separate callback for acquires timing out makes it easier for an application to flag itself as "unhealthy". If the pool is at 95% saturation, that is probably a good reason to alert someone or signal to a monitoring daemon to spin up extra capacity but 95% doesn't mean that requests are timing out yet. Having to check that the wait time equals the timeout to signal an unhealthy status unnecessarily complicates the logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply forwarding a general timeout event to permit_wait_time doesn't work - it doesn't account for timeouts occurring after a permit has been acquired, such as "timed out during query I/O". It's not a permit wait timeout, so it doesn't make sense to then record it as a pool wait timeout.

Why would we accept the duration in fn acquire_timed_out(&self, total_wait: Duration) ? As you said:

It doesn't make sense to record the wait time when it times out because that's going to be basically just recording the timeout duration itself, within a margin of error.

I'm not really sure what use case we're trying to cover with the proposed changes. Is this:

A separate callback for acquires timing out makes it easier for an application to flag itself as "unhealthy".

the use case you're trying to satisfy? If so, we could always emit a pool_wait_time() metric once the pool wait is complete/timed out, and any time a timeout occurs (including during the pool wait) emit a timeout_occurred() event:

  • metrics.pool_wait_time(duration)
  • metrics.timeout_occurred()

Or, how about we instrument each phase of the query, and add a "did time out" flag to the observer fn? So for a timeout during the pool wait, we would see the following calls:

  • metrics.pool_wait_timeout(duration, timeout=true)

and for timeouts occurring after the pool wait, during query I/O:

  • metrics.pool_wait_time(duration, timeout=false)
  • metrics.query_exec_time(duration, timeout=true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of "should timeouts be included in the wait metric?" I'm with @domodwyer – I consider the metric to be "how long did we wait in a queue before either succeeding or failing (for whatever reason) to acquire a connection". If it fails because of a timeout, then that's OK, that still counts.

If the pool is completely saturated and no connections are getting released, I would expect to see the metrics telling me that the wait times are always the timeout period. I would also like to see metrics telling me about connection errors (including failing to acquire a connection), as the E in USE. These two together should give me a pretty good idea of what's going on.

sqlx-core/src/pool/metrics.rs Show resolved Hide resolved
sqlx-core/src/pool/inner.rs Outdated Show resolved Hide resolved
sqlx-core/src/pool/options.rs Show resolved Hide resolved
sqlx-core/src/pool/options.rs Outdated Show resolved Hide resolved
sqlx-core/src/pool/inner.rs Outdated Show resolved Hide resolved
Reduces call-site boilerplate, allowing the caller to "always" call the
metric wrapper and let the impl decide what to do with the results,
turning a None into a no-op and passing the call into the child if Some.
Changes the connection options "metrics_observer" method to accept a
impl PoolMetricsObserver.
Removes some whitespace that caused cargo fmt to bail.
@domodwyer
Copy link
Author

I think this is good for another pass 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants