-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expose pool permit wait times #1860
Conversation
This commit adds a new pool_wait_duration() method to the Pool, returning the total duration of time this Pool instance has spent waiting to acquire a permit from the pool semaphore (rounded to the nearest millisecond). This allows users to quantify the saturation of the connection pool (USE-style metrics) and identify when the workload exceeds the available pool capacity. Both measuring the pool permit wait duration, and retrieving the measured value are cheap operations.
An upper bound of the pool wait time can be calculated in the pool_wait_duration_counter_increases() test to further assert correctness.
@@ -158,6 +158,7 @@ memchr = { version = "2.4.1", default-features = false } | |||
num-bigint = { version = "0.4.0", default-features = false, optional = true, features = ["std"] } | |||
once_cell = "1.9.0" | |||
percent-encoding = "2.1.0" | |||
pin-project = "1.0.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already a transitive dependency so shouldn't slow down compilation.
I was going to write a manual projection impl, but the unsafe lint pushed me towards the extra dependency 👍
pub(super) fn pool_wait_duration(&self) -> Duration { | ||
Duration::from_millis(self.pool_wait_duration_ms.load(Ordering::Relaxed)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By always returning a Duration
to the user, we hide implementation detail of the atomic measured value precision / unit conversions.
// Pass through the poll to the inner future. | ||
this.fut.poll(cx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could inspect the result of the poll here and emit the elapsed duration when we see a Poll::ready()
instead of at drop time for a more accurate measurement, but given the usage & negligible time difference I thought it was best to go for a simple impl first - though I'm happy to build it up.
// The duration of time conn_2 was blocked should be recorded in the pool | ||
// wait metric. | ||
let wait = pool.pool_wait_duration(); | ||
assert!( | ||
wait.as_millis() as u64 >= DELAY_MS, | ||
"expected at least {}, got {}", | ||
DELAY_MS, | ||
wait.as_millis() | ||
); | ||
|
||
assert!( | ||
wait < upper_bound, | ||
"expected at most {:?}, got {:?}", | ||
upper_bound, | ||
wait | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't flake (time based test 💣 ) but obviously cannot assert the exact value 👍
Sorry for the delay. Got busy with work stuff. I like the idea of adding a way to get metrics out of If you're sampling Something like an exponential moving average might be more useful, but I think I'd like to see an API where the pool provides a way to set a callback to get these timings as they're generated. One option I'm thinking is to have a trait that the user can implement for the timings they care about: // bikeshedding
pub trait PoolMetricsObserver<DB: Database> {
/// Receives the time a [`Pool::acquire()`] call waited for a connection permit.
///
/// If this value is spiking, it may indicate the pool is reaching saturation.
fn permit_wait_time(&self, pool: &Pool<DB>, time: std::time::Duration) {}
} All the methods would have default implementations so the user can implement only what they care about, and then this would be settable on The other option is using the |
Hey @abonander We actually record query latency / count per query separately, so combining the total count of all queries this higher-level instrumentation gives us with the total pool wait duration exposed in this PR gives us the average wait per call (and importantly, the rate of change of this value over time). I don't expect others will have this extra layer though. I opened this PR with a small change to increase the odds of getting it accepted, but truthfully I too would prefer a more sophisticated callback mechanism so we can capture the latency distribution in a histogram (like we do for the overall call latency) rather than only being able to derive the average wait. It's the p90's that get ya! I'm pleased you're up for this. Personally I'd prefer the optional-metric-observer approach over relying on I'm happy to implement this 👍 |
@domodwyer did you want to pivot this PR to that implementation or open a new one? I could also take a shot at if you're busy. |
I have actually implemented the observer changes already, just need to tidy it up - I'll try and push it over the line today. I plan on opening a new PR to keep things clean 👍 |
First off, big show of appreciation to the authors for this library - it must be a lot of work, but it's great 🎉
We (@influxdata / IOx) have been making use of sqlx to interact with our database catalog, a fundamental building block that maintains metadata state for the system (the store used to back the equivalent of
INFORMATION_SCHEMA
views in other databases). This can be a frequently hit resource.While we're careful to minimise requests to the catalog, in some situations we have been saturating the available connections in the connection pool, leading to long waits to acquire a connection (and at times timeouts being hit waiting for this to complete).
We would love a way to measure this saturation, so this PR proposes capturing a cumulative sum of pool permit wait times and exposing them on through a
Pool::pool_wait_duration()
call.We have tried to tackle this without an invasive change to sqlx, but this either leads to inaccurate measurements (decorating the
Acquire
trait conflates pool wait time with connection dial time, for example. Or perhaps pollingPool::num_idle()
and building a histogram of values, but this polling is expensive and the low sampling rate relative to high connection churn rate makes this practically useless).Hopefully it is a helpful metric for others too 🎉
feat: expose pool permit wait times (bdaea31)
test: assert pool wait upper bound (226539d)