-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GIL contention metric to Prometheus #7651
Add GIL contention metric to Prometheus #7651
Conversation
docs/source/prometheus.rst
Outdated
.. note:: | ||
Requires ``gilknocker`` to be installed, and | ||
``distributed.admin.system-monitor.gil-contention.enabled`` | ||
configuration to be set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, is this on by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is off by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll express a desire to have this eventually (in the next couple of months) on by default, at least in Coiled. There are a few ways to get there. For example we could ...
- Turn the configuration on by default in Dask config (still gated on
gil_knocker
being installed) - Have Coiled depend on
gil_knocker
and plan on package-sync to distribute that package
Some questions if we wanted to go further than that:
- Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?
- How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?
We've noted in some benchmarks are affected using the default 1 millisecond
polling. In running some rough bench-marking this 1ms interval appears to reflect what py-spy
gives in most test cases, and bumping it to 5ms reduces any detectable performance impact AFAICT but is much less representative of actual GIL contention.
In those benchmarks, some are not affected (but most have at least some impact) with the worst offender being test_h2o
:
For example, with 5ms, 30% contention is likely much higher, in the 60-90% maybe. But this varies wildly depending on the type of program being ran and how frequent/little/long it holds the GIL in the first place; therefore becoming more of a rough indicator to which the user could then opt to increase polling interval to get a closer idea of actual contention. Maybe this is a 'safe' option, default on with a 5ms polling interval?
How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?
Building wheels is pretty easy with a decent amount of support already and can be expanded slightly to use a different project's build I maintain cramjam which builds wheels for basically every possible platform.
However, if having conda packages is a requirement, it appears such builds need to be replicated on its feedstock which is slightly annoying, especially when a package like cramjam or gilknocker has no dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some other thoughts for lessening the effect a high monitoring interval might have over here: milesgranger/gilknocker#9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After updating the changes in gilknocker
and re-running the same A/B benchmarks, it appears the effect is much less and could be lowered still by either allowing for larger gaps in sampling frequency and/or bumping the polling interval to a lower frequency. Here are some A/B benchmarks after the updates using 1ms frequency (same frequency as above before updates) Planning to run 1ms vs 2ms later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see this on by default if gilknocker is installed. I wonder if the interval could be configurable and then set the default to some value that doesn't impact performance and document how to increase it for better accuracy with the appropriate performance warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the noise here.. but with gilknocker==0.4.0 in the updated benchmarks there appears to be no noticeable impact on performance with a 1ms polling interval (the default in the config) The above (outdated) charts were with gilknocker==0.3.0 w/ 1ms polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet no worries then!
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 26 files ±0 26 suites ±0 14h 27m 34s ⏱️ + 57m 4s For more details on these failures and errors, see this check. Results for commit 5b708f8. ± Comparison against base commit d1080d2. This pull request skips 1 test.
♻️ This comment has been updated with latest results. |
d890b19
to
d604cbb
Compare
d604cbb
to
77a520d
Compare
fe9dbc5
to
77a520d
Compare
65e17b6
to
8149a4f
Compare
Hm, I'm surprised the values are >1.0. I was hoping that wouldn't happen. Maybe |
I don't think irate will help (but can share plot later). We got many cases where value went up by 6 in a 5s interval. |
Since EDIT: sorry, I see what you're saying; we've incremented the counter by 6 since the last scrape? |
Is there anything I can do to help in this PR, or is this more of a Prometheus/Grafana internal thing? |
I think the approach of getting total since last scrape isn't ideal and is probably what's leading to sometimes getting value of 6 in 5s interval (likely because of offset between when value is added and when it's scraped). My take is that this non-ideal thing is fine for now because better approach would be non-trivial and having the data in non-ideal state is better than not having at all. |
Yeah, I'm just wondering if the not-ideal counter is better than the not-ideal gauge? Maybe having both a counter and a gauge would be most helpful at first, to figure out which works better. |
8149a4f
to
5b708f8
Compare
Shall I add back the Guage to have both then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code overall looks good to me, there's some cleanup that still needs to be done.
Shall I add back the Guage to have both then?
I have no opinion on that; just having a counter seems fine by me. I also don't fully understand how a gauge would solve the particular issue that has been discussed. From what I understand, this should run into the same offsetting problem.
Still curious if adding the Histogram would be useful?
I'd think about this in another increment. Adding a Prometheus-native histogram is (IIRC) blocked by a refactoring of our Prometheus setup, the alternative would be adding crick-based metrics, but those explicitly expose some quantile, so I'm not a big fan of those.
Thanks for the review @hendrikmakait
I'm gleaning the opinion is not to add back the Gauge then. :) |
Exposes the GIL contention metrics on Prometheus for Scheduler and Worker. Only applies when `distributed.admin.system-monitor.gil-contention.enabled` is set and `gilknocker` is installed.
0.4.0 supports sleeping interval which further reduces program load and maintains good GIL monitoring https://github.com/milesgranger/gilknocker/releases/tag/v0.4.0
Co-authored-by: Gabe Joseph <[email protected]>
Co-authored-by: Hendrik Makait <[email protected]>
eb6ef14
to
cb3a19e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milesgranger! This overall looks good to me; I have one minor cleanup nit that was caused by my previous suggestions.
Exposes the GIL contention metrics on Prometheus
for Scheduler and Worker. Only applies when
distributed.admin.system-monitor.gil-contention.enabled
is set andgilknocker
is installed.Part of #7290
pre-commit run --all-files