Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GIL contention metric to Prometheus #7651

Merged

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Mar 14, 2023

Exposes the GIL contention metrics on Prometheus
for Scheduler and Worker. Only applies when
distributed.admin.system-monitor.gil-contention.enabled is set and gilknocker is installed.

Part of #7290

  • Tests added / passed
  • Passes pre-commit run --all-files

.. note::
Requires ``gilknocker`` to be installed, and
``distributed.admin.system-monitor.gil-contention.enabled``
configuration to be set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, is this on by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is off by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll express a desire to have this eventually (in the next couple of months) on by default, at least in Coiled. There are a few ways to get there. For example we could ...

  1. Turn the configuration on by default in Dask config (still gated on gil_knocker being installed)
  2. Have Coiled depend on gil_knocker and plan on package-sync to distribute that package

Some questions if we wanted to go further than that:

  1. Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?
  2. How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?

We've noted in some benchmarks are affected using the default 1 millisecond polling. In running some rough bench-marking this 1ms interval appears to reflect what py-spy gives in most test cases, and bumping it to 5ms reduces any detectable performance impact AFAICT but is much less representative of actual GIL contention.

In those benchmarks, some are not affected (but most have at least some impact) with the worst offender being test_h2o:
image

For example, with 5ms, 30% contention is likely much higher, in the 60-90% maybe. But this varies wildly depending on the type of program being ran and how frequent/little/long it holds the GIL in the first place; therefore becoming more of a rough indicator to which the user could then opt to increase polling interval to get a closer idea of actual contention. Maybe this is a 'safe' option, default on with a 5ms polling interval?


How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?

Building wheels is pretty easy with a decent amount of support already and can be expanded slightly to use a different project's build I maintain cramjam which builds wheels for basically every possible platform.

However, if having conda packages is a requirement, it appears such builds need to be replicated on its feedstock which is slightly annoying, especially when a package like cramjam or gilknocker has no dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some other thoughts for lessening the effect a high monitoring interval might have over here: milesgranger/gilknocker#9

Copy link
Contributor Author

@milesgranger milesgranger Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After updating the changes in gilknocker and re-running the same A/B benchmarks, it appears the effect is much less and could be lowered still by either allowing for larger gaps in sampling frequency and/or bumping the polling interval to a lower frequency. Here are some A/B benchmarks after the updates using 1ms frequency (same frequency as above before updates) Planning to run 1ms vs 2ms later today

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see this on by default if gilknocker is installed. I wonder if the interval could be configurable and then set the default to some value that doesn't impact performance and document how to increase it for better accuracy with the appropriate performance warnings?

Copy link
Contributor Author

@milesgranger milesgranger Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the noise here.. but with gilknocker==0.4.0 in the updated benchmarks there appears to be no noticeable impact on performance with a 1ms polling interval (the default in the config) The above (outdated) charts were with gilknocker==0.3.0 w/ 1ms polling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet no worries then!

@mrocklin
Copy link
Member

cc @ntabris @gjoseph92

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       26 files  ±0         26 suites  ±0   14h 27m 34s ⏱️ + 57m 4s
  3 545 tests ±0    3 432 ✔️  - 2     106 💤 +1    6 ±0  1 🔥 +1 
44 846 runs  +1  42 583 ✔️  - 2  2 236 💤 +2  26 ±0  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 5b708f8. ± Comparison against base commit d1080d2.

This pull request skips 1 test.
distributed.protocol.tests.test_protocol ‑ test_large_messages

♻️ This comment has been updated with latest results.

@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch 2 times, most recently from d890b19 to d604cbb Compare March 14, 2023 13:29
@milesgranger milesgranger changed the title Add GIL contention metric to Prometheus [WIP] Add GIL contention metric to Prometheus Mar 14, 2023
@milesgranger milesgranger marked this pull request as draft March 14, 2023 14:16
@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from d604cbb to 77a520d Compare March 16, 2023 08:35
@milesgranger milesgranger changed the title [WIP] Add GIL contention metric to Prometheus Add GIL contention metric to Prometheus Mar 16, 2023
@milesgranger milesgranger marked this pull request as ready for review March 16, 2023 09:37
@milesgranger milesgranger requested a review from fjetter as a code owner March 16, 2023 09:37
@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from fe9dbc5 to 77a520d Compare March 16, 2023 13:39
@ntabris
Copy link
Contributor

ntabris commented Mar 21, 2023

Here's example of plotting rate(dask_worker_gil_contention_total{cluster_id="$cluster_id"}[$__rate_interval]):

image

@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from 65e17b6 to 8149a4f Compare March 21, 2023 13:29
@gjoseph92
Copy link
Collaborator

Hm, I'm surprised the values are >1.0. I was hoping that wouldn't happen. Maybe irate?

@ntabris
Copy link
Contributor

ntabris commented Mar 22, 2023

I don't think irate will help (but can share plot later). We got many cases where value went up by 6 in a 5s interval.

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Mar 22, 2023

Since irate is the change between only the last two data points, shouldn't it be impossible (by our construction) for that to be >1?

EDIT: sorry, I see what you're saying; we've incremented the counter by 6 since the last scrape?

@milesgranger
Copy link
Contributor Author

Is there anything I can do to help in this PR, or is this more of a Prometheus/Grafana internal thing?

@ntabris
Copy link
Contributor

ntabris commented Mar 30, 2023

I think the approach of getting total since last scrape isn't ideal and is probably what's leading to sometimes getting value of 6 in 5s interval (likely because of offset between when value is added and when it's scraped). My take is that this non-ideal thing is fine for now because better approach would be non-trivial and having the data in non-ideal state is better than not having at all.

@gjoseph92
Copy link
Collaborator

Yeah, I'm just wondering if the not-ideal counter is better than the not-ideal gauge? Maybe having both a counter and a gauge would be most helpful at first, to figure out which works better.

@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from 8149a4f to 5b708f8 Compare April 3, 2023 09:35
@milesgranger
Copy link
Contributor Author

Shall I add back the Guage to have both then?
Still curious if adding the Histogram would be useful?

@hendrikmakait hendrikmakait self-requested a review April 6, 2023 16:16
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code overall looks good to me, there's some cleanup that still needs to be done.

Shall I add back the Guage to have both then?

I have no opinion on that; just having a counter seems fine by me. I also don't fully understand how a gauge would solve the particular issue that has been discussed. From what I understand, this should run into the same offsetting problem.

Still curious if adding the Histogram would be useful?

I'd think about this in another increment. Adding a Prometheus-native histogram is (IIRC) blocked by a refactoring of our Prometheus setup, the alternative would be adding crick-based metrics, but those explicitly expose some quantile, so I'm not a big fan of those.

docs/source/prometheus.rst Outdated Show resolved Hide resolved
distributed/http/tests/test_core.py Outdated Show resolved Hide resolved
docs/source/prometheus.rst Outdated Show resolved Hide resolved
docs/source/prometheus.rst Outdated Show resolved Hide resolved
@milesgranger
Copy link
Contributor Author

Thanks for the review @hendrikmakait

I have no opinion on that; just having a counter seems fine by me. I also don't fully understand how a gauge would solve the particular issue that has been discussed. From what I understand, this should run into the same offsetting problem

I'm gleaning the opinion is not to add back the Gauge then. :)

Exposes the GIL contention metrics on Prometheus
for Scheduler and Worker. Only applies when
`distributed.admin.system-monitor.gil-contention.enabled`
is set and `gilknocker` is installed.
@milesgranger milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from eb6ef14 to cb3a19e Compare April 11, 2023 10:12
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milesgranger! This overall looks good to me; I have one minor cleanup nit that was caused by my previous suggestions.

distributed/http/tests/test_core.py Outdated Show resolved Hide resolved
@hendrikmakait hendrikmakait merged commit e1944ec into dask:main Apr 11, 2023
@milesgranger milesgranger deleted the 7290-pt3-gil-metric-in-prometheus branch April 11, 2023 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants