Add GIL contention metric to Prometheus #7651

milesgranger · 2023-03-14T11:11:12Z

Exposes the GIL contention metrics on Prometheus
for Scheduler and Worker. Only applies when
distributed.admin.system-monitor.gil-contention.enabled is set and gilknocker is installed.

Part of #7290

Tests added / passed
Passes pre-commit run --all-files

mrocklin · 2023-03-14T12:06:39Z

docs/source/prometheus.rst

+    .. note::
+       Requires ``gilknocker`` to be installed, and 
+       ``distributed.admin.system-monitor.gil-contention.enabled``
+       configuration to be set.


I'm curious, is this on by default?

No it is off by default

I'll express a desire to have this eventually (in the next couple of months) on by default, at least in Coiled. There are a few ways to get there. For example we could ...

Turn the configuration on by default in Dask config (still gated on gil_knocker being installed)

Have Coiled depend on gil_knocker and plan on package-sync to distribute that package

Some questions if we wanted to go further than that:

Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?

How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?

Are there negative side effects to running gil_knocker in practice? Is it reasonable to not want it to be on? The cost here is less than a percent?

We've noted in some benchmarks are affected using the default 1 millisecond polling. In running some rough bench-marking this 1ms interval appears to reflect what py-spy gives in most test cases, and bumping it to 5ms reduces any detectable performance impact AFAICT but is much less representative of actual GIL contention.

In those benchmarks, some are not affected (but most have at least some impact) with the worst offender being test_h2o:

For example, with 5ms, 30% contention is likely much higher, in the 60-90% maybe. But this varies wildly depending on the type of program being ran and how frequent/little/long it holds the GIL in the first place; therefore becoming more of a rough indicator to which the user could then opt to increase polling interval to get a closer idea of actual contention. Maybe this is a 'safe' option, default on with a 5ms polling interval?

How difficult is it to distribute gil_knocker? Do we have conda packages on conda-forge? Are wheels easy to build?

Building wheels is pretty easy with a decent amount of support already and can be expanded slightly to use a different project's build I maintain cramjam which builds wheels for basically every possible platform.

However, if having conda packages is a requirement, it appears such builds need to be replicated on its feedstock which is slightly annoying, especially when a package like cramjam or gilknocker has no dependencies.

Had some other thoughts for lessening the effect a high monitoring interval might have over here: milesgranger/gilknocker#9

After updating the changes in gilknocker and re-running the same A/B benchmarks, it appears the effect is much less and could be lowered still by either allowing for larger gaps in sampling frequency and/or bumping the polling interval to a lower frequency. Here are some A/B benchmarks after the updates using 1ms frequency (same frequency as above before updates) Planning to run 1ms vs 2ms later today

I would also like to see this on by default if gilknocker is installed. I wonder if the interval could be configurable and then set the default to some value that doesn't impact performance and document how to increase it for better accuracy with the appropriate performance warnings?

Sorry for the noise here.. but with gilknocker==0.4.0 in the updated benchmarks there appears to be no noticeable impact on performance with a 1ms polling interval (the default in the config) The above (outdated) charts were with gilknocker==0.3.0 w/ 1ms polling.

Sweet no worries then!

mrocklin · 2023-03-14T12:06:54Z

cc @ntabris @gjoseph92

github-actions · 2023-03-14T13:21:43Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±0       26 suites ±0 14h 27m 34s ⏱️ + 57m 4s
  3 545 tests ±0   3 432 ✔️ - 2   106 💤 +1   6 ❌ ±0 1 🔥 +1
44 846 runs +1 42 583 ✔️ - 2 2 236 💤 +2 26 ❌ ±0 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 5b708f8. ± Comparison against base commit d1080d2.

This pull request skips 1 test.

distributed.protocol.tests.test_protocol ‑ test_large_messages

♻️ This comment has been updated with latest results.

distributed/http/scheduler/prometheus/core.py

ntabris · 2023-03-21T12:27:26Z

Here's example of plotting rate(dask_worker_gil_contention_total{cluster_id="$cluster_id"}[$__rate_interval]):

gjoseph92 · 2023-03-22T06:57:19Z

Hm, I'm surprised the values are >1.0. I was hoping that wouldn't happen. Maybe irate?

ntabris · 2023-03-22T11:59:27Z

I don't think irate will help (but can share plot later). We got many cases where value went up by 6 in a 5s interval.

gjoseph92 · 2023-03-22T16:42:14Z

Since irate is the change between only the last two data points, shouldn't it be impossible (by our construction) for that to be >1?

EDIT: sorry, I see what you're saying; we've incremented the counter by 6 since the last scrape?

milesgranger · 2023-03-29T07:07:45Z

Is there anything I can do to help in this PR, or is this more of a Prometheus/Grafana internal thing?

ntabris · 2023-03-30T16:14:00Z

I think the approach of getting total since last scrape isn't ideal and is probably what's leading to sometimes getting value of 6 in 5s interval (likely because of offset between when value is added and when it's scraped). My take is that this non-ideal thing is fine for now because better approach would be non-trivial and having the data in non-ideal state is better than not having at all.

gjoseph92 · 2023-03-30T17:15:41Z

Yeah, I'm just wondering if the not-ideal counter is better than the not-ideal gauge? Maybe having both a counter and a gauge would be most helpful at first, to figure out which works better.

milesgranger · 2023-04-03T09:45:16Z

Shall I add back the Guage to have both then?
Still curious if adding the Histogram would be useful?

hendrikmakait

The code overall looks good to me, there's some cleanup that still needs to be done.

Shall I add back the Guage to have both then?

I have no opinion on that; just having a counter seems fine by me. I also don't fully understand how a gauge would solve the particular issue that has been discussed. From what I understand, this should run into the same offsetting problem.

Still curious if adding the Histogram would be useful?

I'd think about this in another increment. Adding a Prometheus-native histogram is (IIRC) blocked by a refactoring of our Prometheus setup, the alternative would be adding crick-based metrics, but those explicitly expose some quantile, so I'm not a big fan of those.

docs/source/prometheus.rst

distributed/http/tests/test_core.py

docs/source/prometheus.rst

milesgranger · 2023-04-11T09:46:35Z

Thanks for the review @hendrikmakait

I have no opinion on that; just having a counter seems fine by me. I also don't fully understand how a gauge would solve the particular issue that has been discussed. From what I understand, this should run into the same offsetting problem

I'm gleaning the opinion is not to add back the Gauge then. :)

Exposes the GIL contention metrics on Prometheus for Scheduler and Worker. Only applies when `distributed.admin.system-monitor.gil-contention.enabled` is set and `gilknocker` is installed.

0.4.0 supports sleeping interval which further reduces program load and maintains good GIL monitoring https://github.com/milesgranger/gilknocker/releases/tag/v0.4.0

Co-authored-by: Gabe Joseph <[email protected]>

Co-authored-by: Hendrik Makait <[email protected]>

hendrikmakait

Thanks @milesgranger! This overall looks good to me; I have one minor cleanup nit that was caused by my previous suggestions.

distributed/http/tests/test_core.py

mrocklin reviewed Mar 14, 2023

View reviewed changes

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch 2 times, most recently from d890b19 to d604cbb Compare March 14, 2023 13:29

milesgranger changed the title ~~Add GIL contention metric to Prometheus~~ [WIP] Add GIL contention metric to Prometheus Mar 14, 2023

milesgranger marked this pull request as draft March 14, 2023 14:16

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from d604cbb to 77a520d Compare March 16, 2023 08:35

milesgranger changed the title ~~[WIP] Add GIL contention metric to Prometheus~~ Add GIL contention metric to Prometheus Mar 16, 2023

milesgranger marked this pull request as ready for review March 16, 2023 09:37

milesgranger requested a review from fjetter as a code owner March 16, 2023 09:37

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from fe9dbc5 to 77a520d Compare March 16, 2023 13:39

gjoseph92 reviewed Mar 16, 2023

View reviewed changes

distributed/http/scheduler/prometheus/core.py Outdated Show resolved Hide resolved

gjoseph92 reviewed Mar 17, 2023

View reviewed changes

distributed/http/scheduler/prometheus/core.py Outdated Show resolved Hide resolved

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from 65e17b6 to 8149a4f Compare March 21, 2023 13:29

jacobtomlinson mentioned this pull request Mar 24, 2023

Measure GIL contention in benchmarks rapidsai/ucx-py#937

Merged

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from 8149a4f to 5b708f8 Compare April 3, 2023 09:35

hendrikmakait self-requested a review April 6, 2023 16:16

hendrikmakait requested changes Apr 11, 2023

View reviewed changes

docs/source/prometheus.rst Outdated Show resolved Hide resolved

distributed/http/tests/test_core.py Outdated Show resolved Hide resolved

docs/source/prometheus.rst Outdated Show resolved Hide resolved

docs/source/prometheus.rst Outdated Show resolved Hide resolved

milesgranger added 2 commits April 11, 2023 12:12

Add GIL contention metric to Prometheus

8674245

Exposes the GIL contention metrics on Prometheus for Scheduler and Worker. Only applies when `distributed.admin.system-monitor.gil-contention.enabled` is set and `gilknocker` is installed.

Fixup: fix test_prometheus_api_doc test

b27f99a

milesgranger and others added 9 commits April 11, 2023 12:12

Update config path after dask#7650

ba05e0d

Bump to more efficient gilknocker 0.4.0

e7f5439

0.4.0 supports sleeping interval which further reduces program load and maintains good GIL monitoring https://github.com/milesgranger/gilknocker/releases/tag/v0.4.0

Use Counter for gil in Prometheus metrics

d0e49a3

Fixup: missing prometheus api doc for cumulative metric

eaf6945

Update distributed/http/scheduler/prometheus/core.py

1860660

Co-authored-by: Gabe Joseph <[email protected]>

Fixup: Worker also use CounterMetricFamily for cumulative gil

77076a9

Fixup: prometheus api doc fix

e1c62e7

Apply suggestions from code review [skip ci]

a775840

Co-authored-by: Hendrik Makait <[email protected]>

Refactor wording of prometheus docs

cb3a19e

milesgranger force-pushed the 7290-pt3-gil-metric-in-prometheus branch from eb6ef14 to cb3a19e Compare April 11, 2023 10:12

milesgranger requested a review from hendrikmakait April 11, 2023 10:13

hendrikmakait approved these changes Apr 11, 2023

View reviewed changes

distributed/http/tests/test_core.py Outdated Show resolved Hide resolved

hendrikmakait mentioned this pull request Apr 11, 2023

Add Prometheus counter for SystemMonitor.last_time to improve GIL contention metric #7767

Closed

Update distributed/http/tests/test_core.py

cc8f033

hendrikmakait merged commit e1944ec into dask:main Apr 11, 2023

milesgranger deleted the 7290-pt3-gil-metric-in-prometheus branch April 11, 2023 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GIL contention metric to Prometheus #7651

Add GIL contention metric to Prometheus #7651

milesgranger commented Mar 14, 2023 •

edited

Loading

mrocklin Mar 14, 2023

milesgranger Mar 14, 2023

mrocklin Mar 14, 2023

milesgranger Mar 14, 2023

milesgranger Mar 14, 2023

milesgranger Mar 16, 2023 •

edited

Loading

jacobtomlinson Mar 16, 2023

milesgranger Mar 16, 2023 •

edited

Loading

jacobtomlinson Mar 17, 2023

mrocklin commented Mar 14, 2023

github-actions bot commented Mar 14, 2023 •

edited

Loading

ntabris commented Mar 21, 2023

gjoseph92 commented Mar 22, 2023

ntabris commented Mar 22, 2023

gjoseph92 commented Mar 22, 2023 •

edited

Loading

milesgranger commented Mar 29, 2023

ntabris commented Mar 30, 2023

gjoseph92 commented Mar 30, 2023

milesgranger commented Apr 3, 2023

hendrikmakait left a comment

milesgranger commented Apr 11, 2023

hendrikmakait left a comment

Add GIL contention metric to Prometheus #7651

Add GIL contention metric to Prometheus #7651

Conversation

milesgranger commented Mar 14, 2023 • edited Loading

mrocklin Mar 14, 2023

Choose a reason for hiding this comment

milesgranger Mar 14, 2023

Choose a reason for hiding this comment

mrocklin Mar 14, 2023

Choose a reason for hiding this comment

milesgranger Mar 14, 2023

Choose a reason for hiding this comment

milesgranger Mar 14, 2023

Choose a reason for hiding this comment

milesgranger Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

jacobtomlinson Mar 16, 2023

Choose a reason for hiding this comment

milesgranger Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

jacobtomlinson Mar 17, 2023

Choose a reason for hiding this comment

mrocklin commented Mar 14, 2023

github-actions bot commented Mar 14, 2023 • edited Loading

Unit Test Results

ntabris commented Mar 21, 2023

gjoseph92 commented Mar 22, 2023

ntabris commented Mar 22, 2023

gjoseph92 commented Mar 22, 2023 • edited Loading

milesgranger commented Mar 29, 2023

ntabris commented Mar 30, 2023

gjoseph92 commented Mar 30, 2023

milesgranger commented Apr 3, 2023

hendrikmakait left a comment

Choose a reason for hiding this comment

milesgranger commented Apr 11, 2023

hendrikmakait left a comment

Choose a reason for hiding this comment

milesgranger commented Mar 14, 2023 •

edited

Loading

milesgranger Mar 16, 2023 •

edited

Loading

milesgranger Mar 16, 2023 •

edited

Loading

github-actions bot commented Mar 14, 2023 •

edited

Loading

gjoseph92 commented Mar 22, 2023 •

edited

Loading