Cluster wait #6700

idorrington92 · 2022-07-09T16:23:15Z

Closes #6346

Tests added / passed
Passes pre-commit run --all-files

…if it can

…er_wait

GPUtester · 2022-07-09T16:23:17Z

Can one of the admins verify this patch?

github-actions · 2022-07-09T17:24:02Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      26 files ±  0       26 suites ±0 12h 50m 9s ⏱️ - 8m 54s
  3 552 tests +  1   3 444 ✔️ -   2   105 💤 ±0 2 ❌ +2 1 🔥 +1
44 952 runs +13 42 826 ✔️ +13 2 121 💤 - 5 4 ❌ +4 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit c95aafd. ± Comparison against base commit 1baa5ff.

♻️ This comment has been updated with latest results.

quasiben · 2022-07-09T17:24:18Z

add to allowlist

idorrington92 · 2022-07-12T21:37:26Z

add to allowlist

@quasiben Sorry, what does this mean?

Also, I've got 3 failing tests in test_ssh. Is it possible to run these locally? They just fail for me, saying cluster failed to start. This happens on main branch as well. I'm not really sure how to debug them otherwise...

quasiben · 2022-07-12T21:50:15Z

@idorrington92 the comment allows us to run your PR against a GPU CI system:
https://docs.dask.org/en/stable/develop.html#gpu-ci

Distributed has been flaky in the past and recently there has been a concerted effort to stabilize the CI test:
https://dask.github.io/distributed/test_report.html

As you can see in the linked report, test_dashboard_port_zero is a known flaky test

I would suggest @jacobtomlinson review this PR but he is currently at SciPy (as are other dask maintainers) so it may be a little bit before you hear back

jacobtomlinson

Many thanks for raising this!

distributed/client.py

…rkers is not implemented

…rge didn't remove it from mine

jacobtomlinson

Thanks for addressing the review feedback.

It looks like the linter isn't happy. If you haven't installed pre-commit you'll want to do that.

Also it looks like some of the test failures might be related, especially this one.

_____________________ test_ssh_nprocs_renamed_to_n_workers _____________________
/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/site-packages/coverage/data.py:130: CoverageWarning: Data file '/home/runner/work/distributed/distributed/.coverage.fv-az302-754.15681.346903' doesn't seem to be a coverage data file: cannot unpack non-iterable NoneType object
  data._warn(str(exc))
@gen_test()
asyncdeftest_ssh_nprocs_renamed_to_n_workers():
with pytest.warns(FutureWarning, match="renamed to n_workers"):
asyncwith SSHCluster(
                ["127.0.0.1"] * 3,
                connect_options=dict(known_hosts=None),
                asynchronous=True,
                scheduler_options={"idle_timeout": "5s"},
                worker_options={"death_timeout": "5s", "nprocs": 2},
            ) as cluster:
assertlen(cluster.workers) == 2
asyncwith Client(cluster, asynchronous=True) as client:
>                   await client.wait_for_workers(4)
distributed/deploy/tests/test_ssh.py:117: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = SSHCluster(SSHCluster, 'tcp://10.1.0.103:42191', workers=0, threads=0, memory=0 B)
n_workers = 4, timeout = None
asyncdef_wait_for_workers(self, n_workers=0, timeout=None):
>       info = self.scheduler.identity()
E       AttributeError: 'Scheduler' object has no attribute 'identity'
distributed/deploy/cluster.py:534: AttributeError

Could you take a look?

idorrington92 · 2022-07-25T10:17:39Z

Doesn't it say listing/pre-commit hooks passed ok?

I agree that test failure is related but I can't run it locally as I get an error saying cluster failed to start.

Is there something I need to do in order to run it locally? Otherwise the only thing I've got to go on is the automated tests here...

jacobtomlinson · 2022-07-25T10:34:21Z

Apologies I was looking at Linting / Unit Test Results (pull_request) which actually has nothing to do with linting, sorry for the noise.

Can you run ssh localhost without having to enter a password?

idorrington92 · 2022-07-25T21:10:01Z

Thanks @jacobtomlinson, that fixed the error I was getting. I would have taken ages to think of that.

I had a little look at the bug that was coming up in the tests, but it's not obvious to me why it's happening. I'll have proper look later in the week.

…tity

…e using a cluster

…viour of client.wait_for_workers

mrocklin · 2022-12-26T20:22:08Z

Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.

If I were reviewing this initially I probably would have said "the try-except logic around cluster.wait_for_workers seems super simple and good to me 👍 . However, the implementation for Cluster.wait_for_workers and the changes to update_scheduler_info seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have their wait_for_worker methods respected, not tha we implement one for all Cluster classes.

My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in client.py (which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.

Do you have any thoughts on the above? Is there a solid reason to implement wait_for_workers and update_scheduler_info that I'm not seeing here?

mrocklin · 2022-12-26T20:23:36Z

Said a different way, I'm more than happy to merge the client.py changes immediately. I'd want to look over the other changes more thoroughly, which would take me some time. Or somoene like @jacobtomlinson who has already looked over them could merge if he's feeling confident.

fjetter · 2023-01-12T09:40:50Z

distributed/deploy/cluster.py

+        await self._scheduler_info_comm.write({"op": "identity"})
+        self.scheduler_info = SchedulerInfo(await self._scheduler_info_comm.read())


You can simply do

self.scheduler_info = SchedulerInfo(await self._scheduler.identity())

This will allow this connection to be reused and handles the write/read part of this call. Specifically you will not need to manage the lifecycle of the comm yourself.

FWIW I think the same should be done with the _watch_worker_status_comm. I don't see a reason why these calls would deserve a dedicated connection. I assume _watch_worker_status_comm was introduced before we had a connection pool

Hi @fjetter,
I get an attribute error when making this change and using a LocalCluster.

It was quite a while ago that I wrote this code, but I'm pretty sure the reason I went down the creating-a-comm route was because clusters don't (necessarily?) have a _scheduler attribute.

Am I missing something? Should there be a _scheduler attribute?

Sorry, this should've been self.scheduler_comm.

This should always be set if a cluster is already started. The implementation is a bit messy, though. Cluster doesn't actually set it but SpecCluster does... regardless, having this set is a requirement since we're using this in many other places as well, i.e. it is safe to assume that subclasses implement this as well

That worked! I've removed all the messy comm stuff now :)
Thank you :)

fjetter · 2023-01-12T09:42:51Z

distributed/client.py

+            # Most likely, either self.cluster is None, or the cluster has not
+            # implemented a wait_for_workers method


nit: You added wait_for_workers to the base class, i.e. every cluster will come equipped with this. I typically prefer dealing with these situations by being explicit, i.e. if cluster is None; ...

I agree, changed it to an if-else

…er_wait

idorrington92 · 2023-01-19T17:27:35Z

Hi @idorrington92 sorry that this hasn't gotten much love. I suspect that most folks are out during the holidays. I'm holding down the fort for the moment.

If I were reviewing this initially I probably would have said "the try-except logic around cluster.wait_for_workers seems super simple and good to me +1 . However, the implementation for Cluster.wait_for_workers and the changes to update_scheduler_info seem also probably good, but like an area that might be error prone. Do we actually want/need this for some reason? If not I'd be inclined to skip this and stay with status quo. It seems to me like the upstream issue was mostly asking for the ability of downstream clusters to have their wait_for_worker methods respected, not tha we implement one for all Cluster classes.

My preference is to drop the implementation (even if this means dropping the test), and just stick with the try-except logic that you have in client.py (which seems super straightforward). However, I appreciate that you've also gone through several rounds of review, and having expectations change on you after so long probably isn't fun.

Do you have any thoughts on the above? Is there a solid reason to implement wait_for_workers and update_scheduler_info that I'm not seeing here?

Hi @mrocklin,
Thank you for your quick reply :)
I see what you mean. I took the issue quite literally and just implemented the changes that were requested. But perhaps a simple try-catch in client.py would allow clusters to do everything the issue really needs. I've not used clusters much, so don't know from a users perspective whether cluster.wait_for_workers is very useful.

FYI - Following the comment from @fjetter above, I've replaced that try-catch with an if-else, so that will need to be reverted if we do go down that route.

I'm happy either way, I learned a lot from working on this issue, and would rather remove most of my changes and contribute something useful than add a load of code that'll cause problems later on :)

I think it's better if people with more user knowledge than me decide which route we go down though :)

idorrington92 · 2023-04-05T15:25:42Z

Just pinging again to see if we can merge this now?

jacobtomlinson

Sure. Could you fix up the merge conflicts?

…er_wait

idorrington92 · 2023-04-10T15:50:54Z

I've handled the merge conflicts. I'm getting some failing tests, but I'm seeing these test failures in other PRs as well so don't think it's to do with my changes...

…er_wait

jacobtomlinson · 2023-04-20T11:13:41Z

It looks like distributed/deploy/tests/test_cluster.py::test_cluster_wait_for_worker is failing a bunch here so it's certainly related to these changes. Would you mind taking another look?

jacobtomlinson

The CI failures seem to be happening consistently with Python 3.8 only.

I had a skim over the code and can't see anything that sticks out as being a syntax problem but maybe you could test locally with 3.8 to find the problem?

jacobtomlinson · 2023-04-20T14:00:47Z

distributed/deploy/cluster.py

+            self.scheduler_info = SchedulerInfo(await self.scheduler_comm.identity())
+
+    def wait_for_workers(
+        self, n_workers: int | str = no_default, timeout: float | None = None


Writing unions with a pipe like this is supported in Python >=3.10. Not sure if that's related to the issues we are seeing?

Agreed it's 3.8. I do get same error locally. Changing (or removing) type hints doesn't fix it.

It's a thread leaking issue, which sounds much lower level that anything I've done here, but I'll keep looking in to it

I've managed to fix it. I noticed the other tests in test_cluster used "async with" to start the cluster, and didn't pass in the loop fixture, so I copied that. I'd be lying if I said I knew what went wrong and how this fixed it though...

jacobtomlinson

Thanks for fixing this up. CI failures now look unrelated.

Thanks for pushing this along, sorry that it's taken so long to get in.

* Moving wait_for_worker logic to cluster, and having client call that if it can * Adding test for cluster.wait_for_workers * use try and except to catch case where cluster is none or wait_for_workers is not implemented * linting * This test has been removed on main branch, but for some reason git merge didn't remove it from mine * Cluster has to use scheduler_info attribute instead of scheduler.identity * lint * reverting * need to use cluster.scale when client.wait_for_workers is called while using a cluster * need to use scheduler_info. Also, using cluster.scale to emulate behaviour of client.wait_for_workers * using scheduler_info and dont need to call scale anymore * lint * adding gen_test decorator * Don't think we need to scale at start of wait_for_workers * self.scheduler_info does not update worker status from init to running, so need to request status again * Use Status * Scale was fixing the nworkers test because it forced the worker status to update. Now that worker status is checked we don't need this (and shouldn't have really included it anyway) * Refactoring * Fixing type information * Experimenting with creating new comm * Create separate comm in _start and use that to update scheduler_info * Close new comm * initialise scheduler_info_comm * Don't allow n_workers to be zero for cluster wait_for_workers * Adding return type * Change try-catch to be an explicit if-else * Check explicitly for cluster is none, as I think it's clearer * linting * use scheduler_comm instead of opening new comm * remove update_scheduler_info method * pre-commit changes * Reduce number of works to see if it fixes github tests * Changing test to make it work in python 3.8

consideRatio · 2023-12-31T13:59:58Z

This was a breaking change for dask-gateway's client i think, because it assumes that its available instead of opting in to it if it is.

Can we complement this PR with a conditional check to verify the new function is available or similar? Possibly emitting a warning or similar?

If this is confirmed to being a resonable call, i can open a PR but since I'm a novice in this repo's code base it would be good to have a signal it could make sense at all.

idorrington92 added 3 commits July 9, 2022 14:53

Moving wait_for_worker logic to cluster, and having client call that …

4f850e8

…if it can

Merge branch 'main' of https://github.com/dask/distributed into clust…

ec66806

…er_wait

Adding test for cluster.wait_for_workers

4ee2739

jacobtomlinson reviewed Jul 19, 2022

View reviewed changes

distributed/client.py Outdated Show resolved Hide resolved

idorrington92 and others added 5 commits July 20, 2022 21:52

use try and except to catch case where cluster is none or wait_for_wo…

f9a840b

…rkers is not implemented

Merge branch 'main' into cluster_wait

5439674

linting

ca778c7

Merge remote-tracking branch 'upstream/main' into cluster_wait

8f532e9

This test has been removed on main branch, but for some reason git me…

78985cc

…rge didn't remove it from mine

jacobtomlinson reviewed Jul 25, 2022

View reviewed changes

idorrington92 added 12 commits August 6, 2022 10:35

Cluster has to use scheduler_info attribute instead of scheduler.iden…

aad5d56

…tity

Merge remote-tracking branch 'upstream/main' into cluster_wait

b1ad4d5

lint

ef96549

reverting

e682058

Merge remote-tracking branch 'upstream/main' into cluster_wait

cdf2c0b

need to use cluster.scale when client.wait_for_workers is called whil…

45cf486

…e using a cluster

need to use scheduler_info. Also, using cluster.scale to emulate beha…

088e696

…viour of client.wait_for_workers

using scheduler_info and dont need to call scale anymore

4861151

lint

1d9705c

Merge remote-tracking branch 'upstream/main' into cluster_wait

97f4aa4

adding gen_test decorator

c3907b7

Don't think we need to scale at start of wait_for_workers

ac1dbc5

fjetter reviewed Jan 12, 2023

View reviewed changes

idorrington92 added 3 commits January 19, 2023 16:48

Merge branch 'main' of https://github.com/dask/distributed into clust…

5789bfd

…er_wait

Change try-catch to be an explicit if-else

000febc

Check explicitly for cluster is none, as I think it's clearer

3d573e5

idorrington92 added 3 commits January 19, 2023 18:30

linting

2b1713a

use scheduler_comm instead of opening new comm

8877a58

remove update_scheduler_info method

005dd69

jacobtomlinson reviewed Apr 6, 2023

View reviewed changes

idorrington92 added 2 commits April 10, 2023 16:00

Merge branch 'main' of https://github.com/dask/distributed into clust…

db9dc71

…er_wait

pre-commit changes

494d3fc

Merge branch 'main' of https://github.com/dask/distributed into clust…

21277ce

…er_wait

Reduce number of works to see if it fixes github tests

d277ea6

jacobtomlinson reviewed Apr 20, 2023

View reviewed changes

Changing test to make it work in python 3.8

c95aafd

idorrington92 closed this Apr 20, 2023

idorrington92 reopened this Apr 20, 2023

jacobtomlinson approved these changes Apr 20, 2023

View reviewed changes

jacobtomlinson merged commit 76bbfaf into dask:main Apr 20, 2023

TomAugspurger mentioned this pull request Dec 16, 2023

AttributeError: 'GatewayCluster' object has no attribute 'wait_for_workers' dask/dask-gateway#782

Closed

consideRatio mentioned this pull request Jan 4, 2024

Delegate wait_for_workers to cluster instances only when implemented #8441

Merged

2 tasks

consideRatio mentioned this pull request Jan 16, 2024

Decide on wait_for_workers implementation in client cluster object dask/dask-gateway#798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster wait #6700

Cluster wait #6700

idorrington92 commented Jul 9, 2022 •

edited

Loading

GPUtester commented Jul 9, 2022

github-actions bot commented Jul 9, 2022 •

edited

Loading

quasiben commented Jul 9, 2022

idorrington92 commented Jul 12, 2022

quasiben commented Jul 12, 2022

jacobtomlinson left a comment

jacobtomlinson left a comment

idorrington92 commented Jul 25, 2022

jacobtomlinson commented Jul 25, 2022

idorrington92 commented Jul 25, 2022

mrocklin commented Dec 26, 2022

mrocklin commented Dec 26, 2022

fjetter Jan 12, 2023

idorrington92 Jan 19, 2023

fjetter Jan 20, 2023

idorrington92 Jan 20, 2023

fjetter Jan 12, 2023

idorrington92 Jan 19, 2023

idorrington92 commented Jan 19, 2023

idorrington92 commented Apr 5, 2023

jacobtomlinson left a comment

idorrington92 commented Apr 10, 2023

jacobtomlinson commented Apr 20, 2023

jacobtomlinson left a comment

jacobtomlinson Apr 20, 2023

idorrington92 Apr 20, 2023

idorrington92 Apr 20, 2023

jacobtomlinson left a comment

consideRatio commented Dec 31, 2023 •

edited

Loading

		await self._scheduler_info_comm.write({"op": "identity"})
		self.scheduler_info = SchedulerInfo(await self._scheduler_info_comm.read())

		# Most likely, either self.cluster is None, or the cluster has not
		# implemented a wait_for_workers method

Cluster wait #6700

Cluster wait #6700

Conversation

idorrington92 commented Jul 9, 2022 • edited Loading

GPUtester commented Jul 9, 2022

github-actions bot commented Jul 9, 2022 • edited Loading

Unit Test Results

quasiben commented Jul 9, 2022

idorrington92 commented Jul 12, 2022

quasiben commented Jul 12, 2022

jacobtomlinson left a comment

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

idorrington92 commented Jul 25, 2022

jacobtomlinson commented Jul 25, 2022

idorrington92 commented Jul 25, 2022

mrocklin commented Dec 26, 2022

mrocklin commented Dec 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idorrington92 commented Jan 19, 2023

idorrington92 commented Apr 5, 2023

jacobtomlinson left a comment

Choose a reason for hiding this comment

idorrington92 commented Apr 10, 2023

jacobtomlinson commented Apr 20, 2023

jacobtomlinson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobtomlinson left a comment

Choose a reason for hiding this comment

consideRatio commented Dec 31, 2023 • edited Loading

idorrington92 commented Jul 9, 2022 •

edited

Loading

github-actions bot commented Jul 9, 2022 •

edited

Loading

consideRatio commented Dec 31, 2023 •

edited

Loading