Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Fix potential dead lock in local provider #49909

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jyakaranda
Copy link

@jyakaranda jyakaranda commented Jan 17, 2025

Why are these changes needed?

Launching ray (version 2.35.0) cluster with on-prem mode sometimes hangs on starting worker nodes.

By attaching to running monitor.py process on header node,
I got following relevant stacktrace showing some threads are waiting for lock.

Thread 30 (Thread 0x7faafc9ff700 (LWP 2071927)):
Traceback (most recent call first):
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 101, in get
    with self.lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 250, in set_node_tags
    info = self.state.get()[node_id]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 592, in do_update
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 1190, in run
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Thread 29 (Thread 0x7faafd7ff700 (LWP 2071925)):
Traceback (most recent call first):
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 101, in get
    with self.lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 226, in node_tags
    return self.state.get()[node_id]["tags"]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1282, in _get_node_type_specific_fields
    node_tags = self.provider.node_tags(node_id)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1612, in spawn_updater
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Thread 28 (Thread 0x7faafe5ff700 (LWP 2071924)):
Traceback (most recent call first):
  <built-in method sleep of module object at remote 0x7fab1a108630>
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/filelock/_api.py", line 344, in acquire
    time.sleep(poll_interval)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/filelock/_api.py", line 376, in __enter__
    self.acquire()
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 102, in get
    with self.file_lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 226, in node_tags
    return self.state.get()[node_id]["tags"]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1282, in _get_node_type_specific_fields
    node_tags = self.provider.node_tags(node_id)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1612, in spawn_updater
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Analyze: There could be a dead lock when some thread creating node with create_node (holding state.file_lock) while some other thread getting worker info (holding state.lock).

Solution: Lock on state.lock before state.file_lock like ClusterState does should avoid such race.

Related issue number

  1. [Ray Cluster] Ray cluster stuck after using ray up #38718
  2. [ray local cluster] nodes marked as uninitialized #39565
  3. [OnPrem][ClusterLauncher] Only provisions head node #46204

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jyakaranda jyakaranda requested review from hongchaodeng and a team as code owners January 17, 2025 06:43
@jyakaranda
Copy link
Author

@hongchaodeng @kevin85421 @jjyao PTAL

@jyakaranda
Copy link
Author

jyakaranda commented Jan 24, 2025

@hongchaodeng @kevin85421 @jjyao PTAL

ping~
(avoid getting stale label :)

@jyakaranda jyakaranda force-pushed the fix-on-prem-dead-lock branch from 5ac16ae to 2fece28 Compare February 5, 2025 02:24
@heliar-k
Copy link

heliar-k commented Feb 6, 2025

I have encountered the same problem of this deadlock. Thanks for this PR, I fix it temporarily with a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants