[autoscaler] Fix potential dead lock in local provider #49909

jyakaranda · 2025-01-17T06:43:45Z

Why are these changes needed?

Launching ray (version 2.35.0) cluster with on-prem mode sometimes hangs on starting worker nodes.

By attaching to running monitor.py process on header node,
I got following relevant stacktrace showing some threads are waiting for lock.

Thread 30 (Thread 0x7faafc9ff700 (LWP 2071927)):
Traceback (most recent call first):
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 101, in get
    with self.lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 250, in set_node_tags
    info = self.state.get()[node_id]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 592, in do_update
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 1190, in run
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Thread 29 (Thread 0x7faafd7ff700 (LWP 2071925)):
Traceback (most recent call first):
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 101, in get
    with self.lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 226, in node_tags
    return self.state.get()[node_id]["tags"]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1282, in _get_node_type_specific_fields
    node_tags = self.provider.node_tags(node_id)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1612, in spawn_updater
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Thread 28 (Thread 0x7faafe5ff700 (LWP 2071924)):
Traceback (most recent call first):
  <built-in method sleep of module object at remote 0x7fab1a108630>
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/filelock/_api.py", line 344, in acquire
    time.sleep(poll_interval)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/filelock/_api.py", line 376, in __enter__
    self.acquire()
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 102, in get
    with self.file_lock:
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/local/node_provider.py", line 226, in node_tags
    return self.state.get()[node_id]["tags"]
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1282, in _get_node_type_specific_fields
    node_tags = self.provider.node_tags(node_id)
  File "/home/test/miniconda/envs/test/lib/python3.9/site-packages/ray/autoscaler/_private/autoscaler.py", line 1612, in spawn_updater
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/test/miniconda/envs/test/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()

Analyze: There could be a dead lock when some thread creating node with create_node (holding state.file_lock) while some other thread getting worker info (holding state.lock).

Solution: Lock on state.lock before state.file_lock like ClusterState does should avoid such race.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jyakaranda · 2025-01-20T03:10:52Z

@hongchaodeng @kevin85421 @jjyao PTAL

jyakaranda · 2025-01-24T01:26:37Z

@hongchaodeng @kevin85421 @jjyao PTAL

ping~
(avoid getting stale label :)

Signed-off-by: jyakaranda <[email protected]>

heliar-k · 2025-02-06T02:27:12Z

I have encountered the same problem of this deadlock. Thanks for this PR, I fix it temporarily with a patch.

jyakaranda requested review from hongchaodeng and a team as code owners January 17, 2025 06:43

[autoscaler] Fix potential dead lock in local provider

2fece28

Signed-off-by: jyakaranda <[email protected]>

jyakaranda force-pushed the fix-on-prem-dead-lock branch from 5ac16ae to 2fece28 Compare February 5, 2025 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Fix potential dead lock in local provider #49909

[autoscaler] Fix potential dead lock in local provider #49909

jyakaranda commented Jan 17, 2025 •

edited

Loading

jyakaranda commented Jan 20, 2025

jyakaranda commented Jan 24, 2025 •

edited

Loading

heliar-k commented Feb 6, 2025

[autoscaler] Fix potential dead lock in local provider #49909

Are you sure you want to change the base?

[autoscaler] Fix potential dead lock in local provider #49909

Conversation

jyakaranda commented Jan 17, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

jyakaranda commented Jan 20, 2025

jyakaranda commented Jan 24, 2025 • edited Loading

heliar-k commented Feb 6, 2025

jyakaranda commented Jan 17, 2025 •

edited

Loading

jyakaranda commented Jan 24, 2025 •

edited

Loading