[core] Make Accessor Node Address and Liveliness Cache Thread Safe #58947

Sparks0219 · 2025-11-24T20:35:49Z

Briefly describe what this PR accomplishes and why it's needed.

This PR was motivated by #58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess.

Signed-off-by: joshlee <[email protected]>

Sparks0219 · 2025-11-24T20:37:20Z

src/ray/gcs_rpc_client/accessor.cc

+absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness>
+NodeInfoAccessor::GetAllNodeAddressAndLiveness() const {
+  std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_);
  return node_cache_address_and_liveness_;


This is bad, but its currently only used for FreeObjects

ray/src/ray/object_manager/object_manager.cc

Line 648 in cd0881e

const auto &node_info_map = gcs_client_.Nodes().GetAllNodeAddressAndLiveness();

which makes it somewhat less bad. Should be removed in the FreeObjects refactor which will eventually happen...

This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects

gemini-code-assist

Code Review

This pull request addresses a potential thread-safety issue in the GCS node info accessor by introducing a mutex to protect the node_cache_address_and_liveness_ cache. The approach taken is sound and follows best practices for concurrent programming. The public API has been updated to return data by value (std::optional or a copy of the map) instead of by pointer or reference, which is a good design choice to encapsulate the locking mechanism and prevent unsafe access to the underlying cache. All related call sites, including mocks and tests, have been consistently updated to reflect these API changes. The implementation is clean, and the trade-off of performance for thread safety is well-justified by the author. Overall, this is a solid improvement to the codebase's robustness.

dayshah

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

src/ray/gcs_rpc_client/accessor.h

src/ray/gcs_rpc_client/accessor.cc

edoakes · 2025-11-24T20:48:52Z

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

Signed-off-by: joshlee <[email protected]>

…ssor-node-address-and-liveliness-cache-thread-safe

dayshah · 2025-11-24T22:09:57Z

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

the dashboard builds it's own cache lol. I looked for it, I think it's unused, it can go 🥳

Sparks0219 · 2025-11-24T22:43:01Z

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

the dashboard builds it's own cache lol. I looked for it, I think it's unused, it can go 🥳

@ZacAttack beat me to it and already has a pr up #58951 🐎

Signed-off-by: joshlee <[email protected]>

dayshah

the copy of all the node infos on free objects copying is a little concerning but ya... should fix it in general

src/ray/gcs_rpc_client/accessor.h

dayshah · 2025-11-25T02:40:19Z

src/ray/gcs_rpc_client/accessor.cc

+absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness>
+NodeInfoAccessor::GetAllNodeAddressAndLiveness() const {
+  std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_);
  return node_cache_address_and_liveness_;


This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects

dayshah · 2025-11-25T02:56:40Z

src/ray/raylet/main.cc

-                     ray::NodeID::FromBinary(id.Binary())) != nullptr;
+          return gcs_client->Nodes()
+              .GetNodeAddressAndLiveness(ray::NodeID::FromBinary(id.Binary()))
+              .has_value();


this should just use IsNodeDead, one less copy that way

Good point, there's a couple of other places that use this exact same pattern so just changed em all to !IsNodeDead(...)

Signed-off-by: joshlee <[email protected]>

src/ray/raylet/main.cc

Signed-off-by: joshlee <[email protected]>

…ay-project#58947) > Briefly describe what this PR accomplishes and why it's needed. This PR was motivated by ray-project#58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess. --------- Signed-off-by: joshlee <[email protected]>

Sparks0219 added 2 commits November 24, 2025 20:13

Make node address and liveliness cache thread safe

9cb6be0

Signed-off-by: joshlee <[email protected]>

Refine

77d3a64

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from a team as a code owner November 24, 2025 20:35

Sparks0219 commented Nov 24, 2025

View reviewed changes

Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 24, 2025

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

dayshah reviewed Nov 24, 2025

View reviewed changes

src/ray/gcs_rpc_client/accessor.h Outdated Show resolved Hide resolved

src/ray/gcs_rpc_client/accessor.cc Outdated Show resolved Hide resolved

Sparks0219 mentioned this pull request Nov 24, 2025

[core] Make CancelTask RPC Fault Tolerant #58018

Merged

Sparks0219 added 2 commits November 24, 2025 21:00

Addressing comments

70180af

Signed-off-by: joshlee <[email protected]>

Merge remote-tracking branch 'upstream/master' into joshlee/make-acce…

dd4b463

…ssor-node-address-and-liveliness-cache-thread-safe

Sparks0219 requested a review from dayshah November 24, 2025 21:00

Removing dead code

7d4ddf7

Signed-off-by: joshlee <[email protected]>

ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 25, 2025

dayshah approved these changes Nov 25, 2025

View reviewed changes

dayshah reviewed Nov 25, 2025

View reviewed changes

Addressing comments

809fa35

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from dayshah November 25, 2025 07:57

cursor bot reviewed Nov 25, 2025

View reviewed changes

src/ray/raylet/main.cc Outdated Show resolved Hide resolved

Addressing comments

314132e

Signed-off-by: joshlee <[email protected]>

edoakes approved these changes Nov 25, 2025

View reviewed changes

edoakes enabled auto-merge (squash) November 25, 2025 23:26

edoakes merged commit 9b217e9 into ray-project:master Nov 26, 2025
7 checks passed

[core] Make Accessor Node Address and Liveliness Cache Thread Safe #58947

[core] Make Accessor Node Address and Liveliness Cache Thread Safe #58947

Uh oh!

Conversation

Sparks0219 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sparks0219 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

edoakes commented Nov 24, 2025

Uh oh!

dayshah commented Nov 24, 2025

Uh oh!

Sparks0219 commented Nov 24, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dayshah Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sparks0219 commented Nov 24, 2025 •

edited

Loading

dayshah Nov 25, 2025 •

edited

Loading

dayshah Nov 25, 2025 •

edited

Loading