Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Nov 24, 2025

Briefly describe what this PR accomplishes and why it's needed.

This PR was motivated by #58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess.

@Sparks0219 Sparks0219 requested a review from a team as a code owner November 24, 2025 20:35
absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness>
NodeInfoAccessor::GetAllNodeAddressAndLiveness() const {
std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_);
return node_cache_address_and_liveness_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad, but its currently only used for FreeObjects

const auto &node_info_map = gcs_client_.Nodes().GetAllNodeAddressAndLiveness();
which makes it somewhat less bad. Should be removed in the FreeObjects refactor which will eventually happen...

Copy link
Contributor

@dayshah dayshah Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects

@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential thread-safety issue in the GCS node info accessor by introducing a mutex to protect the node_cache_address_and_liveness_ cache. The approach taken is sound and follows best practices for concurrent programming. The public API has been updated to return data by value (std::optional or a copy of the map) instead of by pointer or reference, which is a good design choice to encapsulate the locking mechanism and prevent unsafe access to the underlying cache. All related call sites, including mocks and tests, have been consistently updated to reflect these API changes. The implementation is clean, and the trade-off of performance for thread safety is well-justified by the author. Overall, this is a solid improvement to the codebase's robustness.

Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

@edoakes
Copy link
Collaborator

edoakes commented Nov 24, 2025

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

Signed-off-by: joshlee <[email protected]>
…ssor-node-address-and-liveliness-cache-thread-safe
@Sparks0219 Sparks0219 requested a review from dayshah November 24, 2025 21:00
@dayshah
Copy link
Contributor

dayshah commented Nov 24, 2025

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

the dashboard builds it's own cache lol. I looked for it, I think it's unused, it can go 🥳

@Sparks0219
Copy link
Contributor Author

It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's

I believe it's only used in the dashboard

the dashboard builds it's own cache lol. I looked for it, I think it's unused, it can go 🥳

@ZacAttack beat me to it and already has a pr up #58951 🐎

Signed-off-by: joshlee <[email protected]>
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 25, 2025
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the copy of all the node infos on free objects copying is a little concerning but ya... should fix it in general

absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness>
NodeInfoAccessor::GetAllNodeAddressAndLiveness() const {
std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_);
return node_cache_address_and_liveness_;
Copy link
Contributor

@dayshah dayshah Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects

ray::NodeID::FromBinary(id.Binary())) != nullptr;
return gcs_client->Nodes()
.GetNodeAddressAndLiveness(ray::NodeID::FromBinary(id.Binary()))
.has_value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just use IsNodeDead, one less copy that way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, there's a couple of other places that use this exact same pattern so just changed em all to !IsNodeDead(...)

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from dayshah November 25, 2025 07:57
Signed-off-by: joshlee <[email protected]>
@edoakes edoakes enabled auto-merge (squash) November 25, 2025 23:26
@edoakes edoakes merged commit 9b217e9 into ray-project:master Nov 26, 2025
7 checks passed
KaisennHu pushed a commit to KaisennHu/ray that referenced this pull request Nov 26, 2025
…ay-project#58947)

> Briefly describe what this PR accomplishes and why it's needed.

This PR was motivated by ray-project#58018 where we call methods of the gcs node
info accessor potentially from the user's python cancel thread,
potentially causing thread safety issues. I did the trivial solution of
adding a mutex onto node_cache_address_and_liveness_ cache. The one
downside of this is instead of returning ptrs to the
GcsNodeAddressAndLiveness objects in the cache, I return them by value
instead. I didn't want to allow access to the mutex that guards the
cache outside of the accessor since I think it's a bad precedent/will
create a mess.

---------

Signed-off-by: joshlee <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ay-project#58947)

> Briefly describe what this PR accomplishes and why it's needed.

This PR was motivated by ray-project#58018 where we call methods of the gcs node
info accessor potentially from the user's python cancel thread,
potentially causing thread safety issues. I did the trivial solution of
adding a mutex onto node_cache_address_and_liveness_ cache. The one
downside of this is instead of returning ptrs to the
GcsNodeAddressAndLiveness objects in the cache, I return them by value
instead. I didn't want to allow access to the mutex that guards the
cache outside of the accessor since I think it's a bad precedent/will
create a mess.

---------

Signed-off-by: joshlee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants