-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make Accessor Node Address and Liveliness Cache Thread Safe #58947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Make Accessor Node Address and Liveliness Cache Thread Safe #58947
Conversation
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
| absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness> | ||
| NodeInfoAccessor::GetAllNodeAddressAndLiveness() const { | ||
| std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_); | ||
| return node_cache_address_and_liveness_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bad, but its currently only used for FreeObjects
ray/src/ray/object_manager/object_manager.cc
Line 648 in cd0881e
| const auto &node_info_map = gcs_client_.Nodes().GetAllNodeAddressAndLiveness(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a potential thread-safety issue in the GCS node info accessor by introducing a mutex to protect the node_cache_address_and_liveness_ cache. The approach taken is sound and follows best practices for concurrent programming. The public API has been updated to return data by value (std::optional or a copy of the map) instead of by pointer or reference, which is a good design choice to encapsulate the locking mechanism and prevent unsafe access to the underlying cache. All related call sites, including mocks and tests, have been consistently updated to reflect these API changes. The implementation is clean, and the trade-off of performance for thread safety is well-justified by the author. Overall, this is a solid improvement to the codebase's robustness.
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty weird to have the diet one be thread safe while the big one isn't. Where is the big cache still used? Can we get rid of it or have two more obviously different api's
I believe it's only used in the dashboard |
Signed-off-by: joshlee <[email protected]>
…ssor-node-address-and-liveliness-cache-thread-safe
the dashboard builds it's own cache lol. I looked for it, I think it's unused, it can go 🥳 |
@ZacAttack beat me to it and already has a pr up #58951 🐎 |
Signed-off-by: joshlee <[email protected]>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the copy of all the node infos on free objects copying is a little concerning but ya... should fix it in general
| absl::flat_hash_map<NodeID, rpc::GcsNodeAddressAndLiveness> | ||
| NodeInfoAccessor::GetAllNodeAddressAndLiveness() const { | ||
| std::lock_guard<std::mutex> lock(node_cache_address_and_liveness_mutex_); | ||
| return node_cache_address_and_liveness_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a beefy copy on a big cluster per primary free... but ya we should prioritize fixing free objects
src/ray/raylet/main.cc
Outdated
| ray::NodeID::FromBinary(id.Binary())) != nullptr; | ||
| return gcs_client->Nodes() | ||
| .GetNodeAddressAndLiveness(ray::NodeID::FromBinary(id.Binary())) | ||
| .has_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should just use IsNodeDead, one less copy that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, there's a couple of other places that use this exact same pattern so just changed em all to !IsNodeDead(...)
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
…ay-project#58947) > Briefly describe what this PR accomplishes and why it's needed. This PR was motivated by ray-project#58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess. --------- Signed-off-by: joshlee <[email protected]>
…ay-project#58947) > Briefly describe what this PR accomplishes and why it's needed. This PR was motivated by ray-project#58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess. --------- Signed-off-by: joshlee <[email protected]>
This PR was motivated by #58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess.