-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix the spilling back failure in case of node missing #19564
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Just a couple nits. How do you plan to verify this works btw? Will you just run the nightly test?
@@ -462,8 +462,8 @@ class NodeInfoAccessor { | |||
/// \param filter_dead_nodes Whether or not if this method will filter dead nodes. | |||
/// \return The item returned by GCS. If the item to read doesn't exist or the node is | |||
/// dead, this optional object is empty. | |||
virtual absl::optional<rpc::GcsNodeInfo> Get(const NodeID &node_id, | |||
bool filter_dead_nodes = true) const = 0; | |||
virtual const rpc::GcsNodeInfo *Get(const NodeID &node_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we move back to the raw pointer? I feel like raw pointers are really easy to make a mistake because it is common to forget not to handle the return nullptr in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for optional it's the same. For example, optional can also be nullopt right? So in either case you need to handle nullptr (or nullopt). I feel if we'd like to model return by value data, we should use optional, otherwise, pointer should also work since there is no difference.
The reason I make this ptr is that I'd like to avoid the copy of GcsNodeInfo. I check every node during scheduler, and seeing the copy here makes me feel a little bit bad. So I updated this one. optionalrcp::GcsNodeInfo& is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the cons is that if we use the nullptr, the program will crash, but not for nullopt? (am I correct?)
But if the performance is the concern is makes sense. I remember the syntax to pass references to the optional was too complex and ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.cppreference.com/w/cpp/utility/optional/value From the doc it says, it'll throw an exception and we are not catching it, so it'll crash as well.
I change this not purely because of the performance, but just feel
- it's the same (both will crash).
- it might potentially have a performance issue (not tested).
It looks like only gain, so I just update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense. So technically having optional doesn't really have any meaning within Ray... good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me think if we should have an exception handler in the outermost file (main.cc) (and at least print it although we don't handle them). I assume there are some exceptions that happen but not printed because we don't catch them.
@@ -12,6 +12,7 @@ | |||
// See the License for the specific language governing permissions and | |||
// limitations under the License. | |||
|
|||
// clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because I need mock/ray/gcs/gcs_client.h
to be the last one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does include order matter here, is it because mock/ray/gcs/gcs_client.h
has incomplete include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's mock type, so it assumes the context has all the necessary structure.
@@ -12,6 +12,7 @@ | |||
// See the License for the specific language governing permissions and | |||
// limitations under the License. | |||
|
|||
// clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does include order matter here, is it because mock/ray/gcs/gcs_client.h
has incomplete include?
best_node = iter->first; | ||
break; | ||
} | ||
++iter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, dead nodes are pretty rare, right? Otherwise with always searching forward when seeing a dead node, the distribution will no longer be uniform which may or may not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's rare. Otherwise, it should be a bug I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is a really good point. I think we will start doing chaos testing, so we should handle this better. If distribution is bad, it can actually cause memory issues easily I think. If it is not an easy fix, can we add a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about how to fix this and I can't come out with an easy, elegant algorithm for this :( any ideas? otherwise, I'll just add TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me first add todo for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about a way besides scanning all nodes once. Leaving a TODO seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might still prefer optional return, but if it is due to the perf reason, I have no strong preference (maybe we can follow up in the thread we are talking about it).
Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.
This PR filter out the node that's not available when select the node.
Related issue number
#19438
Checks
scripts/format.sh
to lint the changes in this PR.