-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make NotifyGCSRestart RPC Fault Tolerant #57945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Make NotifyGCSRestart RPC Fault Tolerant #57945
Conversation
Signed-off-by: joshlee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively makes the NotifyGCSRestart RPC fault-tolerant by making it retryable, a key enhancement for GCS reliability. The changes also include a valuable refactoring of GCS subscriber and accessor methods to return void instead of a misleading Status, which clarifies their asynchronous nature. The PR further improves the codebase by removing dead code and adding a comprehensive Python integration test to validate the new fault-tolerant behavior. The implementation is solid and the changes significantly improve the robustness and clarity of the code.
Signed-off-by: joshlee <[email protected]>
| }); | ||
| } | ||
|
|
||
| void NodeResourceInfoAccessor::AsyncResubscribe() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was dead code, we never set these callbacks ever since the subscribe was removed a couple years ago:
https://github.com/ray-project/ray/pull/24857/files
Signed-off-by: joshlee <[email protected]>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: joshlee <[email protected]>
| if (!status.ok()) { | ||
| RAY_LOG(ERROR) << "NotifyGCSRestart failed: " << status; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a RAY_CHECK? IIUC, this gets internally retried so non-OK would only come from raylet reponse (which we don't ever return)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, there's this section of code in the retryable grpc:
ray/src/ray/rpc/retryable_grpc_client.cc
Line 32 in 50ffca4
| request->Fail(Status::Disconnected("GRPC client is shut down.")); |
where all pending requests are flushed with a Status::Disconnected when we receive a node death notification and trigger the dtor for the raylet client. Hence this RAY_CHECK could be easily triggered as long as a NotifyGCSRestart RPC is in flight/pending and we then receive the node death notification for the target node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ This is true for all RPCs for the retryable grpc client, we shouldn't use RAY_CHECK in their callbacks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's great context. This log message should be improved then:
| if (!status.ok()) { | |
| RAY_LOG(ERROR) << "NotifyGCSRestart failed: " << status; | |
| } | |
| if (!status.ok()) { | |
| RAY_LOG(WARNING) << "NotifyGCSRestart failed. This is expected if the target node has died. Status: " << status; | |
| } |
Specifically, note that it should not be an ERROR-level log if it's expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh good point didn't realize that distinction between error/warning logs thanks, done
Signed-off-by: joshlee <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: YK <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>
Description
Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call