Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 21, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call

@Sparks0219 Sparks0219 requested review from dayshah and edoakes October 21, 2025 07:25
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 21, 2025 07:25
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively makes the NotifyGCSRestart RPC fault-tolerant by making it retryable, a key enhancement for GCS reliability. The changes also include a valuable refactoring of GCS subscriber and accessor methods to return void instead of a misleading Status, which clarifies their asynchronous nature. The PR further improves the codebase by removing dead code and adding a comprehensive Python integration test to validate the new fault-tolerant behavior. The implementation is solid and the changes significantly improve the robustness and clarity of the code.

Signed-off-by: joshlee <[email protected]>
});
}

void NodeResourceInfoAccessor::AsyncResubscribe() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was dead code, we never set these callbacks ever since the subscribe was removed a couple years ago:
https://github.com/ray-project/ray/pull/24857/files

Signed-off-by: joshlee <[email protected]>
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 21, 2025
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 21, 2025
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 5, 2025
Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 7, 2025
Comment on lines 695 to 697
if (!status.ok()) {
RAY_LOG(ERROR) << "NotifyGCSRestart failed: " << status;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a RAY_CHECK? IIUC, this gets internally retried so non-OK would only come from raylet reponse (which we don't ever return)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, there's this section of code in the retryable grpc:

request->Fail(Status::Disconnected("GRPC client is shut down."));

where all pending requests are flushed with a Status::Disconnected when we receive a node death notification and trigger the dtor for the raylet client. Hence this RAY_CHECK could be easily triggered as long as a NotifyGCSRestart RPC is in flight/pending and we then receive the node death notification for the target node.

Copy link
Contributor Author

@Sparks0219 Sparks0219 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is true for all RPCs for the retryable grpc client, we shouldn't use RAY_CHECK in their callbacks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's great context. This log message should be improved then:

Suggested change
if (!status.ok()) {
RAY_LOG(ERROR) << "NotifyGCSRestart failed: " << status;
}
if (!status.ok()) {
RAY_LOG(WARNING) << "NotifyGCSRestart failed. This is expected if the target node has died. Status: " << status;
}

Specifically, note that it should not be an ERROR-level log if it's expected behavior.

Copy link
Contributor Author

@Sparks0219 Sparks0219 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh good point didn't realize that distinction between error/warning logs thanks, done

@edoakes edoakes enabled auto-merge (squash) November 7, 2025 20:05
@edoakes edoakes disabled auto-merge November 7, 2025 20:45
Signed-off-by: joshlee <[email protected]>
@edoakes edoakes enabled auto-merge (squash) November 7, 2025 21:25
@edoakes edoakes merged commit 7eca669 into ray-project:master Nov 7, 2025
7 checks passed
@Sparks0219 Sparks0219 deleted the joshlee/make-notify-gcs-restart-fault-tolerant branch November 7, 2025 22:47
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants