Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Oct 14, 2025

This is 1/n in a series of PRs to fix #54007.

CoreWorker::Get has two parts

  • CoreWorkerMemoryStore::Get (to wait for objects to become ready anywhere in the cluster)
  • PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in #12912) is that the worker is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker.

In this PR, I clean up PlasmaStoreProvider::Get so it

  • No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK).
  • Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet.
  • Removes unnecessary parameters.

…ecause it

never calls NotifyWorkerBlocked. Therefore, it doesn't have to release
resources and only cancel inflight Pull requests.

Signed-off-by: irabbani <[email protected]>
@israbbani israbbani added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Oct 14, 2025
@israbbani israbbani changed the title [core] Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get [core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get Oct 16, 2025
@israbbani israbbani marked this pull request as ready for review October 16, 2025 06:48
@israbbani israbbani requested a review from a team as a code owner October 16, 2025 06:48
Signed-off-by: irabbani <[email protected]>
@edoakes
Copy link
Collaborator

edoakes commented Oct 16, 2025

Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet.

Were you able to figure out why this is necessary?

// objects contain an exception, clean up the Get request in the raylet
// and early exit.
if (remaining.empty() || got_exception) {
return raylet_ipc_client_->CancelGetRequest();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a behavior difference here. This will now call CancelGetRequest from background threads in normal tasks where previously NotifyWorkerUnblocked was not called. This can exacerbate the "accidental cancelation" problem where one thread cancels outstanding get requests from others. I'm OK with this as an intermediate state since the next PR will presumably implement the get request ID logic.

int64_t timeout_ms,
const WorkerContext &ctx,
absl::flat_hash_map<ObjectID, std::shared_ptr<RayObject>> *results,
bool *got_exception);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got_exception was unused at the callsite? 🤔

return store_client_->GetExperimentalMutableObject(object_id, mutable_object);
}

Status UnblockIfNeeded(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't describe how happy I am to see this in the red

Comment on lines +283 to +284
// objects contain an exception, clean up the Get request in the raylet
// and early exit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// objects contain an exception, clean up the Get request in the raylet
// and early exit.
// objects contain an exception, cancel the Get request and exit early.

@edoakes edoakes merged commit 6539477 into master Oct 16, 2025
6 checks passed
@edoakes edoakes deleted the irabbani/ray-get-1 branch October 16, 2025 14:34
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…rovider::Get (ray-project#57691)

This is 1/n in a series of PRs to fix ray-project#54007. 

CoreWorker::Get has two parts
* CoreWorkerMemoryStore::Get (to wait for objects to become ready
_anywhere_ in the cluster)
* PlasmaStoreProvider::Get (to fetch objects into local plasma and
return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU
back to the raylet if it's blocked. The only time does this in practice
is when it called CoreWorkerMemoryStore::Get. The rational (as discussed
in ray-project#12912) is that the worker
is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that
it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It
does this to clean up an inflight or completed "Get" request from the
worker.

In this PR, I clean up PlasmaStoreProvider::Get so it
* No longer calls NotifyWorkerUnblocked sometimes (with some convoluted
checking to see if we're executing a NORMAL_TASK on the main thread or
an ACTOR_CREATION_TASK).
* Instead calls CancelGetRequest on (almost) all exits from the
function. This is because even if PlasmaStoreProvider::Get is
successful, it still needs to clean up the "Get" request on the raylet.
* Removes unnecessary parameters.

---------

Signed-off-by: irabbani <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…rovider::Get (ray-project#57691)

This is 1/n in a series of PRs to fix ray-project#54007.

CoreWorker::Get has two parts
* CoreWorkerMemoryStore::Get (to wait for objects to become ready
_anywhere_ in the cluster)
* PlasmaStoreProvider::Get (to fetch objects into local plasma and
return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU
back to the raylet if it's blocked. The only time does this in practice
is when it called CoreWorkerMemoryStore::Get. The rational (as discussed
in ray-project#12912) is that the worker
is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that
it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It
does this to clean up an inflight or completed "Get" request from the
worker.

In this PR, I clean up PlasmaStoreProvider::Get so it
* No longer calls NotifyWorkerUnblocked sometimes (with some convoluted
checking to see if we're executing a NORMAL_TASK on the main thread or
an ACTOR_CREATION_TASK).
* Instead calls CancelGetRequest on (almost) all exits from the
function. This is because even if PlasmaStoreProvider::Get is
successful, it still needs to clean up the "Get" request on the raylet.
* Removes unnecessary parameters.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: xgui <[email protected]>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…rovider::Get (#57691)

This is 1/n in a series of PRs to fix #54007. 

CoreWorker::Get has two parts
* CoreWorkerMemoryStore::Get (to wait for objects to become ready
_anywhere_ in the cluster)
* PlasmaStoreProvider::Get (to fetch objects into local plasma and
return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU
back to the raylet if it's blocked. The only time does this in practice
is when it called CoreWorkerMemoryStore::Get. The rational (as discussed
in #12912) is that the worker
is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that
it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It
does this to clean up an inflight or completed "Get" request from the
worker.

In this PR, I clean up PlasmaStoreProvider::Get so it
* No longer calls NotifyWorkerUnblocked sometimes (with some convoluted
checking to see if we're executing a NORMAL_TASK on the main thread or
an ACTOR_CREATION_TASK).
* Instead calls CancelGetRequest on (almost) all exits from the
function. This is because even if PlasmaStoreProvider::Get is
successful, it still needs to clean up the "Get" request on the raylet.
* Removes unnecessary parameters.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
israbbani added a commit that referenced this pull request Oct 26, 2025
This reverts commit 3af4650.

Revert "[core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get (#57691)"

This reverts commit 6539477.

Signed-off-by: irabbani <[email protected]>
israbbani added a commit that referenced this pull request Oct 26, 2025
This reverts commit 3af4650.

Revert "[core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get (#57691)"

This reverts commit 6539477.

Signed-off-by: irabbani <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…rovider::Get (ray-project#57691)

This is 1/n in a series of PRs to fix ray-project#54007. 

CoreWorker::Get has two parts
* CoreWorkerMemoryStore::Get (to wait for objects to become ready
_anywhere_ in the cluster)
* PlasmaStoreProvider::Get (to fetch objects into local plasma and
return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU
back to the raylet if it's blocked. The only time does this in practice
is when it called CoreWorkerMemoryStore::Get. The rational (as discussed
in ray-project#12912) is that the worker
is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that
it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It
does this to clean up an inflight or completed "Get" request from the
worker.

In this PR, I clean up PlasmaStoreProvider::Get so it
* No longer calls NotifyWorkerUnblocked sometimes (with some convoluted
checking to see if we're executing a NORMAL_TASK on the main thread or
an ACTOR_CREATION_TASK).
* Instead calls CancelGetRequest on (almost) all exits from the
function. This is because even if PlasmaStoreProvider::Get is
successful, it still needs to clean up the "Get" request on the raylet.
* Removes unnecessary parameters.

---------

Signed-off-by: irabbani <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…rovider::Get (ray-project#57691)

This is 1/n in a series of PRs to fix ray-project#54007.

CoreWorker::Get has two parts
* CoreWorkerMemoryStore::Get (to wait for objects to become ready
_anywhere_ in the cluster)
* PlasmaStoreProvider::Get (to fetch objects into local plasma and
return a ptr to shm).

The CoreWorker tries to implement cooperative scheduling by yielding CPU
back to the raylet if it's blocked. The only time does this in practice
is when it called CoreWorkerMemoryStore::Get. The rational (as discussed
in ray-project#12912) is that the worker
is not using any resources.

PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that
it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It
does this to clean up an inflight or completed "Get" request from the
worker.

In this PR, I clean up PlasmaStoreProvider::Get so it
* No longer calls NotifyWorkerUnblocked sometimes (with some convoluted
checking to see if we're executing a NORMAL_TASK on the main thread or
an ACTOR_CREATION_TASK).
* Instead calls CancelGetRequest on (almost) all exits from the
function. This is because even if PlasmaStoreProvider::Get is
successful, it still needs to clean up the "Get" request on the raylet.
* Removes unnecessary parameters.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Multi-threaded ray.get can hang in certain situations.

3 participants