-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get #57691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ecause it never calls NotifyWorkerBlocked. Therefore, it doesn't have to release resources and only cancel inflight Pull requests. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Were you able to figure out why this is necessary? |
| // objects contain an exception, clean up the Get request in the raylet | ||
| // and early exit. | ||
| if (remaining.empty() || got_exception) { | ||
| return raylet_ipc_client_->CancelGetRequest(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there is a behavior difference here. This will now call CancelGetRequest from background threads in normal tasks where previously NotifyWorkerUnblocked was not called. This can exacerbate the "accidental cancelation" problem where one thread cancels outstanding get requests from others. I'm OK with this as an intermediate state since the next PR will presumably implement the get request ID logic.
| int64_t timeout_ms, | ||
| const WorkerContext &ctx, | ||
| absl::flat_hash_map<ObjectID, std::shared_ptr<RayObject>> *results, | ||
| bool *got_exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got_exception was unused at the callsite? 🤔
| return store_client_->GetExperimentalMutableObject(object_id, mutable_object); | ||
| } | ||
|
|
||
| Status UnblockIfNeeded( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't describe how happy I am to see this in the red
| // objects contain an exception, clean up the Get request in the raylet | ||
| // and early exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // objects contain an exception, clean up the Get request in the raylet | |
| // and early exit. | |
| // objects contain an exception, cancel the Get request and exit early. |
…rovider::Get (ray-project#57691) This is 1/n in a series of PRs to fix ray-project#54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in ray-project#12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]>
…rovider::Get (ray-project#57691) This is 1/n in a series of PRs to fix ray-project#54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in ray-project#12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]> Signed-off-by: xgui <[email protected]>
…rovider::Get (#57691) This is 1/n in a series of PRs to fix #54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in #12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]> Signed-off-by: elliot-barn <[email protected]>
This reverts commit 3af4650. Revert "[core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get (#57691)" This reverts commit 6539477. Signed-off-by: irabbani <[email protected]>
This reverts commit 3af4650. Revert "[core] (1/n ray.get) Removing NotifyWorkerUnblocked from PlasmaStoreProvider::Get (#57691)" This reverts commit 6539477. Signed-off-by: irabbani <[email protected]>
…rovider::Get (ray-project#57691) This is 1/n in a series of PRs to fix ray-project#54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in ray-project#12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]>
…rovider::Get (ray-project#57691) This is 1/n in a series of PRs to fix ray-project#54007. CoreWorker::Get has two parts * CoreWorkerMemoryStore::Get (to wait for objects to become ready _anywhere_ in the cluster) * PlasmaStoreProvider::Get (to fetch objects into local plasma and return a ptr to shm). The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in ray-project#12912) is that the worker is not using any resources. PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker. In this PR, I clean up PlasmaStoreProvider::Get so it * No longer calls NotifyWorkerUnblocked sometimes (with some convoluted checking to see if we're executing a NORMAL_TASK on the main thread or an ACTOR_CREATION_TASK). * Instead calls CancelGetRequest on (almost) all exits from the function. This is because even if PlasmaStoreProvider::Get is successful, it still needs to clean up the "Get" request on the raylet. * Removes unnecessary parameters. --------- Signed-off-by: irabbani <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
This is 1/n in a series of PRs to fix #54007.
CoreWorker::Get has two parts
The CoreWorker tries to implement cooperative scheduling by yielding CPU back to the raylet if it's blocked. The only time does this in practice is when it called CoreWorkerMemoryStore::Get. The rational (as discussed in #12912) is that the worker is not using any resources.
PlasmaStoreProvider::Get does not yield CPU by notifying the raylet that it's blocked, but instead calls NotifyWorkerUnblocked. This is a bug. It does this to clean up an inflight or completed "Get" request from the worker.
In this PR, I clean up PlasmaStoreProvider::Get so it