-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] (2/n) [Removing GCS Centralized Scheduling] Removing ClusterLeaseManager from GCS. #60008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: irabbani/remove-centralized-actor-scheduling-1
Are you sure you want to change the base?
Conversation
Signed-off-by: irabbani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is part of a larger effort to remove the dependency on ClusterLeaseManager from GCS components. The changes primarily involve removing ClusterLeaseManager from GcsActorScheduler, GcsResourceManager, and GcsServer, and updating their dependencies and logic accordingly. While most of the changes look correct and consistent with the refactoring goal, I've identified a few potential issues related to resource reporting for autoscaling, missing defensive checks, and logic for triggering global garbage collection. These might be regressions or temporary states in a work-in-progress PR, but are worth pointing out.
I am having trouble creating individual review comments. Click here to see my feedback.
src/ray/gcs/gcs_resource_manager.cc (179-185)
The removal of this block means that pending actor resource demands are no longer included in the resource usage report. This could prevent the autoscaler from scaling up correctly for pending actors. Was this intentional? If this functionality is still needed, the pending actor information could be retrieved from GcsActorManager.
src/ray/gcs/gcs_server.cc (396)
This RAY_CHECK is removed. While the check for cluster_lease_manager_ is no longer needed, it would be safer to retain the check for cluster_resource_scheduler_ to ensure it's initialized before use.
RAY_CHECK(cluster_resource_scheduler_);
src/ray/gcs/gcs_server.cc (518)
This RAY_CHECK is removed. While the check for cluster_lease_manager_ is no longer needed, it would be safer to retain a check for gcs_resource_manager_.
RAY_CHECK(gcs_resource_manager_);
src/ray/gcs/gcs_server.cc (987-990)
This check for pending tasks is removed, which means TryGlobalGC might attempt to trigger a global GC even when there are no pending tasks. This seems inefficient. The check should be updated to use the new way of tracking pending actors, for example by checking gcs_actor_manager_->GetPendingActorsCount().
if (gcs_actor_manager_->GetPendingActorsCount() == 0) {
task_pending_schedule_detected_ = 0;
return;
}
|
Reviewing Gemini's comments
This is fine. There's nothing to include if you're not using the GCS as a global scheduler.
Fair enough. I reintroduced this for now. We have too many RAY_CHECKs with no indication of why they're important.
Nope. I think the original check was wrong because we don't use cluster_resource_manager_ inside InitGCSActorManager at all. It was probably a typo in the original code.
I'm not convinced. The original wasn't calling the GcsActorScheduler, but the ClusterLeaseManager directly. This means that the current code wasn't checking from the perspective of the GCS being the owner for Actors, but rather the scheduler. |
…abbani/remove-centralized-actor-scheduling-2
Signed-off-by: irabbani <[email protected]>
| } | ||
|
|
||
| void GcsServer::TryGlobalGC() { | ||
| if (cluster_lease_manager_->GetPendingQueueSize() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global GC counter never resets, causing unnecessary GC triggers
Medium Severity
The TryGlobalGC function previously checked if cluster_lease_manager_->GetPendingQueueSize() == 0 and reset task_pending_schedule_detected_ to 0 when there were no pending tasks. With the ClusterLeaseManager removed, this check and reset are gone. The counter now increments on every call (every ~10 seconds by default) and never resets. After the first two calls, task_pending_schedule_detected_++ > 0 will always be true, causing global GC to be triggered whenever the throttler allows, regardless of whether there are actually pending tasks. This results in unnecessary global GC broadcasts across the cluster. The comment "To avoid spurious triggers" is now stale since spurious triggers can occur.
🔬 Verification Test
Why verification test was not possible: This bug involves runtime behavior in a distributed system (Ray GCS server). Testing would require spinning up a Ray cluster and observing GC trigger patterns over time, which is not feasible in a unit test context. The bug can be verified by code inspection: the removed lines (958-961 in the original) contained the counter reset logic task_pending_schedule_detected_ = 0; return; that is no longer present, while the increment task_pending_schedule_detected_++ at line 972 continues to execute on every call without any reset mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is real. I'm deleting GlobalGC from the GCS completely in #60019. Keeping the PRs separate to make them easy to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on combining
Dead code is a maintenance burden. Removing unused mocks. This came up as I was working on removing the ClusterLeaseManager from the GCS in #60008. Signed-off-by: irabbani <[email protected]>
| "//src/ray/protobuf:gcs_service_cc_grpc", | ||
| "//src/ray/pubsub:gcs_publisher", | ||
| "//src/ray/pubsub:publisher", | ||
| # TODO(irabbani): Refactor a subset of scheduling into a shared targer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: typo
| size_t GcsActorManager::GetPendingActorsCount() const { | ||
| return gcs_actor_scheduler_->GetPendingActorsCount() + pending_actors_.size(); | ||
| return pending_actors_.size(); | ||
| // return gcs_actor_scheduler_->GetPendingActorsCount() + pending_actors_.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
| #include "ray/gcs/grpc_service_interfaces.h" | ||
| #include "ray/ray_syncer/ray_syncer.h" | ||
| #include "ray/raylet/scheduling/cluster_lease_manager.h" | ||
| // #include "ray/raylet/scheduling/cluster_lease_manager.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for doing this!
This PR stacks on #59979.
This is 2/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in #15943). The feature is off by default and no longer in use or supported.
In this PR, I remove use of the ClusterLeaseManager (from the Raylet's scheduler) in the GCS.