Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Jan 9, 2026

This PR stacks on #59979.

This is 2/N in a series of PRs to remove Centralized Actor Scheduling by the GCS (introduced in #15943). The feature is off by default and no longer in use or supported.

In this PR, I remove use of the ClusterLeaseManager (from the Raylet's scheduler) in the GCS.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is part of a larger effort to remove the dependency on ClusterLeaseManager from GCS components. The changes primarily involve removing ClusterLeaseManager from GcsActorScheduler, GcsResourceManager, and GcsServer, and updating their dependencies and logic accordingly. While most of the changes look correct and consistent with the refactoring goal, I've identified a few potential issues related to resource reporting for autoscaling, missing defensive checks, and logic for triggering global garbage collection. These might be regressions or temporary states in a work-in-progress PR, but are worth pointing out.

I am having trouble creating individual review comments. Click here to see my feedback.

src/ray/gcs/gcs_resource_manager.cc (179-185)

high

The removal of this block means that pending actor resource demands are no longer included in the resource usage report. This could prevent the autoscaler from scaling up correctly for pending actors. Was this intentional? If this functionality is still needed, the pending actor information could be retrieved from GcsActorManager.

src/ray/gcs/gcs_server.cc (396)

medium

This RAY_CHECK is removed. While the check for cluster_lease_manager_ is no longer needed, it would be safer to retain the check for cluster_resource_scheduler_ to ensure it's initialized before use.

  RAY_CHECK(cluster_resource_scheduler_);

src/ray/gcs/gcs_server.cc (518)

medium

This RAY_CHECK is removed. While the check for cluster_lease_manager_ is no longer needed, it would be safer to retain a check for gcs_resource_manager_.

  RAY_CHECK(gcs_resource_manager_);

src/ray/gcs/gcs_server.cc (987-990)

medium

This check for pending tasks is removed, which means TryGlobalGC might attempt to trigger a global GC even when there are no pending tasks. This seems inefficient. The check should be updated to use the new way of tracking pending actors, for example by checking gcs_actor_manager_->GetPendingActorsCount().

  if (gcs_actor_manager_->GetPendingActorsCount() == 0) {
    task_pending_schedule_detected_ = 0;
    return;
  }

@israbbani
Copy link
Contributor Author

israbbani commented Jan 9, 2026

Reviewing Gemini's comments

src/ray/gcs/gcs_resource_manager.cc (179-185)

This is fine. There's nothing to include if you're not using the GCS as a global scheduler.

src/ray/gcs/gcs_server.cc (396)

Fair enough. I reintroduced this for now. We have too many RAY_CHECKs with no indication of why they're important.

src/ray/gcs/gcs_server.cc (518)

Nope. I think the original check was wrong because we don't use cluster_resource_manager_ inside InitGCSActorManager at all. It was probably a typo in the original code.

src/ray/gcs/gcs_server.cc (987-990)

I'm not convinced. The original wasn't calling the GcsActorScheduler, but the ClusterLeaseManager directly. This means that the current code wasn't checking from the perspective of the GCS being the owner for Actors, but rather the scheduler.

@israbbani israbbani closed this Jan 9, 2026
@israbbani israbbani reopened this Jan 9, 2026
…abbani/remove-centralized-actor-scheduling-2
@israbbani israbbani changed the title [core] (2/n) Removing GCS's dependency on ClusterLeaseManager. [core] (2/n) [Removing GCS Scheduling] Removing ClusterLeaseManager from GCS. Jan 9, 2026
@israbbani israbbani marked this pull request as ready for review January 9, 2026 21:57
@israbbani israbbani requested a review from a team as a code owner January 9, 2026 21:57
@israbbani israbbani added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Jan 9, 2026
@israbbani israbbani changed the title [core] (2/n) [Removing GCS Scheduling] Removing ClusterLeaseManager from GCS. [core] (2/n) [Removing GCS Centralized Scheduling] Removing ClusterLeaseManager from GCS. Jan 9, 2026
}

void GcsServer::TryGlobalGC() {
if (cluster_lease_manager_->GetPendingQueueSize() == 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global GC counter never resets, causing unnecessary GC triggers

Medium Severity

The TryGlobalGC function previously checked if cluster_lease_manager_->GetPendingQueueSize() == 0 and reset task_pending_schedule_detected_ to 0 when there were no pending tasks. With the ClusterLeaseManager removed, this check and reset are gone. The counter now increments on every call (every ~10 seconds by default) and never resets. After the first two calls, task_pending_schedule_detected_++ > 0 will always be true, causing global GC to be triggered whenever the throttler allows, regardless of whether there are actually pending tasks. This results in unnecessary global GC broadcasts across the cluster. The comment "To avoid spurious triggers" is now stale since spurious triggers can occur.

🔬 Verification Test

Why verification test was not possible: This bug involves runtime behavior in a distributed system (Ray GCS server). Testing would require spinning up a Ray cluster and observing GC trigger patterns over time, which is not feasible in a unit test context. The bug can be verified by code inspection: the removed lines (958-961 in the original) contained the counter reset logic task_pending_schedule_detected_ = 0; return; that is no longer present, while the increment task_pending_schedule_detected_++ at line 972 continues to execute on every call without any reset mechanism.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is real. I'm deleting GlobalGC from the GCS completely in #60019. Keeping the PRs separate to make them easy to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes on second thought. It's a bad idea to separate these because now GC will run more frequently until #60019 is merged. Can we merge both together? I can combine them too..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on combining

edoakes pushed a commit that referenced this pull request Jan 9, 2026
Dead code is a maintenance burden. Removing unused mocks. 

This came up as I was working on removing the ClusterLeaseManager from
the GCS in #60008.

Signed-off-by: irabbani <[email protected]>
"//src/ray/protobuf:gcs_service_cc_grpc",
"//src/ray/pubsub:gcs_publisher",
"//src/ray/pubsub:publisher",
# TODO(irabbani): Refactor a subset of scheduling into a shared targer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: typo

size_t GcsActorManager::GetPendingActorsCount() const {
return gcs_actor_scheduler_->GetPendingActorsCount() + pending_actors_.size();
return pending_actors_.size();
// return gcs_actor_scheduler_->GetPendingActorsCount() + pending_actors_.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

#include "ray/gcs/grpc_service_interfaces.h"
#include "ray/ray_syncer/ray_syncer.h"
#include "ray/raylet/scheduling/cluster_lease_manager.h"
// #include "ray/raylet/scheduling/cluster_lease_manager.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants