[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang #50868

ryanaoleary · 2025-02-24T22:53:58Z

What happened + What you expected to happen

If updating the replicas or maxReplicas count of a Ray instance running with the V2 autoscaler, it's currently possible for the autoscaler to error when reconciling and fail to scale down due to Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED. This occurs when the k8s Pods are stuck in a pending state, so the Ray nodes are ALLOCATED but not yet RAY_INSTALLING or RAY_RUNNING, and the v2 autoscaler attempts to enforce the new limit for max number of worker nodes per type reached. I believe the ALLOCATED Pods will be deleted eventually anyway once it's detected to have been stuck in the allocated state for too long, but I think it'd make sense for the autoscaler to support deleting pending instances that violate max nodes constraints in order to avoid erroneously scaling up un-needed cloud resources.

Related PR (the added test case passes for the V1 autoscaler but fails for V2 due to the above behavior): ray-project/kuberay#2279

Versions / Dependencies

KubeRay v1.3.0
Ray nightly image

Reproduction script

Create a RayCluster with the v2 autoscaler enabled and a worker group that will fail to schedule (I added TPU nodeSelectors):

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  # Use the Ray nightly or Ray version >= 2.10.0 and KubeRay 1.1.0 or later for autoscaler v2.
  rayVersion: '2.41.0'
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template
  headGroupSpec:
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
    # Pod template
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: rayproject/ray:2.41.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "1"
              memory: "2G"
          env:
            - name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
              value: "1"
          volumeMounts:
            - mountPath: /home/ray/samples
              name: ray-example-configmap
        volumes:
          - name: ray-example-configmap
            configMap:
              name: ray-example
              defaultMode: 0777
              items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: terminate_detached_actor.py
                  path: terminate_detached_actor.py
        restartPolicy: Never # No restart to avoid reuse of pod for different ray nodes.
  workerGroupSpecs:
  # the Pod replicas in this group typed worker
  - replicas: 0
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.41.0
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "1"
              memory: "1G"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x1
        restartPolicy: Never # Never restart a pod to avoid pod reuse
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray
    import sys

    @ray.remote(num_cpus=1, resources={"TPU": 4})
    class Actor:
      pass

    ray.init(namespace="default_namespace")
    Actor.options(name=sys.argv[1], lifetime="detached").remote()

  terminate_detached_actor.py: |
    import ray
    import sys

    ray.init(namespace="default_namespace")
    detached_actor = ray.get_actor(sys.argv[1])
    ray.kill(detached_actor)

Scale up two worker replicas

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor_1
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor_2

Pod are pending:

kubectl get pods
NAME                                             READY   STATUS    RESTARTS   AGE
kuberay-operator-5c7f84f8bc-9dqfc                1/1     Running   0          3d21h
raycluster-autoscaler-head-llpsx                 2/2     Running   0          2m43s
raycluster-autoscaler-small-group-worker-4z65n   0/1     Pending   0          1s
raycluster-autoscaler-small-group-worker-x7zkh   0/1     Pending   0          6s

Set maxReplicas to 1 and re-apply the RayCluster configuration
Autoscaler fails to scale down the instance, which could lead to the cloud provider autoscaler attempting to provision unneeded resources

AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-24 14:51:45,428 ERROR autoscaler.py:200 -- Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-24 14:51:45,429 - WARNING - No autoscaling state to report.
2025-02-24 14:51:45,429 WARNING monitor.py:173 -- No autoscaling state to report.

Both Pods are still pending:

kubectl get pods
NAME                                             READY   STATUS    RESTARTS   AGE
kuberay-operator-5c7f84f8bc-9dqfc                1/1     Running   0          3d21h
raycluster-autoscaler-head-llpsx                 2/2     Running   0          4m3s
raycluster-autoscaler-small-group-worker-4z65n   0/1     Pending   0          81s
raycluster-autoscaler-small-group-worker-x7zkh   0/1     Pending   0          86s

This same failure is seen in the test case PR here: ray-project/kuberay#2279

Issue Severity

None

The text was updated successfully, but these errors were encountered:

ryanaoleary added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2025

ryanaoleary mentioned this issue Feb 24, 2025

[Autoscaler][V2] Check IM instance_status before terminating nodes #50707

Open

8 tasks

jcotant1 added the core Issues that should be addressed in Ray Core label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang #50868

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang #50868

ryanaoleary commented Feb 24, 2025

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang #50868

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang #50868

Comments

ryanaoleary commented Feb 24, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity