You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ryanaoleary opened this issue
Feb 24, 2025
· 0 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
If updating the replicas or maxReplicas count of a Ray instance running with the V2 autoscaler, it's currently possible for the autoscaler to error when reconciling and fail to scale down due to Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED. This occurs when the k8s Pods are stuck in a pending state, so the Ray nodes are ALLOCATED but not yet RAY_INSTALLING or RAY_RUNNING, and the v2 autoscaler attempts to enforce the new limit for max number of worker nodes per type reached. I believe the ALLOCATED Pods will be deleted eventually anyway once it's detected to have been stuck in the allocated state for too long, but I think it'd make sense for the autoscaler to support deleting pending instances that violate max nodes constraints in order to avoid erroneously scaling up un-needed cloud resources.
Related PR (the added test case passes for the V1 autoscaler but fails for V2 due to the above behavior): ray-project/kuberay#2279
Versions / Dependencies
KubeRay v1.3.0
Ray nightly image
Reproduction script
Create a RayCluster with the v2 autoscaler enabled and a worker group that will fail to schedule (I added TPU nodeSelectors):
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-autoscaler
spec:
# The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
# Use the Ray nightly or Ray version >= 2.10.0 and KubeRay 1.1.0 or later for autoscaler v2.
rayVersion: '2.41.0'
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
# Optionally specify the Autoscaler container's securityContext.
securityContext: {}
env: []
envFrom: []
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
# Ray head pod template
headGroupSpec:
rayStartParams:
# Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
num-cpus: "0"
# Pod template
template:
spec:
containers:
# The Ray head container
- name: ray-head
image: rayproject/ray:2.41.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
memory: "2G"
requests:
cpu: "1"
memory: "2G"
env:
- name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
value: "1"
volumeMounts:
- mountPath: /home/ray/samples
name: ray-example-configmap
volumes:
- name: ray-example-configmap
configMap:
name: ray-example
defaultMode: 0777
items:
- key: detached_actor.py
path: detached_actor.py
- key: terminate_detached_actor.py
path: terminate_detached_actor.py
restartPolicy: Never # No restart to avoid reuse of pod for different ray nodes.
workerGroupSpecs:
# the Pod replicas in this group typed worker
- replicas: 0
minReplicas: 0
maxReplicas: 10
groupName: small-group
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.41.0
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "1"
memory: "1G"
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x1
restartPolicy: Never # Never restart a pod to avoid pod reuse
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example
data:
detached_actor.py: |
import ray
import sys
@ray.remote(num_cpus=1, resources={"TPU": 4})
class Actor:
pass
ray.init(namespace="default_namespace")
Actor.options(name=sys.argv[1], lifetime="detached").remote()
terminate_detached_actor.py: |
import ray
import sys
ray.init(namespace="default_namespace")
detached_actor = ray.get_actor(sys.argv[1])
ray.kill(detached_actor)
kubectl get pods
NAME READY STATUS RESTARTS AGE
kuberay-operator-5c7f84f8bc-9dqfc 1/1 Running 0 3d21h
raycluster-autoscaler-head-llpsx 2/2 Running 0 2m43s
raycluster-autoscaler-small-group-worker-4z65n 0/1 Pending 0 1s
raycluster-autoscaler-small-group-worker-x7zkh 0/1 Pending 0 6s
Set maxReplicas to 1 and re-apply the RayCluster configuration
Autoscaler fails to scale down the instance, which could lead to the cloud provider autoscaler attempting to provision unneeded resources
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-24 14:51:45,428 ERROR autoscaler.py:200 -- Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
return Reconciler.reconcile(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
Reconciler._step_next(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
Reconciler._scale_cluster(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
Reconciler._update_instance_manager(instance_manager, version, updates)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
reply = instance_manager.update_instance_manager_state(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
instance = self._update_instance(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-24 14:51:45,429 - WARNING - No autoscaling state to report.
2025-02-24 14:51:45,429 WARNING monitor.py:173 -- No autoscaling state to report.
Both Pods are still pending:
kubectl get pods
NAME READY STATUS RESTARTS AGE
kuberay-operator-5c7f84f8bc-9dqfc 1/1 Running 0 3d21h
raycluster-autoscaler-head-llpsx 2/2 Running 0 4m3s
raycluster-autoscaler-small-group-worker-4z65n 0/1 Pending 0 81s
raycluster-autoscaler-small-group-worker-x7zkh 0/1 Pending 0 86s
The text was updated successfully, but these errors were encountered:
ryanaoleary
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 24, 2025
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
If updating the
replicas
ormaxReplicas
count of a Ray instance running with the V2 autoscaler, it's currently possible for the autoscaler to error when reconciling and fail to scale down due toInvalid status transition from ALLOCATED to RAY_STOP_REQUESTED
. This occurs when the k8s Pods are stuck in a pending state, so the Ray nodes areALLOCATED
but not yetRAY_INSTALLING
orRAY_RUNNING
, and the v2 autoscaler attempts to enforce the new limit formax number of worker nodes per type reached
. I believe theALLOCATED
Pods will be deleted eventually anyway once it's detected to have been stuck in the allocated state for too long, but I think it'd make sense for the autoscaler to support deleting pending instances that violate max nodes constraints in order to avoid erroneously scaling up un-needed cloud resources.Related PR (the added test case passes for the V1 autoscaler but fails for V2 due to the above behavior): ray-project/kuberay#2279
Versions / Dependencies
Reproduction script
Set maxReplicas to 1 and re-apply the RayCluster configuration
Autoscaler fails to scale down the instance, which could lead to the cloud provider autoscaler attempting to provision unneeded resources
Both Pods are still pending:
This same failure is seen in the test case PR here: ray-project/kuberay#2279
Issue Severity
None
The text was updated successfully, but these errors were encountered: