Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fake TPU e2e Autoscaling Test Cases #2279

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Jul 31, 2024

Why are these changes needed?

This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host autoscaling behaves as expected. The behaviors tested included:

  • (1) Creating a detached actor that requests resources: {"TPU": 4} will scale up a Ray TPU worker
  • (2) For a worker group, the number of workers created should equal replicas * numOfHosts
  • (2) Terminating detached actors scheduled on a TPU worker group replica will cause the entire replica to be scaled down

Edit: Removed test behavior for idle nodes being scaled down, since KubeRay TPU Pods are un-schedulable due to required nodeSelectors.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421

@kevin85421 kevin85421 self-assigned this Aug 12, 2024
@kevin85421
Copy link
Member

I plan to include this PR in v1.3.0 instead.

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>

Fix tests for updated format

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary
Copy link
Contributor Author

Related Issue:
#2561

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary
Copy link
Contributor Author

I'm moving the multi-host test case to a separate PR based on offline discussion. For the V2 autoscaler only I'm seeing issues with maxReplicas being enforced (the V1 test case passes but V2 fails to ever delete the Pod), I'll open an issue with reproduction steps to see if I can root cause.

--- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU (344.39s)
[2025-01-23T18:58:30Z]     --- PASS: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaling_enabled (26.63s)
[2025-01-23T18:58:30Z]     --- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaler_v2_enabled (317.64s)

@ryanaoleary
Copy link
Contributor Author

The V2 autoscaler is failing repeatedly with error:

2025-02-14 17:56:02,207 INFO instance_manager.py:262 -- Update instance ALLOCATED->RAY_STOP_REQUESTED (id=b7e7fbf6-19c9-486a-aa54-0daee9f0ce5d, type=tpu-group, cloud_instance_id=ray-cluster-tpu-group-worker-89dnn, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=0
2025-02-14 17:56:02,207 - ERROR - Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-14 17:56:02,207 ERROR autoscaler.py:200 -- Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-14 17:56:02,208 - WARNING - No autoscaling state to report.
2025-02-14 17:56:02,208 WARNING monitor.py:173 -- No autoscaling state to report.

this causes it to hang when attempting to scale down the allocated node and timeout the test case. The TPU worker Pod is never scheduled (and thus Ray is not running), so it's a bug that it attempts to transition it to RAY_STOP_REQUESTED rather than Instance.TERMINATING. The autoscaler should check here if the current state of the to_terminate nodes is ALLOCATED or RAY_RUNNING before transitioning the state. I'll put out a PR with a fix for this and then the V2 autoscaler test case should pass and this PR can be merged. cc: @kevin85421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants