Add Fake TPU e2e Autoscaling Test Cases #2279

ryanaoleary · 2024-07-31T02:52:44Z

Why are these changes needed?

This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host autoscaling behaves as expected. The behaviors tested included:

(1) Creating a detached actor that requests resources: {"TPU": 4} will scale up a Ray TPU worker
(2) For a worker group, the number of workers created should equal replicas * numOfHosts
(2) Terminating detached actors scheduled on a TPU worker group replica will cause the entire replica to be scaled down

Edit: Removed test behavior for idle nodes being scaled down, since KubeRay TPU Pods are un-schedulable due to required nodeSelectors.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go

ryanaoleary · 2024-08-12T09:24:24Z

cc: @kevin85421

kevin85421 · 2024-08-12T15:03:22Z

I plan to include this PR in v1.3.0 instead.

Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]> Fix tests for updated format Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-01-21T13:36:51Z

Related Issue:
#2561

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-01-23T20:06:32Z

I'm moving the multi-host test case to a separate PR based on offline discussion. For the V2 autoscaler only I'm seeing issues with maxReplicas being enforced (the V1 test case passes but V2 fails to ever delete the Pod), I'll open an issue with reproduction steps to see if I can root cause.

--- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU (344.39s)
[2025-01-23T18:58:30Z]     --- PASS: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaling_enabled (26.63s)
[2025-01-23T18:58:30Z]     --- FAIL: TestRayClusterAutoscalerWithFakeSingleHostTPU/Create_a_RayCluster_with_autoscaler_v2_enabled (317.64s)

ryanaoleary · 2025-02-15T02:35:36Z

The V2 autoscaler is failing repeatedly with error:

2025-02-14 17:56:02,207 INFO instance_manager.py:262 -- Update instance ALLOCATED->RAY_STOP_REQUESTED (id=b7e7fbf6-19c9-486a-aa54-0daee9f0ce5d, type=tpu-group, cloud_instance_id=ray-cluster-tpu-group-worker-89dnn, ray_id=): draining ray: Terminating node due to MAX_NUM_NODE_PER_TYPE: max_num_nodes=None, max_num_nodes_per_type=0
2025-02-14 17:56:02,207 - ERROR - Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-14 17:56:02,207 ERROR autoscaler.py:200 -- Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 185, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 119, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 274, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1168, in _scale_cluster
    Reconciler._update_instance_manager(instance_manager, version, updates)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 615, in _update_instance_manager
    reply = instance_manager.update_instance_manager_state(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 94, in update_instance_manager_state
    instance = self._update_instance(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/instance_manager.py", line 263, in _update_instance
    assert InstanceUtil.set_status(instance, update.new_instance_status), (
AssertionError: Invalid status transition from ALLOCATED to RAY_STOP_REQUESTED
2025-02-14 17:56:02,208 - WARNING - No autoscaling state to report.
2025-02-14 17:56:02,208 WARNING monitor.py:173 -- No autoscaling state to report.

this causes it to hang when attempting to scale down the allocated node and timeout the test case. The TPU worker Pod is never scheduled (and thus Ray is not running), so it's a bug that it attempts to transition it to RAY_STOP_REQUESTED rather than Instance.TERMINATING. The autoscaler should check here if the current state of the to_terminate nodes is ALLOCATED or RAY_RUNNING before transitioning the state. I'll put out a PR with a fix for this and then the V2 autoscaler test case should pass and this PR can be merged. cc: @kevin85421

andrewsykim reviewed Aug 7, 2024

View reviewed changes

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go Outdated Show resolved Hide resolved

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go Outdated Show resolved Hide resolved

kevin85421 self-assigned this Aug 12, 2024

kevin85421 added the 1.3.0 label Aug 12, 2024

ryanaoleary added 8 commits January 21, 2025 13:05

Fake TPU test initial commit

92f523a

Signed-off-by: Ryan O'Leary <[email protected]>

Add single host test

d4ade7a

Signed-off-by: Ryan O'Leary <[email protected]>

remove comment

1e0f6fe

Signed-off-by: Ryan O'Leary <[email protected]>

Lint changes

88b3dcb

Signed-off-by: Ryan O'Leary <[email protected]>

Fix build errors

70cfd99

Signed-off-by: Ryan O'Leary <[email protected]>

Fix unparam lint error

aaa7cad

Signed-off-by: Ryan O'Leary <[email protected]>

Change to 2x2x2 topology and remove idle node behavior

428b6fd

Signed-off-by: Ryan O'Leary <[email protected]>

Updating tests for new format

fd552da

Signed-off-by: Ryan O'Leary <[email protected]> Fix tests for updated format Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the tpu-fake-test branch from e95f608 to fd552da Compare January 21, 2025 13:35

ryanaoleary added 3 commits January 21, 2025 14:07

Format

42001b8

Signed-off-by: Ryan O'Leary <[email protected]>

fix wrong length assertion err

f05a8bc

Signed-off-by: Ryan O'Leary <[email protected]>

Move multi-host test to its own PR

2892a37

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary mentioned this pull request Feb 24, 2025

[Autoscaler][V2] Updating max replicas while Pods are pending causes v2 autoscaler to hang ray-project/ray#50868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fake TPU e2e Autoscaling Test Cases #2279

Add Fake TPU e2e Autoscaling Test Cases #2279

ryanaoleary commented Jul 31, 2024 •

edited

Loading

ryanaoleary commented Aug 12, 2024

kevin85421 commented Aug 12, 2024

ryanaoleary commented Jan 21, 2025

ryanaoleary commented Jan 23, 2025

ryanaoleary commented Feb 15, 2025

Add Fake TPU e2e Autoscaling Test Cases #2279

Are you sure you want to change the base?

Add Fake TPU e2e Autoscaling Test Cases #2279

Conversation

ryanaoleary commented Jul 31, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

ryanaoleary commented Aug 12, 2024

kevin85421 commented Aug 12, 2024

ryanaoleary commented Jan 21, 2025

ryanaoleary commented Jan 23, 2025

ryanaoleary commented Feb 15, 2025

ryanaoleary commented Jul 31, 2024 •

edited

Loading