-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fake TPU e2e Autoscaling Test Cases #2279
base: master
Are you sure you want to change the base?
Conversation
cc: @kevin85421 |
I plan to include this PR in v1.3.0 instead. |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]> Fix tests for updated format Signed-off-by: Ryan O'Leary <[email protected]>
e95f608
to
fd552da
Compare
Related Issue: |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
I'm moving the multi-host test case to a separate PR based on offline discussion. For the V2 autoscaler only I'm seeing issues with maxReplicas being enforced (the V1 test case passes but V2 fails to ever delete the Pod), I'll open an issue with reproduction steps to see if I can root cause.
|
The V2 autoscaler is failing repeatedly with error:
this causes it to hang when attempting to scale down the allocated node and timeout the test case. The TPU worker Pod is never scheduled (and thus Ray is not running), so it's a bug that it attempts to transition it to |
Why are these changes needed?
This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host autoscaling behaves as expected. The behaviors tested included:
resources: {"TPU": 4}
will scale up a Ray TPU workerreplicas * numOfHosts
Edit: Removed test behavior for idle nodes being scaled down, since KubeRay TPU Pods are un-schedulable due to required nodeSelectors.
Related issue number
Checks