[AWS] Stop Round Robining AZs by ijrsvt · Pull Request #19051 · ray-project/ray

ijrsvt · 2021-10-01T23:15:34Z

Why are these changes needed?

Having Ray clusters span across AZs leads to costs (from cross AZ traffic) and increased latency.
This PR tries to prioritize launching machines into the first AZ if multiple are selected. When a node fails to launch, it will be retried in a different AZ!
Spot instances are not round-robin'ed, changing the behavior from Support multiple availability zones in AWS (fix #2177) #2254 This is fine because:
1. It is unlikely that only one AZ in a a region has substantially different chance of spot instances being evicted
1. AWS tends not to evict a bunch of machines from a single user.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ijrsvt · 2021-10-04T17:09:07Z

Hey @pdames, I was wondering if you had any thoughts about this PR?

DmitriGekhtman

Looks good.

Confirming these things to make sure I understand:
(1) The code change is to not use a random offset when we're picking a subnet.
(2) We still round-robin on launch failure.
(3) Prioritizing launching into a particular subnet implies prioritizing launching into a particular AZ, which saves cost.

pdames

Thanks for making these changes! I think your reasoning for getting rid of AZ round-robining for spot instances makes sense. I would also add that it makes node AZ selection logic easier to understand for users and easier to maintain for developers since the behavior doesn't change based on the instance type being launched.

I think we'll need to make a minor change to ensure that we pack instances into availability zones in the order specified in the autoscaler config, and we may want to ensure that we try launching an instance in all AZs before giving up, but otherwise this looks good to me.

I'd also like to circle back to implementing proper resilience for spot instances in Ray clusters in a subsequent PR. A better strategy for Spot instances would arguably be to launch different EC2 instance types that provides equivalent resources (if available in the autoscaler config) after a particular spot instance type is lost. Ideally, we would also make the autoscaler aware of early Rebalance Recommendations (e.g. by monitoring http://169.254.169.254/latest/meta-data/events/recommendations/rebalance) and early Termination Notices (by monitoring http://169.254.169.254/latest/meta-data/spot/termination-time) and to (1) try to switch node types if either of the above events is received or (2) try to switch AZs if no other node types are suitable.

I know this is a bit more work, but I think this would ultimately present much better spot instance resilience with Ray. I’d be happy to file an issue for the same and follow up with a PR if that would help.

pdames · 2021-10-20T08:20:40Z

python/ray/autoscaler/_private/aws/node_provider.py

Since BOTO_CREATE_MAX_RETRIES is 5 by default, a corner case is that we may give up before attempting to launch an instance in each AZ (e.g. the us-east-1 region has up to 8 local zones, and if the requested instance type is only be available in the last 3 AZs then it will never launch).

The probability of hitting this corner case is also increased if multiple subnets defined for 1 or more AZs.

To ensure that we try to launch the instance in each subnet (and therefore each AZ) at least once, we could replace BOTO_CREATE_MAX_RETRIES with max_attempts = max(BOTO_CREATE_MAX_RETRIES, len(subnet_ids))

pdames · 2021-10-20T08:20:45Z

python/ray/autoscaler/aws/example-full.yaml

For up-to-date information on spot instance interruption frequency by region and instance type, see: https://aws.amazon.com/ec2/spot/instance-advisor/.

pdames · 2021-10-20T09:15:56Z

doc/source/cluster/config.rst

For this statement to be true, I think we also need to change https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py#L443
From:
subnets = [s for s in subnets if s.availability_zone in azs]
To:
subnets = [s for az in azs for s in subnets if s.availability_zone == az]

Wow, great catch! I changed this and added a test for this!

pdames

LGTM!

pdames · 2021-10-20T21:08:22Z

python/ray/tests/aws/test_autoscaler_aws.py

Nice, thanks for adding this test! AZ ordering is an important feature to protect from regressions now that we've promised that we'll pack nodes into AZs in the order given in config.

pdames · 2021-10-20T21:11:04Z

python/ray/tests/aws/utils/stubs.py

Minor/Typo: DIFFENT -> DIFFERENT

ijrsvt · 2021-10-21T19:06:33Z

Failures are on Windows (no autoscaler) and thus unrelated for this PR.

ijrsvt changed the title ~~[AWS] Stop Round Robin'ing AZs~~ [AWS] Stop Round Robining AZs Oct 4, 2021

ijrsvt force-pushed the stop-expensive-azs branch from 44af611 to ae24a4c Compare October 18, 2021 15:47

ijrsvt requested a review from pdames October 19, 2021 15:23

ijrsvt assigned pdames Oct 19, 2021

ijrsvt requested a review from DmitriGekhtman October 19, 2021 15:57

ijrsvt assigned DmitriGekhtman Oct 19, 2021

DmitriGekhtman approved these changes Oct 19, 2021

View reviewed changes

pdames requested changes Oct 20, 2021

View reviewed changes

ijrsvt requested a review from pdames October 20, 2021 18:19

pdames approved these changes Oct 20, 2021

View reviewed changes

ijrsvt added 9 commits October 21, 2021 16:19

round robin on failure to launch

1df368b

still round-robin spot instances

35a4f3d

prioritize first AZ

ec7fc18

no more round-robining

b7c2d73

doc updates

059de8b

Order subnets by AZ

ca986e7

add spot instance advisor link

aa6a9d2

ensure we try all AZs

f23f9d8

fix typos

60ef04d

ijrsvt force-pushed the stop-expensive-azs branch from 0cd191d to 60ef04d Compare October 21, 2021 16:19

ijrsvt merged commit 0cdf4ae into ray-project:master Oct 21, 2021

ijrsvt deleted the stop-expensive-azs branch October 21, 2021 19:08

ijrsvt mentioned this pull request Jan 31, 2022

[YAMLs] Fix comments about autoscaler round-robining #22002

Merged

6 tasks

Conversation

ijrsvt commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

ijrsvt commented Oct 4, 2021

Uh oh!

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Uh oh!

pdames left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdames left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ijrsvt commented Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ijrsvt commented Oct 1, 2021 •

edited

Loading