Skip to content

[AWS] Stop Round Robining AZs#19051

Merged
ijrsvt merged 9 commits intoray-project:masterfrom
ijrsvt:stop-expensive-azs
Oct 21, 2021
Merged

[AWS] Stop Round Robining AZs#19051
ijrsvt merged 9 commits intoray-project:masterfrom
ijrsvt:stop-expensive-azs

Conversation

@ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented Oct 1, 2021

Why are these changes needed?

  • Having Ray clusters span across AZs leads to costs (from cross AZ traffic) and increased latency.
  • This PR tries to prioritize launching machines into the first AZ if multiple are selected. When a node fails to launch, it will be retried in a different AZ!
  • Spot instances are not round-robin'ed, changing the behavior from Support multiple availability zones in AWS (fix #2177) #2254 This is fine because:
    1. It is unlikely that only one AZ in a a region has substantially different chance of spot instances being evicted
    1. AWS tends not to evict a bunch of machines from a single user.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ijrsvt ijrsvt changed the title [AWS] Stop Round Robin'ing AZs [AWS] Stop Round Robining AZs Oct 4, 2021
@ijrsvt
Copy link
Contributor Author

ijrsvt commented Oct 4, 2021

Hey @pdames, I was wondering if you had any thoughts about this PR?

Copy link
Contributor

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Confirming these things to make sure I understand:
(1) The code change is to not use a random offset when we're picking a subnet.
(2) We still round-robin on launch failure.
(3) Prioritizing launching into a particular subnet implies prioritizing launching into a particular AZ, which saves cost.

Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes! I think your reasoning for getting rid of AZ round-robining for spot instances makes sense. I would also add that it makes node AZ selection logic easier to understand for users and easier to maintain for developers since the behavior doesn't change based on the instance type being launched.

I think we'll need to make a minor change to ensure that we pack instances into availability zones in the order specified in the autoscaler config, and we may want to ensure that we try launching an instance in all AZs before giving up, but otherwise this looks good to me.

I'd also like to circle back to implementing proper resilience for spot instances in Ray clusters in a subsequent PR. A better strategy for Spot instances would arguably be to launch different EC2 instance types that provides equivalent resources (if available in the autoscaler config) after a particular spot instance type is lost. Ideally, we would also make the autoscaler aware of early Rebalance Recommendations (e.g. by monitoring http://169.254.169.254/latest/meta-data/events/recommendations/rebalance) and early Termination Notices (by monitoring http://169.254.169.254/latest/meta-data/spot/termination-time) and to (1) try to switch node types if either of the above events is received or (2) try to switch AZs if no other node types are suitable.

I know this is a bit more work, but I think this would ultimately present much better spot instance resilience with Ray. I’d be happy to file an issue for the same and follow up with a PR if that would help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since BOTO_CREATE_MAX_RETRIES is 5 by default, a corner case is that we may give up before attempting to launch an instance in each AZ (e.g. the us-east-1 region has up to 8 local zones, and if the requested instance type is only be available in the last 3 AZs then it will never launch).

The probability of hitting this corner case is also increased if multiple subnets defined for 1 or more AZs.

To ensure that we try to launch the instance in each subnet (and therefore each AZ) at least once, we could replace BOTO_CREATE_MAX_RETRIES with max_attempts = max(BOTO_CREATE_MAX_RETRIES, len(subnet_ids))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For up-to-date information on spot instance interruption frequency by region and instance type, see: https://aws.amazon.com/ec2/spot/instance-advisor/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this statement to be true, I think we also need to change https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py#L443
From:
subnets = [s for s in subnets if s.availability_zone in azs]
To:
subnets = [s for az in azs for s in subnets if s.availability_zone == az]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great catch! I changed this and added a test for this!

@ijrsvt ijrsvt requested a review from pdames October 20, 2021 18:19
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for adding this test! AZ ordering is an important feature to protect from regressions now that we've promised that we'll pack nodes into AZs in the order given in config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor/Typo: DIFFENT -> DIFFERENT

@ijrsvt ijrsvt force-pushed the stop-expensive-azs branch from 0cd191d to 60ef04d Compare October 21, 2021 16:19
@ijrsvt
Copy link
Contributor Author

ijrsvt commented Oct 21, 2021

Failures are on Windows (no autoscaler) and thus unrelated for this PR.

@ijrsvt ijrsvt merged commit 0cdf4ae into ray-project:master Oct 21, 2021
@ijrsvt ijrsvt deleted the stop-expensive-azs branch October 21, 2021 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants