[AWS] Stop Round Robining AZs#19051
Conversation
|
Hey @pdames, I was wondering if you had any thoughts about this PR? |
44af611 to
ae24a4c
Compare
DmitriGekhtman
left a comment
There was a problem hiding this comment.
Looks good.
Confirming these things to make sure I understand:
(1) The code change is to not use a random offset when we're picking a subnet.
(2) We still round-robin on launch failure.
(3) Prioritizing launching into a particular subnet implies prioritizing launching into a particular AZ, which saves cost.
pdames
left a comment
There was a problem hiding this comment.
Thanks for making these changes! I think your reasoning for getting rid of AZ round-robining for spot instances makes sense. I would also add that it makes node AZ selection logic easier to understand for users and easier to maintain for developers since the behavior doesn't change based on the instance type being launched.
I think we'll need to make a minor change to ensure that we pack instances into availability zones in the order specified in the autoscaler config, and we may want to ensure that we try launching an instance in all AZs before giving up, but otherwise this looks good to me.
I'd also like to circle back to implementing proper resilience for spot instances in Ray clusters in a subsequent PR. A better strategy for Spot instances would arguably be to launch different EC2 instance types that provides equivalent resources (if available in the autoscaler config) after a particular spot instance type is lost. Ideally, we would also make the autoscaler aware of early Rebalance Recommendations (e.g. by monitoring http://169.254.169.254/latest/meta-data/events/recommendations/rebalance) and early Termination Notices (by monitoring http://169.254.169.254/latest/meta-data/spot/termination-time) and to (1) try to switch node types if either of the above events is received or (2) try to switch AZs if no other node types are suitable.
I know this is a bit more work, but I think this would ultimately present much better spot instance resilience with Ray. I’d be happy to file an issue for the same and follow up with a PR if that would help.
There was a problem hiding this comment.
Since BOTO_CREATE_MAX_RETRIES is 5 by default, a corner case is that we may give up before attempting to launch an instance in each AZ (e.g. the us-east-1 region has up to 8 local zones, and if the requested instance type is only be available in the last 3 AZs then it will never launch).
The probability of hitting this corner case is also increased if multiple subnets defined for 1 or more AZs.
To ensure that we try to launch the instance in each subnet (and therefore each AZ) at least once, we could replace BOTO_CREATE_MAX_RETRIES with max_attempts = max(BOTO_CREATE_MAX_RETRIES, len(subnet_ids))
There was a problem hiding this comment.
For up-to-date information on spot instance interruption frequency by region and instance type, see: https://aws.amazon.com/ec2/spot/instance-advisor/.
doc/source/cluster/config.rst
Outdated
There was a problem hiding this comment.
For this statement to be true, I think we also need to change https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py#L443
From:
subnets = [s for s in subnets if s.availability_zone in azs]
To:
subnets = [s for az in azs for s in subnets if s.availability_zone == az]
There was a problem hiding this comment.
Wow, great catch! I changed this and added a test for this!
There was a problem hiding this comment.
Nice, thanks for adding this test! AZ ordering is an important feature to protect from regressions now that we've promised that we'll pack nodes into AZs in the order given in config.
python/ray/tests/aws/utils/stubs.py
Outdated
There was a problem hiding this comment.
Minor/Typo: DIFFENT -> DIFFERENT
0cd191d to
60ef04d
Compare
|
Failures are on Windows (no autoscaler) and thus unrelated for this PR. |
Why are these changes needed?
Related issue number
Checks
scripts/format.shto lint the changes in this PR.