Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] SubnetId, a valid AWS field, is being ignored in cluster yaml #14551

Open
2 tasks
xcharleslin opened this issue Mar 9, 2021 · 10 comments
Open
2 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Milestone

Comments

@xcharleslin
Copy link
Contributor

xcharleslin commented Mar 9, 2021

What is the problem?

Cluster config docs say that you can put any valid AWS field in NodeConfig. But, putting SubnetId is silently ignored.

This issue has already been raised before; see #12530 . This new issue is not to find a workaround, but to fix the documentation/implementation mismatch.

My suggested fix (only to start discussion, not volunteering 😅 ) is the following:

  1. Move any fields that are Ray-specific concepts out of the NodeConfig dict and into the Provider dict. (This is already true today for most fields; I believe SubnetIds is the sole violation of this.)
  2. No hidden behaviour or silent failures in the NodeConfig dict. Either pass through all AWS fields to the underlying create_instance call, or explicitly raise an error for the ones we don't want the user passing in.

Reproduction (REQUIRED)

(From #12530.)

# ~/Desktop/ray_cluster.yaml
cluster_name: default

min_workers: 0
max_workers: 2

docker:
    image: "rayproject/ray:latest-gpu"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []

provider:
    type: aws
    region: eu-west-1
    availability_zone: eu-west-1a,eu-west-1b
    cache_stopped_nodes: False # If not present, the default is True.

auth:
    ssh_user: ubuntu

head_node:
    # VpcId:
    SubnetId: subnet-***********qwe
    # CidrBlock: 10.19.0.0/16
    InstanceType: m5a.large
    ImageId: ami-006ff58f5247c50eb # Deep Learning AMI (Ubuntu) Version 30

    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100


worker_nodes:
    SubnetId: subnet-***********qwe
    InstanceType: m5a.large
    ImageId: ami-006ff58f5247c50eb # Deep Learning AMI (Ubuntu) Version 30
ray up -y ~/Desktop/ray_cluster.yaml
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@xcharleslin xcharleslin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2021
@xcharleslin
Copy link
Contributor Author

xcharleslin commented Mar 9, 2021

cc @wuisawesome , @richardliaw

@wuisawesome
Copy link
Contributor

"support all AWS fields" is vague.

provider seems like a logical place to put the SubnetIds (plural) field, but for the sake of backwards compatibility, we should consider leaving it where it is, and just not clobber a user provided SubnetId

cc @AmeerHajAli

@xcharleslin
Copy link
Contributor Author

xcharleslin commented Mar 9, 2021

"support all AWS fields" is vague.

@wuisawesome good point, rephrased for precision :)

@xcharleslin
Copy link
Contributor Author

@pcmoritz please triage so this doesn't get forgotten ;)

@richardliaw richardliaw added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 21, 2021
@richardliaw richardliaw added this to the Serverless Autoscaling milestone Apr 21, 2021
@AmeerHajAli AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022
@xieus
Copy link

xieus commented May 22, 2023

@architkulkarni I am hitting the same issue on cloud launcher. If users have more than one subnets, it is not easy to deploy a Ray cluster to the target subnet, all by chance. Could we please prioritize the fix?

@architkulkarni architkulkarni self-assigned this May 22, 2023
@architkulkarni
Copy link
Contributor

Thanks @xieus , I'll see if we can prioritize it.

@architkulkarni
Copy link
Contributor

@xieus does this unblock you in the meantime? #12530 (comment)

@wuisawesome
Copy link
Contributor

@xieus the short term mitigation here is to provide a SubnetIds: ["<your subnet id>"] field in the node_config.

@xieus
Copy link

xieus commented May 23, 2023

Right. I tried that but only providing SubnetIds doesn't work. We have to provide SecurityGroupIds, and then either set use_internal_ips as true, or make the target subnet's mapPublicIpOnLaunch property true (after reading the cloud launcher source codes). All of the desirable parameters are hidden from users.

I would recommend updating Ray OSS documents or at least this thread with an example, which shows the following setup:
(1) Two VPCs, two subnets per VPC;
(2) Each VPC has its own SG, not the default SG;
(3) A sample yaml file that specifies one subnet (can be any random subnet)..

This setup could give user a good example if they want to expand to more complicated setup. @architkulkarni @wuisawesome

@xieus
Copy link

xieus commented May 23, 2023

I followed this yaml file: https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html#full-configuration, which works only in the scenario of unspecified SubnetId, and cloud launch picks one for user although the criteria is random from user perspective. We currently don't have a working example that allows users to specify a subnet and just work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants