Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Hydra-Ray plugin can only launch jobs on the head node #1583

Closed
1 of 2 tasks
samuelstanton opened this issue Apr 28, 2021 · 6 comments
Closed
1 of 2 tasks

[Bug] Hydra-Ray plugin can only launch jobs on the head node #1583

samuelstanton opened this issue Apr 28, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@samuelstanton
Copy link
Contributor

samuelstanton commented Apr 28, 2021

🐛 Bug

Description

Multi-node support for the Hydra-Ray plugin is currently broken. Although the Ray client is able to launch worker nodes, the worker nodes are unable to connect to the head node Ray runtime because the head IP address is not getting communicated to the worker nodes. The plugin config schema is based off of the Ray examples, such as this one,

https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full-legacy.yaml

The problem is in worker_start_ray_commands. Currently
these commands are set as follows (just like the Ray example):

    worker_start_ray_commands: [
            "ray stop",
            "ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076",
}

The issue is that RAY_HEAD_IP is never actually set as an environment variable on the worker node. In the Ray codebase it looks like they use a export RAY_HEAD_IP=x.x.x.x prefix whenever they call the worker startup commands.

https://github.com/ray-project/ray/blob/42565d5bbe56412402e0126cf41fb92f6a411810/python/ray/autoscaler/_private/autoscaler.py#L604

It seems like that service is no longer getting called when the cluster is launched with the plugin, so the worker node can't identify the head node at all. I've verified that I can manually connect the worker node to the head node.

Checklist

  • I checked on the latest version of Hydra
  • I created a minimal repro (See this for tips).

To reproduce

I haven't made a stripped down version of the program yet, and it fails silently as all the jobs run sequentially on the head node. The worker node is there, but it isn't being used. You can see this happening if you connect to the Ray dashboard on the head node.

Expected Behavior

The worker nodes need some way of identifying the head node IP address so they can connect to the head node runtime. Once that happens, the jobs should be distributed to the worker nodes instead of all being executed by the head node.

System information

  • Hydra Version : hydra-core==1.1.0.dev5, hydra-ray-launcher==0.1.3
  • Python version : 3.8.8
  • Virtual environment type and version : conda 4.9.2
  • Operating system : Ubuntu 18.04

Additional context

I'm willing to make a PR to fix the issue, the main issue I'm having is that I can't come up with a good way to determine the head node address and update the config before calling ray up. There's something of a chicken-and-egg problem the way it's currently structured. You can't call ray up until you know the host node IP address and update worker_start_ray_commands, but you won't know the host node IP address until after you've called ray up (unless you are statically assigning the IP address). Obviously within ray up this isn't an issue, but it doesn't seem like the API exposes that kind of operability (as far as I know).

@samuelstanton samuelstanton added the bug Something isn't working label Apr 28, 2021
@jieru-hu
Copy link
Contributor

thanks @samuelstanton

could you paste your launcher config here so i try repro with a minimal example? you can run the something like the following

python my_app.py hydra/launcher=ray_aws --cfg hydra -p hydra.launcher

@samuelstanton
Copy link
Contributor Author

samuelstanton commented Apr 28, 2021

Ok this is a stripped down version of the config, leaving out most of the unchanged defaults. I'm using Docker as well, but it'd be good to start with something simple to begin with. Depending on how you have the AWS CLI configured you may not need to specify the head and worker subnets. If you don't, make sure that they are both in the same subnet, and make sure that subnet is in the same availability zone you specified in the config. Changing stop_cluster to False can make debugging easier.

_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
ray:
  init:
    address: auto
  cluster:
    min_workers: 1
    max_workers: 1
    initial_workers: 1
    provider:
      type: aws
      region: us-west-2
      availability_zone: us-west-2b
    auth:
      ssh_user: ubuntu
    head_node:
      SubnetIds:
        - subnet-04907b6ca5184e259
    worker_nodes:
      SubnetIds:
        - subnet-04907b6ca5184e259
# stop_cluster: False

@jieru-hu
Copy link
Contributor

Thank you. I will take a look.

@samuelstanton
Copy link
Contributor Author

samuelstanton commented Apr 28, 2021

So the minimal example from my PR seems to be working... might be a Docker thing :(

python my_parallel_app.py -m hydra/launcher=ray_aws task=0,1,2,3,4,5,6,7,8,9

Screen Shot 2021-04-28 at 7 52 12 PM

...
Did not find any active Ray processes.
bash: ulimit: open files: cannot modify limit: Operation not permitted
Local node IP: 10.0.7.99
2021-04-28 23:46:13,891	INFO services.py:1172 -- View the Ray dashboard at http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.0.7.99:6379' --redis-password='5241590000000000'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop
Cluster: default

Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (head & workers): hydra-stantsa_key-2 [default]
  VPC Subnets (head & workers): subnet-04907b6ca5184e259
  EC2 Security groups (head & workers): sg-0fc7fc13f4a6f028c [default]
  EC2 AMI (head & workers): ami-008d8ed4bd7dc2485

Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Currently running head node is out-of-date with cluster configuration
  hash is c5462da86467af8cfdb2f2705262c9e0c82a15c6, expected e75095c628a5987982f6ba08f254644973de3657
  Relaunching it. Confirm [y/N]: y [automatic, due to --yes]
  Terminated head node i-076d37b9e5fd3b699
  Launched 1 nodes [subnet_id=subnet-04907b6ca5184e259]
    Launched instance i-0009792f05afa4fa0 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 34.215.241.148
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Success.
  Updating cluster configuration. [hash=02ec6f92c4566a3e843a364e25b72e42218ab22d]
  New status: syncing-files
  [2/7] Processing file mounts
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
  [6/7] Running setup commands
    (0/8) conda create -n hydra_3.8.8 py...
    (1/8) echo 'export PATH="$HOME/anaco...
    (2/8) pip install omegaconf==2.1.0de...
    (3/8) pip install hydra_core==1.1.0d...
    (4/8) pip install ray==1.2.0
    (5/8) pip install cloudpickle==1.6.0
    (6/8) pip install pickle5==0.0.11
    (7/8) pip install hydra_ray_launcher...
  [7/7] Starting the Ray runtime
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes -i /Users/stantsa/.ssh/hydra-stantsa_key-2.pem [email protected] 
 Error: ssh: connect to host 34.215.241.148 port 22: Operation timed out
ssh: connect to host 34.215.241.148 port 22: Operation timed out
ssh: connect to host 34.215.241.148 port 22: Operation timed out
ssh: connect to host 34.215.241.148 port 22: Operation timed out
Warning: Permanently added '34.215.241.148' (ECDSA) to the list of known hosts.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
Shared connection to 34.215.241.148 closed.
[2021-04-28 19:46:16,982][HYDRA] Running command: ray exec --run-env=auto /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml echo $(mktemp -d)
[2021-04-28 19:46:18,654][HYDRA] Output: /tmp/tmp.lx0deZjUjV
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: Shared connection to 34.215.241.148 closed.
[2021-04-28 19:46:18,654][HYDRA] Created temp path on remote server /tmp/tmp.lx0deZjUjV
[2021-04-28 19:46:18,657][HYDRA] Running command: ray rsync-up /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmprlwtjmd7/ /tmp/tmp.lx0deZjUjV
[2021-04-28 19:46:20,454][HYDRA] Output: Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: 
[2021-04-28 19:46:20,457][HYDRA] Running command: ray rsync-up /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml /Users/stantsa/code/hydra/plugins/hydra_ray_launcher/hydra_plugins/hydra_ray_launcher/_remote_invoke.py /tmp/tmp.lx0deZjUjV/_remote_invoke.py
[2021-04-28 19:46:22,259][HYDRA] Output: Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: 
[2021-04-28 19:46:22,262][HYDRA] Running command: ray exec --run-env=auto /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml python /tmp/tmp.lx0deZjUjV/_remote_invoke.py /tmp/tmp.lx0deZjUjV
[2021-04-28 19:53:28,459][HYDRA] Output: 2021-04-28 23:46:24,202	INFO worker.py:654 -- Connecting to existing Ray cluster at address: 10.0.7.99:6379
(pid=3803) [2021-04-28 23:46:26,032][__main__][INFO] - Executing task 0 on node with IP 10.0.7.99
(pid=3803) [2021-04-28 23:47:26,195][__main__][INFO] - Executing task 1 on node with IP 10.0.7.99
(pid=3803) [2021-04-28 23:48:26,364][__main__][INFO] - Executing task 2 on node with IP 10.0.7.99
(pid=3803) [2021-04-28 23:49:26,535][__main__][INFO] - Executing task 3 on node with IP 10.0.7.99
(autoscaler +3m6s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +3m6s) Resized to 4 CPUs.
(pid=3588, ip=10.0.13.62) [2021-04-28 23:49:31,241][__main__][INFO] - Executing task 4 on node with IP 10.0.13.62
(pid=3803) [2021-04-28 23:50:26,857][__main__][INFO] - Executing task 5 on node with IP 10.0.7.99
(pid=3588, ip=10.0.13.62) [2021-04-28 23:50:31,454][__main__][INFO] - Executing task 6 on node with IP 10.0.13.62
(pid=3803) [2021-04-28 23:51:27,021][__main__][INFO] - Executing task 7 on node with IP 10.0.7.99
(pid=3588, ip=10.0.13.62) [2021-04-28 23:51:31,711][__main__][INFO] - Executing task 8 on node with IP 10.0.13.62
(pid=3803) [2021-04-28 23:52:27,232][__main__][INFO] - Executing task 9 on node with IP 10.0.7.99
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: Shared connection to 34.215.241.148 closed.
[2021-04-28 19:53:28,463][HYDRA] Running command: ray rsync-down /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml /tmp/tmp.lx0deZjUjV/returns.pkl /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmpxu_42f47
[2021-04-28 19:53:31,211][HYDRA] Output: Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: 
[2021-04-28 19:53:31,214][HYDRA] Running command: ray exec --run-env=auto /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml rm -rf /tmp/tmp.lx0deZjUjV
[2021-04-28 19:53:32,685][HYDRA] Output: Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.215.241.148 
 Error: Shared connection to 34.215.241.148 closed.
[2021-04-28 19:53:32,685][HYDRA] Stopping cluster now. (stop_cluster=true)
[2021-04-28 19:53:32,685][HYDRA] Deleted the cluster (provider.cache_stopped_nodes=false)
[2021-04-28 19:53:32,688][HYDRA] Running command: ray down -y /var/folders/4k/ptbs25l170389035l24w1x7c0000gr/T/tmp0ebvjt4e.yaml
[2021-04-28 19:53:42,763][HYDRA] Output: Stopped all 13 Ray processes.
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
Fetched IP: 34.215.241.148
Requested 2 nodes to shut down. [interval=1s]
0 nodes remaining after 5 second(s).
No nodes remaining. 
 Error: Shared connection to 34.215.241.148 closed.

@samuelstanton
Copy link
Contributor Author

Not sure what's up with the idle ray processes though...

@samuelstanton
Copy link
Contributor Author

After some more debugging, I got it working with Docker! Apparently I was mistaken about the nature of the problem. Closing the issue, but I'll mention some of the important things to get right in case others run into similar problems. Below I've given the package versions I'm using, a stripped down version of my config, and discussion on some on the important aspects of configuration on the AWS side. Some of this information is redundant with the plugin docs, but included for completeness.

Dependency versions

omegaconf==2.1.0.dev26
hydra-core==1.1.0.dev6
ray==1.2.0
cloudpickle==1.6.0
pickle5==0.0.11
hydra-ray-launcher==1.1.0.dev1

Plugin config

_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
ray:
  init:
    address: auto
  remote:
    num_cpus: 4
    num_gpus: 1.0
  cluster:
    cluster_name: multi-node-cluster
    min_workers: 0
    max_workers: 1
    initial_workers: 0
    idle_timeout_minutes: 5
    docker:
      image: rayproject/ray-ml:latest-gpu
      container_name: ray-container
      pull_before_run: true
      run_options: ["--gpus='all' --shm-size=8g"]
    provider:
      type: aws
      region: us-west-2
      availability_zone: us-west-2b
    auth:
      ssh_user: ubuntu
    head_node:
      InstanceType: p3.2xlarge
      ImageId: ami-084f81625fbc98fa4
      SubnetIds:
        - subnet-04907b6ca5184e259
      IamInstanceProfile:
        Arn: arn:aws:iam::627702899116:instance-profile/ray-head-v1
    worker_nodes:
      InstanceType: p3.2xlarge
      ImageId: ami-084f81625fbc98fa4
      SubnetIds:
        - subnet-04907b6ca5184e259
      IamInstanceProfile:
        Arn: arn:aws:iam::627702899116:instance-profile/ray-worker-v1

AWS configuration

There's 3 big things you have to have configured right on the AWS side or things won't work correctly.

  1. Service quotas (how many vCPUs you can request in the specified region)
  2. The cluster subnet
  3. IAM policies/roles for the head and worker nodes

Service quotas

Requesting an increase in your service quota is pretty straightforward.

VPC subnet

When you create a subnet for your cluster (or use an existing one) you need to make sure it is in the right availability zone (your cluster cannot be spread across multiple zones) and if your application needs internet access you'll need to add a gateway to the subnet routing table.

IAM policies and roles

This one can cause confusion for two reasons. The first reason is that the default plugin configuration works fine without this step, because by default the head node is assigned an automatically created role called ray-autoscaler-v1 which has an attached policy authorizing it to launch EC2 instances. The default config works because it does not assign an IAM role to the worker node. Hence if you create just a ray-worker-v1 role (e.g. a role with an attached policy that grants S3 access), and leave the head node role set to ray-autoscaler-v1, the worker nodes will fail to launch because the policies automatically attached to ray-autoscaler-v1 do not grant permission for the iam:PassRole action, which is required for the head node to assign worker IAM roles. The solution is to create a new policy ray-ec2-launcher with permissions to both launch EC2 instances and assign IAM roles, and either attach that new policy to the ray-autoscaler-v1 role or attach it to a new role called ray-head-v1 as shown in the config. This procedure is described in more detail in this issue in the Ray repo.

The second source of confusion is the following: if you create the ray-ec2-launcher policy by starting in the JSON editor, pasting the raw policy description from the link above, and editing the AWS account number and region, you aren't quite done. You then need to switch over to the virtual editor and resolve the warnings before going on to the next step.

image

If you don't do this, then you'll still have permissions issues when you try to launch worker nodes.

Sorry for the spurious issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants