-
-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Hydra-Ray plugin can only launch jobs on the head node #1583
Comments
thanks @samuelstanton could you paste your launcher config here so i try repro with a minimal example? you can run the something like the following
|
Ok this is a stripped down version of the config, leaving out most of the unchanged defaults. I'm using Docker as well, but it'd be good to start with something simple to begin with. Depending on how you have the AWS CLI configured you may not need to specify the head and worker subnets. If you don't, make sure that they are both in the same subnet, and make sure that subnet is in the same availability zone you specified in the config. Changing
|
Thank you. I will take a look. |
So the minimal example from my PR seems to be working... might be a Docker thing :(
|
Not sure what's up with the idle ray processes though... |
After some more debugging, I got it working with Docker! Apparently I was mistaken about the nature of the problem. Closing the issue, but I'll mention some of the important things to get right in case others run into similar problems. Below I've given the package versions I'm using, a stripped down version of my config, and discussion on some on the important aspects of configuration on the AWS side. Some of this information is redundant with the plugin docs, but included for completeness. Dependency versions
Plugin config
AWS configurationThere's 3 big things you have to have configured right on the AWS side or things won't work correctly.
Service quotasRequesting an increase in your service quota is pretty straightforward. VPC subnetWhen you create a subnet for your cluster (or use an existing one) you need to make sure it is in the right availability zone (your cluster cannot be spread across multiple zones) and if your application needs internet access you'll need to add a gateway to the subnet routing table. IAM policies and rolesThis one can cause confusion for two reasons. The first reason is that the default plugin configuration works fine without this step, because by default the head node is assigned an automatically created role called The second source of confusion is the following: if you create the If you don't do this, then you'll still have permissions issues when you try to launch worker nodes. Sorry for the spurious issue! |
🐛 Bug
Description
Multi-node support for the Hydra-Ray plugin is currently broken. Although the Ray client is able to launch worker nodes, the worker nodes are unable to connect to the head node Ray runtime because the head IP address is not getting communicated to the worker nodes. The plugin config schema is based off of the Ray examples, such as this one,
https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full-legacy.yaml
The problem is in
worker_start_ray_commands
. Currentlythese commands are set as follows (just like the Ray example):
The issue is that
RAY_HEAD_IP
is never actually set as an environment variable on the worker node. In the Ray codebase it looks like they use aexport RAY_HEAD_IP=x.x.x.x
prefix whenever they call the worker startup commands.https://github.com/ray-project/ray/blob/42565d5bbe56412402e0126cf41fb92f6a411810/python/ray/autoscaler/_private/autoscaler.py#L604
It seems like that service is no longer getting called when the cluster is launched with the plugin, so the worker node can't identify the head node at all. I've verified that I can manually connect the worker node to the head node.
Checklist
To reproduce
I haven't made a stripped down version of the program yet, and it fails silently as all the jobs run sequentially on the head node. The worker node is there, but it isn't being used. You can see this happening if you connect to the Ray dashboard on the head node.
Expected Behavior
The worker nodes need some way of identifying the head node IP address so they can connect to the head node runtime. Once that happens, the jobs should be distributed to the worker nodes instead of all being executed by the head node.
System information
Additional context
I'm willing to make a PR to fix the issue, the main issue I'm having is that I can't come up with a good way to determine the head node address and update the config before calling
ray up
. There's something of a chicken-and-egg problem the way it's currently structured. You can't callray up
until you know the host node IP address and updateworker_start_ray_commands
, but you won't know the host node IP address until after you've calledray up
(unless you are statically assigning the IP address). Obviously withinray up
this isn't an issue, but it doesn't seem like the API exposes that kind of operability (as far as I know).The text was updated successfully, but these errors were encountered: