-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray local cluster] nodes marked as uninitialized #39565
Comments
cc @rickyyx can you follow up with the investigation?
Can you tell us what this exactly for? |
Hey @jmakov - will you be able to get any |
Didn't see anything exciting happening there, only
|
+1 same issue for me. Even with systems on cloud (3rd party cloud, not AWS/GCS/Azure). Opened all ports, sometimes it gets connected, some times it shows uninitialized. |
cc @gvspraveen could someone from the cluster team help take a look? I believe this is more relevant to cluster launcher as of now rather than the actual autoscaling logics since "running everything manually works". |
@rickyyx not to mention manually starting ray not working and cluster launcher not working. Wondering how ray works at all for anybody. As someone who uses ray for more than a year, every other release breaks a core part. |
cc @anyscalesam can you triage this issue with @gvspraveen? |
@jmakov do you happen to remember if this was working for you on a previous version of Ray, and if so which one? |
So cluster launcher worked for me for the last +2 years using a local cluster (without Docker, just conda env). Think it was 2.6.0 before I made the mistake of upgrading, if I remember correctly. Think I'll just start writing my own tests and run before every upgrade. |
|
The above log is for |
This issue is still present in ray 2.7.1 |
Let us know if any other details are required |
Actually, when I reproduced the issue earlier, I had forgotten to open all the ports. After opening all ports, I wasn't able to reproduce the issue. @jmakov or @ajaichemmanam if you're able to reproduce the issue and you have time, it would potentially be very helpful if you could amend your YAML file as follows:
And share the |
@architkulkarni I've added ls /tmp/ray_worker_output.txt
ls: cannot access '/tmp/ray_worker_output.txt': No such file or directory |
Worker nodes almost always stuck as launching/uninitialized or no cluster status at all. |
Hi @MatteoCorvi, Glad to hear you were able to get this working, but I'm a little confused about your solution. How is this different from simply installing ray version 2.22? |
Hi @jacksonjacobs1, |
Interesting, thanks. It would be fantastic if a Ray dev from the cluster team could comment on why newer versions of ray seem to break on-prem cluster launching & cleanup. @anyscalesam What would be your recommendation for resolving this issue? |
I'm running ray on AWS EC2 instances with the same issue. |
Hi @Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances? |
I'm using the cluster launcher: ray up cluster.yaml.
…On Tue, 23 Jul 2024, 18:07 Jackson Jacobs, ***@***.***> wrote:
Hi @Tipmethewink <https://github.com/Tipmethewink>, are you using
existing EC2 instances (equivalent to an on-prem cluster) or using ray
cluster launcher to provision new EC2 instances?
—
Reply to this email directly, view it on GitHub
<#39565 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALGU36MYJZKFLJXQ7ZUIK3TZN2ETVAVCNFSM6AAAAAA4UBMZY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVG44DEMRZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I got this same painful issue today, after retriving the codes and logs from ray dashboard, I made my worker node started finally.
hopes these findings could help you |
Hey folks, I ran into a similar issue when trying to set up an "On Prem" 1 click cluster via Lambda Labs. I could start the cluster successfully when not using a docker image. But as soon as I switched to the docker image, I ran into the uninitialized issue. I would get something like
Here's the config.yaml I was using. cluster_name: test-cluster
upscaling_speed: 1.0
docker:
container_name: basic-ray-ml-image
image: rayproject/ray-ml:latest-gpu
pull_before_run: true
provider:
type: local
head_ip: scrubbed_ip
worker_ips:
- scrubbed_ip
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/keypair
min_workers: 1
max_workers: 1
setup_commands:
- pip install ray[default]
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 I managed to fix this by
Maybe this helps some people! I feel like this stems from poor logging / error reporting from the other nodes. |
Additionally I don't see any logging or log file. Even though the instruction from
I don't see such a file on any of the nodes
|
I want to flag that the working node always gets stuck in the Pending: uninitialized state when trying to run A few things to note:
Here's an example of my cluster.yaml
If I try to force starting the worker node I get this
ray monitor my_cluster.yaml looks like
|
Anything to help debug things would be very useful! |
Fixed! I've managed to |
Thanks @olly-writes-code . Were you able to successfully tear down and re-initialize your cluster? I ask because your ray version incompatibility issue was definitely not the case for me - I pulled the pre-built rayproject/ray docker image onto all nodes. On the first attempt, the cluster spun up without any issues. It was only after running |
Interesting. It seems that running |
Local node provider is not actively maintained. |
Ahh, damn, okay. Is it correct to say that Ray is not recommended for running training on a cluster of GPU machines provided by someone other than AWS, Azure, or GCP? |
You can use Ray in any on-prem or cloud environment, but I'd recommend figuring out another way to orchestrate the process of pulling images and running Ray start. |
I see. Thanks for the clarification @DmitriGekhtman :) For future users, it would be great if the deprecation of local clusters could be made clear in the docs. |
This doc is clearly false now https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html
|
Ah, I think I might have spoken too soon on another thread about this functionality being officially deprecated. However, based on my experience with the Ray project, this functionality is not very well maintained. You will likely get better results by "setting up manually", i.e. running ray start on each of the machines in the cluster. If you have ssh access to each of the machines, you can write a for loop to do this. |
Yes, that is my experience, having worked with code. Docs are badly maintained and the code is very flaky. We have burned 3 days, battling this code, only to find out it is poorly supported. Due to this experience and confusion over the state of deprecation, etc., we will not use any Ray (or Anyscale) anytime soon. |
cc @anyscalesam on the poor UX here. My recommendation to the maintainers would be to officially deprecate local node provider. The best maintained and most popular method for using Ray in the OSS is to run Ray on Kubernetes using KubeRay. The Anyscale product is also quite stable and reliable (as it's used by paying customers of Anyscale.) |
Right, but the sense is that due to the incentives at play, support for Ray OSS will weaken to encourage people to pay for Anyscale, precisely as described above. This leaves a very bitter taste, which means we won't even try Ray and, therefore, would never consider Anyscale. |
Yes, it is all very confusing, since local bare metal is the most "obvious" way of running ray to people from supercomputer background (like openMPI). I've written a suite of ansible scripts to start up ray in an openMPI-like fashion, and figure out sub-GPU chunks and custom resources as well (https://github.com/flyingfalling/pyraygputils). However, it is VERY hackish (am polishing it in parallel with some other related projects). It would be very unfortunate if bare metal ray were deprecated... |
Same issue with "worker nodes uninitialized". It changes the initialization for each worker from paralization to serialization. The last line in log file: monitor.log, shows: In line https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/autoscaler.py#L717-L719 PR-5903 adds back the multi-thread processing. |
What happened + What you expected to happen
Running
ray up ray.yaml
I'd expect that all of the 4 nodes would be setup and join the cluster as I've setmin_workers: 4
.ray monitor ray.yaml
is showing the nodes asuninitialized
though.Versions / Dependencies
ray 2.6.4
python 3.9.18
manjaro
Reproduction script
ray.yaml
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: