-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Failed to register worker . Slurm - srun - #30012
Comments
I am monitoring this issue so let me know if I need to convey more information or if I need to run or save any log file. Thanks Ray Team ! |
I encountered a similar issue. In my case, the _temp_dir was not writable from the cluster job. I added e.g. /home/username |
This has not solved it for me . It is still dependent on the number of CPUs initiated. Again, I allocate with slurm 112 cpus. #########. THIS WORKS - 10 CPUS ######### #########. THIS DOES NOT WORK - 20 CPUS ######### 2022-11-06 18:21:12,545 INFO worker.py:1518 -- Started a local Ray instance. RayContext(dashboard_url=None, python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.22', 'raylet_ip_address': '172.20.6.22', 'redis_address': None, 'object_store_address': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/plasma_store', 'raylet_socket_name': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/raylet', 'webui_url': None, 'session_dir': '/scratch/session_2022-11-06_18-21-09_979887_243927', 'metrics_export_port': 62846, 'gcs_address': '172.20.6.22:54835', 'address': '172.20.6.22:54835', 'dashboard_agent_listen_port': 52365, 'node_id': '1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47'}) (raylet) [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See #########. AFTER FAILURE: CONTENT OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.err [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See #########. AFTER FAILURE: LAST LINES OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.out [2022-11-06 18:21:12,456 I 243968 243968] (raylet) accessor.cc:608: Received notification for node id = 1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47, IsAlive = 1 |
I also meet the issue:
|
This is still open. If anyone from the ray team tells me how to proceed. I can help troubleshoot. |
Pinging this issue to keep it open. |
Interesting, I keep running into variations of this as well on our slurm cluster. @marianogabitto Are you saying that for you this happens even if you have the node allocated exclusively, i.e. no other slurm jobs running on that physical machine? I have so far noticed two things that contribute to crashes for me:
This is purely speculation of course, but maybe at least my anecdotal data can be helpful in figuring this out. |
Hi @mgerstgrasser , |
Could someone with the problem zip the entire directory of log files and upload it? The error is, I think, a red herring: the worker cannot register because (I think) something is wrong with the main node. Either the connect messages are not being delivered or the raylet process is crashin, or it is sharing fate with the dashboard agent which is crashing. Perhaps some of the other log files have some hints. |
Matt,
|
Here it is with the right file format |
could you also show what is in |
Sure, here it is "conda list" and "pip list" in the same condo environment called ray. It is a fresh installation. |
Just in case an additional data point is helpful, I previously also uploaded logs once: #21479 (comment) |
Note that
|
include_dashboard=False ???? |
I see you have conda's |
Matt, Thanks ! |
|
Installed fresh repo with ray-all, same problem. I have not solved it. conda config --env --add channels conda-forge conda create -n rayserve python=3.10 |
:( |
I just tried to reproduce your exact script on my cluster, and for me no error. (But plenty of occurences of seemingly the same error otherwise, randomly). I ran I'm attaching my |
@mgerstgrasser Quick question, how long does it take from the moment that you run ray.init() until finish ? Mine depends on the number of cpus . Is that your case too ? |
If I run it without the 10 cpus:
20 cpus:
48 cpus:
256 cpus (on a 48 core machine)
|
This issue has collected a number of different reports, I think I saw these:
When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened. |
Would it make sense and be possible to have Ray emit a more detailed error message here? One thing that makes it hard for me to report the problem in more detail is that the main log only shows the "Failed to register worker" and "IOError: [RayletClient] Unable to register worker with raylet. No such file or directory" messages. And it's impossible for me to figure out what other logs or information could be relevant. At the very least, could Ray log which file the "no such file or directory" message refers to? |
I usually am able to grep for the "No such file or directory" message in the log directory |
thanks, it works for me after modifying it as ray.init(num_cpus=1) |
Hey people, I has the same error as mentioned earlier, however, after I do "pip uninstall grpcio" and then reinstall using conda "conda install grpcio". The error gone. and its working fine for me now! Peace. |
the |
can confirm that downgrading to ray 1.13 fixes the issue |
I also have this issue, working from a singularity container. Is there anything I should try to copy / print to help move the debugging forward? |
Also, for me it seems to be a problem related to some persistent files? my installation worked for a while, but when I tried to better utilize the node I was working on (adding more gpu's and cpu's), I started getting the error. Now, reverting to the old code, the error persists. I am now only using 1 GPU and 10 CPUs, so I doubt it's related to number of OpenBLAS threads. |
|
Hi Mattip My version of ray is 2.4.0, and it was installed using pip. I do not have grpcio installed. |
@kaare-mikkelsen there should be a grpcio package in the output of |
Hi again
In case it makes any difference, my version of the error message is [2023-08-07 18:07:41,009 E 55830 55830] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory |
update: changing the yml file to this:
seems to make the problem go away? I can try adding the extra libraries back, and see when it breaks again. though it will have to wait until tomorrow. Oh, and for the record: pip list is still reporting ray as version 2.4.0 and grpcio as 1.51.3 (and running the old container still gives the same error, so it doesn't seem like there's anything outside the container causing the problem). |
new status: This conda environment works:
This one doesn't:
So now it looks like neptune is partly to blame? My version of that is 1.4.1. Edit: |
Just in case this is helpful to anyone else running into this, for me it seems I've been able to work around this problem by putting a try-except block around Since I've started doing this I've not seen any failed slurm jobs. But I did see in my logs that the except-block was triggered on occasion, so I think the underlying issue still occurs sometimes.
|
The top solution on stack overflow solved this issue for me:
(though I don't think it solves the OP's problem as they have already set |
this solves it for me. |
I have |
@mgerstgrasser Had high hopes for your solution, but for me I get an unhandled segfault:
Any ideas? :( |
@jtlz2 The only other thing I remember that helped for me and that wasn't mentioned recently was making sure I allocate at least 2 cores for the slurm job (even if I set |
version 1.2.0 is too low for me, I update to 2.0.0, and no error. |
@Pkulyte The former - I don't recall if it was |
Facing similar error in my Mac M1 Pro but it happens only when running x86 on Rosetta mode. On QEMU it works fine!!!
|
After many attempts, I managed to resolve this issue by increasing the limit for open threads and files using the commands |
This is very weird bug indeed. Reasons for such setup: my jobs are very short-lived and a single ray instance is not able to utilize all cores, but when I run cluster and run all workers on the cluster, the head ray instance hangs very soon after start, blocking whole pipeline. Running multiple instances per host never caused any problems till today, when I upgraded RAM on one of the machines from 64 to 128GB, and suddenly, when I launch two instances on this upgraded machine, they both fail with Initially, I blamed faulty RAM (given it's a memory-related error), but replacing it did not solve the problem. Very puzzling where this error comes from and why only on single machine, that worked totally fine prior RAM upgrade, while other machines with same setups never had issues with any RAM size.. . My env:
|
What happened + What you expected to happen
I can't start ray.
I instantiate a node in a slurm cluster using:
srun -n 1 --exclusive -G 1 --pty bash
This allocates a node with 112 cpus and 4 gpus.
Then, within python:
import ray
ray.init(num_cpus=20)
2022-11-03 21:17:31,752 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[2022-11-03 21:18:32,436 E 251378 251378] core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
On a different test:
mport ray
ray.init(ignore_reinit_error=True, num_cpus=10)
2022-11-03 21:19:01,734 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.24', 'raylet_ip_address': '172.20.6.24', 'redis_address': None, 'object_store_address': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/plasma_store', 'raylet_socket_name': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630', 'metrics_export_port': 62537, 'gcs_address': '172.20.6.24:49967', 'address': '172.20.6.24:49967', 'dashboard_agent_listen_port': 52365, 'node_id': '0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78'})
Versions / Dependencies
DEPENDENCIES:
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50)
[GCC 10.3.0] on linux
RAY VERSION: 2.0.1 INSTALLATION: pip install -U "ray[default]"
grpcio: 1.43.0
Reproduction script
import ray
ray.init(num_cpus=20)
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: