Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Failed to register worker . Slurm - srun - #30012

Open
marianogabitto opened this issue Nov 4, 2022 · 65 comments
Open

[Core] Failed to register worker . Slurm - srun - #30012

marianogabitto opened this issue Nov 4, 2022 · 65 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical

Comments

@marianogabitto
Copy link

What happened + What you expected to happen

I can't start ray.

I instantiate a node in a slurm cluster using:

srun -n 1 --exclusive -G 1 --pty bash

This allocates a node with 112 cpus and 4 gpus.

Then, within python:

import ray
ray.init(num_cpus=20)
2022-11-03 21:17:31,752 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[2022-11-03 21:18:32,436 E 251378 251378] core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

On a different test:
mport ray
ray.init(ignore_reinit_error=True, num_cpus=10)
2022-11-03 21:19:01,734 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.24', 'raylet_ip_address': '172.20.6.24', 'redis_address': None, 'object_store_address': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/plasma_store', 'raylet_socket_name': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630', 'metrics_export_port': 62537, 'gcs_address': '172.20.6.24:49967', 'address': '172.20.6.24:49967', 'dashboard_agent_listen_port': 52365, 'node_id': '0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78'})

(raylet) [2022-11-03 21:19:31,639 E 252725 252765] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.
2022-11-03 21:20:00,798 WARNING worker.py:1829 -- The node with node id: 0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78 and address: 172.20.6.24 and node name: 172.20.6.24 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.

Versions / Dependencies

DEPENDENCIES:
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50)
[GCC 10.3.0] on linux

RAY VERSION: 2.0.1 INSTALLATION: pip install -U "ray[default]"
grpcio: 1.43.0

Reproduction script

import ray
ray.init(num_cpus=20)

Issue Severity

High: It blocks me from completing my task.

@marianogabitto marianogabitto added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 4, 2022
@clarng clarng added core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 4, 2022
@marianogabitto
Copy link
Author

I am monitoring this issue so let me know if I need to convey more information or if I need to run or save any log file.

Thanks Ray Team !

@michaelfeil
Copy link

I encountered a similar issue. In my case, the _temp_dir was not writable from the cluster job. I added e.g. /home/username
ray.init(num_cpus=num_cpus, num_gpus=num_gpus, _temp_dir=f"/home/mfeil/tmp", include_dashboard=False, ignore_reinit_error=True)

@marianogabitto
Copy link
Author

This has not solved it for me . It is still dependent on the number of CPUs initiated.

Again, I allocate with slurm 112 cpus.

#########. THIS WORKS - 10 CPUS #########
import ray
ray.init(include_dashboard=False, num_cpus=10, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)

#########. THIS DOES NOT WORK - 20 CPUS #########
import ray
ray.init(include_dashboard=False, num_cpus=20, num_gpus=4, _temp_dir=f"/scratch/", ignore_reinit_error=True)

2022-11-06 18:21:12,545 INFO worker.py:1518 -- Started a local Ray instance.

RayContext(dashboard_url=None, python_version='3.9.13', ray_version='2.0.1', ray_commit='03b6bc7b5a305877501110ec04710a9c57011479', address_info={'node_ip_address': '172.20.6.22', 'raylet_ip_address': '172.20.6.22', 'redis_address': None, 'object_store_address': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/plasma_store', 'raylet_socket_name': '/scratch/session_2022-11-06_18-21-09_979887_243927/sockets/raylet', 'webui_url': None, 'session_dir': '/scratch/session_2022-11-06_18-21-09_979887_243927', 'metrics_export_port': 62846, 'gcs_address': '172.20.6.22:54835', 'address': '172.20.6.22:54835', 'dashboard_agent_listen_port': 52365, 'node_id': '1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47'})

(raylet) [2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

#########. AFTER FAILURE: CONTENT OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.err

[2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

#########. AFTER FAILURE: LAST LINES OF /SCRATCH/session_2022-11-06_18-21-09_979887_243927/logs/raylet.out

[2022-11-06 18:21:12,456 I 243968 243968] (raylet) accessor.cc:608: Received notification for node id = 1673a663ec1b58a2c8924abaf65438d32990ce10ee323cf216260d47, IsAlive = 1
[2022-11-06 18:21:12,555 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244037, the token is 0
[2022-11-06 18:21:12,556 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244038, the token is 1
[2022-11-06 18:21:12,557 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244039, the token is 2
[2022-11-06 18:21:12,559 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244040, the token is 3
[2022-11-06 18:21:12,560 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244041, the token is 4
[2022-11-06 18:21:12,563 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244042, the token is 5
[2022-11-06 18:21:12,565 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244043, the token is 6
[2022-11-06 18:21:12,566 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244044, the token is 7
[2022-11-06 18:21:12,567 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244045, the token is 8
[2022-11-06 18:21:12,569 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244046, the token is 9
[2022-11-06 18:21:12,571 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244047, the token is 10
[2022-11-06 18:21:12,573 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244048, the token is 11
[2022-11-06 18:21:12,579 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244049, the token is 12
[2022-11-06 18:21:12,587 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244050, the token is 13
[2022-11-06 18:21:12,593 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244051, the token is 14
[2022-11-06 18:21:12,595 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244052, the token is 15
[2022-11-06 18:21:12,596 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244053, the token is 16
[2022-11-06 18:21:12,598 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244054, the token is 17
[2022-11-06 18:21:12,599 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244055, the token is 18
[2022-11-06 18:21:12,600 I 243968 243968] (raylet) worker_pool.cc:447: Started worker process with pid 244056, the token is 19
[2022-11-06 18:21:21,446 W 243968 243986] (raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2022-11-06 18:21:41,680 I 243968 243992] (raylet) object_store.cc:35: Object store current usage 8e-09 / 157.089 GB.
[2022-11-06 18:21:42,293 I 243968 243968] (raylet) node_manager.cc:599: New job has started. Job id 01000000 Driver pid 243927 is dead: 0 driver address: 172.20.6.22
[2022-11-06 18:21:42,293 I 243968 243968] (raylet) worker_pool.cc:636: Job 01000000 already started in worker pool.
[2022-11-06 18:21:42,447 W 243968 243968] (raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
[2022-11-06 18:21:42,458 W 243968 244019] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
[2022-11-06 18:21:42,458 E 243968 244019] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

@Qinghao-Hu
Copy link
Contributor

I also meet the issue:

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: BaseTrainer.as_trainable.<locals>.TrainTrainable
	actor_id: 4a0c820541ef7de7e95f887801000000
	namespace: cddfd55e-aa4a-4208-baa0-ccf6483f1ec5
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.140.24.68 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.

@marianogabitto
Copy link
Author

This is still open. If anyone from the ray team tells me how to proceed. I can help troubleshoot.

@marianogabitto
Copy link
Author

Pinging this issue to keep it open.

@mgerstgrasser
Copy link
Contributor

Interesting, I keep running into variations of this as well on our slurm cluster.

@marianogabitto Are you saying that for you this happens even if you have the node allocated exclusively, i.e. no other slurm jobs running on that physical machine?

I have so far noticed two things that contribute to crashes for me:

  1. Having two slurm jobs on the same physical machine each starting their own completely separate ray instance. - But it sounds like this can be ruled out in your case!
  2. Something to do with how slurm pins processes to CPU cores. While I've never been able to reproduce crashes deterministically, I have noticed that they happen much, much more often with fewer CPU cores allocated. E.g. if I run a job with sbatch -c1 in slurm (ie. just a single CPU core), it will fail 50% of the time even if I do ray.init(num_cpus=1). On the other hand, sbatch -c2 will make this work most of the time, and it seems to me that sbatch -c2 and ray.init(num_cpus=1) crashes less frequently than sbatch -c2 and ray.init(num_cpus=2), i.e. leaving a little bit of "buffer" in the number of CPU cores helps. Is it possible that the way slurm pins processes to CPU cores interferes with how Ray likes to manage CPU cores? E.g. Ray tries to start two worker processses on different physical cores, but then Slurm puts them on the same core, and that causes problems?

This is purely speculation of course, but maybe at least my anecdotal data can be helpful in figuring this out.

@marianogabitto
Copy link
Author

marianogabitto commented Nov 13, 2022

Hi @mgerstgrasser ,
thanks for reaching out.
I am allocating the node exclusively to me. I run "srun -N 1 --exclusive -G 1 --pty bash" . I allocate 112 cpus, 512 gb ram and 1 GPU A100. I also allocate running time but it is not relevant here.The good thing is that I can reproduce the issue deterministically.
1. This is not the case and can be ruled out.
2. I always try to start ray with ~20 cpus.
Thanks,
M

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

Could someone with the problem zip the entire directory of log files and upload it? The error is, I think, a red herring: the worker cannot register because (I think) something is wrong with the main node. Either the connect messages are not being delivered or the raylet process is crashin, or it is sharing fate with the dashboard agent which is crashing. Perhaps some of the other log files have some hints.

@marianogabitto
Copy link
Author

marianogabitto commented Nov 13, 2022

Matt,
here it is . I am running:

(ray) [mg@n246 ~]$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(include_dashboard=False, num_cpus=30, _temp_dir=f"/scratch/")

@marianogabitto
Copy link
Author

Here it is with the right file format

session.zip

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

could you also show what is in conda list?

@marianogabitto
Copy link
Author

marianogabitto commented Nov 13, 2022

Sure, here it is "conda list" and "pip list" in the same condo environment called ray. It is a fresh installation.

conda_list.txt
pip_list.txt

@mgerstgrasser
Copy link
Contributor

Just in case an additional data point is helpful, I previously also uploaded logs once: #21479 (comment)
I later thought I had fixed the issue by setting num_cpu, but it turned out it merely made it less frequent.

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

Note that raylet.out exits before really finishing the startup cycle. The dashboard_agent.log is missing: for some reason it apparently was not created.

(raylet) metric_exporter.cc:207: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
(raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
(raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip . id 0
(raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.`

@marianogabitto
Copy link
Author

include_dashboard=False ????

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

I see you have conda's ray-core==2.0.1 which was uploaded only last week. Could you try using ray-serve? Perhaps ray-core is missing needed functionality

@marianogabitto
Copy link
Author

marianogabitto commented Nov 13, 2022

Matt,
I have just re-run it with the dashboard so you have now the log file. Trying ray serve.

session2.zip

Thanks !

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

conda install ray-serve

@marianogabitto
Copy link
Author

marianogabitto commented Nov 13, 2022

Installed fresh repo with ray-all, same problem. I have not solved it.

conda config --env --add channels conda-forge
conda config --set channel_priority strict

conda create -n rayserve python=3.10
conda install ray-all

@marianogabitto
Copy link
Author

:(

@mgerstgrasser
Copy link
Contributor

I just tried to reproduce your exact script on my cluster, and for me no error. (But plenty of occurences of seemingly the same error otherwise, randomly).

I ran salloc -n1 --exclusive -t 1:00:00 -p test and then more or less your reproduction script, with a sleep(600) after the ray.init(num_cpus=20) just in case. No error, even if I increase num_cpus, even beyond the number of physical cores in the machine... No GPU on that machine, and only 48 CPU cores though.

I'm attaching my conda list output just in case there's a package difference that's causing the problem for you. If I can help in any other way let me know. I'm keen on getting to the bottom of this too.

conda.txt

@marianogabitto
Copy link
Author

@mgerstgrasser Quick question, how long does it take from the moment that you run ray.init() until finish ? Mine depends on the number of cpus . Is that your case too ?

@mgerstgrasser
Copy link
Contributor

If I run it without the sleep() after the ray.init(), it takes 15-30 seconds, but doesn't seem to correlate with the number of CPUs. If anything it got faster the more often I tried in a row, but independently of the number of cpu cores I put into ray.init(). See below for the first few measurements, I did a few more after that but no big difference, it mostly took around 15-20s.

10 cpus:

$ time python test.py
2022-11-13 18:02:07,586 INFO worker.py:1518 -- Started a local Ray instance.

real    0m31.615s
user    0m3.551s
sys     0m0.890s

20 cpus:

$ time python test.py
2022-11-13 18:02:56,960 INFO worker.py:1518 -- Started a local Ray instance.

real    0m19.792s
user    0m3.610s
sys     0m0.891s

48 cpus:

$ time python test.py
2022-11-13 18:03:34,155 INFO worker.py:1518 -- Started a local Ray instance.

real    0m20.639s
user    0m4.028s
sys     0m1.370s

256 cpus (on a 48 core machine)

$ time python test.py
2022-11-13 18:04:13,104 INFO worker.py:1518 -- Started a local Ray instance.

real    0m17.411s
user    0m3.984s
sys     0m1.310s

@mattip
Copy link
Contributor

mattip commented Apr 4, 2023

This issue has collected a number of different reports, I think I saw these:

  • the head node dies
  • worker nodes fail to register with the proper head node when more than one is running
  • worker nodes die when starting up
    All these can apparently lead to the log message "Failed to register worker"

When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.

@mgerstgrasser
Copy link
Contributor

This issue has collected a number of different reports, I think I saw these:

* the head node dies

* worker nodes fail to register with the proper head node when more than one is running

* worker nodes die when starting up
  All these can apparently lead to the log message "Failed to register worker"

When commenting "same issue", please be more specific: what exactly did you try, on what hardware, and what happened.

Would it make sense and be possible to have Ray emit a more detailed error message here? One thing that makes it hard for me to report the problem in more detail is that the main log only shows the "Failed to register worker" and "IOError: [RayletClient] Unable to register worker with raylet. No such file or directory" messages. And it's impossible for me to figure out what other logs or information could be relevant.

At the very least, could Ray log which file the "no such file or directory" message refers to?

@mattip
Copy link
Contributor

mattip commented Apr 4, 2023

I usually am able to grep for the "No such file or directory" message in the log directory

@ZixuWang
Copy link

Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)

In my case, I could see messages in the log files similar to the ones described here. I'm guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.

The solution in my case was to increase the number of pending signals, by running

ulimit -u 127590

This resolved the error and at least allowed running the full Ray pipeline. I can't say whether this is a good idea at a system level -- maybe someone else can advise about the pros and cons of this approach -- but it worked for me, ymmv.

thanks, it works for me after modifying it as ray.init(num_cpus=1)

@David2265
Copy link

Hey people,

I has the same error as mentioned earlier, however, after I do "pip uninstall grpcio" and then reinstall using conda "conda install grpcio".

The error gone. and its working fine for me now! Peace.

@dlwh
Copy link
Contributor

dlwh commented Jul 14, 2023

the pip uninstall grpcio/conda install gprcio trick didn't work for me. Also having issues under slurm (outside of slurm it seems to be ok)

@dlwh
Copy link
Contributor

dlwh commented Jul 14, 2023

can confirm that downgrading to ray 1.13 fixes the issue

@kaare-mikkelsen
Copy link

I also have this issue, working from a singularity container. Is there anything I should try to copy / print to help move the debugging forward?

@kaare-mikkelsen
Copy link

Also, for me it seems to be a problem related to some persistent files? my installation worked for a while, but when I tried to better utilize the node I was working on (adding more gpu's and cpu's), I started getting the error. Now, reverting to the old code, the error persists. I am now only using 1 GPU and 10 CPUs, so I doubt it's related to number of OpenBLAS threads.

@mattip
Copy link
Contributor

mattip commented Aug 7, 2023

  1. Post what version of ray and how you installed it.
  2. Post what versions of grpcio you are using and how you installed it.

@kaare-mikkelsen
Copy link

Hi Mattip

My version of ray is 2.4.0, and it was installed using pip. I do not have grpcio installed.

@mattip
Copy link
Contributor

mattip commented Aug 7, 2023

@kaare-mikkelsen there should be a grpcio package in the output of pip list, it is a requirement for ray 2.4.0 (hopefully it will be gone for ray 2.7). Also: did something change in your container configuration or in the networking where that container runs? Perhaps firewall/address/port changes?

@kaare-mikkelsen
Copy link

kaare-mikkelsen commented Aug 7, 2023

Hi again
You are right, I do have grpcio - I thought I could check just by importing. According to pip list, my version is 1.51.3.
I don't recall any changes, other than possibly adding pandas to the mix (installed with conda). Unfortunately I have not been using proper version control in that project, so I can't do a proper version roll-back.
The conda environment in my container was made with this yml file (using docker://rocm/dev-ubuntu-22.04:5.3.2-complete as base image):


channels: 
  - conda-forge
dependencies:
  - mne
  - mne-bids
  - neptune-client
  - python=3.9
  - pytorch-lightning
  - matplotlib
  - numpy
  - scikit-learn
  - optuna
  - tabulate
  - pandas
  - pip
  - pip:
    - ray
    - --extra-index-url https://download.pytorch.org/whl/rocm5.2
    - charset-normalizer==3.0.1
    - numpy==1.24.1
    - requests==2.28.2
    - torch==1.13.1+rocm5.2
    - torchaudio==0.13.1+rocm5.2
    - torchvision==0.14.1+rocm5.2

In case it makes any difference, my version of the error message is

[2023-08-07 18:07:41,009 E 55830 55830] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

@kaare-mikkelsen
Copy link

kaare-mikkelsen commented Aug 7, 2023

update:

changing the yml file to this:


`channels:
  - conda-forge
dependencies:
  - python=3.9
  - numpy
  - scikit-learn
  - optuna
  - tabulate
  - pandas
  - pip
  - pip:
    - ray
    - --extra-index-url https://download.pytorch.org/whl/rocm5.2
    - charset-normalizer==3.0.1
    - numpy==1.24.1
    - requests==2.28.2
    - torch==1.13.1+rocm5.2
    - torchaudio==0.13.1+rocm5.2
    - torchvision==0.14.1+rocm5.2
`

seems to make the problem go away? I can try adding the extra libraries back, and see when it breaks again. though it will have to wait until tomorrow.

Oh, and for the record: pip list is still reporting ray as version 2.4.0 and grpcio as 1.51.3

(and running the old container still gives the same error, so it doesn't seem like there's anything outside the container causing the problem).

@kaare-mikkelsen
Copy link

kaare-mikkelsen commented Aug 8, 2023

new status:

This conda environment works:

channels:
  - conda-forge
dependencies:
  - pytorch-lightning
  - python=3.9
  - scikit-learn
  - optuna
  - tabulate
  - pandas
  - pip
  - pip:
    - ray
    - --extra-index-url https://download.pytorch.org/whl/rocm5.2
    - charset-normalizer==3.0.1
    - numpy==1.24.1
    - requests==2.28.2
    - torch==1.13.1+rocm5.2
    - torchaudio==0.13.1+rocm5.2
    - torchvision==0.14.1+rocm5.2

This one doesn't:

channels:
  - conda-forge
dependencies:
  - pytorch-lightning
  - neptune-client
  - python=3.9
  - scikit-learn
  - optuna
  - tabulate
  - pandas
  - pip
  - pip:
    - ray
    - --extra-index-url https://download.pytorch.org/whl/rocm5.2
    - charset-normalizer==3.0.1
    - numpy==1.24.1
    - requests==2.28.2
    - torch==1.13.1+rocm5.2
    - torchaudio==0.13.1+rocm5.2
    - torchvision==0.14.1+rocm5.2

So now it looks like neptune is partly to blame? My version of that is 1.4.1.
Separately, it seems that I have separate versions of grpcio in conda and pip. It's the same version for both containers, 1.56.2 in conda and 1.51.3 in pip.

Edit:
I've tried installing ray from conda-forge instead (so, adding ray-default to the conda-forge list). Then I get a ray version of 2.6.2 and a grpcio version of 1.48.1 from conda list, and an error, without neptune-client involved.

@mgerstgrasser
Copy link
Contributor

Just in case this is helpful to anyone else running into this, for me it seems I've been able to work around this problem by putting a try-except block around ray.init() and re-trying to start Ray multiple times if it fails, with exponential backoff. So something like the following, and call that instead of ray.init() directly. (Exponential backoff because it still seems to me that this might be related to two instances starting at the same time on the same physical machine, although I've never been able to figure that out with certainty.)

Since I've started doing this I've not seen any failed slurm jobs. But I did see in my logs that the except-block was triggered on occasion, so I think the underlying issue still occurs sometimes.

def try_start_ray(num_cpus, local_mode):
    depth = 0
    while True:
        try:
            print("Trying to start ray.")
            ray.init(num_cpus=num_cpus, local_mode=local_mode, include_dashboard=False)
            break
        except:
            waittime = np.random.randint(1, 10 * 2**depth)
            print(f"Failed to start ray on attempt {depth+1}. Retrying in {waittime} seconds...")
            sleep(waittime)
            depth += 1

@MonliH
Copy link

MonliH commented Sep 4, 2023

The top solution on stack overflow solved this issue for me:

Limit the number of CPUs

Ray will launch as many worker processes as your execution node has CPUs (or CPU cores). If that's more than you reserved, slurm will start killing processes.

You can limit the number of worker processes as such:

import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
print("success")

(though I don't think it solves the OP's problem as they have already set num_cups)

@kaare-mikkelsen
Copy link

this solves it for me.

@jtlz2
Copy link

jtlz2 commented Nov 28, 2023

I have num_cpus=1 already - how to see how it could be reduced any further......

@jtlz2
Copy link

jtlz2 commented Nov 28, 2023

@mgerstgrasser Had high hopes for your solution, but for me I get an unhandled segfault:

>>> try_start_ray(1, True)
Trying to start ray.
2023-11-28 13:38:17,224 INFO worker.py:1673 -- Started a local Ray instance.
[2023-11-28 13:38:17,333 E 109 109] core_worker.cc:205: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

Any ideas? :(

@mgerstgrasser
Copy link
Contributor

@jtlz2 The only other thing I remember that helped for me and that wasn't mentioned recently was making sure I allocate at least 2 cores for the slurm job (even if I set num_cpus=1).

@SMY19999
Copy link

SMY19999 commented Dec 15, 2023

version 1.2.0 is too low for me, I update to 2.0.0, and no error.
python=3.7 tensorflow=2.11.0

@mgerstgrasser
Copy link
Contributor

@mgerstgrasser when you say " at least 2 cores for the slurm job" do you refer to "#SBATCH --cpus-per-task=2" or decorator @ray.remote(num_cpus=2) for the task inside the code itself? Thank you!

@Pkulyte The former - I don't recall if it was --cpus-per-task or one of the equivalent slurm options, but shouldn't make a difference. Note that it still wasn't 100% for me, it just greatly reduced the frequency of failures.

@hasan4791
Copy link

Facing similar error in my Mac M1 Pro but it happens only when running x86 on Rosetta mode. On QEMU it works fine!!!


2024-04-08 09:48:30,198	INFO worker.py:1642 -- Started a local Ray instance.
[2024-04-08 09:48:30,434 E 1 28] core_worker.cc:203: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

@BrunoBelucci
Copy link

After many attempts, I managed to resolve this issue by increasing the limit for open threads and files using the commands ulimit -n 65535 (as recommended here [https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html]) and ulimit -u 65535 (Not necessarily the same as before, I think the exact value may vary depending on your resource and cluster demands; you might need to adjust it even higher). As many others have noted (#36936), Ray creates a large number of threads (~5000 in my case), but only a small portion (~40) are active at any given time. I am not sure if this impacts performance, but it certainly affects the scalability of our cluster. Additionally, I limit the number of CPUs and GPUs to match my actual request by using ray.init(num_cpus=num_cpus, num_gpus=num_gpus). I hope this helps!

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels Nov 12, 2024
@movy
Copy link

movy commented Dec 14, 2024

This is very weird bug indeed.
On a dozen machines I have been running multiple ray instances per host for many months using simple ray.init() and it works totally flawless (only today after extensive digging I learned that such setup was not recommended, but I personally never had problems with it).

Reasons for such setup: my jobs are very short-lived and a single ray instance is not able to utilize all cores, but when I run cluster and run all workers on the cluster, the head ray instance hangs very soon after start, blocking whole pipeline.

Running multiple instances per host never caused any problems till today, when I upgraded RAM on one of the machines from 64 to 128GB, and suddenly, when I launch two instances on this upgraded machine, they both fail with core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff or, if one of them manages to start, they eventually fail with (raylet) worker_pool.cc:643: Failed to start worker with return value system:12: Cannot allocate memory.

Initially, I blamed faulty RAM (given it's a memory-related error), but replacing it did not solve the problem.

Very puzzling where this error comes from and why only on single machine, that worked totally fine prior RAM upgrade, while other machines with same setups never had issues with any RAM size.. .

My env:

python --version
Python 3.12.5 (Intel Python)

ray --version
ray, version 3.0.0.dev0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests