[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

cxy990729 · 2022-01-08T03:24:58Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

When I run it, there is core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet.I can't figure it out

Versions / Dependencies

windows10

Reproduction script

。。。

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

clarkzinzow · 2022-01-08T12:53:16Z

Could you add a small script that reproduces the issue?

czgdp1807 · 2022-01-28T07:20:30Z

Hey @cxy990729 Is there any reproduction script which is causing this thing to happen? Please share that in your issues description. Thank you.

mattip · 2022-03-06T17:50:13Z

It has been over a month with no more information on this problem. Without a reproducer (including how the reporter installed ray and what version of python, ray, and windows) we cannot fix the problem.

czgdp1807 · 2022-03-11T11:32:16Z

I am re-opening this issue as I have reproduced this several times (though it happens randomly and needs fair amount of luck and patience).

Some assumptions (which I believe should be true),

Worker pool in worker_pool.cc launches default_worker.py as the workers for Python scripts at least.
This default_worker.py sends a signal for registration via a RPC.
So, starting default_worker.py and then sending the signal for registration would take some time which can vary in different runs. (Say registration happens in one run of ray.init in 10ms and takes 100ms in another run of ray.init).
The maximum limit of is RayConfig::worker_register_timeout_seconds (=30).

Now, consider this situation when default_worker.py took more than 30s to startup and send a signal to worker pool to register itself. In this case, worker pool would think that default_worker.py is hanging and actually default_worker.py would quick exit because it failed to register itself in under 30s.

I noticed that in the case of error message in the OP, default_worker.py hangs at the following line, possible because it quick exited.

ray/python/ray/worker.py

Line 1684 in 5b2d586

worker.core_worker = ray._raylet.CoreWorker(

ray/src/ray/core_worker/core_worker.cc

Line 123 in 5b2d586

QuickExit();

I am trying to confirm this hypothesis.

rkooo567 · 2022-03-11T11:42:59Z

@iycheng i think this must be related to what we observed (very slow rpc to internal kv)

czgdp1807 · 2022-03-12T10:20:19Z

In addition, after thinking for a while, IMO, in case of default_worker.py, we should exit in python instead of doing QuickExit in the below line,

ray/src/ray/core_worker/core_worker.cc

Line 123 in 5b2d586

QuickExit();

czgdp1807 · 2022-03-14T08:01:36Z

diff --git a/python/ray/_raylet.pyx b/python/ray/_raylet.pyx
index bd0ac678c..c52f5ef1e 100644
--- a/python/ray/_raylet.pyx
+++ b/python/ray/_raylet.pyx
@@ -1175,6 +1175,9 @@ cdef class CoreWorker:
         return WorkerID(
             CCoreWorkerProcess.GetCoreWorker().GetWorkerID().Binary())
 
+    def get_registered_status(self):
+        return CCoreWorkerProcess.GetCoreWorker().GetRegisteredStatus()
+
     def should_capture_child_tasks_in_placement_group(self):
         return CCoreWorkerProcess.GetCoreWorker(
             ).ShouldCaptureChildTasksInPlacementGroup()
diff --git a/python/ray/includes/libcoreworker.pxd b/python/ray/includes/libcoreworker.pxd
index dddccb4fb..fec0152ed 100644
--- a/python/ray/includes/libcoreworker.pxd
+++ b/python/ray/includes/libcoreworker.pxd
@@ -155,6 +155,7 @@ cdef extern from "ray/core_worker/core_worker.h" nogil:
         CJobID GetCurrentJobId()
         CTaskID GetCurrentTaskId()
         CNodeID GetCurrentNodeId()
+        c_bool GetRegisteredStatus()
         c_bool GetCurrentTaskRetryExceptions()
         CPlacementGroupID GetCurrentPlacementGroupId()
         CWorkerID GetWorkerID()
diff --git a/python/ray/worker.py b/python/ray/worker.py
index 95ec0c4d9..c05bec7f9 100644
--- a/python/ray/worker.py
+++ b/python/ray/worker.py
@@ -1677,6 +1677,10 @@ def connect(
         startup_token,
     )
 
+    if not worker.core_worker.get_registered_status():
+        logger.warning("CoreWorkerProcess {} wasn't registered successfully.".format(os.getpid()))
+        sys.exit(1)
+
     # Notify raylet that the core worker is ready.
     worker.core_worker.notify_raylet()
 
diff --git a/src/ray/core_worker/core_worker.cc b/src/ray/core_worker/core_worker.cc
index d301f2865..46db31a49 100644
--- a/src/ray/core_worker/core_worker.cc
+++ b/src/ray/core_worker/core_worker.cc
@@ -120,9 +120,11 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
     RAY_LOG(ERROR) << "Failed to register worker " << worker_id << " to Raylet. "
                    << raylet_client_status;
     // Quit the process immediately.
-    QuickExit();
+    registered_ = false;
+    return ;
   }
 
+  registered_ = true;
   connected_ = true;
 
   RAY_CHECK(assigned_port >= 0);
diff --git a/src/ray/core_worker/core_worker.h b/src/ray/core_worker/core_worker.h
index d12a41b89..15ae67dd5 100644
--- a/src/ray/core_worker/core_worker.h
+++ b/src/ray/core_worker/core_worker.h
@@ -127,6 +127,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
 
   NodeID GetCurrentNodeId() const { return NodeID::FromBinary(rpc_address_.raylet_id()); }
 
+  const bool GetRegisteredStatus() const { return registered_;  } 
+
   const PlacementGroupID &GetCurrentPlacementGroupId() const {
     return worker_context_.GetCurrentPlacementGroupId();
   }
@@ -1037,6 +1039,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
   /// Whether or not this worker is connected to the raylet and GCS.
   bool connected_ = false;
 
+  bool registered_ = false;
+
   // Client to the GCS shared by core worker interfaces.
   std::shared_ptr<gcs::GcsClient> gcs_client_;
 
diff --git a/src/ray/raylet/worker_pool.cc b/src/ray/raylet/worker_pool.cc
index 669351028..586258954 100644
--- a/src/ray/raylet/worker_pool.cc
+++ b/src/ray/raylet/worker_pool.cc
@@ -490,10 +490,6 @@ void WorkerPool::MonitorStartingWorkerProcess(const Process &proc,
                   ? "The process is still alive, probably it's hanging during start."
                   : "The process is dead, probably it crashed during start.");
 
-      if (proc.IsAlive()) {
-        proc.Kill();
-      }
-
       PopWorkerStatus status = PopWorkerStatus::WorkerPendingRegistration;
       process_failed_pending_registration_++;
       bool found;

I applied the above diff and it seems like ray.init hangs because default_worker.py is unable to register itself before RayConfig::worker_register_timeout_seconds s. So, either one or a combination of the following things is slow during some calls of ray.init,

python default_worker.py takes a bit more time to start up.
RPC for registration from default_worker.py to worker_pool.cc is slow.

OopsYouDiedE · 2022-04-22T12:13:05Z

python muzero.py same error.

marianogabitto · 2022-05-13T06:46:37Z

I am having the same issue when I am in a node instantiated by the sun command.

I want to add that this do not happens if I install ray 0.6.3 .

rkooo567 · 2022-05-16T06:17:40Z

This must happen from the pretty latest version (I believe 1.10 or 1.11). But I haven't seen report from the latest version (1.12) I believe. @marianogabitto what's the version of Ray you are using?

marianogabitto · 2022-05-16T14:49:43Z

0.6.3

mattip · 2022-05-17T06:21:51Z

@marianogabitto two questions:

which version of ray are you running when you see the problem, before downgrading to 0.6.3?
Can you provide a reproducer script? I don't know what you mean by "when I am in a node instantiated by the sun command"

marianogabitto · 2022-05-17T19:44:28Z

1.20
Sorry, that was a typo. The script is quite simple
conda activate env1
python

within python:
import ray
ray.init()

CedricVandelaer · 2022-05-25T08:35:22Z

I am having the same issue on a single node. Using Windows 11, Ray version 1.12.1

import ray
ray.init(include_dashboard=False)

Gives me:
core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error

mattip · 2022-05-25T08:50:57Z

@CedricVandelaer could you run the reproducer, then upload the log files in %TEMP%\ray\session*?

These include the path to your python and ray, so please look through them before uploading. You may need to first clear out the %TEMP%\ray\session* directories so it is clear which files to upload. Each run of ray creates a new directory there.

peytondmurray · 2022-06-04T23:26:08Z

@mattip I ran into this issue this weekend with

import ray
ray.init()

Here are my logs: session_latest.tar.gz

mattip · 2022-06-06T20:19:47Z

@peytondmurray your issue is different. You did not include the dashboard_agent.log. The raylet.out log explains that the dashboard agent shares fate with the raylet: if the dashboard agent crashes it will bring down the raylet process. Since the dashboard_agent.log is missing, I can only assume the agent crashed on startup before writing a log file, which killed the raylet process. The issue here is that a worker process (not the raylet itself) failed to register.

phseidl · 2022-06-07T15:31:14Z

for me the issue was solved by changing get_num_cpus() in _priviate.utils.py
to return not all available cpus.
e.g. cpu_count = max(1, multiprocessing.cpu_count()-5) in line 499

tuln128 · 2022-06-09T08:47:30Z

ray.init(include_dashboard=False)

Hello CedricVandelaer,
Have you been able to fix the problem? I have run into the same issue while running ray on Linux machine. It returned an error message exactly the same as what you have posted. It would be greatly appreciated if you can share your experience to troubleshoot the problem.
Many thanks,

mgerstgrasser · 2022-08-13T20:16:28Z

Just to comment here that I've been seeing what I think is the same issue recently, i.e. the "Failed to register worker" error message.

[2022-08-13 12:10:24,314 E 244345 244345] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

It happens randomly in between 1% and 10% of runs I've had recently. Initially I had a more detailed error message, which was as follows:

2022-08-12 20:41:34,418 ERROR services.py:1488 -- Failed to start the dashboard: Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
2022-08-12 20:41:34,419 ERROR services.py:1489 -- Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
Traceback (most recent call last):
  File "/n/home04/mgerstgrasser/.conda/envs/super-main/lib/python3.10/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    with open(dashboard_log, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/home04/mgerstgrasser/.conda/envs/super-main/lib/python3.10/site-packages/ray/_private/services.py", line 1462, in start_dashboard
    raise Exception(err_msg + f"\nFailed to read dashboard log: {e}")
Exception: Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
[2022-08-12 20:42:36,375 E 153671 153671] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I then added include_dashboard=False to my ray.init() call hoping that would fix the issue, but it merely got rid of the first few lines of the error message, the crashes are still happening. This is all on ray 1.13.0, and on Linux.

mattip · 2022-08-14T05:37:38Z

For everyone reporting here: saying "me too" is insufficient to help solve the problem. Please describe in as much detail as you can:

your machine and operating system
are you running ray on a local or remote file system
what version of ray
what script are you running

mgerstgrasser · 2022-08-17T18:17:08Z

@mattip Sorry about that, that was all the information I had at the time, and I figured the additional error message might be helpful, since I hadn't seen it in the thread before. I am running this on a cluster where I can't just access /tmp on the machine the jobs are running to grab the log files, so there wasn't much else I could provide. I have in the mean time modified my slurm scripts to copy back the contents of /tmp/ray to a shared filesystem at the end of the jobs, so I have some log files now. See attached.

This is from ray 1.13.0 on a Linux cluster. I think it's CentOS based but fairly customized, and the filesystem that ray was supposed to write logs to is Lustre, but /tmp is local, I believe. The script is really just some boilerplate for parsing command line arguments and writing them into a config dict, and then calling ray.init() followed by tune.run() on an rllib trainer.

I hope this helps! Let me know if I can help in any other way.

logs.zip

mgerstgrasser · 2022-09-07T21:35:47Z

Just to comment again to share that it seems that this is fixed for me by setting num_cpus in ray.init(), to whatever value I requested from the slurm scheduler. See also the discussion and stackoverflow post linked below. Not sure if this fixes the issue for everyone given not everyone seems to be running into this in connection with slurm, but mayb worth a try.

https://discuss.ray.io/t/ray-on-slurm-problems-with-initialization/6361/4
https://stackoverflow.com/questions/72464756/ray-on-slurm-problems-with-initialization/72492737#72492737?newreg=b351ecf69c804824ac3c578f38e513ad

h-vetinari · 2022-11-09T04:45:31Z

I see this happening in conda-forge/ray-packages-feedstock#78 on linux when calling python -c 'import ray; ray.init()'. This is obviously running on a local filesystem. The image is ubuntu-latest. I haven't run that PR yet many times, but it looks reproducible - I'm happy to test patches on the feedstock to see if they resolve the issue.

More context:

+ python -c 'import ray; ray.init()'
2022-11-09 03:15:44,821	ERROR services.py:1403 -- Failed to start the dashboard: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
2022-11-09 03:15:44,821	ERROR services.py:1404 -- Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/ray-packages_1667960091933/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.9/site-packages/ray/_private/services.py", line 1389, in start_api_server
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
2022-11-09 03:15:44,826	WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=1.29gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-11-09 03:15:44,980	INFO worker.py:1528 -- Started a local Ray instance.
[2022-11-09 03:15:55,017 E 37224 37224] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

mattip · 2022-11-13T15:32:56Z

I am going to close this, since it seems that error message is too generic to point to a specific problem. If you arrive here searching for issues with this message, please note that the "Failed to register worker" message probably means that the main raylet process has crashed, and you should carefully examine the other log files (dashboard*.*, raylet*.*) for hints about what actually went wrong.

marianogabitto · 2022-11-13T17:27:21Z

Hi Matt, I have opened another issue and I am willing to troubleshoot. Would it be possible that you or someone from ray help me troubleshoot it ? I pasted the content my err and log files there.

#30012

marianogabitto · 2022-11-13T17:28:00Z

I am a heavy user of ray and I need to be able to work with it on the cluster. Help would be appreciated.

jpgard · 2022-12-25T06:34:55Z

See #30012 (comment)

pd2871 · 2023-01-21T18:49:15Z

For me, the reason for this error was I was using two different notebooks for ray initialization. As soon as ray was initialized in one notebook, when I tried to initialize the ray in another notebook, it caused the error.

So, to solve the error, I shutdown one of the notebooks

tbukic · 2023-01-23T13:02:32Z

I'm getting this with the newest version of Ray (run locally with Ubuntu 22.04 1 on WSL): https://s3-us-west-2.amazonaws.com/ray-wheels/master/25d3d529f5985b43ec44ab4d82c31780048ce457/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl

raylet.err:

[2023-01-23 11:06:05,497 E 10060 10123] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).

pip freeze | grep grpcio
grpcio==1.51.1

From poetry.lock I get the following grpcio requirements for Ray:

grpcio = [
    {version = ">=1.42.0", markers = "python_version >= \"3.10\" and sys_platform != \"darwin\""},
    {version = ">=1.42.0,<=1.49.1", markers = "python_version >= \"3.10\" and sys_platform == \"darwin\""},
]

Even when I downgrade grpcio to 1.49.1 or 1.42.0 the problem persists.

mattip · 2023-01-23T13:16:38Z

@tbukic please open a new issue. This one is closed. When you do so please be sure to fill in the required information.

I am going to lock this thread. If you get here via search for similar issues, please open a new issue instead of trying to comment here.

cxy990729 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 8, 2022

clarkzinzow added the windows label Jan 8, 2022

pcmoritz added the QS Quantsight triage label label Jan 25, 2022

czgdp1807 self-assigned this Jan 25, 2022

pcmoritz closed this as completed Mar 8, 2022

czgdp1807 reopened this Mar 11, 2022

czgdp1807 mentioned this issue Mar 16, 2022

Increase register timeout seconds #23223

Merged

6 tasks

richardliaw added the core Issues that should be addressed in Ray Core label Oct 29, 2022

mattip closed this as completed Nov 13, 2022

mgerstgrasser mentioned this issue Nov 13, 2022

[Core] Failed to register worker . Slurm - srun - #30012

Open

h-vetinari mentioned this issue Jan 10, 2023

unpin grpc again conda-forge/ray-packages-feedstock#87

Closed

ray-project locked as resolved and limited conversation to collaborators Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

cxy990729 commented Jan 8, 2022

clarkzinzow commented Jan 8, 2022

czgdp1807 commented Jan 28, 2022

mattip commented Mar 6, 2022

czgdp1807 commented Mar 11, 2022

rkooo567 commented Mar 11, 2022

czgdp1807 commented Mar 12, 2022

czgdp1807 commented Mar 14, 2022

OopsYouDiedE commented Apr 22, 2022

marianogabitto commented May 13, 2022

rkooo567 commented May 16, 2022

marianogabitto commented May 16, 2022

mattip commented May 17, 2022

marianogabitto commented May 17, 2022

CedricVandelaer commented May 25, 2022 •

edited

Loading

mattip commented May 25, 2022

peytondmurray commented Jun 4, 2022

mattip commented Jun 6, 2022

phseidl commented Jun 7, 2022

tuln128 commented Jun 9, 2022 •

edited

Loading

mgerstgrasser commented Aug 13, 2022

mattip commented Aug 14, 2022

mgerstgrasser commented Aug 17, 2022

mgerstgrasser commented Sep 7, 2022

h-vetinari commented Nov 9, 2022

mattip commented Nov 13, 2022

marianogabitto commented Nov 13, 2022

marianogabitto commented Nov 13, 2022

jpgard commented Dec 25, 2022 •

edited

Loading

pd2871 commented Jan 21, 2023

tbukic commented Jan 23, 2023

mattip commented Jan 23, 2023

[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

Comments

cxy990729 commented Jan 8, 2022

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

clarkzinzow commented Jan 8, 2022

czgdp1807 commented Jan 28, 2022

mattip commented Mar 6, 2022

czgdp1807 commented Mar 11, 2022

rkooo567 commented Mar 11, 2022

czgdp1807 commented Mar 12, 2022

czgdp1807 commented Mar 14, 2022

OopsYouDiedE commented Apr 22, 2022

marianogabitto commented May 13, 2022

rkooo567 commented May 16, 2022

marianogabitto commented May 16, 2022

mattip commented May 17, 2022

marianogabitto commented May 17, 2022

CedricVandelaer commented May 25, 2022 • edited Loading

mattip commented May 25, 2022

peytondmurray commented Jun 4, 2022

mattip commented Jun 6, 2022

phseidl commented Jun 7, 2022

tuln128 commented Jun 9, 2022 • edited Loading

mgerstgrasser commented Aug 13, 2022

mattip commented Aug 14, 2022

mgerstgrasser commented Aug 17, 2022

mgerstgrasser commented Sep 7, 2022

h-vetinari commented Nov 9, 2022

mattip commented Nov 13, 2022

marianogabitto commented Nov 13, 2022

marianogabitto commented Nov 13, 2022

jpgard commented Dec 25, 2022 • edited Loading

pd2871 commented Jan 21, 2023

tbukic commented Jan 23, 2023

mattip commented Jan 23, 2023

CedricVandelaer commented May 25, 2022 •

edited

Loading

tuln128 commented Jun 9, 2022 •

edited

Loading

jpgard commented Dec 25, 2022 •

edited

Loading