Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479

Closed
1 of 2 tasks
cxy990729 opened this issue Jan 8, 2022 · 31 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core QS Quantsight triage label triage Needs triage (eg: priority, bug/not-bug, and owning component) windows

Comments

@cxy990729
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

When I run it, there is core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet.I can't figure it out

Versions / Dependencies

windows10

Reproduction script

。。。

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@cxy990729 cxy990729 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 8, 2022
@clarkzinzow clarkzinzow changed the title core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error [Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error Jan 8, 2022
@clarkzinzow
Copy link
Contributor

Could you add a small script that reproduces the issue?

@pcmoritz pcmoritz added the QS Quantsight triage label label Jan 25, 2022
@czgdp1807 czgdp1807 self-assigned this Jan 25, 2022
@czgdp1807
Copy link
Contributor

Hey @cxy990729 Is there any reproduction script which is causing this thing to happen? Please share that in your issues description. Thank you.

@mattip
Copy link
Contributor

mattip commented Mar 6, 2022

It has been over a month with no more information on this problem. Without a reproducer (including how the reporter installed ray and what version of python, ray, and windows) we cannot fix the problem.

@pcmoritz pcmoritz closed this as completed Mar 8, 2022
@czgdp1807 czgdp1807 reopened this Mar 11, 2022
@czgdp1807
Copy link
Contributor

I am re-opening this issue as I have reproduced this several times (though it happens randomly and needs fair amount of luck and patience).

Some assumptions (which I believe should be true),

  1. Worker pool in worker_pool.cc launches default_worker.py as the workers for Python scripts at least.
  2. This default_worker.py sends a signal for registration via a RPC.
  3. So, starting default_worker.py and then sending the signal for registration would take some time which can vary in different runs. (Say registration happens in one run of ray.init in 10ms and takes 100ms in another run of ray.init).
  4. The maximum limit of is RayConfig::worker_register_timeout_seconds (=30).

Now, consider this situation when default_worker.py took more than 30s to startup and send a signal to worker pool to register itself. In this case, worker pool would think that default_worker.py is hanging and actually default_worker.py would quick exit because it failed to register itself in under 30s.

I noticed that in the case of error message in the OP, default_worker.py hangs at the following line, possible because it quick exited.

worker.core_worker = ray._raylet.CoreWorker(

I am trying to confirm this hypothesis.

@rkooo567
Copy link
Contributor

@iycheng i think this must be related to what we observed (very slow rpc to internal kv)

@czgdp1807
Copy link
Contributor

In addition, after thinking for a while, IMO, in case of default_worker.py, we should exit in python instead of doing QuickExit in the below line,

@czgdp1807
Copy link
Contributor

diff --git a/python/ray/_raylet.pyx b/python/ray/_raylet.pyx
index bd0ac678c..c52f5ef1e 100644
--- a/python/ray/_raylet.pyx
+++ b/python/ray/_raylet.pyx
@@ -1175,6 +1175,9 @@ cdef class CoreWorker:
         return WorkerID(
             CCoreWorkerProcess.GetCoreWorker().GetWorkerID().Binary())
 
+    def get_registered_status(self):
+        return CCoreWorkerProcess.GetCoreWorker().GetRegisteredStatus()
+
     def should_capture_child_tasks_in_placement_group(self):
         return CCoreWorkerProcess.GetCoreWorker(
             ).ShouldCaptureChildTasksInPlacementGroup()
diff --git a/python/ray/includes/libcoreworker.pxd b/python/ray/includes/libcoreworker.pxd
index dddccb4fb..fec0152ed 100644
--- a/python/ray/includes/libcoreworker.pxd
+++ b/python/ray/includes/libcoreworker.pxd
@@ -155,6 +155,7 @@ cdef extern from "ray/core_worker/core_worker.h" nogil:
         CJobID GetCurrentJobId()
         CTaskID GetCurrentTaskId()
         CNodeID GetCurrentNodeId()
+        c_bool GetRegisteredStatus()
         c_bool GetCurrentTaskRetryExceptions()
         CPlacementGroupID GetCurrentPlacementGroupId()
         CWorkerID GetWorkerID()
diff --git a/python/ray/worker.py b/python/ray/worker.py
index 95ec0c4d9..c05bec7f9 100644
--- a/python/ray/worker.py
+++ b/python/ray/worker.py
@@ -1677,6 +1677,10 @@ def connect(
         startup_token,
     )
 
+    if not worker.core_worker.get_registered_status():
+        logger.warning("CoreWorkerProcess {} wasn't registered successfully.".format(os.getpid()))
+        sys.exit(1)
+
     # Notify raylet that the core worker is ready.
     worker.core_worker.notify_raylet()
 
diff --git a/src/ray/core_worker/core_worker.cc b/src/ray/core_worker/core_worker.cc
index d301f2865..46db31a49 100644
--- a/src/ray/core_worker/core_worker.cc
+++ b/src/ray/core_worker/core_worker.cc
@@ -120,9 +120,11 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
     RAY_LOG(ERROR) << "Failed to register worker " << worker_id << " to Raylet. "
                    << raylet_client_status;
     // Quit the process immediately.
-    QuickExit();
+    registered_ = false;
+    return ;
   }
 
+  registered_ = true;
   connected_ = true;
 
   RAY_CHECK(assigned_port >= 0);
diff --git a/src/ray/core_worker/core_worker.h b/src/ray/core_worker/core_worker.h
index d12a41b89..15ae67dd5 100644
--- a/src/ray/core_worker/core_worker.h
+++ b/src/ray/core_worker/core_worker.h
@@ -127,6 +127,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
 
   NodeID GetCurrentNodeId() const { return NodeID::FromBinary(rpc_address_.raylet_id()); }
 
+  const bool GetRegisteredStatus() const { return registered_;  } 
+
   const PlacementGroupID &GetCurrentPlacementGroupId() const {
     return worker_context_.GetCurrentPlacementGroupId();
   }
@@ -1037,6 +1039,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
   /// Whether or not this worker is connected to the raylet and GCS.
   bool connected_ = false;
 
+  bool registered_ = false;
+
   // Client to the GCS shared by core worker interfaces.
   std::shared_ptr<gcs::GcsClient> gcs_client_;
 
diff --git a/src/ray/raylet/worker_pool.cc b/src/ray/raylet/worker_pool.cc
index 669351028..586258954 100644
--- a/src/ray/raylet/worker_pool.cc
+++ b/src/ray/raylet/worker_pool.cc
@@ -490,10 +490,6 @@ void WorkerPool::MonitorStartingWorkerProcess(const Process &proc,
                   ? "The process is still alive, probably it's hanging during start."
                   : "The process is dead, probably it crashed during start.");
 
-      if (proc.IsAlive()) {
-        proc.Kill();
-      }
-
       PopWorkerStatus status = PopWorkerStatus::WorkerPendingRegistration;
       process_failed_pending_registration_++;
       bool found;

I applied the above diff and it seems like ray.init hangs because default_worker.py is unable to register itself before RayConfig::worker_register_timeout_seconds s. So, either one or a combination of the following things is slow during some calls of ray.init,

  1. python default_worker.py takes a bit more time to start up.
  2. RPC for registration from default_worker.py to worker_pool.cc is slow.

@OopsYouDiedE
Copy link

python muzero.py same error.

@marianogabitto
Copy link

I am having the same issue when I am in a node instantiated by the sun command.

I want to add that this do not happens if I install ray 0.6.3 .

@rkooo567
Copy link
Contributor

This must happen from the pretty latest version (I believe 1.10 or 1.11). But I haven't seen report from the latest version (1.12) I believe. @marianogabitto what's the version of Ray you are using?

@marianogabitto
Copy link

0.6.3

@mattip
Copy link
Contributor

mattip commented May 17, 2022

@marianogabitto two questions:

  1. which version of ray are you running when you see the problem, before downgrading to 0.6.3?
  2. Can you provide a reproducer script? I don't know what you mean by "when I am in a node instantiated by the sun command"

@marianogabitto
Copy link

  1. 1.20
  2. Sorry, that was a typo. The script is quite simple
    conda activate env1
    python

within python:
import ray
ray.init()

@CedricVandelaer
Copy link

CedricVandelaer commented May 25, 2022

I am having the same issue on a single node. Using Windows 11, Ray version 1.12.1

import ray
ray.init(include_dashboard=False)

Gives me:
core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error

@mattip
Copy link
Contributor

mattip commented May 25, 2022

@CedricVandelaer could you run the reproducer, then upload the log files in %TEMP%\ray\session*?

These include the path to your python and ray, so please look through them before uploading. You may need to first clear out the %TEMP%\ray\session* directories so it is clear which files to upload. Each run of ray creates a new directory there.

@peytondmurray
Copy link
Contributor

@mattip I ran into this issue this weekend with

import ray
ray.init()

Here are my logs: session_latest.tar.gz

@mattip
Copy link
Contributor

mattip commented Jun 6, 2022

@peytondmurray your issue is different. You did not include the dashboard_agent.log. The raylet.out log explains that the dashboard agent shares fate with the raylet: if the dashboard agent crashes it will bring down the raylet process. Since the dashboard_agent.log is missing, I can only assume the agent crashed on startup before writing a log file, which killed the raylet process. The issue here is that a worker process (not the raylet itself) failed to register.

@phseidl
Copy link

phseidl commented Jun 7, 2022

for me the issue was solved by changing get_num_cpus() in _priviate.utils.py
to return not all available cpus.
e.g. cpu_count = max(1, multiprocessing.cpu_count()-5) in line 499

@tuln128
Copy link

tuln128 commented Jun 9, 2022

ray.init(include_dashboard=False)

Hello CedricVandelaer,
Have you been able to fix the problem? I have run into the same issue while running ray on Linux machine. It returned an error message exactly the same as what you have posted. It would be greatly appreciated if you can share your experience to troubleshoot the problem.
Many thanks,

@mgerstgrasser
Copy link
Contributor

Just to comment here that I've been seeing what I think is the same issue recently, i.e. the "Failed to register worker" error message.

[2022-08-13 12:10:24,314 E 244345 244345] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

It happens randomly in between 1% and 10% of runs I've had recently. Initially I had a more detailed error message, which was as follows:

2022-08-12 20:41:34,418 ERROR services.py:1488 -- Failed to start the dashboard: Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
2022-08-12 20:41:34,419 ERROR services.py:1489 -- Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
Traceback (most recent call last):
  File "/n/home04/mgerstgrasser/.conda/envs/super-main/lib/python3.10/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    with open(dashboard_log, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/n/home04/mgerstgrasser/.conda/envs/super-main/lib/python3.10/site-packages/ray/_private/services.py", line 1462, in start_dashboard
    raise Exception(err_msg + f"\nFailed to read dashboard log: {e}")
Exception: Failed to start the dashboard
Failed to read dashboard log: [Errno 2] No such file or directory: '/tmp/ray/session_2022-08-12_20-41-13_082466_153671/logs/dashboard.log'
[2022-08-12 20:42:36,375 E 153671 153671] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I then added include_dashboard=False to my ray.init() call hoping that would fix the issue, but it merely got rid of the first few lines of the error message, the crashes are still happening. This is all on ray 1.13.0, and on Linux.

@mattip
Copy link
Contributor

mattip commented Aug 14, 2022

For everyone reporting here: saying "me too" is insufficient to help solve the problem. Please describe in as much detail as you can:

  • your machine and operating system
  • are you running ray on a local or remote file system
  • what version of ray
  • what script are you running

@mgerstgrasser
Copy link
Contributor

@mattip Sorry about that, that was all the information I had at the time, and I figured the additional error message might be helpful, since I hadn't seen it in the thread before. I am running this on a cluster where I can't just access /tmp on the machine the jobs are running to grab the log files, so there wasn't much else I could provide. I have in the mean time modified my slurm scripts to copy back the contents of /tmp/ray to a shared filesystem at the end of the jobs, so I have some log files now. See attached.

This is from ray 1.13.0 on a Linux cluster. I think it's CentOS based but fairly customized, and the filesystem that ray was supposed to write logs to is Lustre, but /tmp is local, I believe. The script is really just some boilerplate for parsing command line arguments and writing them into a config dict, and then calling ray.init() followed by tune.run() on an rllib trainer.

I hope this helps! Let me know if I can help in any other way.

logs.zip

@mgerstgrasser
Copy link
Contributor

Just to comment again to share that it seems that this is fixed for me by setting num_cpus in ray.init(), to whatever value I requested from the slurm scheduler. See also the discussion and stackoverflow post linked below. Not sure if this fixes the issue for everyone given not everyone seems to be running into this in connection with slurm, but mayb worth a try.

https://discuss.ray.io/t/ray-on-slurm-problems-with-initialization/6361/4
https://stackoverflow.com/questions/72464756/ray-on-slurm-problems-with-initialization/72492737#72492737?newreg=b351ecf69c804824ac3c578f38e513ad

@richardliaw richardliaw added the core Issues that should be addressed in Ray Core label Oct 29, 2022
@h-vetinari
Copy link

I see this happening in conda-forge/ray-packages-feedstock#78 on linux when calling python -c 'import ray; ray.init()'. This is obviously running on a local filesystem. The image is ubuntu-latest. I haven't run that PR yet many times, but it looks reproducible - I'm happy to test patches on the feedstock to see if they resolve the issue.

More context:

+ python -c 'import ray; ray.init()'
2022-11-09 03:15:44,821	ERROR services.py:1403 -- Failed to start the dashboard: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
2022-11-09 03:15:44,821	ERROR services.py:1404 -- Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/ray-packages_1667960091933/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.9/site-packages/ray/_private/services.py", line 1389, in start_api_server
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-09_03-15-42_976642_37224/logs/dashboard.log:
2022-11-09 03:15:44,826	WARNING services.py:1922 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=1.29gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-11-09 03:15:44,980	INFO worker.py:1528 -- Started a local Ray instance.
[2022-11-09 03:15:55,017 E 37224 37224] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

I am going to close this, since it seems that error message is too generic to point to a specific problem. If you arrive here searching for issues with this message, please note that the "Failed to register worker" message probably means that the main raylet process has crashed, and you should carefully examine the other log files (dashboard*.*, raylet*.*) for hints about what actually went wrong.

@mattip mattip closed this as completed Nov 13, 2022
@marianogabitto
Copy link

Hi Matt, I have opened another issue and I am willing to troubleshoot. Would it be possible that you or someone from ray help me troubleshoot it ? I pasted the content my err and log files there.

#30012

@marianogabitto
Copy link

I am a heavy user of ray and I need to be able to work with it on the cluster. Help would be appreciated.

@jpgard
Copy link

jpgard commented Dec 25, 2022

See #30012 (comment)

@pd2871
Copy link

pd2871 commented Jan 21, 2023

For me, the reason for this error was I was using two different notebooks for ray initialization. As soon as ray was initialized in one notebook, when I tried to initialize the ray in another notebook, it caused the error.

So, to solve the error, I shutdown one of the notebooks

@tbukic
Copy link
Contributor

tbukic commented Jan 23, 2023

I'm getting this with the newest version of Ray (run locally with Ubuntu 22.04 1 on WSL): https://s3-us-west-2.amazonaws.com/ray-wheels/master/25d3d529f5985b43ec44ab4d82c31780048ce457/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl

raylet.err:

[2023-01-23 11:06:05,497 E 10060 10123] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).

pip freeze | grep grpcio
grpcio==1.51.1

From poetry.lock I get the following grpcio requirements for Ray:

grpcio = [
    {version = ">=1.42.0", markers = "python_version >= \"3.10\" and sys_platform != \"darwin\""},
    {version = ">=1.42.0,<=1.49.1", markers = "python_version >= \"3.10\" and sys_platform == \"darwin\""},
]

Even when I downgrade grpcio to 1.49.1 or 1.42.0 the problem persists.

@mattip
Copy link
Contributor

mattip commented Jan 23, 2023

@tbukic please open a new issue. This one is closed. When you do so please be sure to fill in the required information.

I am going to lock this thread. If you get here via search for similar issues, please open a new issue instead of trying to comment here.

@ray-project ray-project locked as resolved and limited conversation to collaborators Jan 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core QS Quantsight triage label triage Needs triage (eg: priority, bug/not-bug, and owning component) windows
Projects
None yet
Development

No branches or pull requests