-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] [Bug] core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error #21479
Comments
Could you add a small script that reproduces the issue? |
Hey @cxy990729 Is there any reproduction script which is causing this thing to happen? Please share that in your issues description. Thank you. |
It has been over a month with no more information on this problem. Without a reproducer (including how the reporter installed ray and what version of python, ray, and windows) we cannot fix the problem. |
I am re-opening this issue as I have reproduced this several times (though it happens randomly and needs fair amount of luck and patience). Some assumptions (which I believe should be true),
Now, consider this situation when I noticed that in the case of error message in the OP, Line 1684 in 5b2d586
ray/src/ray/core_worker/core_worker.cc Line 123 in 5b2d586
I am trying to confirm this hypothesis. |
@iycheng i think this must be related to what we observed (very slow rpc to internal kv) |
In addition, after thinking for a while, IMO, in case of ray/src/ray/core_worker/core_worker.cc Line 123 in 5b2d586
|
diff --git a/python/ray/_raylet.pyx b/python/ray/_raylet.pyx
index bd0ac678c..c52f5ef1e 100644
--- a/python/ray/_raylet.pyx
+++ b/python/ray/_raylet.pyx
@@ -1175,6 +1175,9 @@ cdef class CoreWorker:
return WorkerID(
CCoreWorkerProcess.GetCoreWorker().GetWorkerID().Binary())
+ def get_registered_status(self):
+ return CCoreWorkerProcess.GetCoreWorker().GetRegisteredStatus()
+
def should_capture_child_tasks_in_placement_group(self):
return CCoreWorkerProcess.GetCoreWorker(
).ShouldCaptureChildTasksInPlacementGroup()
diff --git a/python/ray/includes/libcoreworker.pxd b/python/ray/includes/libcoreworker.pxd
index dddccb4fb..fec0152ed 100644
--- a/python/ray/includes/libcoreworker.pxd
+++ b/python/ray/includes/libcoreworker.pxd
@@ -155,6 +155,7 @@ cdef extern from "ray/core_worker/core_worker.h" nogil:
CJobID GetCurrentJobId()
CTaskID GetCurrentTaskId()
CNodeID GetCurrentNodeId()
+ c_bool GetRegisteredStatus()
c_bool GetCurrentTaskRetryExceptions()
CPlacementGroupID GetCurrentPlacementGroupId()
CWorkerID GetWorkerID()
diff --git a/python/ray/worker.py b/python/ray/worker.py
index 95ec0c4d9..c05bec7f9 100644
--- a/python/ray/worker.py
+++ b/python/ray/worker.py
@@ -1677,6 +1677,10 @@ def connect(
startup_token,
)
+ if not worker.core_worker.get_registered_status():
+ logger.warning("CoreWorkerProcess {} wasn't registered successfully.".format(os.getpid()))
+ sys.exit(1)
+
# Notify raylet that the core worker is ready.
worker.core_worker.notify_raylet()
diff --git a/src/ray/core_worker/core_worker.cc b/src/ray/core_worker/core_worker.cc
index d301f2865..46db31a49 100644
--- a/src/ray/core_worker/core_worker.cc
+++ b/src/ray/core_worker/core_worker.cc
@@ -120,9 +120,11 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
RAY_LOG(ERROR) << "Failed to register worker " << worker_id << " to Raylet. "
<< raylet_client_status;
// Quit the process immediately.
- QuickExit();
+ registered_ = false;
+ return ;
}
+ registered_ = true;
connected_ = true;
RAY_CHECK(assigned_port >= 0);
diff --git a/src/ray/core_worker/core_worker.h b/src/ray/core_worker/core_worker.h
index d12a41b89..15ae67dd5 100644
--- a/src/ray/core_worker/core_worker.h
+++ b/src/ray/core_worker/core_worker.h
@@ -127,6 +127,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
NodeID GetCurrentNodeId() const { return NodeID::FromBinary(rpc_address_.raylet_id()); }
+ const bool GetRegisteredStatus() const { return registered_; }
+
const PlacementGroupID &GetCurrentPlacementGroupId() const {
return worker_context_.GetCurrentPlacementGroupId();
}
@@ -1037,6 +1039,8 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
/// Whether or not this worker is connected to the raylet and GCS.
bool connected_ = false;
+ bool registered_ = false;
+
// Client to the GCS shared by core worker interfaces.
std::shared_ptr<gcs::GcsClient> gcs_client_;
diff --git a/src/ray/raylet/worker_pool.cc b/src/ray/raylet/worker_pool.cc
index 669351028..586258954 100644
--- a/src/ray/raylet/worker_pool.cc
+++ b/src/ray/raylet/worker_pool.cc
@@ -490,10 +490,6 @@ void WorkerPool::MonitorStartingWorkerProcess(const Process &proc,
? "The process is still alive, probably it's hanging during start."
: "The process is dead, probably it crashed during start.");
- if (proc.IsAlive()) {
- proc.Kill();
- }
-
PopWorkerStatus status = PopWorkerStatus::WorkerPendingRegistration;
process_failed_pending_registration_++;
bool found; I applied the above diff and it seems like
|
python muzero.py same error. |
I am having the same issue when I am in a node instantiated by the sun command. I want to add that this do not happens if I install ray 0.6.3 . |
This must happen from the pretty latest version (I believe 1.10 or 1.11). But I haven't seen report from the latest version (1.12) I believe. @marianogabitto what's the version of Ray you are using? |
0.6.3 |
@marianogabitto two questions:
|
within python: |
I am having the same issue on a single node. Using Windows 11, Ray version 1.12.1
Gives me: |
@CedricVandelaer could you run the reproducer, then upload the log files in These include the path to your python and ray, so please look through them before uploading. You may need to first clear out the |
@mattip I ran into this issue this weekend with import ray
ray.init() Here are my logs: session_latest.tar.gz |
@peytondmurray your issue is different. You did not include the |
for me the issue was solved by changing get_num_cpus() in _priviate.utils.py |
Hello CedricVandelaer, |
Just to comment here that I've been seeing what I think is the same issue recently, i.e. the "Failed to register worker" error message.
It happens randomly in between 1% and 10% of runs I've had recently. Initially I had a more detailed error message, which was as follows:
I then added |
For everyone reporting here: saying "me too" is insufficient to help solve the problem. Please describe in as much detail as you can:
|
@mattip Sorry about that, that was all the information I had at the time, and I figured the additional error message might be helpful, since I hadn't seen it in the thread before. I am running this on a cluster where I can't just access /tmp on the machine the jobs are running to grab the log files, so there wasn't much else I could provide. I have in the mean time modified my slurm scripts to copy back the contents of /tmp/ray to a shared filesystem at the end of the jobs, so I have some log files now. See attached. This is from ray 1.13.0 on a Linux cluster. I think it's CentOS based but fairly customized, and the filesystem that ray was supposed to write logs to is Lustre, but /tmp is local, I believe. The script is really just some boilerplate for parsing command line arguments and writing them into a config dict, and then calling I hope this helps! Let me know if I can help in any other way. |
Just to comment again to share that it seems that this is fixed for me by setting https://discuss.ray.io/t/ray-on-slurm-problems-with-initialization/6361/4 |
I see this happening in conda-forge/ray-packages-feedstock#78 on linux when calling More context:
|
I am going to close this, since it seems that error message is too generic to point to a specific problem. If you arrive here searching for issues with this message, please note that the "Failed to register worker" message probably means that the main |
Hi Matt, I have opened another issue and I am willing to troubleshoot. Would it be possible that you or someone from ray help me troubleshoot it ? I pasted the content my err and log files there. |
I am a heavy user of ray and I need to be able to work with it on the cluster. Help would be appreciated. |
See #30012 (comment) |
For me, the reason for this error was I was using two different notebooks for ray initialization. As soon as ray was initialized in one notebook, when I tried to initialize the ray in another notebook, it caused the error. So, to solve the error, I shutdown one of the notebooks |
I'm getting this with the newest version of Ray (run locally with Ubuntu 22.04 1 on WSL): https://s3-us-west-2.amazonaws.com/ray-wheels/master/25d3d529f5985b43ec44ab4d82c31780048ce457/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
From
Even when I downgrade grpcio to |
@tbukic please open a new issue. This one is closed. When you do so please be sure to fill in the required information. I am going to lock this thread. If you get here via search for similar issues, please open a new issue instead of trying to comment here. |
Search before asking
Ray Component
Ray Core
What happened + What you expected to happen
When I run it, there is core_worker.cc:451: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet.I can't figure it out
Versions / Dependencies
windows10
Reproduction script
。。。
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: