-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438
Comments
Even with the Actor referenced in the above log completely removed, I still get:
And the worker never exits, so the Actor bit may be a red herring. |
A little more information, from looking at different logs. It appears that the worker that never exits is for the very top level task in our pipeline - it's a task that just submits a bunch of other tasks to run and then waits for results. |
After compiling Ray from source and enabling debug logging in the core, I found a repro (added to the initial comment). |
Looks like maybe it doesn't need to be returned to the Driver. This seems to reproduce as well: import ray
import numpy as np
ray.init(namespace="test")
@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:
def do_work(self):
arr = np.zeros((10000, 1000))
np.multiply(arr, arr)
@ray.remote
def _create_actor():
return Actor.remote()
@ray.remote
def _toplevel():
actor = ray.get(_create_actor.remote())
tasks = [actor.do_work.remote() for _ in range(0, 100)]
ray.get(tasks)
ray.get(_toplevel.remote()) |
@jfaust what is the workaround? this is causing a lot of issues on our cluster |
@bug-catcher in my case it was easy to create the Actor and pass it places and never return it. Not a workaround in many cases but worked for us. |
Are there any updates on this? I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory. |
@vipese-idoven, if you are using detached actor, then after the Job finishes, detached actor will still be running and all the workers created by it will not be force killed. |
Thanks for the quick response! Just to confirm: when a detached Actor triggers the creation of workers like ray::IDLE and ray::IDLE_spill workers, these will not be force kill. From what I've tried, killing the detached Actor does not kill them either. Is there a way to programmatically kill those workers that are apparently unused by other Tasks/Actors? |
What happened + What you expected to happen
EDIT: I've substantially changed this description based on having found a repro.
We have a pipeline with a detached actor that is shared among a number of jobs. That actor was being returned to the Driver script, somewhat coincidentally.
When you do that with a detached Actor, Ray will leak an IDLE worker each time you do.
If I run the repro below 3 times in a row, I end up with two of these:

Note that the worker has no ID (which I believe means it's about to exit), but it never exits.
Looking at the python-core-worker log for any of those workers, you'll see:
The worker never exits, even if the Actor is killed.
Versions / Dependencies
Reproduction script
Issue Severity
Medium: I have a workaround
The text was updated successfully, but these errors were encountered: