[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

jfaust · 2024-04-02T23:50:37Z

What happened + What you expected to happen

EDIT: I've substantially changed this description based on having found a repro.

We have a pipeline with a detached actor that is shared among a number of jobs. That actor was being returned to the Driver script, somewhat coincidentally.

When you do that with a detached Actor, Ray will leak an IDLE worker each time you do.

If I run the repro below 3 times in a row, I end up with two of these:

Note that the worker has no ID (which I believe means it's about to exit), but it never exits.
Looking at the python-core-worker log for any of those workers, you'll see:

reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

The worker never exits, even if the Actor is killed.

Versions / Dependencies

Tried Ray 2.9.2, Ray 2.10.0
Python 3.8
Ubuntu 20.04

Reproduction script

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)


@ray.remote
def _toplevel():
    actor = Actor.remote()
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)
    return actor


actor = ray.get(_toplevel.remote())

Issue Severity

Medium: I have a workaround

The text was updated successfully, but these errors were encountered:

jfaust · 2024-04-03T00:50:00Z

Even with the Actor referenced in the above log completely removed, I still get:

127[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:4197: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
128[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:835: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
129[2024-04-02 17:41:53,187 W 3724 3724] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

And the worker never exits, so the Actor bit may be a red herring.

jfaust · 2024-04-03T01:42:01Z

A little more information, from looking at different logs. It appears that the worker that never exits is for the very top level task in our pipeline - it's a task that just submits a bunch of other tasks to run and then waits for results.

jfaust · 2024-04-03T20:37:03Z

After compiling Ray from source and enabling debug logging in the core, I found a repro (added to the initial comment).

jfaust · 2024-04-04T20:01:37Z

Looks like maybe it doesn't need to be returned to the Driver. This seems to reproduce as well:

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)


@ray.remote
def _create_actor():
    return Actor.remote()


@ray.remote
def _toplevel():
    actor = ray.get(_create_actor.remote())
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)


ray.get(_toplevel.remote())

bug-catcher · 2024-07-26T19:59:15Z

@jfaust what is the workaround? this is causing a lot of issues on our cluster

jfaust · 2024-07-26T20:35:07Z

@jfaust what is the workaround? this is causing a lot of issues on our cluster

@bug-catcher in my case it was easy to create the Actor and pass it places and never return it. Not a workaround in many cases but worked for us.

vipese-idoven · 2024-09-10T13:22:42Z

Are there any updates on this?

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

jjyao · 2024-09-11T05:07:52Z

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

@vipese-idoven, if you are using detached actor, then after the Job finishes, detached actor will still be running and all the workers created by it will not be force killed.

vipese-idoven · 2024-09-11T09:39:07Z

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

@vipese-idoven, if you are using detached actor, then after the Job finishes, detached actor will still be running and all the workers created by it will not be force killed.

Thanks for the quick response! Just to confirm: when a detached Actor triggers the creation of workers like ray::IDLE and ray::IDLE_spill workers, these will not be force kill. From what I've tried, killing the detached Actor does not kill them either. Is there a way to programmatically kill those workers that are apparently unused by other Tasks/Actors?

jfaust added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 2, 2024

jfaust mentioned this issue Apr 3, 2024

[Core] AttributeError: 'Worker' object has no attribute 'import_thread' #44031

Closed

jfaust changed the title ~~[Core] Ray appears to be leaking IDLE workers, even after their jobs have finished~~ [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor is returned from a Task to the Driver Apr 3, 2024

jfaust changed the title ~~[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor is returned from a Task to the Driver~~ [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task to the Driver Apr 3, 2024

jfaust changed the title ~~[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task to the Driver~~ [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task Apr 4, 2024

jfaust mentioned this issue May 10, 2024

[Core] Returning an object that is >100KB from an Actor with max_task_retries>0 leaks IDLE workers #44931

Closed

anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024

jjyao self-assigned this May 13, 2024

rynewang added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024

jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

jfaust commented Apr 2, 2024 •

edited

Loading

jfaust commented Apr 3, 2024

jfaust commented Apr 3, 2024

jfaust commented Apr 3, 2024

jfaust commented Apr 4, 2024

bug-catcher commented Jul 26, 2024

jfaust commented Jul 26, 2024

vipese-idoven commented Sep 10, 2024

jjyao commented Sep 11, 2024

vipese-idoven commented Sep 11, 2024

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

Comments

jfaust commented Apr 2, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jfaust commented Apr 3, 2024

jfaust commented Apr 3, 2024

jfaust commented Apr 3, 2024

jfaust commented Apr 4, 2024

bug-catcher commented Jul 26, 2024

jfaust commented Jul 26, 2024

vipese-idoven commented Sep 10, 2024

jjyao commented Sep 11, 2024

vipese-idoven commented Sep 11, 2024

jfaust commented Apr 2, 2024 •

edited

Loading