Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task #44438

Open
jfaust opened this issue Apr 2, 2024 · 9 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical

Comments

@jfaust
Copy link

jfaust commented Apr 2, 2024

What happened + What you expected to happen

EDIT: I've substantially changed this description based on having found a repro.

We have a pipeline with a detached actor that is shared among a number of jobs. That actor was being returned to the Driver script, somewhat coincidentally.

When you do that with a detached Actor, Ray will leak an IDLE worker each time you do.

If I run the repro below 3 times in a row, I end up with two of these:
image

Note that the worker has no ID (which I believe means it's about to exit), but it never exits.
Looking at the python-core-worker log for any of those workers, you'll see:

reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

The worker never exits, even if the Actor is killed.

Versions / Dependencies

  • Tried Ray 2.9.2, Ray 2.10.0
  • Python 3.8
  • Ubuntu 20.04

Reproduction script

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)


@ray.remote
def _toplevel():
    actor = Actor.remote()
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)
    return actor


actor = ray.get(_toplevel.remote())

Issue Severity

Medium: I have a workaround

@jfaust jfaust added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 2, 2024
@jfaust
Copy link
Author

jfaust commented Apr 3, 2024

Even with the Actor referenced in the above log completely removed, I still get:

127[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:4197: Force exiting worker that owns object. This may cause other workers that depends on the object to lose it. Own objects: 1 # Pins in flight: 0
128[2024-04-02 17:41:53,187 I 3724 3858] core_worker.cc:835: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT, detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time.
129[2024-04-02 17:41:53,187 W 3724 3724] reference_count.cc:54: This worker is still managing 1 objects, waiting for them to go out of scope before shutting down.

And the worker never exits, so the Actor bit may be a red herring.

@jfaust
Copy link
Author

jfaust commented Apr 3, 2024

A little more information, from looking at different logs. It appears that the worker that never exits is for the very top level task in our pipeline - it's a task that just submits a bunch of other tasks to run and then waits for results.

@jfaust
Copy link
Author

jfaust commented Apr 3, 2024

After compiling Ray from source and enabling debug logging in the core, I found a repro (added to the initial comment).

@jfaust jfaust changed the title [Core] Ray appears to be leaking IDLE workers, even after their jobs have finished [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor is returned from a Task to the Driver Apr 3, 2024
@jfaust jfaust changed the title [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor is returned from a Task to the Driver [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task to the Driver Apr 3, 2024
@jfaust jfaust changed the title [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task to the Driver [Core] Ray leaks IDLE workers, even after their jobs have finished, if an Actor it did not start is returned from a Task Apr 4, 2024
@jfaust
Copy link
Author

jfaust commented Apr 4, 2024

Looks like maybe it doesn't need to be returned to the Driver. This seems to reproduce as well:

import ray
import numpy as np

ray.init(namespace="test")


@ray.remote(name="actor", get_if_exists=True, lifetime="detached")
class Actor:

    def do_work(self):
        arr = np.zeros((10000, 1000))
        np.multiply(arr, arr)


@ray.remote
def _create_actor():
    return Actor.remote()


@ray.remote
def _toplevel():
    actor = ray.get(_create_actor.remote())
    tasks = [actor.do_work.remote() for _ in range(0, 100)]
    ray.get(tasks)


ray.get(_toplevel.remote())

@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024
@jjyao jjyao self-assigned this May 13, 2024
@rynewang rynewang added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@bug-catcher
Copy link

@jfaust what is the workaround? this is causing a lot of issues on our cluster

@jfaust
Copy link
Author

jfaust commented Jul 26, 2024

@jfaust what is the workaround? this is causing a lot of issues on our cluster

@bug-catcher in my case it was easy to create the Actor and pass it places and never return it. Not a workaround in many cases but worked for us.

@vipese-idoven
Copy link

Are there any updates on this?

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

@jjyao
Copy link
Collaborator

jjyao commented Sep 11, 2024

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

@vipese-idoven, if you are using detached actor, then after the Job finishes, detached actor will still be running and all the workers created by it will not be force killed.

@vipese-idoven
Copy link

I'm running into a similar issue here. In our case, the detached Ray Actor is used to pre-process a Ray Dataset. If the job fails, stops, or succeeds, several IDLE and IDLE_Spill workers remain allocating the majority of the memory.

@vipese-idoven, if you are using detached actor, then after the Job finishes, detached actor will still be running and all the workers created by it will not be force killed.

Thanks for the quick response! Just to confirm: when a detached Actor triggers the creation of workers like ray::IDLE and ray::IDLE_spill workers, these will not be force kill. From what I've tried, killing the detached Actor does not kill them either. Is there a way to programmatically kill those workers that are apparently unused by other Tasks/Actors?

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants