Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

Closed
rkooo567 opened this issue Mar 31, 2023 · 3 comments · Fixed by #37949
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-worker P1 Issue that should be fixed within a few weeks Ray-2.7 size-large

Comments

@rkooo567
Copy link
Contributor

What happened + What you expected to happen

I have an actor that keeps track of file locations. It has exactly one instance running, and is an async actor. However, as training runs and more and more remote calls are made to the actor the thread count used by the actor process keeps increasing, until eventually it hits the system limit. I can increase the system limit, but I would like to understand what is happening.

paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        1176
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        2785
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        4949

This was over a period of about 3 minutes. Does each .remote to an actor method create a new thread?
# %%
import ray
ray.init()

# %%
import time

try:
    a = ray.get_actor("testactor")
    ray.kill(a)
except:
    pass

# %%
@ray.remote(resources={"ishead": 1})
class TestActor:
    def __init__(self):
        self.counter = 0
    
    async def increment(self):
        self.counter += 1
        return self.counter
    
    async def get_count(self):
        return self.counter
    
@ray.remote(num_cpus=0.1)
def call_actor(a):
    a.increment.remote()
    
@ray.remote(num_cpus=0.1)
def spawn_task(a):
    for _ in range(10000000):
        call_actor.options(scheduling_strategy="SPREAD").remote(a)
        time.sleep(0.005)
    print("done")

ta = TestActor.options(name="testactor").remote()
for _ in range(30):
    spawn_task.options(scheduling_strategy="SPREAD").remote(ta)

# %%
ray.get(ta.get_count.remote())

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

High: It blocks me from completing my task.

@rkooo567 rkooo567 added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang Ray-2.5 labels Mar 31, 2023
@Wordyka
Copy link

Wordyka commented Apr 2, 2023

It's possible that the issue you're seeing is related to the way Ray handles the lifecycle of actors and the management of actor state.

When a Ray actor is created, Ray spawns a new process to run the actor's methods. This process will remain active for the lifetime of the actor, even if the actor is idle and not currently processing any requests. Each actor process has a limited number of threads that it can use to handle incoming requests, and if the actor receives more requests than it can handle with its existing threads, it will spawn additional threads to handle the load.

However, if the actor receives a large number of requests over time, and if those requests are slow to complete, it's possible that the actor's thread count can grow to a very large number, leading to the issue you're seeing. One possible cause of slow requests could be if the actor's state is very large, and if each remote method call needs to retrieve or modify a significant portion of that state.

To diagnose the issue further, you may want to use the ray state command to inspect the memory usage of the actor and its associated actor process. You can also use the ray timeline command to inspect the timing of the actor's method calls, to see if there are any slow calls that may be contributing to the thread count growth.

To address the issue, you may want to consider breaking up the actor's state into smaller chunks, or using a different storage mechanism such as a distributed database or a message queue. You could also consider batching remote calls to the actor to reduce the overall number of requests it receives. Finally, you could consider implementing some kind of timeout or rate-limiting mechanism on the remote method calls to prevent them from overwhelming the actor's thread pool.

@rynewang
Copy link
Contributor

According to https://docs.ray.io/en/latest/ray-core/actors/concurrency_group_api.html#defining-concurrency-groups by default each actor can't spawn more than 1000 threads? It should be a bug if we schedule in 4000 processes.

@rynewang
Copy link
Contributor

The cause is like this:

When we have a lot of workers sending actor requests to the actor process, we maintain a per-sender (by worker id) task queue, each queue spawning a thread. This queue is never GC'd so we have excessive amount of queues and threads.

https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-worker P1 Issue that should be fixed within a few weeks Ray-2.7 size-large
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants