[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

rkooo567 · 2023-03-31T00:28:06Z

What happened + What you expected to happen

I have an actor that keeps track of file locations. It has exactly one instance running, and is an async actor. However, as training runs and more and more remote calls are made to the actor the thread count used by the actor process keeps increasing, until eventually it hits the system limit. I can increase the system limit, but I would like to understand what is happening.

paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        1176
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        2785
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        4949

This was over a period of about 3 minutes. Does each .remote to an actor method create a new thread?

# %%
import ray
ray.init()

# %%
import time

try:
    a = ray.get_actor("testactor")
    ray.kill(a)
except:
    pass

# %%
@ray.remote(resources={"ishead": 1})
class TestActor:
    def __init__(self):
        self.counter = 0
    
    async def increment(self):
        self.counter += 1
        return self.counter
    
    async def get_count(self):
        return self.counter
    
@ray.remote(num_cpus=0.1)
def call_actor(a):
    a.increment.remote()
    
@ray.remote(num_cpus=0.1)
def spawn_task(a):
    for _ in range(10000000):
        call_actor.options(scheduling_strategy="SPREAD").remote(a)
        time.sleep(0.005)
    print("done")

ta = TestActor.options(name="testactor").remote()
for _ in range(30):
    spawn_task.options(scheduling_strategy="SPREAD").remote(ta)

# %%
ray.get(ta.get_count.remote())

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

Wordyka · 2023-04-02T14:53:50Z

It's possible that the issue you're seeing is related to the way Ray handles the lifecycle of actors and the management of actor state.

When a Ray actor is created, Ray spawns a new process to run the actor's methods. This process will remain active for the lifetime of the actor, even if the actor is idle and not currently processing any requests. Each actor process has a limited number of threads that it can use to handle incoming requests, and if the actor receives more requests than it can handle with its existing threads, it will spawn additional threads to handle the load.

However, if the actor receives a large number of requests over time, and if those requests are slow to complete, it's possible that the actor's thread count can grow to a very large number, leading to the issue you're seeing. One possible cause of slow requests could be if the actor's state is very large, and if each remote method call needs to retrieve or modify a significant portion of that state.

To diagnose the issue further, you may want to use the ray state command to inspect the memory usage of the actor and its associated actor process. You can also use the ray timeline command to inspect the timing of the actor's method calls, to see if there are any slow calls that may be contributing to the thread count growth.

To address the issue, you may want to consider breaking up the actor's state into smaller chunks, or using a different storage mechanism such as a distributed database or a message queue. You could also consider batching remote calls to the actor to reduce the overall number of requests it receives. Finally, you could consider implementing some kind of timeout or rate-limiting mechanism on the remote method calls to prevent them from overwhelming the actor's thread pool.

rynewang · 2023-07-27T23:59:11Z

According to https://docs.ray.io/en/latest/ray-core/actors/concurrency_group_api.html#defining-concurrency-groups by default each actor can't spawn more than 1000 threads? It should be a bug if we schedule in 4000 processes.

rynewang · 2023-07-29T21:38:50Z

The cause is like this:

When we have a lot of workers sending actor requests to the actor process, we maintain a per-sender (by worker id) task queue, each queue spawning a thread. This queue is never GC'd so we have excessive amount of queues and threads.

https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L230

rkooo567 added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang Ray-2.5 labels Mar 31, 2023

rkooo567 assigned vitsai Apr 10, 2023

rkooo567 unassigned vitsai Apr 26, 2023

rkooo567 added Ray-2.6 size-large core-worker and removed Ray-2.5 labels Jun 2, 2023

rkooo567 added Ray-2.7 and removed Ray-2.6 labels Jul 17, 2023

rkooo567 assigned rynewang Jul 27, 2023

rynewang mentioned this issue Jul 31, 2023

[core] Use 1 thread for all fibers for an actor scheduling queue. #37949

Merged

jjyao closed this as completed in #37949 Aug 9, 2023

rynewang mentioned this issue Oct 30, 2023

[Core] Too many threads in ray worker #36936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

rkooo567 commented Mar 31, 2023

Wordyka commented Apr 2, 2023

rynewang commented Jul 27, 2023

rynewang commented Jul 29, 2023

[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit #33957

Comments

rkooo567 commented Mar 31, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Wordyka commented Apr 2, 2023

rynewang commented Jul 27, 2023

rynewang commented Jul 29, 2023