-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Serve] Memory leak in Ray components and modules #35714
Comments
|
Great analysis. cc: @scv119 @iycheng |
There is perhaps another important point related to idle workers:
|
We also observed in ray 2.3 an increase in raylet memory without changing |
I have a couple more questions. Also, we don't need a exact repro. If you tell us your setup in details, we can build a mock script and test memory usage growth over time. For example, you can tell us sth like
Also a couple of additional questions;
This is pretty weird. If there are the same number of actors, workers are not supposed to be created. Is this issue still happening even without autoscaler? Also, what's happening if you don't set kill_idle_workers_interval_ms & just turn off the autoscaler? Also the main leak seems to be from the serve controller. cc @sihanwang41 have you seen a similar memory leaking issue? Seems important to fix. |
Marked the issue with serve as main memory leak is from serve components. However, there's a slight memory usage increase in raylet (300MB/day). I think it could be due to increasing system metadata, but once the workload is given, we can add long-running tests to verify it. |
2 serve applications, 10 serve deployment modules with 10 replicas each. A total of 200 actors.
One node with two docker containers (head and worker).
10 per application constantly to Serve HTTP (FastAPI). When one of the requests is completed, a new one is added.
Minimum 48-72 hours.
From 30 seconds to 3 minutes.
Only
Two serve deployment modules (a dashboard with the amount of memory for all their replicas).
We will check closer to the end of next week, as we will make another launch for a longer period.
Workers are created not for actors, but for the tasks that we call from the actors. In total, 20-40 remote functions are called per request and the default |
I also updated |
@Lorien2027 a few more questions:
|
For context, it's strange that If the application is somehow creating a lot of ServeHandles and never freeing them, that could cause this increase. |
We checked another cluster that was active for more than 5 days, it processed requests for less than 30 minutes at the very beginning, then no requests were received. Raylet memory has stabilized, while the memory of the remaining components and all Serve Deployment Modules increases over time. This is quite strange in the absence of requests. The actors were alive all the time according to the dashboard. Thus, they did not crash and did not restart. |
Thus, the Serve API is not used. An interesting point is that the memory of this global actor also increases over time, although no requests are received, and all objects from previous requests have been removed from plasma. |
@shrekris-anyscale yes, my co-worker sent you an email. |
Thanks, I sent a reply. I'll delete the comment with my email for now, so I don't get any spam. |
Hello! I also faced this problem and here is what I dug up on it: |
Hi! We're seeing the same issue on our workload with Ray 2.6 We only have one application with 6 deployments and a total of 8 replicas. We do not use FastAPI. We use KubeRay to deploy the service and the memory usage drops every time we deploy a new image. The raylet and ServeController usage increases ~150MB / day and HTTPProxyActor usage increases by ~130 MB / day. Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day. |
Thanks for posting at @smit-kiri! I have some follow up questions:
|
|
Thanks for the follow-up. Just to clarify
I was wondering how many bytes were in the body of each request that gets sent to Serve. E.g. if you're including some a text in each request, maybe this is just a few bytes. If you're including images, maybe it's a few megabytes. Could you clarify? |
Majority (~90%) of the requests have an average text length of ~100 characters. Others have larger text of ~10K characters |
@smit-kiri can you share what uvicorn version you're using? |
|
hi @smit-kiri / @Lorien2027 we fixed most of the memory leaks in 2.7. Could you try upgrading and let us know if the issues still persist? |
The head node memory utilization now looks like this, the utilization increases for some time after we deploy a new ^Here, I'm using the ^This graph compares the two metrics But in general, looks like the memory leak is fixed! |
thanks for confirming |
What happened + What you expected to happen
We have a Serve application for which we are doing load testing. It consists of 2 nodes (head and worker), approximately 50 thousand requests are processed during the day and about 40 remote tasks are called for each request from the actors. We observe an increase in memory both in ray components (Raylet, Serve Controller, Serve Controller:listen_for_change) and in modules over time. At about 12:00, the flow of requests was stopped - the memory of ray components continued to grow, and the increase in modules slowed down, but did not stop (about 30 MB in each from 12:00 to 15:00). The number of requests is constant over time, autoscaler is turned off. Examples are given only for 2 modules (memory increases for all).








Module 1:
Module 2:
HTTP Proxy Actor:
Raylet:
Serve Controller:
Serve Contoller:listen_for_change
Head Node:
Worker Node:
Versions / Dependencies
ray 2.4.0
Reproduction script
It will take an additional 1-2 weeks to create a minimal script, since we cannot provide the source code.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: