[Core][Serve] Memory leak in Ray components and modules #35714

Lorien2027 · 2023-05-24T13:48:05Z

What happened + What you expected to happen

We have a Serve application for which we are doing load testing. It consists of 2 nodes (head and worker), approximately 50 thousand requests are processed during the day and about 40 remote tasks are called for each request from the actors. We observe an increase in memory both in ray components (Raylet, Serve Controller, Serve Controller:listen_for_change) and in modules over time. At about 12:00, the flow of requests was stopped - the memory of ray components continued to grow, and the increase in modules slowed down, but did not stop (about 30 MB in each from 12:00 to 15:00). The number of requests is constant over time, autoscaler is turned off. Examples are given only for 2 modules (memory increases for all).
Module 1:

Module 2:

HTTP Proxy Actor:

Raylet:

Serve Controller:

Serve Contoller:listen_for_change

Head Node:

Worker Node:

Versions / Dependencies

ray 2.4.0

Reproduction script

It will take an additional 1-2 weeks to create a minimal script, since we cannot provide the source code.

Issue Severity

High: It blocks me from completing my task.

larrylian · 2023-05-25T03:07:31Z

@Lorien2027

Do you have a way to core dump the process's memory and analyze which memory block is growing?
Do you use placement groups?
Do you have a way to monitor the memory size of Ray object store?
Your actors are always resident, right? You don't frequently create new actors, right?

xieus · 2023-05-25T04:08:38Z

Great analysis.

cc: @scv119 @iycheng

Lorien2027 · 2023-05-25T08:02:19Z

@larrylian

Yes, what is the most convenient way to do this and which processes?
No, only those that ray creates automatically in the source code.
Yes, it does not grow over time.
Yes, they are always constant. We have fixed the number of replicas equal to the number of simultaneous requests for processing.

There is perhaps another important point related to idle workers:

At the beginning, we also had a problem with autoscaler turned on, because after a while the processing time increased very much, stayed like this for several hours and fell back in 10 seconds, but not to the original values. Actors were not added or removed.
We disabled autoscaler and also added kill_idle_workers_interval_ms = 0, as we noticed that more than 10 thousand workers were created per day.
After these two changes, the processing time became constant and even decreased.
Thus, idle workers are never deleted. The problem with the sudden increase in processing time was probably due to autoscaler, but this requires separate tests and issue.

Lorien2027 · 2023-05-25T08:15:32Z

We also observed in ray 2.3 an increase in raylet memory without changing kill_idle_workers_interval_ms about 2 months ago. But I don't remember the same for modules at that time (maybe I didn't notice).

rkooo567 · 2023-06-01T13:05:32Z

I have a couple more questions. Also, we don't need a exact repro. If you tell us your setup in details, we can build a mock script and test memory usage growth over time. For example, you can tell us sth like

X number of actors. X number of nodes.
X number of requests to X actors
Run for X hours
Each request take X time.
All the private config change if you made it

Also a couple of additional questions;

What do you mean by module 1 & module 2?
Does raylet usage growth forever or stop at some point?

We disabled autoscaler and also added kill_idle_workers_interval_ms = 0, as we noticed that more than 10 thousand workers were created per day.

This is pretty weird. If there are the same number of actors, workers are not supposed to be created. Is this issue still happening even without autoscaler? Also, what's happening if you don't set kill_idle_workers_interval_ms & just turn off the autoscaler?

Also the main leak seems to be from the serve controller. cc @sihanwang41 have you seen a similar memory leaking issue? Seems important to fix.

rkooo567 · 2023-06-01T13:08:22Z

Marked the issue with serve as main memory leak is from serve components. However, there's a slight memory usage increase in raylet (300MB/day). I think it could be due to increasing system metadata, but once the workload is given, we can add long-running tests to verify it.

Lorien2027 · 2023-06-02T09:11:47Z

@rkooo567

X number of actors

2 serve applications, 10 serve deployment modules with 10 replicas each. A total of 200 actors.

X number of nodes

One node with two docker containers (head and worker).

X number of requests to X actors

10 per application constantly to Serve HTTP (FastAPI). When one of the requests is completed, a new one is added.

Run for X hours

Minimum 48-72 hours.

Each request take X time

From 30 seconds to 3 minutes.

All the private config change if you made it

Only kill_idle_workers_interval_ms = 0.

What do you mean by module 1 & module 2?

Two serve deployment modules (a dashboard with the amount of memory for all their replicas).

Does raylet usage growth forever or stop at some point?

We will check closer to the end of next week, as we will make another launch for a longer period.

This is pretty weird. If there are the same number of actors, workers are not supposed to be created. Is this issue still happening even without autoscaler? Also, what's happening if you don't set kill_idle_workers_interval_ms & just turn off the autoscaler?

Workers are created not for actors, but for the tasks that we call from the actors. In total, 20-40 remote functions are called per request and the default kill_idle_workers_interval_ms is 200 ms, which is very small. The exit status of the workers was that they had been idle for too long.

Lorien2027 · 2023-06-02T09:33:35Z

I also updated DEFAULT_LATENCY_BUCKET_MS in ray/serve/_private/constants.py as the default buckets are too small for our processing time.
DEFAULT_LATENCY_BUCKET_MS = [ 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 195000, 210000, 225000, 240000, 255000, 270000, 285000, 300000, 315000, 330000, 345000, 360000, 420000, 480000, 540000, 600000, 900000, 1800000, 3600000 ]

shrekris-anyscale · 2023-06-06T00:37:18Z

@Lorien2027 a few more questions:

Do the Ray tasks that get launched on each request contain handles to the Serve deployments?
Do the Ray tasks terminate completely once they're finished? Do they release all memory?
Do the Serve deployments crash and get restarted often?
What's the output of serve status? Are all the deployments healthy?

shrekris-anyscale · 2023-06-06T00:42:09Z

For context, it's strange that ServeController:listen_for_change is growing in memory. Here's its code. It's only called whenever a LongPollClient is intialized or reset. LongPollClients are initialized in only two places– the HTTPProxy and the ServeHandle (which contains a Router).

If the application is somehow creating a lot of ServeHandles and never freeing them, that could cause this increase.

Lorien2027 · 2023-06-06T12:04:14Z

No, the only difference is that some functions use a global detached actor, which acts as the _owner parameter of the ray.put call. We use it in order not to lose references to objects, since we transfer large numpy arrays between different modules. For example, [core] Support detached/GCS owned objects #12635 . Plasma memory is released automatically at the end of processing.
Yes
No
They are healthy all the time, according to ray dashboard. The following error is displayed on serve status [Serve] The API routes for ServeAgent are not available #34882 .

We checked another cluster that was active for more than 5 days, it processed requests for less than 30 minutes at the very beginning, then no requests were received. Raylet memory has stabilized, while the memory of the remaining components and all Serve Deployment Modules increases over time. This is quite strange in the absence of requests. The actors were alive all the time according to the dashboard. Thus, they did not crash and did not restart.

Raylet

listen_for_change

Ray IDLE

Agent

One of Serve Deployment Modules

Lorien2027 · 2023-06-06T15:24:15Z

@shrekris-anyscale

A small clarification, we only use the Core API for the global detached actor, for example:

owner = GlobalOwner.options(name="GlobalOwner", lifetime="detached", get_if_exists=True).remote()
ray.get(owner.alive.remote()) # waiting for the creation of an actor
.....
owner = ray.get_actor("GlobalOwner")
data_ref = ray.put(data, _owner=owner)

Thus, the Serve API is not used. An interesting point is that the memory of this global actor also increases over time, although no requests are received, and all objects from previous requests have been removed from plasma.

Lorien2027 · 2023-06-07T14:34:15Z

@shrekris-anyscale yes, my co-worker sent you an email.

shrekris-anyscale · 2023-06-07T19:34:40Z

Thanks, I sent a reply. I'll delete the comment with my email for now, so I don't get any spam.

hyrmec · 2023-07-03T08:56:20Z

Hello! I also faced this problem and here is what I dug up on it:
I think that the problem lies in unicorn and according to the comment of the guys in this issue (the problem cannot be fixed for a very long time) it is recommended to replace unicorn with hypercorn.
fastapi/fastapi#9082

smit-kiri · 2023-07-31T19:03:53Z

Hi! We're seeing the same issue on our workload with Ray 2.6
Our graphs for each component look the same as mentioned in the description.

We only have one application with 6 deployments and a total of 8 replicas. We do not use FastAPI.
Each deployment is independent and they receive different number of requests, but in total they receive anywhere between 3 - 30 requests / minute in different parts of the day.

We use KubeRay to deploy the service and the memory usage drops every time we deploy a new image.

The raylet and ServeController usage increases ~150MB / day and HTTPProxyActor usage increases by ~130 MB / day. Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

shrekris-anyscale · 2023-07-31T19:08:25Z

Thanks for posting at @smit-kiri! I have some follow up questions:

Does your Serve app start any other Ray tasks or actors?
Roughly how many bytes of data are you sending in your requests?

Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

Is the deployment memory usage growing by 10-30 MB each day, or is it roughly constant at that amount?

Each deployment is independent and they receive different number of requests, but in total they receive anywhere between 3 - 30 requests / minute in different parts of the day.

Do the requests usually finish as expected?
Do clients often disconnect before a request is finished?
Do requests often get queued up on the proxy?

smit-kiri · 2023-07-31T19:19:48Z

Thanks for posting at @smit-kiri! I have some follow up questions:

Does your Serve app start any other Ray tasks or actors?

It does not.

Roughly how many bytes of data are you sending in your requests?

Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

Is the deployment memory usage growing by 10-30 MB each day, or is it roughly constant at that amount?

It flattens out, for example one deployment's usage increased by ~60 MB the first day, ~30 MB in the second day ~20 MB the third day.
We're updating the image pretty frequently right now so we don't have data points for a longer period of time

Do the requests usually finish as expected?

Yes, all requests were successful.

Do clients often disconnect before a request is finished?

Nope.

Do requests often get queued up on the proxy?

Nope.

shrekris-anyscale · 2023-07-31T19:28:50Z

Thanks for the follow-up. Just to clarify

Roughly how many bytes of data are you sending in your requests?

I was wondering how many bytes were in the body of each request that gets sent to Serve. E.g. if you're including some a text in each request, maybe this is just a few bytes. If you're including images, maybe it's a few megabytes. Could you clarify?

smit-kiri · 2023-07-31T19:34:52Z

Thanks for the follow-up. Just to clarify

Roughly how many bytes of data are you sending in your requests?

I was wondering how many bytes were in the body of each request that gets sent to Serve. E.g. if you're including some a text in each request, maybe this is just a few bytes. If you're including images, maybe it's a few megabytes. Could you clarify?

Majority (~90%) of the requests have an average text length of ~100 characters. Others have larger text of ~10K characters

alexeykudinkin · 2023-08-01T17:59:04Z

@smit-kiri can you share what uvicorn version you're using?

smit-kiri · 2023-08-01T18:00:03Z

@smit-kiri can you share what uvicorn version you're using?

0.22.0

akshay-anyscale · 2023-10-27T21:03:33Z

hi @smit-kiri / @Lorien2027 we fixed most of the memory leaks in 2.7. Could you try upgrading and let us know if the issues still persist?

smit-kiri · 2023-10-30T13:50:52Z

The head node memory utilization now looks like this, the utilization increases for some time after we deploy a new RayService and then it levels off.

^Here, I'm using the ray_node_mem_used metric, I'm not sure if that is correct.

^This graph compares the two metrics sum(ray_node_mem_used{pod="head-pod-name"}) by (pod) and sum(ray_component_rss_mb{pod="head-pod-name"} * 1e6) by (pod)

But in general, looks like the memory leak is fixed!

akshay-anyscale · 2023-10-31T06:13:39Z

thanks for confirming

Lorien2027 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 24, 2023

rkooo567 added the serve Ray Serve Related Issue label Jun 1, 2023

scv119 added Ray-2.6 P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 1, 2023

akshay-anyscale assigned sihanwang41 and shrekris-anyscale Jun 2, 2023

akshay-anyscale unassigned sihanwang41 Jun 6, 2023

akshay-anyscale assigned sihanwang41 Aug 1, 2023

akshay-anyscale closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Serve] Memory leak in Ray components and modules #35714

[Core][Serve] Memory leak in Ray components and modules #35714

Lorien2027 commented May 24, 2023

larrylian commented May 25, 2023

xieus commented May 25, 2023

Lorien2027 commented May 25, 2023 •

edited

Loading

Lorien2027 commented May 25, 2023

rkooo567 commented Jun 1, 2023

rkooo567 commented Jun 1, 2023

Lorien2027 commented Jun 2, 2023 •

edited

Loading

Lorien2027 commented Jun 2, 2023

shrekris-anyscale commented Jun 6, 2023

shrekris-anyscale commented Jun 6, 2023 •

edited

Loading

Lorien2027 commented Jun 6, 2023 •

edited

Loading

Lorien2027 commented Jun 6, 2023

Lorien2027 commented Jun 7, 2023

shrekris-anyscale commented Jun 7, 2023

hyrmec commented Jul 3, 2023

smit-kiri commented Jul 31, 2023

shrekris-anyscale commented Jul 31, 2023 •

edited

Loading

smit-kiri commented Jul 31, 2023

shrekris-anyscale commented Jul 31, 2023

smit-kiri commented Jul 31, 2023

alexeykudinkin commented Aug 1, 2023

smit-kiri commented Aug 1, 2023

akshay-anyscale commented Oct 27, 2023

smit-kiri commented Oct 30, 2023

akshay-anyscale commented Oct 31, 2023

[Core][Serve] Memory leak in Ray components and modules #35714

[Core][Serve] Memory leak in Ray components and modules #35714

Comments

Lorien2027 commented May 24, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

larrylian commented May 25, 2023

xieus commented May 25, 2023

Lorien2027 commented May 25, 2023 • edited Loading

Lorien2027 commented May 25, 2023

rkooo567 commented Jun 1, 2023

rkooo567 commented Jun 1, 2023

Lorien2027 commented Jun 2, 2023 • edited Loading

Lorien2027 commented Jun 2, 2023

shrekris-anyscale commented Jun 6, 2023

shrekris-anyscale commented Jun 6, 2023 • edited Loading

Lorien2027 commented Jun 6, 2023 • edited Loading

Lorien2027 commented Jun 6, 2023

Lorien2027 commented Jun 7, 2023

shrekris-anyscale commented Jun 7, 2023

hyrmec commented Jul 3, 2023

smit-kiri commented Jul 31, 2023

shrekris-anyscale commented Jul 31, 2023 • edited Loading

smit-kiri commented Jul 31, 2023

shrekris-anyscale commented Jul 31, 2023

smit-kiri commented Jul 31, 2023

alexeykudinkin commented Aug 1, 2023

smit-kiri commented Aug 1, 2023

akshay-anyscale commented Oct 27, 2023

smit-kiri commented Oct 30, 2023

akshay-anyscale commented Oct 31, 2023

Lorien2027 commented May 25, 2023 •

edited

Loading

Lorien2027 commented Jun 2, 2023 •

edited

Loading

shrekris-anyscale commented Jun 6, 2023 •

edited

Loading

Lorien2027 commented Jun 6, 2023 •

edited

Loading

shrekris-anyscale commented Jul 31, 2023 •

edited

Loading