Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Serve] Memory leak in Ray components and modules #35714

Closed
Lorien2027 opened this issue May 24, 2023 · 25 comments
Closed

[Core][Serve] Memory leak in Ray components and modules #35714

Lorien2027 opened this issue May 24, 2023 · 25 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks Ray-2.6 serve Ray Serve Related Issue

Comments

@Lorien2027
Copy link

What happened + What you expected to happen

We have a Serve application for which we are doing load testing. It consists of 2 nodes (head and worker), approximately 50 thousand requests are processed during the day and about 40 remote tasks are called for each request from the actors. We observe an increase in memory both in ray components (Raylet, Serve Controller, Serve Controller:listen_for_change) and in modules over time. At about 12:00, the flow of requests was stopped - the memory of ray components continued to grow, and the increase in modules slowed down, but did not stop (about 30 MB in each from 12:00 to 15:00). The number of requests is constant over time, autoscaler is turned off. Examples are given only for 2 modules (memory increases for all).
Module 1:
deployment_1
Module 2:
deployment_2
HTTP Proxy Actor:
http_proxy_actor
Raylet:
raylet
Serve Controller:
serve_controller
Serve Contoller:listen_for_change
serve_controller_listen_for_change
Head Node:
head_node
Worker Node:
worker_node

Versions / Dependencies

ray 2.4.0

Reproduction script

It will take an additional 1-2 weeks to create a minimal script, since we cannot provide the source code.

Issue Severity

High: It blocks me from completing my task.

@Lorien2027 Lorien2027 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 24, 2023
@larrylian
Copy link
Contributor

@Lorien2027

  1. Do you have a way to core dump the process's memory and analyze which memory block is growing?
  2. Do you use placement groups?
  3. Do you have a way to monitor the memory size of Ray object store?
  4. Your actors are always resident, right? You don't frequently create new actors, right?

@xieus
Copy link

xieus commented May 25, 2023

Great analysis.

cc: @scv119 @iycheng

@Lorien2027
Copy link
Author

Lorien2027 commented May 25, 2023

@larrylian

  1. Yes, what is the most convenient way to do this and which processes?
  2. No, only those that ray creates automatically in the source code.
  3. Yes, it does not grow over time.
  4. Yes, they are always constant. We have fixed the number of replicas equal to the number of simultaneous requests for processing.

There is perhaps another important point related to idle workers:

  1. At the beginning, we also had a problem with autoscaler turned on, because after a while the processing time increased very much, stayed like this for several hours and fell back in 10 seconds, but not to the original values. Actors were not added or removed.
  2. We disabled autoscaler and also added kill_idle_workers_interval_ms = 0, as we noticed that more than 10 thousand workers were created per day.
  3. After these two changes, the processing time became constant and even decreased.
  4. Thus, idle workers are never deleted. The problem with the sudden increase in processing time was probably due to autoscaler, but this requires separate tests and issue.

@Lorien2027
Copy link
Author

We also observed in ray 2.3 an increase in raylet memory without changing kill_idle_workers_interval_ms about 2 months ago. But I don't remember the same for modules at that time (maybe I didn't notice).

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 1, 2023

I have a couple more questions. Also, we don't need a exact repro. If you tell us your setup in details, we can build a mock script and test memory usage growth over time. For example, you can tell us sth like

  • X number of actors. X number of nodes.
  • X number of requests to X actors
  • Run for X hours
  • Each request take X time.
  • All the private config change if you made it

Also a couple of additional questions;

  1. What do you mean by module 1 & module 2?
  2. Does raylet usage growth forever or stop at some point?

We disabled autoscaler and also added kill_idle_workers_interval_ms = 0, as we noticed that more than 10 thousand workers were created per day.

This is pretty weird. If there are the same number of actors, workers are not supposed to be created. Is this issue still happening even without autoscaler? Also, what's happening if you don't set kill_idle_workers_interval_ms & just turn off the autoscaler?

Also the main leak seems to be from the serve controller. cc @sihanwang41 have you seen a similar memory leaking issue? Seems important to fix.

@rkooo567 rkooo567 added the serve Ray Serve Related Issue label Jun 1, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Jun 1, 2023

Marked the issue with serve as main memory leak is from serve components. However, there's a slight memory usage increase in raylet (300MB/day). I think it could be due to increasing system metadata, but once the workload is given, we can add long-running tests to verify it.

@scv119 scv119 added Ray-2.6 P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 1, 2023
@Lorien2027
Copy link
Author

Lorien2027 commented Jun 2, 2023

@rkooo567

X number of actors

2 serve applications, 10 serve deployment modules with 10 replicas each. A total of 200 actors.

X number of nodes

One node with two docker containers (head and worker).

X number of requests to X actors

10 per application constantly to Serve HTTP (FastAPI). When one of the requests is completed, a new one is added.

Run for X hours

Minimum 48-72 hours.

Each request take X time

From 30 seconds to 3 minutes.

All the private config change if you made it

Only kill_idle_workers_interval_ms = 0.

What do you mean by module 1 & module 2?

Two serve deployment modules (a dashboard with the amount of memory for all their replicas).

Does raylet usage growth forever or stop at some point?

We will check closer to the end of next week, as we will make another launch for a longer period.

This is pretty weird. If there are the same number of actors, workers are not supposed to be created. Is this issue still happening even without autoscaler? Also, what's happening if you don't set kill_idle_workers_interval_ms & just turn off the autoscaler?

Workers are created not for actors, but for the tasks that we call from the actors. In total, 20-40 remote functions are called per request and the default kill_idle_workers_interval_ms is 200 ms, which is very small. The exit status of the workers was that they had been idle for too long.

@Lorien2027
Copy link
Author

I also updated DEFAULT_LATENCY_BUCKET_MS in ray/serve/_private/constants.py as the default buckets are too small for our processing time.
DEFAULT_LATENCY_BUCKET_MS = [ 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 195000, 210000, 225000, 240000, 255000, 270000, 285000, 300000, 315000, 330000, 345000, 360000, 420000, 480000, 540000, 600000, 900000, 1800000, 3600000 ]

@shrekris-anyscale
Copy link
Contributor

@Lorien2027 a few more questions:

  1. Do the Ray tasks that get launched on each request contain handles to the Serve deployments?
  2. Do the Ray tasks terminate completely once they're finished? Do they release all memory?
  3. Do the Serve deployments crash and get restarted often?
  4. What's the output of serve status? Are all the deployments healthy?

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Jun 6, 2023

For context, it's strange that ServeController:listen_for_change is growing in memory. Here's its code. It's only called whenever a LongPollClient is intialized or reset. LongPollClients are initialized in only two places– the HTTPProxy and the ServeHandle (which contains a Router).

If the application is somehow creating a lot of ServeHandles and never freeing them, that could cause this increase.

@Lorien2027
Copy link
Author

Lorien2027 commented Jun 6, 2023

  1. No, the only difference is that some functions use a global detached actor, which acts as the _owner parameter of the ray.put call. We use it in order not to lose references to objects, since we transfer large numpy arrays between different modules. For example, [core] Support detached/GCS owned objects #12635 . Plasma memory is released automatically at the end of processing.
  2. Yes
  3. No
  4. They are healthy all the time, according to ray dashboard. The following error is displayed on serve status [Serve] The API routes for ServeAgent are not available #34882 .

We checked another cluster that was active for more than 5 days, it processed requests for less than 30 minutes at the very beginning, then no requests were received. Raylet memory has stabilized, while the memory of the remaining components and all Serve Deployment Modules increases over time. This is quite strange in the absence of requests. The actors were alive all the time according to the dashboard. Thus, they did not crash and did not restart.

Raylet
raylet

listen_for_change
ray_listen_for_change

Ray IDLE
ray_idle

Agent
agent

One of Serve Deployment Modules
module

@Lorien2027
Copy link
Author

@shrekris-anyscale

  1. A small clarification, we only use the Core API for the global detached actor, for example:
owner = GlobalOwner.options(name="GlobalOwner", lifetime="detached", get_if_exists=True).remote()
ray.get(owner.alive.remote()) # waiting for the creation of an actor
.....
owner = ray.get_actor("GlobalOwner")
data_ref = ray.put(data, _owner=owner)

Thus, the Serve API is not used. An interesting point is that the memory of this global actor also increases over time, although no requests are received, and all objects from previous requests have been removed from plasma.

@Lorien2027
Copy link
Author

@shrekris-anyscale yes, my co-worker sent you an email.

@shrekris-anyscale
Copy link
Contributor

Thanks, I sent a reply. I'll delete the comment with my email for now, so I don't get any spam.

@hyrmec
Copy link

hyrmec commented Jul 3, 2023

Hello! I also faced this problem and here is what I dug up on it:
I think that the problem lies in unicorn and according to the comment of the guys in this issue (the problem cannot be fixed for a very long time) it is recommended to replace unicorn with hypercorn.
fastapi/fastapi#9082

@smit-kiri
Copy link

Hi! We're seeing the same issue on our workload with Ray 2.6
Our graphs for each component look the same as mentioned in the description.

We only have one application with 6 deployments and a total of 8 replicas. We do not use FastAPI.
Each deployment is independent and they receive different number of requests, but in total they receive anywhere between 3 - 30 requests / minute in different parts of the day.

We use KubeRay to deploy the service and the memory usage drops every time we deploy a new image.

The raylet and ServeController usage increases ~150MB / day and HTTPProxyActor usage increases by ~130 MB / day. Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Jul 31, 2023

Thanks for posting at @smit-kiri! I have some follow up questions:

  1. Does your Serve app start any other Ray tasks or actors?

  2. Roughly how many bytes of data are you sending in your requests?

Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

  1. Is the deployment memory usage growing by 10-30 MB each day, or is it roughly constant at that amount?

Each deployment is independent and they receive different number of requests, but in total they receive anywhere between 3 - 30 requests / minute in different parts of the day.

  1. Do the requests usually finish as expected?
  2. Do clients often disconnect before a request is finished?
  3. Do requests often get queued up on the proxy?

@smit-kiri
Copy link

Thanks for posting at @smit-kiri! I have some follow up questions:

  1. Does your Serve app start any other Ray tasks or actors?

It does not.

  1. Roughly how many bytes of data are you sending in your requests?

Memory usage for each deployment is much less, anywhere between ~10 MB to ~30 MB / day.

  1. Is the deployment memory usage growing by 10-30 MB each day, or is it roughly constant at that amount?

It flattens out, for example one deployment's usage increased by ~60 MB the first day, ~30 MB in the second day ~20 MB the third day.
We're updating the image pretty frequently right now so we don't have data points for a longer period of time

  1. Do the requests usually finish as expected?

Yes, all requests were successful.

  1. Do clients often disconnect before a request is finished?

Nope.

  1. Do requests often get queued up on the proxy?

Nope.

@shrekris-anyscale
Copy link
Contributor

Thanks for the follow-up. Just to clarify

Roughly how many bytes of data are you sending in your requests?

I was wondering how many bytes were in the body of each request that gets sent to Serve. E.g. if you're including some a text in each request, maybe this is just a few bytes. If you're including images, maybe it's a few megabytes. Could you clarify?

@smit-kiri
Copy link

Thanks for the follow-up. Just to clarify

Roughly how many bytes of data are you sending in your requests?

I was wondering how many bytes were in the body of each request that gets sent to Serve. E.g. if you're including some a text in each request, maybe this is just a few bytes. If you're including images, maybe it's a few megabytes. Could you clarify?

Majority (~90%) of the requests have an average text length of ~100 characters. Others have larger text of ~10K characters

@alexeykudinkin
Copy link
Contributor

@smit-kiri can you share what uvicorn version you're using?

@smit-kiri
Copy link

@smit-kiri can you share what uvicorn version you're using?

0.22.0

@akshay-anyscale
Copy link
Contributor

hi @smit-kiri / @Lorien2027 we fixed most of the memory leaks in 2.7. Could you try upgrading and let us know if the issues still persist?

@smit-kiri
Copy link

The head node memory utilization now looks like this, the utilization increases for some time after we deploy a new RayService and then it levels off.

Screenshot 2023-10-30 at 9 30 08 AM

^Here, I'm using the ray_node_mem_used metric, I'm not sure if that is correct.

Screenshot 2023-10-30 at 9 48 20 AM

^This graph compares the two metrics sum(ray_node_mem_used{pod="head-pod-name"}) by (pod) and sum(ray_component_rss_mb{pod="head-pod-name"} * 1e6) by (pod)

But in general, looks like the memory leak is fixed!

@akshay-anyscale
Copy link
Contributor

thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks Ray-2.6 serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests