-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Excess memory usage when scheduling tasks in parallel? #20618
Comments
Update: If I define 1.) success = finishing the i-loop without worker failure then
Am I right in inferring that this could be either 1.) a memory leak issue, or 2.) a garbage collection efficiency issue? |
My issue seems similar to #7653, but in my case I'm able to relieve the leak via longer sleep time, which feels garbage collection-related. |
If you add system.gc() in the loop / remote functions, does it prevent the leak? |
Hi @ericl , Here's the new code I tried (just added a few
And here's the output:
Attached are some screen shots from Thank you very much for looking into this! |
Hmm, I'll take another look. Another thing to try is running "ray memory" to see what reference are still in scope. That should tell you if it's a reference counting issue or not, and also the type of reference causing the leak. |
The following seems to work for me (it ran to completion). I think the issue was that the loop wasn't blocking on the remote task to finish (no ray.get), hence the number of active tasks / memory usage grows without bound.
|
Hi @ericl, Thanks for the quick reply. Yes your code ran to completion for me too. And it's interesting that the worker memory footprint on the dashboard is only about 1.8GB now, compared to ~3.2GB when I ray.get() in the remote. I thought whatever the remote gets will just get zero-copy-mapped to the object store, so the total memory footprint should stay ~1.6GB. Or perhaps there's some double-counting on the dashboard I'm looking at? Also, does your example suggest the rule-of-thumb that I should always ray.get() on the driver, concat, and then move on to the next list of arrays? That seems to be a rather limiting constraint on Ray usage, and it wouldn't be too different from a completely serial implementation. |
Yeah it is pretty easy to shoot yourself in the foot when using the tasks API directly. It's a question of optimal scheduling order--- in this case, you minimize footprint by scheduling serially (or maybe you can allow a limited number of tasks to execute). You can run all of them in parallel but that seems to cause excessive thrashing in this hardware configuration. (cc @stephanie-wang for further thoughts on memory management here). |
Yeah, Eric's approach of inserting ray.get() or ray.wait() statements to throttle how many tasks are in the system is the recommended approach for now. It might also help if you use varargs to pass array_futures, like Also, when you say it's failing, do you mean that the driver gets OOM-killed? |
Hi @stephanie-wang and @ericl , Thank you very much for your assistance. As a recap, this is more or less the code I tested, given a
If the sleep time is <= 11s, the above code would often "fail" like this:
some times followed by
, with a dashboard shot like this I think my struggles stem from not clearly understanding how scheduling works, leading to a lot of confusion about what the dashboard says: 1.) The screenshot suggests I have 4 2.) If I ray.init() with num_cpus=4, why would there be more than 4 PIDs running at the same time, spill or not? 3.) If I ray.init() with num_cpus=2, I see the 4.) Which numbers should I sum up to determine my total memory footprint? Do I sum up the nodal RAM + Plasma (25.4 + 37.3 ~ 62.7GB), or do I sum up all the non-idle processes below the nodal RAM count (>~74GB)? Thank you very much again for attending to this issue. It took me some time to put my thoughts together on this. Chun |
Most likely what is happening is that the
This can happen because a task can get stuck in
Not sure about this one, but the warning is probably because workers are getting OOM-killed.
Plasma memory should be a subset of the total RAM per node.
No problem! These concepts are definitely not intuitive... |
Hi @stephanie-wang , I just tried what you suggested regarding changing the list arg into direct args, with the following changes to the original code:
For me, there doesn't seem to be any practical difference between the two methods. If I provide enough memory for one to work, the other one would work as well. Same thing with failure. I haven't been able to find a situation where one method succeeds and the other fails. However, I am quite curious about their different behaviors on the dashboard. Here's the direct args screen shot: versus the list-arg case: As you kindly explained, passing the list ref to the remote tricks the worker into thinking that all elements are readily available and hence begins scheduling, only to block on the That leads to a few questions: 1.) What would be the advantage in the list-arg case? I would think that puts extra burden on the scheduler, and potentially confuses those users who are scheduler noobs (like yours truly). 2.) You mentioned that the number of worker processes can certainly go above 3.) Are spill workers additional threads? I'd guess that spilling slows a task down, which could lead to a bottleneck in the backlog? Once again, thank you so much for attending to this. |
@stephanie-wang do you think we should do it in 2.1? |
Search before asking
Ray Component
Ray Core
What happened + What you expected to happen
Running this snippet and monitoring via top, one would see the RES of the workers incrementing with each loop iteration (the worker that runs concat), and then something like this happens:
until the eventual crash.
On the other hand, if one comments out the line assigning
ar
(and of course thedel ar
), then the RES of each worker stays below 1G.I've looked at a couple of other issues related to remote task memory leak form numpy but they don't seem to pertain to this problem. And because
np.concatenate
(andxarray.concat
by extension) is rather irreplaceable to me, there's currently no good way for me to get around this problem.Versions / Dependencies
OS: Linux 3.10.0-1062.12.1.el7.x86_64
Python: 3.7.9 or 3.7.10
Ray: 1.41 or 1.8.0
System resources: 4 threads, 60G memory (part of a node as provided via Sun Grid Engine)
Libraries used: numpy 1.21.1 or 1.19.5
Reproduction script
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: