-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when reading many different objects #349
Comments
How did you determine there was a memory leak? Need more info to try to reproduce. I copied and pasted your first code block into a REPL (v0.5) and watched the memory usage in Activity Monitor (Mac OS) for about 1 minute. It varied from about 320 MB to 370 MB, going both up and down in the time I observed. |
OK -- I tried running this code again. I am on Linux -- one Ubuntu system and then a cluster of three Gentoo systems. I pasted the first block of code into a 0.5 REPL on one of the cluster machines and watched memory usage with top. It rose precipitously until it reached ~97% and then held steady there for several minutes. Top's count of memory allocated to cached files held steady and low. But swap usage just climbs and climbs. Similar results on my Ubuntu laptop -- I started it, and it froze the whole laptop with everything but Julia having to run from swap space. I killed it forthwith so that I could get on and write this reply. I wonder: Is there something weird going on with the HDF5 caching files/datasets internally to speed up repeat access? |
I can try it on a linux machine at work tomorrow. |
I can't seem to reproduce this on windows |
I can reproduce this on Ubuntu 16, Julia 0.5.1, HDF5 library version 1.18.16. I observed the memory usage in top grow continuously for about 1 minute of execution while I watched. It would grow more than 0.1% per top refresh, and I think that machine has 16 GB of memory. I put the following code in a file called memleak.jl and ran #causes memory leak
using HDF5
dim1, dim2 = (55, 30000)
function blah()
array1= h5read("big.h5", "big1.h5")
array2 = h5read("big.h5", "big2.h5")
result1 = array1 .* array2
return findmax(result1)[1]
end
for i=1:200
i%10==0 && println("$i")
rm("big.h5")
h5write("big.h5", "big1.h5", rand(dim1, dim2))
h5write("big.h5", "big2.h5", rand(dim1, dim2))
x = blah()
end The following lines had numbers larger than 10^6 in front of them: 660005920 Line 676 I ran the same code on Mac OS and found: Those are so similar, that it doesn't seem like it will be helpful. |
Suppose the memory allocated in these lines is freed in Mac OS but not Linux for some reason? |
That sounds right. I don't know how to debug that unfortunately. |
Is there any progress on this issue ? I faced with the same issue on CentOS release 6.5. |
I encountered the same issue in ubuntu 14.04 using Julia 0.5.2. |
@musm Do you think you could ping someone who knows about the gc? Since this is platform dependent, it seems like it may represent an actual bug in the gc? |
@yuyichao is it possible to be a problem of GC? |
That's a strong evidence that it's not related to the GC since it's not platform dependent. It suggests that it's an issue in, well, some platform dependent code. I can reproduce this locally and from the change in
The |
As for the slow leak, according to |
Different versions of code I used for testing for reference. using HDF5
dim1, dim2 = (55, 30000)
a = rand(dim1, dim2)
b = rand(dim1, dim2)
function blah()
# close(h5open("big.h5"))
close(open("big.h5"))
# array1 = h5open("big.h5") do fd
# # read(fd, "big1.h5")
# end
# array2 = h5open("big.h5") do fd
# read(fd, "big2.h5")
# end
# finalizer(array1, x->Core.println("1"))
# finalizer(array2, x->Core.println("2"))
# array1= h5read("big.h5", "big1.h5")
# array2 = h5read("big.h5", "big2.h5")
# ccall(:jl_breakpoint, Void, (Any,), (array1, array2))
return
end
try
rm("big.h5")
end
h5write("big.h5", "big1.h5", rand(dim1, dim2))
h5write("big.h5", "big2.h5", rand(dim1, dim2))
while true
blah()
# @show Sys.maxrss()
ccall(:malloc_stats, Void, ())
gc()
end |
Thanks for the input. Maybe it would be worthwhile to port the offending script to python with h5py and see if the same behavior occurs? |
On further examination of the small leak, that seems to be coming from Line 624 in 284f139
true fixes the minor leak (and in general the toclose seems fragile). This unfree'd allocation also seems to be what triggering the glibc bug, so the glibc problem is probably due to some bad heuristic for defragmentation.
|
Glibc bug report https://sourceware.org/bugzilla/show_bug.cgi?id=21731 for the huge memory consumption part. |
Thanks for the glibc bug report. Please note that memory allocation analysis is a difficult task, involving a deep understanding of program semantics, and allocator behaviour. No allocator is perfect (lacks apriori knowledge of allocation patterns). I always strongly suggest users graph malloc API call resuls (exaclty how much memory you requested) against VmRSS (actual memory usaged) and VmSize (size of virtual address space). These kinds of graphs are hugely useful when trying to determine if it's (a) internal/external fragmentation, (b) increased program usage, and (c) a leak. Thanks again. |
It does not occur in Julia v0.4 |
Let me do a small correction. It still happens in 0.4 but much more slower compared to 0.5. |
I think #428 helps in fixing this Would anyone like to test and report back? |
somehow, I can not reproduce the test. always got 0 with mallo_status in Julia. julia> ccall(:malloc_stats, Void, ())
Arena 0:
system bytes = 0
in use bytes = 0
Total (incl. mmap):
system bytes = 0
in use bytes = 0
max mmap regions = 0
max mmap bytes = 0 |
If malloc_stats() returns all zeros then have you confirmed you're using glibc's malloc and not an interposed malloc like jemalloc/tcmalloc via LD_PRELOAD? What does /proc/self/maps show is loaded? The use of malloc_stats() should certainly return some values. |
Hi, Is this memory leak still present in julia v1.0.2 with "HDF5" v0.10.2?
eats up 1.6 gb om memory, but nothing happens if i comment out either the hdf5 lines or the randomly generated array? |
I can confirm the results by @fremling on Ubuntu 18.04 with hdf5-tools 1.10, julia 1.0.3 and HDF5.jl at master. Running the following code:
Has output And the following read from
My impression from playing around with this is that whatever is in memory when HDF5 is called is being prevented from being garbage collected. I'm working on an application that writes large datasets to disk every few seconds. Currently, it eats through all available memory pretty quickly and needs to be restarted regularly, and I haven't figured out how to get around that. |
If we think this is a glibc malloc issues, please don't hesitate to point me at a minimal reproducer. At the end of the day you really need to do two traces. Raw calls to the memory allocation subsystem, and in-use RSS (assuming your working set fits in RAM). Then graph both. If at any point the sum of the raw calls (allocations or deallocations) is not in line with the used RSS then something is caching and you need to drill down to find what is holding onto the objects. In many cases I've been able to trace raw malloc calls vs RSS and shown application authors that what they are really seeing is just application demand, and so malloc allocates what is requested. In other cases it might be tiny objects with high overhead, or lots of fastbins (try setting mallopt (M_MXFAST, 1), it doesn't help in this case). I did a session of debugging with julia and the test program, and I also see ~3GiB of RSS used when the program runs. Oddly I see a lot of this: What appears to cause a lot of RSS inflation is this loop:
Which sometimes allocate 0.5MiB blocks, and that adds up quickly.
This might just be the way the HDF5 library operates, and in fact they have a H5garbage_collect() interface to cleanup memory usage from peak usage. I'm not going to debug this further, but it certainly looks like someone should experiment with wiring up H5garbage_collect() to see if that's needed in this case of looping construction and use of the HDF5 library caches. |
In case my previous post wasn't clear enough, I don't consider this a glibc malloc issue, it looks like a cache in HDF5 libraries that users will want to clear with H5garbaget_collect(). If that doesn't solve the problem then we need to reach out to HDF5 library authors to ask their input. |
Thank you @codonell , you clearly have a deeper understanding of what might be going on here. I can unfortunately do no more than report on the problem. I did one experiment yesterday, calling To confound things further, I cannot replicate the problem today. I am now on a machine using Ubuntu 16.04, and have tried both using the system-installed HDF5 version 1.8.16, and version 1.10.4 installed using Conda (still Julia 1.0.3 and master branch of HDF5.jl). On his system I cannot see the leak. |
AFAICT there are two issues. The bug in HDF5.jl is in #349 (comment). There may or may not be a glibc issue since I couldn't reproduce the effect while replaing the malloc/free calls. |
@yuyichao Agreed. I use a glibc replay simulator for this, we take a live application trace of all malloc APIs and then replay them in a simulator (with ordering preserved) and try to duplicate the issue (LPC2016 presentation on the tooling). I haven't gone to that extent yet for this workload, but I might, just to dust off the tooling from the last time we tried this and refresh it to master glibc branch. That way I can provide some hard data on the application demands based on the API calls. @ludvigak You cursed me! It has stopped reproducing for me on my F29 system also, and it was just reproducing last night. I see the system process use only ~767MiB of RSS peak during processing. The last time something like this happened to me it was because prelink ran on a cron and altered the objects I was working on, but we don't have that anymore. Anything else that might have run to alter julia and the cache state? I erased the
Here you can see a fresh run after cleaning |
I just hit this on Arch linux with a recent kernel. Julia 1.03 and HDF5 v0.11.0. I'm pretty quickly and consistently filling 16Gb ram loading large raster arrays. The memory is never garbage collected. Let me know if there is anything I can help with, I would prefer not to chunk our workflow into 20 reloads of julia to load the whole data set. |
Can reproduce on linux machine with Julia 1.0 and recent HDF5 version. Any progress on how to solve it? |
I think I'm having this issue with julia 1.2.0-rc2.0 running in ubuntu 19.04. For instance, running the following code using HDF5
function testleak()
for i in 1:10000
fid=h5open("test_$i.h5","w")
close(fid)
end
end
@time testleak() Has output And if I keep re-running Is there any progress on this? For my actual use case I need to periodically save data from simulations that can run for days, and this bug will force me to stop and restart the runs so as to not run out of memory... |
The fix on the hdf5 side is here #349 (comment). |
`yuyichao` suggested this two years ago in #349 (comment)
closed by #629 thanks to @MarkusSchildhauer ! |
There seems to be a memory leak when reading large arrays from a file. For example:
The leak isn't in writing the file: otherwise, this would leak memory too
Oddly enough, this doesn't leak memory. I can't figure out why, that would be since it's doing the same number of reads.
The text was updated successfully, but these errors were encountered: