Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when reading many different objects #349

Closed
LewisHein opened this issue Dec 24, 2016 · 36 comments
Closed

Memory leak when reading many different objects #349

LewisHein opened this issue Dec 24, 2016 · 36 comments
Labels

Comments

@LewisHein
Copy link

There seems to be a memory leak when reading large arrays from a file. For example:

#causes memory leak
using HDF5

dim1, dim2 = (55, 30000)

function blah()
    array1= h5read("big.h5", "big1.h5")
    array2 = h5read("big.h5", "big2.h5")

    result1 = array1 .* array2
    return findmax(result1)[1]
end

while true
    try
        rm("big.h5")
    end
    h5write("big.h5", "big1.h5", rand(dim1, dim2))
    h5write("big.h5", "big2.h5", rand(dim1, dim2))

    x = blah()
end

The leak isn't in writing the file: otherwise, this would leak memory too

#Doesn't leak memory
using HDF5 

dim1, dim2 = (55, 30000)

function blah()
    array1= h5read("big.h5", "big1.h5")
    array2 = h5read("big.h5", "big2.h5") 

    result1 = array1 .* array2
    return findmax(result1)[1]
end

while true
    try
        rm("big.h5")
    end
    h5write("big.h5", "big1.h5", rand(dim1, dim2))
    h5write("big.h5", "big2.h5", rand(dim1, dim2))

    #x = blah() ##NOTE: blah() isn't called, but everything else is the same.
end

Oddly enough, this doesn't leak memory. I can't figure out why, that would be since it's doing the same number of reads.

# Doesn't leak memory
using HDF5

dim1, dim2 = (55, 30000)
try
    rm("big.h5")
end
h5write("big.h5", "big1.h5", rand(dim1, dim2))
h5write("big.h5", "big2.h5", rand(dim1, dim2))


function blah()
    array1= h5read("big.h5", "big1.h5")
    array2 = h5read("big.h5", "big2.h5")

    result1 = array1 .* array2
    return findmax(result1)[1]
end

while true
    x = blah()
end
@ggggggggg
Copy link
Contributor

How did you determine there was a memory leak? Need more info to try to reproduce.

I copied and pasted your first code block into a REPL (v0.5) and watched the memory usage in Activity Monitor (Mac OS) for about 1 minute. It varied from about 320 MB to 370 MB, going both up and down in the time I observed.

@LewisHein
Copy link
Author

LewisHein commented May 8, 2017

OK -- I tried running this code again. I am on Linux -- one Ubuntu system and then a cluster of three Gentoo systems. I pasted the first block of code into a 0.5 REPL on one of the cluster machines and watched memory usage with top. It rose precipitously until it reached ~97% and then held steady there for several minutes. Top's count of memory allocated to cached files held steady and low. But swap usage just climbs and climbs.

Similar results on my Ubuntu laptop -- I started it, and it froze the whole laptop with everything but Julia having to run from swap space. I killed it forthwith so that I could get on and write this reply.

I wonder: Is there something weird going on with the HDF5 caching files/datasets internally to speed up repeat access?

@ggggggggg
Copy link
Contributor

I can try it on a linux machine at work tomorrow.

@musm
Copy link
Member

musm commented May 8, 2017

I can't seem to reproduce this on windows

@ggggggggg
Copy link
Contributor

ggggggggg commented May 8, 2017

I can reproduce this on Ubuntu 16, Julia 0.5.1, HDF5 library version 1.18.16. I observed the memory usage in top grow continuously for about 1 minute of execution while I watched. It would grow more than 0.1% per top refresh, and I think that machine has 16 GB of memory.

I put the following code in a file called memleak.jl and ran julia --track-allocations=all memleak.jl.

#causes memory leak
using HDF5

dim1, dim2 = (55, 30000)

function blah()
    array1= h5read("big.h5", "big1.h5")
    array2 = h5read("big.h5", "big2.h5")

    result1 = array1 .* array2
    return findmax(result1)[1]
end

for i=1:200
    i%10==0 && println("$i")
    rm("big.h5")
    h5write("big.h5", "big1.h5", rand(dim1, dim2))
    h5write("big.h5", "big2.h5", rand(dim1, dim2))

    x = blah()
end

The following lines had numbers larger than 10^6 in front of them:

660005920 Line 676
1333208128 Line 693
3946823920 Line 715
2653219088 Line 1209

I ran the same code on Mac OS and found:
792008640 Line 676
1333211360 Line 693
3946833488 Line 715
2653225008 Line 1209

Those are so similar, that it doesn't seem like it will be helpful.

@LewisHein
Copy link
Author

Suppose the memory allocated in these lines is freed in Mac OS but not Linux for some reason?

@ggggggggg
Copy link
Contributor

That sounds right. I don't know how to debug that unfortunately.

@ereday
Copy link

ereday commented Jun 10, 2017

Is there any progress on this issue ? I faced with the same issue on CentOS release 6.5.

@xiuliren
Copy link

xiuliren commented Jul 6, 2017

I encountered the same issue in ubuntu 14.04 using Julia 0.5.2.
I runs a script for a long time keep reading different hdf5 files. The memory usage keep going up and blow up after a few hours.

@ggggggggg
Copy link
Contributor

@musm Do you think you could ping someone who knows about the gc? Since this is platform dependent, it seems like it may represent an actual bug in the gc?

@xiuliren
Copy link

xiuliren commented Jul 6, 2017

@yuyichao is it possible to be a problem of GC?

@yuyichao
Copy link
Contributor

yuyichao commented Jul 6, 2017

Since this is platform dependent,

That's a strong evidence that it's not related to the GC since it's not platform dependent. It suggests that it's an issue in, well, some platform dependent code. I can reproduce this locally and from the change in /proc/self/maps the main difference is the heap so no julia object is leaking (those are not allocated on the system heap). Calling malloc_stat shows results that looks like,

Arena 0:
system bytes     = 3692122112
in use bytes     =   43379472
Total (incl. mmap):
system bytes     = 3742892032
in use bytes     =   94149392
max mmap regions =         20
max mmap bytes   =   77172736

The system bytess are increasing at a rate that roughly agrees with the increase in memory consumption while in use bytess seem to be increasing at a much slower rate. I'm not sure why the in use bytess are increasing though the fast increase in system bytess seems to suggest that the glibc allocator is not actually freeing or reusing memory. With LD_PRELOAD="/usr/lib/libjemalloc.so" the memory consumption is not increasing that quickly though there could be a real leak that's much slower (hundreds of times slower).

@yuyichao
Copy link
Contributor

yuyichao commented Jul 6, 2017

As for the slow leak, according to malloc_stats it's reproducible with just close(h5open("big.h5")) but not close(open("big.h5")) so I strongly suspect it's a slow leak in libhdf5 itself.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 6, 2017

Different versions of code I used for testing for reference.

using HDF5

dim1, dim2 = (55, 30000)

a = rand(dim1, dim2)
b = rand(dim1, dim2)

function blah()
    # close(h5open("big.h5"))
    close(open("big.h5"))
    # array1 = h5open("big.h5") do fd
    #     # read(fd, "big1.h5")
    # end
    # array2 = h5open("big.h5") do fd
    #     read(fd, "big2.h5")
    # end
    # finalizer(array1, x->Core.println("1"))
    # finalizer(array2, x->Core.println("2"))
    # array1= h5read("big.h5", "big1.h5")
    # array2 = h5read("big.h5", "big2.h5")
    # ccall(:jl_breakpoint, Void, (Any,), (array1, array2))
    return
end

try
    rm("big.h5")
end
h5write("big.h5", "big1.h5", rand(dim1, dim2))
h5write("big.h5", "big2.h5", rand(dim1, dim2))
while true
    blah()
    # @show Sys.maxrss()
    ccall(:malloc_stats, Void, ())
    gc()
end

@ggggggggg
Copy link
Contributor

ggggggggg commented Jul 6, 2017

Thanks for the input. Maybe it would be worthwhile to port the offending script to python with h5py and see if the same behavior occurs?

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2017

On further examination of the small leak, that seems to be coming from HDF5.jl. In particular,

apl = p_create(H5P_FILE_ACCESS) # access property list
seems to be allocating a property that's never free'd manually or automatically, adding true fixes the minor leak (and in general the toclose seems fragile). This unfree'd allocation also seems to be what triggering the glibc bug, so the glibc problem is probably due to some bad heuristic for defragmentation.

@yuyichao
Copy link
Contributor

yuyichao commented Jul 7, 2017

Glibc bug report https://sourceware.org/bugzilla/show_bug.cgi?id=21731 for the huge memory consumption part.

@codonell
Copy link

Thanks for the glibc bug report. Please note that memory allocation analysis is a difficult task, involving a deep understanding of program semantics, and allocator behaviour. No allocator is perfect (lacks apriori knowledge of allocation patterns). I always strongly suggest users graph malloc API call resuls (exaclty how much memory you requested) against VmRSS (actual memory usaged) and VmSize (size of virtual address space). These kinds of graphs are hugely useful when trying to determine if it's (a) internal/external fragmentation, (b) increased program usage, and (c) a leak. Thanks again.

@ilkerkesen
Copy link

It does not occur in Julia v0.4

@ilkerkesen
Copy link

Let me do a small correction. It still happens in 0.4 but much more slower compared to 0.5.

@musm
Copy link
Member

musm commented Jul 22, 2017

I think #428 helps in fixing this

Would anyone like to test and report back?

@xiuliren
Copy link

somehow, I can not reproduce the test. always got 0 with mallo_status in Julia.

julia> ccall(:malloc_stats, Void, ())  
Arena 0:                               
system bytes     =          0          
in use bytes     =          0          
Total (incl. mmap):                    
system bytes     =          0          
in use bytes     =          0          
max mmap regions =          0          
max mmap bytes   =          0          

@codonell
Copy link

If malloc_stats() returns all zeros then have you confirmed you're using glibc's malloc and not an interposed malloc like jemalloc/tcmalloc via LD_PRELOAD? What does /proc/self/maps show is loaded? The use of malloc_stats() should certainly return some values.

@xiuliren
Copy link

@codonell thanks, you are right. I was using jemalloc. after using glibc, I have reproduced the problem with the latest code. @musm the problem is still there with master.
currently, the commit number is:
13f7ad6

@fremling
Copy link

Hi, Is this memory leak still present in julia v1.0.2 with "HDF5" v0.10.2?
For instance

using HDF5
using LinearAlgebra
function HDF5MemoryLeak(Num)
    NyMarray=randn(100,100)
    fid=h5open("testfile.hdf5","w")
    close(fid)
end
N=2*10^4
for j in 1:N
    println("$j of $N")
    HDF5MemoryLeak5(j)
end

eats up 1.6 gb om memory, but nothing happens if i comment out either the hdf5 lines or the randomly generated array?
See this question for some more details https://stackoverflow.com/questions/53335610/is-there-a-memory-leak-in-hdf5-for-julia.

@ludvigak
Copy link

I can confirm the results by @fremling on Ubuntu 18.04 with hdf5-tools 1.10, julia 1.0.3 and HDF5.jl at master. Running the following code:

using HDF5

function leak(N, iters)
    for i=1:iters
        A = zeros(N,N)
        close(h5open("foo.h5", "w"))
    end
    sleep(10)
end

@time leak(2000, 100)

Has output
10.885362 seconds (483.96 k allocations: 3.004 GiB, 0.51% gc time)

And the following read from top while it sleeps:

  PID     USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+     COMMAND 
18518 ludvigak  20   0  3.966g 3.004g 0.087g S   0.0  9.6   0:01.82     julia-1 

My impression from playing around with this is that whatever is in memory when HDF5 is called is being prevented from being garbage collected.

I'm working on an application that writes large datasets to disk every few seconds. Currently, it eats through all available memory pretty quickly and needs to be restarted regularly, and I haven't figured out how to get around that.

@codonell
Copy link

If we think this is a glibc malloc issues, please don't hesitate to point me at a minimal reproducer. At the end of the day you really need to do two traces. Raw calls to the memory allocation subsystem, and in-use RSS (assuming your working set fits in RAM). Then graph both. If at any point the sum of the raw calls (allocations or deallocations) is not in line with the used RSS then something is caching and you need to drill down to find what is holding onto the objects. In many cases I've been able to trace raw malloc calls vs RSS and shown application authors that what they are really seeing is just application demand, and so malloc allocates what is requested. In other cases it might be tiny objects with high overhead, or lots of fastbins (try setting mallopt (M_MXFAST, 1), it doesn't help in this case).

I did a session of debugging with julia and the test program, and I also see ~3GiB of RSS used when the program runs. Oddly I see a lot of this: Thread 1 "julia" hit Breakpoint 5, __GI___libc_free (mem=0x0) at malloc.c:3078 which isn't freeing anything, but since it comes from llvm::LiveRangeEdit::calculateRegClassAndHint it might be a algorithmic thing which just iterates NULL pointers on a list and frees them all anyway, but I found this odd, you'd normally just avoid the call into the the library free routine for this, particularly if you see it this many times. However, this doesn't cause much RSS inflation.

What appears to cause a lot of RSS inflation is this loop:

#0  __GI___libc_malloc (bytes=56) at malloc.c:3020
#1  0x00007fffcc83a55e in H5FL_malloc () from /usr/bin/../lib64/libhdf5.so
#2  0x00007fffcc83a6bd in H5FL_reg_malloc () from /usr/bin/../lib64/libhdf5.so
#3  0x00007fffcc8f41ae in H5SL_new_node () from /usr/bin/../lib64/libhdf5.so
#4  0x00007fffcc8f5ee8 in H5SL_insert_common () from /usr/bin/../lib64/libhdf5.so
#5  0x00007fffcc8f663d in H5SL_insert () from /usr/bin/../lib64/libhdf5.so
#6  0x00007fffcc8d1e78 in H5P_create_id () from /usr/bin/../lib64/libhdf5.so
#7  0x00007fffcc8bed4b in H5Pcreate () from /usr/bin/../lib64/libhdf5.so
#8  0x00007fffcd00c822 in h5p_create () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:2248
#9  p_create () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:980
#10 julia_#h5open#4_35274 () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:668
#11 0x00007fffcd00cd74 in jfptr_#h5open#4_35275 ()
#12 0x00007fffcd00c57a in h5open () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:664
#13 julia_leak_35272 () at /home/carlos/support/julia-issue-349/test.jl:6
#14 0x00007fffcd00c6bb in jfptr_leak_35273 ()
#15 0x00007ffff7ae9dd7 in jl_fptr_trampoline () from /usr/bin/../lib64/libjulia.so.1
#16 0x00007ffff7aeae9c in jl_apply_generic () from /usr/bin/../lib64/libjulia.so.1
...

Which sometimes allocate 0.5MiB blocks, and that adds up quickly.
Then to my surprise you see allocations to the same node list on h5f_close()!

#0  __GI___libc_malloc (bytes=56) at malloc.c:3020
#1  0x00007fffcc83a55e in H5FL_malloc () from /usr/bin/../lib64/libhdf5.so
#2  0x00007fffcc83a6bd in H5FL_reg_malloc () from /usr/bin/../lib64/libhdf5.so
#3  0x00007fffcc8f41ae in H5SL_new_node () from /usr/bin/../lib64/libhdf5.so
#4  0x00007fffcc8f5ee8 in H5SL_insert_common () from /usr/bin/../lib64/libhdf5.so
#5  0x00007fffcc8f663d in H5SL_insert () from /usr/bin/../lib64/libhdf5.so
#6  0x00007fffcc8d1b81 in H5P_close () from /usr/bin/../lib64/libhdf5.so
#7  0x00007fffcc883296 in H5I_dec_ref () from /usr/bin/../lib64/libhdf5.so
#8  0x00007fffcc81f232 in H5F_dest () from /usr/bin/../lib64/libhdf5.so
#9  0x00007fffcc8201d5 in H5F_try_close () from /usr/bin/../lib64/libhdf5.so
#10 0x00007fffcc8204d8 in H5F_close () from /usr/bin/../lib64/libhdf5.so
#11 0x00007fffcc883296 in H5I_dec_ref () from /usr/bin/../lib64/libhdf5.so
#12 0x00007fffcc883366 in H5I_dec_app_ref () from /usr/bin/../lib64/libhdf5.so
#13 0x00007fffcc81ad3a in H5Fclose () from /usr/bin/../lib64/libhdf5.so
#14 0x00007fffcd00c596 in h5f_close () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:2148
#15 close () at /home/carlos/.julia/packages/HDF5/H0XJB/src/HDF5.jl:785
#16 julia_leak_35272 () at /home/carlos/support/julia-issue-349/test.jl:6
#17 0x00007fffcd00c6bb in jfptr_leak_35273 ()
#18 0x00007ffff7ae9dd7 in jl_fptr_trampoline () from /usr/bin/../lib64/libjulia.so.1
#19 0x00007ffff7aeae9c in jl_apply_generic () from /usr/bin/../lib64/libjulia.so.1

This might just be the way the HDF5 library operates, and in fact they have a H5garbage_collect() interface to cleanup memory usage from peak usage.

I'm not going to debug this further, but it certainly looks like someone should experiment with wiring up H5garbage_collect() to see if that's needed in this case of looping construction and use of the HDF5 library caches.

@codonell
Copy link

In case my previous post wasn't clear enough, I don't consider this a glibc malloc issue, it looks like a cache in HDF5 libraries that users will want to clear with H5garbaget_collect(). If that doesn't solve the problem then we need to reach out to HDF5 library authors to ask their input.

@ludvigak
Copy link

Thank you @codonell , you clearly have a deeper understanding of what might be going on here. I can unfortunately do no more than report on the problem.

I did one experiment yesterday, calling H5garbage_collect inside the loop of the example program, and could not see any difference. Perhaps it would be different if done from within the package.

To confound things further, I cannot replicate the problem today. I am now on a machine using Ubuntu 16.04, and have tried both using the system-installed HDF5 version 1.8.16, and version 1.10.4 installed using Conda (still Julia 1.0.3 and master branch of HDF5.jl). On his system I cannot see the leak.

@yuyichao
Copy link
Contributor

AFAICT there are two issues. The bug in HDF5.jl is in #349 (comment). There may or may not be a glibc issue since I couldn't reproduce the effect while replaing the malloc/free calls.

@codonell
Copy link

@yuyichao Agreed. I use a glibc replay simulator for this, we take a live application trace of all malloc APIs and then replay them in a simulator (with ordering preserved) and try to duplicate the issue (LPC2016 presentation on the tooling). I haven't gone to that extent yet for this workload, but I might, just to dust off the tooling from the last time we tried this and refresh it to master glibc branch. That way I can provide some hard data on the application demands based on the API calls.

@ludvigak You cursed me! It has stopped reproducing for me on my F29 system also, and it was just reproducing last night. I see the system process use only ~767MiB of RSS peak during processing. The last time something like this happened to me it was because prelink ran on a cron and altered the objects I was working on, but we don't have that anymore. Anything else that might have run to alter julia and the cache state? I erased the ~/.julia cache state and started again from scratch and it reproduces again! But only once. Then it went back to ~767MiB RSS peak per run. With this knowledge I was able to once again debug the situation again:

carlos   29737  4.9  5.2 2223468 1686780 pts/8 Sl+  16:11   0:01 /usr/bin/julia ./test.jl

Here you can see a fresh run after cleaning ~/.julia and we're back to 2.2GiB used, which is 3x the normal memory usage when it doesn't trigger. Something is certainly related to caching by julia, and that might mean llvm also. Some times it seems this state persists, other times it goes away. Quite odd. We need some julia experts to weight in on this.

@rafaqz
Copy link
Contributor

rafaqz commented Feb 22, 2019

I just hit this on Arch linux with a recent kernel. Julia 1.03 and HDF5 v0.11.0.

I'm pretty quickly and consistently filling 16Gb ram loading large raster arrays. The memory is never garbage collected.

Let me know if there is anything I can help with, I would prefer not to chunk our workflow into 20 reloads of julia to load the whole data set.

@dominikkiese
Copy link

Can reproduce on linux machine with Julia 1.0 and recent HDF5 version. Any progress on how to solve it?

@mzilhao
Copy link

mzilhao commented Jul 20, 2019

I think I'm having this issue with julia 1.2.0-rc2.0 running in ubuntu 19.04. For instance, running the following code

using HDF5

function testleak()
    for i in 1:10000
        fid=h5open("test_$i.h5","w")
        close(fid)
    end
end

@time testleak()

Has output
1.411972 seconds (169.80 k allocations: 7.696 MiB)

And if I keep re-running testleak() from the REPL I can see the memory (as reported by top and pmap) steadily increasing.

Is there any progress on this? For my actual use case I need to periodically save data from simulations that can run for days, and this bug will force me to stop and restart the runs so as to not run out of memory...

@yuyichao
Copy link
Contributor

The fix on the hdf5 side is here #349 (comment).
I'm not in a position to submit a patch anytime soon.

musm pushed a commit that referenced this issue Oct 16, 2019
`yuyichao` suggested this two years ago in #349 (comment)
@musm
Copy link
Member

musm commented Aug 4, 2020

closed by #629 thanks to @MarkusSchildhauer !

@musm musm closed this as completed Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests