Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia slows down when reading thousands of JLD files #86

Open
gasagna opened this issue Jul 23, 2016 · 19 comments
Open

Julia slows down when reading thousands of JLD files #86

gasagna opened this issue Jul 23, 2016 · 19 comments

Comments

@gasagna
Copy link

gasagna commented Jul 23, 2016

This issue has been reported initially at JuliaLang/julia#17554. Discussion can continue here.

@gasagna
Copy link
Author

gasagna commented Aug 1, 2016

(Friendly bump) Is there any chance this issue can be easily addressed or worked around? I have several tens of thousand .jld files and reading data from all of those takes ages, unnecessarily.

I might consider temporarily switching to a different storage format, until this bug/feature is fixed. Any ideas on alternatives? Files are within 1MB and they contain less than 5 arrays each.

@eschnett
Copy link
Contributor

eschnett commented Aug 1, 2016

I don't think we know yet what causes this. I hope there is a straightforward way to change the way we use the HDF5 library that removes this problem (explicitly calling the HDF5 garbage collector? using the "strong closing" flag when opening the file? updating to HDF5 1.10?), but these are just guesses.

What other format are you contemplating? CSV?

@gasagna
Copy link
Author

gasagna commented Aug 1, 2016

Thanks for chiming in.

Unfortunately I am out of ideas for alternative formats... Maybe using numpy .npy via PyCall, but I have not tested it yet. It has to be binary as I have a ~ 100 GB dataset that would be too large in ascii files.

@eschnett
Copy link
Contributor

eschnett commented Aug 1, 2016

The problem seems to be that there are so many files. Can you get by with fewer files? I would consider putting all 100 GByte into the same HDF5 file.

If your data have a simple format (e.g. matrix, array), then you can write them directly into an HDF5 file, circumventing JLD.

@timholy
Copy link
Member

timholy commented Aug 1, 2016

The problem seems to be a libhdf5 problem, however.

@gasagna, you might have to do a bit of legwork. What happens if you comment out the problematic block we discussed at JuliaLang/julia#17554?

Also, upgrading to a new libhdf5 seems likely to help---they have apparently ironed out a bunch of problems.

@gasagna
Copy link
Author

gasagna commented Aug 2, 2016

@eschnett I could store everything in a large file, but I do not want to do that for several reasons, e.g. avoid massive data corruption, being able to delete/move/reorganise files easily, ... (Does HDF.jl allows deleting a dataset in a file? is the associated storage removed from the file, i.e. file size decreases?)

@timholy Will try your suggestions asap.

@eschnett
Copy link
Contributor

eschnett commented Aug 2, 2016

@gasagna Yes to both, with HDF5 1.10.

@eschnett
Copy link
Contributor

eschnett commented Aug 2, 2016

@garrison There are also efficient tools to convert from/to older HDF5 file formats (if you encounter an old tool that needs to use the data), as well as tools to reorganize the data, e.g. transparently adding compression, chunking, different internal representations, etc.

@gasagna
Copy link
Author

gasagna commented Aug 2, 2016

Isn't 1.10 backward compatible?

@eschnett
Copy link
Contributor

eschnett commented Aug 2, 2016

1.10 can use a new file format that is more efficient. 1.10 can of course read the 1.8 format, but if you are using the new features, then the 1.8 library can't read the respective parts of the file. There's a conversion tool that downgrades the file (in place, apparently just rewriting some metadata, very fast). When you create a file with the 1.10 version, then you can decide whether to avoid (by setting a flag) any new 1.10 features or not.

In other words: The 1.8 format has certain limitations that are corrected by the 1.10 library. You have to choose between backward compatibility or the new features.

@gasagna
Copy link
Author

gasagna commented Aug 2, 2016

Cool, Thanks!

@gasagna
Copy link
Author

gasagna commented Aug 19, 2016

Ok, back from vacations...

Here is the script I used for timing and profiling.

using JLD

# create `Nfiles`  JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
end

# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch. 
# Return a vector of times
function foo(Nfiles, Nbatch, dir)
   M = div(Nfiles, Nbatch)
   s = 0.0
   times = Float64[]
   for j = 1:Nbatch
       I = (j-1)*M + 1
       t = @elapsed for i = I:(I+M-1)
           f = @sprintf "%s/%05d.jld" dir i
           X = JLD.load(f, "X")::Matrix{Float64}
           s += sum(X)
       end
       push!(times, t)
       print("\r $j $t")
   end
   println()
   times
end

# ~ Make files in `tmp`, then comment
# bar(10000, "jld-tmp")

# ~ Now read files in batches
# precompile
foo(1, 1, "jld-tmp");

# ~ Time/Profile code, comment as necessary...
# @time foo(10000, 50, "jld-tmp");
@profile foo(10000, 50, "jld-tmp");
Profile.print()

Here are the results of timing the code five time, each time from scratch, where patched stands for JLD.jl with lines 146-148 commented out as recommended by Tim.

# Times over five runs
original = [4.75, 4.61, 4.36, 4.45, 4.42]
patched  = [4.62, 4.37, 4.45, 4.42, 4.42]

As can you see, no big difference...

Now look at the relevant parts of the profiling sessions

# Profiling output

# original
  3322 ./client.jl; _start; line: 378 
  ... 
  stuff
  ...
   2086 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
   1    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2208
   1    ....4/JLD/src/JLD.jl; close; line: 136
   3    ....4/JLD/src/JLD.jl; close; line: 138
    1 dict.jl; skip_deleted; line: 793
   4    ....4/JLD/src/JLD.jl; close; line: 143
   1533 ....4/JLD/src/JLD.jl; close; line: 146
    1528 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2201
    5    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2205
   4    ....4/JLD/src/JLD.jl; close; line: 147
    4 .../HDF5/src/plain.jl; h5o_close; line: 2006
   537  ....4/JLD/src/JLD.jl; close; line: 151
    536 ...HDF5/src/plain.jl; h5f_close; line: 2006
    1    ....4/JLD/src/JLD.jl; close; line: 156

# patched
  3521 ./client.jl; _start; line: 378
  ...
  stuff
  ...
    2272 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
     2    ...0.4/JLD/src/JLD.jl; close; line: 143
     2269 ...0.4/JLD/src/JLD.jl; close; line: 151
      2266 ...HDF5/src/plain.jl; h5f_close; line: 2006

There is something strange going on here. Over a total of about 3300-3500 samples collected in the two cases, about the same amount, ~ 2100-2200, goes to jldopen at line 267, as you would expect from the similar timings. However, for the patched version, all the time is spent at line 151, whereas for the original version, only a fourth of the samples goes to this line.

Either profiling the C code does something nasty, or there is a real shift in where time is spent when those three lines are commented out.

Edit added versionifo

Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@gasagna
Copy link
Author

gasagna commented Aug 30, 2016

I have tested the code in the original issue on a Linux machine with hdf 1.10. Same behaviour occurs.

@gasagna
Copy link
Author

gasagna commented Sep 11, 2016

I have tried using python h5py to read the same files and see if the issue is in HDF or HDF.jl.

The julia code to write and read the files is

using JLD
using Iterators

function make_files(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
    println()
end

function read_files(dir, N)
   files = readdir(dir)
   s = 0.0
   for chunk in partition(files, N)
       t = @elapsed for file in chunk
           X = JLD.load(joinpath(dir, file), "X")::Matrix{Float64}
           s += sum(X)
       end
       @printf "%.5f\n" t
   end
   s
end

# make_files(10000, "jld-tmp")
println(read_files("jld-tmp", 500))

The python code is

import numpy as np
import os
import h5py
import time

def chunks(list, N):
    for i in xrange(0, len(list), N):
        yield list[i:i+N]

def read_files(N, dir):
    files = os.listdir(dir)
    s = 0.0
    for chunk in chunks(files, N):
        tic = time.time()
        for file in chunk:
            fh = h5py.File(os.path.join(dir, file), "r")
            X = fh["X"]
            s += np.sum(X)
        toc = time.time()
        print "%.5f" % (toc-tic)

    return s

print read_files(500, "jld-tmp")

Timing are as follows:

Julia

0.57577
0.13256
0.15173
0.17690
0.18536
0.18588
0.17711
0.19230
0.19699
0.19837
0.21121
0.21436
0.22501
0.22496
0.23954
0.24651
0.27409
0.26905
0.28692
0.30366

so time to read each batch of files increases with the number of files that have been read.

Python

0.15729
0.15792
0.18449
0.19421
0.16328
0.15744
0.15732
0.15698
0.15875
0.15828
0.16598
0.16010
0.15951
0.15864
0.15954
0.16478
0.15948
0.16920
0.16462
0.16515

so time stay constant, and a bit higher than for julia.

Both python and julia use the same version of hdf5, 1.8.16.

@gasagna
Copy link
Author

gasagna commented Sep 12, 2016

(Friendly bump) It seems that h5py handles thousands of files quite well. This would suggest that the problem might be in the way we use the hdf5 library. Unfortunately, (for me) I do have enough breadth to go into the details of HDF5.jl and attempt to fix this bug, but if there is anything you think I can try please let me know.

@StefanKarpinski
Copy link
Member

This makes me wonder if there isn't some state that we're inappropriately sharing across HDF5 files.

@eschnett
Copy link
Contributor

I haven't seen any code where we explicitly share state.

However, the HDF5 library maintains caches. And in particular, we use the "delayed close" flag when opening files, since this seems like a good idea in the presence of garbage collection, as it allows performing certain clean-up actions automatically during garbage collection (RAII in C++ terms). This makes HDF5 much easier to use -- otherwise, one has to manually "close" each HDF5 object (dataset, datashape, datatype, attribute, group, ...), which is a non-starter.

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

A better implementation might make use of a (future) Julia feature to finalize objects more eagerly. Another approach would be to manually keep back-pointers from files to objects that live in that file, and manually close all these when the file is closed.

@garrison
Copy link
Contributor

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

This seems doubtful to me, as calling gc() explicitly does not seem to speed anything up.

Here's my code:

import JLD

function main(fn; enable_gc=false)
    @show enable_gc
    JLD.save(fn, "data", rand(256))
    for i in 1:10
        enable_gc && gc()
        @show i
        @time for j in 1:2000
            JLD.load(fn, "data")
        end
    end
end

main("/tmp/a.jld", enable_gc=true)

And my output (for both values of enable_gc):

enable_gc = false
i = 1
  1.028825 seconds (445.49 k allocations: 26.563 MB, 5.23% gc time)
i = 2
  1.047420 seconds (366.00 k allocations: 23.163 MB, 0.90% gc time)
i = 3
  1.989134 seconds (366.00 k allocations: 23.163 MB, 0.55% gc time)
i = 4
  3.006954 seconds (366.00 k allocations: 23.163 MB, 0.39% gc time)
i = 5
  3.987207 seconds (366.00 k allocations: 23.163 MB, 0.28% gc time)
i = 6
  4.786223 seconds (366.00 k allocations: 23.163 MB, 0.25% gc time)
i = 7
  5.564047 seconds (366.00 k allocations: 23.163 MB, 0.21% gc time)
i = 8
  6.341545 seconds (366.00 k allocations: 23.163 MB, 0.18% gc time)
i = 9
  7.257423 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  8.006855 seconds (366.00 k allocations: 23.163 MB, 0.15% gc time)

enable_gc = true
i = 1
  1.019273 seconds (445.49 k allocations: 26.563 MB, 0.73% gc time)
i = 2
  1.168291 seconds (366.00 k allocations: 23.163 MB, 0.75% gc time)
i = 3
  2.006945 seconds (366.00 k allocations: 23.163 MB, 0.44% gc time)
i = 4
  3.047479 seconds (366.00 k allocations: 23.163 MB, 0.32% gc time)
i = 5
  3.787296 seconds (366.00 k allocations: 23.163 MB, 0.27% gc time)
i = 6
  4.606675 seconds (366.00 k allocations: 23.163 MB, 0.22% gc time)
i = 7
  5.297919 seconds (366.00 k allocations: 23.163 MB, 0.19% gc time)
i = 8
  6.101852 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 9
  6.809118 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  7.578225 seconds (366.00 k allocations: 23.163 MB, 0.12% gc time)

Since I'm opening the same file repeatedly and still see a slowdown (at least in this test case), it seems that HDF5 "caches" are providing no benefit even in a use case where they might be expected to. I think it would be better if JLD.load fully closed the HDF5 file.

@garrison
Copy link
Contributor

See also: JuliaIO/HDF5.jl#349

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants