Julia slows down when reading thousands of JLD files #86

gasagna · 2016-07-23T10:40:08Z

This issue has been reported initially at JuliaLang/julia#17554. Discussion can continue here.

gasagna · 2016-08-01T17:12:29Z

(Friendly bump) Is there any chance this issue can be easily addressed or worked around? I have several tens of thousand .jld files and reading data from all of those takes ages, unnecessarily.

I might consider temporarily switching to a different storage format, until this bug/feature is fixed. Any ideas on alternatives? Files are within 1MB and they contain less than 5 arrays each.

eschnett · 2016-08-01T17:20:51Z

I don't think we know yet what causes this. I hope there is a straightforward way to change the way we use the HDF5 library that removes this problem (explicitly calling the HDF5 garbage collector? using the "strong closing" flag when opening the file? updating to HDF5 1.10?), but these are just guesses.

What other format are you contemplating? CSV?

gasagna · 2016-08-01T17:25:16Z

Thanks for chiming in.

Unfortunately I am out of ideas for alternative formats... Maybe using numpy .npy via PyCall, but I have not tested it yet. It has to be binary as I have a ~ 100 GB dataset that would be too large in ascii files.

eschnett · 2016-08-01T18:19:55Z

The problem seems to be that there are so many files. Can you get by with fewer files? I would consider putting all 100 GByte into the same HDF5 file.

If your data have a simple format (e.g. matrix, array), then you can write them directly into an HDF5 file, circumventing JLD.

timholy · 2016-08-01T19:23:58Z

The problem seems to be a libhdf5 problem, however.

@gasagna, you might have to do a bit of legwork. What happens if you comment out the problematic block we discussed at JuliaLang/julia#17554?

Also, upgrading to a new libhdf5 seems likely to help---they have apparently ironed out a bunch of problems.

gasagna · 2016-08-02T09:24:54Z

@eschnett I could store everything in a large file, but I do not want to do that for several reasons, e.g. avoid massive data corruption, being able to delete/move/reorganise files easily, ... (Does HDF.jl allows deleting a dataset in a file? is the associated storage removed from the file, i.e. file size decreases?)

@timholy Will try your suggestions asap.

eschnett · 2016-08-02T12:41:07Z

@gasagna Yes to both, with HDF5 1.10.

eschnett · 2016-08-02T14:53:44Z

@garrison There are also efficient tools to convert from/to older HDF5 file formats (if you encounter an old tool that needs to use the data), as well as tools to reorganize the data, e.g. transparently adding compression, chunking, different internal representations, etc.

gasagna · 2016-08-02T15:12:51Z

Isn't 1.10 backward compatible?

eschnett · 2016-08-02T18:12:16Z

1.10 can use a new file format that is more efficient. 1.10 can of course read the 1.8 format, but if you are using the new features, then the 1.8 library can't read the respective parts of the file. There's a conversion tool that downgrades the file (in place, apparently just rewriting some metadata, very fast). When you create a file with the 1.10 version, then you can decide whether to avoid (by setting a flag) any new 1.10 features or not.

In other words: The 1.8 format has certain limitations that are corrected by the 1.10 library. You have to choose between backward compatibility or the new features.

gasagna · 2016-08-02T18:12:59Z

Cool, Thanks!

gasagna · 2016-08-19T18:23:35Z

Ok, back from vacations...

Here is the script I used for timing and profiling.

using JLD

# create `Nfiles`  JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
end

# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch. 
# Return a vector of times
function foo(Nfiles, Nbatch, dir)
   M = div(Nfiles, Nbatch)
   s = 0.0
   times = Float64[]
   for j = 1:Nbatch
       I = (j-1)*M + 1
       t = @elapsed for i = I:(I+M-1)
           f = @sprintf "%s/%05d.jld" dir i
           X = JLD.load(f, "X")::Matrix{Float64}
           s += sum(X)
       end
       push!(times, t)
       print("\r $j $t")
   end
   println()
   times
end

# ~ Make files in `tmp`, then comment
# bar(10000, "jld-tmp")

# ~ Now read files in batches
# precompile
foo(1, 1, "jld-tmp");

# ~ Time/Profile code, comment as necessary...
# @time foo(10000, 50, "jld-tmp");
@profile foo(10000, 50, "jld-tmp");
Profile.print()

Here are the results of timing the code five time, each time from scratch, where patched stands for JLD.jl with lines 146-148 commented out as recommended by Tim.

# Times over five runs
original = [4.75, 4.61, 4.36, 4.45, 4.42]
patched  = [4.62, 4.37, 4.45, 4.42, 4.42]

As can you see, no big difference...

Now look at the relevant parts of the profiling sessions

# Profiling output

# original
  3322 ./client.jl; _start; line: 378 
  ... 
  stuff
  ...
   2086 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
   1    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2208
   1    ....4/JLD/src/JLD.jl; close; line: 136
   3    ....4/JLD/src/JLD.jl; close; line: 138
    1 dict.jl; skip_deleted; line: 793
   4    ....4/JLD/src/JLD.jl; close; line: 143
   1533 ....4/JLD/src/JLD.jl; close; line: 146
    1528 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2201
    5    ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2205
   4    ....4/JLD/src/JLD.jl; close; line: 147
    4 .../HDF5/src/plain.jl; h5o_close; line: 2006
   537  ....4/JLD/src/JLD.jl; close; line: 151
    536 ...HDF5/src/plain.jl; h5f_close; line: 2006
    1    ....4/JLD/src/JLD.jl; close; line: 156

# patched
  3521 ./client.jl; _start; line: 378
  ...
  stuff
  ...
    2272 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
     2    ...0.4/JLD/src/JLD.jl; close; line: 143
     2269 ...0.4/JLD/src/JLD.jl; close; line: 151
      2266 ...HDF5/src/plain.jl; h5f_close; line: 2006

There is something strange going on here. Over a total of about 3300-3500 samples collected in the two cases, about the same amount, ~ 2100-2200, goes to jldopen at line 267, as you would expect from the similar timings. However, for the patched version, all the time is spent at line 151, whereas for the original version, only a fourth of the samples goes to this line.

Either profiling the C code does something nasty, or there is a real shift in where time is spent when those three lines are commented out.

Edit added versionifo

Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

gasagna · 2016-08-30T17:19:15Z

I have tested the code in the original issue on a Linux machine with hdf 1.10. Same behaviour occurs.

gasagna · 2016-09-11T08:32:59Z

I have tried using python h5py to read the same files and see if the issue is in HDF or HDF.jl.

The julia code to write and read the files is

using JLD
using Iterators

function make_files(Nfiles, dir)
    !ispath(dir) && mkdir(dir)
    for i = 1:Nfiles
        f = @sprintf "%05d.jld" i
        JLD.save(joinpath(dir, f), "X", randn(5, 5))
        print("\r ", i)
    end
    println()
end

function read_files(dir, N)
   files = readdir(dir)
   s = 0.0
   for chunk in partition(files, N)
       t = @elapsed for file in chunk
           X = JLD.load(joinpath(dir, file), "X")::Matrix{Float64}
           s += sum(X)
       end
       @printf "%.5f\n" t
   end
   s
end

# make_files(10000, "jld-tmp")
println(read_files("jld-tmp", 500))

The python code is

import numpy as np
import os
import h5py
import time

def chunks(list, N):
    for i in xrange(0, len(list), N):
        yield list[i:i+N]

def read_files(N, dir):
    files = os.listdir(dir)
    s = 0.0
    for chunk in chunks(files, N):
        tic = time.time()
        for file in chunk:
            fh = h5py.File(os.path.join(dir, file), "r")
            X = fh["X"]
            s += np.sum(X)
        toc = time.time()
        print "%.5f" % (toc-tic)

    return s

print read_files(500, "jld-tmp")

Timing are as follows:

Julia

so time to read each batch of files increases with the number of files that have been read.

Python

so time stay constant, and a bit higher than for julia.

Both python and julia use the same version of hdf5, 1.8.16.

gasagna · 2016-09-12T11:05:52Z

(Friendly bump) It seems that h5py handles thousands of files quite well. This would suggest that the problem might be in the way we use the hdf5 library. Unfortunately, (for me) I do have enough breadth to go into the details of HDF5.jl and attempt to fix this bug, but if there is anything you think I can try please let me know.

StefanKarpinski · 2016-09-12T15:42:40Z

This makes me wonder if there isn't some state that we're inappropriately sharing across HDF5 files.

eschnett · 2016-09-12T16:19:39Z

I haven't seen any code where we explicitly share state.

However, the HDF5 library maintains caches. And in particular, we use the "delayed close" flag when opening files, since this seems like a good idea in the presence of garbage collection, as it allows performing certain clean-up actions automatically during garbage collection (RAII in C++ terms). This makes HDF5 much easier to use -- otherwise, one has to manually "close" each HDF5 object (dataset, datashape, datatype, attribute, group, ...), which is a non-starter.

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

A better implementation might make use of a (future) Julia feature to finalize objects more eagerly. Another approach would be to manually keep back-pointers from files to objects that live in that file, and manually close all these when the file is closed.

garrison · 2016-12-30T19:39:57Z

So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized?

This seems doubtful to me, as calling gc() explicitly does not seem to speed anything up.

Here's my code:

import JLD

function main(fn; enable_gc=false)
    @show enable_gc
    JLD.save(fn, "data", rand(256))
    for i in 1:10
        enable_gc && gc()
        @show i
        @time for j in 1:2000
            JLD.load(fn, "data")
        end
    end
end

main("/tmp/a.jld", enable_gc=true)

And my output (for both values of enable_gc):

enable_gc = false
i = 1
  1.028825 seconds (445.49 k allocations: 26.563 MB, 5.23% gc time)
i = 2
  1.047420 seconds (366.00 k allocations: 23.163 MB, 0.90% gc time)
i = 3
  1.989134 seconds (366.00 k allocations: 23.163 MB, 0.55% gc time)
i = 4
  3.006954 seconds (366.00 k allocations: 23.163 MB, 0.39% gc time)
i = 5
  3.987207 seconds (366.00 k allocations: 23.163 MB, 0.28% gc time)
i = 6
  4.786223 seconds (366.00 k allocations: 23.163 MB, 0.25% gc time)
i = 7
  5.564047 seconds (366.00 k allocations: 23.163 MB, 0.21% gc time)
i = 8
  6.341545 seconds (366.00 k allocations: 23.163 MB, 0.18% gc time)
i = 9
  7.257423 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  8.006855 seconds (366.00 k allocations: 23.163 MB, 0.15% gc time)

enable_gc = true
i = 1
  1.019273 seconds (445.49 k allocations: 26.563 MB, 0.73% gc time)
i = 2
  1.168291 seconds (366.00 k allocations: 23.163 MB, 0.75% gc time)
i = 3
  2.006945 seconds (366.00 k allocations: 23.163 MB, 0.44% gc time)
i = 4
  3.047479 seconds (366.00 k allocations: 23.163 MB, 0.32% gc time)
i = 5
  3.787296 seconds (366.00 k allocations: 23.163 MB, 0.27% gc time)
i = 6
  4.606675 seconds (366.00 k allocations: 23.163 MB, 0.22% gc time)
i = 7
  5.297919 seconds (366.00 k allocations: 23.163 MB, 0.19% gc time)
i = 8
  6.101852 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 9
  6.809118 seconds (366.00 k allocations: 23.163 MB, 0.16% gc time)
i = 10
  7.578225 seconds (366.00 k allocations: 23.163 MB, 0.12% gc time)

Since I'm opening the same file repeatedly and still see a slowdown (at least in this test case), it seems that HDF5 "caches" are providing no benefit even in a use case where they might be expected to. I think it would be better if JLD.load fully closed the HDF5 file.

garrison · 2017-11-22T00:56:47Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Julia slows down when reading thousands of JLD files #86

Julia slows down when reading thousands of JLD files #86

gasagna commented Jul 23, 2016

gasagna commented Aug 1, 2016

eschnett commented Aug 1, 2016

gasagna commented Aug 1, 2016

eschnett commented Aug 1, 2016

timholy commented Aug 1, 2016 •

edited

Loading

gasagna commented Aug 2, 2016

eschnett commented Aug 2, 2016

eschnett commented Aug 2, 2016

gasagna commented Aug 2, 2016

eschnett commented Aug 2, 2016

gasagna commented Aug 2, 2016

gasagna commented Aug 19, 2016 •

edited

Loading

gasagna commented Aug 30, 2016

gasagna commented Sep 11, 2016

gasagna commented Sep 12, 2016

StefanKarpinski commented Sep 12, 2016

eschnett commented Sep 12, 2016

garrison commented Dec 30, 2016

garrison commented Nov 22, 2017

Julia slows down when reading thousands of JLD files #86

Julia slows down when reading thousands of JLD files #86

Comments

gasagna commented Jul 23, 2016

gasagna commented Aug 1, 2016

eschnett commented Aug 1, 2016

gasagna commented Aug 1, 2016

eschnett commented Aug 1, 2016

timholy commented Aug 1, 2016 • edited Loading

gasagna commented Aug 2, 2016

eschnett commented Aug 2, 2016

eschnett commented Aug 2, 2016

gasagna commented Aug 2, 2016

eschnett commented Aug 2, 2016

gasagna commented Aug 2, 2016

gasagna commented Aug 19, 2016 • edited Loading

gasagna commented Aug 30, 2016

gasagna commented Sep 11, 2016

gasagna commented Sep 12, 2016

StefanKarpinski commented Sep 12, 2016

eschnett commented Sep 12, 2016

garrison commented Dec 30, 2016

garrison commented Nov 22, 2017

timholy commented Aug 1, 2016 •

edited

Loading

gasagna commented Aug 19, 2016 •

edited

Loading