-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia slows down when reading thousands of JLD files #86
Comments
(Friendly bump) Is there any chance this issue can be easily addressed or worked around? I have several tens of thousand .jld files and reading data from all of those takes ages, unnecessarily. I might consider temporarily switching to a different storage format, until this bug/feature is fixed. Any ideas on alternatives? Files are within 1MB and they contain less than 5 arrays each. |
I don't think we know yet what causes this. I hope there is a straightforward way to change the way we use the HDF5 library that removes this problem (explicitly calling the HDF5 garbage collector? using the "strong closing" flag when opening the file? updating to HDF5 1.10?), but these are just guesses. What other format are you contemplating? CSV? |
Thanks for chiming in. Unfortunately I am out of ideas for alternative formats... Maybe using numpy .npy via PyCall, but I have not tested it yet. It has to be binary as I have a ~ 100 GB dataset that would be too large in ascii files. |
The problem seems to be that there are so many files. Can you get by with fewer files? I would consider putting all 100 GByte into the same HDF5 file. If your data have a simple format (e.g. matrix, array), then you can write them directly into an HDF5 file, circumventing JLD. |
The problem seems to be a @gasagna, you might have to do a bit of legwork. What happens if you comment out the problematic block we discussed at JuliaLang/julia#17554? Also, upgrading to a new libhdf5 seems likely to help---they have apparently ironed out a bunch of problems. |
@eschnett I could store everything in a large file, but I do not want to do that for several reasons, e.g. avoid massive data corruption, being able to delete/move/reorganise files easily, ... (Does HDF.jl allows deleting a dataset in a file? is the associated storage removed from the file, i.e. file size decreases?) @timholy Will try your suggestions asap. |
@gasagna Yes to both, with HDF5 1.10. |
@garrison There are also efficient tools to convert from/to older HDF5 file formats (if you encounter an old tool that needs to use the data), as well as tools to reorganize the data, e.g. transparently adding compression, chunking, different internal representations, etc. |
Isn't 1.10 backward compatible? |
1.10 can use a new file format that is more efficient. 1.10 can of course read the 1.8 format, but if you are using the new features, then the 1.8 library can't read the respective parts of the file. There's a conversion tool that downgrades the file (in place, apparently just rewriting some metadata, very fast). When you create a file with the 1.10 version, then you can decide whether to avoid (by setting a flag) any new 1.10 features or not. In other words: The 1.8 format has certain limitations that are corrected by the 1.10 library. You have to choose between backward compatibility or the new features. |
Cool, Thanks! |
Ok, back from vacations... Here is the script I used for timing and profiling. using JLD
# create `Nfiles` JLD files in `dir`, each containing some small amount of data
function bar(Nfiles, dir)
!ispath(dir) && mkdir(dir)
for i = 1:Nfiles
f = @sprintf "%05d.jld" i
JLD.save(joinpath(dir, f), "X", randn(5, 5))
print("\r ", i)
end
end
# Read `Nfiles` in `Nbatch` batches and evaluate the time required to load each batch.
# Return a vector of times
function foo(Nfiles, Nbatch, dir)
M = div(Nfiles, Nbatch)
s = 0.0
times = Float64[]
for j = 1:Nbatch
I = (j-1)*M + 1
t = @elapsed for i = I:(I+M-1)
f = @sprintf "%s/%05d.jld" dir i
X = JLD.load(f, "X")::Matrix{Float64}
s += sum(X)
end
push!(times, t)
print("\r $j $t")
end
println()
times
end
# ~ Make files in `tmp`, then comment
# bar(10000, "jld-tmp")
# ~ Now read files in batches
# precompile
foo(1, 1, "jld-tmp");
# ~ Time/Profile code, comment as necessary...
# @time foo(10000, 50, "jld-tmp");
@profile foo(10000, 50, "jld-tmp");
Profile.print() Here are the results of timing the code five time, each time from scratch, where # Times over five runs
original = [4.75, 4.61, 4.36, 4.45, 4.42]
patched = [4.62, 4.37, 4.45, 4.42, 4.42] As can you see, no big difference... Now look at the relevant parts of the profiling sessions # Profiling output
# original
3322 ./client.jl; _start; line: 378
...
stuff
...
2086 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
1 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2208
1 ....4/JLD/src/JLD.jl; close; line: 136
3 ....4/JLD/src/JLD.jl; close; line: 138
1 dict.jl; skip_deleted; line: 793
4 ....4/JLD/src/JLD.jl; close; line: 143
1533 ....4/JLD/src/JLD.jl; close; line: 146
1528 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2201
5 ...HDF5/src/plain.jl; h5f_get_obj_ids; line: 2205
4 ....4/JLD/src/JLD.jl; close; line: 147
4 .../HDF5/src/plain.jl; h5o_close; line: 2006
537 ....4/JLD/src/JLD.jl; close; line: 151
536 ...HDF5/src/plain.jl; h5f_close; line: 2006
1 ....4/JLD/src/JLD.jl; close; line: 156
# patched
3521 ./client.jl; _start; line: 378
...
stuff
...
2272 ...0.4/JLD/src/JLD.jl; jldopen; line: 267
2 ...0.4/JLD/src/JLD.jl; close; line: 143
2269 ...0.4/JLD/src/JLD.jl; close; line: 151
2266 ...HDF5/src/plain.jl; h5f_close; line: 2006 There is something strange going on here. Over a total of about 3300-3500 samples collected in the two cases, about the same amount, ~ 2100-2200, goes to Either profiling the C code does something nasty, or there is a real shift in where time is spent when those three lines are commented out. Edit added Julia Version 0.4.7-pre+1
Commit 57d0834* (2016-06-19 17:17 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin15.5.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libgfortblas
LAPACK: liblapack
LIBM: libopenlibm
LLVM: libLLVM-3.3 |
I have tested the code in the original issue on a Linux machine with hdf 1.10. Same behaviour occurs. |
I have tried using python h5py to read the same files and see if the issue is in HDF or HDF.jl. The julia code to write and read the files is using JLD
using Iterators
function make_files(Nfiles, dir)
!ispath(dir) && mkdir(dir)
for i = 1:Nfiles
f = @sprintf "%05d.jld" i
JLD.save(joinpath(dir, f), "X", randn(5, 5))
print("\r ", i)
end
println()
end
function read_files(dir, N)
files = readdir(dir)
s = 0.0
for chunk in partition(files, N)
t = @elapsed for file in chunk
X = JLD.load(joinpath(dir, file), "X")::Matrix{Float64}
s += sum(X)
end
@printf "%.5f\n" t
end
s
end
# make_files(10000, "jld-tmp")
println(read_files("jld-tmp", 500)) The python code is import numpy as np
import os
import h5py
import time
def chunks(list, N):
for i in xrange(0, len(list), N):
yield list[i:i+N]
def read_files(N, dir):
files = os.listdir(dir)
s = 0.0
for chunk in chunks(files, N):
tic = time.time()
for file in chunk:
fh = h5py.File(os.path.join(dir, file), "r")
X = fh["X"]
s += np.sum(X)
toc = time.time()
print "%.5f" % (toc-tic)
return s
print read_files(500, "jld-tmp") Timing are as follows: Julia 0.57577
0.13256
0.15173
0.17690
0.18536
0.18588
0.17711
0.19230
0.19699
0.19837
0.21121
0.21436
0.22501
0.22496
0.23954
0.24651
0.27409
0.26905
0.28692
0.30366 so time to read each batch of files increases with the number of files that have been read. Python 0.15729
0.15792
0.18449
0.19421
0.16328
0.15744
0.15732
0.15698
0.15875
0.15828
0.16598
0.16010
0.15951
0.15864
0.15954
0.16478
0.15948
0.16920
0.16462
0.16515 so time stay constant, and a bit higher than for julia. Both python and julia use the same version of hdf5, 1.8.16. |
(Friendly bump) It seems that h5py handles thousands of files quite well. This would suggest that the problem might be in the way we use the hdf5 library. Unfortunately, (for me) I do have enough breadth to go into the details of HDF5.jl and attempt to fix this bug, but if there is anything you think I can try please let me know. |
This makes me wonder if there isn't some state that we're inappropriately sharing across HDF5 files. |
I haven't seen any code where we explicitly share state. However, the HDF5 library maintains caches. And in particular, we use the "delayed close" flag when opening files, since this seems like a good idea in the presence of garbage collection, as it allows performing certain clean-up actions automatically during garbage collection (RAII in C++ terms). This makes HDF5 much easier to use -- otherwise, one has to manually "close" each HDF5 object (dataset, datashape, datatype, attribute, group, ...), which is a non-starter. So maybe all we're missing is a heavy dose of garbage collection, ensuring that all HDF5-referencing objects are actually collected and finalized? A better implementation might make use of a (future) Julia feature to finalize objects more eagerly. Another approach would be to manually keep back-pointers from files to objects that live in that file, and manually close all these when the file is closed. |
This seems doubtful to me, as calling Here's my code: import JLD
function main(fn; enable_gc=false)
@show enable_gc
JLD.save(fn, "data", rand(256))
for i in 1:10
enable_gc && gc()
@show i
@time for j in 1:2000
JLD.load(fn, "data")
end
end
end
main("/tmp/a.jld", enable_gc=true) And my output (for both values of
Since I'm opening the same file repeatedly and still see a slowdown (at least in this test case), it seems that HDF5 "caches" are providing no benefit even in a use case where they might be expected to. I think it would be better if |
See also: JuliaIO/HDF5.jl#349 |
This issue has been reported initially at JuliaLang/julia#17554. Discussion can continue here.
The text was updated successfully, but these errors were encountered: