Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow writing #170

Closed
abieler opened this issue Oct 29, 2014 · 31 comments
Closed

slow writing #170

abieler opened this issue Oct 29, 2014 · 31 comments

Comments

@abieler
Copy link

abieler commented Oct 29, 2014

I have a problem when saving an array of types to the disk.
In my case an array with elements of type Cell (see below):

type Cell
  origin::Array{Float64,1}       # (3 x 1 array)
  halfSize::Array{Float64,1}   # (3 x 1 array)
  nodes::Array{Float64,2}      # (8 x 3 array)
  volume::Float64
  densities::Array{Float64,1}  # (8 x 1 array)
end

nCells = 1e6
allCells = Array(Cell, nCells)

for i=1:nCells       # this for loop is only pseudo code, but you get the point..
  cell = something
  allCells[i] = cell
end

@save "allCells.jld" allCells

This writes the first approx 60 Mb "instantly" to the file, but then it gradually slows down
until it is as low as 0.3 Mb/s when there is still a couple of 100 Mb to write.
Am I missing something?

Andre

@timholy
Copy link
Member

timholy commented Nov 2, 2014

I can't replicate this on my local machine. How are you timing the write? I just ran a 2nd julia process and put filesize in a loop, then plotted the results. It was perfectly linear for me.

What's your platform? (Mine's Kubuntu 14.04.)

@abieler
Copy link
Author

abieler commented Nov 3, 2014

I did not actually time the script, but just visually monitored the filesize in my file browser. After one hour (filesize was some 100 MB at that point) the file was still not completely written to disk so I killed the process.
I am using linux mint 17 xfce (running on imac..)
I ll time it with a filesize loop tomorrow when I am back at my box at work.

@timholy
Copy link
Member

timholy commented Nov 3, 2014

For me it was ~10 seconds (I didn't time it, that's just based on impression). I was a little dismayed it took that long. One hour does seem excessive 😄.

Since I can't reproduce the problem, I'm not sure how much I can help debug this. At a minimum you should try profiling it.

@abieler
Copy link
Author

abieler commented Nov 3, 2014

I ll do. here is the link to download my original code. Maybe you can find something that is off..

https://dl.dropboxusercontent.com/u/1910400/los.zip

newLOS.jl is the main script that should be executed.
below is a graph from the first approx 800 seconds of writing the file to disk.
(after I aborted...)
It seems linear for the first coulple of seconds, then there is a "stop" for a few
seconds. after this the slope is much flatter with an other stop at about 400 s.
after this the slope is even flatter.

figure_1

@timholy
Copy link
Member

timholy commented Nov 14, 2014

Is this still a problem since you upgraded?

@abieler
Copy link
Author

abieler commented Nov 14, 2014

Yes, that was the first thing I tried after the upgrade.. :)
Can you still not reproduce this behavior with the files from the download link?

On 11/14/2014 05:26 PM, Tim Holy wrote:

Is this still a problem since you upgraded?


Reply to this email directly or view it on GitHub
#170 (comment).

@timholy
Copy link
Member

timholy commented Dec 6, 2014

My laptop turns out not to be beefy enough to test the full dataset, but I did get linear behavior up to the first 10% of your dataset.

But I did look at your code. One challenge is that your data structures require a lot of references. Are all of your arrays of fixed size? At least for the purposes of I/O, consider converting into structures like

immutable CellIO
    origin_1::Float64
    origin_2::Float64
    origin_3::Float64
    halfsize_1::Float64
    ...
end

Those should speed writing by more than one order of magnitude, possibly two orders.

@rleegates
Copy link

Hi Tim,

I am having a similar issue on OS X 10.10.1 and Julia 0.3.4. Saving a chunk of memory (~200MB) to disk takes about 45s. When I save a larger chunk (~700MB), it takes longer to complete than I am willing to wait (abort at ~20 min). The type of data is a single instance of a composite type holding arrays of various other composite type instances (both immutable and mutable). Basically, one could visualize it as a tree of type instances about 4-5 levels deep. I am not quite sure how to interpret your latest comment - from my understanding, arrays are of fixed size if they have been initialized as something akin to Array(MyType, 20).

Thanks,
Robert

@timholy
Copy link
Member

timholy commented Jan 4, 2015

I certainly don't understand the nonlinearity, although it seems likely to be a consequence of garbage-collection.

Yes, I meant that they have consistent size. If they have consistent size, you could pack them before writing:

n = length(vector_of_arrays)
len = length(vector_of_arrays[1])
if all(x->length(x)==len, vector_of_arrays)
    matrix_of_arrays = hcat(vector_of_arrays...)
    @write fid matrix_of_arrays  # will be much faster
else
   error("oops, not all the same size")
end

@rleegates
Copy link

Hi Tim,

thanks, however I don't think packing would be all too sensible in my case. I would like to be able to restore the saved data without too much of a hassle, packing would complicate things extensively.

Well, we've all run into the GC... However, I wonder why creating the data (not much preallocation there) is so much faster than saving it to disk.

On another note: any chance that we'll see some manual memory management in Julia? The GC has gotten in the way a lot for me, necessitating all kinds of in-place ops and pre-allocation in places where it really obstructs code readability.

Best,
Robert

@simonster
Copy link
Member

I would guess the non-linearity has something to do with how HDF5 selects the heap size. There is a way to set libhdf5 to write in a format with a fractal heap (only supported by libhdf5 1.8+). However, for data that doesn't run into heap size issues it was slower in my testing. I'll look into that tonight.

@timholy
Copy link
Member

timholy commented Jan 4, 2015

I guess I'm proposing at least testing to find out if packing helps. If it does, then we're facing a limitation of libhdf5, not julia, and it makes sense to try to work around it. @simonster's suggestion is probably worth exploring first. If it is a limitation of libhdf5, one can make it so the packing happens behind-the-scenes; if this proves to be the case, I should revise #191 and then write up some docs.

The GC is slated to get a lot better: JuliaLang/julia#8699

@simonster
Copy link
Member

See if uncommenting this line makes a difference. That should tell libhdf5 to use the newer format.

@abieler
Copy link
Author

abieler commented Feb 5, 2015

So, uncommenting the above line did not fix my problem. Then I tried to go for lower level functions for writing/ loading the data with code shown below. This works, however it is much slower than when I generate cellList from an ascii file read (takes about 10 minutes to read and write).
Any hints on improving performance with this approach?

# writing data to disc
i = 1
for cell in cellList
  groupName = "Cell_" * string(i)
  grp = g_create(file, groupName)
  write(grp, "origin", cell.origin)
  write(grp, "halfSize", cell.halfSize)
  write(grp, "nodes", cell.nodes)
  write(grp, "volume", cell.volume)
  write(grp, "densities", cell.densities)
  if i%10000 == 0
    println(i)
  end
  i += 1
end
close(file)

# reading data from disc
file = h5open("cellList.hdf5", "r")
  for i in 1:nCells
    cellName = "Cell_$i"
    grp = file[cellName]
    origin = read(grp, "origin")
    halfSize = read(grp, "halfSize")
    nodes = read(grp, "nodes")
    volume = read(grp, "volume")
    densities = read(grp, "densities")
    cellList[i] = Cell(origin, halfSize, nodes, volume, densities)
    if i%10000 == 0
      println(i)
    end
  end
  close(file)

@abieler
Copy link
Author

abieler commented Feb 5, 2015

ok saving with the following method brings down the file size from 3.1GiB to 350 MiB.

file = h5open("cellList2.hdf5", "w")
grp = g_create(file, "CellList")
origin = d_create(grp, "origin", datatype(Float64), dataspace(3,nCells))
halfSize = d_create(grp, "halfSize", datatype(Float64), dataspace(3,nCells))
nodes = d_create(grp, "nodes", datatype(Float64), dataspace(8,3,nCells))
volume = d_create(grp, "volume", datatype(Float64), dataspace(1,nCells))
densities = d_create(grp, "densities", datatype(Float64), dataspace(8,nCells))
i = 1
for cell in cellList
  origin[:,i] = cell.origin
  halfSize[:,i] = cell.halfSize
  nodes[:,:,i] = cell.nodes
  volume[1,i] = cell.volume
  densities[:,i] = cell.densities
  if i%10000 == 0
    println(i)
  end
  i += 1
end
close(file)

But creating the array of Cells still takes a lot of time:

  file = h5open("cellList2.hdf5", "r")
  origin = file["CellList/origin"]
  halfSize = file["CellList/halfSize"]
  nodes = file["CellList/nodes"]
  volume = file["CellList/volume"]
  densities = file["CellList/densities"]

  for i =1:nCells
    println(size(volume[:,i]))
    #cell = Cell(vec(origin[:,i]), vec(halfSize[:,i]), reshape(nodes[:,:,i],(8,3)), volume[1,i], vec(densities[:,i])) 
    cell = Cell(vec(origin[:,i]), vec(halfSize[:,i]), reshape(nodes[:,:,i], (8,3)), 0.0, vec(densities[:,i])) 
    cellList[i] = cell 
  end
  close(file)

I don't know if there is a way to speed this up. But one thing that puzzles me is the volume variable.
volume = d_create(grp, "volume", datatype(Float64), dataspace(1,nCells))
why does this not work?
volume = d_create(grp, "volume", datatype(Float64), dataspace(nCells))
or this?
volume = d_create(grp, "volume", datatype(Float64), dataspace(nCells,))
for a 1d array.

Further with the current version of dataspace(1,nCells), if I read a value back like this:
volume = file["CellList/volume"]
volume[1,1] has shape (1,1) but I expect simply a float.

Sorry for the flurry of questions/confusion...

@abieler
Copy link
Author

abieler commented Feb 6, 2015

I could finally solve my problem. Well actually it is more of a workaround.
The solution is to instead of storing an array of Cell types in hdf5, save just the Cell data
as plain arrays in hdf5, load these and build the Cell array from scratch with that data.
So something along the lines that I tried above.

@timholy
Copy link
Member

timholy commented Feb 6, 2015

#191 aims at making such workarounds more convenient (but you still have to figure out the serialization yourself).

@timbitz
Copy link

timbitz commented May 13, 2015

Hi Tim,
I am also having major speed problems that increase as a function of size. I have a large Dict{K,Dict{K,V}} that I generate which can take up to 30gb in memory that I need to save to file and be able to re-use. I was hoping to use HDF5, which on my test runs using smaller input files/structures ( < 1gb ) worked well and was fast enough not to notice a problem. However at full size it has been writing the structure to disk for over 24 hours... Any ideas? -Tim

@rened
Copy link
Contributor

rened commented May 13, 2015

@timbitz apart from the overall size, how many keys does the Dict have / how many entries do you have in a hdf5 group? I noticed that up to let's say 1000 entries everything is snappy, quickly degrading afterwards. A friend had similar 24h problems with 100.000 entries - a workaround was to save them in different subgroups, i.e. in /1/000 to /1/999, etc .. until /100/000

@timholy I don't know if this is an inherent property of HDF5 or whether there is an additional operation in HDF5.jl which might make it accidentally O(N^2) or something like that? (I am just brainstorming here)

@timholy
Copy link
Member

timholy commented May 13, 2015

This is an interesting case: for the problems above, I suggested a custom serializer. But we already use a custom serializer for Dicts.

Here's a guess: could you have any cyclic references? See #85.

@timbitz
Copy link

timbitz commented May 13, 2015

@timholy Its only Dict{ASCIIString,Dict{ASCIIString,Char}}

@rened Yah, it has tuns-- probably upwards of 100M keys. The save function after 4gb written is proceeding at roughly 30KB/s with loads of unused resources on our compute cluster...

I can probably just print/load as gzipped text. I just did not expect that to be a faster or more efficient route.

@timholy
Copy link
Member

timholy commented May 13, 2015

Wow, that's terrible performance. Definitely something wrong. I would be curious about your libhdf5 version, mine's

julia> HDF5._libversion
(0x00000001,0x00000008,0x0000000b)

(i.e., 1.8.11). We've had some reports about problems with 1.8.13, if memory serves.

Also, if you're willing to dig further, profiling could be informative---it would be interesting to know whether the time is spent in julia code or in libhdf5.

@timbitz
Copy link

timbitz commented May 13, 2015

Hi Tim, Looks like I'm only on 1.8.5
I will try to take a deeper look.
Cheers, -t

@adham
Copy link

adham commented Apr 13, 2016

I also have the same problem. After saving 1.5 GB, the JLD.save speed is down to 1KB/s. What I am saving on the disk is

Dict{Int, Vector{Vector{Vector{Int}}}}

and my libhdf5 version is 1.8.14

Any way to solve it?

@timholy
Copy link
Member

timholy commented Apr 13, 2016

Above I suggested profiling to see where the bottleneck is (you can visualize with ProfileView(;C=true) to see how much time is spent in the libhdf5 C library), and until someone follows up on that suggestion there is no hope for fixing this. (When I tried, I couldn't replicate these problems on my machine.)

As far as a workaround, this should be easily amenable to a custom serializer. All of these problems seem to crop up once you have a large number of "objects", and suggest that the problem lies in libhdf5 itself. (Profiling might confirm or deny that.)

If you go this route, the trick is to pack that Vector{Vector{Vector{Int}}} into a single VecVecVec{Int} object (so libhdf5 has far fewer objects to deal with), and to automatically unpack it when you read it back out again. Assuming they aren't all the same length, you can still stuff the whole thing into a single Vector{Int}, where the first entry is n = length(vvv), entries 2:n+1 are [length(vvv[i]) for i = 1:n], entries n+2:n+2+m are equal to [length(vvv[1][1][j]) for j = 1:m], etc, finally followed by all the values in each of the innermost vectors.

@eschnett
Copy link
Contributor

The HDF5 file format may (or may not) also be a culprit here. If you
generate HDF5 files with groups that have many (10,000?) entries, then HDF5
1.6 will generate an internal file structure that is slow. If you are using
HDF5 1.8 (as you do), then you need to tell it that it to forego backward
compatibility and use a faster file layout. You do this when you open the
file:

h5open(filename, "w", "libver_bounds", (HDF5.H5F_LIBVER_LATEST,
HDF5.H5F_LIBVER_LATEST)) do fid
   ... write to the file here ...
end

Obviously there are converters that convert to older file formats if
necessary for some applications.

-erik

On Wed, Apr 13, 2016 at 12:28 AM, Adham [email protected] wrote:

I also have the same problem. After saving 1.5 GB, the JLD.save speed is
down to 1KB/s. What I am saving on the disk is
Dict{Int. Vector{Vector{Vector{Int}}}}
and my libhdf5 version is 1.8.14

Any way to solve it?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#170 (comment)

Erik Schnetter [email protected]
http://www.perimeterinstitute.ca/personal/eschnetter/

@timholy
Copy link
Member

timholy commented Jul 7, 2016

Evidence in #65 that newer versions of libhdf5 likely fix this problem.

@abieler
Copy link
Author

abieler commented Jul 7, 2016

Is there any specific version of libhdf5 or HDF5.jl I can check-out and run on my tests from above to check if it helps in my case?

@timholy
Copy link
Member

timholy commented Jul 7, 2016

If you have at least 1.8, try the "libver_bounds" trick a couple of posts up.

@timholy
Copy link
Member

timholy commented Jul 8, 2016

Looks like HDF5 1.10 also makes this the default. https://hdfgroup.org/wp/2015/05/whats-coming-in-the-hdf5-1-10-0-release/

@musm
Copy link
Member

musm commented Dec 7, 2020

I haven't read this issue in detail, but it's 4 years old. If it's still a problem, please feel free to reopen a new issue, but for now I only think it makes sense to close this.

@musm musm closed this as completed Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants