WIP: Write Julia types as HDF5 compound types #132

simonster · 2014-08-13T18:58:07Z

This is a major change to the serialization format of JLD. Rather than writing Julia types as arrays of references, we translate Julia types to HDF5 compound types and write those. Following the discussion in #27, everything Julia would store inline is presently stored inline in the compound type, and everything Julia would store as a pointer to another object is stored as an HDF5 reference to another object. This PR also incorporates the changes from #102.

This presently passes all read and write tests, but there is plenty more work to be done:

Implementation notes:

Compound types are committed with numeric indices in the _types group, with a julia type attribute that indicates the corresponding Julia type. Because libhdf5 loses track of compound type hierarchy between closing and reopening a file, numeric suffixes indicate the corresponding compound type index for a field that is itself a compound type.
Code to convert between a given Julia type and the corresponding HDF5 compound type and vice versa is dynamically generated. This should be very efficient after warmup. The code that does this needs to be inspected very carefully, since it does some somewhat unclean things, especially for non-pointerfree immutables.
Empty types/immutables and arbitrary bitstypes are stored as HDF5 opaque types. A special case throws an error when a pointer is encountered. As a result, most write methods now throw an error when they encounter a pointer rather than displaying a warning. The variant of @save that saves all top-level variables catches the error and displays a warning instead.

Fixes #27

timholy · 2014-08-13T19:30:00Z

Oh, my!

We cannot yet read JLD files created with earlier versions of JLD or convert them to the new format.

If plain.jl is at least backwards-compatible, then could we just duplicate the whole old jld.jl file and call it JLDOld? (not exported, naturally)

There is no validation to ensure that HDF5 types match Julia types. Thus, if a Julia type changes, JLD will probably segfault on readout.

I could have sworn that once I had (or was working on) a facility to go through and check the types when the file was first opened. That way you'd only have to pay the price once, rather that upon reading each object. I've checked every branch I can find, and no trace of it. So maybe I imagined it.

I don't think issue should block merger of this feature, but certainly it's got to go on the TODO list.

There is no ability to read data if the type is not available in the current workspace. The structure of the compound type is sufficient to reconstruct the Julia type hierarchy (except for the types of reference fields) but this is not yet implemented.

I'm a bit leery of constructing the types because it seems that then you could get in a situation where you have no method(::MyType, ...) and yet you have such a method for a type that shares the same name, but is not really the same type object. But maybe this would be safe. Anyway, again I'm fine with it like this. That's basically what addrequire is about.

The rootmodule keyword argument to write is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without the rootmodule keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?

If memory serves, its whole purpose is basically to allow me to write this test. If there's another good way to do that, so much the better. I assume you saw this documentation as well.

In principle, for pointerfree Julia types, HDF5 compound types could be generated with the same padding so that they exactly match the Julia type. This would avoid a copy when reading/writing and also allow mmapping arrays directly from the file, at the cost of saving the padding to the file.

That would be so awesome, I wouldn't know what to say. But would that be portable across machines? If we had to, I guess we could test it upon file opening? However, I don't know how to get the alignment of fields in a julia type.

In some cases where Julia stores fields as pointers to objects, we may want to store the contents inline in JLD. Two such cases are fields that are non-pointerfree immutables and fields that are leaf type tuples. The main complication to storing such fields inline in JLD is that they could be undefined.

That would just be showing off 😄.

I have to fix some things for Julia 0.2 if we want to maintain compatibility.

I'm fine with just "freezing" 0.2 at a particular commit. Since PkgEvaluator will shortly not even be testing 0.2, I think that's the safest procedure anyway.

The implementation sounds awesome. I'll look over the code next.

simonster · 2014-08-13T20:52:03Z

We cannot yet read JLD files created with earlier versions of JLD or convert them to the new format.

If plain.jl is at least backwards-compatible, then could we just duplicate the whole old jld.jl file and call it JLDOld? (not exported, naturally)

I think this is a good idea. Maybe we can share the parts of the code not related to reading/writing, but everything else is basically a rewrite anyway.

There is no validation to ensure that HDF5 types match Julia types. Thus, if a Julia type changes, JLD will probably segfault on readout.

I could have sworn that once I had (or was working on) a facility to go through and check the types when the file was first opened. That way you'd only have to pay the price once, rather that upon reading each object. I've checked every branch I can find, and no trace of it. So maybe I imagined it.

We already only create the conversion functions once for each type, so this code can just go where that happens (in jldatatype). The simplest thing to do would just be to do the mapping from Julia to HDF5 type (without committing the HDF5 type) and then check that the two match.

I don't think issue should block merger of this feature, but certainly it's got to go on the TODO list.

I think we should have this before merging. Segfaulting is bad to begin with, but it's also possible that this can be exploited for code execution.

There is no ability to read data if the type is not available in the current workspace. The structure of the compound type is sufficient to reconstruct the Julia type hierarchy (except for the types of reference fields) but this is not yet implemented.

I'm a bit leery of constructing the types because it seems that then you could get in a situation where you have no method(::MyType, ...) and yet you have such a method for a type that shares the same name, but is not really the same type object. But maybe this would be safe. Anyway, again I'm fine with it like this. That's basically what addrequire is about.

I'd like to implement some way to read data without the types before merging, to avoid catastrophe in the case where you have a data file but not the code and so you can no longer get the data out. (I don't currently have anything like readsafely in this branch.) Dispatch isn't going to work no matter what we return, because the methods aren't there. It's probably comparably difficult to do either the type-based approach or the Dict approach. I don't have a strong opinion on this.

The rootmodule keyword argument to write is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without the rootmodule keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?

If memory serves, its whole purpose is basically to allow me to write this test. If there's another good way to do that, so much the better. I assume you saw this documentation as well.

It would be reasonably easy to add a facility to register a specific type so that its module path (and the module paths of the types it contains) is truncated on write. Would that suffice?

In principle, for pointerfree Julia types, HDF5 compound types could be generated with the same padding so that they exactly match the Julia type. This would avoid a copy when reading/writing and also allow mmapping arrays directly from the file, at the cost of saving the padding to the file.

That would be so awesome, I wouldn't know what to say. But would that be portable across machines? If we had to, I guess we could test it upon file opening? However, I don't know how to get the alignment of fields in a julia type.

fieldoffsets (which I implemented in Base when I first attempted #27) does this. I think alignment is the same for all Julia-supported types on all 64-bit x86 systems, but maybe not for doubles on 32-bit. We could implement conversion or libhdf5 might be able to do it for us, although we can obviously only mmap if the layout is the same. Eventually we may also need endianness conversion, but that's probably not worth worrying about until Julia runs on a big endian architecture.

In some cases where Julia stores fields as pointers to objects, we may want to store the contents inline in JLD. Two such cases are fields that are non-pointerfree immutables and fields that are leaf type tuples. The main complication to storing such fields inline in JLD is that they could be undefined.

That would just be showing off 😄.

Thinking more about this, we may also have a problem if there are null string fields or undefs in a string array for a similar reason.

timholy · 2014-08-13T21:09:24Z

The simplest thing to do would just be to do the mapping from Julia to HDF5 type (without committing the HDF5 type) and then check that the two match.

Works for me.

I think we should have this before merging. Segfaulting is bad to begin with, but it's also possible that this can be exploited for code execution.

OK, definitely sounds important.

It's probably comparably difficult to do either the type-based approach or the Dict approach. I don't have a strong opinion on this.

If you prefer the type-based approach, that's fine. I agree that it won't work without user help no matter what we choose.

It would be reasonably easy to add a facility to register a specific type so that its module path (and the module paths of the types it contains) is truncated on write. Would that suffice?

Yes

fieldoffsets (which I implemented in Base when I first attempted #27) does this.

Oh, awesome! Since libhdf5 will implement the conversion for us when we can't mmap, this sounds ideal.

Also convert full_typename to use IOBuffer, to avoid some allocations

DataArrays should probably not be trying to perform a conversion if the type is the same, but we can trivially avoid calling convert in the first place.

Avoid boxing due to pointer instability. This branch is now beating Base.serialize by >2x on the test case from JuliaData/DataFrames.jl#667

We will almost certainly need a different strategy here if we want to perform within 1 OOM of serialize

simonster · 2014-08-15T00:01:25Z

So, I did some benchmarking of this branch. The good news is that we appear to be faster than serialize/deserialize and allocate less memory for arrays of strings and numeric data, sometimes by a significant margin, and not much slower for immutables. (We can probably make up the difference if I implement saving with the struct layout.) The files we generate are also generally not enormous. The bad news is that we are >30x slower than serialize at saving arrays of small arrays, and a decent proportion of that time is spent in libhdf5. I may need to resurrect the array of arrays optimization from 3a95493. But first I'll work on items 2 and 3 on the list above.

cc @jiahao and @jakebolewski, in case you're interested. (Sometime I should get you to give me a julia.mit.edu account so I can test on your datasets.)

timholy · 2014-08-15T11:50:52Z

Just checking in to say I'm sorry this is taking me so long to get to. I am pretty swamped with coding that my lab needs done ASAP, and this PR is a big body of work. But I'll try to give this a serious review over the weekend.

simonster · 2014-08-15T14:24:32Z

No worries. It's also fine with me if you wait until I finish items 2 and 3 before reviewing; I don't yet know how big those changes will be.

simonster · 2014-08-15T22:57:58Z

And now for a stupid libhdf5 performance issue: reading out an array of objects is >200x slower after the file is closed and re-opened again. Before closing the file, I can read 1000 items in 10 ms. After closing the file, it takes 2 ms to read each item. It seems that getting the name of the datatype for each object takes immeasurably (at least thousands of times) longer. It may be time to take a trip to the libhdf5 source.

timholy · 2014-08-15T23:22:17Z

Oh, wow. That's frightening.

jakebolewski · 2014-08-15T23:40:30Z

Are you purging the linux buffer cache when benchmarking? Something has to
be caching the writes.

On Friday, August 15, 2014, Tim Holy [email protected] wrote:

Oh, wow. That's frightening.

—
Reply to this email directly or view it on GitHub
#132 (comment).

jakebolewski · 2014-08-16T00:33:55Z

This seems to be a huge improvement !

For one of the examples metioned in serialization thread, these are the times I'm getting:

julia> df = @time begin
       fh = open("labevents.jls")
       deserialize(fh)
       end;
elapsed time: 22.488775592 seconds (1326551288 bytes allocated, 7.86% gc time)

julia> @time begin
       fh = open("test.jls", "w")
       serialize(fh, df)
       end;
elapsed time: 28.523595344 seconds (748371396 bytes allocated, 35.53% gc time)

# This pull request
julia> @time jldopen("test.jld", "w") do io
       write(io, "test", df)
       end
elapsed time: 18.036020153 seconds (324413684 bytes allocated, 42.59% gc time)

julia> @time jldopen("test.jld", "r") do io
       read(io, "test")
       end;
elapsed time: 15.344303129 seconds (1521609372 bytes allocated, 48.61% gc time)

julia> map(n -> typeof(df[n]), names(df))
9-element Array{DataType,1}:
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{UTF16String,1}

julia> nrow(df)
3740682

# Master 
# ...this takes so long it is not really worth comparing (10 + minutes)

another example

julia> using HDF5, JLD

julia> df = @time begin
       fh = open("ioevents.jls", "r")
       deserialize(fh)
       end;
elapsed time: 20.578164919 seconds (1214042696 bytes allocated, 6.45% gc time)

julia> @time begin
       fh = open("test.jls", "w")
       serialize(fh, df)
       end;
elapsed time: 15.59771594 seconds (662881924 bytes allocated)

julia> @time jldopen("test.jld", "w") do io
       write(io, "test", df)
       end
elapsed time: 11.546232497 seconds (308564552 bytes allocated)

julia> df = @time jldopen("test.jld", "r") do io
       read(io, "test")
       end;
elapsed time: 12.905426358 seconds (1470065908 bytes allocated, 43.48% gc time)

julia> map(n -> typeof(df[n]), names(df))
16-element Array{DataType,1}:
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{UTF16String,1}

julia> nrow(df)
2471191

Use H5Oget_info to get type address instead of getting type name (which apparently requires a search through all objects in the file). Also use global buffers in some places instead of allocating on every call.

When we read from a file, we don't want to have to define h5convert! for the type we're reading, since we won't use it, but it's possible that we will read a type from a file and then write that type. In that case, an entry would exist in the jlh5type dict for the type, but we can't know to define h5convert!. So instead, we define h5convert! on write.

I decided to do this rather than read the data as a Dict because 1) it is a bit easier and 2) reading the data as a Dict could be extremely slow if there is a lot of data to be read.

coveralls · 2014-08-21T21:41:56Z

Changes Unknown when pulling 58b3f55 on sjk/compound_types into * on master*.

coveralls · 2014-08-21T21:49:57Z

Changes Unknown when pulling d30f700 on sjk/compound_types into * on master*.

mbauman · 2014-08-21T22:18:47Z

Works in my testing now. Wonderful! 👍

(To be clear, I think treating ByteStrings like they're immutable is just fine. I was just trying a bunch of different code paths. So don't worry about that case.)

coveralls · 2014-08-22T03:23:10Z

Changes Unknown when pulling 7dcf9fd on sjk/compound_types into * on master*.

- Use H5Pset_create_intermediate_group to create intermediate groups, instead of doing this ourselves. - Store small datasets in compact format. - Avoid looking up created datasets to create references to them. Also implement a few more libhdf5 functions, most of which don't seem to make a difference to performance.

coveralls · 2014-08-22T04:10:51Z

Changes Unknown when pulling 6714b2e on sjk/compound_types into * on master*.

coveralls · 2014-08-22T04:39:37Z

Changes Unknown when pulling 376ccd4 on sjk/compound_types into * on master*.

coveralls · 2014-08-22T04:47:45Z

Changes Unknown when pulling cb3bb81 on sjk/compound_types into * on master*.

simonster · 2014-08-25T16:13:51Z

Any last comments before I pull the trigger? 😄

timholy · 2014-08-25T16:19:30Z

Only one: go for it!

This is a really awesome advance. I feel like all the packages I use heavily need/are getting massive facelifts.

When we tag it, I think we should bump the minor version. I know I did that recently, but this is a pretty big change.

WIP: Write Julia types as HDF5 compound types

timholy · 2014-08-25T16:31:37Z

Yay! 🍰

timholy · 2014-08-25T16:33:23Z

Should we encourage folks to test on master for a little bit, then tag a new version? I'd be happy to send an email to julia-users advertising your work! (And of course you should feel free to do so yourself, it's just that it's easier to brag on someone else's behalf 😄.)

simonster · 2014-08-25T16:47:03Z

Sounds good. I'm always happy to have someone else brag for me

jakebolewski · 2014-08-25T17:07:55Z

👍 awesome work!

tkelman · 2014-08-25T19:40:20Z

You asked for testers - Pkg.test("HDF5") causes a segfault and crashes Julia on Win32 https://ci.appveyor.com/project/tkelman/hdf5-jl/build/job/06jf23fjv074321i
and Win64 https://ci.appveyor.com/project/tkelman/hdf5-jl/build/job/jj3u0vnnygn55n69

Note that there's an easy-to-fix 32 bit bug here: tkelman@c5ddb77

timholy · 2014-08-25T20:16:01Z

Thanks, @tkelman, this is exactly the kind of feedback we need.

First, do you know whether older versions passed?

Second, with AppVeyor, can you get a remote login? With Travis, this works well (but you have to ask via email). To me it looks like one of the ccalls is barfing, and that would be easier to debug interactively. If that's not possible, then perhaps inserting printlns in places like this would be the best approach.

How do you submit a job to AppVeyor, anyway?

tkelman · 2014-08-25T20:21:37Z

Older versions had trouble with mmap'ing on Windows (#89), but IIRC would pass everything else.

I get crashes locally too, the AppVeyor logs are just easier to share than copying a big gist from my terminal. I haven't tried asking if we can get a remote login, but it might be possible. I can open a PR with the appveyor.yml configuration file and instructions for how to turn it on. It runs on a webhook, and you can manually trigger builds through their UI as well.

timholy · 2014-08-25T20:23:04Z

That would be great, look forward to the PR. I can probably contact them by Wednesday if the fix isn't relatively straightforward.

simonster · 2014-08-25T20:24:31Z

The segfault is in gc and only happens on Windows, but I can reproduce it in a VM. My testing indicates that the segfault happens when Julia tries to free an object that was malloc'd by libhdf5. There is some discussion of this problem here that indicates that one solution is to make both Julia and libhdf5 use the same C runtime, and another is to free the memory with H5free_memory, which we could do but would require either copying or attaching a finalizer to every variable-length object coming out of libhdf5. So I guess my questions are 1) is the libhdf5 library we provide for Windows linked against a different runtime than Julia and 2) if so, can we link them against the same runtime? Also cc @ihnorton

tkelman · 2014-08-25T20:33:45Z

@timholy PR is #135
@simonster Excellent questions. It looks like the HDF5 library we are downloading is a MSVC one (nota bene the lines Extracting usr\lib\x86\msvcp100.dll Extracting usr\lib\x86\msvcr100.dll), so yes we are using a different runtime than Julia. We can try finding or building our own HDF5 library using MinGW. I don't see one available on WinRPM, but this seems like a library that should be there.

Write Julia types as HDF5 compound types

73c32d2

Fixes #27

simonster changed the title ~~Write Julia types as HDF5 compound types~~ WIP: Write Julia types as HDF5 compound types Aug 13, 2014

Split out write_vals into separate function

d45a370

simonster added 5 commits August 13, 2014 21:07

Add truncate_module_path as a replacement for the rootmodule kw arg

bc22bbe

Also convert full_typename to use IOBuffer, to avoid some allocations

Fix DataFrames test

4f9f408

DataArrays should probably not be trying to perform a conversion if the type is the same, but we can trivially avoid calling convert in the first place.

Reinstate support for JLD < 0.1

4b4a65d

Drop support for julia < 0.3

9818f7d

Only keep track of references across "write sessions"

c41caa9

mbauman mentioned this pull request Aug 14, 2014

Re-use object references #102

Closed

simonster added 3 commits August 14, 2014 17:34

Speed up writing strings a bit

162fe65

Avoid boxing due to pointer instability. This branch is now beating Base.serialize by >2x on the test case from JuliaData/DataFrames.jl#667

Fix some type instability in read_array

ce2c44f

Some small performance improvements saving arrays of arrays

e1874cf

We will almost certainly need a different strategy here if we want to perform within 1 OOM of serialize

simonster added 2 commits August 17, 2014 14:23

Speed up reading types

8b612d6

Use H5Oget_info to get type address instead of getting type name (which apparently requires a search through all objects in the file). Also use global buffers in some places instead of allocating on every call.

Verify that the type in a JLD file matches the loaded type

1faf985

simonster force-pushed the sjk/compound_types branch from 37e78e3 to 1faf985 Compare August 20, 2014 00:18

simonster added 5 commits August 19, 2014 21:06

Update a few comments

50a8921

Reconstruct missing types

46760bc

I decided to do this rather than read the data as a Dict because 1) it is a bit easier and 2) reading the data as a Dict could be extremely slow if there is a lot of data to be read.

Some small perf tweaks

1f410b2

Perf tweaks for d_create/g_create

2bd856c

Address @timholy's review comments

a570dbd

Support getindex/setindex!

7dcf9fd

Fix reading references if a previously read reference is gced

cb3bb81

simonster force-pushed the sjk/compound_types branch from 376ccd4 to cb3bb81 Compare August 22, 2014 04:43

simonster added a commit that referenced this pull request Aug 25, 2014

Merge pull request #132 from timholy/sjk/compound_types

4468fa4

WIP: Write Julia types as HDF5 compound types

simonster merged commit 4468fa4 into master Aug 25, 2014

simonster deleted the sjk/compound_types branch August 25, 2014 16:20

simonster mentioned this pull request Aug 25, 2014

Serialization file format for arbitrary julia objects JuliaLang/julia#7909

Closed

simonster mentioned this pull request Jan 12, 2015

JLD slow/inefficient for Graphs.jl graphs #208

Closed

bjarthur mentioned this pull request Sep 11, 2015

thread safety? #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Write Julia types as HDF5 compound types #132

WIP: Write Julia types as HDF5 compound types #132

simonster commented Aug 13, 2014

timholy commented Aug 13, 2014

simonster commented Aug 13, 2014

timholy commented Aug 13, 2014

simonster commented Aug 15, 2014

timholy commented Aug 15, 2014

simonster commented Aug 15, 2014

simonster commented Aug 15, 2014

timholy commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

jakebolewski commented Aug 16, 2014

coveralls commented Aug 21, 2014

coveralls commented Aug 21, 2014

mbauman commented Aug 21, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

simonster commented Aug 25, 2014

timholy commented Aug 25, 2014

timholy commented Aug 25, 2014

timholy commented Aug 25, 2014

simonster commented Aug 25, 2014

jakebolewski commented Aug 25, 2014

tkelman commented Aug 25, 2014

timholy commented Aug 25, 2014

tkelman commented Aug 25, 2014

timholy commented Aug 25, 2014

simonster commented Aug 25, 2014

tkelman commented Aug 25, 2014

WIP: Write Julia types as HDF5 compound types #132

WIP: Write Julia types as HDF5 compound types #132

Conversation

simonster commented Aug 13, 2014

timholy commented Aug 13, 2014

simonster commented Aug 13, 2014

timholy commented Aug 13, 2014

simonster commented Aug 15, 2014

timholy commented Aug 15, 2014

simonster commented Aug 15, 2014

simonster commented Aug 15, 2014

timholy commented Aug 15, 2014

jakebolewski commented Aug 15, 2014

jakebolewski commented Aug 16, 2014

coveralls commented Aug 21, 2014

coveralls commented Aug 21, 2014

mbauman commented Aug 21, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

coveralls commented Aug 22, 2014

simonster commented Aug 25, 2014

timholy commented Aug 25, 2014

timholy commented Aug 25, 2014

timholy commented Aug 25, 2014

simonster commented Aug 25, 2014

jakebolewski commented Aug 25, 2014

tkelman commented Aug 25, 2014

timholy commented Aug 25, 2014

tkelman commented Aug 25, 2014

timholy commented Aug 25, 2014

simonster commented Aug 25, 2014

tkelman commented Aug 25, 2014