-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Write Julia types as HDF5 compound types #132
Conversation
Oh, my!
If
I could have sworn that once I had (or was working on) a facility to go through and check the types when the file was first opened. That way you'd only have to pay the price once, rather that upon reading each object. I've checked every branch I can find, and no trace of it. So maybe I imagined it. I don't think issue should block merger of this feature, but certainly it's got to go on the TODO list.
I'm a bit leery of constructing the types because it seems that then you could get in a situation where you have
If memory serves, its whole purpose is basically to allow me to write this test. If there's another good way to do that, so much the better. I assume you saw this documentation as well.
That would be so awesome, I wouldn't know what to say. But would that be portable across machines? If we had to, I guess we could test it upon file opening? However, I don't know how to get the alignment of fields in a julia type.
That would just be showing off 😄.
I'm fine with just "freezing" 0.2 at a particular commit. Since PkgEvaluator will shortly not even be testing 0.2, I think that's the safest procedure anyway. The implementation sounds awesome. I'll look over the code next. |
I think this is a good idea. Maybe we can share the parts of the code not related to reading/writing, but everything else is basically a rewrite anyway.
We already only create the conversion functions once for each type, so this code can just go where that happens (in
I think we should have this before merging. Segfaulting is bad to begin with, but it's also possible that this can be exploited for code execution.
I'd like to implement some way to read data without the types before merging, to avoid catastrophe in the case where you have a data file but not the code and so you can no longer get the data out. (I don't currently have anything like
It would be reasonably easy to add a facility to register a specific type so that its module path (and the module paths of the types it contains) is truncated on write. Would that suffice?
Thinking more about this, we may also have a problem if there are null string fields or undefs in a string array for a similar reason. |
Works for me.
OK, definitely sounds important.
If you prefer the type-based approach, that's fine. I agree that it won't work without user help no matter what we choose.
Yes
Oh, awesome! Since libhdf5 will implement the conversion for us when we can't mmap, this sounds ideal. |
Also convert full_typename to use IOBuffer, to avoid some allocations
DataArrays should probably not be trying to perform a conversion if the type is the same, but we can trivially avoid calling convert in the first place.
Avoid boxing due to pointer instability. This branch is now beating Base.serialize by >2x on the test case from JuliaData/DataFrames.jl#667
We will almost certainly need a different strategy here if we want to perform within 1 OOM of serialize
So, I did some benchmarking of this branch. The good news is that we appear to be faster than cc @jiahao and @jakebolewski, in case you're interested. (Sometime I should get you to give me a julia.mit.edu account so I can test on your datasets.) |
Just checking in to say I'm sorry this is taking me so long to get to. I am pretty swamped with coding that my lab needs done ASAP, and this PR is a big body of work. But I'll try to give this a serious review over the weekend. |
No worries. It's also fine with me if you wait until I finish items 2 and 3 before reviewing; I don't yet know how big those changes will be. |
And now for a stupid libhdf5 performance issue: reading out an array of objects is >200x slower after the file is closed and re-opened again. Before closing the file, I can read 1000 items in 10 ms. After closing the file, it takes 2 ms to read each item. It seems that getting the name of the datatype for each object takes immeasurably (at least thousands of times) longer. It may be time to take a trip to the libhdf5 source. |
Oh, wow. That's frightening. |
Are you purging the linux buffer cache when benchmarking? Something has to On Friday, August 15, 2014, Tim Holy [email protected] wrote:
|
This seems to be a huge improvement ! For one of the examples metioned in serialization thread, these are the times I'm getting: julia> df = @time begin
fh = open("labevents.jls")
deserialize(fh)
end;
elapsed time: 22.488775592 seconds (1326551288 bytes allocated, 7.86% gc time)
julia> @time begin
fh = open("test.jls", "w")
serialize(fh, df)
end;
elapsed time: 28.523595344 seconds (748371396 bytes allocated, 35.53% gc time)
# This pull request
julia> @time jldopen("test.jld", "w") do io
write(io, "test", df)
end
elapsed time: 18.036020153 seconds (324413684 bytes allocated, 42.59% gc time)
julia> @time jldopen("test.jld", "r") do io
read(io, "test")
end;
elapsed time: 15.344303129 seconds (1521609372 bytes allocated, 48.61% gc time)
julia> map(n -> typeof(df[n]), names(df))
9-element Array{DataType,1}:
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{DateTime,1}
DataArray{UTF16String,1}
DataArray{Float64,1}
DataArray{UTF16String,1}
DataArray{UTF16String,1}
julia> nrow(df)
3740682
# Master
# ...this takes so long it is not really worth comparing (10 + minutes) another example julia> using HDF5, JLD
julia> df = @time begin
fh = open("ioevents.jls", "r")
deserialize(fh)
end;
elapsed time: 20.578164919 seconds (1214042696 bytes allocated, 6.45% gc time)
julia> @time begin
fh = open("test.jls", "w")
serialize(fh, df)
end;
elapsed time: 15.59771594 seconds (662881924 bytes allocated)
julia> @time jldopen("test.jld", "w") do io
write(io, "test", df)
end
elapsed time: 11.546232497 seconds (308564552 bytes allocated)
julia> df = @time jldopen("test.jld", "r") do io
read(io, "test")
end;
elapsed time: 12.905426358 seconds (1470065908 bytes allocated, 43.48% gc time)
julia> map(n -> typeof(df[n]), names(df))
16-element Array{DataType,1}:
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{DateTime,1}
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{DateTime,1}
DataArray{Int32,1}
DataArray{Int32,1}
DataArray{Float64,1}
DataArray{UTF16String,1}
DataArray{Float64,1}
DataArray{UTF16String,1}
DataArray{Float64,1}
DataArray{UTF16String,1}
DataArray{UTF16String,1}
julia> nrow(df)
2471191 |
Use H5Oget_info to get type address instead of getting type name (which apparently requires a search through all objects in the file). Also use global buffers in some places instead of allocating on every call.
37e78e3
to
1faf985
Compare
When we read from a file, we don't want to have to define h5convert! for the type we're reading, since we won't use it, but it's possible that we will read a type from a file and then write that type. In that case, an entry would exist in the jlh5type dict for the type, but we can't know to define h5convert!. So instead, we define h5convert! on write.
I decided to do this rather than read the data as a Dict because 1) it is a bit easier and 2) reading the data as a Dict could be extremely slow if there is a lot of data to be read.
Changes Unknown when pulling 58b3f55 on sjk/compound_types into * on master*. |
Changes Unknown when pulling d30f700 on sjk/compound_types into * on master*. |
Works in my testing now. Wonderful! 👍 (To be clear, I think treating ByteStrings like they're immutable is just fine. I was just trying a bunch of different code paths. So don't worry about that case.) |
Changes Unknown when pulling 7dcf9fd on sjk/compound_types into * on master*. |
- Use H5Pset_create_intermediate_group to create intermediate groups, instead of doing this ourselves. - Store small datasets in compact format. - Avoid looking up created datasets to create references to them. Also implement a few more libhdf5 functions, most of which don't seem to make a difference to performance.
Changes Unknown when pulling 6714b2e on sjk/compound_types into * on master*. |
Changes Unknown when pulling 376ccd4 on sjk/compound_types into * on master*. |
376ccd4
to
cb3bb81
Compare
Changes Unknown when pulling cb3bb81 on sjk/compound_types into * on master*. |
Any last comments before I pull the trigger? 😄 |
Only one: go for it! This is a really awesome advance. I feel like all the packages I use heavily need/are getting massive facelifts. When we tag it, I think we should bump the minor version. I know I did that recently, but this is a pretty big change. |
WIP: Write Julia types as HDF5 compound types
Yay! 🍰 |
Should we encourage folks to test on master for a little bit, then tag a new version? I'd be happy to send an email to julia-users advertising your work! (And of course you should feel free to do so yourself, it's just that it's easier to brag on someone else's behalf 😄.) |
Sounds good. I'm always happy to have someone else brag for me |
👍 awesome work! |
You asked for testers - Note that there's an easy-to-fix 32 bit bug here: tkelman@c5ddb77 |
Thanks, @tkelman, this is exactly the kind of feedback we need. First, do you know whether older versions passed? Second, with AppVeyor, can you get a remote login? With Travis, this works well (but you have to ask via email). To me it looks like one of the How do you submit a job to AppVeyor, anyway? |
Older versions had trouble with mmap'ing on Windows (#89), but IIRC would pass everything else. I get crashes locally too, the AppVeyor logs are just easier to share than copying a big gist from my terminal. I haven't tried asking if we can get a remote login, but it might be possible. I can open a PR with the appveyor.yml configuration file and instructions for how to turn it on. It runs on a webhook, and you can manually trigger builds through their UI as well. |
That would be great, look forward to the PR. I can probably contact them by Wednesday if the fix isn't relatively straightforward. |
The segfault is in gc and only happens on Windows, but I can reproduce it in a VM. My testing indicates that the segfault happens when Julia tries to free an object that was malloc'd by libhdf5. There is some discussion of this problem here that indicates that one solution is to make both Julia and libhdf5 use the same C runtime, and another is to free the memory with |
@timholy PR is #135 |
This is a major change to the serialization format of JLD. Rather than writing Julia types as arrays of references, we translate Julia types to HDF5 compound types and write those. Following the discussion in #27, everything Julia would store inline is presently stored inline in the compound type, and everything Julia would store as a pointer to another object is stored as an HDF5 reference to another object. This PR also incorporates the changes from #102.
This presently passes all read and write tests, but there is plenty more work to be done:
rootmodule
keyword argument towrite
is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without therootmodule
keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?getindex
/setindex!
methods. These are missing tests.dump
.I have to fix some things for Julia 0.2 if we want to maintain compatibility.Implementation notes:
_types
group, with ajulia type
attribute that indicates the corresponding Julia type. Because libhdf5 loses track of compound type hierarchy between closing and reopening a file, numeric suffixes indicate the corresponding compound type index for a field that is itself a compound type.@save
that saves all top-level variables catches the error and displays a warning instead.