HDF5 support #64

tshort · 2012-10-01T16:44:39Z

This may be nice way to exchange data with R and pandas.

See Tim Holy's work based on code by Konrad Hinsen:

https://github.com/timholy/julia_hdf5

An on-disk Hdf5DataFrame might be nice, too.

tshort · 2012-10-04T01:53:04Z

I checked in preliminary support for a new AbstractDataFrame stored in an HDF5 file. It's in the hdf5 branch:

https://github.com/HarlanH/JuliaData/blob/hdf5/src/hdf5.jl

HDF5 is an interesting format. It has tons of features and potential. It's also huge, overwhelmingly so. It could be a good way to offer on-disk storage (issue #25) with chunking, compression, and indexing.

In the process of adding this, I also tried to make more of the DataFrames into AbstractDataFrames in dataframe.jl. That increased the warning count quite a bit. That is an annoying problem.

HarlanH · 2012-10-04T02:09:46Z

Hm, interesting. I don't have much to say about this, as I've never had a
need to use HDF5.

If it's a vector-of-structs format, as you mention in the code, it doesn't
sound like a viable option for memory-mapped storage at all. Just a static
interchange format. Definitely something we should support
importing/exporting, though.

On Wed, Oct 3, 2012 at 9:53 PM, Tom Short [email protected] wrote:

I checked in preliminary support for a new AbstractDataFrame stored in an
HDF5 file. It's in the hdf5 branch:

https://github.com/HarlanH/JuliaData/blob/hdf5/src/hdf5.jl

HDF5 is an interesting format. It has tons of features and potential. It's
also huge, overwhelmingly so. It could be a good way to offer on-disk
storage (issue #25 https://github.com/HarlanH/JuliaData/issues/25) with
chunking, compression, and indexing.

In the process of adding this, I also tried to make more of the DataFrames
into AbstractDataFrames in dataframe.jl. That increased the warning count
quite a bit. That is an annoying problem.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/64#issuecomment-9128344.

tshort · 2012-10-04T11:30:01Z

Vector-of-structs is just one option. The main option I’m trying uses separate columns, and each column can be compressed and chunked. Indexing is not built in, but it looks like fast indexing is on the horizon. Pytables has some interesting features for memory-mapped storage, and it uses HDF5:

http://www.pytables.org/moin

How to support DataVecs and PooledDataVecs is another issue. It can be done, it's just a question of how hard it is to implement.

dmbates · 2013-01-09T23:11:30Z

I think this is well worthwhile examining. The HDF5Compound type is very similar to an AbstractDataFrame. The "rhdf5" package for R provides the h5save() function which might be a good way to pass data frames back and forth. Right now the HDF5 package for Julia is a bit mystified by HDF5Compound types, I think (@timholy, is that correct?) but it is a natural structure in that it has names and offsets into a table. It should be possible to index into the gargantuan array of characters and covert to the desired types. The example of importance to me is an R data frame with a couple of million records.

$ h5ls -r -v wkce1.h5 
Opened "wkce1.h5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/wkce1                   Dataset {2606541/2606541}
    Location:  1:800
    Links:     1
    Storage:   260654100 logical bytes, 260654100 allocated bytes, 100.00% utilization
    Type:      struct {
                   "stuid"            +0    native int
                   "year"             +4    native int
                   "gr"               +8    native int
                   "age"              +12   native int
                   "sex"              +16   native int
                   "race"             +20   native int
                   "econ"             +24   native int
                   "disab"            +28   native int
                   "ell"              +32   native int
                   "school"           +36   native int
                   "dist"             +40   native int
                   "distFay"          +44   native int
                   "schFay"           +48   native int
                   "readRS"           +52   native double
                   "readSS"           +60   native double
                   "mathRS"           +68   native double
                   "mathSS"           +76   native double
                   "Rproflvl"         +84   native int
                   "Mproflvl"         +88   native int
                   "Rprof"            +92   native int
                   "Mprof"            +96   native int
               } 100 bytes

timholy · 2013-01-09T23:33:30Z

Yes, Julia's HDF5 support for compound types is limited. HDF5 compound data types are basically the equivalent of C structs. My plan was to largely avoid them until Julia's support for C structs is better. I was forced to add a bit of it to get support for Matlab's complex numbers (something that clearly needed to happen sooner rather than later), but otherwise it's very very rough.

JuliaLang/julia#1831 suggests it might not be long now.

But if you want to store columns directly, then I think everything is in place. The array support is good, and should be completely general.

Presumably you can just save a DataFrame directly with JLD? I can't test now because of the strpack breakage, but hopefully soon.

tshort · 2013-01-10T01:19:16Z

On Wed, Jan 9, 2013 at 6:33 PM, Tim Holy [email protected] wrote:

Yes, Julia's HDF5 support for compound types is limited. HDF5 compound
data types are basically the equivalent of C structs. My plan was to
largely avoid them until Julia's support for C structs is better. I was
forced to add a bit of it to get support for Matlab's complex numbers
(something that clearly needed to happen sooner rather than later), but
otherwise it's very very rough.

JuliaLang/julia#1831 https://github.com/JuliaLang/julia/issues/1831suggests it might not be long now.

But if you want to store columns directly, then I think everything is in
place. The array support is good, and should be completely general.

Presumably you can just save a DataFrame directly with JLD? I can't test
now because of the strpack breakage, but hopefully soon.

Last I checked, you cannot save a DataFrame directly with JLD. The write
method for Associative types doesn't work because DataFrames differ from
standard Associative types in that they iterate over rows by default
instead of columns. It shouldn't be hard to write a specific method for
DataFrames. I've been hoping to get to this, but I don't know when I'll get
some time.

I'm also interested in an on-disk HDF5 that is an AbstractDataFrame. Tim's
package is well suited for this, and HDF5 matches well with an
AbstractDataFrame structure as Doug pointed out. Storing columns directly
and accessing them like a SubDataFrame should be possible. A tricky part
may be adding NA support by mimicking a DataArray.

dmbates · 2013-01-10T21:09:35Z

Tom may have already looked at this but two of the issues for storing DataFrames and their components are allowing for NA's and storing the levels of a PooledDataVector. At the very least the hierarchical nature of HDF5 allows for a PooledDataVector to be an HDF5 group containing both the indices and the levels. If you want to include the NA BitVector you could use a hierarchical representation of DataVector's too.

I am trying to remember why NA's aren't stored as a particular NaN value for Float64 and Float32 vectors, as in R (well, R doesn't have a native Float32 but it does use an NaN pattern for NA's in numeric vectors). R also uses a special pattern for integers and for logical vectors (and, I think, for character strings too) but those are implemented by having the low-level code check for them. Would it be reasonable to try to leverage the 32-bit float NaN's for NA's in integer formats? For PooledDataVector's an index outside the allowable range (e.g. 0) could be used to signal NA.

I imagine suggestions like this have been considered and rejected so if you can just point me to a discussion I can find out why they won't work easily. :-)

tshort · 2013-01-10T21:54:47Z

Issues #22 and #45 have background and discussion on NA representation. I've been an advocate of allowing standard Arrays in DataFrames and also of using bit patterns to indicate NA's. It's less of an issue now that John has stepped up and made DataVectors usable, but I think this is still a viable alternative that would be useful in some instances.

As far as an HDF5 AbstractDataFrame, NA's could be handled in several ways:

Mimicking the DataArray type by implementing a BitVector array along with an Array. These could be stored in a hierarchical fashion, or they could be stored with HDF5 references (that's how complex types like composite types or associative types are stored in JLD files).
The NA indicator could be paired with the array data with HDF5 compound data types. An advantage to this is from a disk storage format is that the NA indicator would always be stored "close" to the data.
A bitpattern could be used for NA's.

johnmyleswhite · 2013-01-11T00:30:33Z

Once we have a more stable and feature complete implementation of DataArray's and DataFrames, I'd be open to reconsidering this issue. My gut feeling continues to be that it adds a lot of complexity to the system. After we have enough tests to keep us in check while we experiment with alternative backends, I'd be open to trying to see what is needed to make DataArray's faster.

ViralBShah · 2013-06-14T13:10:47Z

Is there a quick and dirty solution here right now for DataFrames that contain only simple data types such as strings and integers? Is it possible to convert a dataframe into Array{Any} and save that with HDF5?

tshort · 2013-06-14T13:30:57Z

Saving and loading of DataFrames should with HDF5 as is. Tim added support a few months ago, but I haven't tried it for a while.. Longer term, options for better support of HDF5 include:

Loading DataFrames from foreign HDF5 formats, including data subsets
Using HDF5 objects as AbstractDataFrames without pulling the data all into memory; possibly add indexing to make this faster

ViralBShah · 2013-06-14T18:01:10Z

Thanks @tshort. I will try this out, since I am already getting tired of doing readcsv every time I restart julia, even though I have a variant that is much faster now.

timholy · 2014-08-07T17:36:02Z

I think this can be closed; it's been possible to save DataFrames using HDF5/JLD for quite some time.

Remove autopromotion

Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.

Remove autopromotion

damiendr · 2018-06-08T12:09:13Z

I have some code to read HDF5 tables into Julia efficiently (using HDF5's built-in memory layout conversion), and I'm willing to contribute some more code to turn that into a DataFrame. In which package should this go?

damiendr · 2018-06-08T12:20:07Z

To clarify -- The existing support for compound types in HDF5.jl is inefficient (type instability, redundant storage of field types and names with each row, and unnecessary data copying/conversion), making it unsuitable for large datasets. But I'm a bit wary of modifying HDF5.jl to return a DataFrame, because this would break existing code, and there may be some tricky cases with tables that contain arrays and strings (everything needs to be isbits). So maybe the best compromise would be a direct route into DataFrames.

nalimilan · 2018-06-09T20:43:28Z

I think HDF5-DataFrames interop code should live either in HDF5.jl or in a special package. If it lives in HDF5.jl, it doesn't necessarily mean that a DataFrame should be returned by default. I guess the best design would be to implement a DataStreams Source/Sink, which will allow loading the file into a DataFrame, but also streaming it to any kind of structure or file format, as well as doing operations with Query.jl.

damiendr · 2018-06-09T21:02:57Z

Thanks for the pointers. I've made a quick & dirty package here: https://github.com/damiendr/HDFTables.jl
It needs more testing, but it should work just fine for tables of scalars. I'll look into the Source/Sink interface and see how I can work with that.

* drop Julia 0.7, add Julia 1.2 and 1.3 to CI * require DataFrames 0.19

zmughal mentioned this issue Mar 18, 2014

Other implementations EntropyOrg/p5-Data-Frame#2

Open

jiahao mentioned this issue Aug 7, 2014

serialization / deserialization performance JuliaLang/julia#7893

Closed

johnmyleswhite closed this as completed Aug 8, 2014

nalimilan pushed a commit that referenced this issue Jul 8, 2017

Merge pull request #64 from JuliaData/cjprybol-cjp/dtconstructor

373116c

Remove autopromotion

quinnj added a commit that referenced this issue Sep 2, 2017

Merge pull request #64 from JuliaData/cjprybol-cjp/dtconstructor

05b1fd9

Remove autopromotion

nalimilan pushed a commit that referenced this issue May 26, 2022

Merge pull request #64 from alyst/update_ci

861b037

* drop Julia 0.7, add Julia 1.2 and 1.3 to CI * require DataFrames 0.19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF5 support #64

HDF5 support #64

tshort commented Oct 1, 2012

tshort commented Oct 4, 2012

HarlanH commented Oct 4, 2012

tshort commented Oct 4, 2012

dmbates commented Jan 9, 2013

timholy commented Jan 9, 2013

tshort commented Jan 10, 2013

dmbates commented Jan 10, 2013

tshort commented Jan 10, 2013

johnmyleswhite commented Jan 11, 2013

ViralBShah commented Jun 14, 2013

tshort commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

timholy commented Aug 7, 2014

damiendr commented Jun 8, 2018 •

edited

Loading

damiendr commented Jun 8, 2018

nalimilan commented Jun 9, 2018

damiendr commented Jun 9, 2018

HDF5 support #64

HDF5 support #64

Comments

tshort commented Oct 1, 2012

tshort commented Oct 4, 2012

HarlanH commented Oct 4, 2012

tshort commented Oct 4, 2012

dmbates commented Jan 9, 2013

timholy commented Jan 9, 2013

tshort commented Jan 10, 2013

dmbates commented Jan 10, 2013

tshort commented Jan 10, 2013

johnmyleswhite commented Jan 11, 2013

ViralBShah commented Jun 14, 2013

tshort commented Jun 14, 2013

ViralBShah commented Jun 14, 2013

timholy commented Aug 7, 2014

damiendr commented Jun 8, 2018 • edited Loading

damiendr commented Jun 8, 2018

nalimilan commented Jun 9, 2018

damiendr commented Jun 9, 2018

damiendr commented Jun 8, 2018 •

edited

Loading