Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 support #64

Closed
tshort opened this issue Oct 1, 2012 · 17 comments
Closed

HDF5 support #64

tshort opened this issue Oct 1, 2012 · 17 comments
Labels

Comments

@tshort
Copy link
Contributor

tshort commented Oct 1, 2012

This may be nice way to exchange data with R and pandas.

See Tim Holy's work based on code by Konrad Hinsen:

https://github.com/timholy/julia_hdf5

An on-disk Hdf5DataFrame might be nice, too.

@tshort
Copy link
Contributor Author

tshort commented Oct 4, 2012

I checked in preliminary support for a new AbstractDataFrame stored in an HDF5 file. It's in the hdf5 branch:

https://github.com/HarlanH/JuliaData/blob/hdf5/src/hdf5.jl

HDF5 is an interesting format. It has tons of features and potential. It's also huge, overwhelmingly so. It could be a good way to offer on-disk storage (issue #25) with chunking, compression, and indexing.

In the process of adding this, I also tried to make more of the DataFrames into AbstractDataFrames in dataframe.jl. That increased the warning count quite a bit. That is an annoying problem.

@HarlanH
Copy link
Contributor

HarlanH commented Oct 4, 2012

Hm, interesting. I don't have much to say about this, as I've never had a
need to use HDF5.

If it's a vector-of-structs format, as you mention in the code, it doesn't
sound like a viable option for memory-mapped storage at all. Just a static
interchange format. Definitely something we should support
importing/exporting, though.

On Wed, Oct 3, 2012 at 9:53 PM, Tom Short [email protected] wrote:

I checked in preliminary support for a new AbstractDataFrame stored in an
HDF5 file. It's in the hdf5 branch:

https://github.com/HarlanH/JuliaData/blob/hdf5/src/hdf5.jl

HDF5 is an interesting format. It has tons of features and potential. It's
also huge, overwhelmingly so. It could be a good way to offer on-disk
storage (issue #25 https://github.com/HarlanH/JuliaData/issues/25) with
chunking, compression, and indexing.

In the process of adding this, I also tried to make more of the DataFrames
into AbstractDataFrames in dataframe.jl. That increased the warning count
quite a bit. That is an annoying problem.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/64#issuecomment-9128344.

@tshort
Copy link
Contributor Author

tshort commented Oct 4, 2012

Vector-of-structs is just one option. The main option I’m trying uses separate columns, and each column can be compressed and chunked. Indexing is not built in, but it looks like fast indexing is on the horizon. Pytables has some interesting features for memory-mapped storage, and it uses HDF5:

http://www.pytables.org/moin

How to support DataVecs and PooledDataVecs is another issue. It can be done, it's just a question of how hard it is to implement.

@dmbates
Copy link
Contributor

dmbates commented Jan 9, 2013

I think this is well worthwhile examining. The HDF5Compound type is very similar to an AbstractDataFrame. The "rhdf5" package for R provides the h5save() function which might be a good way to pass data frames back and forth. Right now the HDF5 package for Julia is a bit mystified by HDF5Compound types, I think (@timholy, is that correct?) but it is a natural structure in that it has names and offsets into a table. It should be possible to index into the gargantuan array of characters and covert to the desired types. The example of importance to me is an R data frame with a couple of million records.

$ h5ls -r -v wkce1.h5 
Opened "wkce1.h5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/wkce1                   Dataset {2606541/2606541}
    Location:  1:800
    Links:     1
    Storage:   260654100 logical bytes, 260654100 allocated bytes, 100.00% utilization
    Type:      struct {
                   "stuid"            +0    native int
                   "year"             +4    native int
                   "gr"               +8    native int
                   "age"              +12   native int
                   "sex"              +16   native int
                   "race"             +20   native int
                   "econ"             +24   native int
                   "disab"            +28   native int
                   "ell"              +32   native int
                   "school"           +36   native int
                   "dist"             +40   native int
                   "distFay"          +44   native int
                   "schFay"           +48   native int
                   "readRS"           +52   native double
                   "readSS"           +60   native double
                   "mathRS"           +68   native double
                   "mathSS"           +76   native double
                   "Rproflvl"         +84   native int
                   "Mproflvl"         +88   native int
                   "Rprof"            +92   native int
                   "Mprof"            +96   native int
               } 100 bytes

@timholy
Copy link
Contributor

timholy commented Jan 9, 2013

Yes, Julia's HDF5 support for compound types is limited. HDF5 compound data types are basically the equivalent of C structs. My plan was to largely avoid them until Julia's support for C structs is better. I was forced to add a bit of it to get support for Matlab's complex numbers (something that clearly needed to happen sooner rather than later), but otherwise it's very very rough.

JuliaLang/julia#1831 suggests it might not be long now.

But if you want to store columns directly, then I think everything is in place. The array support is good, and should be completely general.

Presumably you can just save a DataFrame directly with JLD? I can't test now because of the strpack breakage, but hopefully soon.

@tshort
Copy link
Contributor Author

tshort commented Jan 10, 2013

On Wed, Jan 9, 2013 at 6:33 PM, Tim Holy [email protected] wrote:

Yes, Julia's HDF5 support for compound types is limited. HDF5 compound
data types are basically the equivalent of C structs. My plan was to
largely avoid them until Julia's support for C structs is better. I was
forced to add a bit of it to get support for Matlab's complex numbers
(something that clearly needed to happen sooner rather than later), but
otherwise it's very very rough.

JuliaLang/julia#1831 https://github.com/JuliaLang/julia/issues/1831suggests it might not be long now.

But if you want to store columns directly, then I think everything is in
place. The array support is good, and should be completely general.

Presumably you can just save a DataFrame directly with JLD? I can't test
now because of the strpack breakage, but hopefully soon.

Last I checked, you cannot save a DataFrame directly with JLD. The write
method for Associative types doesn't work because DataFrames differ from
standard Associative types in that they iterate over rows by default
instead of columns. It shouldn't be hard to write a specific method for
DataFrames. I've been hoping to get to this, but I don't know when I'll get
some time.

I'm also interested in an on-disk HDF5 that is an AbstractDataFrame. Tim's
package is well suited for this, and HDF5 matches well with an
AbstractDataFrame structure as Doug pointed out. Storing columns directly
and accessing them like a SubDataFrame should be possible. A tricky part
may be adding NA support by mimicking a DataArray.

@dmbates
Copy link
Contributor

dmbates commented Jan 10, 2013

Tom may have already looked at this but two of the issues for storing DataFrames and their components are allowing for NA's and storing the levels of a PooledDataVector. At the very least the hierarchical nature of HDF5 allows for a PooledDataVector to be an HDF5 group containing both the indices and the levels. If you want to include the NA BitVector you could use a hierarchical representation of DataVector's too.

I am trying to remember why NA's aren't stored as a particular NaN value for Float64 and Float32 vectors, as in R (well, R doesn't have a native Float32 but it does use an NaN pattern for NA's in numeric vectors). R also uses a special pattern for integers and for logical vectors (and, I think, for character strings too) but those are implemented by having the low-level code check for them. Would it be reasonable to try to leverage the 32-bit float NaN's for NA's in integer formats? For PooledDataVector's an index outside the allowable range (e.g. 0) could be used to signal NA.

I imagine suggestions like this have been considered and rejected so if you can just point me to a discussion I can find out why they won't work easily. :-)

@tshort
Copy link
Contributor Author

tshort commented Jan 10, 2013

Issues #22 and #45 have background and discussion on NA representation. I've been an advocate of allowing standard Arrays in DataFrames and also of using bit patterns to indicate NA's. It's less of an issue now that John has stepped up and made DataVectors usable, but I think this is still a viable alternative that would be useful in some instances.

As far as an HDF5 AbstractDataFrame, NA's could be handled in several ways:

  • Mimicking the DataArray type by implementing a BitVector array along with an Array. These could be stored in a hierarchical fashion, or they could be stored with HDF5 references (that's how complex types like composite types or associative types are stored in JLD files).
  • The NA indicator could be paired with the array data with HDF5 compound data types. An advantage to this is from a disk storage format is that the NA indicator would always be stored "close" to the data.
  • A bitpattern could be used for NA's.

@johnmyleswhite
Copy link
Contributor

Once we have a more stable and feature complete implementation of DataArray's and DataFrames, I'd be open to reconsidering this issue. My gut feeling continues to be that it adds a lot of complexity to the system. After we have enough tests to keep us in check while we experiment with alternative backends, I'd be open to trying to see what is needed to make DataArray's faster.

@ViralBShah
Copy link
Contributor

Is there a quick and dirty solution here right now for DataFrames that contain only simple data types such as strings and integers? Is it possible to convert a dataframe into Array{Any} and save that with HDF5?

@tshort
Copy link
Contributor Author

tshort commented Jun 14, 2013

Saving and loading of DataFrames should with HDF5 as is. Tim added support a few months ago, but I haven't tried it for a while.. Longer term, options for better support of HDF5 include:

  • Loading DataFrames from foreign HDF5 formats, including data subsets
  • Using HDF5 objects as AbstractDataFrames without pulling the data all into memory; possibly add indexing to make this faster

@ViralBShah
Copy link
Contributor

Thanks @tshort. I will try this out, since I am already getting tired of doing readcsv every time I restart julia, even though I have a variant that is much faster now.

@timholy
Copy link
Contributor

timholy commented Aug 7, 2014

I think this can be closed; it's been possible to save DataFrames using HDF5/JLD for quite some time.

nalimilan pushed a commit that referenced this issue Jul 8, 2017
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
rofinn pushed a commit that referenced this issue Aug 17, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
nalimilan pushed a commit that referenced this issue Aug 25, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
quinnj pushed a commit that referenced this issue Sep 2, 2017
Consolidating the constructors minimized the number of places where
auto promotion could take place. The new constructor recycles scalars
such that if DataTable is created with a mix of scalars and vectors
the scalars will be recycled to the same length as the vectors. Fixes an
outstanding bug where scalar recycling only worked if the scalar
assignments came after the vector assignments of the desired length, see
#882. Tests that
used to assume NullableArray promotion now explicitly use NullableArrays
and new constructor tests have been added to test changes.
quinnj added a commit that referenced this issue Sep 2, 2017
@damiendr
Copy link

damiendr commented Jun 8, 2018

I have some code to read HDF5 tables into Julia efficiently (using HDF5's built-in memory layout conversion), and I'm willing to contribute some more code to turn that into a DataFrame. In which package should this go?

@damiendr
Copy link

damiendr commented Jun 8, 2018

To clarify -- The existing support for compound types in HDF5.jl is inefficient (type instability, redundant storage of field types and names with each row, and unnecessary data copying/conversion), making it unsuitable for large datasets. But I'm a bit wary of modifying HDF5.jl to return a DataFrame, because this would break existing code, and there may be some tricky cases with tables that contain arrays and strings (everything needs to be isbits). So maybe the best compromise would be a direct route into DataFrames.

@nalimilan
Copy link
Member

I think HDF5-DataFrames interop code should live either in HDF5.jl or in a special package. If it lives in HDF5.jl, it doesn't necessarily mean that a DataFrame should be returned by default. I guess the best design would be to implement a DataStreams Source/Sink, which will allow loading the file into a DataFrame, but also streaming it to any kind of structure or file format, as well as doing operations with Query.jl.

@damiendr
Copy link

damiendr commented Jun 9, 2018

Thanks for the pointers. I've made a quick & dirty package here: https://github.com/damiendr/HDFTables.jl
It needs more testing, but it should work just fine for tables of scalars. I'll look into the Source/Sink interface and see how I can work with that.

nalimilan pushed a commit that referenced this issue May 26, 2022
* drop Julia 0.7, add Julia 1.2 and 1.3 to CI
* require DataFrames 0.19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants