Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files created with blosc compression can't be read by Python and R #254

Open
scheidan opened this issue Aug 19, 2015 · 16 comments
Open

Files created with blosc compression can't be read by Python and R #254

scheidan opened this issue Aug 19, 2015 · 16 comments

Comments

@scheidan
Copy link
Contributor

This is likely an off-topic, but probably it's still helpful for someone to have it documented.

I created with HDF5.jl a file, using the default compression (blosc) at level 3. However, I was unable to read this file with Python (using the h5Py library) or with R (using the rhdf5 package, which produced a slightly more informative error messages). I had the same result on Ubuntu and Manjaro Linux.

Using the deflate compression solved this issue. Given this compatibility problems, I was wondering if blosc is a good default choice.

@timholy
Copy link
Member

timholy commented Aug 19, 2015

Not off-topic at all. I know little about this myself (CCing @stevengj). My understanding is that blosc is quite a lot faster than deflate, and that this matters a lot for large datasets. My experience with Matlab (which turns on compression by default) is that storing large datasets is painful, so much so that I tried to fix it, http://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files-more-quickly.

If we change it, rather than default to deflate I'd rather go back to no compression at all. But I suppose the alternative is to contact h5py and rhdf5 and try to get them to support blosc?

@timholy
Copy link
Member

timholy commented Aug 19, 2015

CC @andrewcollette. Not quite sure who to CC on the rhdf5 side.

@stevengj
Copy link
Member

As described in #174, blosc compression incurs a 2× slowdown or less (and is often even faster than uncompressed HDF5 for highly compressible data). In contrast, deflate incurs slowdowns from 10× to 1000×.

Unfortunately, Blosc is not yet bundled with HDF5 by default, so Blosc-compressed HDF5 files are not readable by other programs unless they also link to the Blosc library and enable the Blosc HDF5 plugin from https://github.com/Blosc/hdf5. But we don't blosc-compress files unless you explicitly request it (although it might be reasonable to enable it by default for JLD files: #178). You can always specify deflate if you want.

@FrancescAlted, have you made any progress on getting Blosc incorporated into h5py or other HDF5 wrappers, or better yet into HDF5 itself?

@stevengj
Copy link
Member

@timholy, the default should be no compression; has this changed? I think @scheidan was only asking about the default when you specify compress.

@timholy
Copy link
Member

timholy commented Aug 19, 2015

Gotcha

@andrewcollette
Copy link

I'm not that familiar with Blosc, but there is some good news on the HDF5 front... recent versions of HDF5 include a "dynamically loaded filter" capability. So if the Blosc HDF5 filter is compiled into a shared library and put in the appropriate directory, it can be loaded by HDF5 automatically. Since this happens at the C level, it would be compatible with Python/R/whatever.

Here's the original RFC:

https://www.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf

I've been meaning to do this with the LZF filter but haven't had the time.

@timholy
Copy link
Member

timholy commented Aug 19, 2015

@stevengj
Copy link
Member

@timholy, we aren't using that feature. We are loading our filter manually. What @andrewcollette is referring to would remove the need for blosc_filter.jl entirely, because when HDF5 is initialized it would load some blosc_filter.so file automatically. (It is still convenient to have the pure-Julia version in our case, however, because it eliminates a lot of headaches with building and installation.)

(One thing we should be careful of is to avoid having blosc_filter.jl conflict with an auto-loaded filter; I haven't looked into this.)

@stevengj
Copy link
Member

It looks like https://github.com/Blosc/hdf5 already implements the required API functions. So, @scheidan just needs to compile it into a shared library and install it into /usr/local/hdf5/lib/plugin (assuming you have libblosc.so or its equivalent installed appropriately).

@stevengj
Copy link
Member

(Okay, I just double-checked the HDF5 source code, and it looks like blosc_filter.jl will correctly take precedence over any blosc_filter.so file in the search path — it only searches for a shared-library plugin in cases where the desired filter is not already registered.)

@FrancescAlted
Copy link

Yes, I think the suggestion by Andrew of using the plugin method would work best for your needs. The Blosc HDF5 repo should support this already, see:

https://github.com/Blosc/hdf5/blob/master/src/blosc_plugin.c

Hope this helps.

@scheidan
Copy link
Contributor Author

Thanks everyone for looking into that!
For my case using deflate is the simplest option (speed is not critical, but is should be readable on several different machines).

Would it make sense to add a link in the docu to this issue?

@timholy
Copy link
Member

timholy commented Aug 21, 2015

Better than a link, please just edit the README to note that deflate is more portable. https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#improving-documentation

@eschnett
Copy link
Contributor

eschnett commented Sep 8, 2015

"More portable" is an understatement. zlib compression is supported in basically every HDF5 installation (except where explicitly turned it off), while Blosc seems to be e.g. almost unknown in the HPC community (that often uses HDF5 as preferred storage format). If I write an HDF5 dataset with "compression", I have a very specific idea of what that means -- namely, the default compression mechanism in HDF5 that is transparently decompressed on every other system supporting HDF5.

I'm not speaking about JLD; this is a rather Julia-specific format that can and probably should push the envelope. But HDF5 is marketed for its portability and archivability, and people have come to expect this. I think there's about one new "interesting" compression library every five years: Will Blosc still be supported on all platforms in five years, at least in a way to make Blosc-compressed files readable? HDF5 users will expect this.

I would make "compress" use the built-in HDF5 compression by default. If this is slow, then I'd work with the HDF5 developers to improve this, and provide an option to work around this. An incompatible default makes me uncomfortable.

@FrancescAlted
Copy link

Yes, Erik made very good points here. If what you want is archivability
and compatibility with the standard HDF5 library, then you have to use what
it supports by default (zlib and szip). If you want performance, then
Blosc (or any other for that matter) could make more sense.

Regarding whether Blosc will be supported in 5 years, well that's a
question for general open source maintainability. Blosc as it is now (aka
Blosc1) is mainly in maintenance mode, while new improvements are been
moved to Blosc2. Also, Blosc closely follows C89, so I would say that
maintaining it should be cheap enough for the years to come.

2015-09-08 16:59 GMT+02:00 Erik Schnetter [email protected]:

"More portable" is an understatement. zlib compression is supported in
basically every HDF5 installation (except where explicitly turned it off),
while Blosc seems to be e.g. almost unknown in the HPC community (that
often uses HDF5 as preferred storage format). If I write an HDF5 dataset
with "compression", I have a very specific idea of what that means --
namely, the default compression mechanism in HDF5 that is transparently
decompressed on every other system supporting HDF5.

I'm not speaking about JLD; this is a rather Julia-specific format that
can and probably should push the envelope. But HDF5 is marketed for its
portability and archivability, and people have come to expect this. I think
there's about one new "interesting" compression library every five years:
Will Blosc still be supported on all platforms in five years, at least in a
way to make Blosc-compressed files readable? HDF5 users will expect this.

I would make "compress" use the built-in HDF5 compression by default. If
this is slow, then I'd work with the HDF5 developers to improve this, and
provide an option to work around this. An incompatible default makes me
uncomfortable.


Reply to this email directly or view it on GitHub
#254 (comment).

Francesc Alted

@eschnett
Copy link
Contributor

To install HDF5, Blosc, as well as the HDF5 Blosc plugin, you can use e.g. Spack https://github.com/LLNL/spack. The command spack install hdf5-blosc should install all three of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants