Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Blosc compression of JLD files by default? #178

Closed
stevengj opened this issue Nov 15, 2014 · 27 comments
Closed

Enable Blosc compression of JLD files by default? #178

stevengj opened this issue Nov 15, 2014 · 27 comments

Comments

@stevengj
Copy link
Member

As discussed in #173 and #174, it is worth considering enabling compress=true by default when saving JLD files. In all the benchmarks I've done so far, even with single-threaded Blosc this has a performance within a factor of two of uncompressed HDF5 (with the worst case being random data that does not benefit from compression), and is faster than uncompressed HDF5 for data that compresses well.

(This is dramatically better than zip/deflate compression, which can be orders of magnitude slower than uncompressed HDF5.)

Any other stress tests would be appreciated (just pass compress=true to save or jldopen). Note that compression is not used for small datasets (< 8k).

cc: @jakebolewski, @dmbates, @kmsquire, @StefanKarpinski

@timholy
Copy link
Member

timholy commented Nov 15, 2014

I'm certainly willing to entertain this. I'll note that one should run some benchmarks with mmap turned on, which might change the performance picture somewhat.

@rened
Copy link
Contributor

rened commented Nov 15, 2014

Just for my understanding, when one mmaps an array in a Blosc compressed JLD file, does one see the compressed/garbled data or the uncompressed/raw array? I would believe the former, so mmapping is orthogonal to compression?
Is compression per-file or with a finer granularity?

@timholy
Copy link
Member

timholy commented Nov 15, 2014

Yes, the two are orthogonal options. One can get fine granularity with both, but both also have defaults that can be set on a per-file bases.

@stevengj
Copy link
Member Author

Since the default is mmaparrays=false, probably mmap is irrelevant to whether compression is turned on by default. But I agree that the comparison would be interesting.

@simonster
Copy link
Member

ismmappable should probably return false for compressed data.

@jakebolewski
Copy link

If you set the compression level equal to zero, Blosc will do a multithreaded memcopy. Blosc compression could be enabled by default with little overhead in this case (may be faster, haven't tested thoroughly). Opting into a higher compression level seems like the right way to do this.

@stevengj
Copy link
Member Author

I would prefer to default to compression > 0. If the performance penalty is minor, which seems the case even for single-threaded Blosc, the benefit of compression seems worth it.

@jakebolewski
Copy link

Here are some example timings comparing saving example dataframes with the three approaches:

The data is write time, filesize, read time
https://gist.github.com/bc5b9308749649de8187

Julia v3.2 (DataFrames, HDF5, Blosc all latest master)

I'm also getting errors when jldopen(compress=true), it looks like the chunk size is not getting set correctly in all cases. Here is an example backtrace:

HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 0:
  #000: H5D.c line 194 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 444 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
  1 additives.jls:
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 3015 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1049 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 546 in H5D__chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 0:
  #000: H5D.c line 194 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 444 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 3015 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1049 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 546 in H5D__chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object

@stevengj
Copy link
Member Author

Where can I get one of the .jld files that is failing?

@timholy
Copy link
Member

timholy commented Nov 19, 2014

Nice work, @jakebolewski!

I'm also interested in the additives case, since default *.jld seems particularly bad there.

@davidssmith
Copy link

I'm getting this same error when writing a ~1 GB complex array. @stevengj Do you still need test data?

HDF5 0.4.7
JLD 0.1.0

Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.0.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@davidssmith
Copy link

Looks like I can reproduce it with a large random Complex128 array. I used dimension (16,512,24000) .

@timholy
Copy link
Member

timholy commented Dec 17, 2014

Just to double-check, @davidssmith you see that only when enabling compression?

@davidssmith
Copy link

Yep.

@stevengj
Copy link
Member Author

@davidssmith, have you checked whether it is fixed by #179?

PR #179 is currently held up because Blosc is crashing on 32-bit Windows (we're not sure why: JuliaIO/Blosc.jl#2) and that is causing test failures. Maybe we should just merge it anyway?

@davidssmith
Copy link

Is there an easy way to apply #179 to my fork?

@tkelman
Copy link
Contributor

tkelman commented Dec 18, 2014

Maybe we should just merge it anyway?

I'd be against breaking HDF5 on win32. If you can disable the tests for non-default functionality that is known to be broken I'd be fine with it.

@jakebolewski
Copy link

Bump, the Windows issues seem to be fixed now on v0.3.4 latest master.

@stevengj
Copy link
Member Author

Updated the patch; lets' see if it passes.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2015

Note that Blosc has been working on Windows for a while.

However, I don't see a clamor of need for compressed HDF5/JLD files, so maybe we should just leave it as an option.

@davidssmith
Copy link

clamors

Doesn't Blosc improve read performance on large files? If so, seems worth enabling.

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2015

A downside of enabling it by default is that it will be harder for other HDF5 applications to extract data from the file (assuming they don't link to Blosc too).

Blosc improves performance if the data compresses well (i.e. not random numbers).

@davidssmith
Copy link

Does it make any sense to enable by default on JLD but not for HDF5?

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2015

@davidssmith, definitely this would only be for JLD, and it is true that most people saving files for interchange with other programs would probably use the raw HDF5 interface.

@timholy
Copy link
Member

timholy commented Apr 2, 2015

I'm fine with turning it on if (1) it doesn't substantially hurt performance for data that compress poorly, and (2) if people besides myself take responsibility for fixing any breakages :-).

@stevengj
Copy link
Member Author

stevengj commented Apr 2, 2015

@jakebolewski, could you re-run your benchmarks now that things should be working?

@kleinhenz
Copy link
Contributor

closing since it looks like the consensus is that we would only do this for JLD not HDF5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants