Enable Blosc compression of JLD files by default? #178

stevengj · 2014-11-15T19:11:15Z

As discussed in #173 and #174, it is worth considering enabling compress=true by default when saving JLD files. In all the benchmarks I've done so far, even with single-threaded Blosc this has a performance within a factor of two of uncompressed HDF5 (with the worst case being random data that does not benefit from compression), and is faster than uncompressed HDF5 for data that compresses well.

(This is dramatically better than zip/deflate compression, which can be orders of magnitude slower than uncompressed HDF5.)

Any other stress tests would be appreciated (just pass compress=true to save or jldopen). Note that compression is not used for small datasets (< 8k).

cc: @jakebolewski, @dmbates, @kmsquire, @StefanKarpinski

The text was updated successfully, but these errors were encountered:

timholy · 2014-11-15T19:20:42Z

I'm certainly willing to entertain this. I'll note that one should run some benchmarks with mmap turned on, which might change the performance picture somewhat.

rened · 2014-11-15T20:43:59Z

Just for my understanding, when one mmaps an array in a Blosc compressed JLD file, does one see the compressed/garbled data or the uncompressed/raw array? I would believe the former, so mmapping is orthogonal to compression?
Is compression per-file or with a finer granularity?

timholy · 2014-11-15T20:52:41Z

Yes, the two are orthogonal options. One can get fine granularity with both, but both also have defaults that can be set on a per-file bases.

stevengj · 2014-11-15T21:09:30Z

Since the default is mmaparrays=false, probably mmap is irrelevant to whether compression is turned on by default. But I agree that the comparison would be interesting.

simonster · 2014-11-15T21:29:28Z

ismmappable should probably return false for compressed data.

jakebolewski · 2014-11-17T23:20:53Z

If you set the compression level equal to zero, Blosc will do a multithreaded memcopy. Blosc compression could be enabled by default with little overhead in this case (may be faster, haven't tested thoroughly). Opting into a higher compression level seems like the right way to do this.

stevengj · 2014-11-18T03:40:21Z

I would prefer to default to compression > 0. If the performance penalty is minor, which seems the case even for single-threaded Blosc, the benefit of compression seems worth it.

jakebolewski · 2014-11-18T21:16:12Z

Here are some example timings comparing saving example dataframes with the three approaches:

The data is write time, filesize, read time
https://gist.github.com/bc5b9308749649de8187

Julia v3.2 (DataFrames, HDF5, Blosc all latest master)

I'm also getting errors when jldopen(compress=true), it looks like the chunk size is not getting set correctly in all cases. Here is an example backtrace:

HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 0:
  #000: H5D.c line 194 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 444 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
  1 additives.jls:
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 3015 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1049 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 546 in H5D__chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 0:
  #000: H5D.c line 194 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 444 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 3015 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1049 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 546 in H5D__chunk_construct(): chunk size must be <= maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object

stevengj · 2014-11-19T14:07:12Z

Where can I get one of the .jld files that is failing?

timholy · 2014-11-19T14:34:28Z

Nice work, @jakebolewski!

I'm also interested in the additives case, since default *.jld seems particularly bad there.

davidssmith · 2014-12-17T22:56:33Z

I'm getting this same error when writing a ~1 GB complex array. @stevengj Do you still need test data?

HDF5 0.4.7
JLD 0.1.0

Julia Version 0.3.3
Commit b24213b* (2014-11-23 20:19 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.0.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

davidssmith · 2014-12-17T22:58:36Z

Looks like I can reproduce it with a large random Complex128 array. I used dimension (16,512,24000) .

timholy · 2014-12-17T23:34:59Z

Just to double-check, @davidssmith you see that only when enabling compression?

davidssmith · 2014-12-17T23:39:17Z

Yep.

stevengj · 2014-12-18T00:08:45Z

@davidssmith, have you checked whether it is fixed by #179?

PR #179 is currently held up because Blosc is crashing on 32-bit Windows (we're not sure why: JuliaIO/Blosc.jl#2) and that is causing test failures. Maybe we should just merge it anyway?

davidssmith · 2014-12-18T00:53:23Z

Is there an easy way to apply #179 to my fork?

tkelman · 2014-12-18T10:30:25Z

Maybe we should just merge it anyway?

I'd be against breaking HDF5 on win32. If you can disable the tests for non-default functionality that is known to be broken I'd be fine with it.

jakebolewski · 2014-12-29T15:19:43Z

Bump, the Windows issues seem to be fixed now on v0.3.4 latest master.

stevengj · 2014-12-29T17:42:51Z

Updated the patch; lets' see if it passes.

stevengj · 2015-04-02T15:11:57Z

Note that Blosc has been working on Windows for a while.

However, I don't see a clamor of need for compressed HDF5/JLD files, so maybe we should just leave it as an option.

davidssmith · 2015-04-02T15:16:54Z

clamors

Doesn't Blosc improve read performance on large files? If so, seems worth enabling.

stevengj · 2015-04-02T15:41:49Z

A downside of enabling it by default is that it will be harder for other HDF5 applications to extract data from the file (assuming they don't link to Blosc too).

Blosc improves performance if the data compresses well (i.e. not random numbers).

davidssmith · 2015-04-02T15:54:51Z

Does it make any sense to enable by default on JLD but not for HDF5?

stevengj · 2015-04-02T16:04:18Z

@davidssmith, definitely this would only be for JLD, and it is true that most people saving files for interchange with other programs would probably use the raw HDF5 interface.

timholy · 2015-04-02T16:13:13Z

I'm fine with turning it on if (1) it doesn't substantially hurt performance for data that compress poorly, and (2) if people besides myself take responsibility for fixing any breakages :-).

stevengj · 2015-04-02T17:44:50Z

@jakebolewski, could you re-run your benchmarks now that things should be working?

kleinhenz · 2020-11-14T21:34:15Z

closing since it looks like the consensus is that we would only do this for JLD not HDF5.

stevengj mentioned this issue Nov 19, 2014

fix chunking bug #179

Merged

simonster mentioned this issue Jan 3, 2015

fix for #192 + allow to ensure mmappability on a per-write basis #195

Merged

stevengj mentioned this issue Aug 19, 2015

Files created with blosc compression can't be read by Python and R #254

Open

kleinhenz closed this as completed Nov 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Blosc compression of JLD files by default? #178

Enable Blosc compression of JLD files by default? #178

stevengj commented Nov 15, 2014

timholy commented Nov 15, 2014

rened commented Nov 15, 2014

timholy commented Nov 15, 2014

stevengj commented Nov 15, 2014

simonster commented Nov 15, 2014

jakebolewski commented Nov 17, 2014

stevengj commented Nov 18, 2014

jakebolewski commented Nov 18, 2014

stevengj commented Nov 19, 2014

timholy commented Nov 19, 2014

davidssmith commented Dec 17, 2014

davidssmith commented Dec 17, 2014

timholy commented Dec 17, 2014

davidssmith commented Dec 17, 2014

stevengj commented Dec 18, 2014

davidssmith commented Dec 18, 2014

tkelman commented Dec 18, 2014

jakebolewski commented Dec 29, 2014

stevengj commented Dec 29, 2014

stevengj commented Apr 2, 2015

davidssmith commented Apr 2, 2015

stevengj commented Apr 2, 2015

davidssmith commented Apr 2, 2015

stevengj commented Apr 2, 2015

timholy commented Apr 2, 2015

stevengj commented Apr 2, 2015

kleinhenz commented Nov 14, 2020

Enable Blosc compression of JLD files by default? #178

Enable Blosc compression of JLD files by default? #178

Comments

stevengj commented Nov 15, 2014

timholy commented Nov 15, 2014

rened commented Nov 15, 2014

timholy commented Nov 15, 2014

stevengj commented Nov 15, 2014

simonster commented Nov 15, 2014

jakebolewski commented Nov 17, 2014

stevengj commented Nov 18, 2014

jakebolewski commented Nov 18, 2014

stevengj commented Nov 19, 2014

timholy commented Nov 19, 2014

davidssmith commented Dec 17, 2014

davidssmith commented Dec 17, 2014

timholy commented Dec 17, 2014

davidssmith commented Dec 17, 2014

stevengj commented Dec 18, 2014

davidssmith commented Dec 18, 2014

tkelman commented Dec 18, 2014

jakebolewski commented Dec 29, 2014

stevengj commented Dec 29, 2014

stevengj commented Apr 2, 2015

davidssmith commented Apr 2, 2015

stevengj commented Apr 2, 2015

davidssmith commented Apr 2, 2015

stevengj commented Apr 2, 2015

timholy commented Apr 2, 2015

stevengj commented Apr 2, 2015

kleinhenz commented Nov 14, 2020