All speeds and compressed sizes reported below are available in the results CSV. The CSV also includes some codecs not visualized here.
Real world datasets are the best indicator of usefulness. We have compared against 3 datasets, all of which are readily available and accessible in download size:
- Devin Smith's air quality data download (15MB)
- NYC taxi data (2023-04 high volume for hire) (469MB)
- Reddit r/place 2022 data
dataset | uncompressed size | numeric data types |
---|---|---|
air quality | 59.7MB | i32, i64 |
taxi | 2.14GB | f64, i32, i64 |
r/place | 4.19GB | i32, i64 |
These were again done on a single core of an M3 performance core. Only numerical columns were used. For Blosc, the SHUFFLE filter and the Zstd default of Zstd level 3 was used. For Parquet, the Parquet default of Zstd level 1 was used.
You can also run a wide variety of synthetic benchmarks yourself using the cli.