Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1577: Use ZSTD as the default compression #1733

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion c++/src/Writer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ namespace orc {
stripeSize = 64 * 1024 * 1024; // 64M
compressionBlockSize = 64 * 1024; // 64K
rowIndexStride = 10000;
compression = CompressionKind_ZLIB;
compression = CompressionKind_ZSTD;
compressionStrategy = CompressionStrategy_SPEED;
memoryPool = getDefaultPool();
paddingTolerance = 0.0;
Expand Down
2 changes: 1 addition & 1 deletion java/core/src/java/org/apache/orc/OrcConf.java
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ public enum OrcConf {
BLOCK_PADDING("orc.block.padding", "hive.exec.orc.default.block.padding",
true,
"Define whether stripes should be padded to the HDFS block boundaries."),
COMPRESS("orc.compress", "hive.exec.orc.default.compress", "ZLIB",
COMPRESS("orc.compress", "hive.exec.orc.default.compress", "ZSTD",
"Define the default compression codec for ORC file"),
WRITE_FORMAT("orc.write.format", "hive.exec.orc.write.format", "0.12",
"Define the version of the file to write. Possible values are 0.11 and\n"+
Expand Down
8 changes: 4 additions & 4 deletions java/core/src/test/org/apache/orc/TestVectorOrcFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ public void testStringAndBinaryStatistics(Version fileFormat) throws Exception {

assertEquals(3, stats[1].getNumberOfValues());
assertEquals(15, ((BinaryColumnStatistics) stats[1]).getSum());
assertEquals("count: 3 hasNull: true bytesOnDisk: 28 sum: 15", stats[1].toString());
assertEquals("count: 3 hasNull: true bytesOnDisk: 30 sum: 15", stats[1].toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that after enabling ZSTD by default, the size becomes larger, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a noise because the test file is too tiny, @deshanxiao .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For general data like the following, zstd is smaller than gzip.

$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
$ ls -alh data/generated/sales
total 721968
drwxr-xr-x  5 dongjoon  staff   160B Jan  8 17:27 .
drwxr-xr-x  3 dongjoon  staff    96B Jan  8 17:27 ..
-rw-r--r--  1 dongjoon  staff   102M Jan  8 17:27 orc.gz
-rw-r--r--  1 dongjoon  staff   115M Jan  8 17:27 orc.snappy
-rw-r--r--  1 dongjoon  staff   101M Jan  8 17:27 orc.zstd


assertEquals(3, stats[2].getNumberOfValues());
assertEquals("bar", ((StringColumnStatistics) stats[2]).getMinimum());
Expand Down Expand Up @@ -1255,7 +1255,7 @@ public void test1(Version fileFormat) throws Exception {
assertEquals(-15.0, ((DoubleColumnStatistics) stats[7]).getMinimum(), 0.0001);
assertEquals(-5.0, ((DoubleColumnStatistics) stats[7]).getMaximum(), 0.0001);
assertEquals(-20.0, ((DoubleColumnStatistics) stats[7]).getSum(), 0.00001);
assertEquals("count: 2 hasNull: false bytesOnDisk: 15 min: -15.0 max: -5.0 sum: -20.0",
assertEquals("count: 2 hasNull: false bytesOnDisk: 19 min: -15.0 max: -5.0 sum: -20.0",
stats[7].toString());

assertEquals("count: 2 hasNull: false bytesOnDisk: " +
Expand Down Expand Up @@ -3961,7 +3961,7 @@ public void testEncryptMerge(Version fileFormat) throws Exception {
// test reading with no keys
Reader reader = OrcFile.createReader(merge1, OrcFile.readerOptions(conf));
assertEquals(9 * 1024, reader.getNumberOfRows());
assertEquals(CompressionKind.ZLIB, reader.getCompressionKind());
assertEquals(CompressionKind.ZSTD, reader.getCompressionKind());
assertEquals(1000, reader.getRowIndexStride());
assertEquals(0xc00, reader.getCompressionSize());
assertEquals(fileFormat, reader.getFileVersion());
Expand Down Expand Up @@ -4107,7 +4107,7 @@ public void testEncryptMerge(Version fileFormat) throws Exception {

reader = OrcFile.createReader(merge2, OrcFile.readerOptions(conf));
assertEquals(2 * 3 * 1024, reader.getNumberOfRows());
assertEquals(CompressionKind.ZLIB, reader.getCompressionKind());
assertEquals(CompressionKind.ZSTD, reader.getCompressionKind());
assertEquals(0x800, reader.getCompressionSize());
assertEquals(1000, reader.getRowIndexStride());
assertEquals(fileFormat, reader.getFileVersion());
Expand Down
1 change: 1 addition & 0 deletions java/tools/src/test/org/apache/orc/tools/TestFileDump.java
Original file line number Diff line number Diff line change
Expand Up @@ -588,6 +588,7 @@ public void testHasNull() throws Exception {
Writer writer = OrcFile.createWriter(testFilePath,
OrcFile.writerOptions(conf)
.setSchema(schema)
.compress(CompressionKind.ZLIB)
.rowIndexStride(1000)
.stripeSize(10000)
.bufferSize(10000));
Expand Down
4 changes: 2 additions & 2 deletions site/_docs/core-java-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ permalink: /docs/core-java-config.html
</tr>
<tr>
<td><code>orc.compress</code></td>
<td>ZLIB</td>
<td>ZSTD</td>
<td>
Define the default compression codec for ORC file
</td>
Expand Down Expand Up @@ -396,4 +396,4 @@ permalink: /docs/core-java-config.html
The maximum number of child elements to buffer before the ORC row writer writes the batch to the file.
</td>
</tr>
</table>
</table>
2 changes: 1 addition & 1 deletion site/_docs/hive-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ with the same options.

Key | Default | Notes
:----------------------- | :---------- | :------------------------
orc.compress | ZLIB | high level compression = {NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD}
orc.compress | ZSTD | high level compression = {NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD}
orc.compress.size | 262,144 | compression chunk size
orc.stripe.size | 67,108,864 | memory buffer in bytes for writing
orc.row.index.stride | 10,000 | number of rows between index entries
Expand Down
2 changes: 1 addition & 1 deletion site/_docs/spark-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ with the same options.

Key | Default | Notes
:----------------------- | :---------- | :------------------------
orc.compress | ZLIB | high level compression = {NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD}
orc.compress | ZSTD | high level compression = {NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD}
orc.compress.size | 262,144 | compression chunk size
orc.stripe.size | 67,108,864 | memory buffer in bytes for writing
orc.row.index.stride | 10,000 | number of rows between index entries
Expand Down