Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1577: Use ZSTD as the default compression #1733

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 9, 2024

What changes were proposed in this pull request?

This PR aims to use ZSTD as the default compression from Apache ORC 2.0.0.

Why are the changes needed?

Apache ORC has been supporting ZStandard since 1.6.0.

ZStandard is known to be better than Gzip in terms of the size and speed.

  • The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

How was this patch tested?

Pass the CIs.

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Jan 9, 2024
@dongjoon-hyun
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun changed the title ORC-1577: Use ZSTD as the default compression ORC-1577: [Java] Use ZSTD as the default compression Jan 9, 2024
@dongjoon-hyun dongjoon-hyun changed the title ORC-1577: [Java] Use ZSTD as the default compression ORC-1577: Use ZSTD as the default compression Jan 9, 2024
@github-actions github-actions bot added the CPP label Jan 9, 2024
@@ -538,7 +538,7 @@ public void testStringAndBinaryStatistics(Version fileFormat) throws Exception {

assertEquals(3, stats[1].getNumberOfValues());
assertEquals(15, ((BinaryColumnStatistics) stats[1]).getSum());
assertEquals("count: 3 hasNull: true bytesOnDisk: 28 sum: 15", stats[1].toString());
assertEquals("count: 3 hasNull: true bytesOnDisk: 30 sum: 15", stats[1].toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that after enabling ZSTD by default, the size becomes larger, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a noise because the test file is too tiny, @deshanxiao .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For general data like the following, zstd is smaller than gzip.

$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
$ ls -alh data/generated/sales
total 721968
drwxr-xr-x  5 dongjoon  staff   160B Jan  8 17:27 .
drwxr-xr-x  3 dongjoon  staff    96B Jan  8 17:27 ..
-rw-r--r--  1 dongjoon  staff   102M Jan  8 17:27 orc.gz
-rw-r--r--  1 dongjoon  staff   115M Jan  8 17:27 orc.snappy
-rw-r--r--  1 dongjoon  staff   101M Jan  8 17:27 orc.zstd

@dongjoon-hyun
Copy link
Member Author

BTW, did you choose your Apache ID, @deshanxiao ? 😄

dongjoon-hyun added a commit that referenced this pull request Jan 9, 2024
### What changes were proposed in this pull request?

This PR aims to use `ZSTD` as the default compression from Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC has been supporting ZStandard since 1.6.0.

ZStandard is known to be better than Gzip in terms of the size and speed.

- _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_
    - [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro)
    - [Youtube](https://youtu.be/dTGxhHwjONY)

### How was this patch tested?

Pass the CIs.

Closes #1733 from dongjoon-hyun/ORC-1577.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit baf4c23)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the ORC-1577 branch January 9, 2024 01:31
@deshanxiao
Copy link
Contributor

BTW, did you choose your Apache ID, @deshanxiao ? 😄

Yes, and I have replied to that email also, my Apache ID is deshanxiao. Thank you @dongjoon-hyun

@dongjoon-hyun
Copy link
Member Author

Got it!

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 9, 2024

It seems that the ID is not created yet.

Please ping me when your ID is created, @deshanxiao . I can help you the community-side setup.

@deshanxiao
Copy link
Contributor

Sure, thank you @dongjoon-hyun

@dongjoon-hyun
Copy link
Member Author

To @deshanxiao , note that you need to include Craig L Russell who requested the ID creation.
Could you double-check your last reply email includes him (or secratary)?

@deshanxiao
Copy link
Contributor

Thanks for the reminder, I checked and it was not included, so I sent another email. @dongjoon-hyun

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR aims to use `ZSTD` as the default compression from Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC has been supporting ZStandard since 1.6.0.

ZStandard is known to be better than Gzip in terms of the size and speed.

- _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_
    - [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro)
    - [Youtube](https://youtu.be/dTGxhHwjONY)

### How was this patch tested?

Pass the CIs.

Closes apache#1733 from dongjoon-hyun/ORC-1577.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Jan 18, 2024
…benchmarks

### What changes were proposed in this pull request?

This PR aims to use the default ORC compression in data source benchmarks.

### Why are the changes needed?

Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec.
- apache/orc#1733
- #44654

`OrcReadBenchmark` was switched to use ZStandard for comparision.
- #44761

And, this PR aims to change the remaining three data source benchmarks.
```
$ git grep OrcCompressionCodec | grep Benchmark
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName())
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir)
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:      .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName())
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44777 from dongjoon-hyun/SPARK-46752.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…benchmarks

This PR aims to use the default ORC compression in data source benchmarks.

Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec.
- apache/orc#1733
- apache#44654

`OrcReadBenchmark` was switched to use ZStandard for comparision.
- apache#44761

And, this PR aims to change the remaining three data source benchmarks.
```
$ git grep OrcCompressionCodec | grep Benchmark
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName())
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir)
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:      .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName())
```

No.

Manual review.

No.

Closes apache#44777 from dongjoon-hyun/SPARK-46752.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Mar 8, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants