-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1577: Use ZSTD
as the default compression
#1733
Conversation
cc @williamhyun , @wgtmac , @guiyanakuang , @deshanxiao |
ZSTD
as the default compressionZSTD
as the default compression
ZSTD
as the default compressionZSTD
as the default compression
@@ -538,7 +538,7 @@ public void testStringAndBinaryStatistics(Version fileFormat) throws Exception { | |||
|
|||
assertEquals(3, stats[1].getNumberOfValues()); | |||
assertEquals(15, ((BinaryColumnStatistics) stats[1]).getSum()); | |||
assertEquals("count: 3 hasNull: true bytesOnDisk: 28 sum: 15", stats[1].toString()); | |||
assertEquals("count: 3 hasNull: true bytesOnDisk: 30 sum: 15", stats[1].toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that after enabling ZSTD by default, the size becomes larger, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a noise because the test file is too tiny, @deshanxiao .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For general data like the following, zstd
is smaller than gzip
.
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 1000000
$ ls -alh data/generated/sales
total 721968
drwxr-xr-x 5 dongjoon staff 160B Jan 8 17:27 .
drwxr-xr-x 3 dongjoon staff 96B Jan 8 17:27 ..
-rw-r--r-- 1 dongjoon staff 102M Jan 8 17:27 orc.gz
-rw-r--r-- 1 dongjoon staff 115M Jan 8 17:27 orc.snappy
-rw-r--r-- 1 dongjoon staff 101M Jan 8 17:27 orc.zstd
BTW, did you choose your Apache ID, @deshanxiao ? 😄 |
### What changes were proposed in this pull request? This PR aims to use `ZSTD` as the default compression from Apache ORC 2.0.0. ### Why are the changes needed? Apache ORC has been supporting ZStandard since 1.6.0. ZStandard is known to be better than Gzip in terms of the size and speed. - _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_ - [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro) - [Youtube](https://youtu.be/dTGxhHwjONY) ### How was this patch tested? Pass the CIs. Closes #1733 from dongjoon-hyun/ORC-1577. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit baf4c23) Signed-off-by: Dongjoon Hyun <[email protected]>
Yes, and I have replied to that email also, my Apache ID is deshanxiao. Thank you @dongjoon-hyun |
Got it! |
It seems that the ID is not created yet.
Please ping me when your ID is created, @deshanxiao . I can help you the community-side setup. |
Sure, thank you @dongjoon-hyun |
To @deshanxiao , note that you need to include |
Thanks for the reminder, I checked and it was not included, so I sent another email. @dongjoon-hyun |
### What changes were proposed in this pull request? This PR aims to use `ZSTD` as the default compression from Apache ORC 2.0.0. ### Why are the changes needed? Apache ORC has been supporting ZStandard since 1.6.0. ZStandard is known to be better than Gzip in terms of the size and speed. - _The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro_ - [Slides](https://www.slideshare.net/databricks/the-rise-of-zstandard-apache-sparkparquetorcavro) - [Youtube](https://youtu.be/dTGxhHwjONY) ### How was this patch tested? Pass the CIs. Closes apache#1733 from dongjoon-hyun/ORC-1577. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…benchmarks ### What changes were proposed in this pull request? This PR aims to use the default ORC compression in data source benchmarks. ### Why are the changes needed? Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - #44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - #44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…benchmarks This PR aims to use the default ORC compression in data source benchmarks. Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - apache#44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - apache#44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` No. Manual review. No. Closes apache#44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR aims to use
ZSTD
as the default compression from Apache ORC 2.0.0.Why are the changes needed?
Apache ORC has been supporting ZStandard since 1.6.0.
ZStandard is known to be better than Gzip in terms of the size and speed.
How was this patch tested?
Pass the CIs.