-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43273][SQL] Support lz4raw compression codec for Parquet
#41507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @wangyum . Could you add this new codec here?
spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala
Lines 54 to 64 in 2adb8e1
| class ParquetCodecSuite extends FileSourceCodecSuite { | |
| override def format: String = "parquet" | |
| override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key | |
| // Exclude "lzo" because it is GPL-licenced so not included in Hadoop. | |
| // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available | |
| // on Maven Central. | |
| override protected def availableCodecs: Seq[String] = { | |
| Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4") | |
| } | |
| } |
| "snappy" -> CompressionCodecName.SNAPPY, | ||
| "gzip" -> CompressionCodecName.GZIP, | ||
| "lzo" -> CompressionCodecName.LZO, | ||
| "lz4" -> CompressionCodecName.LZ4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why we need to move this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to keep the order consistent:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 986 to 987 in 4e78ff2
| "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " + | |
| "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. It makes sense.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs)
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
lz4raw compression codec for Parquet
|
The first commit passed all tests already (except pyspark-pandas-slow-connect). And, I verified the second commit manually. Merged to master. |
### What changes were proposed in this pull request? Parquet 1.13.0 supports `LZ4_RAW` codec. Please see https://issues.apache.org/jira/browse/PARQUET-2196. This PR adds `lz4raw` to the supported list of `spark.sql.parquet.compression.codec`. ### Why are the changes needed? Support writing Parquet files with `lz4raw` compression codec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual testing: ```scala spark.sql("set spark.sql.parquet.compression.codec=lz4raw") spark.range(10).write.parquet("/tmp/spark/lz4raw") spark.read.parquet("/tmp/spark/lz4raw").show(false) ``` ``` yumwangLM-SHC-16508156 lz4raw % ll /tmp/spark/lz4raw total 16 -rw-r--r-- 1 yumwang wheel 0 Jun 8 12:10 _SUCCESS -rw-r--r-- 1 yumwang wheel 487 Jun 8 12:10 part-00000-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet -rw-r--r-- 1 yumwang wheel 489 Jun 8 12:10 part-00001-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet ``` Closes apache#41507 from wangyum/SPARK-43273. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…n codec lz4raw ### What changes were proposed in this pull request? #41507 supported the new parquet compression codec `lz4raw`. But `lz4raw` is not a correct parquet compression codec name. This mistake causes error. Please refer https://github.com/apache/spark/pull/43310/files#r1352405312 The root cause is parquet uses `lz4_raw` as its name and store it into the metadata of parquet file. Please refer https://github.com/apache/spark/blob/6373f19f537f69c6460b2e4097f19903c01a608f/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala#L65 We should use `lz4_raw` as its name. ### Why are the changes needed? Fix the bug that uses incorrect parquet compression codec `lz4raw`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? New test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43310 from beliefer/SPARK-45484. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Jiaan Geng <[email protected]>
What changes were proposed in this pull request?
Parquet 1.13.0 supports
LZ4_RAWcodec. Please see https://issues.apache.org/jira/browse/PARQUET-2196.This PR adds
lz4rawto the supported list ofspark.sql.parquet.compression.codec.Why are the changes needed?
Support writing Parquet files with
lz4rawcompression codec.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test and manual testing: