[SPARK-43273][SQL] Support `lz4raw` compression codec for Parquet #41507

wangyum · 2023-06-08T04:27:49Z

What changes were proposed in this pull request?

Parquet 1.13.0 supports LZ4_RAW codec. Please see https://issues.apache.org/jira/browse/PARQUET-2196.

This PR adds lz4raw to the supported list of spark.sql.parquet.compression.codec.

Why are the changes needed?

Support writing Parquet files with lz4raw compression codec.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and manual testing:

spark.sql("set spark.sql.parquet.compression.codec=lz4raw")
spark.range(10).write.parquet("/tmp/spark/lz4raw")
spark.read.parquet("/tmp/spark/lz4raw").show(false)

yumwang@LM-SHC-16508156 lz4raw % ll /tmp/spark/lz4raw
total 16
-rw-r--r--@ 1 yumwang  wheel    0 Jun  8 12:10 _SUCCESS
-rw-r--r--@ 1 yumwang  wheel  487 Jun  8 12:10 part-00000-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet
-rw-r--r--@ 1 yumwang  wheel  489 Jun  8 12:10 part-00001-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet

dongjoon-hyun

Thank you, @wangyum . Could you add this new codec here?

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala

Lines 54 to 64 in 2adb8e1

    
           class ParquetCodecSuite extends FileSourceCodecSuite { 
        
             override def format: String = "parquet" 
        
             override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key 
        
             // Exclude "lzo" because it is GPL-licenced so not included in Hadoop. 
        
             // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available 
        
             // on Maven Central. 
        
             override protected def availableCodecs: Seq[String] = { 
        
               Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4") 
        
             } 
        
           }

dongjoon-hyun · 2023-06-08T05:36:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

    "snappy" -> CompressionCodecName.SNAPPY,
    "gzip" -> CompressionCodecName.GZIP,
    "lzo" -> CompressionCodecName.LZO,
-    "lz4" -> CompressionCodecName.LZ4,


May I ask why we need to move this line?

In order to keep the order consistent:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 986 to 987 in 4e78ff2

"`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +

"snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")

Got it. It makes sense.

dongjoon-hyun

+1, LGTM (Pending CIs)

LuciferYang

+1, LGTM

dongjoon-hyun · 2023-06-08T08:48:22Z

The first commit passed all tests already (except pyspark-pandas-slow-connect). And, I verified the second commit manually. Merged to master.

[info] ParquetCodecSuite:
01:42:55.154 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - write and read - file source parquet - codec: none (10 seconds, 419 milliseconds)
[info] - write and read - file source parquet - codec: uncompressed (1 second, 513 milliseconds)
[info] - write and read - file source parquet - codec: snappy (1 second, 457 milliseconds)
[info] - write and read - file source parquet - codec: gzip (1 second, 201 milliseconds)
[info] - write and read - file source parquet - codec: zstd (1 second, 435 milliseconds)
[info] - write and read - file source parquet - codec: lz4 (1 second, 163 milliseconds)
[info] - write and read - file source parquet - codec: lz4raw (1 second, 282 milliseconds)
01:43:14.972 WARN org.apache.spark.sql.execution.datasources.ParquetCodecSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.ParquetCodecSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), QueryStageCreator-0 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), QueryStageCreator-5 (daemon=true), QueryStageCreator-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), QueryStageCreator-1 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), QueryStageCreator-6 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), QueryStageCreator-2 (daemon=true), QueryStageCreator-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true) =====
[info] Run completed in 22 seconds, 883 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 487 s (08:07), completed Jun 8, 2023 1:43:15 AM

### What changes were proposed in this pull request? Parquet 1.13.0 supports `LZ4_RAW` codec. Please see https://issues.apache.org/jira/browse/PARQUET-2196. This PR adds `lz4raw` to the supported list of `spark.sql.parquet.compression.codec`. ### Why are the changes needed? Support writing Parquet files with `lz4raw` compression codec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual testing: ```scala spark.sql("set spark.sql.parquet.compression.codec=lz4raw") spark.range(10).write.parquet("/tmp/spark/lz4raw") spark.read.parquet("/tmp/spark/lz4raw").show(false) ``` ``` yumwangLM-SHC-16508156 lz4raw % ll /tmp/spark/lz4raw total 16 -rw-r--r-- 1 yumwang wheel 0 Jun 8 12:10 _SUCCESS -rw-r--r-- 1 yumwang wheel 487 Jun 8 12:10 part-00000-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet -rw-r--r-- 1 yumwang wheel 489 Jun 8 12:10 part-00001-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet ``` Closes apache#41507 from wangyum/SPARK-43273. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n codec lz4raw ### What changes were proposed in this pull request? #41507 supported the new parquet compression codec `lz4raw`. But `lz4raw` is not a correct parquet compression codec name. This mistake causes error. Please refer https://github.com/apache/spark/pull/43310/files#r1352405312 The root cause is parquet uses `lz4_raw` as its name and store it into the metadata of parquet file. Please refer https://github.com/apache/spark/blob/6373f19f537f69c6460b2e4097f19903c01a608f/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala#L65 We should use `lz4_raw` as its name. ### Why are the changes needed? Fix the bug that uses incorrect parquet compression codec `lz4raw`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? New test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43310 from beliefer/SPARK-45484. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Jiaan Geng <[email protected]>

wangyum and others added 2 commits June 8, 2023 12:15

lz4raw

1126a89

"lz4raw" -> CompressionCodecName.LZ4_RAW,

1eb56a6

github-actions bot added the SQL label Jun 8, 2023

dongjoon-hyun reviewed Jun 8, 2023

View reviewed changes

Update FileSourceCodecSuite.scala

4e78ff2

dongjoon-hyun approved these changes Jun 8, 2023

View reviewed changes

LuciferYang approved these changes Jun 8, 2023

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-43273][SQL] Support writing Parquet files with lz4raw compression codec~~ [SPARK-43273][SQL] Support lz4raw compression codec for Parquet Jun 8, 2023

dongjoon-hyun closed this in 4c7eff2 Jun 8, 2023

wangyum deleted the SPARK-43273 branch June 8, 2023 09:17

LuciferYang mentioned this pull request Oct 10, 2023

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

Closed

beliefer mentioned this pull request Oct 10, 2023

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

Closed

winningsix mentioned this pull request Jan 29, 2024

[AUDIT][Spark 4.0] Add support for lz4_raw compression in Parquet files NVIDIA/spark-rapids#10310

Open

ayushi-agarwal mentioned this pull request Apr 16, 2024

Support for read and write of parquet file with lz4_raw compression codec in velox apache/incubator-gluten#5427

Open

This was referenced Jan 23, 2025

Support LZ4_RAW compression codec for parquet apache/auron#790

Closed

[BLAZE-790] Support LZ4_RAW compression codec for parquet apache/auron#791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-43273][SQL] Support `lz4raw` compression codec for Parquet #41507

[SPARK-43273][SQL] Support `lz4raw` compression codec for Parquet #41507

Uh oh!

wangyum commented Jun 8, 2023 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Jun 8, 2023

Uh oh!

wangyum Jun 8, 2023

Uh oh!

dongjoon-hyun Jun 8, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang left a comment

Uh oh!

dongjoon-hyun commented Jun 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	class ParquetCodecSuite extends FileSourceCodecSuite {

	override def format: String = "parquet"
	override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key
	// Exclude "lzo" because it is GPL-licenced so not included in Hadoop.
	// Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available
	// on Maven Central.
	override protected def availableCodecs: Seq[String] = {
	Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4")
	}
	}

	"`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
	"snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")

[SPARK-43273][SQL] Support lz4raw compression codec for Parquet #41507

[SPARK-43273][SQL] Support lz4raw compression codec for Parquet #41507

Uh oh!

Conversation

wangyum commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

wangyum Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-43273][SQL] Support `lz4raw` compression codec for Parquet #41507

[SPARK-43273][SQL] Support `lz4raw` compression codec for Parquet #41507

wangyum commented Jun 8, 2023 •

edited

Loading