Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 8, 2023

What changes were proposed in this pull request?

Parquet 1.13.0 supports LZ4_RAW codec. Please see https://issues.apache.org/jira/browse/PARQUET-2196.

This PR adds lz4raw to the supported list of spark.sql.parquet.compression.codec.

Why are the changes needed?

Support writing Parquet files with lz4raw compression codec.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and manual testing:

spark.sql("set spark.sql.parquet.compression.codec=lz4raw")
spark.range(10).write.parquet("/tmp/spark/lz4raw")
spark.read.parquet("/tmp/spark/lz4raw").show(false)
yumwang@LM-SHC-16508156 lz4raw % ll /tmp/spark/lz4raw
total 16
-rw-r--r--@ 1 yumwang  wheel    0 Jun  8 12:10 _SUCCESS
-rw-r--r--@ 1 yumwang  wheel  487 Jun  8 12:10 part-00000-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet
-rw-r--r--@ 1 yumwang  wheel  489 Jun  8 12:10 part-00001-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet

@github-actions github-actions bot added the SQL label Jun 8, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wangyum . Could you add this new codec here?

class ParquetCodecSuite extends FileSourceCodecSuite {
override def format: String = "parquet"
override val codecConfigName: String = SQLConf.PARQUET_COMPRESSION.key
// Exclude "lzo" because it is GPL-licenced so not included in Hadoop.
// Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available
// on Maven Central.
override protected def availableCodecs: Seq[String] = {
Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4")
}
}

"snappy" -> CompressionCodecName.SNAPPY,
"gzip" -> CompressionCodecName.GZIP,
"lzo" -> CompressionCodecName.LZO,
"lz4" -> CompressionCodecName.LZ4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why we need to move this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep the order consistent:

"`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
"snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It makes sense.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs)

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-43273][SQL] Support writing Parquet files with lz4raw compression codec [SPARK-43273][SQL] Support lz4raw compression codec for Parquet Jun 8, 2023
@dongjoon-hyun
Copy link
Member

The first commit passed all tests already (except pyspark-pandas-slow-connect). And, I verified the second commit manually. Merged to master.

[info] ParquetCodecSuite:
01:42:55.154 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - write and read - file source parquet - codec: none (10 seconds, 419 milliseconds)
[info] - write and read - file source parquet - codec: uncompressed (1 second, 513 milliseconds)
[info] - write and read - file source parquet - codec: snappy (1 second, 457 milliseconds)
[info] - write and read - file source parquet - codec: gzip (1 second, 201 milliseconds)
[info] - write and read - file source parquet - codec: zstd (1 second, 435 milliseconds)
[info] - write and read - file source parquet - codec: lz4 (1 second, 163 milliseconds)
[info] - write and read - file source parquet - codec: lz4raw (1 second, 282 milliseconds)
01:43:14.972 WARN org.apache.spark.sql.execution.datasources.ParquetCodecSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.ParquetCodecSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), QueryStageCreator-0 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), QueryStageCreator-5 (daemon=true), QueryStageCreator-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), QueryStageCreator-1 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), QueryStageCreator-6 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), QueryStageCreator-2 (daemon=true), QueryStageCreator-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true) =====
[info] Run completed in 22 seconds, 883 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 487 s (08:07), completed Jun 8, 2023 1:43:15 AM

@wangyum wangyum deleted the SPARK-43273 branch June 8, 2023 09:17
czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
### What changes were proposed in this pull request?

Parquet 1.13.0 supports `LZ4_RAW` codec. Please see https://issues.apache.org/jira/browse/PARQUET-2196.

This PR adds `lz4raw` to the supported list of `spark.sql.parquet.compression.codec`.

### Why are the changes needed?

Support writing Parquet files with `lz4raw` compression codec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and manual testing:
```scala
spark.sql("set spark.sql.parquet.compression.codec=lz4raw")
spark.range(10).write.parquet("/tmp/spark/lz4raw")
spark.read.parquet("/tmp/spark/lz4raw").show(false)
```

```
yumwangLM-SHC-16508156 lz4raw % ll /tmp/spark/lz4raw
total 16
-rw-r--r-- 1 yumwang  wheel    0 Jun  8 12:10 _SUCCESS
-rw-r--r-- 1 yumwang  wheel  487 Jun  8 12:10 part-00000-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet
-rw-r--r-- 1 yumwang  wheel  489 Jun  8 12:10 part-00001-c6786f4d-b5a6-406d-96a1-37bf0ceeeac7-c000.lz4raw.parquet
```

Closes apache#41507 from wangyum/SPARK-43273.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
beliefer added a commit that referenced this pull request Oct 20, 2023
…n codec lz4raw

### What changes were proposed in this pull request?
#41507 supported the new parquet compression codec `lz4raw`. But `lz4raw` is not a correct parquet compression codec name.

This mistake causes error. Please refer https://github.com/apache/spark/pull/43310/files#r1352405312

The root cause is parquet uses `lz4_raw` as its name and store it into the metadata of parquet file. Please refer https://github.com/apache/spark/blob/6373f19f537f69c6460b2e4097f19903c01a608f/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala#L65

We should use `lz4_raw` as its name.

### Why are the changes needed?
Fix the bug that uses incorrect parquet compression codec `lz4raw`.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Fix a bug.

### How was this patch tested?
New test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43310 from beliefer/SPARK-45484.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Jiaan Geng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants