[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default #31618

dongjoon-hyun · 2021-02-23T05:23:54Z

What changes were proposed in this pull request?

Apache Spark 3.0 introduced spark.eventLog.compression.codec configuration.
For Apache Spark 3.2, this PR aims to set zstd as the default value for spark.eventLog.compression.codec configuration.
This only affects creating a new log file.

Why are the changes needed?

The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.

1. Save storage resources (and money)

In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.

CODEC	SIZE (bytes)
LZ4	184001434
ZSTD	64522396

And, the plain file is 17.6 times bigger.

-rw-r--r--    1 dongjoon  staff  1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679
-rw-r--r--    1 dongjoon  staff    64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd

2. Better Usability

We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.

$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded

$ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd
spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes

3. Speed
The following results are collected by running lzbench on the above Spark event log. Note that

This is not a direct comparison of Spark compression/decompression codec.
lzbench is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.

Here,

To get ZSTD 1.4.8-1 result, lzbench master branch is used because Spark is using ZSTD 1.4.8.
To get LZ4 1.7.5 result, lzbench v1.7 branch is used because Spark is using LZ4 1.7.1.

Compressor name      Compress. Decompress. Compr. size  Ratio Filename
memcpy               7393 MB/s  7166 MB/s  1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679
zstd 1.4.8 -1        1344 MB/s  3351 MB/s    56665767   4.99 spark-a1843ead29834f46b1125a03eca32679
lz4 1.7.5            1385 MB/s  4782 MB/s   127662168  11.24 spark-a1843ead29834f46b1125a03eca32679

Does this PR introduce any user-facing change?

No for the apps which doesn't use spark.eventLog.compress because spark.eventLog.compress is disabled by default.
No for the apps using spark.eventLog.compression.codec explicitly because this is a change of the default value.
Yes for the apps using spark.eventLog.compress without setting spark.eventLog.compression.codec. In this case, previously spark.io.compression.codec value was used whose default is lz4.

So this JIRA issue, SPARK-34503, is labeled with releasenotes.

How was this patch tested?

Pass the updated UT.

…default

SparkQA · 2021-02-23T06:34:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39945/

SparkQA · 2021-02-23T07:05:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39946/

viirya

Will this affect reading old event log in a case like upgrading Spark?

dongjoon-hyun · 2021-02-23T07:31:02Z

No~ This only decides write codec for new logs.

SparkQA · 2021-02-23T07:35:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39946/

dongjoon-hyun · 2021-02-23T07:42:39Z

I updated the indirect benchmark result by using lzbench.

SparkQA · 2021-02-23T08:07:17Z

Test build #135365 has finished for PR 31618 at commit 6ceea1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

It has obvious benefit to use zstd for event log compression, as shown in the description. It is also only for writer, and EventLogFileReader will check codec itself, so seems no compatibility issue.

SparkQA · 2021-02-23T08:09:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39945/

dongjoon-hyun · 2021-02-23T08:55:24Z

Thank you, @viirya !

docs/core-migration-guide.md

HyukjinKwon · 2021-02-23T09:00:50Z

I think it's not an obvious win though .. Zstd looks more for archiving purpose with less throughput with high compression ratio vs lz4 is for more throughput with less compression.

The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.

But I tend to agree with this. cc @HeartSaVioR or @tgravescs too FYI in case you guys have different thoughts on this.

SparkQA · 2021-02-23T09:06:42Z

Test build #135366 has finished for PR 31618 at commit 54f14cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-02-23T21:23:51Z

Hi, @HyukjinKwon . Why do you think so?

I think it's not an obvious win though .. Zstd looks more for archiving purpose with less throughput with high compression ratio vs lz4 is for more throughput with less compression.

According to the benchmark,

LZ4 1.7.5 compression time is not a winner. If you consider the upload time to the remote storage, ZSTD can be the winner.
LZ4 1.7.5 decompression time might be your reason. However, this is an event log.
- When you download a log from Spark History Server, ZSTD log file will be downloaded 2~3x faster.
- Also, when you view the log via Spark History Server, Spark History Server also need to download it from the remote storages like S3 and decompress it. 2~3x faster download will compensate the decompression downgrade slowdown.

In addition, for the storage cost saving, ZSTD is a clear winner.

HeartSaVioR · 2021-02-23T22:15:19Z

I agree with the statement the event log directory is most likely placed in remote storage in practice, and for that case reducing size would affect the overall latency. It would be really appreciated if we could see direct benchmark (compress and send to S3 & receive from S3 and decompress) - probably run 10~100 times for each and takes median?, but that's optional and I'd tend to agree small difference from compression/decompression could be caught with reduced network cost.

Btw,

$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded

makes me feel Spark does something wrong with lz4, or lz4 has varient which aren't compatible. Anyone knows why this doesn't work?

SparkQA · 2021-02-23T23:05:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39970/

SparkQA · 2021-02-23T23:18:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39970/

HyukjinKwon · 2021-02-23T23:44:46Z

Okie. I'm good with this change.

dongjoon-hyun · 2021-02-24T00:36:48Z

Thank you, @HeartSaVioR and @HyukjinKwon .

BTW, for lz4, it looks like some header issues. Apache Parquet community also is having some interesting discussion during last two weeks.

https://lists.apache.org/thread.html/r03cf7d1c57feaf556a5d7bfd8d440d96f694114a88bc7d7ed1e51d12%40%3Cdev.parquet.apache.org%3E (Request deprecation / removal of LZ4 compression)

dongjoon-hyun · 2021-02-24T00:37:05Z

Merged to master for Apache Spark 3.2.0!

SparkQA · 2021-02-24T01:03:27Z

Test build #135390 has finished for PR 31618 at commit 0e88652.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2021-03-01T17:18:01Z

docs/configuration.md

  <td>
-    The codec to compress logged events. If this is not given,
-    <code>spark.io.compression.codec</code> will be used.
+    The codec to compress logged events.


sorry for coming in late, was out last week, we may want to reference what other codecs can be used here. @dongjoon-hyun thoughts?

Thank you for review, @tgravescs . Sure, I'll make a documentation follow-up.

Here, I made a PR.

[SPARK-34503][DOCS][FOLLOWUP] Document available codecs for event log compression #31695

… compression ### What changes were proposed in this pull request? This PR is a follow-up of #31618 to document the available codecs for event log compression. ### Why are the changes needed? Documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #31695 from dongjoon-hyun/SPARK-34503-DOC. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by …

6ceea1f

…default

github-actions bot added CORE DOCS labels Feb 23, 2021

Update migration guide

54f14cc

viirya reviewed Feb 23, 2021

View reviewed changes

viirya approved these changes Feb 23, 2021

View reviewed changes

HyukjinKwon reviewed Feb 23, 2021

View reviewed changes

docs/core-migration-guide.md Outdated Show resolved Hide resolved

Address comments

0e88652

dongjoon-hyun closed this in 2e31e2c Feb 24, 2021

dongjoon-hyun deleted the SPARK-34503 branch February 24, 2021 00:39

tgravescs reviewed Mar 1, 2021

View reviewed changes

dongjoon-hyun mentioned this pull request Mar 1, 2021

[SPARK-34503][DOCS][FOLLOWUP] Document available codecs for event log compression #31695

Closed

[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default #31618

[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default #31618

Uh oh!

Conversation

dongjoon-hyun commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

dongjoon-hyun commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

dongjoon-hyun commented Feb 23, 2021

Uh oh!

Uh oh!

HyukjinKwon commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

dongjoon-hyun commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

SparkQA commented Feb 23, 2021

Uh oh!

HyukjinKwon commented Feb 23, 2021

Uh oh!

dongjoon-hyun commented Feb 24, 2021

Uh oh!

dongjoon-hyun commented Feb 24, 2021

Uh oh!

SparkQA commented Feb 24, 2021

Uh oh!

tgravescs Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 1, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dongjoon-hyun commented Feb 23, 2021 •

edited

Loading

dongjoon-hyun commented Feb 23, 2021 •

edited

Loading