Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 23, 2021

What changes were proposed in this pull request?

Apache Spark 3.0 introduced spark.eventLog.compression.codec configuration.
For Apache Spark 3.2, this PR aims to set zstd as the default value for spark.eventLog.compression.codec configuration.
This only affects creating a new log file.

Why are the changes needed?

The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.

1. Save storage resources (and money)

In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.

CODEC SIZE (bytes)
LZ4 184001434
ZSTD 64522396

And, the plain file is 17.6 times bigger.

-rw-r--r--    1 dongjoon  staff  1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679
-rw-r--r--    1 dongjoon  staff    64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd

2. Better Usability

We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.

$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded
$ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd
spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes

3. Speed
The following results are collected by running lzbench on the above Spark event log. Note that

  • This is not a direct comparison of Spark compression/decompression codec.
  • lzbench is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.

Here,

  • To get ZSTD 1.4.8-1 result, lzbench master branch is used because Spark is using ZSTD 1.4.8.
  • To get LZ4 1.7.5 result, lzbench v1.7 branch is used because Spark is using LZ4 1.7.1.
Compressor name      Compress. Decompress. Compr. size  Ratio Filename
memcpy               7393 MB/s  7166 MB/s  1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679
zstd 1.4.8 -1        1344 MB/s  3351 MB/s    56665767   4.99 spark-a1843ead29834f46b1125a03eca32679
lz4 1.7.5            1385 MB/s  4782 MB/s   127662168  11.24 spark-a1843ead29834f46b1125a03eca32679

Does this PR introduce any user-facing change?

  • No for the apps which doesn't use spark.eventLog.compress because spark.eventLog.compress is disabled by default.
  • No for the apps using spark.eventLog.compression.codec explicitly because this is a change of the default value.
  • Yes for the apps using spark.eventLog.compress without setting spark.eventLog.compression.codec. In this case, previously spark.io.compression.codec value was used whose default is lz4.

So this JIRA issue, SPARK-34503, is labeled with releasenotes.

How was this patch tested?

Pass the updated UT.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39945/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39946/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this affect reading old event log in a case like upgrading Spark?

@dongjoon-hyun
Copy link
Member Author

No~ This only decides write codec for new logs.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39946/

@dongjoon-hyun
Copy link
Member Author

I updated the indirect benchmark result by using lzbench.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Test build #135365 has finished for PR 31618 at commit 6ceea1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has obvious benefit to use zstd for event log compression, as shown in the description. It is also only for writer, and EventLogFileReader will check codec itself, so seems no compatibility issue.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39945/

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@HyukjinKwon
Copy link
Member

I think it's not an obvious win though .. Zstd looks more for archiving purpose with less throughput with high compression ratio vs lz4 is for more throughput with less compression.

The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.

But I tend to agree with this. cc @HeartSaVioR or @tgravescs too FYI in case you guys have different thoughts on this.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Test build #135366 has finished for PR 31618 at commit 54f14cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 23, 2021

Hi, @HyukjinKwon . Why do you think so?

I think it's not an obvious win though .. Zstd looks more for archiving purpose with less throughput with high compression ratio vs lz4 is for more throughput with less compression.

According to the benchmark,

  • LZ4 1.7.5 compression time is not a winner. If you consider the upload time to the remote storage, ZSTD can be the winner.
  • LZ4 1.7.5 decompression time might be your reason. However, this is an event log.
    • When you download a log from Spark History Server, ZSTD log file will be downloaded 2~3x faster.
    • Also, when you view the log via Spark History Server, Spark History Server also need to download it from the remote storages like S3 and decompress it. 2~3x faster download will compensate the decompression downgrade slowdown.

In addition, for the storage cost saving, ZSTD is a clear winner.

@HeartSaVioR
Copy link
Contributor

I agree with the statement the event log directory is most likely placed in remote storage in practice, and for that case reducing size would affect the overall latency. It would be really appreciated if we could see direct benchmark (compress and send to S3 & receive from S3 and decompress) - probably run 10~100 times for each and takes median?, but that's optional and I'd tend to agree small difference from compression/decompression could be caught with reduced network cost.

Btw,

$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded

makes me feel Spark does something wrong with lz4, or lz4 has varient which aren't compatible. Anyone knows why this doesn't work?

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39970/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39970/

@HyukjinKwon
Copy link
Member

Okie. I'm good with this change.

@dongjoon-hyun
Copy link
Member Author

Thank you, @HeartSaVioR and @HyukjinKwon .

BTW, for lz4, it looks like some header issues. Apache Parquet community also is having some interesting discussion during last two weeks.

@dongjoon-hyun
Copy link
Member Author

Merged to master for Apache Spark 3.2.0!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-34503 branch February 24, 2021 00:39
@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Test build #135390 has finished for PR 31618 at commit 0e88652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<td>
The codec to compress logged events. If this is not given,
<code>spark.io.compression.codec</code> will be used.
The codec to compress logged events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for coming in late, was out last week, we may want to reference what other codecs can be used here. @dongjoon-hyun thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @tgravescs . Sure, I'll make a documentation follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dongjoon-hyun added a commit that referenced this pull request Mar 1, 2021
… compression

### What changes were proposed in this pull request?

This PR is a follow-up of #31618 to document the available codecs for event log compression.

### Why are the changes needed?

Documentation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #31695 from dongjoon-hyun/SPARK-34503-DOC.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants