-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default #31618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this affect reading old event log in a case like upgrading Spark?
|
No~ This only decides write codec for new logs. |
|
Kubernetes integration test status success |
|
I updated the indirect benchmark result by using |
|
Test build #135365 has finished for PR 31618 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has obvious benefit to use zstd for event log compression, as shown in the description. It is also only for writer, and EventLogFileReader will check codec itself, so seems no compatibility issue.
|
Kubernetes integration test status failure |
|
Thank you, @viirya ! |
|
I think it's not an obvious win though .. Zstd looks more for archiving purpose with less throughput with high compression ratio vs lz4 is for more throughput with less compression.
But I tend to agree with this. cc @HeartSaVioR or @tgravescs too FYI in case you guys have different thoughts on this. |
|
Test build #135366 has finished for PR 31618 at commit
|
|
Hi, @HyukjinKwon . Why do you think so?
According to the benchmark,
In addition, for the storage cost saving, ZSTD is a clear winner. |
|
I agree with the statement the event log directory is most likely placed in remote storage in practice, and for that case reducing size would affect the overall latency. It would be really appreciated if we could see direct benchmark (compress and send to S3 & receive from S3 and decompress) - probably run 10~100 times for each and takes median?, but that's optional and I'd tend to agree small difference from compression/decompression could be caught with reduced network cost. Btw, makes me feel Spark does something wrong with lz4, or lz4 has varient which aren't compatible. Anyone knows why this doesn't work? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Okie. I'm good with this change. |
|
Thank you, @HeartSaVioR and @HyukjinKwon . BTW, for lz4, it looks like some header issues.
|
|
Merged to master for Apache Spark 3.2.0! |
|
Test build #135390 has finished for PR 31618 at commit
|
| <td> | ||
| The codec to compress logged events. If this is not given, | ||
| <code>spark.io.compression.codec</code> will be used. | ||
| The codec to compress logged events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for coming in late, was out last week, we may want to reference what other codecs can be used here. @dongjoon-hyun thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @tgravescs . Sure, I'll make a documentation follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… compression ### What changes were proposed in this pull request? This PR is a follow-up of #31618 to document the available codecs for event log compression. ### Why are the changes needed? Documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #31695 from dongjoon-hyun/SPARK-34503-DOC. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Apache Spark 3.0 introduced
spark.eventLog.compression.codecconfiguration.For Apache Spark 3.2, this PR aims to set
zstdas the default value forspark.eventLog.compression.codecconfiguration.This only affects creating a new log file.
Why are the changes needed?
The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.
1. Save storage resources (and money)
In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.
And, the plain file is 17.6 times bigger.
2. Better Usability
We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.
3. Speed
The following results are collected by running lzbench on the above Spark event log. Note that
lzbenchis an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.Here,
lzbenchmasterbranch is used because Spark is using ZSTD 1.4.8.lzbenchv1.7branch is used because Spark is using LZ4 1.7.1.Does this PR introduce any user-facing change?
spark.eventLog.compressbecausespark.eventLog.compressis disabled by default.spark.eventLog.compression.codecexplicitly because this is a change of the default value.spark.eventLog.compresswithout settingspark.eventLog.compression.codec. In this case, previouslyspark.io.compression.codecvalue was used whose default islz4.So this JIRA issue, SPARK-34503, is labeled with
releasenotes.How was this patch tested?
Pass the updated UT.