Commit 2e31e2c
committed
[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default
### What changes were proposed in this pull request?
Apache Spark 3.0 introduced `spark.eventLog.compression.codec` configuration.
For Apache Spark 3.2, this PR aims to set `zstd` as the default value for `spark.eventLog.compression.codec` configuration.
This only affects creating a new log file.
### Why are the changes needed?
The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.
**1. Save storage resources (and money)**
In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.
| CODEC | SIZE (bytes) |
|---------|-------------|
| LZ4 | 184001434|
| ZSTD | 64522396|
And, the plain file is 17.6 times bigger.
```
-rw-r--r-- 1 dongjoon staff 1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679
-rw-r--r-- 1 dongjoon staff 64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd
```
**2. Better Usability**
We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.
```
$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded
```
```
$ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd
spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes
```
**3. Speed**
The following results are collected by running [lzbench](https://github.com/inikep/lzbench) on the above Spark event log. Note that
- This is not a direct comparison of Spark compression/decompression codec.
- `lzbench` is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.
Here,
- To get ZSTD 1.4.8-1 result, `lzbench` `master` branch is used because Spark is using ZSTD 1.4.8.
- To get LZ4 1.7.5 result, `lzbench` `v1.7` branch is used because Spark is using LZ4 1.7.1.
```
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 7393 MB/s 7166 MB/s 1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679
zstd 1.4.8 -1 1344 MB/s 3351 MB/s 56665767 4.99 spark-a1843ead29834f46b1125a03eca32679
lz4 1.7.5 1385 MB/s 4782 MB/s 127662168 11.24 spark-a1843ead29834f46b1125a03eca32679
```
### Does this PR introduce _any_ user-facing change?
- No for the apps which doesn't use `spark.eventLog.compress` because `spark.eventLog.compress` is disabled by default.
- No for the apps using `spark.eventLog.compression.codec` explicitly because this is a change of the default value.
- Yes for the apps using `spark.eventLog.compress` without setting `spark.eventLog.compression.codec`. In this case, previously `spark.io.compression.codec` value was used whose default is `lz4`.
So this JIRA issue, SPARK-34503, is labeled with `releasenotes`.
### How was this patch tested?
Pass the updated UT.
Closes #31618 from dongjoon-hyun/SPARK-34503.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>1 parent 7f27d33 commit 2e31e2c
File tree
4 files changed
+9
-13
lines changed- core/src
- main/scala/org/apache/spark/internal/config
- test/scala/org/apache/spark/deploy/history
- docs
4 files changed
+9
-13
lines changedLines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1726 | 1726 | | |
1727 | 1727 | | |
1728 | 1728 | | |
1729 | | - | |
| 1729 | + | |
1730 | 1730 | | |
1731 | | - | |
| 1731 | + | |
| 1732 | + | |
1732 | 1733 | | |
1733 | 1734 | | |
1734 | 1735 | | |
| |||
Lines changed: 2 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
102 | | - | |
| 102 | + | |
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
110 | | - | |
111 | 110 | | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
| 111 | + | |
118 | 112 | | |
119 | 113 | | |
120 | 114 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1040 | 1040 | | |
1041 | 1041 | | |
1042 | 1042 | | |
1043 | | - | |
| 1043 | + | |
1044 | 1044 | | |
1045 | | - | |
1046 | | - | |
| 1045 | + | |
1047 | 1046 | | |
1048 | 1047 | | |
1049 | 1048 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
| |||
0 commit comments