Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Sep 26, 2018

What changes were proposed in this pull request?

This PR upgrade lz4-java to 1.5.0 get speed improvement.

General speed improvements

LZ4 decompression speed has always been a strong point. In v1.8.2, this gets even better, as it improves decompression speed by about 10%, thanks in a large part to suggestion from @svpv .

For example, on a Mac OS-X laptop with an Intel Core i7-5557U CPU @ 3.10GHz,
running lz4 -bsilesia.tar compiled with default compiler llvm v9.1.0:

Version v1.8.1 v1.8.2 Improvement
Decompression speed 2490 MB/s 2770 MB/s +11%

Compression speeds also receive a welcomed boost, though improvement is not evenly distributed, with higher levels benefiting quite a lot more.

Version v1.8.1 v1.8.2 Improvement
lz4 -1 504 MB/s 516 MB/s +2%
lz4 -9 23.2 MB/s 25.6 MB/s +10%
lz4 -12 3.5 Mb/s 9.5 MB/s +170%

More details:
https://github.com/lz4/lz4/releases/tag/v1.8.3

Below is my benchmark result
set spark.sql.parquet.compression.codec to lz4 and disable orc benchmark, then run FilterPushdownBenchmark.
lz4-java 1.5.0:

[success] Total time: 5585 s, completed Sep 26, 2018 5:22:16 PM

lz4-java 1.4.0:

[success] Total time: 5591 s, completed Sep 26, 2018 5:22:24 PM

Some benchmark result:

lz4-java 1.5.0 Select 1 row with 500 filters:           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            1953 / 1980          0.0  1952502908.0       1.0X
Parquet Vectorized (Pushdown)                 2541 / 2585          0.0  2541019869.0       0.8X


lz4-java 1.4.0 Select 1 row with 500 filters:           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            1979 / 2103          0.0  1979328144.0       1.0X
Parquet Vectorized (Pushdown)                 2596 / 2909          0.0  2596222118.0       0.8X

Complete benchmark result:
https://issues.apache.org/jira/secure/attachment/12941360/FilterPushdownBenchmark-lz4-java-140-results.txt
https://issues.apache.org/jira/secure/attachment/12941361/FilterPushdownBenchmark-lz4-java-150-results.txt

How was this patch tested?

manual tests

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/lz4/lz4/releases/tag/v1.8.3 is already out due to data corruption bug in v1.8.2.
Could you elaborate how to avoid using v1.8.2 in the PR description?

Also, why do you use FilterPushdownBenchmark in this PR? It's mainly based on snappy instead of lz4. So, did you change it to use LZ4 and get the above benchmark?

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96602 has finished for PR 22551 at commit 331298b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Sep 26, 2018

Thanks @dongjoon-hyun I updated the PR description.

@HyukjinKwon
Copy link
Member

retest this please

@dongjoon-hyun
Copy link
Member

Thank you for update, but could you update one more, @wangyum ?

  • Please mention that 1.8.2 data corruption issue clearly and write that 1.8.3 is recommended.
  • Is the copied General speed improvements still valid in 1.8.3? I believe that, but if we or (LZ4 community) cannot reevaluate it, could you put the link to 1.8.2 instead?

@SparkQA
Copy link

SparkQA commented Sep 26, 2018

Test build #96632 has finished for PR 22551 at commit 331298b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 4, 2018

IMO, since lz4-java v1.5.0 is based on lz4 v1.8.3, you'd better to use the version in the benchmark of the PR description instead of v1.8.2.
btw, which lz4 version does the current lz4-java v1.4.0 depend on?

@srowen
Copy link
Member

srowen commented Oct 6, 2018

WDYT @wangyum ? I think this can go in 3.0

@wangyum
Copy link
Member Author

wangyum commented Oct 6, 2018

@maropu lz4 v1.8.3 is maintenance release. So the official does not provide benchmark: https://github.com/lz4/lz4/releases/tag/v1.8.3
@srowen Yes, I think this can go in 3.0. There is a little improvement overall.

@srowen
Copy link
Member

srowen commented Oct 6, 2018

@maropu as far as I can tell, 1.4 was based on lz4 'r128', a couple years old.
lz4/lz4-java#74 (comment)

@srowen
Copy link
Member

srowen commented Oct 7, 2018

Merged to master

@asfgit asfgit closed this in fba722e Oct 7, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This PR upgrade `lz4-java` to 1.5.0 get speed improvement.

**General speed improvements**

LZ4 decompression speed has always been a strong point. In v1.8.2, this gets even better, as it improves decompression speed by about 10%, thanks in a large part to suggestion from svpv .

For example, on a Mac OS-X laptop with an Intel Core i7-5557U CPU  3.10GHz,
running lz4 -bsilesia.tar compiled with default compiler llvm v9.1.0:

Version | v1.8.1 | v1.8.2 | Improvement
-- | -- | -- | --
Decompression speed | 2490 MB/s | 2770 MB/s | +11%

Compression speeds also receive a welcomed boost, though improvement is not evenly distributed, with higher levels benefiting quite a lot more.

Version | v1.8.1 | v1.8.2 | Improvement
-- | -- | -- | --
lz4 -1 | 504 MB/s | 516 MB/s | +2%
lz4 -9 | 23.2 MB/s | 25.6 MB/s | +10%
lz4 -12 | 3.5 Mb/s | 9.5 MB/s | +170%

More details:
https://github.com/lz4/lz4/releases/tag/v1.8.3

**Below is my benchmark result**
set `spark.sql.parquet.compression.codec` to `lz4` and disable orc benchmark, then run `FilterPushdownBenchmark`.
lz4-java 1.5.0:
```
[success] Total time: 5585 s, completed Sep 26, 2018 5:22:16 PM
```
lz4-java 1.4.0:
```
[success] Total time: 5591 s, completed Sep 26, 2018 5:22:24 PM
```
Some benchmark result:
```
lz4-java 1.5.0 Select 1 row with 500 filters:           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            1953 / 1980          0.0  1952502908.0       1.0X
Parquet Vectorized (Pushdown)                 2541 / 2585          0.0  2541019869.0       0.8X

lz4-java 1.4.0 Select 1 row with 500 filters:           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            1979 / 2103          0.0  1979328144.0       1.0X
Parquet Vectorized (Pushdown)                 2596 / 2909          0.0  2596222118.0       0.8X
```
Complete benchmark result:
https://issues.apache.org/jira/secure/attachment/12941360/FilterPushdownBenchmark-lz4-java-140-results.txt
https://issues.apache.org/jira/secure/attachment/12941361/FilterPushdownBenchmark-lz4-java-150-results.txt

## How was this patch tested?

manual tests

Closes apache#22551 from wangyum/SPARK-25539.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants