[SPARK-19112][CORE] add codec for ZStandard #17303

dongjinleekr · 2017-03-15T11:19:50Z

What changes were proposed in this pull request?

Hadoop & HBase started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.'

How was this patch tested?

3 additional unit tests in CompressionCodecSuite.scala.

AmplabJenkins · 2017-03-15T11:22:13Z

Can one of the admins verify this patch?

srowen · 2017-03-15T11:25:37Z

Same questions from last PR -- can this be something the user includes if needed or is there value in integrating it into Spark? where would it come into play and with what versions of Hadoop et al?

tgravescs · 2017-03-15T13:43:42Z

this should not be needed just to use to write to hdfs. The regular hadoop input/output type formats have support for it if you are using the right version (I think hadoop 2.8).

This seems to be adding the support to the spark.io.compression.codec for internal compression. From what I've heard zstd is better then the other codecs since it gives Gzip level Compression with Lz4 level CPU usage. So if you have a job that had a ton of intermediate data or was causing network issues you may want to use ztsd to get the gzip compression levels without much cpu penalty.

@dongjinleekr It doesn't looks like you ran any manual tests on a real cluster? It would be nice to have some basic performance/compression numbers to show it actually working. Are you planning on actually using zstd in your spark deployment?

rxin · 2017-03-15T22:02:42Z

Yes it'd be nice to have some benchmark on this.

maropu · 2017-04-28T04:54:02Z

I did quick benchmarks by using a TPCDS query (Q4) (I just referred the previous work in #10342)
Based on the result, it seems it's a bit earlier to implement this?;

scaleFactor: 4
AWS instance: c4.4xlarge	

-- zstd
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 53.315878375s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 53.468174668s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 57.282403146s 

-- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 20.779643053s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.520911319s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.897124967s

-- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.132412036999998s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 15.908867743999998s                                             
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.789648712s

-- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.339518781s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.881225328s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.813455479s

srowen · 2017-05-06T09:30:48Z

OK, seems like we should close this.

Cyan4973 · 2017-05-08T20:57:54Z

core/src/main/scala/org/apache/spark/io/CompressionCodec.scala

+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "3").toInt


Use cases which favor speed over size should prefer using level 1.
Compression speed difference can be fairly large.

maropu · 2017-05-09T06:45:29Z

@Cyan4973 I quickly checked again;

scaleFactor: 4
AWS instance: c4.4xlarge	

// In this bench, I used `local-cluster` (`local` used in the benchmark above)
./bin/spark-shell --master local-cluster[4,4,7500] \
  --conf spark.driver.memory=1g \
  --conf spark.executor.memory=7g \
  --conf spark.io.compression.codec=xxx

--- zstd (level=3)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 36.517211838s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 25.026869575s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 24.370711575s                                                   

--- zstd (level=1)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 29.654705815s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 20.638918335s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 19.928730758999997s

--- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.422360631s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.38519278s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.779084563s

--- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.476569521000002s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.438640631s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 14.949329456s

--- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.853010073s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.431232532000003s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.916569896999999s

zstd was still worse than the others. Not sure though, there might be the winner case where zstd overcomes the others in more larger data set.

Cyan4973 · 2017-05-09T18:55:42Z

@maropu : What about compression ratios ?

## What changes were proposed in this pull request? This PR proposes to close PRs ... - inactive to the review comments more than a month - WIP and inactive more than a month - with Jenkins build failure but inactive more than a month - suggested to be closed and no comment against that - obviously looking inappropriate (e.g., Branch 0.5) To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below: Closes apache#11129 Closes apache#12085 Closes apache#12162 Closes apache#12419 Closes apache#12420 Closes apache#12491 Closes apache#13762 Closes apache#13837 Closes apache#13851 Closes apache#13881 Closes apache#13891 Closes apache#13959 Closes apache#14091 Closes apache#14481 Closes apache#14547 Closes apache#14557 Closes apache#14686 Closes apache#15594 Closes apache#15652 Closes apache#15850 Closes apache#15914 Closes apache#15918 Closes apache#16285 Closes apache#16389 Closes apache#16652 Closes apache#16743 Closes apache#16893 Closes apache#16975 Closes apache#17001 Closes apache#17088 Closes apache#17119 Closes apache#17272 Closes apache#17971 Added: Closes apache#17778 Closes apache#17303 Closes apache#17872 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18017 from HyukjinKwon/close-inactive-prs.

Implement ZStandardCompressionCodec

1927b91

Cyan4973 reviewed May 8, 2017

View reviewed changes

srowen mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

sitalkedia mentioned this pull request Aug 2, 2017

[SPARK-19112][CORE] Support for ZStandard codec #18805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19112][CORE] add codec for ZStandard #17303

[SPARK-19112][CORE] add codec for ZStandard #17303

Uh oh!

dongjinleekr commented Mar 15, 2017 •

edited

Loading

Uh oh!

AmplabJenkins commented Mar 15, 2017

Uh oh!

srowen commented Mar 15, 2017

Uh oh!

tgravescs commented Mar 15, 2017

Uh oh!

rxin commented Mar 15, 2017

Uh oh!

maropu commented Apr 28, 2017 •

edited

Loading

Uh oh!

srowen commented May 6, 2017

Uh oh!

Cyan4973 May 8, 2017 •

edited

Loading

Uh oh!

maropu commented May 9, 2017 •

edited

Loading

Uh oh!

Cyan4973 commented May 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-19112][CORE] add codec for ZStandard #17303

[SPARK-19112][CORE] add codec for ZStandard #17303

Uh oh!

Conversation

dongjinleekr commented Mar 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Mar 15, 2017

Uh oh!

srowen commented Mar 15, 2017

Uh oh!

tgravescs commented Mar 15, 2017

Uh oh!

rxin commented Mar 15, 2017

Uh oh!

maropu commented Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 6, 2017

Uh oh!

Cyan4973 May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyan4973 commented May 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjinleekr commented Mar 15, 2017 •

edited

Loading

maropu commented Apr 28, 2017 •

edited

Loading

Cyan4973 May 8, 2017 •

edited

Loading

maropu commented May 9, 2017 •

edited

Loading