PARQUET-1973: Support ZSTD JNI BufferPool #865

dongjoon-hyun · 2021-02-04T19:23:03Z

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-1973
- This PR aims to add parquet.compression.codec.zstd.bufferPool.enabled to support ZSTD JNI's BufferPool feature.

Version	Description	Commit
v1.4.5-7	`BufferPool` was added and used it by default	luben/zstd-jni@`4f55c89`
v1.4.5-8	`RecyclingBufferPool` was added and `BufferPool` became an interface to allow custom BufferPool implementation	luben/zstd-jni@`dd2588e`
v1.4.7+	`NoPool` is used by default and user should specify buffer pool explicitly	luben/zstd-jni@`f7c8279`

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

gszadovszky

What would be the pros/cons for a user to use or to not use the buffer pool in parquet-mr? I understand that you want to be on the safe side with this new configuration property but I am currently not sure if it is necessary. I would not extend the number of the existing configuration options if it is not required.

Anyway, if you think the configuration option is necessary, please, update the documentation in the README.md.

dongjoon-hyun · 2021-02-05T17:32:44Z

Hi, @gszadovszky . We observed this kind of issue on Java 11 environment.

Detected Native memory leak or slow reclamation of native memory luben/zstd-jni#156

As I wrote in the PR description, NoPool is the default of ZSTD JNI because Using soft/weak references could be harmful on some workloads. (luben/zstd-jni@f7c8279)

dongjoon-hyun · 2021-02-06T22:58:50Z

Not only for the issues, but also there exists improvement. FYI, Apache Spark is using ZSTD JNI directly for shuffle IO and the followings are the corresponding update and benchmark PRs.

[SPARK-34340][CORE] Support ZSTD JNI BufferPool spark#31453 (SPARK-34340 Support ZSTD JNI BufferPool)
[SPARK-34387][CORE][TESTS] Add ZStandardBenchmark spark#31498 (SPARK-34387 Add ZStandardBenchmark)

gszadovszky

Thanks for explaining. I've found a formatting issue in the readme. Otherwise it looks good to me.

gszadovszky · 2021-02-08T10:14:53Z

parquet-hadoop/README.md

+**Property:** `parquet.compression.codec.zstd.bufferPool.enabled`
+**Description:** If it is true, [RecyclingBufferPool](https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/RecyclingBufferPool.java) is used.
+**Default value:** `false`
+
+---
+


nit: please use double spaces at the end of the lines to enforce new-line. You can verify formatting by checking the page rendered by github: https://github.com/dongjoon-hyun/parquet-mr/tree/PARQUET-1973/parquet-hadoop#class-zstandardcodec

Oh. Got it. Thanks!

dongjoon-hyun · 2021-02-08T17:13:13Z

Thank you, @gszadovszky . I added two ending spaces and checked the README.

dongjoon-hyun · 2021-02-09T04:55:16Z

Thank you for approval, @gszadovszky .

dongjoon-hyun · 2021-02-09T16:52:17Z

Thank you for merging, @gszadovszky !

shangxinli · 2021-03-11T19:54:42Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/ZstandardCodec.java

  @Override
  public CompressionOutputStream createOutputStream(OutputStream stream) throws IOException {
-    return new ZstdCompressorStream(stream, conf.getInt(PARQUET_COMPRESS_ZSTD_LEVEL, DEFAULT_PARQUET_COMPRESS_ZSTD_LEVEL),
+    BufferPool pool;


Thanks Dongjoon for working on this!

It is kind of late. Just a minor comment: if you can wrap the code into a method and call it in both CompressionInputStream() and CompressionOutputStream, it would avoid duplicating. Not a big deal though.

shangxinli · 2021-04-22T15:48:54Z

cc @vectorijk

PARQUET-1973: Support ZSTD JNI BufferPool

b0850ef

gszadovszky requested changes Feb 5, 2021

View reviewed changes

Update README.md

9d1f7c8

gszadovszky requested changes Feb 8, 2021

View reviewed changes

Add two ending spaces

a2f444a

gszadovszky approved these changes Feb 8, 2021

View reviewed changes

gszadovszky merged commit 279255d into apache:master Feb 9, 2021

dongjoon-hyun deleted the PARQUET-1973 branch February 9, 2021 16:52

dongjoon-hyun mentioned this pull request Mar 1, 2021

AVRO-3060: Support ZSTD level and BufferPool options apache/avro#1114

Merged

4 tasks

shangxinli reviewed Mar 11, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1973: Support ZSTD JNI BufferPool #865

PARQUET-1973: Support ZSTD JNI BufferPool #865

Uh oh!

dongjoon-hyun commented Feb 4, 2021 •

edited

Loading

Uh oh!

gszadovszky left a comment

Uh oh!

dongjoon-hyun commented Feb 5, 2021

Uh oh!

dongjoon-hyun commented Feb 6, 2021 •

edited

Loading

Uh oh!

gszadovszky left a comment

Uh oh!

gszadovszky Feb 8, 2021

Uh oh!

dongjoon-hyun Feb 8, 2021

Uh oh!

dongjoon-hyun commented Feb 8, 2021

Uh oh!

dongjoon-hyun commented Feb 9, 2021

Uh oh!

dongjoon-hyun commented Feb 9, 2021

Uh oh!

shangxinli Mar 11, 2021

Uh oh!

shangxinli commented Apr 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PARQUET-1973: Support ZSTD JNI BufferPool #865

PARQUET-1973: Support ZSTD JNI BufferPool #865

Uh oh!

Conversation

dongjoon-hyun commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Tests

Commits

Documentation

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 5, 2021

Uh oh!

dongjoon-hyun commented Feb 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

gszadovszky Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 8, 2021

Uh oh!

dongjoon-hyun commented Feb 9, 2021

Uh oh!

dongjoon-hyun commented Feb 9, 2021

Uh oh!

shangxinli Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Apr 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Feb 4, 2021 •

edited

Loading

dongjoon-hyun commented Feb 6, 2021 •

edited

Loading