Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 4, 2021

Jira

Version Description Commit
v1.4.5-7 BufferPool was added and used it by default luben/zstd-jni@4f55c89
v1.4.5-8 RecyclingBufferPool was added and BufferPool became an interface to allow custom BufferPool implementation luben/zstd-jni@dd2588e
v1.4.7+ NoPool is used by default and user should specify buffer pool explicitly luben/zstd-jni@f7c8279

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the pros/cons for a user to use or to not use the buffer pool in parquet-mr? I understand that you want to be on the safe side with this new configuration property but I am currently not sure if it is necessary. I would not extend the number of the existing configuration options if it is not required.

Anyway, if you think the configuration option is necessary, please, update the documentation in the README.md.

@dongjoon-hyun
Copy link
Member Author

Hi, @gszadovszky . We observed this kind of issue on Java 11 environment.

As I wrote in the PR description, NoPool is the default of ZSTD JNI because Using soft/weak references could be harmful on some workloads. (luben/zstd-jni@f7c8279)

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 6, 2021

Not only for the issues, but also there exists improvement. FYI, Apache Spark is using ZSTD JNI directly for shuffle IO and the followings are the corresponding update and benchmark PRs.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I've found a formatting issue in the readme. Otherwise it looks good to me.

Comment on lines 341 to 346
**Property:** `parquet.compression.codec.zstd.bufferPool.enabled`
**Description:** If it is true, [RecyclingBufferPool](https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/RecyclingBufferPool.java) is used.
**Default value:** `false`

---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use double spaces at the end of the lines to enforce new-line. You can verify formatting by checking the page rendered by github: https://github.com/dongjoon-hyun/parquet-mr/tree/PARQUET-1973/parquet-hadoop#class-zstandardcodec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Got it. Thanks!

@dongjoon-hyun
Copy link
Member Author

Thank you, @gszadovszky . I added two ending spaces and checked the README.
Screen Shot 2021-02-08 at 9 12 27 AM

@dongjoon-hyun
Copy link
Member Author

Thank you for approval, @gszadovszky .

@gszadovszky gszadovszky merged commit 279255d into apache:master Feb 9, 2021
@dongjoon-hyun
Copy link
Member Author

Thank you for merging, @gszadovszky !

@Override
public CompressionOutputStream createOutputStream(OutputStream stream) throws IOException {
return new ZstdCompressorStream(stream, conf.getInt(PARQUET_COMPRESS_ZSTD_LEVEL, DEFAULT_PARQUET_COMPRESS_ZSTD_LEVEL),
BufferPool pool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dongjoon for working on this!

It is kind of late. Just a minor comment: if you can wrap the code into a method and call it in both CompressionInputStream() and CompressionOutputStream, it would avoid duplicating. Not a big deal though.

@shangxinli
Copy link
Contributor

cc @vectorijk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants