-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-2160: Close ZstdInputStream to free off-heap memory in time. #982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ | |
| import org.apache.parquet.bytes.ByteBufferAllocator; | ||
| import org.apache.parquet.bytes.BytesInput; | ||
| import org.apache.parquet.compression.CompressionCodecFactory; | ||
| import org.apache.parquet.hadoop.codec.ZstandardCodec; | ||
| import org.apache.parquet.hadoop.metadata.CompressionCodecName; | ||
|
|
||
| public class CodecFactory implements CompressionCodecFactory { | ||
|
|
@@ -109,7 +110,17 @@ public BytesInput decompress(BytesInput bytes, int uncompressedSize) throws IOEx | |
| decompressor.reset(); | ||
| } | ||
| InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor); | ||
| decompressed = BytesInput.from(is, uncompressedSize); | ||
|
|
||
| // We need to explicitly close the ZstdDecompressorStream here to release the resources it holds to avoid | ||
| // off-heap memory fragmentation issue, see https://issues.apache.org/jira/browse/PARQUET-2160. | ||
| // This change will load the decompressor stream into heap a little earlier, since the problem it solves | ||
| // only happens in the ZSTD codec, so this modification is only made for ZSTD streams. | ||
| if (codec instanceof ZstandardCodec) { | ||
| decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand we had the discussion in the Jira that ByteInput.copy() just loads into a heap in advance but not add extra overall. Can we have a benchmark on the heap/GC(Heap size, GC time etc). I just want to make sure we fix one problem while introducing another problem. Other than that, the ZSTD is treated especially might be OK since we had pretty decent coments.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sure, will do a benchmark.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @shangxinli , very sorry about the big delay, I was a little busy last week. The benchmark result and detailed data has been posted in the PR describe block, also cc @sunchao.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for working on it! |
||
| is.close(); | ||
| } else { | ||
| decompressed = BytesInput.from(is, uncompressedSize); | ||
| } | ||
| } else { | ||
| decompressed = bytes; | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little weird, but considering doing so will load the decompressor stream into heap in advance, and only zstd has this problem currently, so I made this modification only for zstd stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider closing the decompressed stream after it has been read:
https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java#L283-L288
But I'm not sure if there is a situation where the decompressed stream is read more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks OK to me, we probably should add some comments explaining why ZSTD deserves the special treatment here.
The change on
BytesInputlooks more intrusive since it is used not only for decompression but other places like compression. For instance,BytesInput.copycallstoByteArrayunderneath, and after the call the original object should still be valid.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.