PARQUET-2160: Close ZstdInputStream to free off-heap memory in time. #982

zhongyujiang · 2022-08-08T03:01:21Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2160
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Benchmark

Benchmark:
The benchmark data consists of 500 string columns * 1_000_000 rows(10.85GB under ZSTD compression), and the benchmark is simply reading all the data out and pass them to the black hole.

JMH Configuration:
3 rounds of warmup iterations
5 rounds of measurenemt iterations

Notes:

Closing the input stream means reading the decompressed data into the heap early and closing the input stream, which is the method proposed in this pr. Not closing the stream means leaving the stream to the GC.

The JMH Score in the table below is the average of multiple rounds of measurement, while GC Events and GC Time are the statistical values of the total measurement process (3 rounds of Warmup + 5 rounds of Measurement)

According to the benchmark results below, closing stream does not bring additional heap overload, but improves GC instead. From the comparison of the first two results of below table, we can see that closing stream improves the read performance and reduces the GC load.

I also measured the impact of heap size (GC) on read performance. After reducing heap memory to 3GB, GC Events increased, but the read performance is not decreased; However, after increasing heap memory to 5GB, GC Events decreased by about 30% , but the GC time has increased by about 2 times, and the read performance has also been significantly reduced, which may verify the previous guess in Jira: Untimely GC may exacerbate the off-heap memory fragmentation issue. (Of course, this may require more in-depth testing to corroborate)

Compression	Close Inputstream	Heap Size(GB)	JMH Score ± Error (s/op)	GC Events	GC Time(s)
ZSTD	No	4	53.374 ± 3.303	1591	40.069
ZSTD	Yes	4	49.613 ± 2.564	928	17.194
ZSTD	No	3	53.303 ± 1.431	2174	46.121
ZSTD	No	5	75.993 ± 10.919	950	116.184

Out of curiosity, I also tested the effect of closing stream on Snappy and Gzip, the benchmark results show that in such simple read scenario closing stream also has some positive effect on both Snappy and Gzip.

Compression	Close Inputstream	Heap Size(GB)	JMH Score ± Error (s/op)	GC Events	GC Time(s)
Snappy	No	4	38.835 ± 7.505	1618	40.193
Snappy	Yes	4	33.796 ± 2.871	872	12.449
Gzip	No	4	94.902 ± 5.833	1539	39.630
Gzip	Yes	4	86.160 ± 2.329	843	12.153

For more details, please refer to the following pictures:

ZSTD + Not close + 4GB

ZSTD + Close + 4GB

ZSTD + Not close + 3GB

ZSTD + Not close + 5GB

Snappy + Not close + 4GB

Snappy + Close + 4GB

Gzip + Not Close + 4GB

Gzip + Close + 4GB

zhongyujiang · 2022-08-08T03:04:51Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

        }
        InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
-        decompressed = BytesInput.from(is, uncompressedSize);
+        if (codec instanceof ZstandardCodec) {


This looks a little weird, but considering doing so will load the decompressor stream into heap in advance, and only zstd has this problem currently, so I made this modification only for zstd stream.

Maybe we can consider closing the decompressed stream after it has been read:
https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java#L283-L288
But I'm not sure if there is a situation where the decompressed stream is read more than once.

The change looks OK to me, we probably should add some comments explaining why ZSTD deserves the special treatment here.

The change on BytesInput looks more intrusive since it is used not only for decompression but other places like compression. For instance, BytesInput.copy calls toByteArray underneath, and after the call the original object should still be valid.

Added comment.

sunchao

cc @shangxinli @ggershinsky too

sunchao · 2022-08-12T23:03:08Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

        }
        InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
-        decompressed = BytesInput.from(is, uncompressedSize);
+        if (codec instanceof ZstandardCodec) {


The change looks OK to me, we probably should add some comments explaining why ZSTD deserves the special treatment here.

The change on BytesInput looks more intrusive since it is used not only for decompression but other places like compression. For instance, BytesInput.copy calls toByteArray underneath, and after the call the original object should still be valid.

shangxinli · 2022-08-21T18:27:28Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java

+        // This change will load the decompressor stream into heap a little earlier, since the problem it solves
+        // only happens in the ZSTD codec, so this modification is only made for ZSTD streams.
+        if (codec instanceof ZstandardCodec) {
+          decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));


I understand we had the discussion in the Jira that ByteInput.copy() just loads into a heap in advance but not add extra overall. Can we have a benchmark on the heap/GC(Heap size, GC time etc). I just want to make sure we fix one problem while introducing another problem.

Other than that, the ZSTD is treated especially might be OK since we had pretty decent coments.

Can we have a benchmark on the heap/GC(Heap size, GC time etc).

Sure, will do a benchmark.

Hi @shangxinli , very sorry about the big delay, I was a little busy last week. The benchmark result and detailed data has been posted in the PR describe block, also cc @sunchao.

Thanks for working on it!

shangxinli · 2022-09-05T16:34:55Z

LGTM

sunchao

LGTM too, thanks @zhongyujiang !

camper42 · 2022-09-21T08:23:56Z

wondering why this PR has not been merged

we have face this problem and are looking for a solution (spark 3.2.2+zstd-parquet)

sunchao · 2022-09-21T16:31:49Z

cc @ggershinsky @gszadovszky @shangxinli

alexeykudinkin · 2023-01-09T23:41:27Z

@gszadovszky @ggershinsky @shangxinli

Folks, do we have an approximate timeline for the next patch release that will be including this patch?

This is a severe problem that does affect our ability to use Parquet w/ Zstd and i'm aware of at least of handful of similar occasions and issues occurring to others.

Corresponding issue in Spark: https://issues.apache.org/jira/browse/SPARK-41952

shangxinli · 2023-01-10T05:20:04Z

@alexeykudinkin We might release a new patch in the next 2 or 3 months.

Can you elaborate why "this is a severe problem that does affect our ability to use Parquet w/ Zstd"? I understand it is an issue but we have been running Parquet w/ Zstd for 2+ years in production.

alexeykudinkin · 2023-01-10T17:13:37Z

Totally @shangxinli

We have running Spark clusters in production ingesting from 100s of Apache Hudi tables (using Parquet and Zstd) and writing into other ones. We switched from gzip to zstd slightly over a month ago and we started to have OOM issues almost immediately. It took us a bit of triaging to zero in on zstd, but now we're confident that it's not mis-calibration of our configs but slow-bleeding leak of the native memory.

The crux of the problem is very particular type of the job -- one that reads a lot of Zstd compressed Parquet (therefore triggering the affected path). Other jobs not reading Parquet are not affected.

shangxinli · 2023-01-10T17:20:49Z

Thanks @alexeykudinkin for the explanation.

camper42 · 2023-01-11T03:01:44Z

same problem with @alexeykudinkin

currently we replace paruqet jar with patched one in our image, waiting for release

Kimahriman · 2023-01-11T20:15:37Z

same, we have certain jobs that can't function without a patched jar. Seems to get worse with the more columns you read. Our worst offender (table with thousands of columns), can easily hit 90+ GiB of resident memory on a 25 GiB heap executor in an hour or so (running on Yarn without memory limit enforcement)

pan3793 · 2023-02-19T13:26:04Z

I also encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM, this patch can solve this problem. It's a critical issue in my case, wish the upstream publish a patched version so that we don't need to maintain the internal version.

…und for PARQUET-2160 ### What changes were proposed in this pull request? SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first. We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM. See more discussions at apache/parquet-java#982 and apache/iceberg#5681 ### Why are the changes needed? The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet. ### Does this PR introduce _any_ user-facing change? Yes, it's bug fix. ### How was this patch tested? The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table. ``` spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g ``` ``` spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false) ``` - before this patch All executors get killed by NM quickly. ``` ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead. ``` <img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png"> - after this patch Query runs well, no executor gets killed. <img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png"> Closes #40091 from pan3793/SPARK-41952. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Chao Sun <[email protected]>

…und for PARQUET-2160 ### What changes were proposed in this pull request? SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first. We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM. See more discussions at apache/parquet-java#982 and apache/iceberg#5681 ### Why are the changes needed? The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet. ### Does this PR introduce _any_ user-facing change? Yes, it's bug fix. ### How was this patch tested? The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table. ``` spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g ``` ``` spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false) ``` - before this patch All executors get killed by NM quickly. ``` ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead. ``` <img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png"> - after this patch Query runs well, no executor gets killed. <img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png"> Closes apache#40091 from pan3793/SPARK-41952. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Chao Sun <[email protected]>

wangzzu · 2023-06-29T04:03:16Z

@zhongyujiang
We also encountered the same problem, here I have another question, why this problem only occurs in the parquet/zstd read

zhongyujiang · 2023-06-29T07:13:35Z

@wangzzu I guess it's because ZSTD allocates and uses off-heap memory very frequently, and before this PR the release of those native resources is subject to Java's GC, so I tried to relieve off-heap memory pressure by releasing them in time.
I didn't dig any deeper as my knowledge of these codec is very limited.

PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

b179459

zhongyujiang commented Aug 8, 2022

View reviewed changes

sunchao reviewed Aug 12, 2022

View reviewed changes

Add comment.

b813ca7

shangxinli reviewed Aug 21, 2022

View reviewed changes

bryanck mentioned this pull request Aug 31, 2022

Parquet: close zstd input stream early to avoid memory pressure apache/iceberg#5681

Merged

sunchao approved these changes Sep 8, 2022

View reviewed changes

shangxinli merged commit 53f65a8 into apache:master Sep 22, 2022

zhongyujiang deleted the close-zstd-decompress-stream branch September 23, 2022 05:49

pan3793 mentioned this pull request Feb 20, 2023

[SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160 apache/spark#40091

Closed

zhongyujiang mentioned this pull request Feb 25, 2023

Parquet: Add page filter using page indexes apache/iceberg#6935

Closed

zhongyujiang mentioned this pull request Oct 16, 2023

[Feature] Bump Parquet version to 1.13.1 apache/paimon#2137

Closed

2 tasks

yihua mentioned this pull request Jul 18, 2025

[DISCUSSION] Should we change default compression codec for Parquet files to snappy? apache/hudi#13568

Open

PARQUET-2160: Close ZstdInputStream to free off-heap memory in time. #982

PARQUET-2160: Close ZstdInputStream to free off-heap memory in time. #982

Uh oh!

Conversation

zhongyujiang commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Tests

Commits

Documentation

Benchmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Aug 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Sep 5, 2022

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

camper42 commented Sep 21, 2022

Uh oh!

sunchao commented Sep 21, 2022

Uh oh!

alexeykudinkin commented Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shangxinli commented Jan 10, 2023

Uh oh!

alexeykudinkin commented Jan 10, 2023

Uh oh!

shangxinli commented Jan 10, 2023

Uh oh!

camper42 commented Jan 11, 2023

Uh oh!

Kimahriman commented Jan 11, 2023

Uh oh!

pan3793 commented Feb 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangzzu commented Jun 29, 2023

Uh oh!

zhongyujiang commented Jun 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

zhongyujiang commented Aug 8, 2022 •

edited

Loading

shangxinli Aug 21, 2022 •

edited

Loading

alexeykudinkin commented Jan 9, 2023 •

edited

Loading

pan3793 commented Feb 19, 2023 •

edited

Loading