Skip to content

Conversation

@zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Aug 8, 2022

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Benchmark

Benchmark:
The benchmark data consists of 500 string columns * 1_000_000 rows(10.85GB under ZSTD compression), and the benchmark is simply reading all the data out and pass them to the black hole.

JMH Configuration:
3 rounds of warmup iterations
5 rounds of measurenemt iterations

Notes:

  • Closing the input stream means reading the decompressed data into the heap early and closing the input stream, which is the method proposed in this pr. Not closing the stream means leaving the stream to the GC.
  • The JMH Score in the table below is the average of multiple rounds of measurement, while GC Events and GC Time are the statistical values of the total measurement process (3 rounds of Warmup + 5 rounds of Measurement)

According to the benchmark results below, closing stream does not bring additional heap overload, but improves GC instead. From the comparison of the first two results of below table, we can see that closing stream improves the read performance and reduces the GC load.

I also measured the impact of heap size (GC) on read performance. After reducing heap memory to 3GB, GC Events increased, but the read performance is not decreased; However, after increasing heap memory to 5GB, GC Events decreased by about 30% , but the GC time has increased by about 2 times, and the read performance has also been significantly reduced, which may verify the previous guess in Jira: Untimely GC may exacerbate the off-heap memory fragmentation issue. (Of course, this may require more in-depth testing to corroborate)

Compression Close Inputstream Heap Size(GB) JMH Score ± Error (s/op) GC Events GC Time(s)
ZSTD No 4 53.374 ± 3.303 1591 40.069
ZSTD Yes 4 49.613 ± 2.564 928 17.194
ZSTD No 3 53.303 ± 1.431 2174 46.121
ZSTD No 5 75.993 ± 10.919 950 116.184

Out of curiosity, I also tested the effect of closing stream on Snappy and Gzip, the benchmark results show that in such simple read scenario closing stream also has some positive effect on both Snappy and Gzip.

Compression Close Inputstream Heap Size(GB) JMH Score ± Error (s/op) GC Events GC Time(s)
Snappy No 4 38.835 ± 7.505 1618 40.193
Snappy Yes 4 33.796 ± 2.871 872 12.449
Gzip No 4 94.902 ± 5.833 1539 39.630
Gzip Yes 4 86.160 ± 2.329 843 12.153

For more details, please refer to the following pictures:

ZSTD + Not close + 4GB
image
image

ZSTD + Close + 4GB
image
image

ZSTD + Not close + 3GB
image
image

ZSTD + Not close + 5GB
image
image

Snappy + Not close + 4GB
image
image

Snappy + Close + 4GB
image
image

Gzip + Not Close + 4GB
image
image

Gzip + Close + 4GB
image
image

}
InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
decompressed = BytesInput.from(is, uncompressedSize);
if (codec instanceof ZstandardCodec) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little weird, but considering doing so will load the decompressor stream into heap in advance, and only zstd has this problem currently, so I made this modification only for zstd stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider closing the decompressed stream after it has been read:
https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java#L283-L288
But I'm not sure if there is a situation where the decompressed stream is read more than once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks OK to me, we probably should add some comments explaining why ZSTD deserves the special treatment here.

The change on BytesInput looks more intrusive since it is used not only for decompression but other places like compression. For instance, BytesInput.copy calls toByteArray underneath, and after the call the original object should still be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
InputStream is = codec.createInputStream(bytes.toInputStream(), decompressor);
decompressed = BytesInput.from(is, uncompressedSize);
if (codec instanceof ZstandardCodec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks OK to me, we probably should add some comments explaining why ZSTD deserves the special treatment here.

The change on BytesInput looks more intrusive since it is used not only for decompression but other places like compression. For instance, BytesInput.copy calls toByteArray underneath, and after the call the original object should still be valid.

// This change will load the decompressor stream into heap a little earlier, since the problem it solves
// only happens in the ZSTD codec, so this modification is only made for ZSTD streams.
if (codec instanceof ZstandardCodec) {
decompressed = BytesInput.copy(BytesInput.from(is, uncompressedSize));
Copy link
Contributor

@shangxinli shangxinli Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we had the discussion in the Jira that ByteInput.copy() just loads into a heap in advance but not add extra overall. Can we have a benchmark on the heap/GC(Heap size, GC time etc). I just want to make sure we fix one problem while introducing another problem.

Other than that, the ZSTD is treated especially might be OK since we had pretty decent coments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a benchmark on the heap/GC(Heap size, GC time etc).

Sure, will do a benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shangxinli , very sorry about the big delay, I was a little busy last week. The benchmark result and detailed data has been posted in the PR describe block, also cc @sunchao.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it!

@shangxinli
Copy link
Contributor

LGTM

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thanks @zhongyujiang !

@camper42
Copy link

wondering why this PR has not been merged

we have face this problem and are looking for a solution (spark 3.2.2+zstd-parquet)

@sunchao
Copy link
Member

sunchao commented Sep 21, 2022

@shangxinli shangxinli merged commit 53f65a8 into apache:master Sep 22, 2022
@zhongyujiang zhongyujiang deleted the close-zstd-decompress-stream branch September 23, 2022 05:49
@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Jan 9, 2023

@gszadovszky @ggershinsky @shangxinli

Folks, do we have an approximate timeline for the next patch release that will be including this patch?

This is a severe problem that does affect our ability to use Parquet w/ Zstd and i'm aware of at least of handful of similar occasions and issues occurring to others.

Corresponding issue in Spark: https://issues.apache.org/jira/browse/SPARK-41952

@shangxinli
Copy link
Contributor

@alexeykudinkin We might release a new patch in the next 2 or 3 months.

Can you elaborate why "this is a severe problem that does affect our ability to use Parquet w/ Zstd"? I understand it is an issue but we have been running Parquet w/ Zstd for 2+ years in production.

@alexeykudinkin
Copy link
Contributor

Totally @shangxinli

We have running Spark clusters in production ingesting from 100s of Apache Hudi tables (using Parquet and Zstd) and writing into other ones. We switched from gzip to zstd slightly over a month ago and we started to have OOM issues almost immediately. It took us a bit of triaging to zero in on zstd, but now we're confident that it's not mis-calibration of our configs but slow-bleeding leak of the native memory.

The crux of the problem is very particular type of the job -- one that reads a lot of Zstd compressed Parquet (therefore triggering the affected path). Other jobs not reading Parquet are not affected.

@shangxinli
Copy link
Contributor

Thanks @alexeykudinkin for the explanation.

@camper42
Copy link

same problem with @alexeykudinkin

currently we replace paruqet jar with patched one in our image, waiting for release

@Kimahriman
Copy link

same, we have certain jobs that can't function without a patched jar. Seems to get worse with the more columns you read. Our worst offender (table with thousands of columns), can easily hit 90+ GiB of resident memory on a 25 GiB heap executor in an hour or so (running on Yarn without memory limit enforcement)

@pan3793
Copy link
Member

pan3793 commented Feb 19, 2023

I also encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM, this patch can solve this problem. It's a critical issue in my case, wish the upstream publish a patched version so that we don't need to maintain the internal version.

sunchao pushed a commit to apache/spark that referenced this pull request Feb 20, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
sunchao pushed a commit to apache/spark that referenced this pull request Feb 20, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
sunchao pushed a commit to apache/spark that referenced this pull request Feb 20, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
sunchao pushed a commit to apache/spark that referenced this pull request Feb 20, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes apache#40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…und for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM.

See more discussions at apache/parquet-java#982 and apache/iceberg#5681

### Why are the changes needed?

The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
```
<img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png">

- after this patch

Query runs well, no executor gets killed.

<img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png">

Closes apache#40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
@wangzzu
Copy link

wangzzu commented Jun 29, 2023

@zhongyujiang
We also encountered the same problem, here I have another question, why this problem only occurs in the parquet/zstd read

@zhongyujiang
Copy link
Contributor Author

@wangzzu I guess it's because ZSTD allocates and uses off-heap memory very frequently, and before this PR the release of those native resources is subject to Java's GC, so I tried to relieve off-heap memory pressure by releasing them in time.
I didn't dig any deeper as my knowledge of these codec is very limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants