Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1593: Set orc.compression.zstd.level to 3 by default #1760

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR aims to set orc.compression.zstd.level to 3 by default.

Why are the changes needed?

To prevent a regression from ORC 1.9.x

ORC 1.9

data/generated//taxi:
total 2196176
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:02 .
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:07 ..
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:03 orc.zstd

ORC 2.0

-rw-r--r--  1 dongjoon  staff   334M Jan 17 07:56 orc.zstd (level 1)
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:16 orc.zstd (level 3)
-rw-r--r--  1 dongjoon  staff   302M Jan 17 08:21 orc.zstd (level 4)
-rw-r--r--  1 dongjoon  staff   300M Jan 17 08:27 orc.zstd (level 5)

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Jan 17, 2024
@github-actions github-actions bot added the JAVA label Jan 17, 2024
@dongjoon-hyun dongjoon-hyun changed the title ORC-1593: Set orc.compression.zstd.level to 3 by default ORC-1593: Set orc.compression.zstd.level to 3 by default Jan 17, 2024
dongjoon-hyun added a commit that referenced this pull request Jan 17, 2024
### What changes were proposed in this pull request?

This PR aims to set `orc.compression.zstd.level` to 3 by default.

### Why are the changes needed?

To prevent a regression from ORC 1.9.x

**ORC 1.9**
```
data/generated//taxi:
total 2196176
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:02 .
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:07 ..
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:03 orc.zstd
```

**ORC 2.0**
```
-rw-r--r--  1 dongjoon  staff   334M Jan 17 07:56 orc.zstd (level 1)
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:16 orc.zstd (level 3)
-rw-r--r--  1 dongjoon  staff   302M Jan 17 08:21 orc.zstd (level 4)
-rw-r--r--  1 dongjoon  staff   300M Jan 17 08:27 orc.zstd (level 5)
```

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #1760 from dongjoon-hyun/ORC-1593.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 3f2a0b3)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the ORC-1593 branch January 17, 2024 17:25
@cxzl25
Copy link
Contributor

cxzl25 commented Jan 18, 2024

Thanks @dongjoon-hyun .

The default compression levels of aircompressor used by ORC and zstd-jni used by parquet are both level 3.
I verified in the online environment that zstd-jni level 3 is not worse than aircompressor level 3.

https://github.com/airlift/aircompressor/blob/ca561c8214100b1e646a395c2683212419719dc8/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L26

https://github.com/apache/parquet-mr/blob/c82d5b471a558124b03e67759038661a046f5938/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/ZstandardCodec.java#L52

@dongjoon-hyun
Copy link
Member Author

Ya, thank you for checking.

dongjoon-hyun added a commit to apache/spark that referenced this pull request Mar 8, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants