Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1251: Use Hadoop Vectored IO #1708

Closed
wants to merge 5 commits into from

Conversation

williamhyun
Copy link
Member

@williamhyun williamhyun commented Dec 26, 2023

What changes were proposed in this pull request?

This PR aims to use Hadoop Vectored IO always in Apache ORC 2.0.0.

Why are the changes needed?

Apache ORC 2.0.0 is ready to use this new Hadoop feature.

How was this patch tested?

Pass the CIs.

@github-actions github-actions bot added the JAVA label Dec 26, 2023
@williamhyun williamhyun changed the title implement hadoop vectorized io ORC-1251: Use Hadoop Vectored IO Dec 26, 2023
@williamhyun
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Dec 26, 2023
@dongjoon-hyun
Copy link
Member

I added Milestone v2.0.0.

@@ -103,8 +106,7 @@ public OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws I
public BufferChunkList readFileData(BufferChunkList range,
boolean doForceDirect
) throws IOException {
RecordReaderUtils.readDiskRanges(file, zcr, range, doForceDirect, minSeekSize,
minSeekSizeTolerance);
RecordReaderUtils.readDiskRangesVectored(file, range, doForceDirect);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I believe it's okay to use Hadoop Vectored IO only from Apache ORC 2.0.0.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. Could you make the CI happy?

@dongjoon-hyun
Copy link
Member

I fixed the checkstyle issue.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished my second-round review and simplified the logic a little, @williamhyun . I believe this PR is ready for further integration testing. Thank you so much!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side.

@dongjoon-hyun
Copy link
Member

Given that the patch size is small, we can test more after merging. Feel free to merge, @williamhyun .

@dongjoon-hyun
Copy link
Member

Let me merge this with the following authorship.

Lead-authored-by: William Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: HarshitGupta11 <[email protected]>

@mukund-thakur
Copy link

Thanks, everyone for finishing this up.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR aims to use `Hadoop Vectored IO` always in Apache ORC 2.0.0.

### Why are the changes needed?

Apache ORC 2.0.0 is ready to use this new Hadoop feature.
  - apache#1509
  - apache#1554
  - [Hadoop Vectored IO Presentation](https://docs.google.com/presentation/d/1U5QRN4etbM7gkbnGO3OW4sCfUZx9LqJN/)
    > Works great everywhere; radical benefit in object stores

### How was this patch tested?

Pass the CIs.

Closes apache#1708 from williamhyun/hadoopvectorized.

Lead-authored-by: William Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: HarshitGupta11 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Mar 8, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?

This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0.

Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following.
- Apache ORC 2.0.x <-> Apache Spark 4.0.x
- Apache ORC 1.9.x <-> Apache Spark 3.5.x
- Apache ORC 1.8.x <-> Apache Spark 3.4.x
- Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support)

### Why are the changes needed?

**Release Note**
- https://github.com/apache/orc/releases/tag/v2.0.0

**Milestone**
- https://github.com/apache/orc/milestone/20?closed=1
  - apache/orc#1728
  - apache/orc#1801
  - apache/orc#1498
  - apache/orc#1627
  - apache/orc#1497
  - apache/orc#1509
  - apache/orc#1554
  - apache/orc#1708
  - apache/orc#1733
  - apache/orc#1760
  - apache/orc#1743

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45443 from dongjoon-hyun/SPARK-44115.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants