ORC-1251: Use Hadoop Vectored IO #1708

williamhyun · 2023-12-26T01:02:38Z

What changes were proposed in this pull request?

This PR aims to use Hadoop Vectored IO always in Apache ORC 2.0.0.

Why are the changes needed?

Apache ORC 2.0.0 is ready to use this new Hadoop feature.

ORC-1430: Use Hadoop 3.3.5 shaded clients #1509
ORC-1456: Update hadoop.version to 3.3.6 #1554
Hadoop Vectored IO Presentation

Works great everywhere; radical benefit in object stores

How was this patch tested?

Pass the CIs.

williamhyun · 2023-12-26T01:15:15Z

cc: @wgtmac @dongjoon-hyun @HarshitGupta11 @mukund-thakur @steveloughran @jerqi

dongjoon-hyun · 2023-12-26T08:04:52Z

I added Milestone v2.0.0.

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java

dongjoon-hyun · 2023-12-26T08:09:17Z

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java

@@ -103,8 +106,7 @@ public OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws I
    public BufferChunkList readFileData(BufferChunkList range,
                                        boolean doForceDirect
                                        ) throws IOException {
-      RecordReaderUtils.readDiskRanges(file, zcr, range, doForceDirect, minSeekSize,
-                                       minSeekSizeTolerance);
+      RecordReaderUtils.readDiskRangesVectored(file, range, doForceDirect);


Got it. I believe it's okay to use Hadoop Vectored IO only from Apache ORC 2.0.0.

dongjoon-hyun

Thank you for making a PR. Could you make the CI happy?

dongjoon-hyun · 2023-12-27T00:44:10Z

I fixed the checkstyle issue.

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java

dongjoon-hyun

I finished my second-round review and simplified the logic a little, @williamhyun . I believe this PR is ready for further integration testing. Thank you so much!

dongjoon-hyun

+1, LGTM from my side.

dongjoon-hyun · 2023-12-27T06:20:09Z

Given that the patch size is small, we can test more after merging. Feel free to merge, @williamhyun .

dongjoon-hyun · 2023-12-27T19:12:12Z

Let me merge this with the following authorship.

Lead-authored-by: William Hyun <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Co-authored-by: HarshitGupta11 <[email protected]>

mukund-thakur · 2024-01-02T23:23:34Z

Thanks, everyone for finishing this up.

### What changes were proposed in this pull request? This PR aims to use `Hadoop Vectored IO` always in Apache ORC 2.0.0. ### Why are the changes needed? Apache ORC 2.0.0 is ready to use this new Hadoop feature. - apache#1509 - apache#1554 - [Hadoop Vectored IO Presentation](https://docs.google.com/presentation/d/1U5QRN4etbM7gkbnGO3OW4sCfUZx9LqJN/) > Works great everywhere; radical benefit in object stores ### How was this patch tested? Pass the CIs. Closes apache#1708 from williamhyun/hadoopvectorized. Lead-authored-by: William Hyun <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Co-authored-by: HarshitGupta11 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

implement hadoop vectorized io

ff3af60

github-actions bot added the JAVA label Dec 26, 2023

williamhyun changed the title ~~implement hadoop vectorized io~~ ORC-1251: Use Hadoop Vectored IO Dec 26, 2023

williamhyun mentioned this pull request Dec 26, 2023

ORC-1251:Support Vectored IO in ORC [Draft] #1276

Closed

Checkstyle

22ed44e

dongjoon-hyun added this to the 2.0.0 milestone Dec 26, 2023

dongjoon-hyun reviewed Dec 26, 2023

View reviewed changes

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 26, 2023

View reviewed changes

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 26, 2023

View reviewed changes

dongjoon-hyun assigned williamhyun Dec 26, 2023

fix checkstyle

500ec10

dongjoon-hyun reviewed Dec 27, 2023

View reviewed changes

java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java Show resolved Hide resolved

Simplify

35d77b3

dongjoon-hyun reviewed Dec 27, 2023

View reviewed changes

dongjoon-hyun approved these changes Dec 27, 2023

View reviewed changes

Add co-authorship

2bcf748

dongjoon-hyun closed this in bc046ed Dec 27, 2023

dongjoon-hyun mentioned this pull request Mar 8, 2024

[SPARK-44115][BUILD] Upgrade Apache ORC to 2.0.0 apache/spark#45443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1251: Use Hadoop Vectored IO #1708

ORC-1251: Use Hadoop Vectored IO #1708

williamhyun commented Dec 26, 2023 •

edited by dongjoon-hyun

Loading

williamhyun commented Dec 26, 2023

dongjoon-hyun commented Dec 26, 2023

dongjoon-hyun Dec 26, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 27, 2023

dongjoon-hyun left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 27, 2023

dongjoon-hyun commented Dec 27, 2023

mukund-thakur commented Jan 2, 2024

ORC-1251: Use Hadoop Vectored IO #1708

ORC-1251: Use Hadoop Vectored IO #1708

Conversation

williamhyun commented Dec 26, 2023 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

williamhyun commented Dec 26, 2023

dongjoon-hyun commented Dec 26, 2023

dongjoon-hyun Dec 26, 2023

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 27, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 27, 2023

dongjoon-hyun commented Dec 27, 2023

mukund-thakur commented Jan 2, 2024

williamhyun commented Dec 26, 2023 •

edited by dongjoon-hyun

Loading