-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1251: Use Hadoop Vectored IO #1708
Conversation
I added |
@@ -103,8 +106,7 @@ public OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws I | |||
public BufferChunkList readFileData(BufferChunkList range, | |||
boolean doForceDirect | |||
) throws IOException { | |||
RecordReaderUtils.readDiskRanges(file, zcr, range, doForceDirect, minSeekSize, | |||
minSeekSizeTolerance); | |||
RecordReaderUtils.readDiskRangesVectored(file, range, doForceDirect); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I believe it's okay to use Hadoop Vectored IO
only from Apache ORC 2.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR. Could you make the CI happy?
I fixed the checkstyle issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finished my second-round review and simplified the logic a little, @williamhyun . I believe this PR is ready for further integration testing. Thank you so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM from my side.
Given that the patch size is small, we can test more after merging. Feel free to merge, @williamhyun . |
Let me merge this with the following authorship.
|
Thanks, everyone for finishing this up. |
### What changes were proposed in this pull request? This PR aims to use `Hadoop Vectored IO` always in Apache ORC 2.0.0. ### Why are the changes needed? Apache ORC 2.0.0 is ready to use this new Hadoop feature. - apache#1509 - apache#1554 - [Hadoop Vectored IO Presentation](https://docs.google.com/presentation/d/1U5QRN4etbM7gkbnGO3OW4sCfUZx9LqJN/) > Works great everywhere; radical benefit in object stores ### How was this patch tested? Pass the CIs. Closes apache#1708 from williamhyun/hadoopvectorized. Lead-authored-by: William Hyun <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Co-authored-by: HarshitGupta11 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to Upgrade Apache ORC to 2.0.0 for Apache Spark 4.0.0. Apache ORC community has 3-year support policy which is longer than Apache Spark. It's aligned like the following. - Apache ORC 2.0.x <-> Apache Spark 4.0.x - Apache ORC 1.9.x <-> Apache Spark 3.5.x - Apache ORC 1.8.x <-> Apache Spark 3.4.x - Apache ORC 1.7.x (Supported) <-> Apache Spark 3.3.x (End-Of-Support) ### Why are the changes needed? **Release Note** - https://github.com/apache/orc/releases/tag/v2.0.0 **Milestone** - https://github.com/apache/orc/milestone/20?closed=1 - apache/orc#1728 - apache/orc#1801 - apache/orc#1498 - apache/orc#1627 - apache/orc#1497 - apache/orc#1509 - apache/orc#1554 - apache/orc#1708 - apache/orc#1733 - apache/orc#1760 - apache/orc#1743 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45443 from dongjoon-hyun/SPARK-44115. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR aims to use
Hadoop Vectored IO
always in Apache ORC 2.0.0.Why are the changes needed?
Apache ORC 2.0.0 is ready to use this new Hadoop feature.
How was this patch tested?
Pass the CIs.