-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark, Arrow, Parquet: Add vectorized read support for parquet v2 encodings #14800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jbewing
wants to merge
53
commits into
apache:main
Choose a base branch
from
jbewing:add-support-for-parquet-v2-encodings
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Spark, Arrow, Parquet: Add vectorized read support for parquet v2 encodings #14800
jbewing
wants to merge
53
commits into
apache:main
from
jbewing:add-support-for-parquet-v2-encodings
+802
−114
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…rd/iceberg into parquet-v2-delta
…quet-v2-encodings
…quet-v2-encodings
- Fix a couple of tests as well as this now completes vectorized parquet v2 read support - Add golden file tests for booleans encoded in RLE format
jbewing
added a commit
to jbewing/iceberg
that referenced
this pull request
Dec 15, 2025
jbewing
added a commit
to jbewing/iceberg
that referenced
this pull request
Dec 15, 2025
jbewing
added a commit
to jbewing/iceberg
that referenced
this pull request
Dec 15, 2025
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This PR adds vectorized read support to Iceberg for the Apache Parquet v2 specification (see #7162). This builds on top of the existing support for reading DELTA_BINARY_PACKED implemented by @eric-maynard in #13391 with:
Background
This solves a longstanding issue of: the reference Apache Iceberg Spark implementation with the default settings enabled (e.g.
spark.sql.iceberg.vectorization.enabled=true) isn't able to read iceberg tables that may have been written by other compute engines that utilize the not so new anymore Apache Parquet v2 writer specification. It's a widely known workaround to need to disable the vectorized reader in Spark if you need to interop with other compute engines or adjust all compute engines to use the Apache Parquet v1 writer specification when writing parquet files. With disabling the vectorization flag, clients take a performance hit that we've anecdotally measured is quite large in some cases/workloads. If forcing all writers of an iceberg table to write in Apache Parquet v1 format, clients are incurring additional performance and storage penalties (files written with parquet v2 tend to be smaller than those written with the v1 spec as the new encodings tend to save space and are often faster to read/write). So really, it's a lose-lose for performance & data size in the current setup with the additional papercut of Apache Iceberg not being super portable across engines in it's default setup. This PR seeks to solve that by finishing the swing on implementing vectorized parquet read support for the v2 format. In the future, we may also consider allowing clients to write Apache Parquet v2 files natively gated via a setting from Apache Iceberg. Even longer down that road, we may even consider changing that to be the "default" setting.Previous Work / Thanks
This PR is a revival + extension to the work that @eric-maynard was doing in #13709. That PR had been active for a little while, so I literally started from where Eric left off. Thank you for the great work here @eric-maynard, you made implementing the rest of the changes required for vectorized read support way easier!
Note to Reviewers
I debated splitting up the implementation by encoding type but decided that given that the diff isn't crazy with everything all together that it may be better to bunch these all together. Especially since sometimes internal APIs got added on successive encoding versions to support implementations of subsequent encodings. I can split this up into multiple PRs if we think that would be substantially easier to review or if we're having problems reviewing this as a single atomic unit.
Testing
I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a Spark job is able to read a table written with Parquet V2 writer without issues.
Successor to: #13290, #13391, #13709
Issue: #7162