-
Notifications
You must be signed in to change notification settings - Fork 3k
Add golden file tests for vectorized Parquet reads #13450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @eric-maynard for adding this. Should we maybe keep these files in the Parquet-testing repository? This also avoids storing binary files in the repository :) |
|
Hey @Fokko -- firstly this is just a draft so apologies if it's not quite review-ready. Secondly, this is for testing the Iceberg Parquet readers, so I'm not sure parquet-testing is the right place for it. |
|
If storing even small binary files in the repository is a blocking concern, though, I can revisit options for generating the data using specific encodings at test time. When I tried it before, I had a very hard time doing so and it was not possible to do so through the Iceberg Parquet abstractions. |
|
@eric-maynard Got it, sorry for jumping right on it. Thanks for adding these, and yes, I've bumped into the same issues earlier: #13324 |
|
No, thanks for taking a look @Fokko! I think it should now be more or less ready, but I was thinking to hold it until the tests become relevant in the PRs where I'm adding the new encodings. It looks like you saw a very similar issue in #13324 and were able to fix it on the Spark side which is great. |
|
Since #13391 is merged, I've added the new encoding type and re-opened this PR. PTAL @Fokko / @huaxingao / others! |
...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java
Outdated
Show resolved
Hide resolved
|
Thanks @eric-maynard for the PR! I think it’s reasonable to include the binary Parquet files in the Iceberg repo for now, especially since they’re small and targeted. I’ll go ahead and approve the PR. |
|
Merged. Thanks @eric-maynard |
|
hey @eric-maynard could you backport the spark 4.0 changes to spark 3.5? We want to keep the 2 spark versions aligned in the upcoming 1.10 release. Here's some more context https://lists.apache.org/thread/8xzbg1wqft2grv8v1f13vb86vd8f7rjd I'm happy to help with the backport too. |
| try (CloseableIterable<InternalRow> actualReader = | ||
| Parquet.read(Files.localInput(actual)) | ||
| .project(schema) | ||
| .createReaderFunc(t -> SparkParquetReaders.buildReader(schema, t, ID_TO_CONSTANT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-maynard Looks like we overlooked here: it should build the vectorized reader instead. Could you please open a follow up PR to fix this? Thanks!
During the implementation of new Parquet encodings (e.g. #13391) I've noticed that we rely on generating Parquet data at test time. For some encodings, such as DELTA_BYTE_ARRAY, that is complicated by the fact that there's not a good way to reliably tell the writer to use a particular encoding for a particular field.
To address this gap, this PR introduces a new test
testGoldenFilesalong with several pre-generated Parquet files written using various encodings. I intend to add more files/encodings here as support for new encodings is introduced.I generated these files using this small util and manually validated the encodings with
parquet-tools, e.g.: