Skip to content

Conversation

@eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Jul 2, 2025

During the implementation of new Parquet encodings (e.g. #13391) I've noticed that we rely on generating Parquet data at test time. For some encodings, such as DELTA_BYTE_ARRAY, that is complicated by the fact that there's not a good way to reliably tell the writer to use a particular encoding for a particular field.

To address this gap, this PR introduces a new test testGoldenFiles along with several pre-generated Parquet files written using various encodings. I intend to add more files/encodings here as support for new encodings is introduced.

I generated these files using this small util and manually validated the encodings with parquet-tools, e.g.:

$ parquet-tools inspect --detail ~/iceberg/spark/v4.0/spark/src/test/resources/encodings/PLAIN/int32.parquet 
FileMetaData
. . .
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
. . .

$ parquet-tools inspect --detail ~/iceberg/spark/v4.0/spark/src/test/resources/encodings/RLE/boolean.parquet 
FileMetaData
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■encoding = 3

@github-actions github-actions bot added the spark label Jul 2, 2025
@Fokko
Copy link
Contributor

Fokko commented Jul 2, 2025

Thanks @eric-maynard for adding this. Should we maybe keep these files in the Parquet-testing repository? This also avoids storing binary files in the repository :)

@eric-maynard
Copy link
Contributor Author

Hey @Fokko -- firstly this is just a draft so apologies if it's not quite review-ready. Secondly, this is for testing the Iceberg Parquet readers, so I'm not sure parquet-testing is the right place for it.

@eric-maynard
Copy link
Contributor Author

If storing even small binary files in the repository is a blocking concern, though, I can revisit options for generating the data using specific encodings at test time. When I tried it before, I had a very hard time doing so and it was not possible to do so through the Iceberg Parquet abstractions.

@Fokko
Copy link
Contributor

Fokko commented Jul 3, 2025

@eric-maynard Got it, sorry for jumping right on it. Thanks for adding these, and yes, I've bumped into the same issues earlier: #13324

@eric-maynard
Copy link
Contributor Author

No, thanks for taking a look @Fokko! I think it should now be more or less ready, but I was thinking to hold it until the tests become relevant in the PRs where I'm adding the new encodings. It looks like you saw a very similar issue in #13324 and were able to fix it on the Spark side which is great.

@eric-maynard eric-maynard marked this pull request as ready for review July 30, 2025 17:20
@eric-maynard
Copy link
Contributor Author

Since #13391 is merged, I've added the new encoding type and re-opened this PR. PTAL @Fokko / @huaxingao / others!

@eric-maynard eric-maynard requested a review from huaxingao August 5, 2025 18:58
@eric-maynard eric-maynard requested a review from huaxingao August 8, 2025 18:11
@huaxingao
Copy link
Contributor

Thanks @eric-maynard for the PR! I think it’s reasonable to include the binary Parquet files in the Iceberg repo for now, especially since they’re small and targeted. I’ll go ahead and approve the PR.

@huaxingao huaxingao merged commit e667670 into apache:main Aug 12, 2025
27 checks passed
@huaxingao
Copy link
Contributor

Merged. Thanks @eric-maynard

@kevinjqliu
Copy link
Contributor

hey @eric-maynard could you backport the spark 4.0 changes to spark 3.5? We want to keep the 2 spark versions aligned in the upcoming 1.10 release. Here's some more context https://lists.apache.org/thread/8xzbg1wqft2grv8v1f13vb86vd8f7rjd

I'm happy to help with the backport too.

try (CloseableIterable<InternalRow> actualReader =
Parquet.read(Files.localInput(actual))
.project(schema)
.createReaderFunc(t -> SparkParquetReaders.buildReader(schema, t, ID_TO_CONSTANT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-maynard Looks like we overlooked here: it should build the vectorized reader instead. Could you please open a follow up PR to fix this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants