Add golden file tests for vectorized Parquet reads #13450

eric-maynard · 2025-07-02T20:59:01Z

During the implementation of new Parquet encodings (e.g. #13391) I've noticed that we rely on generating Parquet data at test time. For some encodings, such as DELTA_BYTE_ARRAY, that is complicated by the fact that there's not a good way to reliably tell the writer to use a particular encoding for a particular field.

To address this gap, this PR introduces a new test testGoldenFiles along with several pre-generated Parquet files written using various encodings. I intend to add more files/encodings here as support for new encodings is introduced.

I generated these files using this small util and manually validated the encodings with parquet-tools, e.g.:

$ parquet-tools inspect --detail ~/iceberg/spark/v4.0/spark/src/test/resources/encodings/PLAIN/int32.parquet 
FileMetaData
. . .
■■■■■■■■■■■■■■■■■■■■■■■■encodings = list
■■■■■■■■■■■■■■■■■■■■■■■■■■■■3
■■■■■■■■■■■■■■■■■■■■■■■■■■■■0
. . .

$ parquet-tools inspect --detail ~/iceberg/spark/v4.0/spark/src/test/resources/encodings/RLE/boolean.parquet 
FileMetaData
■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■encoding = 3

Fokko · 2025-07-02T22:09:46Z

Thanks @eric-maynard for adding this. Should we maybe keep these files in the Parquet-testing repository? This also avoids storing binary files in the repository :)

eric-maynard · 2025-07-02T22:53:41Z

Hey @Fokko -- firstly this is just a draft so apologies if it's not quite review-ready. Secondly, this is for testing the Iceberg Parquet readers, so I'm not sure parquet-testing is the right place for it.

eric-maynard · 2025-07-02T23:29:48Z

If storing even small binary files in the repository is a blocking concern, though, I can revisit options for generating the data using specific encodings at test time. When I tried it before, I had a very hard time doing so and it was not possible to do so through the Iceberg Parquet abstractions.

Fokko · 2025-07-03T08:17:37Z

@eric-maynard Got it, sorry for jumping right on it. Thanks for adding these, and yes, I've bumped into the same issues earlier: #13324

eric-maynard · 2025-07-07T16:02:11Z

No, thanks for taking a look @Fokko! I think it should now be more or less ready, but I was thinking to hold it until the tests become relevant in the PRs where I'm adding the new encodings. It looks like you saw a very similar issue in #13324 and were able to fix it on the Spark side which is great.

eric-maynard · 2025-07-30T17:21:12Z

Since #13391 is merged, I've added the new encoding type and re-opened this PR. PTAL @Fokko / @huaxingao / others!

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java

huaxingao · 2025-08-08T20:22:15Z

Thanks @eric-maynard for the PR! I think it’s reasonable to include the binary Parquet files in the Iceberg repo for now, especially since they’re small and targeted. I’ll go ahead and approve the PR.

huaxingao · 2025-08-12T19:06:20Z

Merged. Thanks @eric-maynard

kevinjqliu · 2025-08-17T21:12:27Z

hey @eric-maynard could you backport the spark 4.0 changes to spark 3.5? We want to keep the 2 spark versions aligned in the upcoming 1.10 release. Here's some more context https://lists.apache.org/thread/8xzbg1wqft2grv8v1f13vb86vd8f7rjd

I'm happy to help with the backport too.

huaxingao · 2025-08-18T15:52:18Z

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java

+    try (CloseableIterable<InternalRow> actualReader =
+        Parquet.read(Files.localInput(actual))
+            .project(schema)
+            .createReaderFunc(t -> SparkParquetReaders.buildReader(schema, t, ID_TO_CONSTANT))


@eric-maynard Looks like we overlooked here: it should build the vectorized reader instead. Could you please open a follow up PR to fix this? Thanks!

eric-maynard added 3 commits July 2, 2025 13:45

stable

19b099d

Add golden files

f7bbf72

parameterized test

a7c50aa

github-actions bot added the spark label Jul 2, 2025

eric-maynard added 3 commits July 2, 2025 14:05

boolean type

e576df1

remove new encodings

e864cb0

add missing files

f2d3367

checkstyle

7b69f07

eric-maynard mentioned this pull request Jul 21, 2025

Add support for DELTA_BINARY_PACKED Parquet encoding #13391

Merged

eric-maynard added 2 commits July 31, 2025 02:18

resolve conflicts

879a0b4

add new encoding

85c35d6

eric-maynard marked this pull request as ready for review July 30, 2025 17:20

spotless

20c195c

eric-maynard mentioned this pull request Jul 30, 2025

Add support for DELTA_LENGTH_BYTE_ARRAY Parquet encoding #13709

Closed

lint

08314c2

huaxingao reviewed Jul 31, 2025

View reviewed changes

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java Outdated Show resolved Hide resolved

huaxingao reviewed Jul 31, 2025

View reviewed changes

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java Outdated Show resolved Hide resolved

eric-maynard added 2 commits August 5, 2025 09:36

better imports

b91228c

lint

c14100b

eric-maynard requested a review from huaxingao August 5, 2025 18:58

huaxingao reviewed Aug 5, 2025

View reviewed changes

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java Outdated Show resolved Hide resolved

eric-maynard added 2 commits August 6, 2025 09:37

better assert

1bd89c1

spotless

499b8b8

eric-maynard requested a review from huaxingao August 8, 2025 18:11

huaxingao approved these changes Aug 8, 2025

View reviewed changes

huaxingao merged commit e667670 into apache:main Aug 12, 2025
27 checks passed

huaxingao reviewed Aug 18, 2025

View reviewed changes

This was referenced Aug 18, 2025

Backport Parquet encoding tests for Spark 3.5 #13859

Merged

Test both vectorized and nonvectorized readers in Parquet golden file tests #13890

Merged

Add golden file tests for vectorized Parquet reads #13450

Add golden file tests for vectorized Parquet reads #13450

Uh oh!

Conversation

eric-maynard commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Jul 2, 2025

Uh oh!

eric-maynard commented Jul 2, 2025

Uh oh!

eric-maynard commented Jul 2, 2025

Uh oh!

Fokko commented Jul 3, 2025

Uh oh!

eric-maynard commented Jul 7, 2025

Uh oh!

eric-maynard commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huaxingao commented Aug 8, 2025

Uh oh!

Uh oh!

huaxingao commented Aug 12, 2025

Uh oh!

kevinjqliu commented Aug 17, 2025

Uh oh!

huaxingao Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eric-maynard commented Jul 2, 2025 •

edited

Loading