[HUDI-9302] Enable vectorized reading for file slice without log file #13127

TheR1sing3un · 2025-04-10T05:10:27Z

Current snapshot reading performance of file slice with base file only has greater rollback than read_optimized read performance.
The main reason is that read_optimized is a chance to turn on vectorized reading parquet, but snapshot reads never do vectorized reading. Refer to spark's code: apache/spark#38397 , this behavior seems a little too strict. Because we can separate whether parquet is read as vectorized or not and whether batch is returned.
So I modified the code, even if snapshot read occurs, but if the slice to read is only the base file, It uses vectorization to read parquet. However, when using snapshot to read, the batch result is always set to false, because we can't be sure if there is a file slice that needs to be merged on read time, which is row-based, so the batch result cannot be returned.

Our test case

all file slices are base file only
3G per partition

Read with operation: read_optimized

Before optimization: snapshot_read

After optimization: snapshot_read

Change Logs

enable vectorized reading for file slice without log file

Impact

improve snapshot read performance when there are some file slices which are base-file-only.

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

TheR1sing3un · 2025-04-11T07:36:28Z

No merge now, I will continue to optimize this pr.

TheR1sing3un · 2025-04-21T07:29:59Z

ready for review after merging: #13188

TheR1sing3un · 2025-04-22T15:36:38Z

ready for review now!

danny0405 · 2025-05-09T14:03:26Z

@TheR1sing3un Hi, can you resolve the conflicts~

1. enable vectorized reading for file slice without log file Signed-off-by: TheR1sing3un <[email protected]>

1. fix the test Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un · 2025-05-09T14:08:25Z

@TheR1sing3un Hi, can you resolve the conflicts~

done~

TheR1sing3un · 2025-05-09T16:07:18Z

@hudi-bot run azure

TheR1sing3un · 2025-05-22T12:20:05Z

@hudi-bot run azure

TheR1sing3un · 2025-05-23T02:56:30Z

@TheR1sing3un Hi, can you resolve the conflicts~

All conflicts has been resolved and all checks have been passed. Could we move forward?

danny0405 · 2025-05-23T03:02:35Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33ParquetReader.scala


+    // Should always be set by FileSourceScanExec creating this.
+    // Check conf before checking option, to allow working around an issue by changing conf.
+    val returningBatch = sqlConf.parquetVectorizedReaderEnabled &&


should we fix the other version of parquet readers?

should we fix the other version of parquet readers?

Only fix spark 3.3 is fine, because the relevant changes of spark were integrated in versions after 3.3, only 3.3 needs to be compatible, refer to apache/spark#38397

danny0405 · 2025-05-23T04:06:30Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

-      supportBatchResult = !isMOR && !isIncremental && !isBootstrap && super.supportBatch(sparkSession, schema)
-    }
+    val superSupportBatch = super.supportBatch(sparkSession, schema)
+    supportBatchCalled = !isIncremental && !isBootstrap && superSupportBatch


Not sure the impact for fg readers of Spark, @jonvex can you help here?

Yeah, the reason I had supportBatchCalled was because I was seeing that supportBatch() would sometimes switch between true and false and then the caller would try to do an illegal cast. So this was to prevent that from happening. It seems like now we should change the variable name to something else?

It's okay we change the variable name, my concern is does the change introduce some side-effect to other read paths, like causing some regression or overhead. Can you help to confirm that?

danny0405

+1, reading base file using vectorized reader makes sense to me.

danny0405 · 2025-05-24T06:10:48Z

@TheR1sing3un The problem also exists on 0.x releases right? Not a regression of 1.x.

TheR1sing3un · 2025-05-24T07:10:53Z

@TheR1sing3un The problem also exists on 0.x releases right? Not a regression of 1.x.

indeed, but the ralated code path is different from 1.x

hudi-bot · 2025-05-24T07:42:18Z

CI report:

a25793f Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-05-24T09:09:01Z

Azure CI passed: https://dev.azure.com/apachehudi/hudi-oss-ci/_build/results?buildId=5699&view=results

…apache#13127) * enable vectorized reading when only base file is in the file group. --------- Signed-off-by: TheR1sing3un <[email protected]> Co-authored-by: danny0405 <[email protected]>

github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 10, 2025

TheR1sing3un force-pushed the feat_dynamic_vector_read branch from e9ce64b to ff7761b Compare April 21, 2025 07:16

TheR1sing3un force-pushed the feat_dynamic_vector_read branch from ff7761b to 422dc00 Compare April 22, 2025 11:26

TheR1sing3un added 2 commits May 9, 2025 22:08

feat: enable vectorized reading for file slice without log file

487e443

1. enable vectorized reading for file slice without log file Signed-off-by: TheR1sing3un <[email protected]>

test: fix the test

c6a06c8

1. fix the test Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un force-pushed the feat_dynamic_vector_read branch from 422dc00 to c6a06c8 Compare May 9, 2025 14:08

danny0405 reviewed May 23, 2025

View reviewed changes

danny0405 approved these changes May 24, 2025

View reviewed changes

cosmetic changes

a25793f

danny0405 merged commit d9822b8 into apache:master May 24, 2025
57 of 58 checks passed

[HUDI-9302] Enable vectorized reading for file slice without log file #13127

[HUDI-9302] Enable vectorized reading for file slice without log file #13127

Uh oh!

Conversation

TheR1sing3un commented Apr 10, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

TheR1sing3un commented Apr 11, 2025

Uh oh!

TheR1sing3un commented Apr 21, 2025

Uh oh!

TheR1sing3un commented Apr 22, 2025

Uh oh!

danny0405 commented May 9, 2025

Uh oh!

TheR1sing3un commented May 9, 2025

Uh oh!

TheR1sing3un commented May 9, 2025

Uh oh!

TheR1sing3un commented May 22, 2025

Uh oh!

TheR1sing3un commented May 23, 2025

Uh oh!

danny0405 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un May 23, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

jonvex May 23, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 May 24, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

danny0405 commented May 24, 2025

Uh oh!

TheR1sing3un commented May 24, 2025

Uh oh!

hudi-bot commented May 24, 2025

CI report:

Uh oh!

danny0405 commented May 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants