Skip to content

Conversation

@TheR1sing3un
Copy link
Member

Current snapshot reading performance of file slice with base file only has greater rollback than read_optimized read performance.
The main reason is that read_optimized is a chance to turn on vectorized reading parquet, but snapshot reads never do vectorized reading. Refer to spark's code: apache/spark#38397 , this behavior seems a little too strict. Because we can separate whether parquet is read as vectorized or not and whether batch is returned.
So I modified the code, even if snapshot read occurs, but if the slice to read is only the base file, It uses vectorization to read parquet. However, when using snapshot to read, the batch result is always set to false, because we can't be sure if there is a file slice that needs to be merged on read time, which is row-based, so the batch result cannot be returned.

Our test case

  1. all file slices are base file only
  2. 3G per partition

Read with operation: read_optimized

image

Before optimization: snapshot_read

image

After optimization: snapshot_read

image

Change Logs

  1. enable vectorized reading for file slice without log file

Impact

improve snapshot read performance when there are some file slices which are base-file-only.

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 10, 2025
@TheR1sing3un
Copy link
Member Author

No merge now, I will continue to optimize this pr.

@TheR1sing3un TheR1sing3un force-pushed the feat_dynamic_vector_read branch from e9ce64b to ff7761b Compare April 21, 2025 07:16
@TheR1sing3un
Copy link
Member Author

ready for review after merging: #13188

@TheR1sing3un TheR1sing3un force-pushed the feat_dynamic_vector_read branch from ff7761b to 422dc00 Compare April 22, 2025 11:26
@TheR1sing3un
Copy link
Member Author

ready for review now!

@danny0405
Copy link
Contributor

@TheR1sing3un Hi, can you resolve the conflicts~

1. enable vectorized reading for file slice without log file

Signed-off-by: TheR1sing3un <[email protected]>
1. fix the test

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un TheR1sing3un force-pushed the feat_dynamic_vector_read branch from 422dc00 to c6a06c8 Compare May 9, 2025 14:08
@TheR1sing3un
Copy link
Member Author

@TheR1sing3un Hi, can you resolve the conflicts~

done~

@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

1 similar comment
@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

@TheR1sing3un
Copy link
Member Author

@TheR1sing3un Hi, can you resolve the conflicts~

All conflicts has been resolved and all checks have been passed. Could we move forward?


// Should always be set by FileSourceScanExec creating this.
// Check conf before checking option, to allow working around an issue by changing conf.
val returningBatch = sqlConf.parquetVectorizedReaderEnabled &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix the other version of parquet readers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix the other version of parquet readers?

Only fix spark 3.3 is fine, because the relevant changes of spark were integrated in versions after 3.3, only 3.3 needs to be compatible, refer to apache/spark#38397

supportBatchResult = !isMOR && !isIncremental && !isBootstrap && super.supportBatch(sparkSession, schema)
}
val superSupportBatch = super.supportBatch(sparkSession, schema)
supportBatchCalled = !isIncremental && !isBootstrap && superSupportBatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the impact for fg readers of Spark, @jonvex can you help here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the reason I had supportBatchCalled was because I was seeing that supportBatch() would sometimes switch between true and false and then the caller would try to do an illegal cast. So this was to prevent that from happening. It seems like now we should change the variable name to something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay we change the variable name, my concern is does the change introduce some side-effect to other read paths, like causing some regression or overhead. Can you help to confirm that?

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, reading base file using vectorized reader makes sense to me.

@danny0405
Copy link
Contributor

@TheR1sing3un The problem also exists on 0.x releases right? Not a regression of 1.x.

@TheR1sing3un
Copy link
Member Author

@TheR1sing3un The problem also exists on 0.x releases right? Not a regression of 1.x.

indeed, but the ralated code path is different from 1.x

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

@danny0405 danny0405 merged commit d9822b8 into apache:master May 24, 2025
57 of 58 checks passed
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
…apache#13127)

* enable vectorized reading when only base file is in the file group.

---------

Signed-off-by: TheR1sing3un <[email protected]>
Co-authored-by: danny0405 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants