-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-9302] Enable vectorized reading for file slice without log file #13127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-9302] Enable vectorized reading for file slice without log file #13127
Conversation
|
No merge now, I will continue to optimize this pr. |
e9ce64b to
ff7761b
Compare
|
ready for review after merging: #13188 |
ff7761b to
422dc00
Compare
|
ready for review now! |
|
@TheR1sing3un Hi, can you resolve the conflicts~ |
1. enable vectorized reading for file slice without log file Signed-off-by: TheR1sing3un <[email protected]>
1. fix the test Signed-off-by: TheR1sing3un <[email protected]>
422dc00 to
c6a06c8
Compare
done~ |
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
All conflicts has been resolved and all checks have been passed. Could we move forward? |
|
|
||
| // Should always be set by FileSourceScanExec creating this. | ||
| // Check conf before checking option, to allow working around an issue by changing conf. | ||
| val returningBatch = sqlConf.parquetVectorizedReaderEnabled && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fix the other version of parquet readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fix the other version of parquet readers?
Only fix spark 3.3 is fine, because the relevant changes of spark were integrated in versions after 3.3, only 3.3 needs to be compatible, refer to apache/spark#38397
| supportBatchResult = !isMOR && !isIncremental && !isBootstrap && super.supportBatch(sparkSession, schema) | ||
| } | ||
| val superSupportBatch = super.supportBatch(sparkSession, schema) | ||
| supportBatchCalled = !isIncremental && !isBootstrap && superSupportBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the impact for fg readers of Spark, @jonvex can you help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the reason I had supportBatchCalled was because I was seeing that supportBatch() would sometimes switch between true and false and then the caller would try to do an illegal cast. So this was to prevent that from happening. It seems like now we should change the variable name to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay we change the variable name, my concern is does the change introduce some side-effect to other read paths, like causing some regression or overhead. Can you help to confirm that?
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, reading base file using vectorized reader makes sense to me.
|
@TheR1sing3un The problem also exists on 0.x releases right? Not a regression of 1.x. |
indeed, but the ralated code path is different from 1.x |
…apache#13127) * enable vectorized reading when only base file is in the file group. --------- Signed-off-by: TheR1sing3un <[email protected]> Co-authored-by: danny0405 <[email protected]>
Current
snapshotreading performance of file slice with base file only has greater rollback thanread_optimizedread performance.The main reason is that
read_optimizedis a chance to turn on vectorized reading parquet, butsnapshotreads never do vectorized reading. Refer to spark's code: apache/spark#38397 , this behavior seems a little too strict. Because we can separate whether parquet is read as vectorized or not and whether batch is returned.So I modified the code, even if
snapshotread occurs, but if the slice to read is only the base file, It uses vectorization to read parquet. However, when usingsnapshotto read, the batch result is always set to false, because we can't be sure if there is a file slice that needs to be merged on read time, which is row-based, so the batch result cannot be returned.Change Logs
Impact
improve snapshot read performance when there are some file slices which are base-file-only.
Risk level (write none, low medium or high below)
low
Documentation Update
none
Contributor's checklist