Add support for Hudi MOR queries#14795
Conversation
6ad553a to
511726c
Compare
|
|
Is this CLA authorization failing because my github account isn't using my Amazon email? I know we have signed the corporate CLA, any help to resolve would be great. |
|
@arhimondr Would you be able to help review this PR? This enables Presto to query merge-on-read tables in Hudi which is both parquet and avro formatted to serve more fresh data. |
|
@bhasudha I'm pretty busy till the end of next week. Trying to ask in the chat if there are any volunteers to do the first pass. Otherwise i will be able to get to that by the end of next week. |
Thanks @arhimondr |
arhimondr
left a comment
There was a problem hiding this comment.
Generally looks good. Some small comments
There was a problem hiding this comment.
nit: this.customSplitInfo = ImmutableMap.copyOf(requireNonNull(customSplitInfo, "customSplitInfo is null"));
There was a problem hiding this comment.
We usually don't pass null neither return null. Pass ImmutableMap.of(). Same for other similar call sites.
There was a problem hiding this comment.
Don't check for null. The general assumption is that the method parameters are never null.
There was a problem hiding this comment.
I would recommend checking the customSplitInfo.get(HUDI_BASEPATH_KEY), customSplitInfo.get(HUDI_MAX_COMMIT_TIME_KEY) for null
Also let's go with the one parameter at a line style
return Optional.of(new HoodieRealtimeFileSplit(
split,
requireNonNull(customSplitInfo.get(HUDI_BASEPATH_KEY), "HUDI_BASEPATH_KEY is missing"),
deltaLogPaths,
requireNonNull(customSplitInfo.get(HUDI_MAX_COMMIT_TIME_KEY), "HUDI_MAX_COMMIT_TIME_KEY is missing")))
There was a problem hiding this comment.
Maybe recreateFileSplitFromCustomInfo?
There was a problem hiding this comment.
Let's merge it into a one test case method. It generally verifies the same thing, but just different return
There was a problem hiding this comment.
Let's initialize it in @BeforeClass. Also please add the @AfterClass(alwaysRun=true) method and nullify the field there
There was a problem hiding this comment.
It feels like this branch is too generic. Let's add this special extra logic only if the split is the HoodieRealtimeFileSplit
Thanks for the comments! Just updated the branch addressing them |
There was a problem hiding this comment.
Why the HoodieRealtimeFileSplit is created twice?
There was a problem hiding this comment.
nit: prefer ImmutableMap.of (here and in other similar places)
|
Please have a look at the test failures, those might be related |
|
Thanks for the update. Made the changes and everything look to be passing again. |
|
@bschell Good to go. Could you please rebase and squash the commits? |
0609ad5 to
936b513
Compare
|
@arhimondr Done! |
936b513 to
b5be227
Compare
Allows presto-hive to support the use of custom input formats with custom file splits and record readers. This allows support of Hudi merge-on-read table input format.
b5be227 to
20d0830
Compare
|
@arhimondr I think everything should be good now! Anything left for me to do? |
|
The CLA check passed. Thanks for your contribution @bschell ! |
|
@bschell @bhasudha I am trying to use Presto to query hudi MOR table. Right now, I have two tables, which are hive_mor_table_ro and hive_mor_table_rt. I made some changes on the original taable, so there is a log file. From Hive cli, I would obtain two different results for these two tables because the "hive_mor_table_rt" shows the latest result. But from Presto cli, I would see the same results, I assume it is not working. Do you have any ideas why I would encounter this problem? Do I have to set some session properties to trigger the action of merging log files and parquet files? I would appreciate any ideas. Thanks |
hello, do you notice that query mor table on snapshot mode is much more slower than the read_optimized mode? I hava tested many tables, for the worst case in same table, snapshot is ten times lower than read_optimized. |

Allows presto-hive to support the use of custom input formats with custom
file splits and record readers. Tested using Hudi merge-on-read table
input format.
This is a rebase of the hudi presto realtime query patch on presto mainline. Tests are passing but need to double check for any possible new issues for Hudi given the rebase. Just wanted to get this opened for feedback ASAP.
Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines.
Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.