Skip to content

Add support for Hudi MOR queries#14795

Merged
arhimondr merged 1 commit intoprestodb:masterfrom
bschell:hudiprestomainline
Aug 6, 2020
Merged

Add support for Hudi MOR queries#14795
arhimondr merged 1 commit intoprestodb:masterfrom
bschell:hudiprestomainline

Conversation

@bschell
Copy link
Contributor

@bschell bschell commented Jul 7, 2020

Allows presto-hive to support the use of custom input formats with custom
file splits and record readers. Tested using Hudi merge-on-read table
input format.

This is a rebase of the hudi presto realtime query patch on presto mainline. Tests are passing but need to double check for any possible new issues for Hudi given the rebase. Just wanted to get this opened for feedback ASAP.

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

Hive Changes
* Allows presto-hive to use custom parquet input formats 
* Add support for Hudi realtime input format for hudi realtime queries

@bschell bschell force-pushed the hudiprestomainline branch from 6ad553a to 511726c Compare July 8, 2020 18:55
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 8, 2020

CLA Check

@bschell
Copy link
Contributor Author

bschell commented Jul 14, 2020

Is this CLA authorization failing because my github account isn't using my Amazon email? I know we have signed the corporate CLA, any help to resolve would be great.

@bhasudha
Copy link

@arhimondr Would you be able to help review this PR? This enables Presto to query merge-on-read tables in Hudi which is both parquet and avro formatted to serve more fresh data.

@arhimondr
Copy link
Member

@bhasudha I'm pretty busy till the end of next week. Trying to ask in the chat if there are any volunteers to do the first pass. Otherwise i will be able to get to that by the end of next week.

@bhasudha
Copy link

@bhasudha I'm pretty busy till the end of next week. Trying to ask in the chat if there are any volunteers to do the first pass. Otherwise i will be able to get to that by the end of next week.

Thanks @arhimondr

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Some small comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this.customSplitInfo = ImmutableMap.copyOf(requireNonNull(customSplitInfo, "customSplitInfo is null"));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't pass null neither return null. Pass ImmutableMap.of(). Same for other similar call sites.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableMap.of()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't check for null. The general assumption is that the method parameters are never null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend checking the customSplitInfo.get(HUDI_BASEPATH_KEY), customSplitInfo.get(HUDI_MAX_COMMIT_TIME_KEY) for null

Also let's go with the one parameter at a line style

return Optional.of(new HoodieRealtimeFileSplit(
  split, 
  requireNonNull(customSplitInfo.get(HUDI_BASEPATH_KEY), "HUDI_BASEPATH_KEY is missing"), 
  deltaLogPaths, 
  requireNonNull(customSplitInfo.get(HUDI_MAX_COMMIT_TIME_KEY), "HUDI_MAX_COMMIT_TIME_KEY is missing")))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe recreateFileSplitFromCustomInfo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it into a one test case method. It generally verifies the same thing, but just different return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's initialize it in @BeforeClass. Also please add the @AfterClass(alwaysRun=true) method and nullify the field there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this branch is too generic. Let's add this special extra logic only if the split is the HoodieRealtimeFileSplit

@bschell
Copy link
Contributor Author

bschell commented Jul 25, 2020

Generally looks good. Some small comments

Thanks for the comments! Just updated the branch addressing them

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer ImmutableMap.of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the HoodieRealtimeFileSplit is created twice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer ImmutableMap.of (here and in other similar places)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableMap.of()

@arhimondr
Copy link
Member

Please have a look at the test failures, those might be related

@bschell
Copy link
Contributor Author

bschell commented Jul 27, 2020

Thanks for the update. Made the changes and everything look to be passing again.

@arhimondr
Copy link
Member

@bschell Good to go. Could you please rebase and squash the commits?

@bschell bschell force-pushed the hudiprestomainline branch from 0609ad5 to 936b513 Compare July 28, 2020 17:18
@bschell
Copy link
Contributor Author

bschell commented Jul 28, 2020

@arhimondr Done!

@bschell bschell force-pushed the hudiprestomainline branch from 936b513 to b5be227 Compare August 5, 2020 23:14
Allows presto-hive to support the use of custom input formats with custom
file splits and record readers. This allows support of Hudi merge-on-read table
input format.
@bschell bschell force-pushed the hudiprestomainline branch from b5be227 to 20d0830 Compare August 5, 2020 23:21
@bschell
Copy link
Contributor Author

bschell commented Aug 5, 2020

@arhimondr I think everything should be good now! Anything left for me to do?

@arhimondr arhimondr merged commit cfb2e7a into prestodb:master Aug 6, 2020
@arhimondr
Copy link
Member

The CLA check passed.

Thanks for your contribution @bschell !

@XiaoqiMa
Copy link

@bschell @bhasudha I am trying to use Presto to query hudi MOR table. Right now, I have two tables, which are hive_mor_table_ro and hive_mor_table_rt. I made some changes on the original taable, so there is a log file. From Hive cli, I would obtain two different results for these two tables because the "hive_mor_table_rt" shows the latest result. But from Presto cli, I would see the same results, I assume it is not working. Do you have any ideas why I would encounter this problem? Do I have to set some session properties to trigger the action of merging log files and parquet files? I would appreciate any ideas. Thanks

@duanyongvictory
Copy link

Allows presto-hive to support the use of custom input formats with custom file splits and record readers. Tested using Hudi merge-on-read table input format.

This is a rebase of the hudi presto realtime query patch on presto mainline. Tests are passing but need to double check for any possible new issues for Hudi given the rebase. Just wanted to get this opened for feedback ASAP.

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines.

Fill in the release notes towards the bottom of the PR description. See Release Notes Guidelines for details.

== RELEASE NOTES ==

Hive Changes
* Allows presto-hive to use custom parquet input formats 
* Add support for Hudi realtime input format for hudi realtime queries

hello, do you notice that query mor table on snapshot mode is much more slower than the read_optimized mode?
since for every split in every query, this class HoodieMergedLogRecordScanner has to use ExternalSpillableMap to read the record in log and merge with parquet,not to mention the ParquetRecordReaderWrapper is the row-level reader。
thanks.

I hava tested many tables, for the worst case in same table, snapshot is ten times lower than read_optimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants