Skip to content

Support hudi mor table#8240

Closed
PennyAndWang wants to merge 1 commit intotrinodb:masterfrom
PennyAndWang:support_hudi_mor_table
Closed

Support hudi mor table#8240
PennyAndWang wants to merge 1 commit intotrinodb:masterfrom
PennyAndWang:support_hudi_mor_table

Conversation

@PennyAndWang
Copy link
Contributor

According to this issue:#5132, I submit this PR based on PrestoDB's two commit: prestodb/presto#13818 and prestodb/presto#14795 .
I'm not sure if this is OK for trino community.
My test results shows that this PR can let trino read Hudi's MOR tables.
Please review .THX

@cla-bot cla-bot bot added the cla-signed label Jun 9, 2021
@afilipchik
Copy link

I would love to see it is part of Trino. We use HUDI extensively and inability to use MOR is limiting what we can do with Trino

@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch 2 times, most recently from 6923bd4 to 57e20e3 Compare June 10, 2021 09:14
@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch from 57e20e3 to b699ad9 Compare June 10, 2021 11:54
@caneGuy
Copy link
Contributor

caneGuy commented Jun 10, 2021

Great job @PennyAndWang

@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch 4 times, most recently from 154f310 to 52ca459 Compare June 11, 2021 15:36
@martint martint requested a review from electrum June 11, 2021 18:11
@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch 3 times, most recently from 52ca459 to 1c69459 Compare June 17, 2021 10:15
@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch from ddb3f39 to 52ca459 Compare July 9, 2021 02:46
@PennyAndWang
Copy link
Contributor Author

rebase @PennyAndWang ?

@tooptoop4 done!

@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch from 52ca459 to 2d8135c Compare July 9, 2021 06:30
@PennyAndWang PennyAndWang force-pushed the support_hudi_mor_table branch from 2d8135c to 28cd2f8 Compare July 9, 2021 09:19
<dep.docker-java.version>3.2.8</dep.docker-java.version>
<dep.coral.version>1.0.60</dep.coral.version>
<dep.confluent.version>5.5.2</dep.confluent.version>
<dep.hudi.version>0.7.0</dep.hudi.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we upgrade to 0.8.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test the code based on Hudi 0.8.0 before, but there is an error showing that one API does not support. I will try it again.

@py2711
Copy link

py2711 commented Sep 20, 2021

@PennyAndWang
When can we expect this feature to be merged in master

@PennyAndWang
Copy link
Contributor Author

@PennyAndWang
When can we expect this feature to be merged in master

I'm waiting for review .

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any integration tests it's difficult to accept this patch since any future change can break the Hudi integration.

@afilipchik
Copy link

Can anyone at least look at the code and comment on the approach. Having HUDI MOR integration is a pretty big deal as it pushes Trino in to close to real-time category. Plus Presto already has it.

We use HUDI and would love to have this finished.

@hashhar
Copy link
Member

hashhar commented Sep 23, 2021

Re: The approach has some drawbacks. It very strongly couples the Hudi integration to lots of parts of the Hive connector (split enumeration for example) and would prevent evolution of the Hive connector independently of the Hudi integration.

I think this PR also changes the Hive connector behaviour for Parquet tables that have a custom input format/reader set in table properties.

cc: @electrum for a better approach since he's more familiar with the Hive connector than I am.

@afilipchik
Copy link

Thank you for taking look, it makes sense. What would be the proper approach? Create a separate HUDI connector (similar to iceberg)?

@hashhar
Copy link
Member

hashhar commented Sep 27, 2021

@afilipchik That makes the most sense to me.

The Hudi impl would need it's own split enumeration logic, reader instantiation and it's own implementation of a Split so that we can avoiding piggybacking on top of HiveSplit and instead model the split in a way that makes most sense for Hudi. e.g. should the PathFilter be resolved on a worker (i.e. part of split) or should the co-ordinator evaluate the filter before creating the splits and other similar decisions.
Common code from Hive connector can still be used (and should be where it makes sense) similar to how Iceberg connector does.

It might take more time and effort but the end result would be a much better first class Hudi integration.

I'd suggest to create a PR with a new barebones Hudi connector with minimum required functionality and integration tests so that interested parties can start contributing in parallel on different parts (e.g. realtime Hudi tables?).

@vinothchandar
Copy link

The Hudi impl would need it's own split enumeration logic, reader instantiation and it's own implementation of a Split so that we can avoiding piggybacking on top of HiveSplit

@hashhar Thanks for the comments! For context, Hudi supports two kinds of tables, Copy-On-Write/CoW (which has only parquet data for e.g) and Merge-on-Read/MoR (which has a mix of columnar and row oriented data). While we see the benefits of a new connector for MoR - where we need our own record readers, CoW is just some filtering on the existing split generation in the Hive connector. Doing it within the hive connector, has the benefit of being able to take advantage of new features added to hive connector e.g file status caching? Since we are picking and choosing and extending Hive connector within our Hudi connector, I am wondering how we maintain it over time.

In fact, @electrum suggested the original approach we took in prestoDB :).

@py2711
Copy link

py2711 commented Oct 26, 2021

I have tested the above code and Its doesn't work in case where log file is present in one partition and not present in another partition.
This is a very basic case.

@hashhar hashhar mentioned this pull request Oct 26, 2021
@colebow
Copy link
Member

colebow commented Oct 27, 2022

👋 @PennyAndWang - this PR is inactive and doesn't seem to be under development, and it might already be implemented. If you'd like to continue work on this at any point in the future, feel free to re-open.

@colebow colebow closed this Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

9 participants