Conversation
|
I would love to see it is part of Trino. We use HUDI extensively and inability to use MOR is limiting what we can do with Trino |
6923bd4 to
57e20e3
Compare
57e20e3 to
b699ad9
Compare
|
Great job @PennyAndWang |
154f310 to
52ca459
Compare
52ca459 to
1c69459
Compare
ddb3f39 to
52ca459
Compare
@tooptoop4 done! |
52ca459 to
2d8135c
Compare
Add support for Hudi MOR queries
2d8135c to
28cd2f8
Compare
| <dep.docker-java.version>3.2.8</dep.docker-java.version> | ||
| <dep.coral.version>1.0.60</dep.coral.version> | ||
| <dep.confluent.version>5.5.2</dep.confluent.version> | ||
| <dep.hudi.version>0.7.0</dep.hudi.version> |
There was a problem hiding this comment.
Could we upgrade to 0.8.0?
There was a problem hiding this comment.
I test the code based on Hudi 0.8.0 before, but there is an error showing that one API does not support. I will try it again.
|
@PennyAndWang |
I'm waiting for review . |
hashhar
left a comment
There was a problem hiding this comment.
Without any integration tests it's difficult to accept this patch since any future change can break the Hudi integration.
|
Can anyone at least look at the code and comment on the approach. Having HUDI MOR integration is a pretty big deal as it pushes Trino in to close to real-time category. Plus Presto already has it. We use HUDI and would love to have this finished. |
|
Re: The approach has some drawbacks. It very strongly couples the Hudi integration to lots of parts of the Hive connector (split enumeration for example) and would prevent evolution of the Hive connector independently of the Hudi integration. I think this PR also changes the Hive connector behaviour for Parquet tables that have a custom input format/reader set in table properties. cc: @electrum for a better approach since he's more familiar with the Hive connector than I am. |
|
Thank you for taking look, it makes sense. What would be the proper approach? Create a separate HUDI connector (similar to iceberg)? |
|
@afilipchik That makes the most sense to me. The Hudi impl would need it's own split enumeration logic, reader instantiation and it's own implementation of a Split so that we can avoiding piggybacking on top of HiveSplit and instead model the split in a way that makes most sense for Hudi. e.g. should the PathFilter be resolved on a worker (i.e. part of split) or should the co-ordinator evaluate the filter before creating the splits and other similar decisions. It might take more time and effort but the end result would be a much better first class Hudi integration. I'd suggest to create a PR with a new barebones Hudi connector with minimum required functionality and integration tests so that interested parties can start contributing in parallel on different parts (e.g. realtime Hudi tables?). |
@hashhar Thanks for the comments! For context, Hudi supports two kinds of tables, Copy-On-Write/CoW (which has only parquet data for e.g) and Merge-on-Read/MoR (which has a mix of columnar and row oriented data). While we see the benefits of a new connector for MoR - where we need our own record readers, CoW is just some filtering on the existing split generation in the Hive connector. Doing it within the hive connector, has the benefit of being able to take advantage of new features added to hive connector e.g file status caching? Since we are picking and choosing and extending Hive connector within our Hudi connector, I am wondering how we maintain it over time. In fact, @electrum suggested the original approach we took in prestoDB :). |
|
I have tested the above code and Its doesn't work in case where log file is present in one partition and not present in another partition. |
|
👋 @PennyAndWang - this PR is inactive and doesn't seem to be under development, and it might already be implemented. If you'd like to continue work on this at any point in the future, feel free to re-open. |
According to this issue:#5132, I submit this PR based on PrestoDB's two commit: prestodb/presto#13818 and prestodb/presto#14795 .
I'm not sure if this is OK for trino community.
My test results shows that this PR can let trino read Hudi's MOR tables.
Please review .THX