Skip to content

Adding non-recursive directory listing support in trino filesystem and using the non-rec api in hudi metadata dir listing#20686

Closed
ryadav-uptycs wants to merge 1 commit intotrinodb:masterfrom
ryadav-uptycs:nonrec-dirlist
Closed

Adding non-recursive directory listing support in trino filesystem and using the non-rec api in hudi metadata dir listing#20686
ryadav-uptycs wants to merge 1 commit intotrinodb:masterfrom
ryadav-uptycs:nonrec-dirlist

Conversation

@ryadav-uptycs
Copy link
Copy Markdown

@ryadav-uptycs ryadav-uptycs commented Feb 13, 2024

Details are Captured in
#20253

Hudi connector shows incorrect table count for some of the tables where we run clustering

After digging deeper into the code we found that in newer version of trino .hudi metapath listing is done recursively (https://github.com/trinodb/trino/blob/430/plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/table/HudiTableMetaClient.java#L166) which is causing to incorrect oldest timestamp file calculation. Since it is doing recursively it is also reading commit files from .hoodie/.bucket_index/consistent_hashing_metadata//20231101000038468.commit which was placed to update the state of clustering. it is causing to skip some of the parquet files while listing the directory. hence incorrectly data count. This clustering commit files in hudi was done as part of this PR (apache/hudi#8503)

if we add filter after getting the listing result the query takes time in planning phase , but after adding non-recursive implementation planning is finishing quickly

PS: This is same PR as #20255 after resolving all conflicts

@cla-bot cla-bot bot added the cla-signed label Feb 13, 2024
@mosabua mosabua requested review from yihua and removed request for mosabua February 13, 2024 17:29
@github-actions github-actions bot added tests:hive hudi Hudi connector hive Hive connector labels Feb 13, 2024
@findepi findepi requested a review from electrum February 13, 2024 21:10
@codope
Copy link
Copy Markdown
Contributor

codope commented Feb 14, 2024

@ryadav-uptycs Thanks for the contribution! I think 7da7882 missed filtering lot of files under .hoodie and we are landing into this bug due to that. That commit was to get rid f Hadoop dependencies. We are working on newer abstractions in Hudi code so that there is no need to inline Hudi classes in Trino code. And these new APIs will be Hadoop independent. Until then, we need to land this fix. I will review the PR this week.

…e api in hudi-connector to avoid incorrect query resul

extra check
@ryadav-uptycs
Copy link
Copy Markdown
Author

@electrum Could you please review it

@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right approach to me. Instead, can't Hudi just filter the reslts to remove the entries it doesn't want?

@ryadav-uptycs
Copy link
Copy Markdown
Author

This doesn't seem like the right approach to me. Instead, can't Hudi just filter the reslts to remove the entries it doesn't want?

@dain hudi can , but listing will be faster if we dont list the dir in the first place . filter will be helpful to get the desired result but it will add performance impact

@github-actions github-actions bot removed the stale label Mar 20, 2024
@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 10, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 3, 2024

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jun 3, 2024
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Jun 3, 2024

Will leave this closed since the upcoming Hudi release and work from @yihua will affect how to proceed on this work (and if)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants