Skip to content

Nonrecursive dir listing in trino FileSystem#20255

Closed
ryadav-uptycs wants to merge 14 commits intotrinodb:masterfrom
ryadav-uptycs:nonrecursive-dir-listing
Closed

Nonrecursive dir listing in trino FileSystem#20255
ryadav-uptycs wants to merge 14 commits intotrinodb:masterfrom
ryadav-uptycs:nonrecursive-dir-listing

Conversation

@ryadav-uptycs
Copy link
Copy Markdown

@ryadav-uptycs ryadav-uptycs commented Jan 2, 2024

Details are Captured in
#20253

Hudi connector shows incorrect table count for some of the tables where we run clustering

After digging deeper into the code we found that in newer version of trino .hudi metapath listing is done recursively (https://github.com/trinodb/trino/blob/430/plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/table/HudiTableMetaClient.java#L166) which is causing to incorrect oldest timestamp file calculation. Since it is doing recursively it is also reading commit files from .hoodie/.bucket_index/consistent_hashing_metadata//20231101000038468.commit which was placed to update the state of clustering. it is causing to skip some of the parquet files while listing the directory. hence incorrectly data count. This clustering commit files in hudi was done as part of this PR (apache/hudi#8503)

if we add filter after getting the listing result the query takes time in planning phase , but after adding non-recursive implementation planning is finishing quickly

As suggested by @findepi
For this PR adding changes for trinofilesystem only

1. exception handling in splitLoader
2. non-recursive iteration on hudi metadata dir
1. exception handling in splitLoader
2. non-recursive iteration on hudi metadata dir
1. exception handling in splitLoader
2. non-recursive iteration on hudi metadata dir
as Hudi metadata listing requires non-recursive listing
@cla-bot cla-bot bot added the cla-signed label Jan 2, 2024
@github-actions github-actions bot added tests:hive hive Hive connector labels Jan 2, 2024
@ryadav-uptycs ryadav-uptycs requested a review from findepi January 2, 2024 10:54
@ryadav-uptycs
Copy link
Copy Markdown
Author

Hi @findepi I have raised this separate PR for trino file system changes . Could you please review and also add the other developer of trinofilesystem lib

@ebyhr ebyhr requested a review from electrum January 2, 2024 11:50
as Hudi metadata listing requires non-recursive listing
@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 3, 2024

@ryadav-uptycs thank you for your PR!

can you check the PR's commit list https://github.com/trinodb/trino/pull/20255/commits and make sure it has only the stuff you wanted to include?

@ryadav-uptycs
Copy link
Copy Markdown
Author

@findepi yes , it has only changes related to make directory listing non-recursive in all type of FS

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 3, 2024

I was confused by commit titled Fixing hudi related issue : x3 and Adding non-recursive dir listing property x2 commits. Did you mean to squash them?

@electrum
Copy link
Copy Markdown
Member

electrum commented Jan 5, 2024

It's not clear to me why we need this. Can you add a commit to the Hudi connector that uses this, along with a test case that would fail without this change?

Is there any Hudi documentation that shows the expected file layout for this feature?

@ryadav-uptycs
Copy link
Copy Markdown
Author

I was confused by commit titled Fixing hudi related issue : x3 and Adding non-recursive dir listing property x2 commits. Did you mean to squash them?

@findepi I mentioned hudi issue because to fix the hudi connector issue we need to have non-recursive dir listing . That is may driving cause for this change . Let me know if it is creating confusion . I will remove it from the title

@ryadav-uptycs
Copy link
Copy Markdown
Author

ryadav-uptycs commented Jan 5, 2024

It's not clear to me why we need this. Can you add a commit to the Hudi connector that uses this, along with a test case that would fail without this change?

Is there any Hudi documentation that shows the expected file layout for this feature?

@electrum , in Hudi connector we list hoodie metadata dir to get the timeline . Right now that listing piece is recursive listing since it is using trinosystem.listFiles () which is giving incorrect result . There is no error , it is just that query count is incorrect .
I thought to open a separate PR for hudi change once we have non-recursive dir listing support in trinofilesystem

@ryadav-uptycs
Copy link
Copy Markdown
Author

@electrum

This is Dir structure of hudi metadata
hoodiemeta

And below the recursive files inside the parent .hoodie directly . Because of recursive nature we reading below files also . which results into incorrect timeline calculation and getting incorrect count .
consistentHashing

@electrum
Copy link
Copy Markdown
Member

electrum commented Jan 5, 2024

We can add a test that verifies that the row count is correct for the query.

@ryadav-uptycs
Copy link
Copy Markdown
Author

We can add a test that verifies that the row count is correct for the query.

@electrum ok I will add that

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 8, 2024

@ryadav-uptycs can you update this PR to use the new API to resolve #20130 ?

@ryadav-uptycs
Copy link
Copy Markdown
Author

ryadav-uptycs commented Jan 8, 2024

@ryadav-uptycs can you update this PR to use the new API to resolve #20130 ?

@findepi I am sorry I have not understood , which new api ? . In #20130 I could not see what new api is
added

@github-actions github-actions bot added the hudi Hudi connector label Jan 23, 2024
@ryadav-uptycs
Copy link
Copy Markdown
Author

@electrum updated PR with test cases

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 31, 2024

@ryadav-uptycs please squash commits
https://github.com/trinodb/trino/pull/20255/commits should generally include just one commit

ryadav-uptycs and others added 4 commits February 5, 2024 10:22
…fix the incorrect table count result.

for that.
1. Adding non-recursive list method in trino filesystem
2. calling that method in hudi connector
@ryadav-uptycs
Copy link
Copy Markdown
Author

@findepi done

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Feb 5, 2024

@ryadav-uptycs please squash the commits in this PR. That means that the count of commit above in the user interface should be 1 (currently 13).

You can do that will interactive rebasing in git and then a forced push to your branch.

something like

git rebase -i HEAD~15

then flag them for squash and update the commit message

and then

git push -f

Ping me on slack if you need help.

Also https://www.git-tower.com/learn/git/faq/git-squash

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Feb 5, 2024

Also maybe @codope can help here?

@ryadav-uptycs
Copy link
Copy Markdown
Author

ryadav-uptycs commented Feb 13, 2024

squashing causing lots of conflict , so avoid any error I am closing this PR and raising another from different branch with same changes. raised new PR after resolving conflicts -> #20253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector hudi Hudi connector

Development

Successfully merging this pull request may close these issues.

4 participants