Nonrecursive dir listing in trino FileSystem#20255
Nonrecursive dir listing in trino FileSystem#20255ryadav-uptycs wants to merge 14 commits intotrinodb:masterfrom
Conversation
1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir
1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir
1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir
as Hudi metadata listing requires non-recursive listing
|
Hi @findepi I have raised this separate PR for trino file system changes . Could you please review and also add the other developer of trinofilesystem lib |
as Hudi metadata listing requires non-recursive listing
|
@ryadav-uptycs thank you for your PR! can you check the PR's commit list https://github.com/trinodb/trino/pull/20255/commits and make sure it has only the stuff you wanted to include? |
|
@findepi yes , it has only changes related to make directory listing non-recursive in all type of FS |
|
I was confused by commit titled |
|
It's not clear to me why we need this. Can you add a commit to the Hudi connector that uses this, along with a test case that would fail without this change? Is there any Hudi documentation that shows the expected file layout for this feature? |
@findepi I mentioned hudi issue because to fix the hudi connector issue we need to have non-recursive dir listing . That is may driving cause for this change . Let me know if it is creating confusion . I will remove it from the title |
@electrum , in Hudi connector we list hoodie metadata dir to get the timeline . Right now that listing piece is recursive listing since it is using trinosystem.listFiles () which is giving incorrect result . There is no error , it is just that query count is incorrect . |
|
We can add a test that verifies that the row count is correct for the query. |
@electrum ok I will add that |
|
@ryadav-uptycs can you update this PR to use the new API to resolve #20130 ? |
@findepi I am sorry I have not understood , which new api ? . In #20130 I could not see what new api is |
|
@electrum updated PR with test cases |
|
@ryadav-uptycs please squash commits |
…fix the incorrect table count result. for that. 1. Adding non-recursive list method in trino filesystem 2. calling that method in hudi connector
…onrecursive-dir-listing
|
@findepi done |
|
@ryadav-uptycs please squash the commits in this PR. That means that the count of commit above in the user interface should be 1 (currently 13). You can do that will interactive rebasing in git and then a forced push to your branch. something like then flag them for squash and update the commit message and then Ping me on slack if you need help. |
|
Also maybe @codope can help here? |
|
squashing causing lots of conflict , so avoid any error I am closing this PR and raising another from different branch with same changes. raised new PR after resolving conflicts -> #20253 |


Details are Captured in
#20253
Hudi connector shows incorrect table count for some of the tables where we run clustering
After digging deeper into the code we found that in newer version of trino
.hudimetapath listing is done recursively (https://github.com/trinodb/trino/blob/430/plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/table/HudiTableMetaClient.java#L166) which is causing to incorrect oldest timestamp file calculation. Since it is doing recursively it is also reading commit files from.hoodie/.bucket_index/consistent_hashing_metadata//20231101000038468.commitwhich was placed to update the state of clustering. it is causing to skip some of the parquet files while listing the directory. hence incorrectly data count. This clustering commit files in hudi was done as part of this PR (apache/hudi#8503)if we add filter after getting the listing result the query takes time in planning phase , but after adding non-recursive implementation planning is finishing quickly
As suggested by @findepi
For this PR adding changes for trinofilesystem only