Adding non-recursive directory listing support in trino filesystem and using the non-rec api in hudi metadata dir listing#20686
Conversation
a8722e9 to
b51c0b2
Compare
|
@ryadav-uptycs Thanks for the contribution! I think 7da7882 missed filtering lot of files under |
…e api in hudi-connector to avoid incorrect query resul extra check
f6acda0 to
98dc4a8
Compare
|
@electrum Could you please review it |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
dain
left a comment
There was a problem hiding this comment.
This doesn't seem like the right approach to me. Instead, can't Hudi just filter the reslts to remove the entries it doesn't want?
@dain hudi can , but listing will be faster if we dont list the dir in the first place . filter will be helpful to get the desired result but it will add performance impact |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
|
Will leave this closed since the upcoming Hudi release and work from @yihua will affect how to proceed on this work (and if) |
Details are Captured in
#20253
Hudi connector shows incorrect table count for some of the tables where we run clustering
After digging deeper into the code we found that in newer version of trino .hudi metapath listing is done recursively (https://github.com/trinodb/trino/blob/430/plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/table/HudiTableMetaClient.java#L166) which is causing to incorrect oldest timestamp file calculation. Since it is doing recursively it is also reading commit files from .hoodie/.bucket_index/consistent_hashing_metadata//20231101000038468.commit which was placed to update the state of clustering. it is causing to skip some of the parquet files while listing the directory. hence incorrectly data count. This clustering commit files in hudi was done as part of this PR (apache/hudi#8503)
if we add filter after getting the listing result the query takes time in planning phase , but after adding non-recursive implementation planning is finishing quickly
PS: This is same PR as #20255 after resolving all conflicts