Nonrecursive dir listing in trino FileSystem by ryadav-uptycs · Pull Request #20255 · trinodb/trino

ryadav-uptycs · 2024-01-02T10:51:36Z

Details are Captured in
#20253

Hudi connector shows incorrect table count for some of the tables where we run clustering

After digging deeper into the code we found that in newer version of trino .hudi metapath listing is done recursively (https://github.com/trinodb/trino/blob/430/plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/table/HudiTableMetaClient.java#L166) which is causing to incorrect oldest timestamp file calculation. Since it is doing recursively it is also reading commit files from .hoodie/.bucket_index/consistent_hashing_metadata//20231101000038468.commit which was placed to update the state of clustering. it is causing to skip some of the parquet files while listing the directory. hence incorrectly data count. This clustering commit files in hudi was done as part of this PR (apache/hudi#8503)

if we add filter after getting the listing result the query takes time in planning phase , but after adding non-recursive implementation planning is finishing quickly

As suggested by @findepi
For this PR adding changes for trinofilesystem only

1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir

as Hudi metadata listing requires non-recursive listing

ryadav-uptycs · 2024-01-02T10:55:36Z

Hi @findepi I have raised this separate PR for trino file system changes . Could you please review and also add the other developer of trinofilesystem lib

as Hudi metadata listing requires non-recursive listing

findepi · 2024-01-03T08:18:35Z

@ryadav-uptycs thank you for your PR!

can you check the PR's commit list https://github.com/trinodb/trino/pull/20255/commits and make sure it has only the stuff you wanted to include?

ryadav-uptycs · 2024-01-03T10:57:47Z

@findepi yes , it has only changes related to make directory listing non-recursive in all type of FS

findepi · 2024-01-03T11:11:27Z

I was confused by commit titled Fixing hudi related issue : x3 and Adding non-recursive dir listing property x2 commits. Did you mean to squash them?

electrum · 2024-01-05T13:23:50Z

It's not clear to me why we need this. Can you add a commit to the Hudi connector that uses this, along with a test case that would fail without this change?

Is there any Hudi documentation that shows the expected file layout for this feature?

ryadav-uptycs · 2024-01-05T13:31:55Z

I was confused by commit titled Fixing hudi related issue : x3 and Adding non-recursive dir listing property x2 commits. Did you mean to squash them?

@findepi I mentioned hudi issue because to fix the hudi connector issue we need to have non-recursive dir listing . That is may driving cause for this change . Let me know if it is creating confusion . I will remove it from the title

ryadav-uptycs · 2024-01-05T13:35:45Z

It's not clear to me why we need this. Can you add a commit to the Hudi connector that uses this, along with a test case that would fail without this change?

Is there any Hudi documentation that shows the expected file layout for this feature?

@electrum , in Hudi connector we list hoodie metadata dir to get the timeline . Right now that listing piece is recursive listing since it is using trinosystem.listFiles () which is giving incorrect result . There is no error , it is just that query count is incorrect .
I thought to open a separate PR for hudi change once we have non-recursive dir listing support in trinofilesystem

ryadav-uptycs · 2024-01-05T14:04:56Z

@electrum

This is Dir structure of hudi metadata

And below the recursive files inside the parent .hoodie directly . Because of recursive nature we reading below files also . which results into incorrect timeline calculation and getting incorrect count .

electrum · 2024-01-05T14:21:55Z

We can add a test that verifies that the row count is correct for the query.

ryadav-uptycs · 2024-01-05T14:28:29Z

We can add a test that verifies that the row count is correct for the query.

@electrum ok I will add that

findepi · 2024-01-08T15:24:21Z

@ryadav-uptycs can you update this PR to use the new API to resolve #20130 ?

ryadav-uptycs · 2024-01-08T16:01:58Z

@ryadav-uptycs can you update this PR to use the new API to resolve #20130 ?

@findepi I am sorry I have not understood , which new api ? . In #20130 I could not see what new api is
added

ryadav-uptycs · 2024-01-23T06:46:31Z

@electrum updated PR with test cases

findepi · 2024-01-31T17:01:35Z

@ryadav-uptycs please squash commits
https://github.com/trinodb/trino/pull/20255/commits should generally include just one commit

…fix the incorrect table count result. for that. 1. Adding non-recursive list method in trino filesystem 2. calling that method in hudi connector

…onrecursive-dir-listing

ryadav-uptycs · 2024-02-05T05:34:15Z

@findepi done

mosabua · 2024-02-05T17:58:08Z

@ryadav-uptycs please squash the commits in this PR. That means that the count of commit above in the user interface should be 1 (currently 13).

You can do that will interactive rebasing in git and then a forced push to your branch.

something like

git rebase -i HEAD~15

then flag them for squash and update the commit message

and then

git push -f

Ping me on slack if you need help.

Also https://www.git-tower.com/learn/git/faq/git-squash

mosabua · 2024-02-05T18:09:24Z

Also maybe @codope can help here?

ryadav-uptycs · 2024-02-13T17:05:47Z

squashing causing lots of conflict , so avoid any error I am closing this PR and raising another from different branch with same changes. raised new PR after resolving conflicts -> #20253

ryadav-uptycs added 4 commits December 17, 2023 18:53

Fixing hudi related issue :

9a36d95

1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir

Fixing hudi related issue :

f95786d

1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir

Fixing hudi related issue :

5f2e34b

1. exception handling in splitLoader 2. non-recursive iteration on hudi metadata dir

Adding non-recursive dir listing property

30dd43c

as Hudi metadata listing requires non-recursive listing

cla-bot bot added the cla-signed label Jan 2, 2024

github-actions bot added tests:hive hive Hive connector labels Jan 2, 2024

ryadav-uptycs requested a review from findepi January 2, 2024 10:54

ebyhr requested a review from electrum January 2, 2024 11:50

Adding non-recursive dir listing property

c6909d2

as Hudi metadata listing requires non-recursive listing

adding testcase for nonrecursive hoodie metadata listing test

8077a9c

github-actions bot added the hudi Hudi connector label Jan 23, 2024

ryadav-uptycs and others added 4 commits February 5, 2024 10:22

adding non-recursive hudi metadata listing functionality in order to …

ed37590

…fix the incorrect table count result. for that. 1. Adding non-recursive list method in trino filesystem 2. calling that method in hudi connector

Merge remote-tracking branch 'origin/nonrecursive-dir-listing' into n…

6c88467

…onrecursive-dir-listing

resolving conflict

ba05ad6

Merge branch 'master' into nonrecursive-dir-listing

fdcb3f6

resolving conflict

34ef3b5

ryadav-uptycs added 2 commits February 5, 2024 14:10

adding non-recursive call to CacheFileSystem

a80bc3a

fixing style check fixes

aae8200

Adding test comment

33f00d0

ryadav-uptycs closed this Feb 13, 2024

This was referenced Feb 13, 2024

Adding feature to list dir non-recursively in trinofilesystem to support hudi metadir list in non-recursive fashion #20253

Open

Adding non-recursive directory listing support in trino filesystem and using the non-rec api in hudi metadata dir listing #20686

Closed

Conversation

ryadav-uptycs commented Jan 2, 2024 • edited by electrum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryadav-uptycs commented Jan 2, 2024

Uh oh!

findepi commented Jan 3, 2024

Uh oh!

ryadav-uptycs commented Jan 3, 2024

Uh oh!

findepi commented Jan 3, 2024

Uh oh!

electrum commented Jan 5, 2024

Uh oh!

ryadav-uptycs commented Jan 5, 2024

Uh oh!

ryadav-uptycs commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryadav-uptycs commented Jan 5, 2024

Uh oh!

electrum commented Jan 5, 2024

Uh oh!

ryadav-uptycs commented Jan 5, 2024

Uh oh!

findepi commented Jan 8, 2024

Uh oh!

ryadav-uptycs commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryadav-uptycs commented Jan 23, 2024

Uh oh!

findepi commented Jan 31, 2024

Uh oh!

ryadav-uptycs commented Feb 5, 2024

Uh oh!

mosabua commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mosabua commented Feb 5, 2024

Uh oh!

ryadav-uptycs commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

ryadav-uptycs commented Jan 2, 2024 •

edited by electrum

Loading

ryadav-uptycs commented Jan 5, 2024 •

edited

Loading

ryadav-uptycs commented Jan 8, 2024 •

edited

Loading

mosabua commented Feb 5, 2024 •

edited

Loading

ryadav-uptycs commented Feb 13, 2024 •

edited

Loading