Skip to content

Implement PrestoS3FileSystem#listFiles for direct recursive listings#4825

Merged
electrum merged 1 commit intotrinodb:masterfrom
pettyjamesm:support-s3-recursive-listings
Aug 20, 2020
Merged

Implement PrestoS3FileSystem#listFiles for direct recursive listings#4825
electrum merged 1 commit intotrinodb:masterfrom
pettyjamesm:support-s3-recursive-listings

Conversation

@pettyjamesm
Copy link
Copy Markdown
Member

Cross contribution of prestodb/presto#15024

Implements FileSystem#listFiles(Path, boolean recursive) for PrestoS3FileSystem which in theory adds support for directly listing S3 files underneath a prefix without recursive calls through each "directory". This direct traversal requires much fewer requests when dealing with nested directories but may violate some PathFilter implementation's expectation of being called at each directory level, and may perform worse when a large number objects are contained within hidden paths (since filtering would be performed after the fact). I'm open to suggestions about how to balance this trade-off and integrate this with the DirectoryLister.

Incidentally, a straightforward improvement to getFileStatus fell out of the implementation allowing the isDir check for a path prefix (with no object present) to be done by limiting the listing result size to 1 instead of listing the full default 1000.

@cla-bot cla-bot bot added the cla-signed label Aug 13, 2020
@pettyjamesm pettyjamesm requested a review from electrum August 13, 2020 20:47
@raunaqmorarka
Copy link
Copy Markdown
Member

Related PR #3221

@pettyjamesm pettyjamesm force-pushed the support-s3-recursive-listings branch from 45beeb6 to 1f32305 Compare August 19, 2020 13:45
Implements FileSystem#listFiles(Path, boolean recursive) for
PrestoS3FileSystem which in theory adds support for directly listing
S3 files underneath a prefix without recursive calls through each
"directory". This direct traversal requires much fewer requests
 when dealing with nested directories but may violate some PathFilter
implementation's expectation of being called at each directory level,
and may perform worse when a large number objects are contained within
hidden paths (since filtering would be performed after the fact). I'm
open to suggestions about how to balance this trade-off and integrate
this with the DirectoryLister.

Incidentally, a straightforward improvement to getFileStatus fell out
of the implementation allowing the isDir check for a path prefix (with
no object present) to be done by limiting the listing result size to 1
instead of listing the full default 1000.
@pettyjamesm pettyjamesm force-pushed the support-s3-recursive-listings branch from 1f32305 to da2bf3d Compare August 20, 2020 11:03
@electrum electrum merged commit 4713d09 into trinodb:master Aug 20, 2020
@electrum
Copy link
Copy Markdown
Member

Thanks!

@pettyjamesm pettyjamesm deleted the support-s3-recursive-listings branch August 20, 2020 17:14
@findepi
Copy link
Copy Markdown
Member

findepi commented Aug 21, 2020

Without a change in BackgroundHiveSplitLoader are we benefitting from this already?

@findepi findepi added this to the 341 milestone Aug 21, 2020
@pettyjamesm
Copy link
Copy Markdown
Member Author

Without a change in BackgroundHiveSplitLoader are we benefitting from this already?

At the moment, no. I didn’t currently have the bandwidth to adapt the code we have to work in a world with caches and / or hudi integration so I contributed just this piece for now. I’ve talked with @electrum about the basic shape of what you would have to do in order to leverage this from split loading so it’s possible someone could pick that up in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants