Skip to content

Implement PrestoS3FileSystem#listFiles for direct recursive listings#15024

Merged
zhenxiao merged 1 commit intoprestodb:masterfrom
pettyjamesm:support-s3-recursive-listings
Aug 21, 2020
Merged

Implement PrestoS3FileSystem#listFiles for direct recursive listings#15024
zhenxiao merged 1 commit intoprestodb:masterfrom
pettyjamesm:support-s3-recursive-listings

Conversation

@pettyjamesm
Copy link
Copy Markdown
Contributor

Implements FileSystem#listFiles(Path, boolean recursive) for PrestoS3FileSystem which in theory adds support for directly listing S3 files underneath a prefix without recursive calls through each "directory". This direct traversal requires much fewer requests when dealing with nested directories but may violate some PathFilter implementation's expectation of being called at each directory level, and may perform worse when a large number objects are contained within hidden paths (since filtering would be performed after the fact). I'm open to suggestions about how to balance this trade-off and integrate this with the DirectoryLister.

Incidentally, a straightforward improvement to getFileStatus fell out of the implementation allowing the isDir check for a path prefix (with no object present) to be done by limiting the listing result size to 1 instead of listing the full default 1000.

== RELEASE NOTES ==
Hive Changes
* Add support for direct recursive file listings in PrestoS3FileSystem

Copy link
Copy Markdown
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking at it, @pettyjamesm
left some comments :)

@pettyjamesm pettyjamesm force-pushed the support-s3-recursive-listings branch 2 times, most recently from 30eef37 to cb4be32 Compare August 20, 2020 10:53
Copy link
Copy Markdown
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, @pettyjamesm
LGTM, only a few minor issues

Implements FileSystem#listFiles(Path, boolean recursive) for
PrestoS3FileSystem which in theory adds support for directly listing
S3 files underneath a prefix without recursive calls through each
"directory". This direct traversal requires much fewer requests
when dealing with nested directories but may violate some PathFilter
implementation's expectation of being called at each directory level,
and may perform worse when a large number objects are contained within
hidden paths (since filtering would be performed after the fact). I'm
open to suggestions about how to balance this trade-off and integrate
this with the DirectoryLister.

Incidentally, a straightforward improvement to getFileStatus fell out
of the implementation allowing the isDir check for a path prefix (with
no object present) to be done by limiting the listing result size to 1
instead of listing the full default 1000.
@pettyjamesm pettyjamesm force-pushed the support-s3-recursive-listings branch from cb4be32 to c108a13 Compare August 21, 2020 20:48
@zhenxiao zhenxiao merged commit 95725f6 into prestodb:master Aug 21, 2020
@caithagoras caithagoras mentioned this pull request Sep 9, 2020
3 tasks
@pettyjamesm pettyjamesm deleted the support-s3-recursive-listings branch July 20, 2021 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants