Support Direct Recursive Hive File Listings#12443
Conversation
4be5f05 to
cd1f29f
Compare
cd1f29f to
df4839f
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/DirectoryListingCacheKey.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/fs/TestHiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/fs/TestHiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/fs/TestHiveFileIterator.java
Outdated
Show resolved
Hide resolved
df4839f to
228183a
Compare
|
@findinpath @alexjo2144 PTAL |
|
nit: |
There was a problem hiding this comment.
Is this correct when we want to list the whole tree under the specified path?
There was a problem hiding this comment.
Correct, we want to list all files underneath the specified path (not including the intermediate "directory" entries). This is ultimately the goal of a "recursive listing" when generating splits in Hive, but comes with a caveat- we no longer can identify and skip sub-listing "hidden" directories, so we have to filter out the files that are listed underneath those hidden paths after the fact.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/fs/TestHiveFileIterator.java
Outdated
Show resolved
Hide resolved
228183a to
14743e5
Compare
Woops, fixed. |
|
@pettyjamesm please setup a test containing We need to add these tests in order guard Trino against eventual future regressions. |
|
@findinpath - I might need some help figuring out how to put one of those together, the only existing tests I can see bootstrapping minio components are in the delta lake connector. Is it fine to assume certain bucket names and instance local credentials for the purposes of integrating with S3 in tests? |
|
@pettyjamesm ideally the class With this common building block, we could build integration tests for the all Lakehouse connectors.
Yes, it should be fine. Please do take into account the HDFS backed integration tests as well - ideally we'd have common base class containing the tests for the directory walker functionality on which we could eventually add also Azure/ GCS tests later (not in scope of this PR). |
alexjo2144
left a comment
There was a problem hiding this comment.
Mostly nit-picks
One thing we might want to think about is that there may be a lot of duplicates in this cache if a directory is asked for in both recursive and non-recursive listings
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Can we move this ignoreAbsentPartitions && !filesystem.exisists(path) to the top, before constructing the FileStatusIteraor? That way we don't have to rely on Exception handling to resolve the condition
There was a problem hiding this comment.
The idea behind this change is to avoid that extra check for fileSystem.exists(path) if the listing succeeds
There was a problem hiding this comment.
| catch (Exception ee) { | |
| catch (RuntimeException ee) { |
There was a problem hiding this comment.
We want to handle any Exception type here, since we're promoting it to a TrinoException with a specific error code and message regardless of the underlying cause.
There was a problem hiding this comment.
Could we just do something like:
stream(pathSubstring.split(String.valueOf(SEPARATOR_CHAR)))
.anyMatch(name -> ...));
There was a problem hiding this comment.
You could, but you'd end up generating a lot of extra allocations in a performance sensitive path and wouldn't allow isHiddenFileOrDirectory and isHiddenOrWithinHiddenParentDirectory to both share the same logic.
There was a problem hiding this comment.
Is this just here to precompute the hash code? I'd usually just compute it as needed.
Also we mostly we just do Objects.hash(path, recursiveFilesOnly)
There was a problem hiding this comment.
Precomputing and storing the hashcode avoids an extra pointer indirection through the path reference, which is significant in terms of latency when, eg: rehashing internal Cache structures which is typically memory access latency bound.
There was a problem hiding this comment.
Could you please use Objects.hash(path, recursiveFilesOnly) ?
The result is slightly different, because Boolean.hashCode returns either 1231/1237 instead of 0/1, but it is easier to grasp by the maintainers of the code.
Please do add also a comment in the code explainining why the hash code is precomputed.
There was a problem hiding this comment.
| && hashCode == other.hashCode |
There was a problem hiding this comment.
Sounds redundant with isHiddenOrWithinHiddenParentDirectory?
There was a problem hiding this comment.
Is hidden file or directory only looks at the "name", ie: the last path part and is used when doing shallow listings that don't recurse. When we're donig a recursive listing, we have to check for intermediate hidden folders between the current listing root and the final file name part, but not for shallow listings.
There was a problem hiding this comment.
Maybe includesRecursiveFiles or includesRecusiveListing?
There was a problem hiding this comment.
I see now where the naming you choose comes from : io.trino.plugin.hive.s3.TrinoS3FileSystem.ListingMode#RECURSIVE_FILES_ONLY.
There was a problem hiding this comment.
Why not always call fs.listFiles(cachKey.getPath(), cacheKey.recursiveFilesOnly()?
There was a problem hiding this comment.
fs.listFiles(Path, boolean recursive) only lists files, not directories (and does so either recursively, or only at the current level), while fs.listLocatedStatus(Path) returns both.
While that's true in theory, I don't think will actually occur in practice- at least in terms of how the code works today. The only time a directory will be listed recursively is inside of the |
14743e5 to
169096b
Compare
There was a problem hiding this comment.
As far as I understand, both list operations are listing ONLY files (and not sub-directories).
Please take into consideration changing the method name. e.g. : listRecursively
There was a problem hiding this comment.
That’s incorrect, the list method will list all contents at the requested path (shallowly) including directories, which is why the HiveFileIterator previously handled recursing through child directories (or failing, or ignoring them as per whatever nested directory policy was set). The new method listFilesRecursively will list only files, at any depth below the requested path.
There was a problem hiding this comment.
I fail to see a constellation in which directories would be returned by the list method with the current NestedDirectoryPolicy values: IGNORED, RECURSE, FAIL.
It would probably make sense to put a separate (preparatory) PR containing the newly added hive test to make sure that there is no regression performed through your latest changes. Just to be on the safe-side.
There was a problem hiding this comment.
The nested directory policy previously only affected how the HiveFileIterator handled encountered directories, but had no bearing on the results of the DirectoryLister. Now, when the mode is set to recurse, we call the appropriate DirectoryLister method which calls the corresponding FileSystem#listFiles(Path, boolean recursive=true) method to handle recursing and returning only files (not directories) for us. Effectively, the directory lister now has two listing modes:
- shallow, files + directories
- recursive, files only
The only change here is that we’re now letting the FileSystem handle “recursing” into subdirectories to enumerate the leaf “files” for us, which allows some implementations (like TrinoS3FileSystem, which already has an optimized listFiles overriding implementation from a prior PR but wasn’t previously being used by the DirectoryLister) to provide much more efficient implementations.
Specifically, most blob store APIs (like S3) provide a mechanism to directly list all keys starting with a given prefix, regardless of how many / characters are between the prefix and the end of the key, which greatly reduces the number of unnecessary list calls compared to the behavior before that would issue another batch of listings to traverse into each “sub-directory”.
For most other FileSystem implementations, the behavior should be effectively unchanged, because the FileSystem parent class implementation does essentially the same thing that HiveFileIterator was doing before this change: it builds a Deque of RemoteIterator instances and repeatedly calls FileSystem#listLocatedStatus(Path) to recurse into subdirectories.
There was a problem hiding this comment.
Got it. I was fixed previously on the HiveFileIterator, but noticed now that the discussion is about the DirectoryLister and its concrete implementation.
|
This change is specifically targeted at @pettyjamesm do you see any timeout related issues related to the recursive listing of a table containing a huge number of files ? |
This greatly depends on the level of nesting and how many files are at the “leaves” at each level, but anecdotally if you have a directory layout like the one I mentioned and linked in the PR description for ELB logs (link)- the difference can be enormous. On the order of queries taking 30+ minutes (traversing through layers and layers of listing calls and producing no splits for most of the query duration) to ~30 seconds (splits generated immediately and all files enumerated using orders of magnitude fewer list API calls).
No timeouts, although the current behavior without this change does increase the risk of throttling because of the extremely high call volume. Queries just take much much longer when recursive listing is performed against S3 because of inefficient use of the API. As a worked example, assuming the ELB log layout above (
|
|
As discussed on Slack, I would find beneficial adding a new test class in order to ensure the accuracy of the new implementation. |
c0521f0 to
87fce3d
Compare
...hive/src/test/java/io/trino/plugin/hive/fs/TestCachingDirectoryListerRecursiveFilesOnly.java
Outdated
Show resolved
Hide resolved
|
@findepi / @findinpath - can we now consider this ready to merge? |
findinpath
left a comment
There was a problem hiding this comment.
Code-wise this PR is great.
Functionality-wise it seems to be handling an exotic/not so generic use-case that without proper documentation will be not spotted by the Trino users.
...trino-hive/src/main/java/io/trino/plugin/hive/fs/TransactionScopeCachingDirectoryLister.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Quite interesting.
Without the hive.recursive-directories set to true the reading from such a table would not work at all.
@findepi , @alexjo2144 is this use-case too exotic for trino-hive?
There was a problem hiding this comment.
The partitioned table + drop / recreate table unpartitioned is just a convenience for this test. The FileHiveMetastore implementation doesn't allow duplicate tables with the same name, and table names must correspond to their file system location, so I'm using the partitioned table to insert new records into sub-paths, dropping it (without deleting the data) and then creating a new table at the same path location using the same name.
This isn't a typical usage pattern, but it does work to exercise the recursive listing behaviors in combination with caching.
There was a problem hiding this comment.
Instead of doing the low-level table dropping and creation, could we create two tables with the same external_location:
- for writing use the partitioned configuration
partitioned_by = ARRAY['day', 'country'] - for reading use the plain configuration without partioning on which we can use the list recursively files super power
This seems to me (although it screams like an anti-pattern) the way in which such a scenario could be implemented in the real-world.
Please take this note with a grain of salt.
There was a problem hiding this comment.
I tried to do something along those lines, but I couldn't get it work very easily in the way the tests are set up- so I just went with this approach for the purposes of this test. It's not intended to represent a realistic usage pattern outside of this test scenario.
87fce3d to
f15ab74
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/fs/CachingDirectoryLister.java
Outdated
Show resolved
Hide resolved
f15ab74 to
e6abd74
Compare
Adds support for directly listing all recursive files in directory listing and associated caches.
Removes a pre-emptive FileSystem#exists(Path) check before attempting to list path contents when absent partition errors are configured to be ignored. Instead, ignoring absent partitions can be done as part of a check only in the case where partition listing actually fails. For file systems like S3, checking the existence of a "directory" already incurs an S3 listing call, which are relatively expensive in terms of API rate limits and latency.
e6abd74 to
9ac2278
Compare
|
@findepi - this should be ready for final approval / merge |
|
@arhimondr / @findepi this change is worthy of having release notes
|
Proposed release notes: Hive Connector:
|
In the future, could you please edit the original PR message to include release notes so it's easier for me to find? I still got here, but it took me a few minutes |
Sure thing, sorry about that. |
Description
This change includes three changes that are designed to improve the efficiency of the hive plugin
DirectoryListerwhen used in combination with aFileSystemthat supports more efficient recursive listing viaFileSystem#listFiles(Path, boolean recursive)like theTrinoS3FileSystem.DirectoryListerinterface with an additional method that requests that a full recursive file listing instead of a shallow, "files and directories" listing. This necessarily required alteringCachingDirectoryListerandTransactionScopeCachingDirectoryListerto be able to store both kinds of listings at the same path, which was done by addingDirectoryListingCacheKeywhich distinguishes the two kinds of keys via the sign bit in their precomputed hash code.HiveFileIteratorto conditionally callDirectoryLister.listFilesRecursivelywhen recursive behaviors are requested instead ofDirectoryLister.list. This greatly simplifies the the implementation, since now recursion and traversal down into sub-paths are handled by the file system itself, but comes with the cost of attempting to identify "hidden sub-paths" between the top level path and child paths in a slightly more complex fashion.HiveFileIteratorto avoid eagerly checkingFileSystem.existswhenignoreAbsentPartitionsistrue, instead waiting for the listing to throw an exception and then checking whether the exception was caused by the path existing. This is especially useful for theTrinoS3FileSystemsince the implementation ofFileSystem.existsitself performs an S3 listing operation.Overall, this greatly reduces the number of S3 listing operations that will be performed when there is a large number of nested "directories" since we can instead fully enumerate all of the leaf S3 object paths when that's the desired behavior. For instance the ELB access log S3 path structure is described here will generate a separate "directory" per region, per day- which can quickly result in many hundreds of S3 listing calls to recurse through the hierarchy before reaching the actual data files without this improvement.
This change bridges the gap between the
BackgroundHiveSplitLoaderand the optimizedTrinoS3FileSystem#listFiles(Path, boolean recursive)implementation originally contributed in #4825Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.( ) Release notes entries required with the following suggested text: