Introduce PathFilter support in DirectoryLister#13590
Introduce PathFilter support in DirectoryLister#13590bhasudha wants to merge 1 commit intoprestodb:masterfrom
Conversation
|
Hi bhasudha! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
shixuan-fan
left a comment
There was a problem hiding this comment.
Looks good in general, some minor comments
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: extra line between methods.
I think we could just use the same interface and default other use cases to be Optional.empty(), but I also realized that this might break other private connectors. @wenleix , @arhimondr do you have a suggestion here?
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: How about making this Optional? Or even, can we create PathFilter here?
There was a problem hiding this comment.
We want to create a PathFilter instance per query. Thats why I thought creating inside BackgroundHiveSplitLoader would be apt. I made this Optional as you suggested. Let me know what you think.
presto-hive/src/main/java/com/facebook/presto/hive/util/HiveFileIterator.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestBackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestBackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveClientConfig.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/util/TestHiveFileIterator.java
Outdated
Show resolved
Hide resolved
|
Thanks for the detailed review @shixuan-fan . I have responded inline. |
06ea92e to
facc233
Compare
facc233 to
3a4d1c0
Compare
shixuan-fan
left a comment
There was a problem hiding this comment.
Looks good to me. I'm not a huge fan of reflection but I don't have a better suggestion here given we want to have different instances. I'll ask @arhimondr / @wenleix to take another pass. Thanks for the good work!
presto-hive/src/main/java/com/facebook/presto/hive/util/HiveFileIterator.java
Show resolved
Hide resolved
| private int fileCountForIgnoredPolicyPathFilterOff; | ||
| private int fileCountForRecursePolicyPathFilterOff; | ||
| private Configuration hadoopConf; | ||
| private Random random; |
There was a problem hiding this comment.
Use ThreadLocalRandom instead
| private Random random; | ||
| private static final String RANDOM_FILE_NAME_SALT_STRING = "abcdefghijklmnopqrstuvwxyz"; | ||
|
|
||
| @AfterClass |
There was a problem hiding this comment.
nit: put it after @BeforeClass
| private Random random; | ||
| private static final String RANDOM_FILE_NAME_SALT_STRING = "abcdefghijklmnopqrstuvwxyz"; | ||
|
|
||
| @AfterClass |
There was a problem hiding this comment.
Also please nullify the fields to avoid memory leaks
| } | ||
| } | ||
|
|
||
| private void delete(File file) |
There was a problem hiding this comment.
nit: put this method after it's usage
|
|
||
| public class TestHiveFileIterator | ||
| { | ||
| private NamenodeStats namenodeStats; |
There was a problem hiding this comment.
Are all this variables have to be on the class level? Can some of them be local?
| @Test | ||
| public void testPathFilterWithRecursion() | ||
| { | ||
| hiveFileIterator = new HiveFileIterator(rootPath, listDirectoryOperation, namenodeStats, RECURSE, Optional.of(pathFilter)); |
| { | ||
| hiveFileIterator = new HiveFileIterator(rootPath, listDirectoryOperation, namenodeStats, IGNORED, Optional.empty()); | ||
| int actualCount = getFileCount(hiveFileIterator); | ||
| assertEquals(actualCount, fileCountForIgnoredPolicyPathFilterOff); |
There was a problem hiding this comment.
Why fileCountForIgnoredPolicyPathFilterOff (and other similar) cannot be simply hardcoded?
There was a problem hiding this comment.
By introducing something like
hive.input-path-filter-classwe are explicitly encouraging jar drop-ins. This is something that we were historically trying to avoid (however it is still possible,e.g.: with storage formats). The "presto way" would be to extend thepresto-hivemodule, and have an interface inpresto-hivethat allows to extend this behaviour.@bhasudha What is your usecase? Do you use
presto-hiveconnector as is? Or do you have some wrapper on top of it?
@arhimondr I understand. We are using presto-hive-hadoop2 with drop-ins for custom storage format. The pathfilter class in this case will also be coming from Hudi jars- HoodieROTablePathFilter.
Like @shixuan-fan suggested let me try to introduce a PathFilterProvider interface.
There was a problem hiding this comment.
@bhasudha Did you consider having a presto-hive-hoodie connector wrapper in your proprietary codebase? This way you can verify jar dependency compatibility during the build time. Also you can inject whatever extensions you need without introducing runtime class resolutions that can be very error prone.
We use this very approach for the Facebook proprietary extension. You can have a look at DirectoryLister interface. When it doesn't make sense in the scope of presto-hive connector, it allows us to plug in our extensions.
There was a problem hiding this comment.
@arhimondr Wrapper connector is a good idea. I can try that. Just wanted to clarify that this is not proprietary. Presto users in Hudi open source community also face this issue today. Since we have already solved this at Uber by patching Presto, thought generalizing that would be useful for everyone.
I can refactor this change into a separate module called 'presto-hudi' connector and update the PR. Let me know if you are okay with that.
| List<File> dirs = createDirs(basePath, 2); | ||
|
|
||
| // create files in each subdir along with couple files and one nested directory | ||
| for (int i = 0; i < dirs.size(); i++) { |
There was a problem hiding this comment.
suggestion: This might be somehow hard to follow for someone who is not familiar with this test. What do you think about having 4 directories that are created in a straightforward way for each test case? And then hardcode the expected count. Also we can create them inside the test (per test), and not in BeforeClass
There was a problem hiding this comment.
sure. i ll refactor the test this way.
| } | ||
| try { | ||
| return fileStatusIterator.hasNext(); | ||
| while (fileStatusIterator.hasNext()) { |
There was a problem hiding this comment.
How about wrapping the FileStatusIterator in Iterators#filter from guava? So we don't have to implement filtering logic here?
There was a problem hiding this comment.
Hmmm. I tried doing this. It looks like this accepts only Iterator as the input parameter - https://github.com/google/guava/blob/49f5a6332a63737dff70cf77472f9867bc7ca6eb/guava/src/com/google/common/collect/Iterators.java#L629 However the ListDirectoryOperation.list(path) return RemoteIterator.
| { | ||
| Iterator<HiveFileInfo> list(FileSystem fileSystem, Path path, NamenodeStats namenodeStats, NestedDirectoryPolicy nestedDirectoryPolicy); | ||
|
|
||
| Iterator<HiveFileInfo> list(FileSystem fileSystem, Path path, NamenodeStats namenodeStats, NestedDirectoryPolicy nestedDirectoryPolicy, Optional<PathFilter> pathFilter); |
There was a problem hiding this comment.
Let's leave only a single method in the interface (remove the old one)
|
By introducing something like @bhasudha What is your usecase? Do you use |
|
Actually as I was thinking, is it possible to have a |
|
Closing this as redundant. PR - #13818 covers these changes. |
This PR fixes the issue described here - Support for PathFilter in DirectoryLister
If release note is NOT required, use: