Skip to content

Introduce PathFilter support in DirectoryLister#13590

Closed
bhasudha wants to merge 1 commit intoprestodb:masterfrom
bhasudha:presto-hudi-optimizations
Closed

Introduce PathFilter support in DirectoryLister#13590
bhasudha wants to merge 1 commit intoprestodb:masterfrom
bhasudha:presto-hudi-optimizations

Conversation

@bhasudha
Copy link

This PR fixes the issue described here - Support for PathFilter in DirectoryLister

== RELEASE NOTES ==

General Changes
* ...
* ...

Hive Changes
* ...
* ...

If release note is NOT required, use:

== NO RELEASE NOTE ==

@facebook-github-bot
Copy link
Collaborator

Hi bhasudha! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Collaborator

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, some minor comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line between methods.

I think we could just use the same interface and default other use cases to be Optional.empty(), but I also realized that this might break other private connectors. @wenleix , @arhimondr do you have a suggestion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about making this Optional? Or even, can we create PathFilter here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to create a PathFilter instance per query. Thats why I thought creating inside BackgroundHiveSplitLoader would be apt. I made this Optional as you suggested. Let me know what you think.

@bhasudha
Copy link
Author

Thanks for the detailed review @shixuan-fan . I have responded inline.

@bhasudha bhasudha force-pushed the presto-hudi-optimizations branch from 06ea92e to facc233 Compare October 29, 2019 17:58
@bhasudha bhasudha force-pushed the presto-hudi-optimizations branch from facc233 to 3a4d1c0 Compare October 29, 2019 18:38
Copy link
Contributor

@shixuan-fan shixuan-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'm not a huge fan of reflection but I don't have a better suggestion here given we want to have different instances. I'll ask @arhimondr / @wenleix to take another pass. Thanks for the good work!

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

private int fileCountForIgnoredPolicyPathFilterOff;
private int fileCountForRecursePolicyPathFilterOff;
private Configuration hadoopConf;
private Random random;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ThreadLocalRandom instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

private Random random;
private static final String RANDOM_FILE_NAME_SALT_STRING = "abcdefghijklmnopqrstuvwxyz";

@AfterClass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put it after @BeforeClass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

private Random random;
private static final String RANDOM_FILE_NAME_SALT_STRING = "abcdefghijklmnopqrstuvwxyz";

@AfterClass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please nullify the fields to avoid memory leaks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

}
}

private void delete(File file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put this method after it's usage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


public class TestHiveFileIterator
{
private NamenodeStats namenodeStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all this variables have to be on the class level? Can some of them be local?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I ll refactor them.

@Test
public void testPathFilterWithRecursion()
{
hiveFileIterator = new HiveFileIterator(rootPath, listDirectoryOperation, namenodeStats, RECURSE, Optional.of(pathFilter));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

{
hiveFileIterator = new HiveFileIterator(rootPath, listDirectoryOperation, namenodeStats, IGNORED, Optional.empty());
int actualCount = getFileCount(hiveFileIterator);
assertEquals(actualCount, fileCountForIgnoredPolicyPathFilterOff);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fileCountForIgnoredPolicyPathFilterOff (and other similar) cannot be simply hardcoded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By introducing something like hive.input-path-filter-class we are explicitly encouraging jar drop-ins. This is something that we were historically trying to avoid (however it is still possible,e.g.: with storage formats). The "presto way" would be to extend the presto-hive module, and have an interface in presto-hive that allows to extend this behaviour.

@bhasudha What is your usecase? Do you use presto-hive connector as is? Or do you have some wrapper on top of it?

@arhimondr I understand. We are using presto-hive-hadoop2 with drop-ins for custom storage format. The pathfilter class in this case will also be coming from Hudi jars- HoodieROTablePathFilter.

Like @shixuan-fan suggested let me try to introduce a PathFilterProvider interface.

Copy link
Member

@arhimondr arhimondr Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhasudha Did you consider having a presto-hive-hoodie connector wrapper in your proprietary codebase? This way you can verify jar dependency compatibility during the build time. Also you can inject whatever extensions you need without introducing runtime class resolutions that can be very error prone.

We use this very approach for the Facebook proprietary extension. You can have a look at DirectoryLister interface. When it doesn't make sense in the scope of presto-hive connector, it allows us to plug in our extensions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Wrapper connector is a good idea. I can try that. Just wanted to clarify that this is not proprietary. Presto users in Hudi open source community also face this issue today. Since we have already solved this at Uber by patching Presto, thought generalizing that would be useful for everyone.
I can refactor this change into a separate module called 'presto-hudi' connector and update the PR. Let me know if you are okay with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

List<File> dirs = createDirs(basePath, 2);

// create files in each subdir along with couple files and one nested directory
for (int i = 0; i < dirs.size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This might be somehow hard to follow for someone who is not familiar with this test. What do you think about having 4 directories that are created in a straightforward way for each test case? And then hardcode the expected count. Also we can create them inside the test (per test), and not in BeforeClass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i ll refactor the test this way.

}
try {
return fileStatusIterator.hasNext();
while (fileStatusIterator.hasNext()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrapping the FileStatusIterator in Iterators#filter from guava? So we don't have to implement filtering logic here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I tried doing this. It looks like this accepts only Iterator as the input parameter - https://github.com/google/guava/blob/49f5a6332a63737dff70cf77472f9867bc7ca6eb/guava/src/com/google/common/collect/Iterators.java#L629 However the ListDirectoryOperation.list(path) return RemoteIterator.

{
Iterator<HiveFileInfo> list(FileSystem fileSystem, Path path, NamenodeStats namenodeStats, NestedDirectoryPolicy nestedDirectoryPolicy);

Iterator<HiveFileInfo> list(FileSystem fileSystem, Path path, NamenodeStats namenodeStats, NestedDirectoryPolicy nestedDirectoryPolicy, Optional<PathFilter> pathFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave only a single method in the interface (remove the old one)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@arhimondr
Copy link
Member

By introducing something like hive.input-path-filter-class we are explicitly encouraging jar drop-ins. This is something that we were historically trying to avoid (however it is still possible,e.g.: with storage formats). The "presto way" would be to extend the presto-hive module, and have an interface in presto-hive that allows to extend this behaviour.

@bhasudha What is your usecase? Do you use presto-hive connector as is? Or do you have some wrapper on top of it?

@shixuan-fan
Copy link
Contributor

shixuan-fan commented Oct 30, 2019

Actually as I was thinking, is it possible to have a PathFilterProvider interface so that the internal use case could provide this, and the default behavior is provide an Optional.empty()?

@bhasudha
Copy link
Author

Closing this as redundant. PR - #13818 covers these changes.

@bhasudha bhasudha closed this Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants