Support listing of Hudi files through its metadata#1
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
One high level concern. Seems simple enough .
There was a problem hiding this comment.
may be just, hive.use.hudi.metadata.to.list.files
There was a problem hiding this comment.
I don't think community will accept this, as the convention they follow is sepration by -. In addition, I used the work prefer so it along the lines of https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java#L1530
There was a problem hiding this comment.
Same as above. This is not the naming convention presto community follows.
There was a problem hiding this comment.
bit more guidance on when this should be turned on and off?
There was a problem hiding this comment.
added more details in the description
There was a problem hiding this comment.
so this will do one additional RPC per base file? if so, would n't this be bad actually for listing performance?
There was a problem hiding this comment.
@umehrot2 what I meant was if we can simply do something like
LocatedFileStatus hoodieFileStatus = new LocatedFileStatus(fileStatus,
new BlockLocation[] {new BlockLocation(name, host, 0, file.getLen())});
if so, would this work for hdfs.
There was a problem hiding this comment.
@umehrot2 do you have an easy test environment for HDFS?
There was a problem hiding this comment.
Uber and Facebook both use hdfs a lot. So this will come up in the review upstream for sure.
There was a problem hiding this comment.
Yeah I will make that change, and do some testing on EMR cluster. EMR clusters all have HDFS so I should be able to test.
5800c3e to
3959ddd
Compare
3959ddd to
193cb46
Compare
|
Scale Testing The patch has been tested on a 1.5 TB Hudi table. The patch in general is offering much better performance for two reason:
Based on my investigate Even from I did the same testing with HDFS too. |
|
Concerns regarding caching In addition, the concern we had about caching the file system view in case of presto is not really a concern. Unlike in spark, where input format is directly used to do the file listing, here we directly use the file system view to do the listing for us. The directory lister is instantiated only once per thread that loads the split. By default this concurrency is |
This implements changes to support fetching the list of Hudi files through the metadata table maintained inside Hudi. This is part of feature https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements where we are adding support for maintaining file listing metadata within the Hudi tables for faster listings when working with S3 specially.
This is based on apache/hudi#2326 where I introduced a new
createInMemoryFileSystemViewmethod insideFileSystemViewManagerwhich returns a view according to users configuration, depending on whether they want to list using the metadata or not.