-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am a little curious about why the requests in |
I think metadata table feature let hudi query internal hudi MOR table to do file list instead of listing the files in partitions so that can improve list performance. But as we know a query-hudi-table-action need more requests compared a single list action. Maybe this can explain |
|
As for performance improved because of hudi metadata table, I got the follow answers:
based on S3 hudi table 240 partitions and 2400 data files. |
|
By the way, thanks for your attention @leesf :) |
|
|
||
| fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext, | ||
| metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf())); | ||
| fsView = hoodieTableFileSystemViewCache.get(baseDir.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the fsView would never get updated once put into cache, even the writer commits new file, the new file will not appears in the fsView, I think it may lead to wrong results in flink streaming read, right? cc @danny0405
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flink streaming reader does not use this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in flink incremental read code path. still a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review.
As we can see HoodieROTablePathFilter already create and cache HoodieTableMetaClient at baseDir level, also setLoadActiveTimelineOnLoad(true) which will create an active timeline in singleton mode.
So that IMO no matter we cache the fsView or not, any new created files will not appear in current hoodieROTablePathFilter.
Now we cached the fsView using above cached meta client and cached active timeline. Maybe can have no bad effect but can reduce unnecessary init action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to correct me if I am wrong :)
| fsView = hoodieTableFileSystemViewCache.get(baseDir.toString()); | ||
| if (null == fsView) { | ||
| fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext, metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf())); | ||
| hoodieTableFileSystemViewCache.put(baseDir.toString(), fsView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use computeIfAbsent to simplify the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Changed.
| } finally { | ||
| if (fsView != null) { | ||
| fsView.close(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should the cached views been closed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsView.close() will do
@Override
public void close() {
closed = true;
super.reset();
partitionToFileGroupsMap = null;
fgIdToPendingCompaction = null;
fgIdToBootstrapBaseFile = null;
fgIdToReplaceInstants = null;
}
Because of we recycling fsView, we can't close it. Although we only create one fsView for each baseDir. And will cause no memory leak maybe.
|
Hi @vinothchandar I noticed that you are the author of Very appreciate it if you could give me a hand :) |
leesf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar do you have time to take a final pass here.
|
@zhangyue19921010 : wrt your comment on perf difference (enabling and disabling metadata), was it a read query that you benchmarked? If yes, did you also enable metadata when you queried? CC @xushiyan |
@nsivabalan Thanks a lot for your review. Yep, I use a query with meta-data enabled at query side to do the benchmark. |
|
Actually, for legacy MapReduce. This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower. |
…SystemView, aiming to reduce unnecessary list/get requests (apache#3719)
What is the purpose of the pull request
Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests.
Cache HoodieTableFileSystemView at baseDir level
The same as HoodieTableMetaClient
Here is the test result based on S3 hudi table
240 partitions and 2400 data files.
I also verify the query result, and it works fine.
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.