Skip to content

Conversation

@zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Sep 26, 2021

What is the purpose of the pull request

Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests.

Cache HoodieTableFileSystemView at baseDir level

The same as HoodieTableMetaClient

Here is the test result based on S3 hudi table
240 partitions and 2400 data files.

Condition All Requests Get Requests List Requests 4xx errors
Disable metadata table 17280 14144 1322 1568
Disable metadata table + this patch tuning 11495 9839 796 620
Condition All Requests Get Requests List Requests 4xx errors
Enable metadata table 21024 13980 3198 3438
Enable metadata table + this patch tuning 13399 10485 1093 1298

I also verify the query result, and it works fine.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Sep 26, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@leesf
Copy link
Contributor

leesf commented Sep 26, 2021

I am a little curious about why the requests in Enable metadata table is larger than Disable metadata table, and also what the query response time difference between enable/disable metadata table?

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Sep 27, 2021

I am a little curious about why the requests in Enable metadata table is larger than Disable metadata table, and also what the query response time difference between enable/disable metadata table?

I think metadata table feature let hudi query internal hudi MOR table to do file list instead of listing the files in partitions so that can improve list performance. But as we know a query-hudi-table-action need more requests compared a single list action. Maybe this can explain why the requests in Enable metadata table is larger than Disable metadata table

@zhangyue19921010
Copy link
Contributor Author

As for performance improved because of hudi metadata table, I got the follow answers:

Condition Processing Time
Disable metadata table 143s
Enable metadata table 120s

based on S3 hudi table 240 partitions and 2400 data files.

@zhangyue19921010
Copy link
Contributor Author

By the way, thanks for your attention @leesf :)


fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext,
metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf()));
fsView = hoodieTableFileSystemViewCache.get(baseDir.toString());
Copy link
Contributor

@leesf leesf Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the fsView would never get updated once put into cache, even the writer commits new file, the new file will not appears in the fsView, I think it may lead to wrong results in flink streaming read, right? cc @danny0405

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink streaming reader does not use this code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in flink incremental read code path. still a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review.
As we can see HoodieROTablePathFilter already create and cache HoodieTableMetaClient at baseDir level, also setLoadActiveTimelineOnLoad(true) which will create an active timeline in singleton mode.
So that IMO no matter we cache the fsView or not, any new created files will not appear in current hoodieROTablePathFilter.

Now we cached the fsView using above cached meta client and cached active timeline. Maybe can have no bad effect but can reduce unnecessary init action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to correct me if I am wrong :)

fsView = hoodieTableFileSystemViewCache.get(baseDir.toString());
if (null == fsView) {
fsView = FileSystemViewManager.createInMemoryFileSystemView(engineContext, metaClient, HoodieInputFormatUtils.buildMetadataConfig(getConf()));
hoodieTableFileSystemViewCache.put(baseDir.toString(), fsView);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use computeIfAbsent to simplify the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Changed.

} finally {
if (fsView != null) {
fsView.close();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the cached views been closed ?

Copy link
Contributor Author

@zhangyue19921010 zhangyue19921010 Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsView.close() will do

  @Override
  public void close() {
    closed = true;
    super.reset();
    partitionToFileGroupsMap = null;
    fgIdToPendingCompaction = null;
    fgIdToBootstrapBaseFile = null;
    fgIdToReplaceInstants = null;
  }

Because of we recycling fsView, we can't close it. Although we only create one fsView for each baseDir. And will cause no memory leak maybe.

@zhangyue19921010
Copy link
Contributor Author

Hi @vinothchandar I noticed that you are the author of HoodieROTableFilter.java. Would you mind take a look at your convince?

Very appreciate it if you could give me a hand :)

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar do you have time to take a final pass here.

@nsivabalan
Copy link
Contributor

@zhangyue19921010 : wrt your comment on perf difference (enabling and disabling metadata), was it a read query that you benchmarked? If yes, did you also enable metadata when you queried? CC @xushiyan

@nsivabalan nsivabalan self-assigned this Oct 12, 2021
@zhangyue19921010
Copy link
Contributor Author

@zhangyue19921010 : wrt your comment on perf difference (enabling and disabling metadata), was it a read query that you benchmarked? If yes, did you also enable metadata when you queried? CC @xushiyan

@nsivabalan Thanks a lot for your review. Yep, I use a query with meta-data enabled at query side to do the benchmark.

@zhuoluoy
Copy link

Actually, for legacy MapReduce. This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower.

southernriver pushed a commit to southernriver/hudi that referenced this pull request Mar 31, 2023
…SystemView, aiming to reduce unnecessary list/get requests (apache#3719)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants