Skip to content

Conversation

@minihippo
Copy link
Contributor

What is the purpose of the pull request

Optimization for bootstrap when use flink bucket index. Load and cache the filegroups info of a partition which poccessing the records belong to instead of loading all partitions at first.

Brief change log

  • BucketStreamWriteFunction

Verify this pull request

This pull request is already covered by existing tests.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@minihippo
Copy link
Contributor Author

@garyli1019 An improvement for HUDI-3315, please take a look.

@garyli1019 garyli1019 self-assigned this Mar 22, 2022
Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a minor comment.


if (bucketIndex.containsKey(partitionBucketId)) {
location = new HoodieRecordLocation("U", bucketIndex.get(partitionBucketId));
if (incBucketIndex.contains(partitionBucketId)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, a bug fixed here

@garyli1019
Copy link
Member

@hudi-bot run azure

1 similar comment
@garyli1019
Copy link
Member

@hudi-bot run azure

@garyli1019
Copy link
Member

@minihippo would you resolve the conflict

@garyli1019
Copy link
Member

@hudi-bot run azure


bootstrapIndexIfNeed(partition);
Map<Integer, String> bucketToFileIdMap = bucketIndex.get(partition);
final int bucketNum = BucketIdentifier.getBucketId(hoodieKey, indexKeyFields, this.bucketNum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get(partition) -> computeIfAbsent(partition, p -> new HashMap<>())

@danny0405
Copy link
Contributor

Thanks for the fix, i have fired a minor fix patch, can you apply it then, thanks ~
3539_fix.patch.zip

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@garyli1019 garyli1019 merged commit 2e2d08c into apache:master Mar 28, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…he#5093)

* [HUDI-3539] Flink bucket index bucketID bootstrap optimization.

Co-authored-by: gengxiaoyu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants