[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

TJX2014 · 2022-09-05T07:34:46Z

Change Logs

Make hudi-flink of mor table also will gen CreateHandle with base bucket not exist.
Open deduplicate function for mor table.

Impact

The duplicate issue is from hudi-flink mor table, which first append log, but not compact right now, so the bucket num is not in base file;
When spark use loadPartitionBucketIdFileIdMapping of org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the bucket num which written by hudi-flink, so it will generate a new one which not consistent with hudi-flink.
After this change, when hudi-flink write mor table use bucket index, it will firstly consider to write base parquet file after deduplicate, if base file exists, it will change to write log file, follow spark way seems more stable for mor table.

Risk level: none | low | medium | high
none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

minihippo · 2022-09-05T07:58:14Z

When spark loads the latest fileslice, if include the fileslice that only contains log file, and then the problem can be also solved right?

TJX2014 · 2022-09-05T08:06:36Z

When spark loads the latest fileslice, if include the fileslice that only contains log file, and then the problem can be also solved right?

Seems spark need not to include log file, which is merged to base file.

TJX2014 · 2022-09-05T08:16:02Z

When spark loads the latest fileslice, if include the fileslice that only contains log file, and then the problem can be also solved right?

Sorry, closed by mistake, please see: #6595

minihippo · 2022-09-05T09:08:45Z

The pr can be reopen :)

t

3fa2713

TJX2014 closed this Sep 5, 2022

hudi-bot mentioned this pull request Dec 9, 2025

Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #14582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

Uh oh!

TJX2014 commented Sep 5, 2022 •

edited

Loading

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

Uh oh!

Conversation

TJX2014 commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Contributor's checklist

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TJX2014 commented Sep 5, 2022 •

edited

Loading