[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi… #6595

TJX2014 · 2022-09-05T08:15:16Z

Change Logs

Make hudi-flink of mor table also will gen CreateHandle with base bucket not exist.
Open deduplicate function for mor table.

Impact

The duplicate issue is from hudi-flink mor table, which first append log, but not compact right now, so the bucket num is not in base file;
When spark use loadPartitionBucketIdFileIdMapping of org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the bucket num which written by hudi-flink, so it will generate a new one which not consistent with hudi-flink.
After this change, when hudi-flink write mor table use bucket index, it will firstly consider to write base parquet file after deduplicate, if base file exists, it will change to write log file, follow spark way seems more stable for mor table.

Risk level: none | low | medium | high
None.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…th spark lead to duplicate bucket issue

minihippo · 2022-09-05T08:44:46Z

As discussed in the closed pr, I think consider the log only fileslice when loading bucketId-fileId mapping is better, because

for flink writing, log first is more intuitive.
bucket Index can support writing to log first theoretically(although it's not supported right now), both spark and flink. So the problem will be met when using spark writing either

TJX2014 · 2022-09-05T08:52:20Z

There are two following considerations that I think follow spark is more graceful：

The log file of mor table is a temporary state, both mor and cow has base file, but only mor has log file, so log base is not commonly, we should firstly consider commonly base file, right ？
Log is not stable, which should merge to base file, if we firstly consider log file, it will make our hudi system not stable. Our target is make hudi stable, right？

minihippo · 2022-09-05T09:21:15Z

There is a property named canIndexLogFile in HoodeIndex, which means can write to log first. It's true default for HbaseIndex and ConsistentBucketIndex, and it also can be true default for HoodieSimpleBucket. So log file only is a common state for writing to a new partition/table until compaction, whether use Spark or Flink. Fixing the way that spark loading map is once and for all.

TJX2014 · 2022-09-05T09:30:07Z

canIndexLogFile

Thanks for your suggestion, I will give another pr fix in spark side too, in spark side, load the index both consider log and base, considering canIndexLogFile.

hudi-bot · 2022-09-05T15:16:59Z

CI report:

4eacbc0 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2022-09-07T04:06:07Z

When spark use loadPartitionBucketIdFileIdMapping of org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the bucket num which written by hudi-flink

Seems we should fix the code in spark side right ?

TJX2014 · 2022-09-08T03:03:38Z

I will fix give pr fix in spark side too， but in flink side, I think deduplicate should also open as default option for mor table , when duplicate write to log file, very hard for compact to read, also lead mor table not stable due to the duplicate record twice read into memory.

minihippo · 2022-09-08T15:30:41Z

but in flink side, I think deduplicate should also open as default option for mor table , when duplicate write to log file, very hard for compact to read, also lead mor table not stable due to the duplicate record twice read into memory.

Do you mean that there are two client writing to the same partition at the same time?

danny0405 · 2022-09-09T02:08:43Z

I will fix give pr fix in spark side too， but in flink side, I think deduplicate should also open as default option for mor table , when duplicate write to log file, very hard for compact to read, also lead mor table not stable due to the duplicate record twice read into memory.

The initial idea is to keep the details of the log records, such as in the cdc change log feed.

TJX2014 · 2022-09-09T03:38:44Z

but in flink side, I think deduplicate should also open as default option for mor table , when duplicate write to log file, very hard for compact to read, also lead mor table not stable due to the duplicate record twice read into memory.

Do you mean that there are two client writing to the same partition at the same time?

Not exactly, if we deduplicate the record in memory, and then write to log is elegant for MOR because result is same. As @danny0405 say, in cdc situation, we need to retain origin records, not compact firstly in memory, which is acceptable.

danny0405 · 2022-09-23T11:18:51Z

Guess we can close this PR now, feel free to reopen it if you still have questions.

[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

4eacbc0

…th spark lead to duplicate bucket issue

TJX2014 mentioned this pull request Sep 5, 2022

[HUDI-4777] Fix flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #6593

Closed

4 tasks

yihua assigned minihippo Sep 5, 2022

yihua added priority:high Significant impact; potential bugs engine:flink Flink integration index labels Sep 5, 2022

TJX2014 mentioned this pull request Sep 8, 2022

[HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… #6630

Merged

4 tasks

danny0405 closed this Sep 23, 2022

hudi-bot mentioned this pull request Dec 9, 2025

Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue #14582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi… #6595

[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi… #6595

Uh oh!

TJX2014 commented Sep 5, 2022 •

edited

Loading

Uh oh!

minihippo commented Sep 5, 2022 •

edited

Loading

Uh oh!

TJX2014 commented Sep 5, 2022 •

edited

Loading

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022 •

edited

Loading

Uh oh!

hudi-bot commented Sep 5, 2022

Uh oh!

danny0405 commented Sep 7, 2022

Uh oh!

TJX2014 commented Sep 8, 2022

Uh oh!

minihippo commented Sep 8, 2022

Uh oh!

danny0405 commented Sep 9, 2022

Uh oh!

TJX2014 commented Sep 9, 2022

Uh oh!

danny0405 commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi… #6595

[HUDI-4777] Fix flink gen bucket index of mor table not consistent wi… #6595

Uh oh!

Conversation

TJX2014 commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Contributor's checklist

Uh oh!

minihippo commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TJX2014 commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minihippo commented Sep 5, 2022

Uh oh!

TJX2014 commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudi-bot commented Sep 5, 2022

CI report:

Uh oh!

danny0405 commented Sep 7, 2022

Uh oh!

TJX2014 commented Sep 8, 2022

Uh oh!

minihippo commented Sep 8, 2022

Uh oh!

danny0405 commented Sep 9, 2022

Uh oh!

TJX2014 commented Sep 9, 2022

Uh oh!

danny0405 commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TJX2014 commented Sep 5, 2022 •

edited

Loading

minihippo commented Sep 5, 2022 •

edited

Loading

TJX2014 commented Sep 5, 2022 •

edited

Loading

TJX2014 commented Sep 5, 2022 •

edited

Loading