Skip to content

Conversation

@beyond1920
Copy link
Contributor

Change Logs

Snapshot query result is wrong after apply insert overwrite to an existed table with simple bucket index.
see detailed in HUDI-5857.
The root cause of the bug is the write handler reuses existed bucket file id for insert overwrite. Besides it use replace commit for insert overwrite operation and mark all the existed bucket file id as replaced.
So the snapshot query result is wrong.
The pr aims to fix this bug by generating new file id for bucket if insert overwrite into bucket index table.

Impact

NA

Risk level (write none, low medium or high below)

NA

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@XuQianJin-Stars
Copy link
Contributor

Thanks @beyond1920 overall looks good.

Copy link
Contributor

@KnightChess KnightChess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about remove tagLocation when use insert overwrite op, like #8073 , I think this can also solve this quesion too. And the scenario in insert overwrite I think no need to tagLocation, right?

protected Partitioner getPartitioner(WorkloadProfile profile) {
return table.getStorageLayout().layoutPartitionerClass()
.map(c -> getLayoutPartitioner(profile, c))
.map(c -> c.equals(HoodieLayoutConfig.SIMPLE_BUCKET_LAYOUT_PARTITIONER_CLASS_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does consistentBucketIndex will not cause the same problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, consistentBucketIndex works correctly, it would generate different file ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnightChess Thanks for your advice.
Remove tagLocation could also fixed this problem. However I prefer to fix this problem by generate new file ids because:

  1. Remove tag location would change stats, for example, miss updated count
  2. It's better to keep same behavior for all index types instead of only remove tag location in insert overwrite for bucket index table.
    But remove tag location is a good improvement to speed up insert overwrite. I would created a new JIRA to track this issue. Maybe using bulk insert to do insert overwrite for all index typed. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beyond1920 I read consistentBucketIndex implementation, found it must tag incomming records to allocation fgId, so #8073 will cause some quesion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, consistentBucketIndex works correctly, it would generate different file ids.

consistentBucketIndex can not work correctly, change the ut case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, sorry for the hurry response before.
Thank you for point it out.
I need to spend more time to get familiar with ConsistentBucketIndex. I would response ASAP.

}

test("Test Insert Overwrite") {
test("Test Insert Overwrite for bucket ") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test for consistentBucketIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsistentBucketIndex works correctly, it would generate different file ids.
However, I add the test cases for consitentBucketIndex too.

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 9a4747e to 5fa9c60 Compare February 28, 2023 10:19
// Insert overwrite static partition
spark.sql(
s"""
| insert overwrite table $tableName partition(dt = '2021-01-05')
Copy link
Contributor

@KnightChess KnightChess Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will create a new parquet file with the same prefix against log file, but something diff in fgId suffix. just like the picture, create new parquet file will add -0 after fgId(xxx-0-0_xxx), so it can be read if only insert overwrite onece, but if insert overwrite again, will use the same fgId(xxx-0-0), result nothing.
image

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 5fa9c60 to d4e9858 Compare March 1, 2023 06:31
Copy link
Contributor

@XuQianJin-Stars XuQianJin-Stars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from d4e9858 to d0099b9 Compare March 1, 2023 11:06
@danny0405 danny0405 changed the title [HUDI-5857] Index overwrite into bucket table would generate new file group id [HUDI-5857] Insert overwrite into bucket table would generate new file group id Mar 2, 2023
case CONSISTENT_HASHING:
return new SparkInsertOverwriteConsistentBucketIndexPartitioner(profile, context, table, config);
default:
throw new HoodieNotSupportedException("Unknown bucket index engine type: " + config.getBucketIndexEngineType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inline all the different handling for getBucketInfo into SparkInsertOverwritePartitioner ? Let's make the code cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move part of them which related to ConsistentBucketIndex to SparkInsertOverwritePartitioner.
And I left other part which related to SimpleBucketIndex in SparkBucketIndexInsertOverwritePartitioner.
Because SimpleBucketIndex and ConsistentBucketIndex are different when creates new BucketInfo.

return handleInsert(binfo.fileIdPrefix, recordItr);
} else if (btype.equals(BucketType.UPDATE)) {
throw new HoodieInsertOverwriteException(
"Insert overwrite should always use INSERT bucketType, please correct the logical of " + partitioner.getClass().getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case we can hit the code path for BucketType.UPDATE ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a protected code to prevent hit this bug again when introduce new partitioner class in the future.

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from d0099b9 to a76dc55 Compare March 3, 2023 08:13

@Override
public BucketInfo getBucketInfo(int bucketNumber) {
String partitionPath = partitionPaths.get(bucketNumber / numBuckets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HoodieWriteConfig, we can fetch the operation then decides whether it is INSERT_OVERWRITE, then the logic can be moved into SparkBucketIndexPartitioner.

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch 2 times, most recently from 2015eb0 to 7267340 Compare March 9, 2023 09:38
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we are good to land once the CI is green

@beyond1920 beyond1920 force-pushed the fixInsertOverwriteIntoBucketTable branch from 7267340 to f550344 Compare March 9, 2023 09:48
@beyond1920
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 9, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@XuQianJin-Stars XuQianJin-Stars merged commit 51d0351 into apache:master Mar 10, 2023
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 18, 2023
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
southernriver pushed a commit to southernriver/hudi that referenced this pull request Mar 31, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
KnightChess pushed a commit to KnightChess/hudi that referenced this pull request Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants