Skip to content

Conversation

@kazdy
Copy link
Contributor

@kazdy kazdy commented Feb 20, 2023

Change Logs

Fix shouldCombine, take into account the situation where the write operation is UPSERT but COMBINE_BEFORE_UPSERT is false.
Currently, Hudi always combines records on UPSERT, and option COMBINE_BEFORE_UPSERT is not honored.

Impact

Fixes user-facing option COMBINE_BEFORE_UPSERT

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@kazdy
Copy link
Contributor Author

kazdy commented Feb 21, 2023

@hudi-bot run azure

@kazdy kazdy changed the title Do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is disabled [HUDI-5824] Do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false Feb 21, 2023
@kazdy kazdy changed the title [HUDI-5824] Do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false Feb 21, 2023
@kazdy
Copy link
Contributor Author

kazdy commented Feb 21, 2023

GH actions tests failed but I don't see why, it passed before.
Azure failed on TestHoodieDeltaStreamerWithMultiWriter, rather unrelated and it also passed it on the first run. Looks like some flaky tests causing issues.

@kazdy kazdy force-pushed the fix-combine-before-upsert-not-applied branch from de68cc5 to 13fafcd Compare February 21, 2023 17:09
@kazdy kazdy marked this pull request as ready for review February 21, 2023 20:09
@kazdy kazdy marked this pull request as draft February 22, 2023 23:34
@kazdy
Copy link
Contributor Author

kazdy commented Feb 23, 2023

Found this because of my test failures, reported in HUDI-5839

There seem to be a bug with non-strict insert mode
when using spark datasource it can insert duplicates only in overwrite mode or append mode when data is inserted to the table for the first time, but if I want to insert in append mode for the second time it deduplicates the dataset as if it was working in upsert mode.


opt_insert = {
    'hoodie.table.name': 'huditbl',
    'hoodie.datasource.write.recordkey.field': 'keyid',
    'hoodie.datasource.write.table.name': 'huditbl',
    'hoodie.datasource.write.operation': 'insert',
    'hoodie.sql.insert.mode': 'non-strict',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'hoodie.combine.before.upsert': 'false',
    'hoodie.combine.before.insert': 'false',
    'hoodie.datasource.write.insert.drop.duplicates': 'false'
}

df = spark.range(0, 10).toDF("keyid") \
  .withColumn("age", expr("keyid + 1000"))

df.write.format("hudi"). \
options(**opt_insert). \
mode("overwrite"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10

df = df.union(df) # creates duplicates
df.write.format("hudi"). \
options(**opt_insert). \
mode("append"). \
save(path)

spark.read.format("hudi").load(path).count() # returns 10 but should return 20

@kazdy kazdy marked this pull request as ready for review February 23, 2023 19:10
@kazdy
Copy link
Contributor Author

kazdy commented Feb 23, 2023

Hi Hudi devs, I would appreciate a review, thanks!

@danny0405 danny0405 self-assigned this Feb 27, 2023
@danny0405 danny0405 added engine:spark Spark integration writer-core labels Feb 27, 2023
@kazdy kazdy requested a review from bvaradar April 12, 2023 06:35
@kazdy kazdy force-pushed the fix-combine-before-upsert-not-applied branch from b572d73 to a95196c Compare May 10, 2023 21:01
@kazdy kazdy requested review from bvaradar and danny0405 May 10, 2023 21:15
@kazdy kazdy marked this pull request as draft May 12, 2023 20:50
@kazdy kazdy force-pushed the fix-combine-before-upsert-not-applied branch 2 times, most recently from c8fd80e to dfdd333 Compare May 12, 2023 23:11
@kazdy
Copy link
Contributor Author

kazdy commented May 12, 2023

@hudi-bot run azure

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazdy : One comment about testcase. Otherwise looks good.

@bvaradar
Copy link
Contributor

@kazdy : Can you remove the draft status if you think it is ready.

@kazdy kazdy force-pushed the fix-combine-before-upsert-not-applied branch from dfdd333 to 93db1f0 Compare May 14, 2023 18:50
@kazdy kazdy marked this pull request as ready for review May 14, 2023 21:56
@kazdy kazdy requested a review from bvaradar May 14, 2023 21:56
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@bvaradar bvaradar merged commit 2e2459e into apache:master May 15, 2023
@kazdy kazdy mentioned this pull request May 15, 2023
4 tasks
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023
@kazdy kazdy deleted the fix-combine-before-upsert-not-applied branch May 23, 2023 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:spark Spark integration

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants