Skip to content

[HUDI-5317] Fix insert overwrite table for partitioned table#7365

Merged
leesf merged 3 commits intoapache:masterfrom
stream2000:fix_insert_overwrite_table
Jan 12, 2023
Merged

[HUDI-5317] Fix insert overwrite table for partitioned table#7365
leesf merged 3 commits intoapache:masterfrom
stream2000:fix_insert_overwrite_table

Conversation

@stream2000
Copy link
Copy Markdown
Contributor

@stream2000 stream2000 commented Dec 2, 2022

Change Logs

For sql like insert overwrite table $table select xxx, we expect to drop all data in the table first and then insert the selected data into it. But we found that the 'insert overwrite table' semantic works only for non-partitioned table. For partitioned table, current implementation will drop only partitions involved in the select sub-query, other partitions won't be dropped( which should be dropped as expected).

This pr to solve the problem that insert overwrite table can drop all partitions at first then insert new data.

Impact

Insert overwrite table will drop all partitions at first then insert new data.

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@stream2000
Copy link
Copy Markdown
Contributor Author

@leesf Could you please help to review this PR?

@leesf leesf self-assigned this Dec 2, 2022
@leesf leesf closed this Dec 3, 2022
@leesf leesf reopened this Dec 3, 2022
@leesf
Copy link
Copy Markdown
Contributor

leesf commented Dec 3, 2022

@stream2000 would you please check CI failure?

@leesf
Copy link
Copy Markdown
Contributor

leesf commented Dec 3, 2022

@hudi-bot run azure

@stream2000
Copy link
Copy Markdown
Contributor Author

@stream2000 would you please check CI failure?

Seems like some uts were failed. Will fix it

@nsivabalan
Copy link
Copy Markdown
Contributor

we have two operations relating to insert_overwrite.
1: insert_overwrite_table
2: insert_overwrite.

spark-ds writes supports both operations.
insert_overwrite_table will override entire table. while insert_overwrite will overwrite only matching partitions.

guess in spark-sql, we supported only insert_overwrite. not sure if we can revert the behavior. May be we should consider adding a new write operation in spark-sql for this.

@nsivabalan nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022
@codope codope added priority:critical Production degraded; pipelines stalled area:sql SQL interfaces and removed priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@leesf
Copy link
Copy Markdown
Contributor

leesf commented Dec 9, 2022

we have two operations relating to insert_overwrite. 1: insert_overwrite_table 2: insert_overwrite.

spark-ds writes supports both operations. insert_overwrite_table will override entire table. while insert_overwrite will overwrite only matching partitions.

guess in spark-sql, we supported only insert_overwrite. not sure if we can revert the behavior. May be we should consider adding a new write operation in spark-sql for this.

@nsivabalan hi, here are my two cents: insert overwrite xxx values(xx,xxx) has very clear semantics, it means overwrite the entire table, insert overwrite xx partition(xx) values(xx,xxx) means insert overwrite partitions, but hudi handles overwrite partitions for overwrite table, which is a definite bug and i do not think we need to introduce a new operation for it.

@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch from 424e8f0 to e58d4db Compare December 14, 2022 06:10
@stream2000
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch 5 times, most recently from f10a71c to d3ab1e3 Compare December 29, 2022 02:31
@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch from d3ab1e3 to b879347 Compare January 9, 2023 07:56
Copy link
Copy Markdown
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stream2000
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

1 similar comment
@leesf
Copy link
Copy Markdown
Contributor

leesf commented Jan 11, 2023

@hudi-bot run azure

@stream2000 stream2000 closed this Jan 11, 2023
@stream2000 stream2000 reopened this Jan 11, 2023
@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch from b4a9d1a to c963634 Compare January 12, 2023 02:34
@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch from c963634 to 9f01927 Compare January 12, 2023 02:41
@stream2000 stream2000 force-pushed the fix_insert_overwrite_table branch from 9f01927 to 9e1f64f Compare January 12, 2023 02:43
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:sql SQL interfaces priority:critical Production degraded; pipelines stalled

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants