Skip to content

[HUDI-5684] Fix CTAS and Insert Into to avoid combine-on-insert by default#7813

Merged
codope merged 6 commits intoapache:masterfrom
onehouseinc:ak/ctas-dedup-fix
Feb 2, 2023
Merged

[HUDI-5684] Fix CTAS and Insert Into to avoid combine-on-insert by default#7813
codope merged 6 commits intoapache:masterfrom
onehouseinc:ak/ctas-dedup-fix

Conversation

@alexeykudinkin
Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin commented Feb 1, 2023

Change Logs

Currently, InsertIntoHoodieTable by default sets COMBINE_BEFORE_INSERT config whenever pre-combine field is specified and it's specified in a way that doesn't allow it to be overridden by the user.

Following changes are made to address it, all Spark SQL feature-specific configs are split into dichotomy:

  • Default: settings serving as a default (or preferred) value for the feature (could be overridden by the user)
  • Overriding: settings serving as required values for the feature (could NOT be overridden by the user)

Impact

Avoids combining on insertion for Insert Into and CTAS statements in Spark SQL

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@alexeykudinkin alexeykudinkin requested a review from yihua February 1, 2023 03:43
@alexeykudinkin alexeykudinkin added the priority:blocker Production down; release blocker label Feb 1, 2023
@alexeykudinkin alexeykudinkin changed the title [MINOR] Fix CTAS and Insert Into to avoid combine-on-insert by default [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default Feb 2, 2023
@alexeykudinkin alexeykudinkin force-pushed the ak/ctas-dedup-fix branch 2 times, most recently from 22bd2b7 to cab2849 Compare February 2, 2023 02:35
Alexey Kudinkin added 6 commits February 1, 2023 21:35
@alexeykudinkin alexeykudinkin changed the title [MINOR][Stacked on 7821] Fix CTAS and Insert Into to avoid combine-on-insert by default [HUDI-5684] Fix CTAS and Insert Into to avoid combine-on-insert by default Feb 2, 2023
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Feb 2, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 1459edd into apache:master Feb 2, 2023
yihua pushed a commit that referenced this pull request Feb 2, 2023
…fault (#7813)

* Remove `COMBINE_BEFORE_INSERT` config being overridden for insert operations

* Revisited Spark SQL feature configuration to allow dichotomy of having:
  - (Feature-)specific "default" configuration (that could be overridden by the user)
  - "Overriding" configuration (that could NOT be overridden by the user)

* Restoring existing behavior for Insert Into to deduplicate by default (if pre-combine is specified)

* Fixing compilation

* Fixing compilation (one more time)

* Fixing options combination ordering
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…fault (apache#7813)

* Remove `COMBINE_BEFORE_INSERT` config being overridden for insert operations

* Revisited Spark SQL feature configuration to allow dichotomy of having:
  - (Feature-)specific "default" configuration (that could be overridden by the user)
  - "Overriding" configuration (that could NOT be overridden by the user)

* Restoring existing behavior for Insert Into to deduplicate by default (if pre-combine is specified)

* Fixing compilation

* Fixing compilation (one more time)

* Fixing options combination ordering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants