Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Feb 6, 2023

Change Logs

This patch introduce reconcile strategy and add dynamic schema strategy. Existing reconcile strategy is deemed as "legacy".

Legacy reconcile strategy:
if newer incoming has more columns than table schema, newer incoming will be chosen as the new table schema.
if newer incoming has few columns than table schema, table schema will remain as is.
No other flows are supported.

Dynamic schema reconcile strategy:
This is a super set of legacy. In this, newer incoming can have some dropped columns and could have new columns as well compared to table schema. New table schema will be last known table schema + new columns in new batch (even if new batch had some dropped columns, hudi will auto fill nulls)

Eg illustration:
commit1 with col1, col2, col3 -> tableSchema: col1, col2, col3
commit2 with col1, col2, col3, col4 -> tableSchema: col1, col2, col3, col4
// dropping cols
commit3 with col1, col2, col4 -> tableSchema: col1, col2, col3, col4
// dropping cols(col2, col4) and adding new col(col5, col6)
commit4 with col1, col3, col5, col6 -> tableSchema: col1, col2, col3, col4, col5, col6

Impact

More flexibility in evolving schemas w/ hudi.

Risk level (write none, low medium or high below)

low.

Documentation Update

Introducing a new config named hoodie.datasource.write.reconcile.schema.strategy. Default value is legacy_reconcile_strategy. and to leverage dynamic schema, value to set is dynamic_schema_reconcile_strategy.
Users have to set reconcile hoodie.datasource.write.reconcile.schema to true to leverage this.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added the priority:high Significant impact; potential bugs label Feb 6, 2023
@codope codope added the area:schema Schema evolution and data types label Feb 6, 2023
@codope codope self-assigned this Feb 6, 2023
@kazdy
Copy link
Contributor

kazdy commented Feb 6, 2023

Hi @nsivabalan can I ask something?
Assume I enabled full schema evolution. If I added a column in the middle will dynamic schema reconciliation handle it?
Or this is only meant to support "out of the box" schema evolution?
There was a PR #6017 implementing similar behaviour when both full schema evolution and reconciliation were enabled.
I'm interested in preserving this behavior if possible since some of my jobs rely on it.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 6, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types priority:high Significant impact; potential bugs size:L PR with lines of changes in (300, 1000]

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

4 participants