[HUDI-2175] Adding support for dynamic schemas #7859
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
This patch introduce reconcile strategy and add dynamic schema strategy. Existing reconcile strategy is deemed as "legacy".
Legacy reconcile strategy:
if newer incoming has more columns than table schema, newer incoming will be chosen as the new table schema.
if newer incoming has few columns than table schema, table schema will remain as is.
No other flows are supported.
Dynamic schema reconcile strategy:
This is a super set of legacy. In this, newer incoming can have some dropped columns and could have new columns as well compared to table schema. New table schema will be last known table schema + new columns in new batch (even if new batch had some dropped columns, hudi will auto fill nulls)
Eg illustration:
commit1 with col1, col2, col3 -> tableSchema: col1, col2, col3
commit2 with col1, col2, col3, col4 -> tableSchema: col1, col2, col3, col4
// dropping cols
commit3 with col1, col2, col4 -> tableSchema: col1, col2, col3, col4
// dropping cols(col2, col4) and adding new col(col5, col6)
commit4 with col1, col3, col5, col6 -> tableSchema: col1, col2, col3, col4, col5, col6
Impact
More flexibility in evolving schemas w/ hudi.
Risk level (write none, low medium or high below)
low.
Documentation Update
Introducing a new config named
hoodie.datasource.write.reconcile.schema.strategy. Default value islegacy_reconcile_strategy. and to leverage dynamic schema, value to set isdynamic_schema_reconcile_strategy.Users have to set reconcile
hoodie.datasource.write.reconcile.schemato true to leverage this.Contributor's checklist