Skip to content

[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787

Merged
codope merged 13 commits intoapache:masterfrom
codope:schema-validation
Feb 1, 2023
Merged

[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787
codope merged 13 commits intoapache:masterfrom
codope:schema-validation

Conversation

@codope
Copy link
Copy Markdown
Member

@codope codope commented Jan 29, 2023

Change Logs

Schema reconciliation is turned off by default. We should not allow dropping columns by default unless schema reconciliation is on. This PR adds a config and schema compatibility check to that effect.

Impact

The default behavior is:

  • Addition of columns is allowed.
  • Type promotion is allowed.
  • Dropping columns not allowed.
  • Renaming columns not allowed.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope added the priority:blocker Production down; release blocker label Jan 29, 2023
.defaultValue("true")
.withDocumentation("Validate the schema used for the write against the latest schema, for backwards compatibility.");

public static final ConfigProperty<String> SCHEMA_ALLOW_DROP_COLUMNS = ConfigProperty
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's define it as a string and not as config property. we don't want to expose this in our configurations page.
Or we need to come up w/ a way to tag internal configs so that we can fix our config docs generation.

try {
writeBatch(client, "005", "004", Option.empty(), "003", numRecords,
(String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, false, numRecords, 2 * numRecords, 5, false);
} catch (HoodieInsertException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after L209, we should also do, before catch block

assertTrue(shouldAllowDroppedColumns); 

try {
updateBatch(hoodieWriteConfig, client, "009", "008", Option.empty(),
initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, false, numUpdateRecords, 4 * numRecords, 9);
} catch (HoodieUpsertException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as above

writeBatch(client, "004", "003", Option.empty(), "003", numRecords,
(String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, true, numRecords, numRecords * 2, 1, false);
} catch (HoodieInsertException e) {
assertFalse(shouldAllowDroppedColumns);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, true,
numUpdateRecords, 3 * numRecords, 8);
} catch (HoodieUpsertException e) {
assertFalse(shouldAllowDroppedColumns);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// We have to register w/ Kryo all of the Avro schemas that might potentially be used to decode
// records into Avro format. Otherwise, Kryo wouldn't be able to apply an optimization allowing
// it to avoid the need to ser/de the whole schema along _every_ Avro record
val targetAvroSchemas = sourceSchema +: writerSchema +: latestTableSchemaOpt.toSeq
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is actually a misnomer: unfortunately after Spark Session is started it's impossible to register Kryo schemas with it (therefore this code is removed to unblock writer schema handling)

// NOTE: Target writer's schema is deduced based on
// - Source's schema
// - Existing table's schema (including its Hudi's [[InternalSchema]] representation)
val writerSchema = deduceWriterSchema(sourceSchema, latestTableSchemaOpt, internalSchemaOpt, parameters)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has moved to avoid running it for operations like delete/delete_partition

@nsivabalan
Copy link
Copy Markdown
Contributor

Addressed my comments and pushed an update

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Feb 1, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 5e616ab into apache:master Feb 1, 2023
yihua pushed a commit that referenced this pull request Feb 2, 2023
…lt (#7787)

* [HUDI-5646] Guard dropping columns by a config, do not allow by default

* Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed;

* Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations

* Remove meta-fields from latest table schema during analysis

* Disable schema validation when partition columns are dropped

---------

Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…lt (apache#7787)

* [HUDI-5646] Guard dropping columns by a config, do not allow by default

* Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed;

* Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations

* Remove meta-fields from latest table schema during analysis

* Disable schema validation when partition columns are dropped

---------

Co-authored-by: Alexey Kudinkin <alexey@infinilake.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants