[HUDI-5646] Guard dropping columns by a config, do not allow by default by codope · Pull Request #7787 · apache/hudi

codope · 2023-01-29T15:55:41Z

Change Logs

Schema reconciliation is turned off by default. We should not allow dropping columns by default unless schema reconciliation is on. This PR adds a config and schema compatibility check to that effect.

Impact

The default behavior is:

Addition of columns is allowed.
Type promotion is allowed.
Dropping columns not allowed.
Renaming columns not allowed.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2023-01-29T18:22:13Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

      .defaultValue("true")
      .withDocumentation("Validate the schema used for the write against the latest schema, for backwards compatibility.");

+  public static final ConfigProperty<String> SCHEMA_ALLOW_DROP_COLUMNS = ConfigProperty


let's define it as a string and not as config property. we don't want to expose this in our configurations page.
Or we need to come up w/ a way to tag internal configs so that we can fix our config docs generation.

nsivabalan · 2023-01-29T18:27:34Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java

+    try {
+      writeBatch(client, "005", "004", Option.empty(), "003", numRecords,
+          (String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, false, numRecords, 2 * numRecords, 5, false);
+    } catch (HoodieInsertException e) {


after L209, we should also do, before catch block

assertTrue(shouldAllowDroppedColumns);

nsivabalan · 2023-01-29T18:27:48Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java

+    try {
+      updateBatch(hoodieWriteConfig, client, "009", "008", Option.empty(),
+          initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, false, numUpdateRecords, 4 * numRecords, 9);
+    } catch (HoodieUpsertException e) {


similar comment as above

nsivabalan · 2023-01-29T18:27:58Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java

+      writeBatch(client, "004", "003", Option.empty(), "003", numRecords,
+          (String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, true, numRecords, numRecords * 2, 1, false);
+    } catch (HoodieInsertException e) {
+      assertFalse(shouldAllowDroppedColumns);


nsivabalan · 2023-01-29T18:28:04Z

...-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java

+          initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, true,
+          numUpdateRecords, 3 * numRecords, 8);
+    } catch (HoodieUpsertException e) {
+      assertFalse(shouldAllowDroppedColumns);


hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

...tasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBasicSchemaEvolution.scala

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

alexeykudinkin · 2023-01-29T23:19:03Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-      //       We have to register w/ Kryo all of the Avro schemas that might potentially be used to decode
-      //       records into Avro format. Otherwise, Kryo wouldn't be able to apply an optimization allowing
-      //       it to avoid the need to ser/de the whole schema along _every_ Avro record
-      val targetAvroSchemas = sourceSchema +: writerSchema +: latestTableSchemaOpt.toSeq


This code is actually a misnomer: unfortunately after Spark Session is started it's impossible to register Kryo schemas with it (therefore this code is removed to unblock writer schema handling)

alexeykudinkin · 2023-01-29T23:19:26Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-      // NOTE: Target writer's schema is deduced based on
-      //         - Source's schema
-      //         - Existing table's schema (including its Hudi's [[InternalSchema]] representation)
-      val writerSchema = deduceWriterSchema(sourceSchema, latestTableSchemaOpt, internalSchemaOpt, parameters)


This code has moved to avoid running it for operations like delete/delete_partition

nsivabalan · 2023-01-30T07:42:30Z

Addressed my comments and pushed an update

…ifying whether column drop should be allowed; Clarified terms; Tidying up;

…ng for delete operations

…client module

hudi-bot · 2023-02-01T08:01:15Z

CI report:

f75bc67 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…lt (#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>

…lt (apache#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>

codope requested review from alexeykudinkin and nsivabalan January 29, 2023 15:55

codope assigned alexeykudinkin Jan 29, 2023

codope added the priority:blocker Production down; release blocker label Jan 29, 2023

nsivabalan reviewed Jan 29, 2023

View reviewed changes

alexeykudinkin reviewed Jan 29, 2023

View reviewed changes

alexeykudinkin force-pushed the schema-validation branch from 555d916 to 4a25ef8 Compare January 30, 2023 20:02

codope and others added 10 commits January 31, 2023 08:37

[HUDI-5646] Guard dropping columns by a config, do not allow by default

8837748

Tidying up

e80376b

Tidying up configs

1faf54c

Replaced superfluous isSchemaCompatible override by explicitly spec…

fa4a383

…ifying whether column drop should be allowed; Clarified terms; Tidying up;

Revisited HoodieSparkSqlWriter to avoid (unnecessary) schema handli…

937958f

…ng for delete operations

Remove meta-fields from latest table schema during analysis

396de18

addressing comments

b9ac6f2

Fixing test

9e0445c

Disable schema validation when partition columns are dropped

0db5034

Fixed ClassNotFoundExceptions littering tests output in hudi-spark-…

41584f7

…client module

alexeykudinkin force-pushed the schema-validation branch from 8849a73 to 41584f7 Compare January 31, 2023 16:38

alexeykudinkin approved these changes Jan 31, 2023

View reviewed changes

Alexey Kudinkin added 3 commits January 31, 2023 12:37

Fixing test

0962153

Fix column stats accounting for empty columns

fe056bb

Fixing more tests

f75bc67

codope merged commit 5e616ab into apache:master Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787

[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787
codope merged 13 commits intoapache:masterfrom
codope:schema-validation

codope commented Jan 29, 2023

Uh oh!

nsivabalan Jan 29, 2023

Uh oh!

nsivabalan Jan 29, 2023

Uh oh!

nsivabalan Jan 29, 2023

Uh oh!

nsivabalan Jan 29, 2023

Uh oh!

nsivabalan Jan 29, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Jan 29, 2023

Uh oh!

alexeykudinkin Jan 29, 2023

Uh oh!

nsivabalan commented Jan 30, 2023

Uh oh!

hudi-bot commented Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

codope commented Jan 29, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jan 29, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jan 30, 2023

Uh oh!

hudi-bot commented Feb 1, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants