[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787
[HUDI-5646] Guard dropping columns by a config, do not allow by default#7787codope merged 13 commits intoapache:masterfrom
Conversation
| .defaultValue("true") | ||
| .withDocumentation("Validate the schema used for the write against the latest schema, for backwards compatibility."); | ||
|
|
||
| public static final ConfigProperty<String> SCHEMA_ALLOW_DROP_COLUMNS = ConfigProperty |
There was a problem hiding this comment.
let's define it as a string and not as config property. we don't want to expose this in our configurations page.
Or we need to come up w/ a way to tag internal configs so that we can fix our config docs generation.
| try { | ||
| writeBatch(client, "005", "004", Option.empty(), "003", numRecords, | ||
| (String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, false, numRecords, 2 * numRecords, 5, false); | ||
| } catch (HoodieInsertException e) { |
There was a problem hiding this comment.
after L209, we should also do, before catch block
assertTrue(shouldAllowDroppedColumns);
| try { | ||
| updateBatch(hoodieWriteConfig, client, "009", "008", Option.empty(), | ||
| initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, false, numUpdateRecords, 4 * numRecords, 9); | ||
| } catch (HoodieUpsertException e) { |
There was a problem hiding this comment.
similar comment as above
| writeBatch(client, "004", "003", Option.empty(), "003", numRecords, | ||
| (String s, Integer a) -> failedRecords, SparkRDDWriteClient::insert, true, numRecords, numRecords * 2, 1, false); | ||
| } catch (HoodieInsertException e) { | ||
| assertFalse(shouldAllowDroppedColumns); |
| initCommitTime, numUpdateRecords, SparkRDDWriteClient::upsert, false, true, | ||
| numUpdateRecords, 3 * numRecords, 8); | ||
| } catch (HoodieUpsertException e) { | ||
| assertFalse(shouldAllowDroppedColumns); |
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...tasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBasicSchemaEvolution.scala
Show resolved
Hide resolved
...tasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBasicSchemaEvolution.scala
Show resolved
Hide resolved
...tasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBasicSchemaEvolution.scala
Show resolved
Hide resolved
| // We have to register w/ Kryo all of the Avro schemas that might potentially be used to decode | ||
| // records into Avro format. Otherwise, Kryo wouldn't be able to apply an optimization allowing | ||
| // it to avoid the need to ser/de the whole schema along _every_ Avro record | ||
| val targetAvroSchemas = sourceSchema +: writerSchema +: latestTableSchemaOpt.toSeq |
There was a problem hiding this comment.
This code is actually a misnomer: unfortunately after Spark Session is started it's impossible to register Kryo schemas with it (therefore this code is removed to unblock writer schema handling)
| // NOTE: Target writer's schema is deduced based on | ||
| // - Source's schema | ||
| // - Existing table's schema (including its Hudi's [[InternalSchema]] representation) | ||
| val writerSchema = deduceWriterSchema(sourceSchema, latestTableSchemaOpt, internalSchemaOpt, parameters) |
There was a problem hiding this comment.
This code has moved to avoid running it for operations like delete/delete_partition
|
Addressed my comments and pushed an update |
555d916 to
4a25ef8
Compare
…ifying whether column drop should be allowed; Clarified terms; Tidying up;
…ng for delete operations
8849a73 to
41584f7
Compare
…lt (#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>
…lt (apache#7787) * [HUDI-5646] Guard dropping columns by a config, do not allow by default * Replaced superfluous `isSchemaCompatible` override by explicitly specifying whether column drop should be allowed; * Revisited `HoodieSparkSqlWriter` to avoid (unnecessary) schema handling for delete operations * Remove meta-fields from latest table schema during analysis * Disable schema validation when partition columns are dropped --------- Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>
Change Logs
Schema reconciliation is turned off by default. We should not allow dropping columns by default unless schema reconciliation is on. This PR adds a config and schema compatibility check to that effect.
Impact
The default behavior is:
Risk level (write none, low medium or high below)
medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist