[HUDI-3982] Comprehensive schema evolution in flink when read/batch/cow/snapshot#5443
[HUDI-3982] Comprehensive schema evolution in flink when read/batch/cow/snapshot#5443trushev wants to merge 4 commits intoapache:masterfrom
Conversation
|
@danny0405 @bvaradar If you have free time,could you pls help review this pr, thanks very much |
|
What do you mean when you saying |
This PR covers the following case |
| <artifactId>spark-hive_${scala.binary.version}</artifactId> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
Why introduces the spark dependency in flink pom ?
There was a problem hiding this comment.
To prepare test data. Currently, only Spark engine provides way to change schema and write new data after that.
I think when full support of schema evolution is implemented, we can remove this dependency by rewriting test to pure Flink
| } | ||
|
|
||
| private SchemaEvoContext getSchemaEvoContext() { | ||
| if (!conf.getBoolean(FlinkOptions.SCHEMA_EVOLUTION_ENABLED)) { |
There was a problem hiding this comment.
Returns Option<SchemaEvoContext> instead.
There was a problem hiding this comment.
fixed. removed enabled field. isPresent means enabled.
| LogicalTypeRoot to = toType.getTypeRoot(); | ||
| switch (to) { | ||
| case BIGINT: { | ||
| // Integer => Long |
There was a problem hiding this comment.
What is the philosophy of these mappings ?
There was a problem hiding this comment.
Assume schema evolution DDL
alter table t1 alter column val type bigintwhich changes type of val from int to bigint
We want to be able to read old data. To do it we need to cast val from int to long
otherwise, an exception will be thrown
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
at org.apache.flink.table.data.GenericRowData.getLong(GenericRowData.java:154)This class is an analogue of org.apache.hudi.client.utils.SparkInternalSchemaConverter#convertColumnVectorType which converts Spark's types
| * Data class to pass schema evolution info from table source to input format. | ||
| */ | ||
| public final class SchemaEvoContext implements Serializable { | ||
| private final boolean enabled; |
There was a problem hiding this comment.
Is this clazz necessary ? The enabled flag can be replaced by Option< querySchema> non empty instead.
There was a problem hiding this comment.
Is this clazz necessary ?
I think yes. Presented schema evolution methods moved from CopyOnWriteInputFormat to SchemaEvoContext to be reused in MergeOnReadInputFormat
The enabled flag can be replaced by Option< querySchema> non empty instead.
fixed by Option<SchemaEvoContext>
| } | ||
|
|
||
| private static final class ActualFields { | ||
| private final String[] names; |
There was a problem hiding this comment.
Personally i don't like the style that we introduces too many intermediate POJOs.
| @@ -61,7 +68,10 @@ public static RowDataProjection instance(LogicalType[] types, int[] positions) { | |||
| public RowData project(RowData rowData) { | |||
| GenericRowData genericRowData = new GenericRowData(this.fieldGetters.length); | |||
| for (int i = 0; i < this.fieldGetters.length; i++) { | |||
There was a problem hiding this comment.
Can we do not affect the normal code path for non evolution ? Something like
public RowData project(RowData rowData, CastMap castMap)There was a problem hiding this comment.
Faire enough. Fixed. I extended RowDataProjection instead of project(RowData rowData, CastMap castMap) because it is convenient to keep the CastMap inside the projection
|
@hudi-bot run azure |
|
I merged all supported modes into one patch and reworked pull request. |
What is the purpose of the pull request
This PR adds support of reading by flink when comprehensive schema evolution(RFC-33) enabled and there were some operations add column, rename column, change type of column, drop column.
Supported mode: batch/cow/snapshot
Brief change log
CopyOnWriteInputFormat. Now, during the opening, it calculates schema of file, if this schema differs from queried schema, it creates cast map. After reading file, type conversion is performed according to constructed map.Verify this pull request
This change added tests and can be verified as follows:
TestCastMapto verify that type conversion is correctITTestReadWithSchemaEvoto verify that table with added, renamed, casted, dropped columns is read as expected. This test usesTestSpark3DDLto prepare data, so it works only with-P scala-2.12,spark3.2, sinceTestSpark3DDLworks only with it.Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.