[HUDI-3982] Comprehensive schema evolution in flink when read/batch/cow/snapshot by trushev · Pull Request #5443 · apache/hudi

trushev · 2022-04-27T09:44:05Z

What is the purpose of the pull request

This PR adds support of reading by flink when comprehensive schema evolution(RFC-33) enabled and there were some operations add column, rename column, change type of column, drop column.
Supported mode: batch/cow/snapshot

Brief change log

Added new option to enable comprehensive schema evolution in flink
Key changes are made inside CopyOnWriteInputFormat. Now, during the opening, it calculates schema of file, if this schema differs from queried schema, it creates cast map. After reading file, type conversion is performed according to constructed map.

Verify this pull request

This change added tests and can be verified as follows:

Added unit test TestCastMap to verify that type conversion is correct
Added integration test ITTestReadWithSchemaEvo to verify that table with added, renamed, casted, dropped columns is read as expected. This test uses TestSpark3DDL to prepare data, so it works only with -P scala-2.12,spark3.2, since TestSpark3DDL works only with it.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…ow/snapshot

xiarixiaoyao · 2022-04-27T11:58:05Z

@danny0405 @bvaradar If you have free time，could you pls help review this pr, thanks very much

danny0405 · 2022-04-27T12:39:16Z

What do you mean when you saying batch/cow/snapshot ?

trushev · 2022-04-27T12:46:49Z

What do you mean when you saying batch/cow/snapshot ?

This PR covers the following case

'read.streaming.enabled' = 'false',
'table.type' = 'COPY_ON_WRITE',
'hoodie.datasource.query.type' = 'snapshot'

danny0405 · 2022-05-11T04:09:05Z

hudi-flink-datasource/hudi-flink/pom.xml

+            <artifactId>spark-hive_${scala.binary.version}</artifactId>
+            <scope>test</scope>
+        </dependency>
+        <dependency>


Why introduces the spark dependency in flink pom ?

To prepare test data. Currently, only Spark engine provides way to change schema and write new data after that.
I think when full support of schema evolution is implemented, we can remove this dependency by rewriting test to pure Flink

danny0405 · 2022-05-11T04:11:05Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java

  }

+  private SchemaEvoContext getSchemaEvoContext() {
+    if (!conf.getBoolean(FlinkOptions.SCHEMA_EVOLUTION_ENABLED)) {


Returns Option<SchemaEvoContext> instead.

fixed. removed enabled field. isPresent means enabled.

danny0405 · 2022-05-11T04:12:28Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/CastMap.java

+    LogicalTypeRoot to = toType.getTypeRoot();
+    switch (to) {
+      case BIGINT: {
+        // Integer => Long


What is the philosophy of these mappings ?

Assume schema evolution DDL

alter table t1 alter column val type bigint

which changes type of val from int to bigint

We want to be able to read old data. To do it we need to cast val from int to long

otherwise, an exception will be thrown

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at org.apache.flink.table.data.GenericRowData.getLong(GenericRowData.java:154)

This class is an analogue of org.apache.hudi.client.utils.SparkInternalSchemaConverter#convertColumnVectorType which converts Spark's types

danny0405 · 2022-05-11T04:14:02Z

...flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/SchemaEvoContext.java

+ * Data class to pass schema evolution info from table source to input format.
+ */
+public final class SchemaEvoContext implements Serializable {
+  private final boolean enabled;


Is this clazz necessary ? The enabled flag can be replaced by Option< querySchema> non empty instead.

Is this clazz necessary ?

I think yes. Presented schema evolution methods moved from CopyOnWriteInputFormat to SchemaEvoContext to be reused in MergeOnReadInputFormat

The enabled flag can be replaced by Option< querySchema> non empty instead.

fixed by Option<SchemaEvoContext>

danny0405 · 2022-05-11T04:18:39Z

...source/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java

+  }
+
+  private static final class ActualFields {
+    private final String[] names;


Personally i don't like the style that we introduces too many intermediate POJOs.

pojo removed

danny0405 · 2022-05-11T04:19:42Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/RowDataProjection.java

@@ -61,7 +68,10 @@ public static RowDataProjection instance(LogicalType[] types, int[] positions) {
  public RowData project(RowData rowData) {
    GenericRowData genericRowData = new GenericRowData(this.fieldGetters.length);
    for (int i = 0; i < this.fieldGetters.length; i++) {


Can we do not affect the normal code path for non evolution ? Something like

public RowData project(RowData rowData, CastMap castMap)

Faire enough. Fixed. I extended RowDataProjection instead of project(RowData rowData, CastMap castMap) because it is convenient to keep the CastMap inside the projection

trushev · 2022-05-17T18:16:58Z

@hudi-bot run azure

hudi-bot · 2022-05-23T05:38:25Z

CI report:

b5a6de3 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

trushev · 2022-06-10T04:58:24Z

I merged all supported modes into one patch and reworked pull request.
#5830

[HUDI-3982] Comprehensive schema evolution in flink when read/batch/c…

317bc4a

…ow/snapshot

yihua assigned danny0405 Apr 28, 2022

yihua added area:schema Schema evolution and data types engine:flink Flink integration priority:high Significant impact; potential bugs labels Apr 28, 2022

trushev force-pushed the schema-evo branch from 9194292 to 317bc4a Compare May 10, 2022 14:55

danny0405 reviewed May 11, 2022

View reviewed changes

trushev added 2 commits May 17, 2022 09:53

fixes

3961299

fix NPE

b48c327

rename getActualFields

b5a6de3

trushev closed this Jun 10, 2022

Conversation

trushev commented Apr 27, 2022

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

xiarixiaoyao commented Apr 27, 2022

Uh oh!

danny0405 commented Apr 27, 2022

Uh oh!

trushev commented Apr 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trushev May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trushev May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trushev May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trushev commented May 17, 2022

Uh oh!

hudi-bot commented May 23, 2022

CI report:

Uh oh!

trushev commented Jun 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trushev May 17, 2022 •

edited

Loading

trushev May 17, 2022 •

edited

Loading

trushev May 17, 2022 •

edited

Loading