[RFC-33] [HUDI-2429][WIP] Support full Schema evolution for Spark/Hive #3668

xiarixiaoyao · 2021-09-15T09:24:32Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Support full schema evolution for hoodie:

1) support spark3 DDL. include:
alter statement:
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
support follow types

int => long/float/double/string
long => float/double/string
float => double/String
double => String/Decimal
Decimal => Decimal/String
String => date/decimal
date => String

ALTER TABLE table1 ALTER COLUMN a.b.c SET NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
add statement:
ALTER TABLE table1 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
rename:
ALTER TABLE table1 RENAME COLUMN a.b.c TO x
drop:
ALTER TABLE table1 DROP COLUMN a.b.c
ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
set/unset Properties:
ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value');
ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key');

2) support mor(incremental/realtime/optimize) read/write
3) support cow (incremental/realtime) read/write
4) support mor compaction

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2021-09-15T09:36:44Z

kindly ping @bvaradar @leesf could you help me review those code . thanks

yanghua · 2021-09-15T11:02:11Z

This PR is about RFC-33? Have we reached an agreement on the design?

leesf · 2021-09-15T11:33:51Z

hudi-common/pom.xml

what this dependency used for?

used to cache the history schema. caffine is better than guava cache

leesf · 2021-09-15T11:40:45Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

PositionType can be enum?

ok, will change it

xiarixiaoyao · 2021-09-15T12:00:05Z

@yanghua yes, We reached a preliminary consensus.

vinothchandar · 2021-09-15T12:37:18Z

cc @codope @bvaradar lets use this as a basis and evolve our design, move forward.

@yanghua we are close I think. Few open items, that can be resolved soon hopefully.
You can follow along the discussion in the comments here
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution

@codope also has a bunch of these working on presto and trino already. Flink would be a good thing to tackle as well. Does Flink use the Hive record readers? cc @danny0405 as well.

leesf · 2021-09-15T13:01:06Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

what does parentName means here?

Suppose we have a nested column. the parentName of a.b.c is a.b; the parentName of a.b is a; the parentName of a is ""

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

hudi-common/src/main/java/org/apache/hudi/common/util/TableInternalSchemaUtils.java

hudi-common/src/main/java/org/apache/hudi/internal/schema/action/MergeSchemaAction.java

danny0405 · 2021-09-16T05:39:49Z

cc @codope @bvaradar lets use this as a basis and evolve our design, move forward.

@yanghua we are close I think. Few open items, that can be resolved soon hopefully.
You can follow along the discussion in the comments here
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution

@codope also has a bunch of these working on presto and trino already. Flink would be a good thing to tackle as well. Does Flink use the Hive record readers? cc @danny0405 as well.

Flink does not use the Hive record readers now ~

xiarixiaoyao · 2021-09-16T06:36:14Z

@danny0405 Hope you can help us complete the full schema evolution of flink， thanks

codope

@xiarixiaoyao This is great! Let's work on this together. I've taken one pass but I'm yet to look at the tests. Few high-level comments:

Let's add docs for all the public classes and APIs.
Does the merge schema action handle evolution of non-leaf fields in nested fields? For example, if a.b.c is renamed to a.d.c.
IIUC, the patch has not yet handled old or existing schema compatibility as mentioned in the RFC right?
Since schema history is being changed not only at the write time but also at the read time, so we need to think of both writer and reader concurrency.

hudi-common/src/main/java/org/apache/hudi/common/util/TableInternalSchemaUtils.java

codope · 2021-09-17T03:11:06Z

hudi-common/src/main/java/org/apache/hudi/common/util/TableInternalSchemaUtils.java

What about thread-safety? If there are concurrent readers, then should construction of TreeMap be synchronized?

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

codope · 2021-09-17T03:36:14Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

This else block can be simplified to return fileSchema.findfullName(nameId) as we're doing the same thing in both the cases right?

Yes, it can be done, but the algorithm is poorly readable. i tread to not simplified those code

...scala/org/apache/spark/sql/execution/datasources/parquet/Spark2HoodieParquetFileFormat.scala

codope · 2021-09-17T03:59:11Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

We need to call this for all table and query types not just MOR snapshot relation. So, basically at following places:

DefaultSource#getBaseFileOnlyView IncrementalRelation MergeOnReadIncrementalRelation

yes，we need do that。but i think this pr is more like the prototype of id-schema, I hope we can reach a basic agreement on this pr。 full spark adaption will be another pr 。

@xiarixiaoyao : It would be helpful to review if you can list the gaps in the current PR before reaching full spark support

ok, cow incremental/snapshort view and mor incremetal/optimize view will added.

xiarixiaoyao · 2021-09-17T04:49:55Z

@codope thanks for your review. Let me answer some your questions first，pls forgive me for being busy today, I need to modify something for the pr of zorder
Let's add docs for all the public classes and APIs.
ok，i will added。
Does the merge schema action handle evolution of non-leaf fields in nested fields? For example, if a.b.c is renamed to a.d.c.
this is a strange demand， if you want to change a.b.c to a.d.c , why donot use spark.sql(s"alter table ${tableName} rename column a.b to d"). we support handle all the non-leaf fields.

IIUC, the patch has not yet handled old or existing schema compatibility as mentioned in the RFC right?
already deal with this situation. pls see the TestSpark3DDL, we first create a table and insert some data to it. now no id-schema is produced. then we do schema change, and the id-schema is produced. the test result confirmed this。

Since schema history is being changed not only at the write time but also at the read time, so we need to think of both writer and reader concurrency.
A little doubtful， hudi is snapshot isolated， maybe only need to deal with concurrency between write and write， if i am wrong pls fix me ， thanks

Now there is the most important question， I now tend to use metatable to store historical schema。 as we know metatable use hfile to store data， the hfile has a very good point query performance， what do you suggest？

xiarixiaoyao · 2021-09-18T09:23:49Z

@leesf @codope @bvaradar update the code , add docs to all public method, and resolved most comments. pls help me review those code again.

bvaradar

@xiarixiaoyao : Thanks for opening the PR. Great start !

Have left comments. Regarding COW, I did not see the changes in HoodieMergeHandle when Hudi tries to read old parquet file, applies merge and writes new version of file. I guess this is still WIP.

We can try to make this PR complete functionally.

bvaradar · 2021-09-22T06:15:52Z

hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java

instead of timeline, use metadata.getActiveTimeline()

bvaradar · 2021-09-22T07:03:18Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

Similar logic needs to be also be present in BaseCommitExecutor.commit.

We need to introduce a config to enable/disable schema evolution. Only if the config is enables, we should let any side-effects of schema evolution to take effect.

We need to introduce a config to enable/disable schema evolution. Only if the config is enables, we should let any side-effects of schema evolution to take effect.

may no need introduce a config. only when user execute alter SQL /alter api explicitly, schema evolution will take effect； Otherwise, everything goes through the original process

I think for implicit schema changes we would want to control whether to have current behavior or introduce the one. Same is true for the reader side.it is better to simply have a gatekeeper config to control the entire feature. when the feature matures, we can turn it on by default and deprecate later.

agree, will deal it. thanks

bvaradar · 2021-09-22T07:17:24Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

when will this case happen ? SCHEMA_KEY is passed but not the LATESTSCHEMA ?

if we donot modify the table schema explicitly. LATESTSCHEMA is only produced when we do alter SQL or alter api.

bvaradar · 2021-09-22T07:19:07Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

High level : Instead of storing historical schema in commit file, its better to store them as separate files under a new directory under .hoodie. Something like .hoodie/schema/...

ok, but now we need to reach agreement on this issue。 vinothchandar suggest we save those information into metatable。
@vinothchandar What is your opinion on this。

@xiarixiaoyao : The synchronous metadata table support is still in the works and is getting ready. Its better to decouple both the projects and at the same time avoid adding historical schema to commit files. As an interim measure, lets write as separate files under .hoodie/schema and then we can do a follow up PR whenever synchronous metadata table is done. cc @vinothchandar @codope

ok， will do it. thanks

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

bvaradar · 2021-09-22T09:21:18Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaUtil.java

For the many conversions we are doing Avro, InternalSchema, Spark, Parquet... , instead of implementing these methods in helper/utils class, let rename the classes as "xxxxConverter" and make sure only the conversion functions are defined.

bvaradar · 2021-09-22T09:22:34Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaParser.java

Add docs to all public APIs

looks like this class performs both serialization and deserialization. If that is the case, lets move those methods to a SerDeHelper class

done, docs added, and rename InternalSchemaParser to SerDeHelper

bvaradar · 2021-09-22T09:32:30Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java

Consider using Visitor design pattern using classes instead of static methods.

bvaradar · 2021-09-22T09:35:37Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataMergedLogRecordScanner.java

Same comment as the one in HoodieUnMergedLogRecordScanner

spark mor/cow has not use HoodieUnmergedLogRecordScanner only hive/presto use it. this pr is only about spark adapation.

bvaradar · 2021-09-22T09:48:02Z

...spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

nit: You can simplify with something like
schemaUtil.getTableAvroSchema, internalSchemaOpt.orElse(null)

xiarixiaoyao · 2021-09-22T11:11:20Z

@bvaradar . thanks for your review。 I will try to solve these problems。 There is a little question, do we need to add all adaptations to the spark engine on this pr。 if needed，full adaptation of spark engine（both mor and cow） will be added、

bvaradar · 2021-09-23T17:32:10Z

Thanks @xiarixiaoyao , Yes, It makes sense to add all the spark adapations related changes to this PR .

bvaradar · 2021-09-28T05:08:51Z

@xiarixiaoyao : Can you add commits to this PR instead of squashing. It makes things easy for us to find the delta changes. We can do final squash before landing the PR.

xiarixiaoyao · 2021-09-29T09:21:14Z

@bvaradar @codope @leesf . could you pls help me review this pr again, thanks
code changes

support mor(incremental/realtime/optimize) read/write
support cow (incremental/realtime) read/write
support spark3 DDL. include:
alter statement:
- ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
- ALTER TABLE table1 ALTER COLUMN a.b.c SET NOT NULL
- ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
- ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
- ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
- ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
  add statement:
- ALTER TABLE table1 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
  rename:
- ALTER TABLE table1 RENAME COLUMN a.b.c TO x
  drop:
- ALTER TABLE table1 DROP COLUMN a.b.c
- ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
  set/unset Properties:
- ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value');
- ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key');
support spark2 DDL.
add FileBaseInternalSchemasManager to manger history schemas, and save historySchemas in "./hoodie/.schema" . now we no need to save historySchemas into commit file.
add segment lock to TableInternalSchemaUtils to support concurrent read and write cache.
rename mergeSchemaAction to SchemaMerger; remove helper methods from TableChanges to a helper class, now TableChanges is ok; use visitor mode to produce nameToId for internalSchema; and other samll fixed.

Remaining problem: add more UT for this pr, add support for bootstrap table.

@bvaradar forgive me this change is too large, i still use squahsing. Subsequent modifications will be in the form of add commit.

xiarixiaoyao · 2021-09-30T01:33:12Z

@hudi-bot run azure

xiarixiaoyao · 2021-10-11T01:21:19Z

@hudi-bot run azure

xiarixiaoyao · 2021-10-11T06:32:22Z

@bvaradar rebased the code and add config to control schema evolution。 the test fail has no relate to this pr。

codope · 2021-10-11T12:40:25Z

@xiarixiaoyao It would really help if you could share a gist showing the schema evolution steps. For example, earlier I tried this and add/drop/reorder was working but rename was not working. Now, after the latest push, when I try the same but with the below spark-sql command, i don't see schema directory being created.

./bin/spark-sql --jars ${JARS} \
  --packages org.apache.spark:spark-avro_2.11:2.4.7,com.github.ben-manes.caffeine:caffeine:2.9.1 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf 'spark.kryoserializer.buffer.max=1024m' --conf "spark.memory.storageFraction=0.8" \
  --conf "hoodie.schema.evolution.enable=true"
  --conf spark.rdd.compress=true --driver-memory 2g --conf "spark.memory.fraction=0.8"

xiarixiaoyao · 2021-10-12T01:45:07Z

@codope thanks for your try this pr. notice that we should not use --conf "hoodie.schema.evolution.enable=true", this conf is not start with spark/hadoop/hive spark will ignore this conf. see the test case, we can execute set command to enable schema evolution, sql("set hoodie.schema.evolution.enable=true"). another things it's very strange that rename cannot work well, TestSpark3DDL already contains rename example, spark2 has no same test case . maybe we can discuss this problem by slack, thanks.

rubenssoto · 2021-10-25T21:57:22Z

This feature is amazing, long-awaited functionality.

hudi-bot · 2021-11-05T02:54:40Z

CI report:

fc5223c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xiarixiaoyao · 2022-02-25T13:10:11Z

close this pr, as this pr is too old, and we also have a new pr about this feature