-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[RFC-33] [HUDI-2429][WIP] Support full Schema evolution for Spark/Hive #3668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is about RFC-33? Have we reached an agreement on the design? |
hudi-common/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what this dependency used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used to cache the history schema. caffine is better than guava cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PositionType can be enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will change it
|
@yanghua yes, We reached a preliminary consensus. |
|
cc @codope @bvaradar lets use this as a basis and evolve our design, move forward. @yanghua we are close I think. Few open items, that can be resolved soon hopefully. @codope also has a bunch of these working on presto and trino already. Flink would be a good thing to tackle as well. Does Flink use the Hive record readers? cc @danny0405 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does parentName means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we have a nested column. the parentName of a.b.c is a.b; the parentName of a.b is a; the parentName of a is ""
...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/TableInternalSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/internal/schema/action/MergeSchemaAction.java
Outdated
Show resolved
Hide resolved
4b64c7e to
101b85a
Compare
Flink does not use the Hive record readers now ~ |
|
@danny0405 Hope you can help us complete the full schema evolution of flink, thanks |
ba407cc to
1a0b7d8
Compare
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao This is great! Let's work on this together. I've taken one pass but I'm yet to look at the tests. Few high-level comments:
- Let's add docs for all the public classes and APIs.
- Does the merge schema action handle evolution of non-leaf fields in nested fields? For example, if
a.b.cis renamed toa.d.c. - IIUC, the patch has not yet handled old or existing schema compatibility as mentioned in the RFC right?
- Since schema history is being changed not only at the write time but also at the read time, so we need to think of both writer and reader concurrency.
hudi-common/src/main/java/org/apache/hudi/common/util/TableInternalSchemaUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about thread-safety? If there are concurrent readers, then should construction of TreeMap be synchronized?
hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/InternalSchemaUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block can be simplified to return fileSchema.findfullName(nameId) as we're doing the same thing in both the cases right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be done, but the algorithm is poorly readable. i tread to not simplified those code
...scala/org/apache/spark/sql/execution/datasources/parquet/Spark2HoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
...scala/org/apache/spark/sql/execution/datasources/parquet/Spark2HoodieParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to call this for all table and query types not just MOR snapshot relation. So, basically at following places:
DefaultSource#getBaseFileOnlyView
IncrementalRelation
MergeOnReadIncrementalRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,we need do that。but i think this pr is more like the prototype of id-schema, I hope we can reach a basic agreement on this pr。 full spark adaption will be another pr 。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao : It would be helpful to review if you can list the gaps in the current PR before reaching full spark support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, cow incremental/snapshort view and mor incremetal/optimize view will added.
|
@codope thanks for your review. Let me answer some your questions first,pls forgive me for being busy today, I need to modify something for the pr of zorder IIUC, the patch has not yet handled old or existing schema compatibility as mentioned in the RFC right? Since schema history is being changed not only at the write time but also at the read time, so we need to think of both writer and reader concurrency. Now there is the most important question, I now tend to use metatable to store historical schema。 as we know metatable use hfile to store data, the hfile has a very good point query performance, what do you suggest? |
1a0b7d8 to
5c94979
Compare
bvaradar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao : Thanks for opening the PR. Great start !
Have left comments. Regarding COW, I did not see the changes in HoodieMergeHandle when Hudi tries to read old parquet file, applies merge and writes new version of file. I guess this is still WIP.
We can try to make this PR complete functionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of timeline, use metadata.getActiveTimeline()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar logic needs to be also be present in BaseCommitExecutor.commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to introduce a config to enable/disable schema evolution. Only if the config is enables, we should let any side-effects of schema evolution to take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to introduce a config to enable/disable schema evolution. Only if the config is enables, we should let any side-effects of schema evolution to take effect.
may no need introduce a config. only when user execute alter SQL /alter api explicitly, schema evolution will take effect; Otherwise, everything goes through the original process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for implicit schema changes we would want to control whether to have current behavior or introduce the one. Same is true for the reader side.it is better to simply have a gatekeeper config to control the entire feature. when the feature matures, we can turn it on by default and deprecate later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, will deal it. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this case happen ? SCHEMA_KEY is passed but not the LATESTSCHEMA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we donot modify the table schema explicitly. LATESTSCHEMA is only produced when we do alter SQL or alter api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level : Instead of storing historical schema in commit file, its better to store them as separate files under a new directory under .hoodie. Something like .hoodie/schema/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but now we need to reach agreement on this issue。 vinothchandar suggest we save those information into metatable。
@vinothchandar What is your opinion on this。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao : The synchronous metadata table support is still in the works and is getting ready. Its better to decouple both the projects and at the same time avoid adding historical schema to commit files. As an interim measure, lets write as separate files under .hoodie/schema and then we can do a follow up PR whenever synchronous metadata table is done. cc @vinothchandar @codope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do it. thanks
...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the many conversions we are doing Avro, InternalSchema, Spark, Parquet... , instead of implementing these methods in helper/utils class, let rename the classes as "xxxxConverter" and make sure only the conversion functions are defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docs to all public APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this class performs both serialization and deserialization. If that is the case, lets move those methods to a SerDeHelper class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, docs added, and rename InternalSchemaParser to SerDeHelper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using Visitor design pattern using classes instead of static methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the one in HoodieUnMergedLogRecordScanner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark mor/cow has not use HoodieUnmergedLogRecordScanner only hive/presto use it. this pr is only about spark adapation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can simplify with something like
schemaUtil.getTableAvroSchema, internalSchemaOpt.orElse(null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@bvaradar . thanks for your review。 I will try to solve these problems。 There is a little question, do we need to add all adaptations to the spark engine on this pr。 if needed,full adaptation of spark engine(both mor and cow) will be added、 |
|
Thanks @xiarixiaoyao , Yes, It makes sense to add all the spark adapations related changes to this PR . |
b2d8b3f to
b5e0b22
Compare
b5e0b22 to
c0f0a8e
Compare
|
@xiarixiaoyao : Can you add commits to this PR instead of squashing. It makes things easy for us to find the delta changes. We can do final squash before landing the PR. |
c0f0a8e to
42ca73a
Compare
|
@bvaradar @codope @leesf . could you pls help me review this pr again, thanks
Remaining problem: add more UT for this pr, add support for bootstrap table. @bvaradar forgive me this change is too large, i still use squahsing. Subsequent modifications will be in the form of add commit. |
0fe3038 to
a423a63
Compare
|
@hudi-bot run azure |
|
@hudi-bot run azure |
|
@bvaradar rebased the code and add config to control schema evolution。 the test fail has no relate to this pr。 |
|
@xiarixiaoyao It would really help if you could share a gist showing the schema evolution steps. For example, earlier I tried this and add/drop/reorder was working but rename was not working. Now, after the latest push, when I try the same but with the below spark-sql command, i don't see schema directory being created. |
|
@codope thanks for your try this pr. notice that we should not use --conf "hoodie.schema.evolution.enable=true", this conf is not start with spark/hadoop/hive spark will ignore this conf. see the test case, we can execute set command to enable schema evolution, sql("set hoodie.schema.evolution.enable=true"). another things it's very strange that rename cannot work well, TestSpark3DDL already contains rename example, spark2 has no same test case . maybe we can discuss this problem by slack, thanks. |
89dab78 to
aaf33e3
Compare
8588a4d to
d1c9d35
Compare
|
This feature is amazing, long-awaited functionality. |
|
close this pr, as this pr is too old, and we also have a new pr about this feature |
Tips
What is the purpose of the pull request
Support full schema evolution for hoodie:
1) support spark3 DDL. include:
alter statement:
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
support follow types
int => long/float/double/string
long => float/double/string
float => double/String
double => String/Decimal
Decimal => Decimal/String
String => date/decimal
date => String
ALTER TABLE table1 ALTER COLUMN a.b.c SET NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
add statement:
ALTER TABLE table1 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
rename:
ALTER TABLE table1 RENAME COLUMN a.b.c TO x
drop:
ALTER TABLE table1 DROP COLUMN a.b.c
ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
set/unset Properties:
ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value');
ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key');
2) support mor(incremental/realtime/optimize) read/write
3) support cow (incremental/realtime) read/write
4) support mor compaction
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.