-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o… #2309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2309 +/- ##
============================================
- Coverage 52.43% 52.21% -0.22%
- Complexity 2653 2669 +16
============================================
Files 332 335 +3
Lines 14892 15014 +122
Branches 1496 1512 +16
============================================
+ Hits 7808 7839 +31
- Misses 6458 6548 +90
- Partials 626 627 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @nbalajee , i have left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the document of Avro, i believe this is the right direction to correct: http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the schema equivalence first because the old schema may also be a UNION.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a test case there ? For 2 cases:
- the nested record has fewer fields than expected
- the nested record has more fields than expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two test cases.
UNION is predominantly used for optional record - [null, {record}] pattern. In the next step of the recursion, record performs the schema equivalence check. Hence, thought we won't need the equivalence check here. Please let me know if I missed something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for UNION i think the check in the nested recursion is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalajee : let me know if this is feasible. Is NULL mandatory in any UNION schema? I mean, can there be a UNION schema w/o NULL in it? if yes, this would fail in my understanding.
|
@danny0405 awesome! thanks for jumping in :) . |
@n3nash - Correct. When reading the parquet files, Hudi uses the writer schema (evolved schema with added fields) so that optional fields are automatically populated with null (native schema evolution). For the rewrite(), Hudi use-cases always pass the writerSchema, so we don't run into this issue. Added advantage of fixing this the correct way is that Hudi will be able to support "external schema evolution". (Read parquet using the reader schema, then rewrite the records using the evolved schema). |
259e622 to
6d73038
Compare
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for UNION i think the check in the nested recursion is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The force type coercion (List)datum does not have any protection logic, thus, the method rewriteEvolvedFields(Object datum, Schema newSchema) works assuming that the datum has compatible schema with newSchema, this implicit contract should be kept by the invoker, i would suggest to add some notion to the java doc to make it more clear. Same with Map type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the datum contained inside the Array/Map is of primitive type, then no additional schema compatibility check is required. If the datum is a Record then we are already checking whether newSchema matches with the record schema, by comparing the hash values.
In other words, similar to the UNION, the nested recursion would take care of the datum contained in the ARRAY or MAP as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is good, although it misses the ARRAY and MAP type check, i think it is okey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - Since the differences come from thee RECORD and ARRAY/MAP are containers, testing against the RECORD is sufficient is my thinking as well.
|
Thanks for the update @nbalajee , i have left some comments. |
|
@nbalajee : do you think we can get this landed by upcoming release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertEquals first operand should be the expected, to avoid confusion, i would suggest to use the assertThat(variable, is(expected)) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapped the parameters to fix assertEquals.
I can, wait for the update of @nbalajee ~ |
…ngPlan and to run the plan
…ootstrap table (apache#2370) Co-authored-by: Wenning Ding <[email protected]>
Co-authored-by: Wenning Ding <[email protected]> - Added support for bulk insert v2 with datasource v2 api in Spark 3.0.0.
…ons that have no incremental changes (apache#2371) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <[email protected]>
…ning tests in test suite framework (apache#2168) * trigger rebuild * [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (apache#1927) * Adding support for validating records and long running tests in test sutie framework * Adding partial validate node * Fixing spark session initiation in Validate nodes * Fixing validation * Adding hive table validation to ValidateDatasetNode * Rebasing with latest commits from master * Addressing feedback * Addressing comments Co-authored-by: lamber-ken <[email protected]> Co-authored-by: linshan-ma <[email protected]>
…pache#2275) * [HUDI-1354] Block updates and replace on file groups in clustering * [HUDI-1354] Block updates and replace on file groups in clustering
|
@nbalajee You may need to rebase your branch first in order to avoid unnecessary commits. |
f90f65f to
baff127
Compare
* [HUDI-1350] Support Partition level delete API in HUDI * [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction * [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction
* [HUDI-1398] Align insert file size for reducing IO Co-authored-by: zhang wen <[email protected]>
…f a nested record field.
63d238d to
9abc305
Compare
|
fyi, we have another patch that fixes schema evolution in hudi. not sure if there are any overlaps though as I haven't looked into this patch. |
@nsivabalan - #2334 is using HoodieAvroUtils.rewrite() functionality to rewrite the generic records. There is no overlap between the issues being corrected. |
|
@nbalajee @nsivabalan can you please summarize the status of this PR? is it ready to go after rebasing or should we spend more time in the review |
|
@n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but can someone please summarize its state? also this needs a rebase with only the necessary changes. |
The changes overall looks good from my side, but this PR has to do a rebase because it introduces many conflicts commits from master branch. |
|
@nbalajee : can you rebase and update the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments for clarification.
Also, a general question on avro schema evolution.
- I know we can evolve a field from int to long. But can we evolve a field of array[int] to array[long] ?
- Can a primitive field be evolved to union with null in it?
In general, depending on the answers to this, does this patch handles whatever is compatible evolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbalajee : let me know if this is feasible. Is NULL mandatory in any UNION schema? I mean, can there be a UNION schema w/o NULL in it? if yes, this would fail in my understanding.
| mapCopy.put(entry.getKey(), rewriteEvolvedFields(entry.getValue(), newSchema.getValueType())); | ||
| } | ||
| return mapCopy; | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slowly gaining knowledge in schema evolution. might be a noob question. Apart from RECORD datatype, how else other datatypes could evolve. for eg, a field of array datatype in old schema has to be an array in new schema right, and it can never evolve to anything else (in a compatible manner). Incase of RECORD, I understand there could be more fields, and hence we need a deep copy. what I am trying to ask is, for union, array, map data types, can we just fetch old value and add to new record rather than doing a deep copy? Can you help clarify.
| } | ||
|
|
||
| @Test | ||
| public void testRewriteToShorterRecord() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought from the java docs of HoodieAvroUtils.rewriteRecord(GenericRecord oldRecord, Schema newSchema), rewrite can happen from old schema to new schema and not other way round. Can you help me understand why we allow backwards incompatible rewrite here?
|
@nbalajee : I see that you have lot of extra commits in this patch. Can you fix it and rebase. In the interest of testing it out, I pulled in your changes locally and have put up some draft PR #2982 w/ some minor fixes in addition to your patch. If incase you wanna create a clean PR, might be useful to you. |
|
I verified some of the unknowns.
|
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some minor optimization in new PR. do check it out.
#2982
…f a nested record field.
What is the purpose of the pull request
If schema contains nested records, then HoodieAvroUtils rewrite() function copies the record fields as-is, from the oldrecord to the newRecord. If fields of the nested record have evolved, it would result in SchemaCompatibilityException or ArrayIndexOutOfBoundsException.
Brief change log
Modify HoodieAvroUtils rewrite() to rewrite the evolved fields, with new/evolved fields initialized to null.
Verify this pull request
This pull request is already covered by existing tests, such as TestHoodieAvroUtils.
Added testRewriteToEvolvedNestedRecord() and testRewriteToShorterRecord()
Committer checklist
[ x] Has a corresponding JIRA in PR title & commit
[ x] Commit message is descriptive of the change
[ x] CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.