[HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o… #2309

nbalajee · 2020-12-08T17:34:48Z

…f a nested record field.

What is the purpose of the pull request

If schema contains nested records, then HoodieAvroUtils rewrite() function copies the record fields as-is, from the oldrecord to the newRecord. If fields of the nested record have evolved, it would result in SchemaCompatibilityException or ArrayIndexOutOfBoundsException.

Brief change log

Modify HoodieAvroUtils rewrite() to rewrite the evolved fields, with new/evolved fields initialized to null.

Verify this pull request

This pull request is already covered by existing tests, such as TestHoodieAvroUtils.
Added testRewriteToEvolvedNestedRecord() and testRewriteToShorterRecord()

Committer checklist

[ x] Has a corresponding JIRA in PR title & commit
[ x] Commit message is descriptive of the change
[ x] CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2020-12-08T21:11:30Z

Codecov Report

Merging #2309 (9abc305) into master (0c821fe) will decrease coverage by 0.21%.
The diff coverage is 22.83%.

@@             Coverage Diff              @@
##             master    #2309      +/-   ##
============================================
- Coverage     52.43%   52.21%   -0.22%     
- Complexity     2653     2669      +16     
============================================
  Files           332      335       +3     
  Lines         14892    15014     +122     
  Branches       1496     1512      +16     
============================================
+ Hits           7808     7839      +31     
- Misses         6458     6548      +90     
- Partials        626      627       +1

Flag	Coverage Δ	Complexity Δ
hudicli	`38.83% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`54.72% <22.22%> (-0.45%)`	`0.00 <13.00> (ø)`
hudihadoopmr	`33.52% <100.00%> (+0.23%)`	`0.00 <0.00> (ø)`
huditimelineservice	`65.30% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.65% <ø> (ø)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
.../apache/hudi/common/model/ClusteringOperation.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...e/hudi/common/table/log/HoodieFileSliceReader.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...di/common/table/timeline/HoodieActiveTimeline.java	`69.62% <0.00%> (-4.51%)`	`41.00 <0.00> (ø)`
...che/hudi/common/table/timeline/HoodieTimeline.java	`89.13% <0.00%> (-4.06%)`	`43.00 <0.00> (ø)`
...a/org/apache/hudi/common/util/ClusteringUtils.java	`89.06% <0.00%> (-1.42%)`	`18.00 <0.00> (ø)`
.../java/org/apache/hudi/common/util/StringUtils.java	`66.66% <ø> (ø)`	`14.00 <0.00> (ø)`
...hudi/utilities/schema/FilebasedSchemaProvider.java	`82.35% <ø> (ø)`	`5.00 <0.00> (ø)`
...g/apache/hudi/common/model/WriteOperationType.java	`53.12% <33.33%> (-2.05%)`	`2.00 <0.00> (ø)`
...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java	`52.34% <45.71%> (-1.59%)`	`45.00 <7.00> (+7.00)`	⬇️
.../apache/hudi/common/config/SerializableSchema.java	`57.89% <57.89%> (ø)`	`6.00 <6.00> (?)`
... and 6 more

n3nash · 2020-12-09T05:15:31Z

@nbalajee Can you please explain why do we need this ? If the latest schema is passed (which is the case for Hudi now) is this still a problem ?
@bvaradar can you please take a look at this one ?

danny0405

Thanks for the contribution @nbalajee , i have left some comments.

danny0405 · 2020-12-09T09:46:39Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

Based on the document of Avro, i believe this is the right direction to correct: http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution

danny0405 · 2020-12-09T09:50:25Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

Should we check the schema equivalence first because the old schema may also be a UNION.

Can we also add a test case there ? For 2 cases:

the nested record has fewer fields than expected

the nested record has more fields than expected

Added two test cases.

UNION is predominantly used for optional record - [null, {record}] pattern. In the next step of the recursion, record performs the schema equivalence check. Hence, thought we won't need the equivalence check here. Please let me know if I missed something here.

Yeah, for UNION i think the check in the nested recursion is valid.

@nbalajee : let me know if this is feasible. Is NULL mandatory in any UNION schema? I mean, can there be a UNION schema w/o NULL in it? if yes, this would fail in my understanding.

vinothchandar · 2020-12-11T07:25:43Z

@danny0405 awesome! thanks for jumping in :) .

nbalajee · 2020-12-11T19:07:17Z

@nbalajee Can you please explain why do we need this ? If the latest schema is passed (which is the case for Hudi now) is this still a problem ?
@bvaradar can you please take a look at this one ?

@n3nash - Correct. When reading the parquet files, Hudi uses the writer schema (evolved schema with added fields) so that optional fields are automatically populated with null (native schema evolution). For the rewrite(), Hudi use-cases always pass the writerSchema, so we don't run into this issue.

Added advantage of fixing this the correct way is that Hudi will be able to support "external schema evolution". (Read parquet using the reader schema, then rewrite the records using the evolved schema).

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

danny0405 · 2020-12-21T13:07:48Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

Yeah, for UNION i think the check in the nested recursion is valid.

danny0405 · 2020-12-21T13:15:40Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

The force type coercion (List)datum does not have any protection logic, thus, the method rewriteEvolvedFields(Object datum, Schema newSchema) works assuming that the datum has compatible schema with newSchema, this implicit contract should be kept by the invoker, i would suggest to add some notion to the java doc to make it more clear. Same with Map type.

If the datum contained inside the Array/Map is of primitive type, then no additional schema compatibility check is required. If the datum is a Record then we are already checking whether newSchema matches with the record schema, by comparing the hash values.

In other words, similar to the UNION, the nested recursion would take care of the datum contained in the ARRAY or MAP as well.

danny0405 · 2020-12-21T13:17:57Z

hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java

The test is good, although it misses the ARRAY and MAP type check, i think it is okey.

+1 - Since the differences come from thee RECORD and ARRAY/MAP are containers, testing against the RECORD is sufficient is my thinking as well.

danny0405 · 2020-12-21T13:18:28Z

Thanks for the update @nbalajee , i have left some comments.

nsivabalan · 2020-12-26T03:45:32Z

@nbalajee : do you think we can get this landed by upcoming release.
@danny0405 : Can you please take care of seeing this through once balaji addresses your comment. we are looking to get this in by upcoming release (will be cutting a release in a weeks time)

danny0405 · 2020-12-28T04:18:34Z

hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java

The assertEquals first operand should be the expected, to avoid confusion, i would suggest to use the assertThat(variable, is(expected)) instead.

Swapped the parameters to fix assertEquals.

danny0405 · 2020-12-28T04:19:31Z

@nbalajee : do you think we can get this landed by upcoming release.
@danny0405 : Can you please take care of seeing this through once balaji addresses your comment. we are looking to get this in by upcoming release (will be cutting a release in a weeks time)

I can, wait for the update of @nbalajee ~

…ngPlan and to run the plan

…ic ts (apache#2357)

…pache#2361)

…ootstrap table (apache#2370) Co-authored-by: Wenning Ding <[email protected]>

Co-authored-by: Wenning Ding <[email protected]> - Added support for bulk insert v2 with datasource v2 api in Spark 3.0.0.

…e#2364)

…ons that have no incremental changes (apache#2371) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <[email protected]>

…ning tests in test suite framework (apache#2168) * trigger rebuild * [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (apache#1927) * Adding support for validating records and long running tests in test sutie framework * Adding partial validate node * Fixing spark session initiation in Validate nodes * Fixing validation * Adding hive table validation to ValidateDatasetNode * Rebasing with latest commits from master * Addressing feedback * Addressing comments Co-authored-by: lamber-ken <[email protected]> Co-authored-by: linshan-ma <[email protected]>

…it test (apache#2360)

…pache#2275) * [HUDI-1354] Block updates and replace on file groups in clustering * [HUDI-1354] Block updates and replace on file groups in clustering

danny0405 · 2020-12-29T02:36:14Z

@nbalajee You may need to rebase your branch first in order to avoid unnecessary commits.

* [HUDI-1350] Support Partition level delete API in HUDI * [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction * [HUDI-1350] Support Partition level delete API in HUDI base InsertOverwriteCommitAction

…ommit (apache#2385)

* [HUDI-1398] Align insert file size for reducing IO Co-authored-by: zhang wen <[email protected]>

…f a nested record field.

nsivabalan · 2021-01-02T17:00:09Z

fyi, we have another patch that fixes schema evolution in hudi. not sure if there are any overlaps though as I haven't looked into this patch.
#2334

nbalajee · 2021-01-04T03:32:07Z

fyi, we have another patch that fixes schema evolution in hudi. not sure if there are any overlaps though as I haven't looked into this patch.
#2334

@nsivabalan - #2334 is using HoodieAvroUtils.rewrite() functionality to rewrite the generic records. There is no overlap between the issues being corrected.

vinothchandar · 2021-02-03T20:39:54Z

@nbalajee @nsivabalan can you please summarize the status of this PR? is it ready to go after rebasing or should we spend more time in the review

vinothchandar · 2021-03-14T08:11:14Z

@n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but can someone please summarize its state? also this needs a rebase with only the necessary changes.

danny0405 · 2021-03-15T02:27:38Z

@n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but can someone please summarize its state? also this needs a rebase with only the necessary changes.

The changes overall looks good from my side, but this PR has to do a rebase because it introduces many conflicts commits from master branch.

nsivabalan · 2021-04-19T13:16:54Z

@nbalajee : can you rebase and update the PR.

nsivabalan

Left some comments for clarification.
Also, a general question on avro schema evolution.

I know we can evolve a field from int to long. But can we evolve a field of array[int] to array[long] ?
Can a primitive field be evolved to union with null in it?
In general, depending on the answers to this, does this patch handles whatever is compatible evolution.

nsivabalan · 2021-05-23T22:06:46Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

@nbalajee : let me know if this is feasible. Is NULL mandatory in any UNION schema? I mean, can there be a UNION schema w/o NULL in it? if yes, this would fail in my understanding.

nsivabalan · 2021-05-23T22:12:08Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+          mapCopy.put(entry.getKey(), rewriteEvolvedFields(entry.getValue(), newSchema.getValueType()));
+        }
+        return mapCopy;
+      default:


I am slowly gaining knowledge in schema evolution. might be a noob question. Apart from RECORD datatype, how else other datatypes could evolve. for eg, a field of array datatype in old schema has to be an array in new schema right, and it can never evolve to anything else (in a compatible manner). Incase of RECORD, I understand there could be more fields, and hence we need a deep copy. what I am trying to ask is, for union, array, map data types, can we just fetch old value and add to new record rather than doing a deep copy? Can you help clarify.

nsivabalan · 2021-05-23T22:13:53Z

hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java

+  }
+
+  @Test
+  public void testRewriteToShorterRecord() throws Exception {


I thought from the java docs of HoodieAvroUtils.rewriteRecord(GenericRecord oldRecord, Schema newSchema), rewrite can happen from old schema to new schema and not other way round. Can you help me understand why we allow backwards incompatible rewrite here?

nsivabalan · 2021-05-23T22:24:12Z

@nbalajee : I see that you have lot of extra commits in this patch. Can you fix it and rebase. In the interest of testing it out, I pulled in your changes locally and have put up some draft PR #2982 w/ some minor fixes in addition to your patch. If incase you wanna create a clean PR, might be useful to you.

nsivabalan · 2021-05-31T15:20:17Z

I verified some of the unknowns.

a primitive can be evolved to a union w/ null default.
array[int] can be evolved to array[long]
map[int] can be evolved to map[ long ]

nsivabalan

made some minor optimization in new PR. do check it out.
#2982

hudi-bot · 2021-11-05T02:51:57Z

CI report:

9abc305 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

n3nash requested a review from bvaradar December 9, 2020 05:14

n3nash assigned bvaradar Dec 9, 2020

danny0405 reviewed Dec 9, 2020

View reviewed changes

vinothchandar added the area:schema Schema evolution and data types label Dec 15, 2020

nbalajee force-pushed the fixHoodieAvroUtils branch from 259e622 to 6d73038 Compare December 17, 2020 04:47

danny0405 reviewed Dec 21, 2020

View reviewed changes

danny0405 reviewed Dec 28, 2020

View reviewed changes

Satish Kotha and others added 11 commits December 28, 2020 07:24

[HUDI-1075] Implement simple clustering strategies to create Clusteri…

e75dc30

…ngPlan and to run the plan

[HUDI-1471] Make QuickStartUtils generate deletes according to specif…

5d349a2

…ic ts (apache#2357)

[HUDI-1485] Fix Deletes issued without any prior commits exception (a…

38fb03d

…pache#2361)

[HUDI-1488] Fix Test Case Failure in TestHBaseIndex (apache#2365)

a39951f

[HUDI-1489] Fix null pointer exception when reading updated written b…

93e6b54

…ootstrap table (apache#2370) Co-authored-by: Wenning Ding <[email protected]>

[HUDI-1451] Support bulk insert v2 with Spark 3.0.0 (apache#2328)

20b9d64

Co-authored-by: Wenning Ding <[email protected]> - Added support for bulk insert v2 with datasource v2 api in Spark 3.0.0.

[HUDI-1487] fix unit test testCopyOnWriteStorage random failed (apach…

cd8f145

…e#2364)

[HUDI-1490] Incremental Query should work even when there are partiti…

a3772a0

…ons that have no incremental changes (apache#2371) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <[email protected]>

[HUDI-1481] add structured streaming and delta streamer clustering un…

4ef4e8d

…it test (apache#2360)

[HUDI-1354] Block updates and replace on file groups in clustering (a…

17e66f8

…pache#2275) * [HUDI-1354] Block updates and replace on file groups in clustering * [HUDI-1354] Block updates and replace on file groups in clustering

nbalajee force-pushed the fixHoodieAvroUtils branch from f90f65f to baff127 Compare December 29, 2020 18:37

lw309637554 and others added 3 commits December 29, 2020 11:32

[HUDI-1495] Upgrade Flink version to 1.12.0 (apache#2384)

79292a5

[MINOR] Remove the duplicate code in AbstractHoodieWriteClient.startC…

dc23e58

…ommit (apache#2385)

yui2010 and others added 4 commits December 29, 2020 11:32

[HUDI-1398] Align insert file size for reducing IO (apache#2256)

191eb9f

* [HUDI-1398] Align insert file size for reducing IO Co-authored-by: zhang wen <[email protected]>

[HUDI-1484] Escape the partition value in HiveSyncTool (apache#2363)

4555468

[HUDI-1474] Add additional unit tests to TestHBaseIndex (apache#2349)

6bacf12

[HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o…

9abc305

…f a nested record field.

nbalajee force-pushed the fixHoodieAvroUtils branch from 63d238d to 9abc305 Compare December 29, 2020 19:45

vinothchandar self-assigned this Jan 29, 2021

vinothchandar added the priority:critical Production degraded; pipelines stalled label Feb 11, 2021

vinothchandar assigned n3nash and unassigned bvaradar Mar 14, 2021

vinothchandar removed the priority:critical Production degraded; pipelines stalled label Mar 16, 2021

nsivabalan reviewed May 23, 2021

View reviewed changes

nsivabalan mentioned this pull request May 23, 2021

[HUDI-1441] Fixing HoodieAvroUtils.rewriteRecord for nested record schema evolution #2982

Closed

5 tasks

nsivabalan reviewed May 31, 2021

View reviewed changes

vinothchandar assigned nsivabalan and unassigned vinothchandar and n3nash Sep 7, 2021

nsivabalan closed this Jan 20, 2022

[HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o… #2309

[HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o… #2309

Uh oh!

Conversation

nbalajee commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

n3nash commented Dec 9, 2020

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Dec 11, 2020

Uh oh!

nbalajee commented Dec 11, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Dec 21, 2020

Uh oh!

nsivabalan commented Dec 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Dec 28, 2020

Uh oh!

danny0405 commented Dec 29, 2020

Uh oh!

nsivabalan commented Jan 2, 2021

Uh oh!

nbalajee commented Jan 4, 2021

Uh oh!

vinothchandar commented Feb 3, 2021

Uh oh!

vinothchandar commented Mar 14, 2021

Uh oh!

danny0405 commented Mar 15, 2021

Uh oh!

nsivabalan commented Apr 19, 2021

Uh oh!

nsivabalan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan May 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nbalajee commented Dec 8, 2020 •

edited

Loading

codecov-io commented Dec 8, 2020 •

edited

Loading

nsivabalan left a comment •

edited

Loading

nsivabalan May 23, 2021 •

edited

Loading