-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2815] add partial overwrite payload to support partial overwrit… #4724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
d73d7f5
6b6a60f
2fa2d57
21df6fe
940f6de
14edef0
ce561bc
c6f524e
d3b3e05
10e080b
a86b7ff
d11c670
715d4b0
e89fd60
3f771d3
b823e94
9c15335
eef40bc
3a373c2
6fbf453
1e236ba
ac3e72a
f1a8d0c
3697d8c
975c463
46ea95d
4aaee39
6a13069
466a633
fe4aefd
0b9f295
6731992
4b975fd
7a30b08
d6e38af
ef9ff1a
dd7e772
8a4cfb7
6af6076
77b0f3f
6bb4181
6c4b714
b851feb
4d86424
0b21be2
55f5626
d2aed60
4c15551
b9230e0
1e68d6f
f28bad6
6fa32a0
da9962b
f52553b
a5b9f66
69f058c
2a18375
8cba0a9
ced2def
1409c0b
cd47bc9
d22d93f
d0d6981
180b690
b4770df
f76144b
f7886f8
fc6c7a7
7d89404
801c69d
cf03735
d963079
5f59bcb
04baf70
5da95d5
a51bdb5
9e1cad8
5403db3
151ce1e
ff16cdc
54808ec
6530d83
6570198
967b336
399eb8d
07d6929
f9ae271
31b54c7
95ef13c
1a7157a
145440b
c0eecb5
035c3ca
ece2ae6
a55ce33
895becc
00b2e45
4512e96
c163ac2
402f60e
b825b8a
029622b
931747d
75abad6
9037045
9c40d0c
d9ca8e1
dfc05b7
618fe26
b28f5d2
542cec6
75056ea
d1e31f8
e19b5d1
b709f75
1ce9a5e
dcbb074
0640f20
d482527
7f5ee51
5558b79
a9b4110
ffac31e
5854243
f8092a3
32b9700
27adaa2
9c49e43
1959d8b
4568fae
c43747e
5b66abf
06ac8cb
24cc379
5a0a1e9
d9da263
20b1ee4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -40,8 +40,18 @@ | |
| public class OverwriteWithLatestAvroPayload extends BaseAvroPayload | ||
| implements HoodieRecordPayload<OverwriteWithLatestAvroPayload> { | ||
|
|
||
| /** | ||
| * the schema of generic record | ||
| */ | ||
| public final String schema; | ||
|
||
|
|
||
| public OverwriteWithLatestAvroPayload(GenericRecord record, Comparable orderingVal) { | ||
| this(record, orderingVal, null); | ||
| } | ||
|
|
||
| public OverwriteWithLatestAvroPayload(GenericRecord record, Comparable orderingVal, String schema) { | ||
| super(record, orderingVal); | ||
| this.schema = schema; | ||
| } | ||
|
|
||
| public OverwriteWithLatestAvroPayload(Option<GenericRecord> record) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi.common.model; | ||
|
|
||
| import org.apache.hudi.common.util.Option; | ||
|
|
||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.generic.GenericRecord; | ||
| import org.apache.avro.generic.IndexedRecord; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.List; | ||
| import java.util.Objects; | ||
|
|
||
| import static org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro; | ||
|
|
||
| /** | ||
| * The only difference with {@link OverwriteNonDefaultsWithLatestAvroPayload} is that it supports | ||
| * merging the latest non-null partial fields with the old record instead of replacing the whole record. | ||
| * And merging the non-null fields during preCombine multiple records with same record key instead of choosing the latest record based on ordering field. | ||
| * | ||
| * <p> Regarding #combineAndGetUpdateValue, Assuming a {@link GenericRecord} has row schema: (f0 int , f1 int, f2 int). | ||
| * The first record value is: (1, 2, 3), the second record value is: (4, 5, null) with the field f2 value as null. | ||
| * Calling the #combineAndGetUpdateValue method of the two records returns record: (4, 5, 3). | ||
| * Note that field f2 value is ignored because it is null. </p> | ||
| * | ||
| * <p> Regarding #preCombine, Assuming a {@link GenericRecord} has row schema: (f0 int , f1 int, f2 int, o1 int), | ||
| * and initial two {@link PartialOverwriteWithLatestAvroPayload} with different ordering value. | ||
| * The first record value is (1, null, 1, 1) with the filed f1 value as null, the second value is: (2, 2, null, 2) with the f2 value as null. | ||
| * Calling the #preCombine method of the two records returns record: (2, 2, 1, 2). | ||
| * Note: | ||
| * <ol> | ||
| * <li>the field f0 value is 2 because the ordering value of second record is bigger.</li> | ||
| * <li>the filed f1 value is 2 because the f2 value of first record is null.</li> | ||
| * <li>the filed f2 value is 1 because the f2 value of second record is null.</li> | ||
| * <li>the filed o1 value is 2 because the ordering value of second record is bigger.</li> | ||
| * </ol> | ||
| * | ||
| * </p> | ||
| */ | ||
| public class PartialOverwriteWithLatestAvroPayload extends OverwriteWithLatestAvroPayload { | ||
|
|
||
| public PartialOverwriteWithLatestAvroPayload(GenericRecord record, Comparable orderingVal) { | ||
| this(record, orderingVal, null); | ||
| } | ||
|
|
||
| public PartialOverwriteWithLatestAvroPayload(GenericRecord record, Comparable orderingVal, String schema) { | ||
| super(record, orderingVal, schema); | ||
| } | ||
|
|
||
| public PartialOverwriteWithLatestAvroPayload(Option<GenericRecord> record) { | ||
| super(record); // natural order | ||
| } | ||
|
|
||
| @Override | ||
| public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema) throws IOException { | ||
| if (recordBytes.length == 0) { | ||
| return Option.empty(); | ||
| } | ||
|
|
||
| GenericRecord incomingRecord = bytesToAvro(recordBytes, schema); | ||
| if (isDeleteRecord(incomingRecord)) { | ||
| return Option.empty(); | ||
| } | ||
|
|
||
| GenericRecord currentRecord = (GenericRecord) currentValue; | ||
| List<Schema.Field> fields = schema.getFields(); | ||
|
||
| fields.forEach(field -> { | ||
| Object value = incomingRecord.get(field.name()); | ||
|
||
| if (Objects.nonNull(value)) { | ||
| currentRecord.put(field.name(), value); | ||
| } | ||
| }); | ||
|
|
||
| return Option.of(currentRecord); | ||
| } | ||
|
|
||
| @Override | ||
| public int compareTo(OverwriteWithLatestAvroPayload oldValue) { | ||
| return orderingVal.compareTo(oldValue.orderingVal); | ||
| } | ||
|
|
||
| @Override | ||
| public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) { | ||
|
||
| if (null == this.schema || null == oldValue.schema) { | ||
| return super.preCombine(oldValue); | ||
| } | ||
|
|
||
| try { | ||
| Schema schema = new Schema.Parser().parse(this.schema); | ||
|
||
| Option<IndexedRecord> incomingOption = getInsertValue(new Schema.Parser().parse(this.schema)); | ||
| Option<IndexedRecord> insertRecordOption = oldValue.getInsertValue(new Schema.Parser().parse(oldValue.schema)); | ||
|
||
|
|
||
| if (incomingOption.isPresent() && insertRecordOption.isPresent()) { | ||
| GenericRecord currentRecord = (GenericRecord) incomingOption.get(); | ||
| GenericRecord insertRecord = (GenericRecord) insertRecordOption.get(); | ||
| boolean chooseCurrent = this.orderingVal.compareTo(oldValue.orderingVal) > 0; | ||
|
|
||
| if (!isDeleteRecord(insertRecord) && !isDeleteRecord(currentRecord)) { | ||
| schema.getFields().forEach(field -> { | ||
| Object insertValue = insertRecord.get(field.name()); | ||
| Object currentValue = currentRecord.get(field.name()); | ||
| currentRecord.put(field.name(), mergeValue(currentValue, insertValue, chooseCurrent)); | ||
| }); | ||
| return new PartialOverwriteWithLatestAvroPayload(currentRecord, chooseCurrent ? this.orderingVal : oldValue.orderingVal, this.schema); | ||
| } else { | ||
| return isDeleteRecord(insertRecord) ? this : oldValue; | ||
| } | ||
|
||
| } else { | ||
| return insertRecordOption.isPresent() ? oldValue : this; | ||
| } | ||
| } catch (IOException e) { | ||
| return super.preCombine(oldValue); | ||
| } | ||
| } | ||
|
|
||
| private Object mergeValue(Object left, Object right, Boolean chooseLeft) { | ||
| if (null != left && null != right) { | ||
| return chooseLeft ? left : right; | ||
| } else { | ||
| return null == left ? right : left; | ||
| } | ||
| } | ||
|
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi.common.model; | ||
|
|
||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.generic.GenericData; | ||
| import org.apache.avro.generic.GenericRecord; | ||
| import org.junit.jupiter.api.BeforeEach; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.Arrays; | ||
| import java.util.Collections; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.assertArrayEquals; | ||
| import static org.junit.jupiter.api.Assertions.assertEquals; | ||
|
|
||
| class PartialOverwriteWithLatestAvroPayloadTest { | ||
| private Schema schema; | ||
|
|
||
| @BeforeEach | ||
| public void setUp() throws Exception { | ||
| schema = Schema.createRecord("record", null, null, false, Arrays.asList( | ||
| new Schema.Field("id", Schema.create(Schema.Type.STRING), "", null), | ||
| new Schema.Field("partition", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", ""), | ||
| new Schema.Field("ts", Schema.create(Schema.Type.LONG), "", null), | ||
| new Schema.Field("_hoodie_is_deleted", Schema.create(Schema.Type.BOOLEAN), "", false), | ||
| new Schema.Field("city", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", null), | ||
| new Schema.Field("child", Schema.createArray(Schema.create(Schema.Type.STRING)), "", Collections.emptyList()) | ||
| )); | ||
| } | ||
|
|
||
| @Test | ||
| public void testActiveRecordsWithoutSchema() throws IOException { | ||
| GenericRecord record1 = new GenericData.Record(schema); | ||
| record1.put("id", "1"); | ||
| record1.put("partition", "partition1"); | ||
| record1.put("ts", 0L); | ||
| record1.put("_hoodie_is_deleted", false); | ||
| record1.put("city", "NY0"); | ||
| record1.put("child", Arrays.asList("A")); | ||
|
|
||
| GenericRecord record2 = new GenericData.Record(schema); | ||
| record2.put("id", "2"); | ||
| record2.put("partition", ""); | ||
| record2.put("ts", 1L); | ||
| record2.put("_hoodie_is_deleted", false); | ||
| record2.put("city", "NY"); | ||
| record2.put("child", Collections.emptyList()); | ||
|
|
||
| GenericRecord record3 = new GenericData.Record(schema); | ||
| record3.put("id", "2"); | ||
| record3.put("partition", ""); | ||
| record3.put("ts", 1L); | ||
| record3.put("_hoodie_is_deleted", false); | ||
| record3.put("city", "NY"); | ||
| record3.put("child", Arrays.asList("A")); | ||
|
|
||
|
|
||
| PartialOverwriteWithLatestAvroPayload payload1 = new PartialOverwriteWithLatestAvroPayload(record1, 1); | ||
| PartialOverwriteWithLatestAvroPayload payload2 = new PartialOverwriteWithLatestAvroPayload(record2, 2); | ||
| assertEquals(payload1.preCombine(payload2), payload2); | ||
| assertEquals(payload2.preCombine(payload1), payload2); | ||
|
|
||
| assertEquals(record1, payload1.getInsertValue(schema).get()); | ||
| assertEquals(record2, payload2.getInsertValue(schema).get()); | ||
|
|
||
| assertEquals(payload1.combineAndGetUpdateValue(record2, schema).get(), record1); | ||
| assertEquals(payload2.combineAndGetUpdateValue(record1, schema).get(), record3); | ||
| } | ||
|
|
||
| @Test | ||
| public void testCompareFunction() { | ||
| GenericRecord record = new GenericData.Record(schema); | ||
| record.put("id", "1"); | ||
| record.put("partition", "partition1"); | ||
| record.put("ts", 0L); | ||
| record.put("_hoodie_is_deleted", false); | ||
| record.put("city", "NY0"); | ||
| record.put("child", Arrays.asList("A")); | ||
|
|
||
| PartialOverwriteWithLatestAvroPayload payload1 = new PartialOverwriteWithLatestAvroPayload(record, 1); | ||
| PartialOverwriteWithLatestAvroPayload payload2 = new PartialOverwriteWithLatestAvroPayload(record, 2); | ||
|
|
||
| assertEquals(payload1.compareTo(payload2), -1); | ||
| assertEquals(payload2.compareTo(payload1), 1); | ||
| assertEquals(payload1.compareTo(payload1), 0); | ||
| } | ||
|
|
||
| @Test | ||
| public void testActiveRecordsWithSchema() throws IOException { | ||
| GenericRecord record1 = new GenericData.Record(schema); | ||
| record1.put("id", "1"); | ||
| record1.put("partition", "partition1"); | ||
| record1.put("ts", 0L); | ||
| record1.put("_hoodie_is_deleted", false); | ||
| record1.put("city", null); | ||
| record1.put("child", Arrays.asList("A")); | ||
|
|
||
| GenericRecord record2 = new GenericData.Record(schema); | ||
| record2.put("id", "2"); | ||
| record2.put("partition", null); | ||
| record2.put("ts", 1L); | ||
| record2.put("_hoodie_is_deleted", false); | ||
| record2.put("city", "NY"); | ||
| record2.put("child", Collections.emptyList()); | ||
|
|
||
| GenericRecord expectedRecord = new GenericData.Record(schema); | ||
| expectedRecord.put("id", "2"); | ||
| expectedRecord.put("partition", "partition1"); | ||
| expectedRecord.put("ts", 1L); | ||
| expectedRecord.put("_hoodie_is_deleted", false); | ||
| expectedRecord.put("city", "NY"); | ||
| expectedRecord.put("child", Collections.emptyList()); | ||
|
|
||
|
|
||
| PartialOverwriteWithLatestAvroPayload payload1 = new PartialOverwriteWithLatestAvroPayload(record1, 1, schema.toString()); | ||
| PartialOverwriteWithLatestAvroPayload payload2 = new PartialOverwriteWithLatestAvroPayload(record2, 2, schema.toString()); | ||
| PartialOverwriteWithLatestAvroPayload expectedPayload = new PartialOverwriteWithLatestAvroPayload(expectedRecord, 2, schema.toString()); | ||
| assertArrayEquals(payload1.preCombine(payload2).recordBytes, expectedPayload.recordBytes); | ||
| assertArrayEquals(payload2.preCombine(payload1).recordBytes, expectedPayload.recordBytes); | ||
| assertEquals(payload1.preCombine(payload2).orderingVal, expectedPayload.orderingVal); | ||
| assertEquals(payload2.preCombine(payload1).orderingVal, expectedPayload.orderingVal); | ||
| } | ||
|
|
||
| @Test | ||
| public void testDeletedRecord() throws IOException { | ||
| GenericRecord record1 = new GenericData.Record(schema); | ||
| record1.put("id", "1"); | ||
| record1.put("partition", "partition0"); | ||
| record1.put("ts", 0L); | ||
| record1.put("_hoodie_is_deleted", false); | ||
| record1.put("city", "NY0"); | ||
| record1.put("child", Collections.emptyList()); | ||
|
|
||
| GenericRecord delRecord1 = new GenericData.Record(schema); | ||
| delRecord1.put("id", "2"); | ||
| delRecord1.put("partition", "partition1"); | ||
| delRecord1.put("ts", 1L); | ||
| delRecord1.put("_hoodie_is_deleted", true); | ||
| delRecord1.put("city", "NY0"); | ||
| delRecord1.put("child", Collections.emptyList()); | ||
|
|
||
| GenericRecord record2 = new GenericData.Record(schema); | ||
| record2.put("id", "1"); | ||
| record2.put("partition", "partition0"); | ||
| record2.put("ts", 0L); | ||
| record2.put("_hoodie_is_deleted", true); | ||
| record2.put("city", "NY0"); | ||
| record2.put("child", Collections.emptyList()); | ||
|
|
||
| PartialOverwriteWithLatestAvroPayload payload1 = new PartialOverwriteWithLatestAvroPayload(record1, 1, schema.toString()); | ||
| PartialOverwriteWithLatestAvroPayload payload2 = new PartialOverwriteWithLatestAvroPayload(delRecord1, 2, schema.toString()); | ||
|
|
||
| assertEquals(payload1.preCombine(payload2), payload1); | ||
| assertEquals(payload2.preCombine(payload1), payload1); | ||
| } | ||
|
|
||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -268,6 +268,12 @@ private FlinkOptions() { | |
| .withDescription("Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting.\n" | ||
| + "This will render any value set for the option in-effective"); | ||
|
|
||
| public static final ConfigOption<Boolean> PARTIAL_OVERWRITE_ENABLED = ConfigOptions | ||
|
||
| .key("partial.overwrite.enabled") | ||
| .booleanType() | ||
| .defaultValue(false) | ||
| .withDescription("Partial overwrite payload, the write.payload.class should be org.apache.hudi.common.model.PartialOverwriteWithLatestAvroPayload when it is true"); | ||
|
|
||
| /** | ||
| * Flag to indicate whether to drop duplicates before insert/upsert. | ||
| * By default false to gain extra performance. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we must need a
compareTohere ?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous logic of
data2.preCombine(data1)is that return one of data1 or data2 ordering by theirorderVal. But if we merge/combine data1 and data2 into a new payload(reduceData), thedata1.equals(reduceData)is always false. In order to get theHoodieKeyandHoodieOperationfor new HoodieRecord withreduceData, we need to get the latestHoodieKeyandHoodieOperationfromdata1anddata2,compareTois used for replace#preCombineto compare theirorderingVal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,
rec1andrec2should have same HoodieKey here, right, but the HodieOperation might different.