-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload #9892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload #9892
Conversation
hudi-common/src/main/java/org/apache/hudi/common/model/BaseAvroPayload.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/BaseAvroPayload.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java
Show resolved
Hide resolved
|
@danny0405 I also changed the payload creation logic for Flink. Could you also review the relevant changes? |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. just 1 minor comment
| private final Comparable<?> orderingVal; | ||
|
|
||
| public HoodieAvroPayload(GenericRecord record, Comparable<?> orderingVal) { | ||
| this(record, orderingVal, EMPTY_PROPS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we mark these as deprecated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding this is only used internally and props are not used, so I don't mark it as deprecated.
| Constructor<?> constructor, | ||
| @Nullable String preCombineField) { | ||
| this.shouldCombine = shouldCombine; | ||
| this.shouldUsePropsForPayload = shouldUsePropsForPayload; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldUsePropsForPayload should be always true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the record payload class implemented by a user outside Hudi repo, the class may have the constructor with old signature, i.e., {GenericRecord.class, Comparable.class} or {Option.class}, so to be compatible on that, there's a fallback mechanism to create payload without properties (which is OK is the payload does not leverage any props). See PayloadCreation#instance.
| } | ||
|
|
||
| public static Properties extractPropsFromConfiguration(Configuration config) { | ||
| Properties props = new Properties(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all we want is payload properties, you can use StreamerUtil.getPayloadConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need hoodie.payload.delete.field and hoodie.payload.delete.marker here, which are not included in StreamerUtil.getPayloadConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just set it up correctly in the code. BTW, Flink never supports these 2 options I think.
| super(record, orderingVal); | ||
| public DefaultHoodieRecordPayload(GenericRecord record, Comparable orderingVal, Properties props) { | ||
| super(record, orderingVal, props); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source of the props seems a chaos, I already saw several ways how it was produced:
config.getPayloadConfig().getProps()inHoodieMergeHandle;payloadProps.setProperty(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, preCombineField);inHoodieFileSliceReader;config.getProps()inHoodieIndexUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be cleaned up separately. Regardless, the payload-related configs should always be included so there's no correctness issue. And we're not storing any props with the record so record processing is consistent.
| final String deleteKey = props.getProperty(DELETE_KEY); | ||
| if (StringUtils.isNullOrEmpty(deleteKey)) { | ||
| return isDeleteRecord(genericRecord); | ||
| return super.isDeleteRecord(record, props); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line the actual fix, I didn't see the props got used by the super method, so do we still need to pass around all the props here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of this method is kept the same. However, before my fix, this method isDeleteRecord(GenericRecord genericRecord, Properties properties) was a protected method not overriding any method in super class, and it was not called in the constructor to determine whether a record is a delete.
After the fix, the method isDeleteRecord(GenericRecord genericRecord, Properties properties) overrides the same in the super class, and is now called in the constructor (see super class BaseAvroPayload constructor, where the following is called)
this.isDeletedRecord = record == null || isDeleteRecord(record, props);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the BaseAvroPayload.isDeleteRecord does not really use the passed in props, so do we still to change all the constructors just for passing the props, it seems just an additional invocation of BaseAvroPayload.isDeleteRecord could fix the problem and this method does not need any props actually.
We can address the props issue is separate task maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BaseAvroPayload.isDeleted(Schema schema, Properties props) already passes around the properties, so can we just overwrite it for DefaultHoodieRecordPayload and there is no need to change all the constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseAvroPayload.isDeleteRecord(GenericRecord genericRecord, Properties props) uses the props, and it is called by this.isDeletedRecord = record == null || isDeleteRecord(record, props); in the constructor. BaseAvroPayload.isDeleteRecord(GenericRecord genericRecord) is only for backwards compatibility, called from user-implemented record payload implementation.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about the maste code, in master, BaseAvroPayload.isDeleteRecord only handles _hoodie_is_deleted field which looks reasonable. If BaseAvroPayload.public boolean isDeleted(Schema schema, Properties props) is the culprit that incurs the issue then just fix it, we can override it in DefaultHoodieRecordPayload and utilize those specific options there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is that isDeleted(Schema schema, Properties props) should not deserialize the bytes for perf reason so we cannot implement the custom delete marker there for DefaultHoodieRecordPayload. The way now in this fix is to check the delete key and marker in the constructor before the Avro record is serialized; that’s why props needs to be passed in to the constructor.
|
hey @yihua @danny0405 : can you folks sync up and resolve soon. We wanted to get this landed for 0.14.1. |
|
attempting a diff approach here #10150 |
Change Logs
This PR fixes the
DefaultHoodieRecordPayloadto allow the records with custom delete key (hoodie.payload.delete.field) and delete marker (hoodie.payload.delete.marker) to be properly ingested. Before this fix, the write fails with the following exceptionThe bug is introduced by the RFC-46 implementation and the refactoring on payload merging.
To fix the issue, the following changes are made:
DefaultHoodieRecordPayload, this means that the configuration of delete key (hoodie.payload.delete.field) and delete marker (hoodie.payload.delete.marker) must be passed to the constructor. As the record payload (HoodieRecordPayload) creation is realized through Java reflection, the signature of the constructor of all existing record payload implementations intended to be used by the user has to be changed.DefaultHoodieRecordPayload, only the constructors with props are kept. We change all payload classes so that the new constructor is called once in reflection, instead of twice with the first attempt failing if not including new constructors with the properties, to avoid unnecessary fallbacks and performance hits on each record.DataSourceUtilstoHoodieRecordUtilsand modified them according to the constructor changes.HoodieRecordPayloadto properly pass the configurations inProperties.BaseAvroPayload#isDeleteRecordnow takes the configuration inProperties, which is called by the new constructors of the payload classes. The old method is still kept for backwards compatibility.DefaultHoodieRecordPayloadoverrides the implementation ofisDeleteRecordto use custom delete key and marker to determine deletes, which is the behavior before RFC-46.DefaultHoodieRecordPayloadand deletes with custom delete key and marker work as expected.Impact
Fixes the
DefaultHoodieRecordPayloadto properly handle custom delete key (hoodie.payload.delete.field) and delete marker (hoodie.payload.delete.marker. Deletes using default_hoodie_is_deletedfield are not affected, i.e., working before this fix.Risk level
medium
Documentation Update
HUDI-6966 for updating docs on custom payload implementation.
Contributor's checklist