Skip to content

Conversation

@shenh062326
Copy link
Contributor

What is the purpose of the pull request

users can specify any boolean field for delete marker and _hoodie_is_deleted remains as default.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Member

@GuoPhilipse GuoPhilipse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    @Parameter(names = {"--source-delete-field"}, description = "Field within source record to decide"
            + " is this record is delete record. Default: " + OverwriteWithLatestAvroPayload.DEFAULT_DELETE_FIELD)
    public String sourceDeleteField = OverwriteWithLatestAvroPayload.DEFAULT_DELETE_FIELD;

Could we make it the description more clear?

@vinothchandar
Copy link
Member

@nsivabalan can you please take a pass once CI is passing

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shenh062326 for taking a stab at this. I had a different proposal to configuring this property. Let me know if it sounds good to you. @nsivabalan Please also review the suggested approach. Thank you.

@nsivabalan
Copy link
Contributor

Looks like there are 2 options here.
Option1: Change interface for combineAndGetUpdateValue and getInsertValue to take in delete field.
check #1792 on why we need to fix getInsertValue as well.
Option2: Approach taken in the patch, where in expose a setDeleteField method in OverwriteWithLatestAvroPayload and this will be set while generating HoodieRecords from GenericRecord.

Since making changes in getInsertValue will touch lot of files, guess we can go with Option2 with some changes.
Instead of exposing a new method, why can't we add an overloaded constructor. When HoodieSparkSqlWriter calls into

DataSourceUtils.createHoodieRecord(gr,
                orderingVal, keyGenerator.getKey(gr), parameters(PAYLOAD_CLASS_OPT_KEY))

we could also pass in a delete field optionally which inturn will be passed into constructor of OverwriteWithLatestAvroPayload. If delete field is set by user, then we take that else we resort to "_hoodie_is_deleted".

Let me know your thoughts.

@xushiyan
Copy link
Member

@nsivabalan actually what i commented here is the 3rd option

@shenh062326 Maybe we shouldn't do the check in the payload class itself. Maybe org.apache.hudi.io.HoodieMergeHandle#write is better for this job. After

Option<IndexedRecord> combinedAvroRecord =
hoodieRecord.getData().combineAndGetUpdateValue(oldRecord, useWriterSchema ? writerSchema : originalSchema);

we check combinedAvroRecord against the delete field defined in the configs and convert it to Option.empty() if appropriate. As this feature is config-related, whoever owns the configs should do it. Need more inputs from @nsivabalan

Basically it is about shifting the responsibility of converting records to Option.empty() to HoodieMergeHandle. WDYT?

@shenh062326
Copy link
Contributor Author

@xushiyan It seems good for OverwriteWithLatestAvroPayload. But for AWSDmsAvroPayload, users need to define not only the deletion column, but also the processing method of deleting the column. Where is the new method should be? If it continue to stay in combineAndGetUpdateValue, then combineAndGetUpdateValue has the processing delete logic, and the calling method also has the delete logic, will it be repeated?

@nsivabalan
Copy link
Contributor

@xushiyan : nope. As I have mentioned above, we need it in getInsertValue() as well which is called from lot of classes. Hence I suggested to add it as part of constructor.

@xushiyan
Copy link
Member

@shenh062326 @nsivabalan got it. yup, making it through the constructor looks good. thanks for clarifying.

@shenh062326 shenh062326 force-pushed the HUDI-1058 branch 3 times, most recently from d3f6648 to dcbf29b Compare July 15, 2020 11:28
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high level comment.

@shenh062326 shenh062326 force-pushed the HUDI-1058 branch 2 times, most recently from bae0d5b to a888c89 Compare July 16, 2020 03:23
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

@shenh062326 shenh062326 force-pushed the HUDI-1058 branch 2 times, most recently from 3a3e9df to 9a748bb Compare July 17, 2020 02:07
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests being added as part of the patch. Would be nice to have some tests covering the new code that was added at all levels.

  • WriteClient
  • Datasource if there is an existing suite of tests for other write operations
  • Deltastreamer

@shenh062326 shenh062326 force-pushed the HUDI-1058 branch 6 times, most recently from 1caaef3 to 78d1bbc Compare July 25, 2020 10:59
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more minor comments. We are almost there. Also, do click on "Resolve conversation" for my comments if you have resoled them. If you can't resolve them, do leave a comment so that I know which ones to review again instead of reviewing the entire patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this also be false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it does not matter whether deleteField is true or false here, only defaultDeleteField will affect decide the result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

@shenh062326 shenh062326 force-pushed the HUDI-1058 branch 3 times, most recently from 57a2d7a to 242c979 Compare August 1, 2020 12:56
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the code changes for parallelizing deletion. Lets take it in a different PR.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the perseverance.

@nsivabalan
Copy link
Contributor

@xushiyan : do you plan to review? Or can I go ahead and merge this.

@xushiyan
Copy link
Member

xushiyan commented Aug 3, 2020

@nsivabalan no, please feel free to merge. there has been thorough reviews :) thanks

@nsivabalan nsivabalan merged commit 433d7d2 into apache:master Aug 3, 2020
@vinothchandar
Copy link
Member

Folks, I see a performance/efficiency problem with the approach here. We added a string field to every payload object, which will get increase the shuffle size in the write path. We would be just sending the same string with each and every object.

Can we rework this so that we introduce a new PayloadConfig object which we can hand to the payload methods at runtime, instead of adding this to constructor? We need to introduce new methods without breaking existing Custom payload implementations.

cc @nsivabalan @shenh062326 @xushiyan

@nsivabalan
Copy link
Contributor

makes sense. sorry about the oversight.
I will take up the rework by using a payloadConfig class.

@shenh062326
Copy link
Contributor Author

shenh062326 commented Aug 4, 2020

makes sense. sorry about the oversight.

I can rework it.
Let me confirm how to do it first, adding PayloadConfig, and then calling PayloadConfig in OverwriteWithLatestAvroPayload.isDeleteRecord to get deleteMarkerField, is that right?

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Aug 4, 2020
@nsivabalan
Copy link
Contributor

@shenh062326 : appreciate your help. We are looking to have a release by this weekend and so I am reverting this patch for now. I will work with you with the right fix for configurable delete marker. if we can get it in by this weekend well and good, even if not, we can land it later once we cut the release for 0.6.0.

@nsivabalan
Copy link
Contributor

@vinothchandar : if I am not wrong, we added an additional overloaded constructor to OverwriteWithLatestAvroPayload. so shouldn't have broken any existing implementations. Only if someone wants to leverage user defined col, they might have to use the new constructor. Anyways, your perf reasoning is convincing, hence going ahead w/ reverting.

@vinothchandar
Copy link
Member

@nsivabalan my concern is more on the overhead of the extra field on the payload serialization. No matter what constructor is called. there is a string field that will be serialized, right?

@nsivabalan
Copy link
Contributor

I see it now. got it.

vinothchandar pushed a commit that referenced this pull request Aug 4, 2020
@shenh062326
Copy link
Contributor Author

Let me confirm the new implementation again, adding a field containing transient to OverwriteWithLatestAvroPayload, like below:

  private transient String deleteMarkerField = null;

  public void setDeleteMarkerField(String deleteMarkerField) {
    this.deleteMarkerField = deleteMarkerField;
  }

And set the deleteMarkerField before HoodieMergeHandle.write call OverwriteWithLatestAvroPayload.combineAndGetUpdateValue, right?

@nsivabalan
Copy link
Contributor

nope. we need to create a new class called PayloadConfig which will hold the deleteMarkerField col name. Similar to how we pass in Schema to OverwriteWithLatestAvroPayload#combineAndGetUpdateValue and OverwriteWithLatestAvroPayload#getInsertValue, we have to pass this payloadConfig as well. Hope you get the gist.

@vinothchandar : can you confirm if this is the right approach.

@shenh062326
Copy link
Contributor Author

It seems there are two implementations.
(a) If we add new methods like below to HoodieRecordPayload, since HoodieRecordPayload is an interface, it will incompatible will custom payload implementations.

public Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, PayloadConfig payloadConfig) 
public Option<IndexedRecord>  getInsertValue(Schema schema, PayloadConfig payloadConfig)

(b) If we add this method to OverwriteWithLatestAvroPayload, when we call HoodieRecordPayload.combineAndGetUpdateValue and HoodieRecordPayload.getInsertValue, we need to check whether it's OverwriteWithLatestAvroPayload or not, if it's OverwriteWithLatestAvroPayload, we need to call getInsertValue(Schema schema, PayloadConfig payloadConfig); if it's not OverwriteWithLatestAvroPayload, we need to call getInsertValue(Schema schema). But there are many places will call HoodieRecordPayload.getInsertValue.

@shenh062326
Copy link
Contributor Author

@vinothchandar @nsivabalan Can you take a look at this PR?

@nsivabalan
Copy link
Contributor

@shenh062326 : sorry for late reply. Can you check #1704 patch on how to add new apis. Infact, we need to coordinate both these patches. I mean, land one and rebase other one and re-use the new apis added as part of first patch. @bhasudha : wrt your patch, What is your take on adding a PayloadConfig instead of a Map<>. I understand map is very flexible, but for payload class, hudi controls what goes in.

@shenh062326
Copy link
Contributor Author

@shenh062326 : sorry for late reply. Can you check #1704 patch on how to add new apis. Infact, we need to coordinate both these patches. I mean, land one and rebase other one and re-use the new apis added as part of first patch. @bhasudha : wrt your patch, What is your take on adding a PayloadConfig instead of a Map<>. I understand map is very flexible, but for payload class, hudi controls what goes in.

Sure, I will check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants