-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1058] Make delete marker configurable #1819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GuoPhilipse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Parameter(names = {"--source-delete-field"}, description = "Field within source record to decide"
+ " is this record is delete record. Default: " + OverwriteWithLatestAvroPayload.DEFAULT_DELETE_FIELD)
public String sourceDeleteField = OverwriteWithLatestAvroPayload.DEFAULT_DELETE_FIELD;
Could we make it the description more clear?
|
@nsivabalan can you please take a pass once CI is passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shenh062326 for taking a stab at this. I had a different proposal to configuring this property. Let me know if it sounds good to you. @nsivabalan Please also review the suggested approach. Thank you.
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
|
Looks like there are 2 options here. Since making changes in getInsertValue will touch lot of files, guess we can go with Option2 with some changes. we could also pass in a delete field optionally which inturn will be passed into constructor of OverwriteWithLatestAvroPayload. If delete field is set by user, then we take that else we resort to "_hoodie_is_deleted". Let me know your thoughts. |
|
@nsivabalan actually what i commented here is the 3rd option
Basically it is about shifting the responsibility of converting records to |
|
@xushiyan It seems good for OverwriteWithLatestAvroPayload. But for AWSDmsAvroPayload, users need to define not only the deletion column, but also the processing method of deleting the column. Where is the new method should be? If it continue to stay in combineAndGetUpdateValue, then combineAndGetUpdateValue has the processing delete logic, and the calling method also has the delete logic, will it be repeated? |
|
@xushiyan : nope. As I have mentioned above, we need it in getInsertValue() as well which is called from lot of classes. Hence I suggested to add it as part of constructor. |
|
@shenh062326 @nsivabalan got it. yup, making it through the constructor looks good. thanks for clarifying. |
d3f6648 to
dcbf29b
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One high level comment.
bae0d5b to
a888c89
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
3a3e9df to
9a748bb
Compare
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests being added as part of the patch. Would be nice to have some tests covering the new code that was added at all levels.
- WriteClient
- Datasource if there is an existing suite of tests for other write operations
- Deltastreamer
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Outdated
Show resolved
Hide resolved
1caaef3 to
78d1bbc
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more minor comments. We are almost there. Also, do click on "Resolve conversation" for my comments if you have resoled them. If you can't resolve them, do leave a comment so that I know which ones to review again instead of reviewing the entire patch.
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this also be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, it does not matter whether deleteField is true or false here, only defaultDeleteField will affect decide the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Outdated
Show resolved
Hide resolved
...va/org/apache/hudi/utilities/functional/TestDeltaStreamerWithOverwriteLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
...va/org/apache/hudi/utilities/functional/TestDeltaStreamerWithOverwriteLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
57a2d7a to
242c979
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the code changes for parallelizing deletion. Lets take it in a different PR.
hudi-client/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteWithLatestAvroPayload.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the perseverance.
|
@xushiyan : do you plan to review? Or can I go ahead and merge this. |
|
@nsivabalan no, please feel free to merge. there has been thorough reviews :) thanks |
|
Folks, I see a performance/efficiency problem with the approach here. We added a string field to every payload object, which will get increase the shuffle size in the write path. We would be just sending the same string with each and every object. Can we rework this so that we introduce a new |
|
makes sense. sorry about the oversight. |
I can rework it. |
This reverts commit 433d7d2.
|
@shenh062326 : appreciate your help. We are looking to have a release by this weekend and so I am reverting this patch for now. I will work with you with the right fix for configurable delete marker. if we can get it in by this weekend well and good, even if not, we can land it later once we cut the release for 0.6.0. |
|
@vinothchandar : if I am not wrong, we added an additional overloaded constructor to OverwriteWithLatestAvroPayload. so shouldn't have broken any existing implementations. Only if someone wants to leverage user defined col, they might have to use the new constructor. Anyways, your perf reasoning is convincing, hence going ahead w/ reverting. |
|
@nsivabalan my concern is more on the overhead of the extra field on the payload serialization. No matter what constructor is called. there is a string field that will be serialized, right? |
|
I see it now. got it. |
This reverts commit 433d7d2.
|
Let me confirm the new implementation again, adding a field containing transient to OverwriteWithLatestAvroPayload, like below: And set the deleteMarkerField before HoodieMergeHandle.write call OverwriteWithLatestAvroPayload.combineAndGetUpdateValue, right? |
|
nope. we need to create a new class called PayloadConfig which will hold the deleteMarkerField col name. Similar to how we pass in Schema to OverwriteWithLatestAvroPayload#combineAndGetUpdateValue and OverwriteWithLatestAvroPayload#getInsertValue, we have to pass this payloadConfig as well. Hope you get the gist. @vinothchandar : can you confirm if this is the right approach. |
|
It seems there are two implementations. (b) If we add this method to OverwriteWithLatestAvroPayload, when we call HoodieRecordPayload.combineAndGetUpdateValue and HoodieRecordPayload.getInsertValue, we need to check whether it's OverwriteWithLatestAvroPayload or not, if it's OverwriteWithLatestAvroPayload, we need to call getInsertValue(Schema schema, PayloadConfig payloadConfig); if it's not OverwriteWithLatestAvroPayload, we need to call getInsertValue(Schema schema). But there are many places will call HoodieRecordPayload.getInsertValue. |
|
@vinothchandar @nsivabalan Can you take a look at this PR? |
|
@shenh062326 : sorry for late reply. Can you check #1704 patch on how to add new apis. Infact, we need to coordinate both these patches. I mean, land one and rebase other one and re-use the new apis added as part of first patch. @bhasudha : wrt your patch, What is your take on adding a PayloadConfig instead of a Map<>. I understand map is very flexible, but for payload class, hudi controls what goes in. |
Sure, I will check it. |
What is the purpose of the pull request
users can specify any boolean field for delete marker and
_hoodie_is_deletedremains as default.Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.