-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6702][RFC-46] Support customized logic #9809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6702][RFC-46] Support customized logic #9809
Conversation
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
Outdated
Show resolved
Hide resolved
181594a to
9611675
Compare
|
@beyond1920, please check if this would work for your workflow. |
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, the rest looks good. Thanks for simplifying.
| * | ||
| * <p> This interface is experimental and might be evolved in the future. | ||
| **/ | ||
| default Pair<Boolean, Boolean> shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this going to evolve over time? What if users want to implement this method and Pair<Boolean, Boolean> is not sufficient? Then we add another method?
Also, what's the thinking behind the arguments of this method? If it's going to be simple enough to just decide whether to flush or not, do we need record, schema and props?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's going to be simple enough to just decide whether to flush or not
I kind of agree, we can simplify the returned value as a true/false. But maybe @linliu-code has some other considerations here, @linliu-code can you clarify.
Then we add another method?
I think we may change the method signature directly, by marking this method as expremental, we do not guarantee any compatibility in future versions. And because we have a default impl, so it should be feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're only going to have 3 cases, can we use an enum instead? I think it can be more clear to the reader and future devs when implementing this method. The overhead of a new Pair being created vs the enum singletons can add up as well when processing 10s of thousands of records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree, we can simplify the returned value as a true/false. But maybe @linliu-code has some other considerations here, @linliu-code can you clarify.
@danny0405, @codope, the reason that we need a pair of boolean variables is that: if a merger decides not to flush the combined record, it faces the question if the old record (the record in the base file) should be kept or not. This question could be very critical, so we should not guess it for the developer who implements their custom merger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're only going to have 3 cases, can we use an enum instead? I think it can be more clear to the reader and future devs when implementing this method. The overhead of a new
Pairbeing created vs the enum singletons can add up as well when processing 10s of thousands of records.
@the-other-tim-brown, sounds a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question could be very critical,
I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up.
We can evolve the returned value as a Pair or Enum if there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question could be very critical,
I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up.
We can evolve the returned value as a
PairorEnumif there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me.
Even in current implementation of HoodieMergeHandle, we are still facing this problem: when the shouldFlush function returns false, should we return true or false in writeRecord function? Returning true means skipping the old record, false means keeping the old record. No matter which one we choose in advance, we still face the possible situation: what if a user wants the other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 , saw your comment in the slack, we can use the simple signature right now since we can evolve it when we need in the future.
| } | ||
|
|
||
| @Test | ||
| public void testShouldFlush1() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give these method specific names instead of 1, 2, 3... suffix.
3516d7c to
a417df8
Compare
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocked for the clarification of the return type.
For Spark workflow, current HoodieRecordMerger interface does not support custom delete logic like the getInsertValue method does in HoodieRecordPayload interface. This change fills the hole for the merger interface.
a417df8 to
a150f8a
Compare
|
Test failures seems not related. Ran failing tests in local, and they passed. |
|
@hudi-bot run azure |
|
Landing this PR. The integ test failure is due to a flaky test which will be fixed by HUDI-6917. |
|
@linliu-code @danny0405 Sorry for late response. I was just back from vacation. |
Change Logs
For Spark workflow, current HoodieRecordMerger interface does not support custom delete logic like the getInsertValue method does in HoodieRecordPayload interface.
This change fills the hole for the merger interface.
Functional tests are added to test the behavior the new method.
Impact
No impact on existing workflows.
Risk level (write none, low medium or high below)
Low.
Documentation Update
No need to update any public documenation.
Contributor's checklist