Skip to content

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Sep 29, 2023

Change Logs

For Spark workflow, current HoodieRecordMerger interface does not support custom delete logic like the getInsertValue method does in HoodieRecordPayload interface.

This change fills the hole for the merger interface.
Functional tests are added to test the behavior the new method.

Impact

No impact on existing workflows.

Risk level (write none, low medium or high below)

Low.

Documentation Update

No need to update any public documenation.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@linliu-code linliu-code force-pushed the HUDI-6702_support_custom_delete_logic_through_merger branch 2 times, most recently from 181594a to 9611675 Compare October 2, 2023 21:06
@linliu-code
Copy link
Collaborator Author

@beyond1920, please check if this would work for your workflow.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, the rest looks good. Thanks for simplifying.

*
* <p> This interface is experimental and might be evolved in the future.
**/
default Pair<Boolean, Boolean> shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this going to evolve over time? What if users want to implement this method and Pair<Boolean, Boolean> is not sufficient? Then we add another method?
Also, what's the thinking behind the arguments of this method? If it's going to be simple enough to just decide whether to flush or not, do we need record, schema and props?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's going to be simple enough to just decide whether to flush or not

I kind of agree, we can simplify the returned value as a true/false. But maybe @linliu-code has some other considerations here, @linliu-code can you clarify.

Then we add another method?

I think we may change the method signature directly, by marking this method as expremental, we do not guarantee any compatibility in future versions. And because we have a default impl, so it should be feasible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're only going to have 3 cases, can we use an enum instead? I think it can be more clear to the reader and future devs when implementing this method. The overhead of a new Pair being created vs the enum singletons can add up as well when processing 10s of thousands of records.

Copy link
Collaborator Author

@linliu-code linliu-code Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of agree, we can simplify the returned value as a true/false. But maybe @linliu-code has some other considerations here, @linliu-code can you clarify.

@danny0405, @codope, the reason that we need a pair of boolean variables is that: if a merger decides not to flush the combined record, it faces the question if the old record (the record in the base file) should be kept or not. This question could be very critical, so we should not guess it for the developer who implements their custom merger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're only going to have 3 cases, can we use an enum instead? I think it can be more clear to the reader and future devs when implementing this method. The overhead of a new Pair being created vs the enum singletons can add up as well when processing 10s of thousands of records.

@the-other-tim-brown, sounds a good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question could be very critical,

I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up.

We can evolve the returned value as a Pair or Enum if there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question could be very critical,

I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up.

We can evolve the returned value as a Pair or Enum if there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me.

Even in current implementation of HoodieMergeHandle, we are still facing this problem: when the shouldFlush function returns false, should we return true or false in writeRecord function? Returning true means skipping the old record, false means keeping the old record. No matter which one we choose in advance, we still face the possible situation: what if a user wants the other way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 , saw your comment in the slack, we can use the simple signature right now since we can evolve it when we need in the future.

}

@Test
public void testShouldFlush1() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give these method specific names instead of 1, 2, 3... suffix.

@linliu-code linliu-code force-pushed the HUDI-6702_support_custom_delete_logic_through_merger branch 2 times, most recently from 3516d7c to a417df8 Compare October 3, 2023 07:23
Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked for the clarification of the return type.

For Spark workflow, current HoodieRecordMerger interface does not support
custom delete logic like the getInsertValue method does in HoodieRecordPayload interface.

This change fills the hole for the merger interface.
@linliu-code linliu-code force-pushed the HUDI-6702_support_custom_delete_logic_through_merger branch from a417df8 to a150f8a Compare October 3, 2023 16:54
@linliu-code
Copy link
Collaborator Author

Test failures seems not related. Ran failing tests in local, and they passed.

@codope
Copy link
Member

codope commented Oct 4, 2023

@hudi-bot run azure

@codope
Copy link
Member

codope commented Oct 5, 2023

Landing this PR. The integ test failure is due to a flaky test which will be fixed by HUDI-6917.

@codope codope merged commit 65cb387 into apache:master Oct 5, 2023
@beyond1920
Copy link
Contributor

beyond1920 commented Oct 7, 2023

@linliu-code @danny0405 Sorry for late response. I was just back from vacation.
The return type is true/false is enough.
But something is missed in this PR. For example, when read snapshot of MOR table, we should also check whether should return the records or not for those records which keys only existed in log files but does not existed in base file.
For example some methods in org.apache.hudi.LogFileIterator and org.apache.hudi.RecordMergingFileIterator.
It may lead to different query results for a MOR HUDI table before and after compaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants