-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1936] Introduce a optional property for conditional upsert #3035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge pull request apache#1 from fanaticjo/customupsertlogic
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3035 +/- ##
============================================
+ Coverage 55.01% 55.04% +0.02%
- Complexity 3850 3865 +15
============================================
Files 485 491 +6
Lines 23467 23640 +173
Branches 2497 2535 +38
============================================
+ Hits 12911 13012 +101
- Misses 9405 9466 +61
- Partials 1151 1162 +11
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. This will be beneficial for the community.
Couple of high level comments.
- Did you think through if this works for both COW and MOR ? If not, let's try to think how we can make it work for MOR.
- Can user set different values for "hoodie.update.keys" for different batch of writes. From my understanding, COW should be fine. but not sure about MOR.
| if (!recordOption.isPresent()) { | ||
| return Option.empty(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might have to check for delete record as well.
something like
isDeleteRecord((GenericRecord) indexedRecord)
Please do check OverwriteWithLatestAvroPayload impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes user can set different values for different batches for cow it working , mor will test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should only be considered only for upserts , so why delete is required ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see one thing as a blocker is if a new column is introduced it makes it as null , any idea how to tackle this ? one idea i have is check what schema is mismatch and add that in the properties only in that new column will get values or is there any hudi way for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan any updates ?
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithCustomAvroPayload.java
Show resolved
Hide resolved
|
cc @vingov do you mind taking a review at this, given its a python benefiting change |
|
In some sense, with the Spark SQL support now, python users can do custom merges? does that satisfy your requirements? |
|
@fanaticjo : We landed a partial payload support via #4676. |
|
Since there is no update on this PR for a while and Hudi already supports partial updates with a more general approach than the payload this PR proposes, closing this PR now. Feel free to reopen if needed. |
Tips
What is the purpose of the pull request
If anyone wants to use custom upsert logic then they have to override the Latest avro payload class which is only possible in java or scala .
Python developers have no such option .
Will be introducing a new payload class and a new key which can work in java , scala and python
This class will be responsible for custom upsert logic and a new key hoodie.update.key which will accept the columns which only need to be updated
"hoodie.update.keys": "admission_date,name", #comma seperated key
"hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert" #custom upsert key
so this will only update the column admission_date and name in the target table
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR