-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal #2106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… foo handles client requests one by one so we just use gevent for concurrency. Jira case id: sss
|
@Karl-WangSK Hi, sorry for the late reply, I don't fully get the point of this changes, would you please describe in more details in which cases we need these changes? |
Like when we have 2 records with same id in a batch . and we need to retain all property like name and age in my Brief change log. But in default. HUDI just choose a record with biggest orderingVal. |
|
oops ! in this pr #2116 . lack of a period at line 60 in |
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
Outdated
Show resolved
Hide resolved
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Karl-WangSK Can this kind of merging should belong in a separate payload class. I am not sure overloading the existing payload is the right way to go. It has a specific purpose of ingesting full change log records, and pick the latest record based on orderingVal. What I am trying to say : there may be users who just want null, 18 i.e latest values for name,age instead of merged.
Happy to take this contribution as a separate payload class .
In |
|
@Karl-WangSK Can you fix the conflicts? thanks. |
| T reducedData = (T) rec1.getData().preCombine(rec2.getData()); | ||
| T reducedData; | ||
| //To prevent every records from parsing schema | ||
| if (rec2.getData() instanceof UpdatePrecombineAvroPayload) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this prevent old payload impl from calling preCombine(_, schema) ?
|
@Karl-WangSK We need some more thinking to resolve the pending issues I think.
I am thinking about how we can safely make this the only API the Hudi code calls. TOL, existing payload classes will have just
Now, we can think about how to perf issue. We can parse it once on the driver and then send it across. but issue is Are you able to attempt this. (else I ll try. might take bit of time.please let me know) |
I will try. |
|
@Karl-WangSK any updates on this? Happy to help with any open ended issues here |
meet some trouble in SerializableSchema. |
|
@Karl-WangSK avro upgrade is a non-trivial task. We need to ensure parquet-avro etc work the same and all that. @n3nash wdyt? |
|
@Karl-WangSK : we have introduced new apis for all methods in HoodieRecordPayload with Properties arg to assist in special needs like this. Do you think we can leverage that instead of adding a schema arg to preCombine. |
|
@vinothchandar : do you think we need to make this release blocker? |
|
@Karl-WangSK @vinothchandar : since its been open for some time, wanted to see how we can get it to closure. let me summarize the state of the PR. Requirement:
So, initial proposal was to change the signature of preCombine to take in schema. but feedback was given to try out serializable version so that we don't need to change the api and incur perf impact. @Karl-WangSK : feel free to clarify my questions and give us the latest update on this PR. |
|
fyi: we have another PR open to support similar feature for combineAndUpdate |
|
|
@Karl-WangSK : we have another PR being reviewed right now for partial updates support. #2666 |
Tips
What is the purpose of the pull request
preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal
Brief change log
When more than one HoodieRecord have the same HoodieKey, this function combines all fields(which is not DefaultValue)
before attempting to insert/upsert (if combining turned on in HoodieClientConfig).
eg: 1)
Verify this pull request
Added one test in TestOverwriteWithLatestAvroPayload to verify the change
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.