-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1383] Modify hive partition synchronization #2262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
…tion synchronization
Codecov Report
@@ Coverage Diff @@
## master #2262 +/- ##
============================================
- Coverage 53.51% 52.01% -1.50%
- Complexity 2769 2952 +183
============================================
Files 348 395 +47
Lines 16107 17650 +1543
Branches 1643 1809 +166
============================================
+ Hits 8619 9181 +562
- Misses 6789 7710 +921
- Partials 699 759 +60
Flags with carried forward coverage won't be shown. Click here to find out more. |
n3nash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linshan-ma Can you please describe the need for this change ?
|
@n3nash The hive/hudi table have contain partition |
This is described in more detail。#2234 |
|
@n3nash would you please review this again when free? |
|
@liujinhui1994 Can you help to review this PR. @linshan-ma It would be better to make the title of the issue more descriable. |
|
@n3nash do you see any reason to keep the sort? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar @n3nash : really surprised that we had this bug so far. How come none of our customers reported this until now nor didn't catch at Uber so far. Any interplay between month value and day value will run into issues.
Here is the gist of the bug. After getting partition value, we were sorting each partition value. For eg, "2020,12,25" will be sorted(lexicographically) as "12,2020,25". In other words, both "2020,01,02" and "2020,02,01" will be sorted as "01,02,2020".
I did some debugging and couldn't quite comprehend why we added the sorting in the first place only. I am sure I am missing something here.
I am trying to add a test to cover this scenario and having some issues on that end. In the meantime, do respond w/ your comments if any.
|
@linshan-ma : I see lot of commits(/rebase) in the patch even though actual fix is just one commit. guess it wasn't cleanly applied. Can you try fixing it. |
Tips
What is the purpose of the pull request
Modify hive partition synchronization
Brief change log
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.