-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1887] Setting default value to false for enabling schema post processor #2911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2911 +/- ##
=========================================
Coverage 52.90% 52.90%
+ Complexity 3748 3746 -2
=========================================
Files 488 488
Lines 23574 23574
Branches 2510 2510
=========================================
Hits 12472 12472
- Misses 9998 9999 +1
+ Partials 1104 1103 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
n3nash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will wait for @bvaradar to confirm later tonight/tomorrow before merging this.
|
@nsivabalan are n't there any tests affected by this change? Also do we even need this post processor feature anymore? Should can the entire feature instead of just making it not enabled by default? |
|
yeah, I am trying to understand the implications of removing the post processor. we may not have namespace in delta streamer's target schema if we remove the post processor. so what we need to understand is, what incase schema has namespace in hudi data files, and then delta streamer produces records w/ schema not having namespace. marking as WIP for now. |
|
yes, Only reason we still wanted to have is, just incase some user wishes to migrate from a path of having namespace to a path of not having namespace. but if we can totally remove it, I am all for it. |
|
@n3nash : could you think of any reason why we need to have this instead of removing it altogether. |
|
@nsivabalan It's the same reason you mentioned. The post processor adds the namespace. So if users have written log files with AVRO name-spacing and we disable post process for them, it will break. But this only happens in a specific path of DeltaSync - can you point out which one ? From my interactions in the community, I can suggest whether that path is a frequently used path or not - if it's not, let's remove this complexity. |
|
@n3nash : when there is no target schema set and so hudi falls back to RowBasedSchemaProvider for target shema. |
|
I am marking this a non-blocker. |
|
@vinothchandar : If you are good with the patch, I can rebase and land it. |
|
We have added more |
What is the purpose of the pull request
https://issues.apache.org/jira/browse/HUDI-1343 was added so that default and nulls are handled properly. This fix is applicable only to deltastreamer flow. But later, we had a fix in spark ds layer which fixed the defaults and null values. So, the schema post processor is not really required anymore. Hence making the default value to disabled.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.