-
Notifications
You must be signed in to change notification settings - Fork 2.5k
HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
with schema default value when data write is upsert and payload is OverwriteNonDefaultsWithLatestAvroPayload
� Conflicts: � hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.
While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?
Thank you for your reviewing, I think sometimes a missing column does not mean dropped ,but ignored when update specified columns. Like update SQL, sometimes I only want to update one column , keep others unchanged. |
# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
Codecov Report
@@ Coverage Diff @@
## master #2091 +/- ##
============================================
+ Coverage 53.45% 53.52% +0.07%
- Complexity 2781 2782 +1
============================================
Files 354 354
Lines 16158 16180 +22
Branches 1648 1653 +5
============================================
+ Hits 8637 8661 +24
+ Misses 6821 6817 -4
- Partials 700 702 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@vinothchandar @bvaradar your thoughts here ? |
# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
|
May be we can guard it by a config. So only interested users can leverage it and default behavior is untouched. |
|
@vinothchandar @n3nash : any inputs appreciated. this is just waiting on your pointers. |
|
@ivorzhou : can you please clarify if the requirement is to fill w/ default values for every column or use values from previous version if not present in incoming record? |
|
#2927 puts in a fix, where in if you incoming data has subset of columns compared to table schema, it will populate default vals and proceed. Closing this PR in favor of the other. Please feel free to reopen if requirement is different or create a new one. |
Fill missing normal column (except key) with schema default value when data write is upsert and
payload is OverwriteNonDefaultsWithLatestAvroPayload
Given (id PK ts precombied KEY dt partition KEY)
Upsert with following data (name column value missing)
Result