Skip to content

Conversation

@ivorzhou
Copy link

@ivorzhou ivorzhou commented Sep 15, 2020

Fill missing normal column (except key) with schema default value when data write is upsert and
payload is OverwriteNonDefaultsWithLatestAvroPayload

Given (id PK ts precombied KEY dt partition KEY)

id  ,    name,    age,     ts,         dt
1    ,  Jack    ,   10   ,     1  ,          20191212
2    ,  Tom    ,    11   ,     1   ,         20191213
3    ,    Bill    ,    12    ,    1    ,        20191212

Upsert with following data (name column value missing)

id   ,    age,     ts,         dt
1     ,    22  ,      2   ,        20191212

Result

id  ,    name,    age,     ts,         dt
1   ,    Jack  ,     22 ,       2,            20191212
2   ,    Tom  ,      11 ,       1 ,           20191213
3   ,     Bill   ,      12 ,       1 ,           20191212

ivor added 2 commits September 15, 2020 17:16
with schema default value when data write is upsert and
payload is OverwriteNonDefaultsWithLatestAvroPayload
ivor added 4 commits September 16, 2020 12:15
� Conflicts:
�	hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
Copy link
Contributor

@umehrot2 umehrot2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.

While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

@ivorzhou
Copy link
Author

ivorzhou commented Sep 23, 2020

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.

While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

Thank you for your reviewing, I think sometimes a missing column does not mean dropped ,but ignored when update specified columns. Like update SQL, sometimes I only want to update one column , keep others unchanged.

@vinothchandar vinothchandar added the area:schema Schema evolution and data types label Oct 9, 2020
# Conflicts:
#	hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
#	hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
@codecov-io
Copy link

codecov-io commented Oct 11, 2020

Codecov Report

Merging #2091 (1ba5f9c) into master (78fd122) will increase coverage by 0.07%.
The diff coverage is 81.48%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2091      +/-   ##
============================================
+ Coverage     53.45%   53.52%   +0.07%     
- Complexity     2781     2782       +1     
============================================
  Files           354      354              
  Lines         16158    16180      +22     
  Branches       1648     1653       +5     
============================================
+ Hits           8637     8661      +24     
+ Misses         6821     6817       -4     
- Partials        700      702       +2     
Flag Coverage Δ Complexity Δ
hudicli 38.50% <ø> (ø) 0.00 <ø> (ø)
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon 55.13% <ø> (+0.05%) 0.00 <ø> (ø)
hudihadoopmr 32.94% <ø> (ø) 0.00 <ø> (ø)
hudispark 66.04% <81.48%> (+0.31%) 0.00 <0.00> (ø)
huditimelineservice 65.30% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 70.06% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 53.97% <81.48%> (+2.66%) 0.00 <0.00> (ø)
...e/hudi/common/table/log/HoodieLogFormatWriter.java 78.81% <0.00%> (+1.69%) 23.00% <0.00%> (ø%)
...ache/hudi/common/fs/inline/InMemoryFileSystem.java 89.65% <0.00%> (+10.34%) 16.00% <0.00%> (+1.00%)

ivor added 3 commits October 29, 2020 10:05
# Conflicts:
#	hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@umehrot2
Copy link
Contributor

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.
While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

Thank you for your reviewing, I think sometimes a missing column does not mean dropped ,but ignored when update specified columns. Like update SQL, sometimes I only want to update one column , keep others unchanged.

@vinothchandar @bvaradar your thoughts here ?

ivor added 2 commits December 2, 2020 10:54
# Conflicts:
#	hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
#	hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
@nsivabalan
Copy link
Contributor

nsivabalan commented Dec 26, 2020

@ivorzhou : is the requirement to set default value or value from previous version of the record? if previous version of the record, then guess we already have another PR for this #2106

@ivorzhou
Copy link
Author

ivorzhou commented Jan 6, 2021

@ivorzhou : is the requirement to set default value or value from previous version of the record? if previous version of the record, then guess we already have another PR for this #2106

I want to update record with specified field like SQL update.
UPDATE col='value' where id=1

@nsivabalan
Copy link
Contributor

May be we can guard it by a config. So only interested users can leverage it and default behavior is untouched.
@umehrot2 @vinothchandar @bvaradar

@vinothchandar vinothchandar added the priority:critical Production degraded; pipelines stalled label Feb 11, 2021
@nsivabalan
Copy link
Contributor

nsivabalan commented Feb 17, 2021

@vinothchandar @n3nash : any inputs appreciated. this is just waiting on your pointers.
btw @ivorzhou : In the mean time, a high level comment. not sure if you have handled this case. what if some of the missing columns don't have any default value set?

@nsivabalan
Copy link
Contributor

@ivorzhou : can you please clarify if the requirement is to fill w/ default values for every column or use values from previous version if not present in incoming record?

@vinothchandar vinothchandar removed the priority:critical Production degraded; pipelines stalled label Mar 14, 2021
@nsivabalan
Copy link
Contributor

@ivorzhou : if your requirement is to fetch value from previous version of the record and inject into incoming record, we already have another PR open #2666. Feel to clarify and close this PR out if the requirement is same.

@nsivabalan
Copy link
Contributor

#2927 puts in a fix, where in if you incoming data has subset of columns compared to table schema, it will populate default vals and proceed. Closing this PR in favor of the other. Please feel free to reopen if requirement is different or create a new one.

@nsivabalan nsivabalan closed this May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants