HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091

ivorzhou · 2020-09-15T09:20:00Z

Fill missing normal column (except key) with schema default value when data write is upsert and
payload is OverwriteNonDefaultsWithLatestAvroPayload

Given (id PK ts precombied KEY dt partition KEY)

id  ,    name,    age,     ts,         dt
1    ,  Jack    ,   10   ,     1  ,          20191212
2    ,  Tom    ,    11   ,     1   ,         20191213
3    ,    Bill    ,    12    ,    1    ,        20191212

Upsert with following data (name column value missing)

id   ,    age,     ts,         dt
1     ,    22  ,      2   ,        20191212

Result

id  ,    name,    age,     ts,         dt
1   ,    Jack  ,     22 ,       2,            20191212
2   ,    Tom  ,      11 ,       1 ,           20191213
3   ,     Bill   ,      12 ,       1 ,           20191212

with schema default value when data write is upsert and payload is OverwriteNonDefaultsWithLatestAvroPayload

� Conflicts: � hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

umehrot2

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.

While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

ivorzhou · 2020-09-23T00:54:41Z

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.

While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

Thank you for your reviewing, I think sometimes a missing column does not mean dropped ,but ignored when update specified columns. Like update SQL, sometimes I only want to update one column , keep others unchanged.

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

codecov-io · 2020-10-11T00:10:26Z

Codecov Report

Merging #2091 (1ba5f9c) into master (78fd122) will increase coverage by 0.07%.
The diff coverage is 81.48%.

@@             Coverage Diff              @@
##             master    #2091      +/-   ##
============================================
+ Coverage     53.45%   53.52%   +0.07%     
- Complexity     2781     2782       +1     
============================================
  Files           354      354              
  Lines         16158    16180      +22     
  Branches       1648     1653       +5     
============================================
+ Hits           8637     8661      +24     
+ Misses         6821     6817       -4     
- Partials        700      702       +2

Flag	Coverage Δ	Complexity Δ
hudicli	`38.50% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`55.13% <ø> (+0.05%)`	`0.00 <ø> (ø)`
hudihadoopmr	`32.94% <ø> (ø)`	`0.00 <ø> (ø)`
hudispark	`66.04% <81.48%> (+0.31%)`	`0.00 <0.00> (ø)`
huditimelineservice	`65.30% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`70.06% <ø> (ø)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`53.97% <81.48%> (+2.66%)`	`0.00 <0.00> (ø)`
...e/hudi/common/table/log/HoodieLogFormatWriter.java	`78.81% <0.00%> (+1.69%)`	`23.00% <0.00%> (ø%)`
...ache/hudi/common/fs/inline/InMemoryFileSystem.java	`89.65% <0.00%> (+10.34%)`	`16.00% <0.00%> (+1.00%)`

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

umehrot2 · 2020-10-29T04:02:32Z

Thank you for creating this PR. At this point, I am not fully convinced if we really need this logic. A missing column in the DataFrame could also mean that column has been dropped, although Hudi schema evolution does not really support dropping of fields at this point of time. But if in future, if we are planning to support something like that then this would contradict with it.
While the logic LGTM, lets get a second opinion. @vinothchandar @bvaradar thoughts on this use-case ?

Thank you for your reviewing, I think sometimes a missing column does not mean dropped ,but ignored when update specified columns. Like update SQL, sometimes I only want to update one column , keep others unchanged.

@vinothchandar @bvaradar your thoughts here ?

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

nsivabalan · 2020-12-26T05:06:33Z

@ivorzhou : is the requirement to set default value or value from previous version of the record? if previous version of the record, then guess we already have another PR for this #2106

ivorzhou · 2021-01-06T06:56:20Z

@ivorzhou : is the requirement to set default value or value from previous version of the record? if previous version of the record, then guess we already have another PR for this #2106

I want to update record with specified field like SQL update.
UPDATE col='value' where id=1

nsivabalan · 2021-01-28T02:16:21Z

May be we can guard it by a config. So only interested users can leverage it and default behavior is untouched.
@umehrot2 @vinothchandar @bvaradar

nsivabalan · 2021-02-17T04:15:32Z

@vinothchandar @n3nash : any inputs appreciated. this is just waiting on your pointers.
btw @ivorzhou : In the mean time, a high level comment. not sure if you have handled this case. what if some of the missing columns don't have any default value set?

nsivabalan · 2021-03-03T15:31:46Z

@ivorzhou : can you please clarify if the requirement is to fill w/ default values for every column or use values from previous version if not present in incoming record?

nsivabalan · 2021-04-01T14:04:15Z

@ivorzhou : if your requirement is to fetch value from previous version of the record and inject into incoming record, we already have another PR open #2666. Feel to clarify and close this PR out if the requirement is same.

nsivabalan · 2021-05-26T11:17:13Z

#2927 puts in a fix, where in if you incoming data has subset of columns compared to table schema, it will populate default vals and proceed. Closing this PR in favor of the other. Please feel free to reopen if requirement is different or create a new one.

ivor added 2 commits September 15, 2020 17:16

HUDI-1283 Fill missing normal column (except key)

625b68a

with schema default value when data write is upsert and payload is OverwriteNonDefaultsWithLatestAvroPayload

HUDI-1283 UPDATE

f5be5ea

vinothchandar assigned umehrot2 Sep 16, 2020

ivor added 4 commits September 16, 2020 12:15

HUDI-1283 UPDATE

9dd0488

Merge branch 'master' of https://github.com/apache/hudi into HUDI-1283

d505ae5

� Conflicts: � hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

HUDI-1283 UPDATE

509bca6

HUDI-1283 UPDATE

f205686

umehrot2 reviewed Sep 22, 2020

View reviewed changes

vinothchandar added the area:schema Schema evolution and data types label Oct 9, 2020

Merge branch 'gitmaster' into HUDI-1283

51f7d63

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

ivor added 3 commits October 29, 2020 10:05

Merge branch 'gitmaster' into HUDI-1283

7b7781a

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

HUDI-1283 UPDATE

e38ea45

HUDI-1283 UPDATE

e0f168d

ivor added 2 commits December 2, 2020 10:54

Merge branch 'gitmaster' into HUDI-1283

1ba5f9c

# Conflicts: # hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala # hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala

HUDI-1283 UPDATE

5dd10d5

vinothchandar assigned nsivabalan Jan 29, 2021

vinothchandar added the priority:critical Production degraded; pipelines stalled label Feb 11, 2021

vinothchandar removed the priority:critical Production degraded; pipelines stalled label Mar 14, 2021

nsivabalan closed this May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091

HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091

Uh oh!

ivorzhou commented Sep 15, 2020 •

edited by vinothchandar

Loading

Uh oh!

umehrot2 left a comment •

edited

Loading

Uh oh!

ivorzhou commented Sep 23, 2020 •

edited

Loading

Uh oh!

codecov-io commented Oct 11, 2020 •

edited

Loading

Uh oh!

umehrot2 commented Oct 29, 2020

Uh oh!

nsivabalan commented Dec 26, 2020 •

edited

Loading

Uh oh!

ivorzhou commented Jan 6, 2021

Uh oh!

nsivabalan commented Jan 28, 2021

Uh oh!

nsivabalan commented Feb 17, 2021 •

edited

Loading

Uh oh!

nsivabalan commented Mar 3, 2021

Uh oh!

nsivabalan commented Apr 1, 2021

Uh oh!

nsivabalan commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091

HUDI-1283 Fill missing columns with default value when spark dataframe save to hudi table #2091

Uh oh!

Conversation

ivorzhou commented Sep 15, 2020 • edited by vinothchandar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umehrot2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivorzhou commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Oct 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

umehrot2 commented Oct 29, 2020

Uh oh!

nsivabalan commented Dec 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivorzhou commented Jan 6, 2021

Uh oh!

nsivabalan commented Jan 28, 2021

Uh oh!

nsivabalan commented Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsivabalan commented Mar 3, 2021

Uh oh!

nsivabalan commented Apr 1, 2021

Uh oh!

nsivabalan commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ivorzhou commented Sep 15, 2020 •

edited by vinothchandar

Loading

umehrot2 left a comment •

edited

Loading

ivorzhou commented Sep 23, 2020 •

edited

Loading

codecov-io commented Oct 11, 2020 •

edited

Loading

nsivabalan commented Dec 26, 2020 •

edited

Loading

nsivabalan commented Feb 17, 2021 •

edited

Loading