[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

nsivabalan · 2020-08-28T00:21:14Z

What is the purpose of the pull request

Adding support for user defined bulk insert partitioner for bulk insert with Rows.
Added 3 out of the box sorting options with BulkInsertSort modes. Similar to BulkInsert with JavaRDD.
- No sorting
- Global sorting
- Partition sorting.

Brief change log

Added user defined partitioner in BulkInsert with Rows.
Added 3 out of the box sorting options. NonSortPartitionerWithRows, GlobalSortPartitionerWithRows and RDDPartitionSortPartitionerWithRows.

Verify this pull request

This change added tests and can be verified as follows:

Added TestBulkInsertInternalPartitionerForRows to verify the change.

Committer checklist

[ x] Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2020-10-25T11:37:35Z

Codecov Report

Merging #2049 into master will increase coverage by 0.06%.
The diff coverage is 64.28%.

@@             Coverage Diff              @@
##             master    #2049      +/-   ##
============================================
+ Coverage     53.62%   53.68%   +0.06%     
- Complexity     2848     2850       +2     
============================================
  Files           359      359              
  Lines         16553    16574      +21     
  Branches       1780     1784       +4     
============================================
+ Hits           8876     8898      +22     
+ Misses         6918     6917       -1     
  Partials        759      759

Flag	Coverage Δ	Complexity Δ
#hudicli	`38.37% <ø> (ø)`	`193.00 <ø> (ø)`
#hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
#hudicommon	`54.70% <ø> (-0.03%)`	`1794.00 <ø> (-1.00)`
#hudihadoopmr	`33.05% <ø> (ø)`	`181.00 <ø> (ø)`
#hudispark	`65.90% <64.28%> (+0.48%)`	`305.00 <1.00> (+1.00)`
#huditimelineservice	`62.29% <ø> (ø)`	`50.00 <ø> (ø)`
#hudiutilities	`70.09% <ø> (+0.01%)`	`327.00 <ø> (+2.00)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`50.56% <0.00%> (-0.39%)`	`0.00 <0.00> (ø)`
...org/apache/hudi/HoodieDatasetBulkInsertHelper.java	`94.11% <75.00%> (-2.86%)`	`3.00 <0.00> (ø)`
...src/main/java/org/apache/hudi/DataSourceUtils.java	`46.84% <85.71%> (+2.08%)`	`23.00 <1.00> (+1.00)`
...ache/hudi/common/fs/inline/InMemoryFileSystem.java	`79.31% <0.00%> (-10.35%)`	`15.00% <0.00%> (-1.00%)`
...in/java/org/apache/hudi/utilities/UtilHelpers.java	`64.19% <0.00%> (-0.40%)`	`30.00% <0.00%> (ø%)`
...i/utilities/deltastreamer/HoodieDeltaStreamer.java	`69.23% <0.00%> (-0.32%)`	`18.00% <0.00%> (ø%)`
...java/org/apache/hudi/common/fs/StorageSchemes.java	`100.00% <0.00%> (ø)`	`10.00% <0.00%> (ø%)`
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java	`82.60% <0.00%> (ø)`	`16.00% <0.00%> (ø%)`
...g/apache/hudi/utilities/sources/AvroDFSSource.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...udi/utilities/sources/helpers/DFSPathSelector.java	`84.44% <0.00%> (+2.39%)`	`14.00% <0.00%> (+2.00%)`
... and 1 more

zhedoubushishi · 2020-12-15T18:34:07Z

...client/src/main/java/org/apache/hudi/execution/bulkinsert/GlobalSortPartitionerWithRows.java

Looks like coalesce action is not based on the sorting result.

Here is an example, if I set outputSparkPartitions to 2, the partition column is event_type:

val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type")

(Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the df would become:

val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2) df2.withColumn("partitionID", spark_partition_id).show(false) +--------+--------------+---------------------------+----------+-----------+ |event_id|event_name |event_ts |event_type|partitionID| +--------+--------------+---------------------------+----------+-----------+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | +--------+--------------+---------------------------+----------+-----------+

You can see the coalescing result actually does not depend on the sorting result. Each spark partition id contains 3 types of Hudi partitions.

I see, thanks for bringing it up. wanted to avoid the shuffle and hence thought will rely on coalesce. let me see if there is something we could do.

nsivabalan · 2021-06-07T14:12:49Z

hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java

I have not introduced a new config for this, but reusing the same used for BulkInsertPartitioner.
Open to adding a new config if required.

nsivabalan

left a note.

codecov-commenter · 2021-06-07T16:21:20Z

Codecov Report

Merging #2049 (7f80674) into master (b9e28e5) will decrease coverage by 17.48%.
The diff coverage is 81.25%.

@@              Coverage Diff              @@
##             master    #2049       +/-   ##
=============================================
- Coverage     45.79%   28.30%   -17.49%     
+ Complexity     5270     1233     -4037     
=============================================
  Files           909      372      -537     
  Lines         39390    13988    -25402     
  Branches       4244     1426     -2818     
=============================================
- Hits          18039     3960    -14079     
+ Misses        19508     9741     -9767     
+ Partials       1843      287     -1556

Flag	Coverage Δ
hudicli	`?`
hudiclient	`22.45% <81.25%> (-7.94%)`	⬇️
hudicommon	`?`
hudiflink	`?`
hudihadoopmr	`?`
hudisparkdatasource	`?`
hudisync	`6.72% <ø> (-45.01%)`	⬇️
huditimelineservice	`?`
hudiutilities	`56.63% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../BulkInsertInternalPartitionerWithRowsFactory.java	`50.00% <50.00%> (ø)`
...ecution/bulkinsert/NonSortPartitionerWithRows.java	`80.00% <80.00%> (ø)`
...tion/bulkinsert/GlobalSortPartitionerWithRows.java	`83.33% <83.33%> (ø)`
...ulkinsert/RDDPartitionSortPartitionerWithRows.java	`93.33% <93.33%> (ø)`
...main/java/org/apache/hudi/metrics/HoodieGauge.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../org/apache/hudi/hive/NonPartitionedExtractor.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/hudi/metrics/MetricsReporter.java	`0.00% <0.00%> (-100.00%)`	⬇️
...a/org/apache/hudi/metrics/MetricsReporterType.java	`0.00% <0.00%> (-100.00%)`	⬇️
...rg/apache/hudi/client/bootstrap/BootstrapMode.java	`0.00% <0.00%> (-100.00%)`	⬇️
...he/hudi/hive/HiveStylePartitionValueExtractor.java	`0.00% <0.00%> (-100.00%)`	⬇️
... and 609 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9e28e5...7f80674. Read the comment docs.

…r Class (apache#1927)

…r to bulk insert of Rows

hudi-bot · 2021-06-18T21:25:20Z

CI report:

7f80674 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

nsivabalan · 2021-06-20T03:47:47Z

Changes in last commit is not yet ready for review. Pushed it to access it from a diff place.

nsivabalan · 2021-06-24T12:44:43Z

Closing this in favor of #3149

nsivabalan marked this pull request as draft August 28, 2020 00:21

nsivabalan force-pushed the bulk_insert_dedup branch from d0f7b62 to 01fba00 Compare August 29, 2020 13:45

vinothchandar self-assigned this Sep 9, 2020

vinothchandar added the status:in-progress Work in progress label Oct 4, 2020

nsivabalan force-pushed the bulk_insert_dedup branch from 01fba00 to 11512f9 Compare October 17, 2020 19:50

nsivabalan changed the title ~~[HUDI-1104] [WIP] Bulk insert dedup~~ [HUDI-1104] Adding support for UserDefinedPartitioners and BulkSortModes to BulkInsert with Rows Oct 17, 2020

nsivabalan force-pushed the bulk_insert_dedup branch from 11512f9 to 9b68e0e Compare October 17, 2020 20:05

nsivabalan removed the status:in-progress Work in progress label Oct 17, 2020

vinothchandar added the status:in-progress Work in progress label Oct 25, 2020

nsivabalan marked this pull request as ready for review October 25, 2020 16:46

nsivabalan requested a review from bvaradar October 25, 2020 16:46

nsivabalan changed the title ~~[HUDI-1104] Adding support for UserDefinedPartitioners and BulkSortModes to BulkInsert with Rows~~ [HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows Oct 25, 2020

zhedoubushishi reviewed Dec 15, 2020

View reviewed changes

nsivabalan added priority:high Significant impact; potential bugs and removed status:in-progress Work in progress labels May 24, 2021

nsivabalan commented Jun 7, 2021

View reviewed changes

nsivabalan force-pushed the bulk_insert_dedup branch from a40300a to 3692d9b Compare June 7, 2021 14:29

nsivabalan commented Jun 7, 2021

View reviewed changes

lamberken and others added 7 commits June 17, 2021 19:17

trigger rebuild

895d3a5

[HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrappe…

49779f6

…r Class (apache#1927)

Adding bulk insert sort modes and user defined bulk insert partitione…

351b177

…r to bulk insert of Rows

Rebasing and adding tests for BulkInsertPartitioners with Rows

030a295

Fixing build issues

d65104f

fetching and rebasing with master

dccbebb

Temp local changes

7f80674

nsivabalan force-pushed the bulk_insert_dedup branch from 3692d9b to 7f80674 Compare June 18, 2021 21:24

nsivabalan added the status:in-progress Work in progress label Jun 20, 2021

nsivabalan closed this Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

Uh oh!

nsivabalan commented Aug 28, 2020 •

edited

Loading

Uh oh!

codecov-io commented Oct 25, 2020 •

edited

Loading

Uh oh!

zhedoubushishi Dec 15, 2020 •

edited

Loading

Uh oh!

nsivabalan May 22, 2021

Uh oh!

nsivabalan Jun 7, 2021

Uh oh!

nsivabalan left a comment

Uh oh!

codecov-commenter commented Jun 7, 2021 •

edited

Loading

Uh oh!

hudi-bot commented Jun 18, 2021 •

edited

Loading

Uh oh!

nsivabalan commented Jun 20, 2021

Uh oh!

nsivabalan commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

Uh oh!

Conversation

nsivabalan commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Oct 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhedoubushishi Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan May 22, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hudi-bot commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

nsivabalan commented Jun 20, 2021

Uh oh!

nsivabalan commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nsivabalan commented Aug 28, 2020 •

edited

Loading

codecov-io commented Oct 25, 2020 •

edited

Loading

zhedoubushishi Dec 15, 2020 •

edited

Loading

codecov-commenter commented Jun 7, 2021 •

edited

Loading

hudi-bot commented Jun 18, 2021 •

edited

Loading