Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Aug 28, 2020

What is the purpose of the pull request

  • Adding support for user defined bulk insert partitioner for bulk insert with Rows.
  • Added 3 out of the box sorting options with BulkInsertSort modes. Similar to BulkInsert with JavaRDD.
    • No sorting
    • Global sorting
    • Partition sorting.

Brief change log

  • Added user defined partitioner in BulkInsert with Rows.
  • Added 3 out of the box sorting options. NonSortPartitionerWithRows, GlobalSortPartitionerWithRows and RDDPartitionSortPartitionerWithRows.

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestBulkInsertInternalPartitionerForRows to verify the change.

Committer checklist

  • [ x] Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan marked this pull request as draft August 28, 2020 00:21
@vinothchandar vinothchandar self-assigned this Sep 9, 2020
@vinothchandar vinothchandar added the status:in-progress Work in progress label Oct 4, 2020
@nsivabalan nsivabalan changed the title [HUDI-1104] [WIP] Bulk insert dedup [HUDI-1104] Adding support for UserDefinedPartitioners and BulkSortModes to BulkInsert with Rows Oct 17, 2020
@nsivabalan nsivabalan removed the status:in-progress Work in progress label Oct 17, 2020
@vinothchandar vinothchandar added the status:in-progress Work in progress label Oct 25, 2020
@codecov-io
Copy link

codecov-io commented Oct 25, 2020

Codecov Report

Merging #2049 into master will increase coverage by 0.06%.
The diff coverage is 64.28%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2049      +/-   ##
============================================
+ Coverage     53.62%   53.68%   +0.06%     
- Complexity     2848     2850       +2     
============================================
  Files           359      359              
  Lines         16553    16574      +21     
  Branches       1780     1784       +4     
============================================
+ Hits           8876     8898      +22     
+ Misses         6918     6917       -1     
  Partials        759      759              
Flag Coverage Δ Complexity Δ
#hudicli 38.37% <ø> (ø) 193.00 <ø> (ø)
#hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
#hudicommon 54.70% <ø> (-0.03%) 1794.00 <ø> (-1.00)
#hudihadoopmr 33.05% <ø> (ø) 181.00 <ø> (ø)
#hudispark 65.90% <64.28%> (+0.48%) 305.00 <1.00> (+1.00)
#huditimelineservice 62.29% <ø> (ø) 50.00 <ø> (ø)
#hudiutilities 70.09% <ø> (+0.01%) 327.00 <ø> (+2.00)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 50.56% <0.00%> (-0.39%) 0.00 <0.00> (ø)
...org/apache/hudi/HoodieDatasetBulkInsertHelper.java 94.11% <75.00%> (-2.86%) 3.00 <0.00> (ø)
...src/main/java/org/apache/hudi/DataSourceUtils.java 46.84% <85.71%> (+2.08%) 23.00 <1.00> (+1.00)
...ache/hudi/common/fs/inline/InMemoryFileSystem.java 79.31% <0.00%> (-10.35%) 15.00% <0.00%> (-1.00%)
...in/java/org/apache/hudi/utilities/UtilHelpers.java 64.19% <0.00%> (-0.40%) 30.00% <0.00%> (ø%)
...i/utilities/deltastreamer/HoodieDeltaStreamer.java 69.23% <0.00%> (-0.32%) 18.00% <0.00%> (ø%)
...java/org/apache/hudi/common/fs/StorageSchemes.java 100.00% <0.00%> (ø) 10.00% <0.00%> (ø%)
...ava/org/apache/hudi/keygen/CustomKeyGenerator.java 82.60% <0.00%> (ø) 16.00% <0.00%> (ø%)
...g/apache/hudi/utilities/sources/AvroDFSSource.java 0.00% <0.00%> (ø) 0.00% <0.00%> (ø%)
...udi/utilities/sources/helpers/DFSPathSelector.java 84.44% <0.00%> (+2.39%) 14.00% <0.00%> (+2.00%)
... and 1 more

@nsivabalan nsivabalan marked this pull request as ready for review October 25, 2020 16:46
@nsivabalan nsivabalan requested a review from bvaradar October 25, 2020 16:46
@nsivabalan nsivabalan changed the title [HUDI-1104] Adding support for UserDefinedPartitioners and BulkSortModes to BulkInsert with Rows [HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows Oct 25, 2020
Copy link
Contributor

@zhedoubushishi zhedoubushishi Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like coalesce action is not based on the sorting result.

Here is an example, if I set outputSparkPartitions to 2, the partition column is event_type:

val df = Seq(
  (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"),
  (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
  (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

(Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the df would become:

val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2)
df2.withColumn("partitionID", spark_partition_id).show(false)

+--------+--------------+---------------------------+----------+-----------+
|event_id|event_name    |event_ts                   |event_type|partitionID|
+--------+--------------+---------------------------+----------+-----------+
|100     |event_name_16 |2015-01-01T13:51:39.340396Z|type1     |0          |
|108     |event_name_18 |2015-01-01T11:51:33.340396Z|type1     |0          |
|105     |event_name_678|2015-01-01T13:51:42.248818Z|type2     |0          |
|110     |event_name_20 |2014-02-01T11:51:33.340396Z|type3     |0          |
|104     |event_name_123|2015-01-01T12:15:00.512679Z|type1     |1          |
|101     |event_name_546|2015-01-01T12:14:58.597216Z|type2     |1          |
|109     |event_name_19 |2014-01-01T11:51:33.340396Z|type3     |1          |
+--------+--------------+---------------------------+----------+-----------+

You can see the coalescing result actually does not depend on the sorting result. Each spark partition id contains 3 types of Hudi partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for bringing it up. wanted to avoid the shuffle and hence thought will rely on coalesce. let me see if there is something we could do.

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs and removed status:in-progress Work in progress labels May 24, 2021
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not introduced a new config for this, but reusing the same used for BulkInsertPartitioner.
Open to adding a new config if required.

@nsivabalan nsivabalan force-pushed the bulk_insert_dedup branch from a40300a to 3692d9b Compare June 7, 2021 14:29
Copy link
Contributor Author

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a note.

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2021

Codecov Report

Merging #2049 (7f80674) into master (b9e28e5) will decrease coverage by 17.48%.
The diff coverage is 81.25%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2049       +/-   ##
=============================================
- Coverage     45.79%   28.30%   -17.49%     
+ Complexity     5270     1233     -4037     
=============================================
  Files           909      372      -537     
  Lines         39390    13988    -25402     
  Branches       4244     1426     -2818     
=============================================
- Hits          18039     3960    -14079     
+ Misses        19508     9741     -9767     
+ Partials       1843      287     -1556     
Flag Coverage Δ
hudicli ?
hudiclient 22.45% <81.25%> (-7.94%) ⬇️
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync 6.72% <ø> (-45.01%) ⬇️
huditimelineservice ?
hudiutilities 56.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../BulkInsertInternalPartitionerWithRowsFactory.java 50.00% <50.00%> (ø)
...ecution/bulkinsert/NonSortPartitionerWithRows.java 80.00% <80.00%> (ø)
...tion/bulkinsert/GlobalSortPartitionerWithRows.java 83.33% <83.33%> (ø)
...ulkinsert/RDDPartitionSortPartitionerWithRows.java 93.33% <93.33%> (ø)
...main/java/org/apache/hudi/metrics/HoodieGauge.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/hudi/hive/NonPartitionedExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/hudi/metrics/MetricsReporter.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/hudi/metrics/MetricsReporterType.java 0.00% <0.00%> (-100.00%) ⬇️
...rg/apache/hudi/client/bootstrap/BootstrapMode.java 0.00% <0.00%> (-100.00%) ⬇️
...he/hudi/hive/HiveStylePartitionValueExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
... and 609 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9e28e5...7f80674. Read the comment docs.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jun 18, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan added the status:in-progress Work in progress label Jun 20, 2021
@nsivabalan
Copy link
Contributor Author

Changes in last commit is not yet ready for review. Pushed it to access it from a diff place.

@nsivabalan
Copy link
Contributor Author

Closing this in favor of #3149

@nsivabalan nsivabalan closed this Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs status:in-progress Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants