[HUDI-5685] Fixing deduplication in Bulk Insert row-writing path by alexeykudinkin · Pull Request #7825 · apache/hudi

alexeykudinkin · 2023-02-02T04:48:16Z

Change Logs

Currently, in case flag hoodie.combine.before.insert is set to true and hoodie.bulkinsert.sort.mode is set to NONE, Bulk Insert Row Writing performance will considerably degrade due to the following circumstances

During de-duplication (w/in dedupRows) records in the incoming RDD would be reshuffled (by Spark's default HashPartitioner) based on (partition-path, record-key) into N partitions
In case BulkInsertSortMode.NONE is used as partitioner, no re-partitioning will be performed and therefore each Spark task might be writing into M table partitions
This in turn entails explosion in the number of created (small) files, killing performance and table's layout

This PR addresses performance gap by introducing TablePartitioningAwarePartitioner to partition records during de-duplication in a following way

In case of the partitioned table, we hash-partition (into Spark partition) based on the value of the partition-path
In case of the non-partitioned table, we hash-partition (into Spark partition) based on the value of the record-key

Impact

This considerably improves writing performance for Bulk Insert Row Writing path w/ enabled de-duplication

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

codope · 2023-02-02T08:23:56Z

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

        PRECOMBINE_FIELD.key -> preCombineField,
        PARTITIONPATH_FIELD.key -> partitionFieldsStr,
        PAYLOAD_CLASS_NAME.key -> payloadClassName,
-        HoodieWriteConfig.COMBINE_BEFORE_INSERT.key -> String.valueOf(hasPrecombineColumn),


I guess the intent here was to automatically infer the COMBINE_BEFORE_INSERT config. With this change, it not enough for user to just configure precombine field, they also need to enable COMBINE_BEFORE_INSERT if they want to deduplicate. Isn't it?

Is there validation in code which checks that if COMBINE_BEFORE_INSERT is enabled then precombine field is also configured? If not, it would be better to add as part of configs improvement story.

can you clarify why removing it in this pr though?

@codope this was just a back-stop to have things running during testing, this is addressed properly in #7813, so this change is reverted

codope · 2023-02-02T08:28:30Z

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala

 import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
 import org.apache.spark.sql.types._
-import org.apache.spark.sql.{AnalysisException, Column, DataFrame, SparkSession}
+import org.apache.spark.sql.{AnalysisException, Column, DataFrame, HoodieDataTypeUtils, HoodieInternalRowUtils, SparkSession}


nit: optimize imports (HoodieInternalRowUtils is not used).

codope · 2023-02-02T08:29:59Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+   *
+   * For more details check out HUDI-5685
+   */
+  private case class TablePartitioningAwarePartitioner(override val numPartitions: Int,


I understand the benefit but have we tested it?

Yes, this been tested in our benchmarking run

codope · 2023-02-02T08:36:26Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieDataTypeUtils.scala

+  def hasMetaFields(structType: StructType): Boolean =
+    structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined
+
+  // TODO scala-doc


nit: remove todo?

xushiyan · 2023-02-02T08:32:50Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+      val partitionPath = if (isPartitioned) row.getUTF8String(partitionPathMetaFieldOrd) else UTF8String.EMPTY_UTF8
+      val recordKey = row.getUTF8String(recordKeyMetaFieldOrd)
+
+      ((partitionPath, recordKey), row)


not copying the row here?

Not needed anymore (we're doing subsequent shuffling which is sparing us a need to copy)

xushiyan · 2023-02-02T08:46:59Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+   * For more details check out HUDI-5685
+   */
+  private case class TablePartitioningAwarePartitioner(override val numPartitions: Int,
+                                                       val isPartitioned: Boolean) extends Partitioner {


we don't need additional flag to tell partitioned or not. can just check if nonEmpty(partitionPath) ?

xushiyan · 2023-02-02T08:48:08Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieDataTypeUtils.scala

+  def hasMetaFields(structType: StructType): Boolean =
+    structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined
+
+  // TODO scala-doc


resolve TODO

xushiyan · 2023-02-02T08:52:44Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieDataTypeUtils.scala

+    structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined
+
+  // TODO scala-doc
+  def addMetaFields(schema: StructType): StructType = {


this is more like ensuring meta fields placed first in schema. so the name can be more accurate.

This is a relocated method. Keeping the name for compatibility

…ields are not provided

… relying on native `UTF8String`); Made sure `reduceByKey` doesn't coalesce incoming RDD (preserving incoming # of partitions)

…den)

…e partitioning records in a way that is aware of the table partitioning

codope

Thanks for addressing the comments.

nsivabalan · 2023-02-02T18:55:05Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

-      val prependedRdd: RDD[InternalRow] =
-        df.queryExecution.toRdd.mapPartitions { iter =>
+      val sourceRdd = df.queryExecution.toRdd
+      val populatedRdd: RDD[InternalRow] = if (hasMetaFields(schema)) {


is this for clustering row writer code path ?

nsivabalan · 2023-02-02T19:00:37Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala

+    override def getPartition(key: Any): Int = {
+      key match {
+        case null => 0
+        case (partitionPath, recordKey) =>


won't this result in data skews? if one of the hudi partition has lot of data, the respective spark partition will skew the total time for de-dup right?

this was one of the reason why we did not go w/ this to avoid data skews.

nsivabalan · 2023-02-02T19:01:50Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieDataTypeUtils.scala

+   * however assuming that meta-fields should either be omitted or specified in full
+   */
+  def hasMetaFields(structType: StructType): Boolean =
+    structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined


minor. should we check for partition path as well ?

hudi-bot · 2023-02-02T20:04:57Z

CI report:

a797990 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-02-08T00:43:18Z

@alexeykudinkin : can you close if this is not valid anymore ?

alexeykudinkin · 2023-02-08T01:05:50Z

This is still a valid scenario when someone uses NONE as partitioner

yihua

Closing this PR as the bulk insert behavior is intended in this way and the changes may be suboptimal for some particular cases where there are data skews.

alexeykudinkin added the priority:blocker Production down; release blocker label Feb 2, 2023

alexeykudinkin requested review from codope and yihua February 2, 2023 04:48

alexeykudinkin assigned codope Feb 2, 2023

alexeykudinkin force-pushed the ak/blk-ins-ddup-fix branch 2 times, most recently from 96d711d to 247062c Compare February 2, 2023 07:06

alexeykudinkin changed the title ~~[DNM] Fixing deduplication in Bulk Insert row-writing path~~ [HUDI-5685] Fixing deduplication in Bulk Insert row-writing path Feb 2, 2023

codope reviewed Feb 2, 2023

View reviewed changes

xushiyan reviewed Feb 2, 2023

View reviewed changes

Alexey Kudinkin added 10 commits February 2, 2023 08:52

Tidying up

7d16b6e

Fixed HoodieDatasetBulkInsertHelper to avoid assumption that meta-f…

1a39049

…ields are not provided

Revisited row-deduping sequence to avoid String allocation (instead…

3bfc0b5

… relying on native `UTF8String`); Made sure `reduceByKey` doesn't coalesce incoming RDD (preserving incoming # of partitions)

Avoid populating meta-fields in case already present (will be overrid…

7f677df

…den)

Extracted schema handling utils into HoodieSqlCommonUtils

b8d1a35

Revisited HoodieDatasetBulkInsertHelper to avoid copying of the rows

c2908d4

Tidying up

a9172eb

Make sure that during de-duping in Bulk Insert (row-writer) path we'r…

07451e1

…e partitioning records in a way that is aware of the table partitioning

Fixing tests

182d4b4

Tidying up

9b2ab5d

alexeykudinkin force-pushed the ak/blk-ins-ddup-fix branch from f526dc5 to 9b2ab5d Compare February 2, 2023 16:53

Skip propagating unnecessary boolean flag

a797990

codope approved these changes Feb 2, 2023

View reviewed changes

nsivabalan requested changes Feb 2, 2023

View reviewed changes

alexeykudinkin removed the priority:blocker Production down; release blocker label Feb 2, 2023

nsivabalan added priority:high Significant impact; potential bugs writer-core labels Feb 8, 2023

alexeykudinkin added the status:in-progress Work in progress label Feb 8, 2023

bvaradar added the type:bug Bug reports and fixes label Oct 4, 2023

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

yihua reviewed Sep 17, 2024

View reviewed changes

yihua closed this Sep 17, 2024

hudi-bot mentioned this pull request Dec 9, 2025

Fix performance gap in Bulk Insert row-writing path with enabled de-duplication #15747

Open

Conversation

alexeykudinkin commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 2, 2023

CI report:

Uh oh!

nsivabalan commented Feb 8, 2023

Uh oh!

alexeykudinkin commented Feb 8, 2023

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alexeykudinkin commented Feb 2, 2023 •

edited

Loading

alexeykudinkin Feb 2, 2023 •

edited

Loading