Skip to content

[HUDI-5685] Fixing deduplication in Bulk Insert row-writing path#7825

Closed
alexeykudinkin wants to merge 11 commits intoapache:masterfrom
onehouseinc:ak/blk-ins-ddup-fix
Closed

[HUDI-5685] Fixing deduplication in Bulk Insert row-writing path#7825
alexeykudinkin wants to merge 11 commits intoapache:masterfrom
onehouseinc:ak/blk-ins-ddup-fix

Conversation

@alexeykudinkin
Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin commented Feb 2, 2023

Change Logs

Currently, in case flag hoodie.combine.before.insert is set to true and hoodie.bulkinsert.sort.mode is set to NONE, Bulk Insert Row Writing performance will considerably degrade due to the following circumstances

  • During de-duplication (w/in dedupRows) records in the incoming RDD would be reshuffled (by Spark's default HashPartitioner) based on (partition-path, record-key) into N partitions
  • In case BulkInsertSortMode.NONE is used as partitioner, no re-partitioning will be performed and therefore each Spark task might be writing into M table partitions
  • This in turn entails explosion in the number of created (small) files, killing performance and table's layout

This PR addresses performance gap by introducing TablePartitioningAwarePartitioner to partition records during de-duplication in a following way

  • In case of the partitioned table, we hash-partition (into Spark partition) based on the value of the partition-path
  • In case of the non-partitioned table, we hash-partition (into Spark partition) based on the value of the record-key

Impact

This considerably improves writing performance for Bulk Insert Row Writing path w/ enabled de-duplication

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@alexeykudinkin alexeykudinkin added the priority:blocker Production down; release blocker label Feb 2, 2023
@alexeykudinkin alexeykudinkin force-pushed the ak/blk-ins-ddup-fix branch 2 times, most recently from 96d711d to 247062c Compare February 2, 2023 07:06
@alexeykudinkin alexeykudinkin changed the title [DNM] Fixing deduplication in Bulk Insert row-writing path [HUDI-5685] Fixing deduplication in Bulk Insert row-writing path Feb 2, 2023
PRECOMBINE_FIELD.key -> preCombineField,
PARTITIONPATH_FIELD.key -> partitionFieldsStr,
PAYLOAD_CLASS_NAME.key -> payloadClassName,
HoodieWriteConfig.COMBINE_BEFORE_INSERT.key -> String.valueOf(hasPrecombineColumn),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the intent here was to automatically infer the COMBINE_BEFORE_INSERT config. With this change, it not enough for user to just configure precombine field, they also need to enable COMBINE_BEFORE_INSERT if they want to deduplicate. Isn't it?

Is there validation in code which checks that if COMBINE_BEFORE_INSERT is enabled then precombine field is also configured? If not, it would be better to add as part of configs improvement story.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify why removing it in this pr though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope this was just a back-stop to have things running during testing, this is addressed properly in #7813, so this change is reverted

import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{AnalysisException, Column, DataFrame, SparkSession}
import org.apache.spark.sql.{AnalysisException, Column, DataFrame, HoodieDataTypeUtils, HoodieInternalRowUtils, SparkSession}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: optimize imports (HoodieInternalRowUtils is not used).

*
* For more details check out HUDI-5685
*/
private case class TablePartitioningAwarePartitioner(override val numPartitions: Int,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the benefit but have we tested it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this been tested in our benchmarking run

def hasMetaFields(structType: StructType): Boolean =
structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined

// TODO scala-doc
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove todo?

val partitionPath = if (isPartitioned) row.getUTF8String(partitionPathMetaFieldOrd) else UTF8String.EMPTY_UTF8
val recordKey = row.getUTF8String(recordKeyMetaFieldOrd)

((partitionPath, recordKey), row)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not copying the row here?

Copy link
Copy Markdown
Contributor Author

@alexeykudinkin alexeykudinkin Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore (we're doing subsequent shuffling which is sparing us a need to copy)

* For more details check out HUDI-5685
*/
private case class TablePartitioningAwarePartitioner(override val numPartitions: Int,
val isPartitioned: Boolean) extends Partitioner {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need additional flag to tell partitioned or not. can just check if nonEmpty(partitionPath) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

def hasMetaFields(structType: StructType): Boolean =
structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined

// TODO scala-doc
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve TODO

structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined

// TODO scala-doc
def addMetaFields(schema: StructType): StructType = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more like ensuring meta fields placed first in schema. so the name can be more accurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relocated method. Keeping the name for compatibility

Copy link
Copy Markdown
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments.

val prependedRdd: RDD[InternalRow] =
df.queryExecution.toRdd.mapPartitions { iter =>
val sourceRdd = df.queryExecution.toRdd
val populatedRdd: RDD[InternalRow] = if (hasMetaFields(schema)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for clustering row writer code path ?

override def getPartition(key: Any): Int = {
key match {
case null => 0
case (partitionPath, recordKey) =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this result in data skews? if one of the hudi partition has lot of data, the respective spark partition will skew the total time for de-dup right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was one of the reason why we did not go w/ this to avoid data skews.

* however assuming that meta-fields should either be omitted or specified in full
*/
def hasMetaFields(structType: StructType): Boolean =
structType.getFieldIndex(HoodieRecord.RECORD_KEY_METADATA_FIELD).isDefined
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. should we check for partition path as well ?

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Feb 2, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin alexeykudinkin removed the priority:blocker Production down; release blocker label Feb 2, 2023
@nsivabalan
Copy link
Copy Markdown
Contributor

@alexeykudinkin : can you close if this is not valid anymore ?

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs writer-core labels Feb 8, 2023
@alexeykudinkin
Copy link
Copy Markdown
Contributor Author

This is still a valid scenario when someone uses NONE as partitioner

@alexeykudinkin alexeykudinkin added the status:in-progress Work in progress label Feb 8, 2023
@bvaradar bvaradar added the type:bug Bug reports and fixes label Oct 4, 2023
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this PR as the bulk insert behavior is intended in this way and the changes may be suboptimal for some particular cases where there are data skews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs size:M PR with lines of changes in (100, 300] status:in-progress Work in progress type:bug Bug reports and fixes

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

7 participants