[HUDI-5633] Fixing performance regression in `HoodieSparkRecord` #7769

alexeykudinkin · 2023-01-28T03:07:12Z

Change Logs

This change addresses a few performance regressions in HoodieSparkRecord identified during our recent benchmarking::

HoodieSparkRecord rewrites records using rewriteRecord and rewriteRecordWithNewSchema which do Schema traversals for every record. Instead we should do schema traversal only once and produce a transformer that will directly create new record from the old one.
HoodieRecords currently could be rewritten multiple times even in cases when just meta-fields need to be mixed into the schema (in that case, HoodieSparkRecord simply wraps source InternalRow into HoodieInternalRow holding the meta-fields). This is problematic due to a) UnsafeProjection re-using mutable row (as a buffer) to avoid allocation of small objects leading to b) recursive overwriting of the same row.
Records are currently copied for every Executor even for Simple one which actually is not buffering any records and therefore doesn't require records to be copied.

To address aforementioned gaps following changes have been implemented:

Row writing utils have been revisited to decouple RowWriter generation from actual application (to the source row; that way actual application is much more efficient). Additionally, considerable number of row-writing utilities have been eliminated as these are purely duplicative.
HoodieRecord.rewriteRecord API is renamed into prependMetaFields to clearly disambiguate it from rewriteRecordWithSchema
WriteHandle and HoodieMergeHelper implementations are substantially simplified and streamlined accommodating being rebased onto prependMetaFields

Impact

TBA

Risk level (write none, low medium or high below)

Low

Documentation Update

N/A

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2023-01-28T08:31:16Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-            val recordKey = sparkKeyGenerator.getRecordKey(internalRow, sourceStructType)
-            val partitionPath = sparkKeyGenerator.getPartitionPath(internalRow, sourceStructType)
+        df.queryExecution.toRdd.mapPartitions { it =>
+          // TODO elaborate


please elaborate.

nsivabalan · 2023-01-28T08:31:30Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+        df.queryExecution.toRdd.mapPartitions { it =>
+          // TODO elaborate
+          val (unsafeProjection, transformer) = if (shouldDropPartitionColumns) {
+            (generateUnsafeProjection(dataFileStructType, dataFileStructType), genUnsafeRowWriter(sourceStructType, dataFileStructType))


nice optimization

nsivabalan · 2023-01-28T08:46:38Z

...-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieInternalRowUtils.scala

+    val rowWriter = HoodieInternalRowUtils.genUnsafeRowWriter(schema1, schemaMerge)
+    val newRow = rowWriter(oldRow)

+    val serDe = sparkAdapter.createSparkRowSerDe(schemaMerge)


let's test all data types as much as possible (all primitives, arrays, maps etc).
also, lets test some null values for some of the fields.

wzx140 · 2023-01-29T04:01:50Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

+        // NOTE: Record have to be cloned here to make sure if it holds low-level engine-specific
+        //       payload pointing into a shared, mutable (underlying) buffer we get a clean copy of
+        //       it since these records will be put into queue of QueueBasedExecutorFactory.
+        return isBufferingRecords ? newRecord.copy() : newRecord;


wzx140 · 2023-01-29T04:05:31Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

+        val rowUpdater = new RowUpdater(newRow)
+
+        (fieldUpdater, ordinal, value) => {
+          // TODO elaborate


What does elaborate mean? Need more comments?

Note to self to explain particular piece

alexeykudinkin · 2023-01-28T19:01:06Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

    HoodieAvroUtils.removeFields(schema, partitionColumns.toSet.asJava)
  }

-  def generateSparkSchemaWithoutPartitionColumns(partitionParam: String, schema: StructType): StructType = {


alexeykudinkin · 2023-01-28T19:01:41Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-            extraPreCommitFn: Option[BiConsumer[HoodieTableMetaClient, HoodieCommitMetadata]] = Option.empty)
-  : (Boolean, common.util.Option[String], common.util.Option[String], common.util.Option[String],
-    SparkRDDWriteClient[HoodieRecordPayload[Nothing]], HoodieTableConfig) = {
+            hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty,


This was just search-and-replace removing invalid type references

alexeykudinkin · 2023-01-29T05:06:41Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

+        val rowUpdater = new RowUpdater(newRow)
+
+        (fieldUpdater, ordinal, value) => {
+          // TODO elaborate


Note to self to explain particular piece

wzx140 · 2023-01-29T05:43:47Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

-    InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecordWithNewSchema(this.data, structType, newStructType, renameCols);
-    UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(newStructType, newStructType).apply(rewriteRecord);
+    Function1<InternalRow, UnsafeRow> unsafeRowWriter =
+        HoodieInternalRowUtils.getCachedUnsafeRowWriter(structType, newStructType, Collections.emptyMap());


miss renameCols

Good catch!

wzx140 · 2023-01-29T05:50:40Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

-    InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecord(this.data, structType, targetStructType);
-    UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(targetStructType, targetStructType).apply(rewriteRecord);
+    Function1<InternalRow, UnsafeRow> unsafeRowWriter =
+        HoodieInternalRowUtils.getCachedUnsafeRowWriter(structType, targetStructType, Collections.emptyMap());


rewriteRecord is used to do compatible with avro schema evolution. And rewriteRecordWithNewSchema is used to do hudi schema evolution. They have different logic in type change.
For example, we can not change IntegerType -> DecimalType in rewriteRecord. But we can change it in rewriteRecordWithNewSchema.

Should we keep this?

Key point here is that we actually don't need actually rewriteRecord operation as such: historically it has been used to expand (Avro) schema of the record to accommodate for meta-fields, which is actually handled differently now in HoodieSparkRecord

We should support avro type promotion in this function in HoodieSparkRecord. We have discussed it before in #7003.

Not sure i follow your train of thought -- there's no more rewriteRecord method, instead it's being replaced w/ prependMetaFields

alexeykudinkin · 2023-01-29T07:07:21Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java

-        HoodieRecord rewrittenRecord = schemaOnReadEnabled ? finalRecord.get().rewriteRecordWithNewSchema(schema, recordProperties, writeSchemaWithMetaFields)
-            : finalRecord.get().rewriteRecord(schema, recordProperties, writeSchemaWithMetaFields);
+
+        // Prepend meta-fields into the record


This is primary change in this file: instead of the sequence:

rewriteRecord/rewriteRecordWithNewSchema (rewriting record into schema bearing meta-fields)

updateMetadataValues

we call directly prependMetaFields API (expanding record's schema w/ meta-fields and setting them at the same time)

minihippo · 2023-01-28T14:36:52Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java


    HoodieOperation operation = withOperationField
-        ? HoodieOperation.fromName(getNullableValAsString(structType, record.data, HoodieRecord.OPERATION_METADATA_FIELD)) : null;
+        ? HoodieOperation.fromName(record.data.getString(structType.fieldIndex(HoodieRecord.OPERATION_METADATA_FIELD)))


getNullableValAsString considers the situation that field does not exist, But structType.fieldIndex not

Good point! Let me revisit

Actually, looked at it again and in that case withOperationField is true so this field has to be present in the schema

alexeykudinkin · 2023-01-29T07:08:01Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java

+
        MetadataValues metadataValues = new MetadataValues().setFileName(path.getName());
-        rewriteRecord = rewriteRecord.updateMetadataValues(writeSchemaWithMetaFields, config.getProps(), metadataValues);
+        HoodieRecord populatedRecord =


Change similar to #7769 (comment)

alexeykudinkin · 2023-01-29T07:08:12Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

    //       file holding this record even in cases when overall metadata is preserved
    MetadataValues metadataValues = new MetadataValues().setFileName(newFilePath.getName());
-    rewriteRecord = rewriteRecord.updateMetadataValues(writeSchemaWithMetaFields, prop, metadataValues);
+    HoodieRecord populatedRecord =


Change similar to #7769 (comment)

alexeykudinkin · 2023-01-29T07:21:54Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

+
      wrapper = ExecutorFactory.create(writeConfig, recordIterator, new UpdateHandler(mergeHandle), record -> {
+        HoodieRecord newRecord;
+        if (schemaEvolutionTransformerOpt.isPresent()) {


Schema Evolution transformer now is applied inside the transformer as opposed to as an MappingIterator previously

alexeykudinkin · 2023-01-29T07:26:54Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

-    InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecordWithNewSchema(this.data, structType, newStructType, renameCols);
-    UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(newStructType, newStructType).apply(rewriteRecord);
+    Function1<InternalRow, UnsafeRow> unsafeRowWriter =
+        HoodieInternalRowUtils.getCachedUnsafeRowWriter(structType, newStructType, Collections.emptyMap());


Good catch!

alexeykudinkin · 2023-01-29T07:49:03Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

-    InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecord(this.data, structType, targetStructType);
-    UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(targetStructType, targetStructType).apply(rewriteRecord);
+    Function1<InternalRow, UnsafeRow> unsafeRowWriter =
+        HoodieInternalRowUtils.getCachedUnsafeRowWriter(structType, targetStructType, Collections.emptyMap());


Key point here is that we actually don't need actually rewriteRecord operation as such: historically it has been used to expand (Avro) schema of the record to accommodate for meta-fields, which is actually handled differently now in HoodieSparkRecord

alexeykudinkin · 2023-01-29T07:49:30Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java


    HoodieOperation operation = withOperationField
-        ? HoodieOperation.fromName(getNullableValAsString(structType, record.data, HoodieRecord.OPERATION_METADATA_FIELD)) : null;
+        ? HoodieOperation.fromName(record.data.getString(structType.fieldIndex(HoodieRecord.OPERATION_METADATA_FIELD)))


Good point! Let me revisit

alexeykudinkin · 2023-01-29T20:05:23Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

    // persisted records to adhere to an evolved schema
-    Option<Pair<Function<Schema, Function<HoodieRecord, HoodieRecord>>, Schema>> schemaEvolutionTransformerOpt =
-        composeSchemaEvolutionTransformer(writerSchema, baseFile, writeConfig, table.getMetaClient());
+    Option<Function<HoodieRecord, HoodieRecord>> schemaEvolutionTransformerOpt =


Simplifying implementation by supplying reader-scheme into the method

alexeykudinkin · 2023-01-29T20:07:43Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

    StructType structType = HoodieInternalRowUtils.getCachedSchema(recordSchema);
-    return keyGeneratorOpt.isPresent() ? ((SparkKeyGeneratorInterface) keyGeneratorOpt.get())
-        .getRecordKey(data, structType).toString() : data.getString(HoodieMetadataField.RECORD_KEY_METADATA_FIELD.ordinal());
+    return keyGeneratorOpt.isPresent()


This code is unchanged (there was a change but it got reverted)

alexeykudinkin · 2023-01-29T20:08:34Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java

+    Function1<InternalRow, UnsafeRow> unsafeRowWriter =
+        HoodieInternalRowUtils.getCachedUnsafeRowWriter(structType, newStructType, renameCols);

-    boolean containMetaFields = hasMetaFields(newStructType);


Wrapping into HoodieInternalRow has been removed and abstracted to only occur in prependMetaFields API (considerably simplifying the impl here)

alexeykudinkin · 2023-01-29T20:10:42Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java


    HoodieOperation operation = withOperationField
-        ? HoodieOperation.fromName(getNullableValAsString(structType, record.data, HoodieRecord.OPERATION_METADATA_FIELD)) : null;
+        ? HoodieOperation.fromName(record.data.getString(structType.fieldIndex(HoodieRecord.OPERATION_METADATA_FIELD)))


Actually, looked at it again and in that case withOperationField is true so this field has to be present in the schema

alexeykudinkin · 2023-01-29T20:10:50Z

...lient/hudi-spark-client/src/main/java/org/apache/hudi/execution/SparkLazyInsertIterable.java


-  private boolean useWriterSchema;
-
-  public SparkLazyInsertIterable(Iterator<HoodieRecord<T>> recordItr,


alexeykudinkin · 2023-01-29T20:18:04Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

+    prevFieldName
  }
+
+  private def createArrayData(elementType: DataType, length: Int): ArrayData = elementType match {


These utility classes are borrowed from Spark's AvroSerializer

are these the same across different spark versions?

Yes, they 1:1 w/ Spark type system

alexeykudinkin · 2023-01-29T20:18:18Z

...hudi-spark-client/src/test/java/org/apache/hudi/execution/TestDisruptorExecutionInSpark.java

          public void consume(HoodieRecord record) {
            try {
-              Thread.currentThread().wait();
+              synchronized (this) {


That's fixing the flaky test

alexeykudinkin · 2023-01-29T20:21:35Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+        df.queryExecution.toRdd.mapPartitions { it =>
+          val targetStructType = if (shouldDropPartitionColumns) dataFileStructType else writerStructType
+          // NOTE: To make sure we properly transform records
+          val targetStructTypeRowWriter = getCachedUnsafeRowWriter(sourceStructType, targetStructType)


This replaces old way of rewriteRecord then doing unsafeProjection w/ just applying new UnsafeRowWriter

alexeykudinkin · 2023-01-29T20:22:06Z

...-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieInternalRowUtils.scala

-import org.apache.spark.sql.{HoodieInternalRowUtils, Row, SparkSession}
-import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
-
-class TestHoodieInternalRowUtils extends FunSuite with Matchers with BeforeAndAfterAll {


These are merged into another test suite

alexeykudinkin · 2023-01-29T20:22:29Z

...asource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieInternalRowUtils.scala

    // do change type operation
    val updateChange = TableChanges.ColumnUpdateChange.get(internalSchema)
-    updateChange.updateColumnType("id", Types.LongType.get).updateColumnType("comb", Types.FloatType.get).updateColumnType("com1", Types.DoubleType.get).updateColumnType("col0", Types.StringType.get).updateColumnType("col1", Types.FloatType.get).updateColumnType("col11", Types.DoubleType.get).updateColumnType("col12", Types.StringType.get).updateColumnType("col2", Types.DoubleType.get).updateColumnType("col21", Types.StringType.get).updateColumnType("col3", Types.StringType.get).updateColumnType("col31", Types.DecimalType.get(18, 9)).updateColumnType("col4", Types.DecimalType.get(18, 9)).updateColumnType("col41", Types.StringType.get).updateColumnType("col5", Types.DateType.get).updateColumnType("col51", Types.DecimalType.get(18, 9)).updateColumnType("col6", Types.StringType.get)
+    updateChange.updateColumnType("id", Types.LongType.get)


No changes just breaking down unreadably long line

xushiyan

overall lgtm. did not review HoodieInternalRowUtils line by line, which should have more UT coverage

xushiyan · 2023-01-29T23:13:59Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/util/ExecutorFactory.java

+   * Checks whether configured {@link HoodieExecutor} buffer records (for ex, by holding them
+   * in the queue)
+   */
+  public static boolean isBufferingRecords(HoodieWriteConfig config) {


why not make this a property of ExecutorType? so we don't need this extra helper

Good call. Will address in a follow-up (to avoid re-running CI)

I actually realized that this is not possible unfortunately: we're copying in transformers which we pass as args to ctor of the respective Executor, therefore we can't just call a method on it

xushiyan · 2023-01-30T01:29:17Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieInternalRowUtils.scala

+    prevFieldName
  }
+
+  private def createArrayData(elementType: DataType, length: Int): ArrayData = elementType match {


are these the same across different spark versions?

xushiyan · 2023-01-30T01:49:16Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroIndexedRecord.java

    return kryo.readObjectOrNull(input, GenericRecord.class, avroSerializer);
  }
+
+  static void updateMetadataValuesInternal(GenericRecord avroRecord, MetadataValues metadataValues) {


this looks like a helper method that fits in some avro utils

It's very specific in its purpose though -- it overwrites meta-fields that shouldn't occur outside of HoodieRecord

alexeykudinkin · 2023-01-31T00:30:06Z

...lient/src/main/java/org/apache/hudi/table/action/bootstrap/BaseBootstrapMetadataHandler.java

          .map(HoodieAvroUtils::getRootLevelFieldName)
          .collect(Collectors.toList());
      Schema recordKeySchema = HoodieAvroUtils.generateProjectionSchema(avroSchema, recordKeyColumns);
-      LOG.info("Schema to be used for reading record Keys :" + recordKeySchema);


This are now properly set by actual FileReaders

alexeykudinkin · 2023-01-31T00:32:42Z

...nt/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java

    try {
+      Function<HoodieRecord, HoodieRecord> transformer = record -> {
+        String recordKey = record.getRecordKey(schema, Option.of(keyGenerator));
+        return createNewMetadataBootstrapRecord(recordKey, partitionPath, recordMerger.getRecordType())


Creating createNewMetadataBootstrapRecord is the crux of the change here:

Now metadata bootstrap record is properly initialized with schema including all of the meta-fields and not the one truncated to just record-key (HoodieSparkRecord is not able to handle such truncated meta-fields schema)

Avro path is restored to what it was before RFC-46

…(involving schema traversals) from actual transformation

…erated row-writer

…composition from the actual execution

…e renamed

…trapping records

…n bootstrapping

…ns from `String` to `UTF8String`

…eta-fields to make sure `HoodieSparkRecord` invariants are not violated;

…CHEMA` instead of `RECORD_KEY_SCHEMA` to avoid confusion

…t propagated as data columns; Cleaning up dead-code

alexeykudinkin · 2023-01-31T05:07:48Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java

    }

-    /*
-     * log.info("Partition Fields are : (" + partitionFields + "). Initial Source Schema :" + source.schema());


Cleaning up dead commented code (not updated since 2018)

alexeykudinkin · 2023-01-31T05:08:42Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java

-    final Dataset<Row> src = source.drop(colsToDrop);
-    // log.info("Final Schema from Source is :" + src.schema());
+    // Remove Hoodie meta columns
+    final Dataset<Row> src = source.drop(HoodieRecord.HOODIE_META_COLUMNS.stream().toArray(String[]::new));


Change here is to avoid keeping partition-path as this will make HoodieSparkSqlWriter treat it as data column which is not compatible w/ SparkRecordMerger

_hoodie_partition_path isn't used neither in the source or DS and according to the commented out code it's been used previously but is not used anymore.

#7132 recently added config that forces all of the meta-fields to be cleaned up, but it's false by default

hudi-bot · 2023-01-31T09:28:44Z

CI report:

9bfa20f UNKNOWN
24020a9 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin force-pushed the ak/rfc46-perf-fix branch from 45d05fc to 0ece556 Compare January 28, 2023 03:10

alexeykudinkin changed the title ~~[MINOR] Fixing performance regression in HoodieSparkRecord~~ [HUDI-5633] Fixing performance regression in HoodieSparkRecord Jan 28, 2023

alexeykudinkin requested a review from xushiyan January 28, 2023 03:34

alexeykudinkin added priority:blocker Production down; release blocker area:performance Performance optimizations labels Jan 28, 2023

xushiyan self-assigned this Jan 28, 2023

nsivabalan reviewed Jan 28, 2023

View reviewed changes

wzx140 reviewed Jan 29, 2023

View reviewed changes

alexeykudinkin commented Jan 29, 2023

View reviewed changes

wzx140 reviewed Jan 29, 2023

View reviewed changes

alexeykudinkin commented Jan 29, 2023

View reviewed changes

minihippo reviewed Jan 29, 2023

View reviewed changes

alexeykudinkin commented Jan 29, 2023

View reviewed changes

xushiyan approved these changes Jan 30, 2023

View reviewed changes

codope force-pushed the ak/rfc46-perf-fix branch from 1aa1f0c to ca3b5fa Compare January 30, 2023 12:55

alexeykudinkin force-pushed the ak/rfc46-perf-fix branch 2 times, most recently from 9bfa20f to 62f1095 Compare January 31, 2023 00:36

alexeykudinkin commented Jan 31, 2023

View reviewed changes

Alexey Kudinkin added 9 commits January 30, 2023 16:49

Tidying up

b287072

Revisited rewriteRecord utility to decouple transformer generation …

3130aa8

…(involving schema traversals) from actual transformation

Rebased HoodieSparkSqlWriter and HoodieSparkRecord onto using gen…

5be80db

…erated row-writer

Revisited common caches to avoid multiple lookups

122e9c8

Cleaned up unnecessary utils

51dff0b

Rebased rewriting Spark's Rows with renaming to decouple transformer …

ceacbc6

…composition from the actual execution

Tidying up

0556d1d

Fixed position map not being updated properly

50a6307

Fixed array/map handling

150bd85

Alexey Kudinkin added 18 commits January 30, 2023 16:49

Combined tests

b1670bf

De-duplicated RowWriter utils

ca13108

Rebased HoodieSparkSqlWriter onto a new utils

ab05ada

Fixing compilation

622be32

Reverting accidental changes

e498a67

Fixing typo

5cb4e7a

Tidying up

5446734

Fixed rewriteRecordWithNewSchema mising to pass target columns to b…

f0ffe7c

…e renamed

Fixed map row-writer to not assume particular impl of MapData

b86cca7

Tidying up

60c6afc

Disable DS multi-writer test

7832e4f

Fixed ParquetBootstrapMetadataHandler to properly instantiate boots…

4c3b1bc

…trapping records

Cleaned up updateMetadataFields

d9ab8da

Fixing String to UTF8String conversion

5e9f80f

Fixed HoodieSparkRecord to avoid truncating list of meta-fields whe…

bd0216a

…n bootstrapping

Cleaned up HoodiesSparkParquetWriter to avoid unnecessary conversio…

a0d21f3

…ns from `String` to `UTF8String`

Fixed ParquetBootstrapMetadataHandler to specify complete list of m…

d379195

…eta-fields to make sure `HoodieSparkRecord` invariants are not violated;

Fixed HoodieBootstrapHandle to rely on `METADATA_BOOTSTRAP_RECORD_S…

3252447

…CHEMA` instead of `RECORD_KEY_SCHEMA` to avoid confusion

alexeykudinkin force-pushed the ak/rfc46-perf-fix branch from 62f1095 to 3252447 Compare January 31, 2023 00:49

Strip all meta-fields in HoodieIncrSource to make sure these are no…

24020a9

…t propagated as data columns; Cleaning up dead-code

alexeykudinkin commented Jan 31, 2023

View reviewed changes

alexeykudinkin merged commit 628dc8c into apache:master Jan 31, 2023

alexeykudinkin mentioned this pull request Jan 31, 2023

[MINOR] Restoring existing behavior for DeltaStreamer Incremental Source #7810

Merged

4 tasks

voonhous mentioned this pull request Apr 11, 2023

[HUDI-6033] Fix rounding exception when to decimal casting #8380

Merged

4 tasks


		private boolean useWriterSchema;

		public SparkLazyInsertIterable(Iterator<HoodieRecord<T>> recordItr,

[HUDI-5633] Fixing performance regression in HoodieSparkRecord #7769

[HUDI-5633] Fixing performance regression in HoodieSparkRecord #7769

Uh oh!

Conversation

alexeykudinkin commented Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[HUDI-5633] Fixing performance regression in `HoodieSparkRecord` #7769

[HUDI-5633] Fixing performance regression in `HoodieSparkRecord` #7769

alexeykudinkin commented Jan 28, 2023 •

edited

Loading