Skip to content

Conversation

@wzx140
Copy link
Contributor

@wzx140 wzx140 commented Oct 19, 2022

Change Logs

  1. Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should use it with config "hoodie.sql.insert.mode=strict".
  2. Fix nest field exist in HoodieCatalystExpressionUtils
  3. Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
  4. Fallback to avro when use "merge into" sql
  5. Fix some schema handling issue
  6. Support delta streamer
  7. Convert parquet schema to spark schema and then avro schema(in org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as int/long . We will lose the logic type info if we directly convert it to avro schema.
  8. Support schema evolution in parquet block

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@wzx140 wzx140 changed the title [minor] add more test [minor] add more test for rfc46 Oct 19, 2022
@wzx140 wzx140 force-pushed the add-test branch 3 times, most recently from 4a4d0e7 to 8a5e66b Compare October 19, 2022 17:43
@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Oct 19, 2022

LGTM!

Let's add cover for DeltaStreamer tests as well (w/in TestHoodieDeltaStreamer, we don't need to parameterize all of them but let's pick representative set)

@wzx140 wzx140 force-pushed the add-test branch 3 times, most recently from 01c4965 to 6efb6b4 Compare October 25, 2022 17:24
@xushiyan xushiyan added the priority:blocker Production down; release blocker label Oct 31, 2022
@wzx140 wzx140 force-pushed the add-test branch 5 times, most recently from 277ad5d to f91a0b4 Compare November 13, 2022 17:28
@minihippo minihippo force-pushed the add-test branch 9 times, most recently from 1c1d6e2 to f09dab9 Compare November 15, 2022 17:29
@wzx140
Copy link
Contributor Author

wzx140 commented Nov 16, 2022

@hudi-bot run azure

}
}

protected def withRecordType(f: => Unit, recordConfig: Map[HoodieRecordType, Map[String, String]]=Map.empty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not structuring it the same way as witSqlConf? Like below:

def withRecordType(recordConfig)(f: => Unit)

Copy link
Contributor Author

@wzx140 wzx140 Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most test case, we do not need to pass different sql configs for recordType independently. So I used the default parameters in function .

In "Test Insert Into None Partitioned Table", SparkMerger should use "HoodieSparkValidateDuplicateKeyRecordMerger" with "hoodie.sql.insert.mode=strict".

We should not replace SparkRecordMerger with HoodieSparkValidateDuplicateKeyRecordMerger when user set "hoodie.sql.insert.mode=strict". Because sql cofig has the higher priority.

}

protected def withRecordType(f: => Unit, recordConfig: Map[HoodieRecordType, Map[String, String]]=Map.empty) {
Seq(HoodieRecordType.SPARK).foreach { recordType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we testing just the Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet"
)

def getOpts(recordType: HoodieRecordType, opt: Map[String, String]): (Map[String, String], Map[String, String]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getWriterReaderOpts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

commonOpts ++ sparkOpts
} else {
commonOpts
def getOpts(recordType: HoodieRecordType, opt: Map[String, String] = commonOpts): (Map[String, String], Map[String, String]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, please ditto everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// Create the write parameters
val parameters = buildMergeIntoConfig(hoodieCatalogTable)
// TODO Remove it when we implement ExpressionPayload for SparkRecord
val parametersWithAvroRecordMerger = parameters ++ Map(HoodieWriteConfig.MERGER_IMPLS.key -> classOf[HoodieAvroRecordMerger].getName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just append to parameters (no need for new var)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters is immutable map

newRecord.update(HoodieMetadataField.FILENAME_METADATA_FIELD.ordinal, CatalystTypeConverters.convertToCatalyst(fileName))
def compareSchema(left: StructType, right: StructType): Boolean = {
val schemaPair = (left, right)
if (!schemaCompareMap.contains(schemaPair)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this comment before, but this still seems to be bubbling up: this doesn't make sense -- keys in the map will always be checked for equality, which defeats the purpose of this mapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. rewriteRecord and rewriteRecordWithNewSchema is fixed

case _: FloatType => newRow.update(pos, oldValue.asInstanceOf[Float].toDouble)
case _ => throw new IllegalArgumentException(s"$oldSchema and $newSchema are incompatible")
}
case _: BinaryType if oldType.isInstanceOf[StringType] => newRow.update(pos, oldValue.asInstanceOf[Array[Byte]].map(_.toChar).mkString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 seems to be inverted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@tailrec
def existField(structType: StructType, name: String): Boolean = {
structType.getFieldIndex(name).isDefined
if (name.contains(".")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this method to HoodieInternalRowUtils, it has nothing to do w/ expressions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to rewrite it as a single loop, instead of tail-rec (to avoid calling split N times)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I just want to call structType.getFieldIndex. So I put this func in HoodieCatalystExpressionUtils. I will move HoodieInternalRowUtils to spark.sql package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

private final boolean enablePointLookups;

protected final Schema readerSchema;
protected Schema readerSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be relaxing mutability constraints unless absolutely have to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzx140 we'd need to revert these changes after we rebase on the latest master (these issues should have already been addressed on master)

InternalRow finalRow = new HoodieInternalRow(metaFields, data, containMetaFields);
// Rewrite if schema is not same
InternalRow finalRow = this.data;
StructType rawTargetSchema = HoodieInternalRowUtils.removeMetaField(targetStructType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove meta fields?

StructType rawTargetSchema = HoodieInternalRowUtils.removeMetaField(targetStructType);
if (!HoodieInternalRowUtils.compareSchema(structType, rawTargetSchema)) {
InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecord(this.data, structType, rawTargetSchema);
this.data = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not modify existing record

HoodieInternalRow finalRow = new HoodieInternalRow(metaFields, rewrittenRow, containMetaFields);
// Rewrite if schema is not same
InternalRow finalRow = this.data;
StructType rawNewSchema = HoodieInternalRowUtils.removeMetaField(newStructType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above


public HoodieSparkParquetReader(Configuration conf, Path path) {
this.path = path;
this.conf = conf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure we make the copy of the config first, before modifying it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


new HoodieSparkRecord(key, processedRow, structType, false)
val processedRow = if (reconcileSchema) {
HoodieInternalRowUtils.getCachedUnsafeProjection(processedStructType, processedStructType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to reconcile this with #6358

* Validate the duplicate key for insert statement without enable the INSERT_DROP_DUPS_OPT
* config.
*/
class HoodieSparkValidateDuplicateKeyRecordMerger extends HoodieSparkRecordMerger {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Please update the java-doc to reflect that

return new HoodieAvroRecord<>(keyGenerator.getKey(record), payload);
});
} else if (recordType == HoodieRecordType.SPARK) {
// TODO we should remove it if we can read InternalRow from source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have Jira for it, let me find it

SerializableSchema avroSchema = new SerializableSchema(schemaProvider.getTargetSchema());
SerializableSchema processedAvroSchema = new SerializableSchema(isDropPartitionColumns() ? HoodieAvroUtils.removeMetadataFields(avroSchema.get()) : avroSchema.get());
if (recordType == HoodieRecordType.AVRO) {
records = avroRDD.map(record -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems unchanged, but want to double-check with you: this section didn't change, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We do not change the avro section.

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzx140 LGTM! Looking for some minor clarifications

}

test("Test Insert Exception") {
val tableName = generateTableName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these tests change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just add withRecordType() { here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzx140 we didn't modify the test itself, did we?

GH UI shows it as if the whole test had been rewritten which is very hard to decipher

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin We just add withRecordType() and put sql config in withSQLConf. Maybe We have also changed the indentation. This make GH UI shows that the whole test had been rewritten.

UTF8String[] metaFields = tryExtractMetaFields(data, structType);
InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecord(this.data, structType, targetStructType);
UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(targetStructType, targetStructType).apply(rewriteRecord);
InternalRow finalRow = copy ? unsafeRow.copy() : unsafeRow;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the intent, but we should keep all copying decisions to be committed by the user of HoodieSparkRecord -- we should only do the copying, where's absolutely necessary. Let's move this copying to a place where it's actually required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think operation on record should not change the copy state. When the record is copied, it means the record need to be copied. So we should do copy on the rewrited record too. Does this make sence?

UTF8String[] metaFields = tryExtractMetaFields(data, structType);
InternalRow rewriteRecord = HoodieInternalRowUtils.rewriteRecordWithNewSchema(this.data, structType, newStructType, renameCols);
UnsafeRow unsafeRow = HoodieInternalRowUtils.getCachedUnsafeProjection(newStructType, newStructType).apply(rewriteRecord);
InternalRow finalRow = copy ? unsafeRow.copy() : unsafeRow;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

private final boolean enablePointLookups;

protected final Schema readerSchema;
protected Schema readerSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apache apache deleted a comment from hudi-bot Nov 28, 2022
@alexeykudinkin
Copy link
Contributor

@wzx140 LGTM. Let's resolve copying discussion and we should be good to go.

@alexeykudinkin
Copy link
Contributor

@wzx140 let's disable TestCleaner in this PR and re-run the CI to make sure no other tests are failing

@wzx140
Copy link
Contributor Author

wzx140 commented Nov 29, 2022

@hudi-bot run azure

@wzx140
Copy link
Contributor Author

wzx140 commented Nov 29, 2022

@alexeykudinkin "UT FT other modules" was cancel because of timeout. We add some UTs in delta streamer. https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=13314&view=results

This test was run before "fix copy". It seems that there is no failure in "UT FT other modules". https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=13283&view=results

@wzx140
Copy link
Contributor Author

wzx140 commented Nov 29, 2022

@hudi-bot run azure

HoodieRecord record = schemaOption.isPresent() ? currentRecord.rewriteRecordWithNewSchema(dataBlock.getSchema(), new Properties(), schemaOption.get()) : currentRecord;
HoodieRecord record = schemaOption.isPresent()
? currentRecord
.rewriteRecordWithNewSchema(dataBlock.getSchema(), new Properties(), schemaOption.get())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment elaborating why copy is necessary (please ditto everywhere, where we make a copy)

private final boolean enablePointLookups;

protected final Schema readerSchema;
protected Schema readerSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzx140 we'd need to revert these changes after we rebase on the latest master (these issues should have already been addressed on master)

blockContentLoc.getBlockSize());

Schema writerSchema = new Schema.Parser().parse(this.getLogBlockHeader().get(HeaderMetadataType.SCHEMA));
if (!internalSchema.isEmptySchema()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieRecord recordCopy = record.copy();
if (!externalSchemaTransformation) {
return recordCopy;
return record.copy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment elaborating why we're making a copy here

String recKey = record.getRecordKey(reader.getSchema(), Option.of(keyGenerator));
HoodieRecord hoodieRecord = record
.rewriteRecord(reader.getSchema(), config.getProps(), HoodieAvroUtils.RECORD_KEY_SCHEMA)
.copy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@wzx140
Copy link
Contributor Author

wzx140 commented Nov 30, 2022

@alexeykudinkin alexeykudinkin merged commit 9c691f0 into apache:release-feature-rfc46 Nov 30, 2022
wzx140 added a commit to wzx140/hudi that referenced this pull request Nov 30, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 1, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 2, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 3, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 9, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 13, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
alexeykudinkin pushed a commit to wzx140/hudi that referenced this pull request Dec 13, 2022
[minor] add more test for rfc46 (apache#7003)

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 13, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
alexeykudinkin pushed a commit to wzx140/hudi that referenced this pull request Dec 13, 2022
[minor] add more test for rfc46 (apache#7003)

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
[minor] add more test for rfc46 (apache#7003)

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
wzx140 added a commit to wzx140/hudi that referenced this pull request Dec 14, 2022
[minor] add more test for rfc46 (apache#7003)

## Change Logs

 - Add HoodieSparkValidateDuplicateKeyRecordMerger behaving the same as ValidateDuplicateKeyPayload. We should
 use it with config "hoodie.sql.insert.mode=strict".
 - Fix nest field exist in HoodieCatalystExpressionUtils
 - Fix rewrite in HoodieInternalRowUtiles to support type promoted as avro
 - Fallback to avro when use "merge into" sql
 - Fix some schema handling issue
 - Support delta streamer
 - Convert parquet schema to spark schema and then avro schema(in
 org.apache.hudi.io.storage.HoodieSparkParquetReader#getSchema). Some types in avro are not compatible with
 parquet. For ex, decimal as int32/int64 in parquet will convert to int/long in avro. Because avro do not has decimal as
 int/long . We will lose the logic type info if we directly convert it to avro schema.
 - Support schema evolution in parquet block

[Minor] fix multi deser avro payload (apache#7021)

In HoodieAvroRecord, we will call isDelete, shouldIgnore before we write it to the file. Each method will deserialize HoodiePayload. So we add deserialization method in HoodieRecord and call this method once before calling isDelete or shouldIgnore.

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>

[MINOR] Properly registering target classes w/ Kryo (apache#7026)

* Added `HoodieKryoRegistrar` registering necessary Hudi's classes w/ Kryo to make their serialization more efficient (by serializing just the class id, in-liue the fully qualified class-name)

* Redirected Kryo registration to `HoodieKryoRegistrar`

* Registered additional classes likely to be serialized by Kryo

* Updated tests

* Fixed serialization of Avro's `Utf8` to serialize just the bytes

* Added tests

* Added custom `AvroUtf8Serializer`;
Tidying up

* Extracted `HoodieCommonKryoRegistrar` to leverage in `SerializationUtils`

* `HoodieKryoRegistrar` > `HoodieSparkKryoRegistrar`;
Rebased `HoodieSparkKryoRegistrar` onto `HoodieCommonKryoRegistrar`

* `lint`

* Fixing compilation for Spark 2.x

* Disabling flaky test

[MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo (apache#6977)

* Make sure `HoodieRecord`, `HoodieKey`, `HoodieRecordLocation` are all `KryoSerializable`

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;

* Revisited `HoodieRecord` serialization hooks to make sure they a) could not be overridden, b) provide for hooks to properly
serialize record's payload;
Implemented serialization hooks for `HoodieAvroIndexedRecord`;
Implemented serialization hooks for `HoodieEmptyRecord`;
Implemented serialization hooks for `HoodieAvroRecord`;

* Revisited `HoodieSparkRecord` to transiently hold on to the schema so that it could project row

* Implemented serialization hooks for `HoodieSparkRecord`

* Added `TestHoodieSparkRecord`

* Added tests for Avro-based records

* Added test for `HoodieEmptyRecord`

* Fixed sealing/unsealing for `HoodieRecord` in `HoodieBackedTableMetadataWriter`

* Properly handle deflated records

* Fixing `Row`s encoding

* Fixed `HoodieRecord` to be properly sealed/unsealed

* Fixed serialization of the `HoodieRecordGlobalLocation`

[MINOR] Additional fixes for apache#6745 (apache#6947)

* Tidying up

* Tidying up more

* Cleaning up duplication

* Tidying up

* Revisited legacy operating mode configuration

* Tidying up

* Cleaned up `projectUnsafe` API

* Fixing compilation

* Cleaning up `HoodieSparkRecord` ctors;
Revisited mandatory unsafe-projection

* Fixing compilation

* Cleaned up `ParquetReader` initialization

* Revisited `HoodieSparkRecord` to accept either `UnsafeRow` or `HoodieInternalRow`, and avoid unnecessary copying after unsafe-projection

* Cleaning up redundant exception spec

* Make sure `updateMetadataFields` properly wraps `InternalRow` into `HoodieInternalRow` if necessary;
Cleaned up `MetadataValues`

* Fixed meta-fields extraction and `HoodieInternalRow` composition w/in `HoodieSparkRecord`

* De-duplicate `HoodieSparkRecord` ctors;
Make sure either only `UnsafeRow` or `HoodieInternalRow` are permitted inside `HoodieSparkRecord`

* Removed unnecessary copying

* Cleaned up projection for `HoodieSparkRecord` (dropping partition columns);
Removed unnecessary copying

* Fixing compilation

* Fixing compilation (for Flink)

* Cleaned up File Raders' interfaces:
  - Extracted `HoodieSeekingFileReader` interface (for key-ranged reads)
  - Pushed down concrete implementation methods into `HoodieAvroFileReaderBase` from the interfaces

* Cleaned up File Readers impls (inline with then new interfaces)

* Rebsaed `HoodieBackedTableMetadata` onto new `HoodieSeekingFileReader`

* Tidying up

* Missing licenses

* Re-instate custom override for `HoodieAvroParquetReader`;
Tidying up

* Fixed missing cloning w/in `HoodieLazyInsertIterable`

* Fixed missing cloning in deduplication flow

* Allow `HoodieSparkRecord` to hold `ColumnarBatchRow`

* Missing licenses

* Fixing compilation

* Missing changes

* Fixed Spark 2.x validation whether the row was read as a batch

Fix comment in RFC46 (apache#6745)

* rename

* add MetadataValues in updateMetadataValues

* remove singleton in fileFactory

* add truncateRecordKey

* remove hoodieRecord#setData

* rename HoodieAvroRecord

* fix code style

* fix HoodieSparkRecordSerializer

* fix benchmark

* fix SparkRecordUtils

* instantiate HoodieWriteConfig on the fly

* add test

* fix HoodieSparkRecordSerializer. Replace Java's object serialization with kryo

* add broadcast

* fix comment

* remove unnecessary broadcast

* add unsafe check in spark record

* fix getRecordColumnValues

* remove spark.sql.parquet.writeLegacyFormat

* fix unsafe projection

* fix

* pass external schema

* update doc

* rename back to HoodieAvroRecord

* fix

* remove comparable wrapper

* fix comment

* fix comment

* fix comment

* fix comment

* simplify row copy

* fix ParquetReaderIterator

Co-authored-by: Shawy Geng <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (apache#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.(apache#5629)

* [HUDI-4301] [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

* add schema finger print

* add benchmark

* a new way to config the merger

* fix

Co-authored-by: wangzixuan.wzxuan <[email protected]>
Co-authored-by: gengxiaoyu <[email protected]>

[HUDI-3350][HUDI-3351] Support HoodieMerge API and Spark engine-specific  HoodieRecord (apache#5627)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4344] fix usage of HoodieDataBlock#getRecordIterator (apache#6005)

Co-authored-by: wangzixuan.wzxuan <[email protected]>

[HUDI-4292][RFC-46] Update doc to align with the Record Merge API changes (apache#5927)

[MINOR] Fix type casting in TestHoodieHFileReaderWriter

[HUDI-3378][HUDI-3379][HUDI-3381] Migrate usage of HoodieRecordPayload and raw Avro payload to HoodieRecord (apache#5522)

Co-authored-by: Alexey Kudinkin <[email protected]>
Co-authored-by: wangzixuan.wzxuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Status: No status
Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants