[HUDI-1013] Adding Bulk Insert V2 implementation #1834

nsivabalan · 2020-07-15T13:32:12Z

What is the purpose of the pull request

Adding support for "bulk_insert_dataset" which has better performance compared to existing "bulk_insert".
This implementation uses Datasource for writing to storage.
Support for key generators to operate on Row(rather than HoodieRecords as per existing "bulk_insert") is added.

Brief change log

Added support for "bulk_insert_dataset" which uses Datasource for writing.
This path introduces a new datasource called "org.apache.hudi.internal" and all supporting cast like DefaultSource, DataSourceWriter(HoodieDataSourceInternalWriter), DataWriterFactory(HoodieBulkInsertDataInternalWriterFactory), DataWriter(HoodieBulkInsertDataInternalWriter), etc for the same.
This patch also introduces HoodieRowCreateHandle, HoodieInternalRowFileWriter, HoodieInternalRowParquetWriter, etc to assist in writing InternalRows to parquet(and respective factory classes).
Have added HoodieInternalRow which wraps InternalRow and exposes meta columns.
Have added HoodieInternalWriteStatus to hold write status (instead of WriteStatus used in HoodieRecord write paths), since this hold Rows instead of HoodieRecords.
This patch adds changes to KeyGenerator to ensure getRecordKey and getPartitionPath is supported with Row for "bulk_insert_dataset". New apis are added to KeyGenerator, but default implementation is added so as to not have any breaking change. All keygenerator implementations have been fixed on this regards.
Added HoodieDatasetBulkInsertHelper to assist in prepping the dataset before calling into datasource write. For commit, HoodieWriteClient is leveraged.
Some additional functionalities on top of bulk inserts are not covered in this patch. Have create HUDI-1014, HUDI-1105 and HUDI-1106. Namely, UserDefinedCustomPartitioner, Dedup and drop duplicates.

Verify this pull request

This change added tests and can be verified as follows:

Added tests in HoodieSparkSqlWriterSuite to test end to end bulk_insert_dataset
Added tests to test HoodieInternalWriteStatus, HoodieInternalRow, HoodieRowCreateHandle, HoodieInternalRowParquetWriter, HoodieDatasetBulkInsertHelper, HoodieBulkInsertDataInternalWriter
Added tests to test HoodieDataSourceInternalWriter for commit, abort, large writes and multiple writes
Added tests to test all key generators for new apis with Row

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

vinothchandar

@nsivabalan I made a high level pass. Mostly lgtm. Will make some changes, test and land.

hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java

hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

nsivabalan

Leaving some notes for reviewer :)

hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java

hudi-client/src/main/java/org/apache/hudi/io/HoodieRowCreateHandle.java

hudi-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java

hudi-client/src/main/java/org/apache/hudi/keygen/KeyGenerator.java

hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

hudi-spark/src/test/java/org/apache/hudi/internal/TestHoodieDataSourceInternalWriter.java

nsivabalan · 2020-08-06T14:19:16Z

hudi-spark/src/test/java/org/apache/hudi/keygen/TestTimestampBasedKeyGenerator.java

Note to reviewer: I am yet to add tests to these new methods. Got these as part of rebase. Also, I notice few other test classes for each key generators after rebasing. Will add tests by tmrw to those new test classes.

@nsivabalan this is done?

hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala

vinothchandar

Took a high level pass. Great work @bvaradar , @nsivabalan . Did some testing as well.

hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java

hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

…performance

nsivabalan · 2020-08-11T11:34:12Z

#1834 (comment)
: bcoz, this is for Row where as existing WriteStats is for HoodieRecords. Guess we should have templatized this too.

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

nsivabalan

one nit on HoodieSparkSqlWriter.

vinothchandar

@bvaradar @nsivabalan this is a more detailed review with notes to self. Please take a look and chime in. I think we need to be careful about how we future proof the KeyGenerator API.

hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java

hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java

hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java

+  @Override
+  public String getPartitionPath(Row row) {
+    Object fieldVal = null;
+    Object partitionPathFieldVal =  RowKeyGeneratorHelper.getNestedFieldVal(row, getPartitionPathPositions().get(getPartitionPathFields().get(0)));


hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala

nsivabalan

some notes to reviewer.

hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java

hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java

nsivabalan

some more comments.

hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java

hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java

- Clean up KeyGenerator classes and fix test failures

hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java

hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

vinothchandar · 2020-08-12T03:19:40Z

hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieRowParquetWriteSupport.java

+  public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter) {
+    super();
+    Configuration hadoopConf = new Configuration(conf);
+    hadoopConf.set("spark.sql.parquet.writeLegacyFormat", "false");


yes. why we are hardcoding this. any ideas @bvaradar ?

hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

hudi-spark/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row) - Fixed all built-in key generators with new APIs - Made the field position map lazily created upon the first call to row based apis - Implemented native row based key generators for CustomKeyGenerator - Fixed all the tests, with these new APIs

vinothchandar · 2020-08-13T07:26:12Z

@nsivabalan this is ready. I am going ahead and merging. I also re-ran the benchmark again . Seems to clock the same 30 mins against spark.write.parquet.

Please carefully go over the changes I have made in the last commits here.. and see if anything needs follow on fixing. Our timelines are tight. we need to do it tomorrow, if at all

nsivabalan · 2020-08-13T12:42:55Z

hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java

+            .forEach(f -> partitionPathPositions.put(f,
+                RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, false)));
+      }
+      this.structType = structType;


may I know where is the structType being used ? AvroConversionHelper.createConverterToAvro used row.Schema() and so we may not need it. probably we should rename this to boolean positionMapInitialized.

nsivabalan

few minor comments

boneanxs · 2022-06-28T02:44:30Z

hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java

    // insert/bulk-insert combining to be true, if filtering for duplicates
    boolean combineInserts = Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY()));
+    HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder()
+        .withPath(basePath).withAutoCommit(false).combineInput(combineInserts, true);


Hi, @nsivabalan Do you got any idea why we disable AutoCommit by default when creating HoodieWriteConfig?

nsivabalan changed the title ~~[WIP][HUDI-1013] Adding Bulk Insert V2 implementation~~ [HUDI-1013] Adding Bulk Insert V2 implementation Jul 17, 2020

nsivabalan force-pushed the blk_insert_dataset branch 2 times, most recently from d03df7b to 8366e83 Compare July 19, 2020 18:00

vinothchandar mentioned this pull request Jul 19, 2020

[WIP] [HUDI-1013] Bulk insert Dataset<Row> #1762

Closed

5 tasks

nsivabalan force-pushed the blk_insert_dataset branch 2 times, most recently from a5f608c to e5d4939 Compare July 22, 2020 11:20

vinothchandar self-assigned this Jul 22, 2020

vinothchandar mentioned this pull request Jul 22, 2020

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

Merged

5 tasks

vinothchandar reviewed Jul 22, 2020

View reviewed changes

hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java Outdated Show resolved Hide resolved

hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java Outdated Show resolved Hide resolved

nsivabalan force-pushed the blk_insert_dataset branch 4 times, most recently from af27762 to 1a11d1c Compare July 24, 2020 06:19

nsivabalan force-pushed the blk_insert_dataset branch 3 times, most recently from c957008 to dacd635 Compare August 6, 2020 11:59

nsivabalan commented Aug 6, 2020

View reviewed changes

vinothchandar reviewed Aug 11, 2020

View reviewed changes

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala Outdated Show resolved Hide resolved

vinothchandar reviewed Aug 11, 2020

View reviewed changes

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala Outdated Show resolved Hide resolved

vinothchandar force-pushed the blk_insert_dataset branch from dacd635 to 06e9693 Compare August 11, 2020 06:49

Bulk Insert Dataset Based Implementation using Datasource to improve …

d491a32

…performance

vinothchandar force-pushed the blk_insert_dataset branch from 06e9693 to d491a32 Compare August 11, 2020 11:19

nsivabalan commented Aug 11, 2020

View reviewed changes

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala Outdated Show resolved Hide resolved

nsivabalan commented Aug 11, 2020

View reviewed changes

vinothchandar force-pushed the blk_insert_dataset branch from 866ee72 to 5e77ae3 Compare August 11, 2020 16:07

vinothchandar reviewed Aug 11, 2020

View reviewed changes

vinothchandar force-pushed the blk_insert_dataset branch from 5e77ae3 to 95a71fe Compare August 12, 2020 00:59

nsivabalan commented Aug 12, 2020

View reviewed changes

hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java Outdated Show resolved Hide resolved

hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java Outdated Show resolved Hide resolved

nsivabalan commented Aug 12, 2020

View reviewed changes

Code review comments, cleanup, fixes, restructuring

06c1370

- Clean up KeyGenerator classes and fix test failures

vinothchandar force-pushed the blk_insert_dataset branch from 95a71fe to 06c1370 Compare August 12, 2020 15:23

Some fixes to key generators and adding more tests for Row apis

383a745

vinothchandar reviewed Aug 12, 2020

View reviewed changes

Cleaning up config placements, naming

fefd4b7

vinothchandar force-pushed the blk_insert_dataset branch 2 times, most recently from c8745f8 to 5dc8182 Compare August 13, 2020 04:38

vinothchandar force-pushed the blk_insert_dataset branch from 5dc8182 to 865d8d6 Compare August 13, 2020 05:16

vinothchandar merged commit 379cf07 into apache:master Aug 13, 2020

nsivabalan commented Aug 13, 2020

View reviewed changes

zhedoubushishi mentioned this pull request Dec 24, 2020

[HUDI-1451] Support bulk insert v2 with Spark 3.0.0 #2328

Merged

5 tasks

boneanxs reviewed Jun 28, 2022

View reviewed changes

boneanxs mentioned this pull request Aug 16, 2022

[HUDI-4363] Support Clustering row writer to improve performance #6046

Merged

5 tasks

[HUDI-1013] Adding Bulk Insert V2 implementation #1834

[HUDI-1013] Adding Bulk Insert V2 implementation #1834

Uh oh!

Conversation

nsivabalan commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

vinothchandar Aug 12, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Aug 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Jul 15, 2020 •

edited

Loading

nsivabalan commented Aug 11, 2020 •

edited

Loading

nsivabalan Aug 13, 2020 •

edited

Loading