[HUDI-1951] Add bucket hash index, compatible with the hive bucket by minihippo · Pull Request #3173 · apache/hudi

minihippo · 2021-06-28T12:44:32Z

What is the purpose of the pull request

Index pattern 1 in RFC-29: Hash Index
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

Brief change log

add a new index method
fix the compatibility issue when adding member variables to HoodieKey

Verify this pull request

This change added tests and can be verified as follows:

Added unit tests for bucket index and serialization
Modify existing integration tests to verify the new index method

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2021-06-28T15:46:02Z

Codecov Report

Merging #3173 (fda740e) into master (d024439) will decrease coverage by 41.33%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #3173       +/-   ##
============================================
- Coverage     44.10%   2.77%   -41.34%     
+ Complexity     5157      85     -5072     
============================================
  Files           936     286      -650     
  Lines         41629   12074    -29555     
  Branches       4189    1010     -3179     
============================================
- Hits          18362     335    -18027     
+ Misses        21638   11713     -9925     
+ Partials       1629      26     -1603

Flag	Coverage Δ
hudicli	`?`
hudiclient	`0.00% <0.00%> (-34.47%)`	⬇️
hudicommon	`?`
hudiflink	`?`
hudihadoopmr	`?`
hudisparkdatasource	`?`
hudisync	`4.86% <0.00%> (-50.87%)`	⬇️
huditimelineservice	`?`
hudiutilities	`9.11% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...java/org/apache/hudi/config/HoodieIndexConfig.java	`0.00% <0.00%> (-69.35%)`	⬇️
...pache/hudi/execution/CopyOnWriteInsertHandler.java	`0.00% <0.00%> (ø)`
...c/main/java/org/apache/hudi/index/HoodieIndex.java	`0.00% <0.00%> (-33.34%)`	⬇️
.../java/org/apache/hudi/keygen/BaseKeyGenerator.java	`0.00% <0.00%> (-40.00%)`	⬇️
...rg/apache/hudi/keygen/ComplexAvroKeyGenerator.java	`0.00% <0.00%> (-50.00%)`	⬇️
...org/apache/hudi/keygen/CustomAvroKeyGenerator.java	`0.00% <0.00%> (-10.82%)`	⬇️
...ache/hudi/keygen/GlobalAvroDeleteKeyGenerator.java	`0.00% <0.00%> (-42.86%)`	⬇️
.../main/java/org/apache/hudi/keygen/KeyGenUtils.java	`0.00% <0.00%> (-17.15%)`	⬇️
...he/hudi/keygen/NonpartitionedAvroKeyGenerator.java	`0.00% <0.00%> (-41.67%)`	⬇️
...org/apache/hudi/keygen/SimpleAvroKeyGenerator.java	`0.00% <0.00%> (-61.54%)`	⬇️
... and 719 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d024439...fda740e. Read the comment docs.

minihippo · 2021-07-01T10:01:15Z

The ITTestHoodieDemo failure is due to its own instability rather than being affected by this new feature, and HUDI-2113 has already fixed it.

vinothchandar · 2021-07-02T03:20:24Z

@minihippo can you please rebase this PR again?

minihippo · 2021-07-02T09:41:19Z

@vinothchandar done

vinothchandar · 2021-07-06T21:06:45Z

@minihippo been behind on this.

is there anyone else who wants to take a first pass?

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/SimpleAvroKeyGenerator.java

...-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java

hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv

minihippo · 2021-07-12T06:59:48Z

Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review?

leesf · 2021-07-12T13:34:21Z

Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review?

After removing the csv file, the PR become smaller and i think it is ok to contains the changes in one PR.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/BaseKeyGenerator.java

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveHasher.java

hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/TestHiveBucketUtils.java

...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java

...ark-client/src/main/java/org/apache/hudi/table/action/commit/SparkHiveBucketPartitioner.java

minihippo · 2021-12-17T15:46:20Z

The main changes are:

introduce the layout entry to constraint the write behavior
remove the abstraction of hash function, using jvm hashcode instead to make it simple
remove the changes about spark MergeOnReadRDD

vinothchandar

LGTM, nearing landing. I will push some cleanups/changes in a day.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java

...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java

...ient/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java

hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv

...ent/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/HoodieBucketIndexSuite.java

vinothchandar · 2021-12-27T02:10:55Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

    .noDefaultValue()
    .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.")

+  // bucketSpec: CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS


need to think if there are better ways of exposing this config to the user

vinothchandar · 2021-12-27T02:11:25Z

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSourceWithBucket.scala

+/**
+ *
+ */
+class TestDataSourceForBucketIndex extends HoodieClientTestBase {


Could we parameterize an existung test>?

the test case like testCount can be parameterized, but the testDoubleInsert is more like a bucket index unique test case?

i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.

https://issues.apache.org/jira/browse/HUDI-3121, and linked to bucket index jira https://issues.apache.org/jira/projects/HUDI/issues/HUDI-3039

vinothchandar · 2021-12-27T02:31:01Z

...rk-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java

+    assignUpdates(profile);
+  }
+
+  private void assignUpdates(WorkloadProfile profile) {


see if we can rewrite these functional style?

Can u explain it? can't understand.

using java streams instead of for loops. its a minor comment

...rk-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java

vinothchandar

@minihippo Thanks for the push, I also cleaned up some docs, naming. Please take a look at the last commit. Once CI passes and you are happy with it, I'll land. I have pointed out some follow-ups in this review. If you could add subtasks for them and track them to completion, that would be awesome.

vinothchandar · 2021-12-27T17:20:10Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieBucketLayout.java

+
+  public HoodieBucketLayout() {
+    super();
+    partitionClass = "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner";


@minihippo I understand you did this to avoid a reverse dep from common to spark-client. Wondering, if we need a LayoutConfig introduced so we can pass the partitioner class name along from client side.

vinothchandar · 2021-12-27T17:22:29Z

...ient/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java

+
+import java.io.Serializable;
+
+public class HoodieStorageLayout implements Serializable {


we should probably pull a separate class HoodieDefaultLayout and make this abstract. and nicely move all layout specific optimizations there.

vinothchandar · 2021-12-27T17:38:41Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

    .noDefaultValue()
    .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.")

+  val HIVE_SYNC_BUCKET_SYNC: ConfigProperty[Boolean] = ConfigProperty


do we also fix the deltastreamer path?

vinothchandar · 2021-12-27T17:39:23Z

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSourceWithBucket.scala

+/**
+ *
+ */
+class TestDataSourceForBucketIndex extends HoodieClientTestBase {


i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.

vinothchandar · 2021-12-27T19:34:26Z

There are some test failures

R] Errors: 
[ERROR]   TestHoodieDeltaStreamer.testBulkInsertsAndUpsertsWithSQLBasedTransformerFor2StepPipeline:1088 » IllegalArgument
[ERROR]   TestHoodieDeltaStreamer.testPayloadClassUpdate:1176 » IllegalArgument Property...
[ERROR]   TestHoodieDeltaStreamer.testPayloadClassUpdateWithCOWTable:1202 » IllegalArgument
[INFO] 

[ERROR] Errors: 
[ERROR]   TestDataSourceUtils.testBuildHiveSyncConfig:224 » IllegalArgument Property hoo...
[ERROR]   TestDataSourceUtils.testBuildHiveSyncConfig:224 » IllegalArgument Property hoo...

Don't think these are related to my pushes. @minihippo Could you please check

minihippo · 2021-12-29T01:43:40Z

@vinothchandar I addressed all comments and the failure ut is not related with this pr. Can we land this?

minihippo · 2021-12-29T14:03:50Z

@hudi-bot run azure

vinothchandar · 2021-12-30T00:51:09Z

Looking into the failures.

vinothchandar · 2021-12-30T01:10:46Z

@minihippo I was thinking we can name all parameters hoodie.storage.layout.. instead, but the space curve PRs are all named hoodie.layout.optimize anyway. So I think its ok

vinothchandar · 2021-12-30T07:23:08Z

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 125.714 s - in org.apache.hudi.integ.command.ITTestHoodieSyncCommand

[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 29.812 s <<< FAILURE! - in org.apache.hudi.integ.ITTestHoodieDemo
[ERROR] org.apache.hudi.integ.ITTestHoodieDemo.testParquetDemo  Time elapsed: 29.622 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: Command ([hdfs, dfsadmin, -safemode, wait]) expected to succeed. Exit (255) ==> expected: <0> but was: <255>
	at org.apache.hudi.integ.ITTestHoodieDemo.setupDemo(ITTestHoodieDemo.java:167)
	at org.apache.hudi.integ.ITTestHoodieDemo.testParquetDemo(ITTestHoodieDemo.java:107)

This keeps failing. Could you rebase again with latest master? want to try running the tests again

…park engine.

…untime support double insert for bucket index

minihippo · 2021-12-30T15:50:54Z

@minihippo I was thinking we can name all parameters hoodie.storage.layout.. instead, but the space curve PRs are all named hoodie.layout.optimize anyway. So I think its ok

I didn't modify the hoodie.layout.optimize directly, considering the history config compatibility.

hudi-bot · 2021-12-30T17:41:18Z

CI report:

ee973637958fb9c1496cfb45f78346e2f01ffa02 UNKNOWN
c435898754ec3ce579eb2b67c473443c9ee70e46 UNKNOWN
2de5524 UNKNOWN
924337e UNKNOWN
527e8f5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar

@minihippo Couple more follow ups. Landing

vinothchandar · 2021-12-30T20:35:27Z

...rk-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java

+    assignUpdates(profile);
+  }
+
+  private void assignUpdates(WorkloadProfile profile) {


using java streams instead of for loops. its a minor comment

vinothchandar · 2021-12-30T20:36:33Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java

+ * Storage layout related config.
+ */
+@Immutable
+@ConfigClassProperty(name = "Layout Configs",


We probably have to make this layout a TableConfig. i.e add it to HoodieTableConfig and persist in hoodie.properties, handling cases where there is going to be tables with and without the property.

…pache#3173) * [HUDI-2154] Add index key field to HoodieKey * [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine. * revert HUDI-2154 add index key field to HoodieKey * fix all comments and introduce a new tricky way to get index key at runtime support double insert for bucket index * revert spark read optimizer based on bucket index * add the storage layout * index tag, hash function and add ut * fix ut * address partial comments * Code review feedback * add layout config and docs * fix ut * rename hoodie.layout and rebase master Co-authored-by: Vinoth Chandar <vinoth@apache.org>

minihippo force-pushed the HUDI-1951 branch from c953a65 to fe137c2 Compare June 28, 2021 12:50

wangxianghu added the index label Jun 29, 2021

minihippo force-pushed the HUDI-1951 branch from fe137c2 to a47bb98 Compare July 2, 2021 09:39

minihippo force-pushed the HUDI-1951 branch from 7aef86d to 5afbbaa Compare July 9, 2021 13:28

leesf assigned vinothchandar and leesf Jul 9, 2021

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/SimpleAvroKeyGenerator.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

...-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java Outdated Show resolved Hide resolved

leesf reviewed Jul 9, 2021

View reviewed changes

hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv Outdated Show resolved Hide resolved

minihippo force-pushed the HUDI-1951 branch from 5afbbaa to 946194c Compare July 12, 2021 06:57