[HUDI-1951] Add bucket hash index, compatible with the hive bucket#3173
[HUDI-1951] Add bucket hash index, compatible with the hive bucket#3173vinothchandar merged 13 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3173 +/- ##
============================================
- Coverage 44.10% 2.77% -41.34%
+ Complexity 5157 85 -5072
============================================
Files 936 286 -650
Lines 41629 12074 -29555
Branches 4189 1010 -3179
============================================
- Hits 18362 335 -18027
+ Misses 21638 11713 -9925
+ Partials 1629 26 -1603
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
The |
|
@minihippo can you please rebase this PR again? |
|
@vinothchandar done |
|
@minihippo been behind on this. is there anyone else who wants to take a first pass? |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/SimpleAvroKeyGenerator.java
Outdated
Show resolved
Hide resolved
...-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BucketInfo.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveBucketUtils.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv
Outdated
Show resolved
Hide resolved
|
Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review? |
After removing the csv file, the PR become smaller and i think it is ok to contains the changes in one PR. |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/BaseKeyGenerator.java
Outdated
Show resolved
Hide resolved
...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveHasher.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/utils/HiveHasher.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/TestHiveBucketUtils.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
...lient/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/SparkHiveBucketIndex.java
Outdated
Show resolved
Hide resolved
...ark-client/src/main/java/org/apache/hudi/table/action/commit/SparkHiveBucketPartitioner.java
Outdated
Show resolved
Hide resolved
|
The main changes are:
|
vinothchandar
left a comment
There was a problem hiding this comment.
LGTM, nearing landing. I will push some cleanups/changes in a day.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
Outdated
Show resolved
Hide resolved
...lient-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java
Outdated
Show resolved
Hide resolved
...ient/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/test/resources/hive_bucket_id_check.csv
Outdated
Show resolved
Hide resolved
...ent/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/HoodieBucketIndexSuite.java
Outdated
Show resolved
Hide resolved
| .noDefaultValue() | ||
| .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.") | ||
|
|
||
| // bucketSpec: CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS |
There was a problem hiding this comment.
need to think if there are better ways of exposing this config to the user
| /** | ||
| * | ||
| */ | ||
| class TestDataSourceForBucketIndex extends HoodieClientTestBase { |
There was a problem hiding this comment.
Could we parameterize an existung test>?
There was a problem hiding this comment.
the test case like testCount can be parameterized, but the testDoubleInsert is more like a bucket index unique test case?
There was a problem hiding this comment.
i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.
There was a problem hiding this comment.
https://issues.apache.org/jira/browse/HUDI-3121, and linked to bucket index jira https://issues.apache.org/jira/projects/HUDI/issues/HUDI-3039
| assignUpdates(profile); | ||
| } | ||
|
|
||
| private void assignUpdates(WorkloadProfile profile) { |
There was a problem hiding this comment.
see if we can rewrite these functional style?
There was a problem hiding this comment.
Can u explain it? can't understand.
There was a problem hiding this comment.
using java streams instead of for loops. its a minor comment
...rk-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java
Show resolved
Hide resolved
vinothchandar
left a comment
There was a problem hiding this comment.
@minihippo Thanks for the push, I also cleaned up some docs, naming. Please take a look at the last commit. Once CI passes and you are happy with it, I'll land. I have pointed out some follow-ups in this review. If you could add subtasks for them and track them to completion, that would be awesome.
|
|
||
| public HoodieBucketLayout() { | ||
| super(); | ||
| partitionClass = "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner"; |
There was a problem hiding this comment.
@minihippo I understand you did this to avoid a reverse dep from common to spark-client. Wondering, if we need a LayoutConfig introduced so we can pass the partitioner class name along from client side.
|
|
||
| import java.io.Serializable; | ||
|
|
||
| public class HoodieStorageLayout implements Serializable { |
There was a problem hiding this comment.
we should probably pull a separate class HoodieDefaultLayout and make this abstract. and nicely move all layout specific optimizations there.
| .noDefaultValue() | ||
| .withDocumentation("Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.") | ||
|
|
||
| val HIVE_SYNC_BUCKET_SYNC: ConfigProperty[Boolean] = ConfigProperty |
There was a problem hiding this comment.
do we also fix the deltastreamer path?
| /** | ||
| * | ||
| */ | ||
| class TestDataSourceForBucketIndex extends HoodieClientTestBase { |
There was a problem hiding this comment.
i was referring to taking an existing test and running it across bucket and non-bucket cases. We can revisit this again. Could we add a test JIRA under the umbrella.
|
There are some test failures Don't think these are related to my pushes. @minihippo Could you please check |
|
@vinothchandar I addressed all comments and the failure ut is not related with this pr. Can we land this? |
|
@hudi-bot run azure |
|
Looking into the failures. |
|
@minihippo I was thinking we can name all parameters |
This keeps failing. Could you rebase again with latest master? want to try running the tests again |
…untime support double insert for bucket index
I didn't modify the |
vinothchandar
left a comment
There was a problem hiding this comment.
@minihippo Couple more follow ups. Landing
| assignUpdates(profile); | ||
| } | ||
|
|
||
| private void assignUpdates(WorkloadProfile profile) { |
There was a problem hiding this comment.
using java streams instead of for loops. its a minor comment
| * Storage layout related config. | ||
| */ | ||
| @Immutable | ||
| @ConfigClassProperty(name = "Layout Configs", |
There was a problem hiding this comment.
We probably have to make this layout a TableConfig. i.e add it to HoodieTableConfig and persist in hoodie.properties, handling cases where there is going to be tables with and without the property.
…pache#3173) * [HUDI-2154] Add index key field to HoodieKey * [HUDI-2157] Add the bucket index and its read/write implemention of Spark engine. * revert HUDI-2154 add index key field to HoodieKey * fix all comments and introduce a new tricky way to get index key at runtime support double insert for bucket index * revert spark read optimizer based on bucket index * add the storage layout * index tag, hash function and add ut * fix ut * address partial comments * Code review feedback * add layout config and docs * fix ut * rename hoodie.layout and rebase master Co-authored-by: Vinoth Chandar <vinoth@apache.org>
What is the purpose of the pull request
Index pattern 1 in
RFC-29: Hash Indexhttps://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.