-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1013] Adding Bulk Insert V2 implementation #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d03df7b to
8366e83
Compare
a5f608c to
e5d4939
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan I made a high level pass. Mostly lgtm. Will make some changes, test and land.
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
af27762 to
1a11d1c
Compare
c957008 to
dacd635
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some notes for reviewer :)
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/io/HoodieRowCreateHandle.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/io/HoodieRowCreateHandle.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/internal/TestHoodieDataSourceInternalWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: I am yet to add tests to these new methods. Got these as part of rebase. Also, I notice few other test classes for each key generators after rebasing. Will add tests by tmrw to those new test classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan this is done?
hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
Outdated
Show resolved
Hide resolved
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a high level pass. Great work @bvaradar , @nsivabalan . Did some testing as well.
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
dacd635 to
06e9693
Compare
06e9693 to
d491a32
Compare
|
#1834 (comment) |
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit on HoodieSparkSqlWriter.
866ee72 to
5e77ae3
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar @nsivabalan this is a more detailed review with notes to self. Please take a look and chime in. I think we need to be careful about how we future proof the KeyGenerator API.
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public String getPartitionPath(Row row) { | ||
| Object fieldVal = null; | ||
| Object partitionPathFieldVal = RowKeyGeneratorHelper.getNestedFieldVal(row, getPartitionPathPositions().get(getPartitionPathFields().get(0))); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
Outdated
Show resolved
Hide resolved
5e77ae3 to
95a71fe
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some notes to reviewer.
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments.
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
- Clean up KeyGenerator classes and fix test failures
95a71fe to
06c1370
Compare
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
| public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter) { | ||
| super(); | ||
| Configuration hadoopConf = new Configuration(conf); | ||
| hadoopConf.set("spark.sql.parquet.writeLegacyFormat", "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. why we are hardcoding this. any ideas @bvaradar ?
hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
Outdated
Show resolved
Hide resolved
c8745f8 to
5dc8182
Compare
- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row) - Fixed all built-in key generators with new APIs - Made the field position map lazily created upon the first call to row based apis - Implemented native row based key generators for CustomKeyGenerator - Fixed all the tests, with these new APIs
5dc8182 to
865d8d6
Compare
|
@nsivabalan this is ready. I am going ahead and merging. I also re-ran the benchmark again . Seems to clock the same 30 mins against spark.write.parquet. Please carefully go over the changes I have made in the last commits here.. and see if anything needs follow on fixing. Our timelines are tight. we need to do it tomorrow, if at all |
| .forEach(f -> partitionPathPositions.put(f, | ||
| RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, false))); | ||
| } | ||
| this.structType = structType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may I know where is the structType being used ? AvroConversionHelper.createConverterToAvro used row.Schema() and so we may not need it. probably we should rename this to boolean positionMapInitialized.
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor comments
| // insert/bulk-insert combining to be true, if filtering for duplicates | ||
| boolean combineInserts = Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY())); | ||
| HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder() | ||
| .withPath(basePath).withAutoCommit(false).combineInput(combineInserts, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @nsivabalan Do you got any idea why we disable AutoCommit by default when creating HoodieWriteConfig?
What is the purpose of the pull request
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.