Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 25, 2020

What is the purpose of the pull request

Adding dedup support for Bulk Insert w/ Rows

Brief change log

  • Adding dedup support for Bulk Insert w/ Rows

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestHoodieDatasetBulkInsertHelper to verify the change.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan requested a review from bvaradar October 25, 2020 17:09
@vinothchandar vinothchandar changed the title Adding dedup support for Bulk Insert w/ Rows [WIP] Adding dedup support for Bulk Insert w/ Rows Oct 28, 2020
@vinothchandar vinothchandar self-assigned this Oct 28, 2020
@vinothchandar vinothchandar added the status:in-progress Work in progress label Oct 28, 2020
@nsivabalan nsivabalan removed the status:in-progress Work in progress label Dec 1, 2020
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch from 86ffb25 to 15d0db5 Compare December 1, 2020 17:07
@nsivabalan nsivabalan marked this pull request as ready for review December 1, 2020 17:07
@nsivabalan
Copy link
Contributor Author

@bvaradar @vinothchandar : patch is ready for review

@vinothchandar vinothchandar added the status:in-progress Work in progress label Apr 15, 2021
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch from 15d0db5 to 01e47cc Compare May 23, 2021 04:28
@codecov-commenter
Copy link

codecov-commenter commented May 23, 2021

Codecov Report

Merging #2206 (51ccc2d) into master (ea9e5d0) will decrease coverage by 20.20%.
The diff coverage is 0.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2206       +/-   ##
=============================================
- Coverage     47.61%   27.41%   -20.21%     
+ Complexity     5495     1285     -4210     
=============================================
  Files           929      381      -548     
  Lines         41240    15106    -26134     
  Branches       4135     1303     -2832     
=============================================
- Hits          19637     4141    -15496     
+ Misses        19859    10667     -9192     
+ Partials       1744      298     -1446     
Flag Coverage Δ
hudicli ?
hudiclient 21.05% <0.00%> (-13.56%) ⬇️
hudicommon ?
hudiflink ?
hudihadoopmr ?
hudisparkdatasource ?
hudisync 5.28% <ø> (-49.20%) ⬇️
huditimelineservice ?
hudiutilities 58.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/java/org/apache/hudi/index/SparkHoodieIndex.java 56.52% <0.00%> (-30.15%) ⬇️
...main/java/org/apache/hudi/metrics/HoodieGauge.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/hudi/hive/NonPartitionedExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/hudi/metrics/MetricsReporter.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/hudi/metrics/MetricsReporterType.java 0.00% <0.00%> (-100.00%) ⬇️
...rg/apache/hudi/client/bootstrap/BootstrapMode.java 0.00% <0.00%> (-100.00%) ⬇️
...he/hudi/hive/HiveStylePartitionValueExtractor.java 0.00% <0.00%> (-100.00%) ⬇️
...pache/hudi/client/utils/ConcatenatingIterator.java 0.00% <0.00%> (-100.00%) ⬇️
...che/hudi/config/HoodieMetricsPrometheusConfig.java 0.00% <0.00%> (-100.00%) ⬇️
.../hudi/execution/bulkinsert/BulkInsertSortMode.java 0.00% <0.00%> (-100.00%) ⬇️
... and 615 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea9e5d0...51ccc2d. Read the comment docs.

@nsivabalan nsivabalan changed the title [WIP] Adding dedup support for Bulk Insert w/ Rows Adding dedup support for Bulk Insert w/ Rows May 24, 2021
@nsivabalan nsivabalan removed the status:in-progress Work in progress label May 24, 2021
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch 2 times, most recently from 63ca76b to 39b3315 Compare May 24, 2021 06:33
@nsivabalan nsivabalan added priority:critical Production degraded; pipelines stalled priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels May 24, 2021
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch 2 times, most recently from 10043f2 to 031b8fa Compare June 8, 2021 02:31
@nsivabalan nsivabalan changed the title Adding dedup support for Bulk Insert w/ Rows [HUDI-1105] Adding dedup support for Bulk Insert w/ Rows Jun 15, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jun 15, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar
Copy link
Member

Lets give this a shot.

@hudi-bot run azure

@nsivabalan nsivabalan added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Jun 25, 2021
@vinothchandar vinothchandar added the priority:blocker Production down; release blocker label Jul 2, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Can we see if we can avoid the new config?

@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch 2 times, most recently from 80ae344 to 0b5cdce Compare July 6, 2021 21:12
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch 2 times, most recently from ad1d2d1 to 1fa675d Compare July 7, 2021 04:00
@nsivabalan nsivabalan force-pushed the bulkInsert_DeDupOct17 branch from 1fa675d to 51ccc2d Compare July 7, 2021 15:55
@nsivabalan
Copy link
Contributor Author

@hudi-bot run azure

@nsivabalan nsivabalan merged commit 16e90d3 into apache:master Jul 7, 2021
Samrat002 pushed a commit to Samrat002/hudi that referenced this pull request Jul 15, 2021
change the insret overwrte return type

[HUDI-1860] Test wrapper for insert_overwrite and insert_overwrite_table

[HUDI-2084] Resend the uncommitted write metadata when start up (apache#3168)

Co-authored-by: 喻兆靖 <[email protected]>

[HUDI-2081] Move schema util tests out from TestHiveSyncTool (apache#3166)

[HUDI-2094] Supports hive style partitioning for flink writer (apache#3178)

[HUDI-2097] Fix Flink unable to read commit metadata error (apache#3180)

[HUDI-2085] Support specify compaction paralleism and compaction target io for flink batch compaction (apache#3169)

[HUDI-2092] Fix NPE caused by FlinkStreamerConfig#writePartitionUrlEncode null value (apache#3176)

[HUDI-2006] Adding more yaml templates to test suite (apache#3073)

[HUDI-2103] Add rebalance before index bootstrap (apache#3185)

Co-authored-by: 喻兆靖 <[email protected]>

[HUDI-1944] Support Hudi to read from committed offset (apache#3175)

* [HUDI-1944] Support Hudi to read from committed offset

* [HUDI-1944] Adding group option to KafkaResetOffsetStrategies

* [HUDI-1944] Update Exception msg

[HUDI-2052] Support load logFile in BootstrapFunction (apache#3134)

Co-authored-by: 喻兆靖 <[email protected]>

[HUDI-89] Add configOption & refactor all configs based on that (apache#2833)

Co-authored-by: Wenning Ding <[email protected]>

[MINOR] Update .asf.yaml to codify notification settings, turn on jira comments, gh discussions (apache#3164)

- Turn on comment for jira, so we can track PR activity better
- Create a notification settings that match https://gitbox.apache.org/schemes.cgi?hudi
- Try and turn on "discussions" on Github, to experiment

[MINOR] Fix broken build due to FlinkOptions (apache#3198)

[HUDI-2088] Missing Partition Fields And PreCombineField In Hoodie Properties For Table Written By Flink (apache#3171)

[MINOR] Add Documentation to KEYGENERATOR_TYPE_PROP (apache#3196)

[HUDI-2105] Compaction Failed For MergeInto MOR Table (apache#3190)

[HUDI-2051] Enable Hive Sync When Spark Enable Hive Meta  For Spark Sql (apache#3126)

[HUDI-2112] Support reading pure logs file group for flink batch reader after compaction (apache#3202)

[HUDI-2114] Spark Query MOR Table Written By Flink Return Incorrect Timestamp Value (apache#3208)

[HUDI-2121] Add operator uid for flink stateful operators (apache#3212)

[HUDI-2123]  Exception When Merge With Null-Value Field (apache#3214)

[HUDI-2124] A Grafana dashboard for HUDI. (apache#3216)

[HUDI-2057]  CTAS Generate An External Table When Create Managed Table (apache#3146)

[HUDI-1930] Bootstrap support configure KeyGenerator by type (apache#3170)

* [HUDI-1930] Bootstrap support configure KeyGenerator by type

[HUDI-2116] Support batch synchronization of partition datas to  hive metastore to avoid oom problem (apache#3209)

[HUDI-2126] The coordinator send events to write function when there are no data for the checkpoint (apache#3219)

[HUDI-2127] Initialize the maxMemorySizeInBytes in log scanner (apache#3220)

[HUDI-2058]support incremental query for insert_overwrite_table/insert_overwrite operation on cow table (apache#3139)

[HUDI-2129] StreamerUtil.medianInstantTime should return a valid date time string (apache#3221)

[HUDI-2131] Exception Throw Out When MergeInto With Decimal Type Field (apache#3224)

[HUDI-2122] Improvement in packaging insert into smallfiles (apache#3213)

[HUDI-2132] Make coordinator events as POJO for efficient serialization (apache#3223)

[HUDI-2106] Fix flink batch compaction bug while user don't set compaction tasks (apache#3192)

[HUDI-2133] Support hive1 metadata sync for flink writer (apache#3225)

[HUDI-2089]fix the bug that metatable cannot support non_partition table (apache#3182)

[HUDI-2028] Implement RockDbBasedMap as an alternate to DiskBasedMap in ExternalSpillableMap (apache#3194)

Co-authored-by: Rajesh Mahindra <[email protected]>

[HUDI-2135] Add compaction schedule option for flink (apache#3226)

[HUDI-2055] Added deltastreamer metric for time of lastSync (apache#3129)

[HUDI-2046] Loaded too many classes like sun/reflect/GeneratedSerializationConstructorAccessor in JVM metaspace (apache#3121)

Loaded too many classes when use kryo of spark to hudi

Co-authored-by: weiwei.duan <[email protected]>

[HUDI-1996] Adding functionality to allow the providing of basic auth creds for confluent cloud schema registry (apache#3097)

* adding support for basic auth with confluent cloud schema registry

[HUDI-2093] Fix empty avro schema path caused by duplicate parameters (apache#3177)

* [HUDI-2093] Fix empty avro schema path caused by duplicate parameters

* rename shcmea option key

* fix doc

* rename var name

[HUDI-2113] Fix integration testing failure caused by sql results out of order (apache#3204)

[HUDI-2016] Fixed bootstrap of Metadata Table when some actions are in progress. (apache#3083)

Metadata Table cannot be bootstrapped when any action is in progress. This is detected by the presence of inflight or requested instants. The bootstrapping is initiated in preWrite and postWrite of each commit. So bootstrapping will be retried again until it succeeds.
Also added metrics for when the bootstrapping fails or a table is re-bootstrapped. This will help detect tables which are not getting bootstrapped.

[HUDI-2140] Fixed the unit test TestHoodieBackedMetadata.testOnlyValidPartitionsAdded. (apache#3234)

[HUDI-2115] FileSlices in the filegroup is not descending by timestamp (apache#3206)

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows (apache#3149)

[HUDI-2069] Refactored String constants (apache#3172)

[HUDI-1105] Adding dedup support for Bulk Insert w/ Rows (apache#2206)

[HUDI-2134]Add generics to avoif forced conversion in BaseSparkCommitActionExecutor#partition (apache#3232)

[HUDI-2009] Fixing extra commit metadata in row writer path (apache#3075)

[HUDI-2099]hive lock which state is WATING should be released, otherwise this hive lock will be locked forever (apache#3186)

[MINOR] Fix build broken from apache#3186 (apache#3245)

[HUDI-2136] Fix conflict when flink-sql-connector-hive and hudi-flink-bundle are both in flink lib (apache#3227)

[HUDI-2087] Support Append only in Flink stream (apache#3174)

Co-authored-by: 喻兆靖 <[email protected]>

UnitTest for deltaSync

Removing cosmetic changes and reuse function for insert_overwrite_table

unit test

intial unit test for the insert_overwrite and insert_over_write_table

Adding failed test code for insert_overwrite

Revert "[HUDI-2087] Support Append only in Flink stream (apache#3174)" (apache#3251)

This reverts commit 3715267.

[HUDI-2147] Remove unused class AvroConvertor in hudi-flink (apache#3243)

[MINOR] Fix some wrong assert reasons (apache#3248)

[HUDI-2087] Support Append only in Flink stream (apache#3252)

Co-authored-by: 喻兆靖 <[email protected]>

[HUDI-2143] Tweak the default compaction target IO to 500GB when flink async compaction is off (apache#3238)

[HUDI-2142] Support setting bucket assign parallelism for flink write task (apache#3239)

[HUDI-1483] Support async clustering for deltastreamer and Spark streaming (apache#3142)

- Integrate async clustering service with HoodieDeltaStreamer and HoodieStreamingSink
- Added methods in HoodieAsyncService to reuse code

[HUDI-2107] Support Read Log Only MOR Table For Spark (apache#3193)

[HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data (apache#3240)

* fixed

* add testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan ut

* fix CheckStyle

Co-authored-by: yuezhang <[email protected]>

[MINOR] Fix EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION config (apache#3250)

[HUDI-2168] Fix for AccessControlException for anonymous user (apache#3264)

[HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer

test with insert-overwrite and insert-overwrite-table

removing hardcoded action to pass the unit test

[HUDI-1969] Support reading logs for MOR Hive rt table (apache#3033)

[HUDI-2171] Add parallelism conf for bootstrap operator

using delta-commit for insert_overwrite
ghost pushed a commit to shivagowda/hudi that referenced this pull request Jul 15, 2021
ghost pushed a commit to shivagowda/hudi that referenced this pull request Aug 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants