[HUDI-1632] Supports merge on read write mode for Flink writer #2593

danny0405 · 2021-02-23T08:55:00Z

What is the purpose of the pull request

This PR supports MOR write mode for Flink writer, currently with the state-based index, we can always write log files.

Also supports async compaction with pluggable strategies.

Brief change log

Supports MOR write
Supports async compaction

Verify this pull request

Added UTs and ITs.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

XuQianJin-Stars · 2021-02-23T09:14:40Z

@danny0405 This feature is awesome. 👍

danny0405 · 2021-02-23T09:21:24Z

@danny0405 This feature is awesome. 👍

Thanks

lamberken · 2021-02-23T10:12:29Z

hi @danny0405 thank for the patch, several checkstyle errors，use command to check mvn checkstyle:check

Change junit4 to junit5

import org.junit.Ignore;  -->  import org.junit.jupiter.api.Disabled;

Output:

[ERROR] src/main/java/org/apache/hudi/operator/compact/CompactEvent.java:[35,23] (whitespace) MethodParamPad: '(' is preceded with whitespace.
[ERROR] src/main/java/org/apache/hudi/operator/FlinkOptions.java:[172,67] (whitespace) WhitespaceAround: WhitespaceAround: '=' is not preceded with whitespace.
[ERROR] src/main/java/org/apache/hudi/util/StreamerUtil.java:[40,8] (imports) UnusedImports: Unused import - org.apache.hudi.table.action.compact.strategy.CompactionStrategy.
[ERROR] src/test/java/org/apache/hudi/operator/MergeOnReadCompactTest.java:[31,1] (imports) IllegalImport: Illegal import - org.junit.Ignore.
[ERROR] src/test/java/org/apache/hudi/operator/MergeOnReadWriteTest.java:[37,1] (imports) IllegalImport: Illegal import - org.junit.Ignore.

codecov-io · 2021-02-23T13:57:30Z

Codecov Report

Merging #2593 (c22ac8d) into master (06dc7c7) will increase coverage by 18.37%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #2593       +/-   ##
=============================================
+ Coverage     51.22%   69.59%   +18.37%     
+ Complexity     3230      364     -2866     
=============================================
  Files           438       53      -385     
  Lines         20093     1944    -18149     
  Branches       2069      235     -1834     
=============================================
- Hits          10292     1353     -8939     
+ Misses         8954      458     -8496     
+ Partials        847      133      -714

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.59% <ø> (+0.08%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...hudi/utilities/sources/helpers/KafkaOffsetGen.java	`85.84% <0.00%> (-2.94%)`	`20.00% <0.00%> (+4.00%)`	⬇️
...he/hudi/common/table/log/block/HoodieLogBlock.java
...che/hudi/common/table/timeline/dto/LogFileDTO.java
.../hudi/common/table/view/FileSystemViewManager.java
...able/timeline/versioning/AbstractMigratorBase.java
.../org/apache/hudi/io/storage/HoodieHFileReader.java
...org/apache/hudi/common/bloom/BloomFilterUtils.java
...org/apache/hudi/common/config/TypedProperties.java
...ava/org/apache/hudi/cli/commands/TableCommand.java
...rg/apache/hudi/schema/FilebasedSchemaProvider.java
... and 371 more

simonsssu · 2021-02-24T04:26:13Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

Seems it just contains commitCompaction and planScheduler, where is the real compact logic ? I see there exists a CompactFunction and CompactCommitSink , but I don't find where to call these.

You can take StreamWriteITCase#testMergeOnReadWriteWithCompaction for an example.

I understand the planscheduler to generate plan, and then commitCompaction will read the plan to see if it should commit, but I'm confused to connect all these together, when to trigger the real compact ? if we use CompactFunction , how to understand the async here ?

The StreamWriteOperatorCoordinator.checkpointComplete is the entry point to schedule a compaction, but because there is a pluggable strategy there, a schedule may generate a null compaction plan (E.G. no compaction), we say the compaction is async.

Thanks for explanation.

lamberken · 2021-02-24T05:46:57Z

hudi-flink/src/test/java/org/apache/hudi/operator/StreamWriteITCase.java

we can add following statements to CompactionPlanOperator#open

ValidationUtils.checkArgument(getRuntimeContext().getNumberOfParallelSubtasks() == 1); ValidationUtils.checkArgument(getRuntimeContext().getMaxNumberOfParallelSubtasks() == 1);

Nice idea ~

lamberken · 2021-02-24T06:01:23Z

hudi-flink/src/test/java/org/apache/hudi/operator/MergeOnReadCompactTest.java

for units tests and test cases, seems that should be independent of each other.

MergeOnReadWriteTest extends CopyOnWriteTest MergeOnReadCompactTest extends CopyOnWriteTest

for base content, shall we use a base test? like

AbstractFlinkWriteTest CopyOnWriteTest extends AbstractFlinkWriteTest MergeOnReadWriteTest extends AbstractFlinkWriteTest MergeOnReadCompactTest extends AbstractFlinkWriteTest

Sorry, i think there is no need to have another base class here, because the base class should have some the full test cases for COPY_ON_WRITE table type.

OK, CopyOnWriteTest、MergeOnReadWriteTest、MergeOnReadCompactTest share same tests, can use @ParameterizedTest to test (MERGE_ON_READ, COPY_ON_WRITE) ?

It is not very convenient because we have some custom strategies for each type, that means the code are not 100% reused, IMHO, ParameterizedTest is more suitable for values enumeration.

lamberken · 2021-02-24T09:47:08Z

hudi-flink/src/main/java/org/apache/hudi/operator/compact/CompactionCommitSink.java

Will this parameter work in the future? If not, we can remove it. : )

Yes, we can remove it in the current code base.

yanghua

A quick look, left some comments. @danny0405

yanghua · 2021-02-23T10:02:27Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkAppendHandle.java

Can we wrap this try-catch block around the for-loop.

yanghua · 2021-02-25T02:37:48Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

This line is too long.

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

yanghua · 2021-02-25T03:37:47Z

...udi-flink-client/src/main/java/org/apache/hudi/table/action/compact/FlinkCompactHelpers.java

yanghua · 2021-02-25T03:38:29Z

...udi-flink-client/src/main/java/org/apache/hudi/table/action/compact/FlinkCompactHelpers.java

yanghua · 2021-02-25T05:39:28Z

...src/main/java/org/apache/hudi/table/action/compact/HoodieFlinkMergeOnReadTableCompactor.java

Use the unified indent strategy?

yanghua · 2021-02-25T05:39:59Z

...src/main/java/org/apache/hudi/table/action/compact/HoodieFlinkMergeOnReadTableCompactor.java

yanghua · 2021-02-25T05:40:16Z

...src/main/java/org/apache/hudi/table/action/compact/HoodieFlinkMergeOnReadTableCompactor.java

yanghua

@danny0405 Left some comments.

yanghua · 2021-02-26T01:11:38Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java

In previous PRs, you used instantTime of HoodieRecordLocation to judge update operation, right? So, let keep the same?

This is the default logic for Hoodie, we should keep it, the FlinkAppendHandle already overrides the behavior.

Yes, it's in the common package, my problem.

yanghua · 2021-02-26T01:14:45Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

Can you extract table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ) into a variable?

yanghua · 2021-02-26T01:15:21Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

yanghua · 2021-02-26T06:35:05Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

Let the joined strings and variable be a complete sentence?

yanghua · 2021-02-26T08:10:36Z

.../main/java/org/apache/hudi/table/action/commit/delta/BaseFlinkDeltaCommitActionExecutor.java

Do not hard code FlinkAppendHandle ?

yanghua · 2021-02-26T08:20:13Z

hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssigners.java

Throw an unsupported exception?

This is an internal exception which should never happen, i prefer a simple AssertionError, which means this is a bug within HUDI, not an unsupported feature.

yanghua · 2021-02-26T08:22:27Z

hudi-flink/src/test/java/org/apache/hudi/operator/CopyOnWriteTest.java

Let us start with Test and give it a more valuable name?

Rename to TestWriteCopyOnWrite.

yanghua · 2021-02-26T08:26:56Z

hudi-flink/src/test/java/org/apache/hudi/operator/CopyOnWriteTest.java

Replace it with HoodieTableType.COPY_ON_WRITE.name()?

yanghua · 2021-02-26T08:27:30Z

hudi-flink/src/test/java/org/apache/hudi/operator/MergeOnReadCompactTest.java

yanghua · 2021-02-26T08:28:01Z

hudi-flink/src/test/java/org/apache/hudi/operator/MergeOnReadWriteTest.java

garyli1019

Thanks @danny0405 for the awesome work. Hard to catch up on the review since you are making progress too fast :)
Can't go into detail about this large PR too much until I get a chance to run this myself. Left some high-level comments.
One concern is about the test cases. I feel like Flink writer is not as well tested as Spark, so the reliability is a bit concerning for me when we officially release this feature. Any plan to add more test cases?

garyli1019 · 2021-02-28T12:51:48Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

sounds like ENABLE_ASYNC_COMPACTION?

Rename to COMPACTION_ASYNC_ENABLED.

garyli1019 · 2021-02-28T12:55:04Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

num_commits is too simple IMO, can we use a more informative name?

No, the enumeration comes from what HUDI core defines, see CompactionTriggerStrategy.

garyli1019 · 2021-02-28T12:56:12Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

Can we reuse HoodieCompactionConfig for all compaction related config? To be consistent with Spark.

No, the option key of HoodieCompactionConfig is too long and not very friendly to use as SQL options.

garyli1019 · 2021-02-28T13:15:08Z

hudi-flink/src/main/java/org/apache/hudi/operator/compact/CompactCommitEvent.java

How are we gonna use this TaskID?

Not used in current code, but i would rather keep it in case of future usage.

garyli1019 · 2021-02-28T13:17:07Z

hudi-flink/src/main/java/org/apache/hudi/operator/compact/CompactEvent.java

Maybe CompactionPlanEvent make more sense?

Thanks @danny0405 for the awesome work. Hard to catch up on the review since you are making progress too fast :)
Can't go into detail about this large PR too much until I get a chance to run this myself. Left some high-level comments.
One concern is about the test cases. I feel like Flink writer is not as well tested as Spark, so the reliability is a bit concerning for me when we officially release this feature. Any plan to add more test cases?

Yes, we can add more test cases when more feature are introduced for Flink, such as SQL connectors, INSERT OVERRIDE, more kinds of key generators.

garyli1019 · 2021-02-28T13:17:36Z

hudi-flink/src/main/java/org/apache/hudi/operator/compact/CompactCommitEvent.java

CompactionCommitEvent?

Also supports async compaction with pluggable strategies.

yanghua

LGTM. @danny0405 Thanks for your contribution.

…e#2593) Also supports async compaction with pluggable strategies.

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

codecov-commenter · 2025-06-11T03:29:10Z

Codecov Report

Attention: Patch coverage is 81.06796% with 39 lines in your changes missing coverage. Please review.

Project coverage is 51.57%. Comparing base (06dc7c7) to head (c22ac8d).
Report is 5056 commits behind head on master.

Files with missing lines	Patch %	Lines
...he/hudi/operator/compact/CompactionCommitSink.java	70.21%	11 Missing and 3 partials ⚠️
...perator/partitioner/delta/DeltaBucketAssigner.java	68.42%	7 Missing and 5 partials ⚠️
.../hudi/operator/compact/CompactionPlanOperator.java	79.06%	5 Missing and 4 partials ⚠️
...che/hudi/operator/partitioner/BucketAssigners.java	50.00%	1 Missing and 1 partial ⚠️
.../hudi/operator/StreamWriteOperatorCoordinator.java	85.71%	0 Missing and 1 partial ⚠️
...e/hudi/operator/compact/CompactionCommitEvent.java	87.50%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2593      +/-   ##
============================================
+ Coverage     51.22%   51.57%   +0.35%     
- Complexity     3230     3287      +57     
============================================
  Files           438      445       +7     
  Lines         20093    20328     +235     
  Branches       2069     2102      +33     
============================================
+ Hits          10292    10484     +192     
- Misses         8954     8978      +24     
- Partials        847      866      +19

Flag	Coverage Δ
hudicli	`36.87% <ø> (ø)`
hudiclient	`∅ <ø> (∅)`
hudicommon	`51.33% <ø> (-0.03%)`	⬇️
hudiflink	`51.39% <81.06%> (+4.53%)`	⬆️
hudihadoopmr	`33.16% <ø> (ø)`
hudisparkdatasource	`69.71% <ø> (-0.05%)`	⬇️
hudisync	`49.62% <ø> (+1.00%)`	⬆️
huditimelineservice	`66.49% <ø> (ø)`
hudiutilities	`69.59% <ø> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...in/java/org/apache/hudi/operator/FlinkOptions.java	`78.57% <100.00%> (+3.57%)`	⬆️
.../org/apache/hudi/operator/StreamWriteFunction.java	`86.59% <100.00%> (+0.28%)`	⬆️
.../apache/hudi/operator/compact/CompactFunction.java	`100.00% <100.00%> (ø)`
...che/hudi/operator/compact/CompactionPlanEvent.java	`100.00% <100.00%> (ø)`
...udi/operator/partitioner/BucketAssignFunction.java	`84.94% <100.00%> (+0.16%)`	⬆️
...ache/hudi/operator/partitioner/BucketAssigner.java	`80.17% <ø> (ø)`
...c/main/java/org/apache/hudi/util/StreamerUtil.java	`42.59% <100.00%> (+2.20%)`	⬆️
.../hudi/operator/StreamWriteOperatorCoordinator.java	`70.18% <85.71%> (+0.70%)`	⬆️
...e/hudi/operator/compact/CompactionCommitEvent.java	`87.50% <87.50%> (ø)`
...che/hudi/operator/partitioner/BucketAssigners.java	`50.00% <50.00%> (ø)`
... and 3 more

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yanghua requested a review from garyli1019 February 23, 2021 09:44

yanghua self-assigned this Feb 23, 2021

danny0405 force-pushed the HUDI-1632 branch from c605c9a to 2017461 Compare February 23, 2021 12:19

simonsssu reviewed Feb 24, 2021

View reviewed changes

lamberken reviewed Feb 24, 2021

View reviewed changes

yanghua reviewed Feb 25, 2021

View reviewed changes

danny0405 force-pushed the HUDI-1632 branch from 2017461 to e07fcc5 Compare February 25, 2021 07:59

yanghua reviewed Feb 26, 2021

View reviewed changes

danny0405 force-pushed the HUDI-1632 branch from e07fcc5 to e2c6307 Compare February 26, 2021 11:53

garyli1019 reviewed Feb 28, 2021

View reviewed changes

[HUDI-1632] Supports merge on read write mode for Flink writer

c22ac8d

Also supports async compaction with pluggable strategies.

danny0405 force-pushed the HUDI-1632 branch from e2c6307 to c22ac8d Compare March 1, 2021 02:41

yanghua approved these changes Mar 1, 2021

View reviewed changes

yanghua merged commit 7a11de1 into apache:master Mar 1, 2021

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021

[HUDI-1632] Supports merge on read write mode for Flink writer (apach…

f3617dd

…e#2593) Also supports async compaction with pluggable strategies.

[HUDI-1632] Supports merge on read write mode for Flink writer #2593

[HUDI-1632] Supports merge on read write mode for Flink writer #2593

Uh oh!

Conversation

danny0405 commented Feb 23, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

XuQianJin-Stars commented Feb 23, 2021

Uh oh!

danny0405 commented Feb 23, 2021

Uh oh!

lamberken commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lamberken Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lamberken commented Feb 23, 2021 •

edited

Loading

codecov-io commented Feb 23, 2021 •

edited

Loading

danny0405 Feb 24, 2021 •

edited

Loading

lamberken Feb 24, 2021 •

edited

Loading