[HUDI-1598] Write as minor batches during one checkpoint interval for… #2553

danny0405 · 2021-02-08T07:47:41Z

… the new writer

What is the purpose of the pull request

This is the #step 3 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal

Buffering data during one checkpoint when flush the buffer out all at a
time is not resource friendly for streaming write. The more proper way
it to cut the batches based on their real memory data buffer size (say,
128Mb), the writer always flushes the buffer out when its size reaches
the configured threshold.

Thus, after this change, one instant may span one (if every checkpoint
succeeds) or more (if there are checkpoint failures) checkpoints. The
instant only commits when there is a successful checkpoint.

Brief change log

Modify the BaseFlinkCommitActionExecutor to keep the write handles
explicitly, the write task can then decides when and how to roll over
the underneath file handles
The StreamWriteOperatorCoordinator now commits the old instant and
creates new one only if the checkpoint succeeds, thus, we can skip
some failed checkpoints actually.

Verify this pull request

Added test cases.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-02-08T08:52:27Z

Codecov Report

Merging #2553 (a7d7ee6) into master (c30481f) will decrease coverage by 41.21%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master   #2553       +/-   ##
============================================
- Coverage     50.90%   9.69%   -41.22%     
+ Complexity     3167      48     -3119     
============================================
  Files           433      53      -380     
  Lines         19806    1929    -17877     
  Branches       2032     229     -1803     
============================================
- Hits          10083     187     -9896     
+ Misses         8904    1729     -7175     
+ Partials        819      13      -806

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`9.69% <ø> (-59.78%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...va/org/apache/hudi/utilities/IdentitySplitter.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-2.00%)`
...va/org/apache/hudi/utilities/schema/SchemaSet.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-3.00%)`
...a/org/apache/hudi/utilities/sources/RowSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
.../org/apache/hudi/utilities/sources/AvroSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-1.00%)`
.../org/apache/hudi/utilities/sources/JsonSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-1.00%)`
...rg/apache/hudi/utilities/sources/CsvDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-10.00%)`
...g/apache/hudi/utilities/sources/JsonDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
...apache/hudi/utilities/sources/JsonKafkaSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-6.00%)`
...pache/hudi/utilities/sources/ParquetDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-5.00%)`
...lities/schema/SchemaProviderWithPostProcessor.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
... and 403 more

yanghua

@danny0405 After a quick look, left some comments.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/MiniBatchHandle.java

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/ExplicitWriteHandleTable.java

hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java

hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java

garyli1019 · 2021-02-08T11:43:02Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

is this final?

Yes, once created for a bucket, never change.

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkCreateHandle.java

garyli1019 · 2021-02-08T12:02:03Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkCreateHandle.java

is this checkpoint bucket representing BucketWriter?

It represents a write bucket within a checkpoint. A BucketWriter may hold several bucket write handles.

garyli1019 · 2021-02-08T12:16:05Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

Sorry, I am a bit confused by mini-batch, checkpoint, and fileid e.t.c.
We did a shuffle by file id before the bucketWriter, so each bucketWriter should handle one file id in each commit right? Do you mean we will write multiple files for one commit(checkpoint) in a single bucketWriter?

The BucketWriter hold several write handles, each for a bucket, during one checkpoint, we may write multiple times(as mini-batches) for one bucket, e.g. we may append to the bucket handle as multiple times, for both create handle and merge handle, based on how mini-batches there are in one checkpoint for the BucketWriter.

IMO, can we ignore the concept of the mini-batch in the implementation(interface). Making it a default mechansim?

I mean can we remove the MiniBatchHandle interface?

No, a new interface make the handle more clear because it is different with spark style handle.

I know your thought. But from another view, does it the only way for flink, and will you implement another way?
If we only provide one way like this, we may not need to define it. It will increase the cost of understanding like gray's thought.

Flink's handle and Spark's handle in different classes, it is already an obvious difference. IMO

garyli1019 · 2021-02-08T12:23:45Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

rename is not friendly for cloud object store, can we avoid rename?

Is there better way to switch the file to a new name ? Looks for-ward to your suggestions ~

You may try to create a new file instead of renaming the current file? Refer to TimelineLayoutV1?

Sorry, confused about the suggestion

I mean the difference between TimelineLayoutV0 and TimelineLayoutV1 is that the second chose to create a new instant file instead of renaming a file.

Create new file would duplicate the data write the last batch, it copy and rewrite all the bucket data, more worse, when there is only 1 mini-batch, we still duplicates the write. I'm expecting a way to avoid the duplicate write and also friendly to object store.

We can discuss it later, IMO. It's not a blocking issue.

garyli1019 · 2021-02-08T12:40:10Z

hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java

no exactly-once anymore?

We do not block now when we start a checkpoint and triggers data write, that means, during the checkpoint data buffer flushing, new data may comes in and trigger write. That means a checkpoint may contains data more than it should keep.

When there are checkpoints failure, it may roll back to a checkpoint that has more data written, and this data buffer duplicates.

But we still have a ultimately consistent semantics based on the fact that every Hoodie records has a record key.

The old pipeline is also not exactly-once semantics. We just found that there is no need to keep that for new pipeline to strength the throughput.

hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java

yanghua

@danny0405 Left some comments.

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/ExplicitCreateHandleFactory.java

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkCreateHandle.java

yanghua · 2021-02-09T06:17:49Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

IMO, can we ignore the concept of the mini-batch in the implementation(interface). Making it a default mechansim?

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

yanghua · 2021-02-09T07:05:44Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/FlinkMergeHandle.java

You may try to create a new file instead of renaming the current file? Refer to TimelineLayoutV1?

...lient/hudi-flink-client/src/main/java/org/apache/hudi/table/HoodieFlinkCopyOnWriteTable.java

hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java

… the new writer This is the #step 3 of RFC-24: https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal Buffering data during one checkpoint when flush the buffer out all at a time is not resource friendly for streaming write. The more proper way it to cut the batches based on their real memory data buffer size (say, 128Mb), the writer always flushes the buffer out when its size reaches the configured threshold. Thus, after this change, one instant may span one (if every checkpoint succeeds) or more (if there are checkpoint failures) checkpoints. The instant only commits when there is a successful checkpoint. Changes: - Modify the BaseFlinkCommitActionExecutor to keep the write handles explicitly, the write task can then decides when and how to roll over the underneath file handles - The StreamWriteOperatorCoordinator now commits the old instant and creates new one only if the checkpoint succeeds, thus, we can skip some failed checkpoints actually.

yanghua

LGTM

hk-lrzy · 2021-02-23T10:04:14Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/HoodieFlinkWriteClient.java

    table.validateUpsertSchema();
    preWrite(instantTime, WriteOperationType.UPSERT);
-    HoodieWriteMetadata<List<WriteStatus>> result = table.upsert(context, instantTime, records);
+    final HoodieRecord<T> record = records.get(0);


why use first record' bucket type? if first record is I and the second one is U, th code will use FlinkCreateHandle to process all records

Yes, that is the logic we want, because we try to merge records into small buckets.

So we also need to filter dupicate records on the query side ?

… the new writer (apache#2553)

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

yanghua self-assigned this Feb 8, 2021

yanghua requested review from garyli1019, wangxianghu and yanghua February 8, 2021 07:48

danny0405 force-pushed the step3 branch from 5c4bf48 to d75e7e7 Compare February 8, 2021 08:09

yanghua reviewed Feb 8, 2021

View reviewed changes

garyli1019 reviewed Feb 8, 2021

View reviewed changes

yanghua reviewed Feb 9, 2021

View reviewed changes

hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteOperatorCoordinator.java Outdated Show resolved Hide resolved

danny0405 force-pushed the step3 branch 2 times, most recently from c858c95 to 266492a Compare February 9, 2021 05:16

yanghua reviewed Feb 9, 2021

View reviewed changes

danny0405 force-pushed the step3 branch from 266492a to 0e74819 Compare February 9, 2021 11:22

danny0405 force-pushed the step3 branch from 0e74819 to a7d7ee6 Compare February 9, 2021 12:00

yanghua approved these changes Feb 9, 2021

View reviewed changes

yanghua merged commit 5d2491d into apache:master Feb 17, 2021

hk-lrzy reviewed Feb 23, 2021

View reviewed changes

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021

[HUDI-1598] Write as minor batches during one checkpoint interval for…

7ea6acb

… the new writer (apache#2553)

[HUDI-1598] Write as minor batches during one checkpoint interval for… #2553

[HUDI-1598] Write as minor batches during one checkpoint interval for… #2553

Uh oh!

Conversation

danny0405 commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 commented Feb 8, 2021 •

edited

Loading

codecov-io commented Feb 8, 2021 •

edited

Loading

yanghua Feb 9, 2021 •

edited

Loading

danny0405 Feb 9, 2021 •

edited

Loading

hk-lrzy Feb 23, 2021 •

edited

Loading