[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client #2375

Nieal-Yang · 2020-12-24T07:46:38Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

add bloom index for hudi-flink-clien

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

wangxianghu · 2020-12-24T08:28:27Z

@Nieal-Yang Thanks for doing this, please assign this ticket(https://issues.apache.org/jira/browse/HUDI-1332) to yourself :)

Nieal-Yang · 2020-12-24T08:56:38Z

@Nieal-Yang Thanks for doing this, please assign this ticket(https://issues.apache.org/jira/browse/HUDI-1332) to yourself :)
OK.. But I am not quite familiar with jira. Can you help to assign it to me.

garyli1019 · 2020-12-24T09:12:32Z

@Nieal-Yang Thanks for submitting this PR! Please apply for contributor access for Hudi's Jira so I can assign the ticket to you.
Guide

garyli1019

Hi @Nieal-Yang , thanks for your contribution and welcome to the community!
I just took a quick pass on this PR. So the way of this implementation is similar to Spark, which is in a "batch mode". We will run a full cycle of workload for each commit and restart for the next one. Please let me know if I misunderstand.
I think we need the ability to run a Flink batch job for sure but probably we should distinguish batch from the current streaming job. WDYT?

Nieal-Yang · 2020-12-25T04:00:29Z

Hi @Nieal-Yang , thanks for your contribution and welcome to the community!
I just took a quick pass on this PR. So the way of this implementation is similar to Spark, which is in a "batch mode". We will run a full cycle of workload for each commit and restart for the next one. Please let me know if I misunderstand.
I think we need the ability to run a Flink batch job for sure but probably we should distinguish batch from the current streaming job. WDYT?
yeah. You are right. I got your point.

Nieal-Yang · 2020-12-29T04:02:19Z

Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT?

codecov-io · 2020-12-29T09:32:44Z

Codecov Report

Merging #2375 (8424033) into master (17df517) will increase coverage by 0.54%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master    #2375      +/-   ##
============================================
+ Coverage     50.19%   50.74%   +0.54%     
- Complexity     2990     3064      +74     
============================================
  Files           415      419       +4     
  Lines         18439    18797     +358     
  Branches       1885     1922      +37     
============================================
+ Hits           9255     9538     +283     
- Misses         8427     8484      +57     
- Partials        757      775      +18

Flag	Coverage Δ	Complexity Δ
hudicli	`37.26% <ø> (-0.02%)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`52.04% <ø> (+0.30%)`	`0.00 <ø> (ø)`
hudiflink	`10.20% <ø> (ø)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.06% <ø> (-0.29%)`	`0.00 <ø> (ø)`
hudisparkdatasource	`65.90% <ø> (+3.08%)`	`0.00 <ø> (ø)`
hudisync	`48.61% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.84% <ø> (+1.54%)`	`0.00 <ø> (ø)`
hudiutilities	`69.43% <ø> (-0.23%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java	`44.56% <0.00%> (-4.53%)`	`26.00% <0.00%> (ø%)`
...common/table/view/PriorityBasedFileSystemView.java	`94.36% <0.00%> (-2.74%)`	`33.00% <0.00%> (ø%)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`48.75% <0.00%> (-0.71%)`	`0.00% <0.00%> (ø%)`
...i/common/table/timeline/TimelineMetadataUtils.java	`72.72% <0.00%> (-0.35%)`	`17.00% <0.00%> (ø%)`
...in/java/org/apache/hudi/utilities/UtilHelpers.java	`64.16% <0.00%> (-0.30%)`	`33.00% <0.00%> (+1.00%)`	⬇️
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`70.50% <0.00%> (-0.26%)`	`50.00% <0.00%> (+1.00%)`	⬇️
.../org/apache/hudi/cli/commands/MetadataCommand.java	`1.11% <0.00%> (-0.02%)`	`1.00% <0.00%> (ø%)`
...va/org/apache/hudi/metadata/BaseTableMetadata.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
.../org/apache/hudi/metadata/HoodieTableMetadata.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
.../apache/hudi/common/table/TableSchemaResolver.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
... and 28 more

garyli1019 · 2021-01-04T06:56:49Z

Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT?

@Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR?

Nieal-Yang · 2021-01-05T03:07:02Z

Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT?

@Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR?

okay. This index has been used in production in my company. We will optimize it gradually

yanghua · 2021-01-06T03:37:43Z

Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT?

@Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR?

okay. This index has been used in production in my company. We will optimize it gradually

Hi @Nieal-Yang Would you please tell me what's the Flink version you are using?

Nieal-Yang · 2021-01-06T06:09:49Z

Hi @garyli1019. Maybe I think the current implementation is OK. Beacause even in streaming job, we need to accumulate batch records in memory during the check-point cycle and upsert data into hudi-table when check-point triggers. WDYT?

@Nieal-Yang sorry about the late reply. IMO The batch mode will work but not fully taking the advantage of flink. But we can optimize this step by step. To ensure this PR will work, would you add some unit tests to this PR?

okay. This index has been used in production in my company. We will optimize it gradually

Hi @Nieal-Yang Would you please tell me what's the Flink version you are using?
@yanghua Yeah. The version we are using is flink-1.11.2.

garyli1019

@Nieal-Yang thanks for adding the tests. Left some comments about the coding style. Will review another round once fixed.

...t/hudi-flink-client/src/test/java/org/apache/hudi/index/bloom/TestFlinkHoodieBloomIndex.java

.../hudi-flink-client/src/test/java/org/apache/hudi/testutils/FlinkHoodieClientTestHarness.java

...hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkWriteableTestTable.java

garyli1019

@Nieal-Yang Thanks for your contribution! Left some comments. We are very close to land.
@wangxianghu could you take a pass as well? Thanks

...hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkWriteableTestTable.java

...t/hudi-flink-client/src/test/java/org/apache/hudi/index/bloom/TestFlinkHoodieBloomIndex.java

...lient/hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/FlinkHoodieBloomIndex.java

...ink-client/src/main/java/org/apache/hudi/index/bloom/HoodieFlinkBloomIndexCheckFunction.java

garyli1019 · 2021-01-10T15:15:10Z

.../hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkClientTestHarness.java

    }
  }

+  @org.junit.jupiter.api.BeforeEach


let's remove this, we don't have to call this every time.

sorry if I mislead you on this comment. I was trying to say we should remove this BeforeEach here since we don't have to run this setUp before all the test cases.

.../hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkClientTestHarness.java

wangxianghu · 2021-01-12T08:46:30Z

@Nieal-Yang Thanks for your contribution! Left some comments. We are very close to land.
@wangxianghu could you take a pass as well? Thanks

Ack, will review soon

wangxianghu

@Nieal-Yang sorry for the big delay. left some comments you can consider.
This pr will be ok to me when the comments addressed, thanks
cc @garyli1019

...lient/hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/FlinkHoodieBloomIndex.java

wangxianghu · 2021-01-19T10:35:34Z

...ink-client/src/main/java/org/apache/hudi/index/bloom/HoodieFlinkBloomIndexCheckFunction.java

+    return null;
+  }
+
+  class LazyKeyCheckIterator extends LazyIterableIterator<Tuple2<String, HoodieKey>, List<KeyLookupResult>> {


is it possible to make this LazyKeyCheckIterator class an independent one, for code reuse purpose

maybe we can move HoodieFlinkBloomIndexCheckFunction into the hudi-client-common later then spark can reuse it.

maybe we can move HoodieFlinkBloomIndexCheckFunction into the hudi-client-common later then spark can reuse it.

yes, could be annother pr

garyli1019 · 2021-01-19T10:43:34Z

.../hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkClientTestHarness.java

    }
  }

+  @org.junit.jupiter.api.BeforeEach


sorry if I mislead you on this comment. I was trying to say we should remove this BeforeEach here since we don't have to run this setUp before all the test cases.

...hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkWriteableTestTable.java

garyli1019

LGTM, thanks for your patient @Nieal-Yang , we can merge if @wangxianghu approve

wangxianghu

@Nieal-Yang thanks for addressing my concern, LGTM now
cc @garyli1019

…che#2375) * [HUDI] Add bloom index for hudi-flink-client Co-authored-by: yangxiang <[email protected]>

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

[HUDI] Add bloom index for hudi-flink-client

34558ae

Nieal-Yang changed the title ~~[HUDI] Add bloom index for hudi-flink-client~~ [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client Dec 24, 2020

garyli1019 self-assigned this Dec 24, 2020

garyli1019 self-requested a review December 24, 2020 10:42

garyli1019 reviewed Dec 24, 2020

View reviewed changes

garyli1019 added the status:in-progress Work in progress label Dec 25, 2020

bug fix

9071025

yangxiang added 4 commits December 29, 2020 14:18

modify for check style

4907100

[hudi-1332] 1、modify for check-style

b72f0a1

[hudi-1332] 1、modify for check-style

df6163d

[hudi-1332] 1、modify for check-style

bf1c47f

yangxiang added 2 commits January 6, 2021 16:35

[hudi-1332] 1、add unit tests

1d7e9f6

[hudi-1332] 1、add unit tests

96d0e06

garyli1019 requested changes Jan 6, 2021

View reviewed changes

yangxiang added 6 commits January 7, 2021 10:25

Merge branch 'master' into flink-bloom-index

6832177

opt unit tests

500b6c7

[hudi-1332] 1、modify for check-style

7956226

[hudi-1332] 1、modify for check-style

36b92db

[hudi-1332] 1、add unit tests

3ffbc66

Merge branch 'master' into flink-bloom-index

6aa3418

[opt] sync the latest code from master

095fd3d

garyli1019 removed the status:in-progress Work in progress label Jan 10, 2021

garyli1019 requested changes Jan 10, 2021

View reviewed changes

[hudi-1] 1、modify for check-style 2、add todo comments, etc.

8424033

wangxianghu requested changes Jan 19, 2021

View reviewed changes

garyli1019 reviewed Jan 19, 2021

View reviewed changes

yangxiang added 2 commits January 20, 2021 17:08

[opt] 1、opt

4f7a683

[opt] 1、opt doc of HoodieFlinkBloomIndexCheckFunction

1ec2b66

garyli1019 approved these changes Jan 21, 2021

View reviewed changes

wangxianghu approved these changes Jan 22, 2021

View reviewed changes

garyli1019 merged commit 641abe8 into apache:master Jan 22, 2021

Nieal-Yang deleted the flink-bloom-index branch March 11, 2021 10:10

[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client #2375

[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client #2375

Uh oh!

Conversation

Nieal-Yang commented Dec 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

wangxianghu commented Dec 24, 2020

Uh oh!

Nieal-Yang commented Dec 24, 2020

Uh oh!

garyli1019 commented Dec 24, 2020

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Nieal-Yang commented Dec 25, 2020

Uh oh!

Nieal-Yang commented Dec 29, 2020

Uh oh!

codecov-io commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

garyli1019 commented Jan 4, 2021

Uh oh!

Nieal-Yang commented Jan 5, 2021

Uh oh!

yanghua commented Jan 6, 2021

Uh oh!

Nieal-Yang commented Jan 6, 2021

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garyli1019 Jan 10, 2021

Choose a reason for hiding this comment

Uh oh!

garyli1019 Jan 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxianghu commented Jan 12, 2021

Uh oh!

wangxianghu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wangxianghu Jan 19, 2021

Choose a reason for hiding this comment

Uh oh!

Nieal-Yang Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

wangxianghu Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

garyli1019 Jan 19, 2021

Choose a reason for hiding this comment

Nieal-Yang commented Dec 24, 2020 •

edited

Loading

codecov-io commented Dec 29, 2020 •

edited

Loading