[HUDI-1647] Supports snapshot read for Flink #2613

danny0405 · 2021-03-01T08:10:53Z

What is the purpose of the pull request

COW: the parquet files for the latest file group slices
MOR: the parquet base file + log files for the latest file group slices

Also implements the SQL connectors for both sink and source.

Brief change log

Add input formats for both COW(parquet) and MOR (parquet + log)
Add table factory and table source/sink for flink

Verify this pull request

Added UTs and ITs.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-03-01T10:06:54Z

Codecov Report

Merging #2613 (0d77db7) into master (7a11de1) will decrease coverage by 0.15%.
The diff coverage is 47.23%.

@@             Coverage Diff              @@
##             master    #2613      +/-   ##
============================================
- Coverage     51.56%   51.41%   -0.16%     
- Complexity     3286     3480     +194     
============================================
  Files           445      461      +16     
  Lines         20328    21678    +1350     
  Branches       2102     2299     +197     
============================================
+ Hits          10483    11146     +663     
- Misses         8978     9574     +596     
- Partials        867      958      +91

Flag	Coverage Δ	Complexity Δ
hudicli	`36.87% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`51.32% <ø> (-0.01%)`	`0.00 <ø> (ø)`
hudiflink	`50.31% <47.23%> (-1.09%)`	`0.00 <188.00> (ø)`
hudihadoopmr	`33.16% <ø> (ø)`	`0.00 <ø> (ø)`
hudisparkdatasource	`69.71% <ø> (ø)`	`0.00 <ø> (ø)`
hudisync	`49.62% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.49% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.59% <ø> (+0.05%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...va/org/apache/hudi/factory/HoodieTableFactory.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
.../org/apache/hudi/operator/StreamWriteFunction.java	`84.00% <0.00%> (-2.60%)`	`22.00 <0.00> (ø)`
.../org/apache/hudi/operator/StreamWriteOperator.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...ache/hudi/operator/StreamWriteOperatorFactory.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...udi/operator/partitioner/BucketAssignFunction.java	`85.86% <ø> (+0.92%)`	`22.00 <0.00> (+1.00)`
...ain/java/org/apache/hudi/sink/HoodieTableSink.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...e/hudi/source/format/cow/ParquetDecimalVector.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
.../org/apache/hudi/util/AvroToRowDataConverters.java	`16.94% <16.94%> (ø)`	`12.00 <12.00> (?)`
...hudi/source/format/cow/ParquetSplitReaderUtil.java	`20.00% <20.00%> (ø)`	`13.00 <13.00> (?)`
...java/org/apache/hudi/util/AvroSchemaConverter.java	`23.74% <25.97%> (+23.74%)`	`13.00 <7.00> (+13.00)`
... and 39 more

yanghua

Busy now, @danny0405 I left some comments after I viewed some files yesterday.

yanghua · 2021-03-01T09:07:46Z

hudi-flink/src/main/java/org/apache/hudi/factory/HoodieTableFactory.java

This class provided the function for creating both source and sink. Shall we change this comment to Hoodie table factory?

yanghua · 2021-03-01T09:12:44Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

define the payload class seems to make the semantic not clear. What about replacing with define the merge type?

yanghua · 2021-03-01T09:49:05Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

Can we ignore - in the config key?

yanghua · 2021-03-01T11:37:58Z

hudi-flink/src/main/java/org/apache/hudi/sink/HoodieTableSink.java

duplicated comment?

yanghua · 2021-03-01T11:39:48Z

hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java

IMO, it's not necessary to break this new line.

yanghua · 2021-03-01T11:40:56Z

hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java

Let us extract the -1 to be a more readable constant.

yanghua · 2021-03-01T11:49:42Z

hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java

Why do not use the enum?

I found that a constant in FlinkOptions is more friendly to use.

yanghua · 2021-03-01T11:53:03Z

hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java

You mean org.apache.flink.table.utils. PartitionPathUtils ? If yes, let us add the package name?

yanghua · 2021-03-01T11:57:33Z

hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java

Does this pattern match the hive partition style? If yes, it would be better to rename with HIVE_STYLE_PARTITION_PATTERN ?

yanghua · 2021-03-02T10:09:53Z

hudi-flink/src/main/java/org/apache/hudi/source/format/cow/AbstractColumnReader.java

Can you refactor all the changes to follow the unified code style about the method parameters?

No, i would rather keep it as is now because they are copied code.

But compared with the readBatch above this method. Why different methods follow different styles in the same class file? If we will keep the code in Hudi's codebase for a long time, we should not use different code styles.

I think it's fine and more easy for future upgrade.

garyli1019 · 2021-03-02T02:24:32Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

can we link the String to HoodieTableType?

No, i think a constant here is more easy to use.

garyli1019 · 2021-03-02T02:32:40Z

hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java

Does the user has to define the schema when using this? In Spark, we use the TableSchemaResolver to get the latest schema from the FileSystem

HUDI flink also uses TableSchemaResolver, see HoodieTableSource.getInputFormat.

garyli1019 · 2021-03-02T11:56:43Z

hudi-flink/src/main/java/org/apache/hudi/sink/HoodieTableSink.java

Is this for the batch job?

For both batch and streaming.

garyli1019 · 2021-03-02T11:59:40Z

hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java

how is this possible?

Flink HUDI can index log files (say it's global index), thus, the Flink writer can write all log files for MOR table.

garyli1019 · 2021-03-02T12:05:19Z

hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java

return hadoop FileStatus?

garyli1019 · 2021-03-02T12:05:48Z

hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java

how about getHadoopFileStatusRecursively

garyli1019 · 2021-03-02T12:09:11Z

hudi-flink/src/main/java/org/apache/hudi/source/format/FormatUtils.java

duplicate with spark code, maybe we can move this to a shared place?

Yes, we can promote it in the future, i'm not planning to do it in this PR.

garyli1019 · 2021-03-02T12:11:22Z

hudi-flink/src/main/java/org/apache/hudi/source/format/cow/AbstractColumnReader.java

Why do we need a separate reader. Does Flink have its own one for parquet?

Flink have its own.

garyli1019 · 2021-03-02T12:12:28Z

hudi-flink/src/main/java/org/apache/hudi/source/format/cow/CopyOnWriteInputFormat.java

Why not extends from ParquetInputFormat?

In order to extend the TIMESTAMP INT 64 type which HUDI uses as a default, i have explained in the doc.

garyli1019 · 2021-03-02T12:16:16Z

hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetColumnarRowSplitReader.java

IMO we put too much dependency on the File Reader. It will be hard to maintain over time. I am more inclined to reuse the reader from the compute engine side. WDYT?

It is not that good to put the copied code, but it is still better than adding flink-parquet dependency and shade all the codes in the class path, which would cause conflicts and confuses the users (you can not prevent people to have the original flink-parquet jar in the classpath).

By copy several classes (basically unchanged overtime) we avoid that confusion.

wanna bring @vinothchandar into this discussion, we had this discussion before on the Spark side. Whether we should add the Engine's FileInputFormat and FileReader into Hudi's codebase. Both sides have pros and cons. @vinothchandar WDYT?

* COW: the parquet files for the latest file group slices * MOR: the parquet base file + log files for the latest file group slices Also implements the SQL connectors for both slink and source.

yanghua · 2021-03-03T09:36:36Z

@danny0405 Can you give a checklist about which files are copied from the other projects?

danny0405 · 2021-03-03T09:46:10Z

Sure

AbstractColumnReader
CopyOnWriteInputFormat
ParquetColumnarRowSplitReader
ParquetDecimalVector
ParquetSplitReaderUtil
RunLengthDecoder

garyli1019

Hi @danny0405 , left some minor comments. Regarding whether we should use the FileFormat from the engine side, we can wait for comments from others but I don't wanna block your progress. Change later should be fine.
Can we add Flink to https://github.com/apache/hudi/blob/master/NOTICE about the copied code.

garyli1019 · 2021-03-04T08:38:06Z

hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+org.apache.hudi.factory.HoodieTableFactory


is this file necessary?

Required for the java SPI service.

garyli1019 · 2021-03-04T08:38:21Z

hudi-flink/src/test/resources/META-INF/services/org.apache.flink.table.factories.TableFactory

+# limitations under the License.
+
+org.apache.hudi.factory.HoodieTableFactory
+org.apache.hudi.utils.factory.ContinuousFileSourceFactory


Required for the java SPI service.

garyli1019 · 2021-03-04T08:41:20Z

hudi-flink/src/test/java/org/apache/hudi/source/HoodieDataSourceITCase.java

+  File tempFile;
+
+  @Test
+  void testStreamWriteBatchRead() {


is that possible to add more test cases to cover COW/MOR(without comapction)/MOR(with compaction) and query?

yanghua

LGTM, let's do a careful review after all the works done.

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

danny0405 force-pushed the HUDI-1647 branch from 1ccb37d to 72370ed Compare March 1, 2021 08:20

yanghua self-assigned this Mar 1, 2021

danny0405 force-pushed the HUDI-1647 branch from 72370ed to 99a6c1f Compare March 1, 2021 09:26

danny0405 force-pushed the HUDI-1647 branch from 99a6c1f to 54025e3 Compare March 1, 2021 12:05

yanghua reviewed Mar 2, 2021

View reviewed changes

danny0405 force-pushed the HUDI-1647 branch from 54025e3 to c0a3083 Compare March 2, 2021 03:21

yanghua reviewed Mar 2, 2021

View reviewed changes

garyli1019 reviewed Mar 2, 2021

View reviewed changes

[HUDI-1647] Supports snapshot read for Flink

0d77db7

* COW: the parquet files for the latest file group slices * MOR: the parquet base file + log files for the latest file group slices Also implements the SQL connectors for both slink and source.

danny0405 force-pushed the HUDI-1647 branch from c0a3083 to 0d77db7 Compare March 3, 2021 04:35

garyli1019 approved these changes Mar 4, 2021

View reviewed changes

yanghua approved these changes Mar 5, 2021

View reviewed changes

yanghua merged commit 89003bc into apache:master Mar 5, 2021

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021

[HUDI-1647] Supports snapshot read for Flink (apache#2613)

22128a6

[HUDI-1647] Supports snapshot read for Flink #2613

[HUDI-1647] Supports snapshot read for Flink #2613

Uh oh!

Conversation

danny0405 commented Mar 1, 2021 • edited by nsivabalan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Mar 1, 2021 •

edited by nsivabalan

Loading

codecov-io commented Mar 1, 2021 •

edited

Loading

danny0405 Mar 3, 2021 •

edited

Loading

danny0405 Mar 3, 2021 •

edited

Loading