[HUDI-1663] Streaming read for Flink MOR table #2640

danny0405 · 2021-03-06T06:46:41Z

What is the purpose of the pull request

Support streaming read for HUDI Flink MOR table.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-03-06T07:08:33Z

Codecov Report

Merging #2640 (fcf7175) into master (0207323) will decrease coverage by 42.01%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master   #2640       +/-   ##
============================================
- Coverage     51.53%   9.52%   -42.02%     
+ Complexity     3491      48     -3443     
============================================
  Files           462      53      -409     
  Lines         21881    1963    -19918     
  Branches       2327     235     -2092     
============================================
- Hits          11277     187    -11090     
+ Misses         9624    1763     -7861     
+ Partials        980      13      -967

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`9.52% <ø> (-59.96%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...va/org/apache/hudi/utilities/IdentitySplitter.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-2.00%)`
...va/org/apache/hudi/utilities/schema/SchemaSet.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-3.00%)`
...a/org/apache/hudi/utilities/sources/RowSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
.../org/apache/hudi/utilities/sources/AvroSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-1.00%)`
.../org/apache/hudi/utilities/sources/JsonSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-1.00%)`
...rg/apache/hudi/utilities/sources/CsvDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-10.00%)`
...g/apache/hudi/utilities/sources/JsonDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
...apache/hudi/utilities/sources/JsonKafkaSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-6.00%)`
...pache/hudi/utilities/sources/ParquetDFSSource.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-5.00%)`
...lities/schema/SchemaProviderWithPostProcessor.java	`0.00% <0.00%> (-100.00%)`	`0.00% <0.00%> (-4.00%)`
... and 429 more

yanghua

@danny0405 Left some comments.

hudi-flink/src/test/java/org/apache/hudi/utils/TestUtils.java

hudi-flink/src/test/java/org/apache/hudi/source/TestStreamReadOperator.java

yanghua · 2021-03-08T14:50:23Z

hudi-flink/src/test/java/org/apache/hudi/source/TestStreamReadOperator.java

IMO, a for-loop can implement it.

hudi-flink/src/test/java/org/apache/hudi/source/TestStreamReadOperator.java

hudi-flink/src/test/java/org/apache/hudi/source/TestStreamReadMonitoringFunction.java

yanghua · 2021-03-08T14:58:35Z

hudi-flink/src/test/java/org/apache/hudi/source/TestStreamReadMonitoringFunction.java

Do we have the same method in TestStreamReadOperator?

Move to TestUtils.

yanghua · 2021-03-09T02:36:58Z

hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java

I still think the old one is better. We should have the only fact source. If we define a new one, it would bring more cost of maintenance.

Not true, because HoodieTableType should never change.

I am not sure if a definition would be changed. But if we have a fact source, anywhere we reference it, it's enough. IMO, it's not necessary to define it more than once. Otherwise, the others would define it in java or spark module? Because, we allowed this behavior in flink module. It will add the complexity and the cost of understanding.

Spark did have a definition, see DataSourceOptions.COW_TABLE_TYPE_OPT_VAL and DataSourceOptions.MOR_TABLE_TYPE_OPT_VAL.

IMO, it's different. Spark referenced the HoodieTableType.COPY_ON_WRITE.name, but you used a string literal. What I mean about the word define, is to use a literal directly. In the future, if we change the value of HoodieTableType , in DataSourceOptions we do not need to search the string literal, and we can control all the change points.

HoodieTableType is an enumeration, but in many cases we need a string constant, not a method call like HoodieTableType.name(), such as the Junit5 parameterized test.

BTW, can we not block this PR because these two options are now introduced in this PR though. We can discuss it in another issue, i think, this is not a critical issue.

We did not block the PR because of the reason for this divergence. And why do not we learn from DataSourceOptions ?

We can do that.

yanghua · 2021-03-09T02:43:47Z

hudi-flink/src/main/java/org/apache/hudi/source/format/mor/InstantRange.java

LESSER_THAN_OR_EQUALS ?

yanghua · 2021-03-09T02:43:58Z

hudi-flink/src/main/java/org/apache/hudi/source/format/mor/InstantRange.java

yanghua

Left some comments.

yanghua · 2021-03-09T08:41:53Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java

It would be better to add READ_.

yanghua · 2021-03-09T08:42:02Z

hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java