Flink: Initial implementation of Flink source with the new FLIP-27 source interface #2105

stevenzwu · 2021-01-18T02:46:34Z

Scope in the first version

simple split assigner (no ordering or locality aware)
support both batch and streaming read

This is the uber PR for the reference of complete context. Will submit smaller PRs for the code review

[MERGED] refactor Flink tests so that new source implementation can reuse](Flink: Refactor flink source tests for FLIP-27 unified source. #2047)
[MERGED] Upgrade Flink version to 1.12.1(Flink: Upgrade version from 1.11.0 to 1.12.1 #1956)
[PENDING] FLIP-27 Iceberg source split (Flink: FLIP-27 Iceberg source split #3501 )
SimpleSplitAssigner (TBD). Note that other assigners will be added after this work is completed.
Split enumerator (TBD)
IcebergSource where everything put together (TBD)

The new IcebergSource will be marked as @Experimental as FLIP-27 source is maturing and we are making it production ready.

Here is the design doc that my colleague (@sundargates) and I created, as mentioned in #1626.

flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssigner.java

flink/src/main/java/org/apache/iceberg/flink/TableInfo.java

openinx · 2021-02-02T03:47:31Z

flink/src/main/java/org/apache/iceberg/flink/source/DataIterator.java

    return inputFiles.get(location);
  }

+  public void seek(CheckpointedPosition checkpointedPosition)  {


Currently, we put those two-level iterators inside a single DataIterator, that makes the code a bit complex to read and understand. I'd prefer to make this into two different iterators:

FileRecordIterator, that will seek the provided row offset and then continue to read the following records.

CombinedTaskRecordIterator, that will have multiple FileRecordIterators, it will locate the latest opening FileRecordIterator and seek to the given row offset to read the following records.

That makes sense to me.

I have a question about the Map<String, InputFile> inputFiles. Right now, it is constructed per CombinedScanTask. Would it be ok to do it at each individual FileScanTask? I tried the change and delete tests work fine. But I am not sure if I could miss anything since I am not familiar on read merge on deleted rows.

@openinx can you take a look at my question in the comment above?

The inputFiles in DataIterator is a in-memory cache to get the given decrypted InputFile path by decrypted location. We maintain those <location, decryptedInputFile> into map because we are trying to get all the decrypted infos once ( Some EncryptionManager implementation will use this feature to request them in a batch RPC call). It don't have relationship to row-level delete in format v2, we could fetch the <location, decryptedInputFile> one by one but that may produce many RPC to a key server.

I think it's worth to follow the original comment because it clearly decouples the file and offset iterators code.

I made the change in the child PR #2305 . In particular, this is the commit: cec66f9

I haven't merged it to this Uber PR yet, since I am hoping to get it reviewed first in the child PR #2305.

flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitGenerator.java

flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitStatus.java

sundargates · 2021-02-05T23:09:54Z

build.gradle

    compile project(':iceberg-parquet')
    compile project(':iceberg-hive-metastore')

+    compileOnly "org.apache.flink:flink-connector-base"


Why compileOnly? Does this assume that flink-connector-base would be supplied somehow? If so, what's recommended to users of the library given that flink-dist doesn't bundle flink-connector-base.

All the Flink deps are defined as compileOnly in Iceberg. Yeah, it assumes Flink jars are provided at runtime.

This particular jar file won't be provided by the Flink dist. It should be a transitive dependency of the connector.

Right now, the iceberg-flink-runtime shadow jar doesn't bring in any Flink deps. if we include flink-connector-base as compile, then it will be bundled in the iceberg-flink-runtime shadow. if a Flink app pulls in flink-connector-base transitively via other deps (like Flink Kafka connector), then we can get dup classes in jars.

@openinx maybe you can share some lights on how users get the Flink jars when using the Flink Iceberg connector.

I am also wondering if flink-dist should actually include flink-connector-base

https://issues.apache.org/jira/browse/FLINK-20098

It is not desirable to place such dependencies into flink-dist.

Regarding the transitive dependency: it would be surprising for the user to find that they have to add a flink-connector-base dependency to their project for the iceberg connector to work.

Regarding the dup classes: The the user still has control over the transitive dependency if there is a version mismatch (which is why it should be a transitive dependency and not included via shadow).

@tweise thx a lot for the context. now it all makes sense to me. Also didn't notice that iceberg-flink-runtime actually exclude all flink jars. Updated with compile deps.

In the future, if Flink decided to move flink-connector-base and flink-connector-files into flink-dist (as hinted in FLINK-20472), we can revisit the compile dep status.

sundargates · 2021-02-05T23:27:42Z

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+
+  @Override
+  public Boundedness getBoundedness() {
+    return enumeratorConfig.splitDiscoveryInterval() == null ?


Shouldn't boundedness be based on whether the data that's being read has finite bounds to it, i.e., if there's an end timestamp at which the source has to stop reading? You can have finite bounds but still have continuous discovery enabled if the end timestamp is sometime in the future.

Here is from Javadoc. I think the scenario you described also falls into this CONTINUOUS_UNBOUNDED. I know it is not totally intuitive.

A CONTINUOUS_UNBOUNDED stream may also eventually stop at some point. But before that happens, Flink always assumes the sources are going to run forever.

Yeah agree with Steven that it's not always intuitive but it does fall in line with their definition.

sundargates · 2021-02-05T23:30:15Z

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+    final Table table = loadTable(tableLoader);
+    if (enumeratorConfig.splitDiscoveryInterval() == null) {
+      final List<IcebergSourceSplit> splits = FlinkSplitGenerator.planIcebergSourceSplits(table, scanContext);
+      assigner.onDiscoveredSplits(splits);


nit: can we move this to within the StaticIcebergEnumerator so that we can keep the consistency on the interactions between the enumerator and assigner?

agree. will change

actually this is done intentionally. if split planning failed, we will fail fast during the job initialization. Alternatively If we do the one-time planning in the start method, it will fail at task start in taskmanager. At least, we probably should add some comments to explain this.

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSourceEvents.java

tweise · 2021-02-11T04:34:30Z

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSourceEvents.java

+   * A {@link SourceEvent} representing the request for a split, typically sent from the
+   * {@link SourceReader} to the {@link SplitEnumerator}.
+   *
+   * TODO: push change to Flink to carry the finished splitIds.


Is there a JIRA for this?

Yeah, I forgot to follow up on the Flink side. Created the jira and attached a PR to it.
https://issues.apache.org/jira/browse/FLINK-21364

tweise · 2021-02-11T05:11:27Z

flink/src/main/java/org/apache/iceberg/flink/source/enumerator/AbstractIcebergEnumerator.java

+      readersAwaitingSplit.put(subtaskId, splitRequestEvent.requesterHostname());
+      assignSplits();
+    } else {
+      LOG.error("Received unrecognized event from subtask {}: {}", subtaskId, sourceEvent);


Should this throw exception?

good question. throwing exception will cause job to fail and restart. explicit failure is probably better.

flink/src/main/java/org/apache/iceberg/flink/source/enumerator/AbstractIcebergEnumerator.java

tweise · 2021-02-12T21:48:10Z

flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssigner.java

+import org.apache.iceberg.flink.source.split.IcebergSourceSplit;
+
+/**
+ * Enumerator should call the assigner APIs from the coordinator thread.


Maybe expand the javadoc a little to explain why this is a separate component (from the design doc)?

good suggestion. will add

tweise · 2021-02-12T21:49:25Z

flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssigner.java

+   * If enumerator wasn't able to assign the split (e.g., reader disconnected),
+   * enumerator should call {@link SplitAssigner#onUnassignedSplits} to return the split.
+   */
+  GetSplitResult getNext(@Nullable String hostname);


Also pass the subtask index so that it is possible for an implementation to assign splits to subtasks in a particular order? Multiple subtasks can share a host.

Originally, subtaskIndex was there. We removed it because we can't think of any use cases needing it. I am definitely open to add it back if there is a concrete use case. Can you elaborate a little?

tweise · 2021-02-12T21:54:36Z

flink/src/main/java/org/apache/iceberg/flink/source/assigner/SplitAssignerStats.java

+ * under the License.
+ */
+
+package org.apache.iceberg.flink.source.assigner;


What's the purpose of this class? How will metrics from the enumerator/assigner be reported to Flink?

Good catch. Originally, I was planning to have enumerator poll the assigner for the stats. This is for that purpose. @sundargates and I discussed this and think it is probably better to have the assigner directly publish metrics so that we don't have to force a single value class like this for all assigners.

I cleaned up the assigner/enumerator code for avoiding using this. but I forgot to remove this class. will delete it.

tweise · 2021-02-12T23:48:35Z

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

+      // for batch jobs, discover splits eagerly during job initialization.
+      // As FLINK-16866 supports non-blocking job submission since 1.12,
+      // heavy job initialization won't lead to request timeout for job submission.
+      assigner.onDiscoveredSplits(FlinkSplitGenerator.planIcebergSourceSplits(table, scanContext));


Maybe it would be better to rearrange this for clarity: When the assigner was created with enumState.pendingSplits(), then we shouldn't perform eager split discovery here?

great catch. this is actually a bug. let me fix it and add a unit test

openinx · 2021-03-08T06:21:47Z

@stevenzwu would you mind to make this big PR into several small PRs for reviewing purpose ?

stevenzwu · 2021-03-08T07:24:14Z

@openinx yes, that is the plan as outlined in the description. I am actually preparing the next PR of split reader

klam-shop · 2022-04-12T15:29:06Z

👋 Is work still ongoing for the FLIP-27 IceBerg Flink source?

stevenzwu · 2022-04-12T18:15:03Z

@klam-shop yes, The uber draft PR is meant for the full context for how things work together. It is being broken down to smaller PRs for easier code review.

you can check the project for progress. https://github.com/apache/iceberg/projects/23. We are about 60% merged.

klam-shop · 2022-04-12T18:33:15Z

Thanks for the quick response @stevenzwu! Do you have an idea of when the FLIP-27 source will be completed?

stevenzwu · 2022-04-12T19:42:53Z

@klam-shop should be done before end of Q2. right now, main challenge is committers' review bandwidth.

zoucao · 2022-06-13T09:00:58Z

@klam-shop should be done before end of Q2. right now, main challenge is committers' review bandwidth.

Hi @stevenzwu, is this PR nearly finished? I found all smaller PRs in project23 are merged, and all the class is ready except IcebergSource, so I think only one PR which implement IcebergSource is left, right? correct me If I have made some mistakes, and we're looking forward to this feature.

stevenzwu · 2022-06-13T23:41:16Z

@zoucao for the MVP version, there are two sub PRs left: (1) PR #4986 part 2 of enumerator (2) the IcebergSource that puts everything together. Based on the pace, I think we are talking about probably another 1.5-2 months.

stevenzwu · 2022-06-17T23:47:34Z

Close this draft PR, as we are moving close to merge the MVP version of FLIP-27 source

github-actions bot added build flink labels Jan 18, 2021

openinx requested review from openinx and removed request for openinx January 19, 2021 06:41

stevenzwu force-pushed the flip27IcebergSource branch from 569f46c to 5c485c9 Compare January 25, 2021 19:22

sundargates reviewed Jan 25, 2021

View reviewed changes

tweise mentioned this pull request Jan 30, 2021

Implements the Flink source based on the new FLIP-27 interface #1626

Closed

stevenzwu force-pushed the flip27IcebergSource branch from 5c485c9 to acd85f1 Compare February 2, 2021 01:16

openinx reviewed Feb 2, 2021

View reviewed changes

sundargates reviewed Feb 2, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitStatus.java Outdated Show resolved Hide resolved

sundargates reviewed Feb 2, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitStatus.java Outdated Show resolved Hide resolved

openinx mentioned this pull request Feb 2, 2021

Flink: Refactor flink source tests for FLIP-27 unified source. #2047

Merged

stevenzwu force-pushed the flip27IcebergSource branch 4 times, most recently from 2642ed2 to 5df9919 Compare February 3, 2021 19:51

sundargates reviewed Feb 5, 2021

View reviewed changes

tweise reviewed Feb 11, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/source/IcebergSourceEvents.java Outdated Show resolved Hide resolved

tweise reviewed Feb 11, 2021

View reviewed changes

stevenzwu force-pushed the flip27IcebergSource branch from ea2fc4b to 7c1d4cb Compare February 12, 2021 02:02

tweise reviewed Feb 12, 2021

View reviewed changes

stevenzwu force-pushed the flip27IcebergSource branch 2 times, most recently from fc88932 to bc087ef Compare February 16, 2021 04:36

stevenzwu force-pushed the flip27IcebergSource branch from bc087ef to 0491316 Compare March 8, 2021 04:30

stevenzwu force-pushed the flip27IcebergSource branch 4 times, most recently from 4b03a13 to 49342b1 Compare March 4, 2022 17:47

stevenzwu force-pushed the flip27IcebergSource branch 8 times, most recently from 1879167 to 3125378 Compare March 16, 2022 23:21

stevenzwu force-pushed the flip27IcebergSource branch 2 times, most recently from 117143f to e2d2f38 Compare March 25, 2022 18:39

stevenzwu mentioned this pull request Mar 26, 2022

Flink: FLIP-27 source reader #4269

Merged

github-actions bot added API data spark labels Apr 13, 2022

stevenzwu force-pushed the flip27IcebergSource branch from 6fe6607 to 8afddf0 Compare May 30, 2022 21:16

github-actions bot removed the API label May 30, 2022

FLIP-27-based Flink Iceberg source

228e655

stevenzwu force-pushed the flip27IcebergSource branch from 78fdbce to 228e655 Compare June 6, 2022 23:28

stevenzwu closed this Jun 17, 2022

Flink: Initial implementation of Flink source with the new FLIP-27 source interface #2105

Flink: Initial implementation of Flink source with the new FLIP-27 source interface #2105

Uh oh!

Conversation

stevenzwu commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tweise Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Feb 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Jan 18, 2021 •

edited

Loading

tweise Feb 9, 2021 •

edited

Loading

stevenzwu Feb 6, 2021 •

edited

Loading