Skip to content

Conversation

@stevenzwu
Copy link
Contributor

@stevenzwu stevenzwu commented Jan 18, 2021

Scope in the first version

  • simple split assigner (no ordering or locality aware)
  • support both batch and streaming read

This is the uber PR for the reference of complete context. Will submit smaller PRs for the code review

  1. [MERGED] refactor Flink tests so that new source implementation can reuse](Flink: Refactor flink source tests for FLIP-27 unified source. #2047)
  2. [MERGED] Upgrade Flink version to 1.12.1(Flink: Upgrade version from 1.11.0 to 1.12.1 #1956)
  3. [PENDING] FLIP-27 Iceberg source split (Flink: FLIP-27 Iceberg source split #3501 )
  4. SimpleSplitAssigner (TBD). Note that other assigners will be added after this work is completed.
  5. Split enumerator (TBD)
  6. IcebergSource where everything put together (TBD)

The new IcebergSource will be marked as @Experimental as FLIP-27 source is maturing and we are making it production ready.

Here is the design doc that my colleague (@sundargates) and I created, as mentioned in #1626.

@openinx openinx requested review from openinx and removed request for openinx January 19, 2021 06:41
@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch from 569f46c to 5c485c9 Compare January 25, 2021 19:22
return inputFiles.get(location);
}

public void seek(CheckpointedPosition checkpointedPosition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we put those two-level iterators inside a single DataIterator, that makes the code a bit complex to read and understand. I'd prefer to make this into two different iterators:

  1. FileRecordIterator, that will seek the provided row offset and then continue to read the following records.
  2. CombinedTaskRecordIterator, that will have multiple FileRecordIterators, it will locate the latest opening FileRecordIterator and seek to the given row offset to read the following records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me.

I have a question about the Map<String, InputFile> inputFiles. Right now, it is constructed per CombinedScanTask. Would it be ok to do it at each individual FileScanTask? I tried the change and delete tests work fine. But I am not sure if I could miss anything since I am not familiar on read merge on deleted rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx can you take a look at my question in the comment above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputFiles in DataIterator is a in-memory cache to get the given decrypted InputFile path by decrypted location. We maintain those <location, decryptedInputFile> into map because we are trying to get all the decrypted infos once ( Some EncryptionManager implementation will use this feature to request them in a batch RPC call). It don't have relationship to row-level delete in format v2, we could fetch the <location, decryptedInputFile> one by one but that may produce many RPC to a key server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to follow the original comment because it clearly decouples the file and offset iterators code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change in the child PR #2305 . In particular, this is the commit: cec66f9

I haven't merged it to this Uber PR yet, since I am hoping to get it reviewed first in the child PR #2305.

build.gradle Outdated
compile project(':iceberg-parquet')
compile project(':iceberg-hive-metastore')

compileOnly "org.apache.flink:flink-connector-base"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why compileOnly? Does this assume that flink-connector-base would be supplied somehow? If so, what's recommended to users of the library given that flink-dist doesn't bundle flink-connector-base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Flink deps are defined as compileOnly in Iceberg. Yeah, it assumes Flink jars are provided at runtime.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular jar file won't be provided by the Flink dist. It should be a transitive dependency of the connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the iceberg-flink-runtime shadow jar doesn't bring in any Flink deps. if we include flink-connector-base as compile, then it will be bundled in the iceberg-flink-runtime shadow. if a Flink app pulls in flink-connector-base transitively via other deps (like Flink Kafka connector), then we can get dup classes in jars.

@openinx maybe you can share some lights on how users get the Flink jars when using the Flink Iceberg connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if flink-dist should actually include flink-connector-base

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/FLINK-20098

It is not desirable to place such dependencies into flink-dist.

Regarding the transitive dependency: it would be surprising for the user to find that they have to add a flink-connector-base dependency to their project for the iceberg connector to work.

Copy link

@tweise tweise Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the dup classes: The the user still has control over the transitive dependency if there is a version mismatch (which is why it should be a transitive dependency and not included via shadow).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tweise thx a lot for the context. now it all makes sense to me. Also didn't notice that iceberg-flink-runtime actually exclude all flink jars. Updated with compile deps.

In the future, if Flink decided to move flink-connector-base and flink-connector-files into flink-dist (as hinted in FLINK-20472), we can revisit the compile dep status.


@Override
public Boundedness getBoundedness() {
return enumeratorConfig.splitDiscoveryInterval() == null ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't boundedness be based on whether the data that's being read has finite bounds to it, i.e., if there's an end timestamp at which the source has to stop reading? You can have finite bounds but still have continuous discovery enabled if the end timestamp is sometime in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is from Javadoc. I think the scenario you described also falls into this CONTINUOUS_UNBOUNDED. I know it is not totally intuitive.

A CONTINUOUS_UNBOUNDED stream may also eventually stop at some point. But before that happens, Flink always assumes the sources are going to run forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree with Steven that it's not always intuitive but it does fall in line with their definition.

final Table table = loadTable(tableLoader);
if (enumeratorConfig.splitDiscoveryInterval() == null) {
final List<IcebergSourceSplit> splits = FlinkSplitGenerator.planIcebergSourceSplits(table, scanContext);
assigner.onDiscoveredSplits(splits);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move this to within the StaticIcebergEnumerator so that we can keep the consistency on the interactions between the enumerator and assigner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. will change

Copy link
Contributor Author

@stevenzwu stevenzwu Feb 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is done intentionally. if split planning failed, we will fail fast during the job initialization. Alternatively If we do the one-time planning in the start method, it will fail at task start in taskmanager. At least, we probably should add some comments to explain this.

* A {@link SourceEvent} representing the request for a split, typically sent from the
* {@link SourceReader} to the {@link SplitEnumerator}.
*
* TODO: push change to Flink to carry the finished splitIds.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a JIRA for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I forgot to follow up on the Flink side. Created the jira and attached a PR to it.
https://issues.apache.org/jira/browse/FLINK-21364

readersAwaitingSplit.put(subtaskId, splitRequestEvent.requesterHostname());
assignSplits();
} else {
LOG.error("Received unrecognized event from subtask {}: {}", subtaskId, sourceEvent);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. throwing exception will cause job to fail and restart. explicit failure is probably better.

import org.apache.iceberg.flink.source.split.IcebergSourceSplit;

/**
* Enumerator should call the assigner APIs from the coordinator thread.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand the javadoc a little to explain why this is a separate component (from the design doc)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. will add

* If enumerator wasn't able to assign the split (e.g., reader disconnected),
* enumerator should call {@link SplitAssigner#onUnassignedSplits} to return the split.
*/
GetSplitResult getNext(@Nullable String hostname);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pass the subtask index so that it is possible for an implementation to assign splits to subtasks in a particular order? Multiple subtasks can share a host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, subtaskIndex was there. We removed it because we can't think of any use cases needing it. I am definitely open to add it back if there is a concrete use case. Can you elaborate a little?

* under the License.
*/

package org.apache.iceberg.flink.source.assigner;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this class? How will metrics from the enumerator/assigner be reported to Flink?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Originally, I was planning to have enumerator poll the assigner for the stats. This is for that purpose. @sundargates and I discussed this and think it is probably better to have the assigner directly publish metrics so that we don't have to force a single value class like this for all assigners.

I cleaned up the assigner/enumerator code for avoiding using this. but I forgot to remove this class. will delete it.

// for batch jobs, discover splits eagerly during job initialization.
// As FLINK-16866 supports non-blocking job submission since 1.12,
// heavy job initialization won't lead to request timeout for job submission.
assigner.onDiscoveredSplits(FlinkSplitGenerator.planIcebergSourceSplits(table, scanContext));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to rearrange this for clarity: When the assigner was created with enumState.pendingSplits(), then we shouldn't perform eager split discovery here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch. this is actually a bug. let me fix it and add a unit test

@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch 2 times, most recently from fc88932 to bc087ef Compare February 16, 2021 04:36
@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch from bc087ef to 0491316 Compare March 8, 2021 04:30
@openinx
Copy link
Member

openinx commented Mar 8, 2021

@stevenzwu would you mind to make this big PR into several small PRs for reviewing purpose ?

@stevenzwu
Copy link
Contributor Author

@openinx yes, that is the plan as outlined in the description. I am actually preparing the next PR of split reader

@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch 4 times, most recently from 4b03a13 to 49342b1 Compare March 4, 2022 17:47
@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch 8 times, most recently from 1879167 to 3125378 Compare March 16, 2022 23:21
@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch 2 times, most recently from 117143f to e2d2f38 Compare March 25, 2022 18:39
@klam-shop
Copy link

👋 Is work still ongoing for the FLIP-27 IceBerg Flink source?

@stevenzwu
Copy link
Contributor Author

@klam-shop yes, The uber draft PR is meant for the full context for how things work together. It is being broken down to smaller PRs for easier code review.

you can check the project for progress. https://github.com/apache/iceberg/projects/23. We are about 60% merged.

@klam-shop
Copy link

Thanks for the quick response @stevenzwu! Do you have an idea of when the FLIP-27 source will be completed?

@stevenzwu
Copy link
Contributor Author

@klam-shop should be done before end of Q2. right now, main challenge is committers' review bandwidth.

@stevenzwu stevenzwu force-pushed the flip27IcebergSource branch from 78fdbce to 228e655 Compare June 6, 2022 23:28
@zoucao
Copy link
Contributor

zoucao commented Jun 13, 2022

@klam-shop should be done before end of Q2. right now, main challenge is committers' review bandwidth.

Hi @stevenzwu, is this PR nearly finished? I found all smaller PRs in project23 are merged, and all the class is ready except IcebergSource, so I think only one PR which implement IcebergSource is left, right? correct me If I have made some mistakes, and we're looking forward to this feature.

@stevenzwu
Copy link
Contributor Author

@zoucao for the MVP version, there are two sub PRs left: (1) PR #4986 part 2 of enumerator (2) the IcebergSource that puts everything together. Based on the pace, I think we are talking about probably another 1.5-2 months.

@stevenzwu
Copy link
Contributor Author

Close this draft PR, as we are moving close to merge the MVP version of FLIP-27 source

@stevenzwu stevenzwu closed this Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants