Flink: FLIP-27 Iceberg source split #3501

stevenzwu · 2021-11-08T22:53:11Z

This PR mainly implements IcebergSourceSplit and its serializer. Other classes connected to the source split (like FlinkSplitPlanner and ScanContext) are also included. As we only tried to minimize the set of connected files in this PR (not the lines of code), there could be some changes within some files aren't directly related to the main purpose of this change (although needed by the uber PR #2105 ).

This is against v1.13 only. Will port to v1.14 when it is ready. we will skip v1.12 for FLIP-27 source.

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/Position.java

openinx · 2021-11-09T10:04:23Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(splitId());


Using the value of toString() to calculate the hasCode is not a good practice because if some FileScanTask implementation did not implement the correct toString(), it will use the instance's hashCode as its toString() value. That leads to the meaningless of comparing two IcebergSourceSplit.

As we have the cached serializedFormCache, why not use the lazy approach to get the serialized bytes and calculate its byte array's hashCode ?

as mentioned in the other comment, serializedFormCache can't use the lazy approach because the serialization is done by a separate serializer class. IcebergSourceSplit doesn't know how to serialize itself. Hence we can't calculate the byte array's hashCode.

We aren't not using the toString from FileScanTask. We are only using fileScanTask.file().path().toString().

I don't think that we have a guarantee that serializedFormCache will be set unless the split has in fact been serialized via IcebergSourceSplitSerializer.

@kbendick you are correct. I had a typo in my earlier comment. should be "can't use the lazy approach because ..."

Then why not assign an unique integer number to the IcebergSourceSplit as splitId when planing the tasks in FlinkSplitPlanner#planIcebergSourceSplits ? I think keeping the toString approach as the identifier did not answer the question I raised in the first comment.

I am not sure we want to assign an unique integer number as splitId (especially for long-running streaming jobs).
We need to checkpoint the splitId counter. what if we can't restore from checkpoint (e.g. due to corrupted checkpoint state)? It is probably better to compute the splitId based on the intrinsic properties of IcebergSourceSplit (like path, start, length of data files).

I thought your first comment is about depending on FileScanTask#toString, which I explained that it is not the case. We don't call FileScanTask#toString. Instead, we depend on fileScanTask.file().path().toString(). Maybe I misunderstood your first comment. Can you elaborate a little more on your concern?

I still don't think it valid to make the hash code build on top of the FileScanTask#toString because this comment.

@openinx not sure where is the miscommunication. but we don't call FileScanTask#toString here. Instead, we only call FileScanTask.file().path().toString(). Please see this code snippet below.

private String toString(Collection<FileScanTask> files) { return Iterables.toString(files.stream().map(fileScanTask -> MoreObjects.toStringHelper(fileScanTask) .add("file", fileScanTask.file() != null ? fileScanTask.file().path().toString() : "NoDataFile") .add("start", fileScanTask.start()) .add("length", fileScanTask.length()) .toString()).collect(Collectors.toList())); }

openinx · 2021-11-09T10:08:56Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+  void serializedFormCache(byte[] cachedBytes) {
+    this.serializedFormCache = cachedBytes;
+  }


Why not use the lazily approach to serialize the split which is similar to the Schema.

iceberg/api/src/main/java/org/apache/iceberg/Schema.java

Line 137 in 1d81643

private Map<Integer, NestedField> lazyIdToField() {

Setting the cachedBytes looks a bit strange for me.

Serialization is done by another class IcebergSourceSplitSerializer. It can't be computed and cached internally. Hence this setter provides a way for IcebergSourceSplitSerializer to cache the serialized bytes.

lazyIdToField can work because everything is encapsulated within the Schema class.

Correct me if I'm wrong, but it seems like it is lazy and only set via the first call to IcebergSourceSplitSerializer.

Maybe the name is what looks a bit odd to you? I had to read it a few times myself, but I'm not as quick at Flink stuff as you two (or in general).

To me, lazy pattern is encapsulated within the class. E.g., Schema class knows how to compute the IdToField mapping in the lazyIdToField method. Here, we have a more getter/setter protocol btw two classes for caching serialized bytes.

openinx · 2021-11-09T10:11:12Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+            .add("file", fileScanTask.file() != null ?
+                fileScanTask.file().path().toString() :
+                "NoFile")


I think the FileScanTask#file() is guaranteed to provide a non-null value, otherwise the start() and length() won't have any meaning.

will remove the null check here

openinx · 2021-11-09T10:12:03Z

...1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitStatus.java

+
+package org.apache.iceberg.flink.source.split;
+
+public enum IcebergSourceSplitStatus {


Is this class related to this source split ?

This is not used by this PR. will remove

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

kbendick

Left some comments.

Overall this looks really good to me. Would there be any benefit to us if we used FileSourceSplit interface instead? Seems like that's a bit too general for our use case. And the SerDe is versioned anyway.

I'm going to approve this as this seems ready to me, give or take a few nits that I'll leave up to you and some questions for my own understanding. I'll still come back to follow up on comments etc, but I think this is a good direction to build off of.

Appreciate all of the work you've put in on the FLIP-27 front. 👍

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

kbendick · 2021-11-10T01:58:23Z

.../flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitSerializer.java

+ * <a href="https://github.com/apache/iceberg/issues/1698">issue-1698</a>.
+ */
+@Internal
+public class IcebergSourceSplitSerializer implements SimpleVersionedSerializer<IcebergSourceSplit> {


Will this get used to serialize / deserialize the splits across task boundaries, or just for checkpoints?

The comment in IcebergSourceSplit for serializedFormCache field mentions checkpoints, but curious about crossing task boundaries etc.

Question for my own understanding after looking through SimpleVersionedSerializer docs:

It seems like SimpleVersionedSerializer can only handle one version at a time (hence the getVersion() function).

Is that correct? Are there any best practices when working with SimpleVersionedSerializer to consider? When we evolve, will we have two classes or put all known classes into one instance?

This class is used for checkpoint state serializer. cross-process (JM->TM) communication is via Java serializable. Currently, we are using the Java serializable inside this class too for simpler start. This is not ideal, as we know Java serializable is not good with schema evolution. Schema evolution would be important for long-running streaming jobs (not so much for batch jobs).

In the class Javadoc, we linked to an issue for future improvement. Note that this is already an issue for the current FlinkSource in streaming mode.
#1698

SimpleVersionedSerializer will always use one/latest version to serialize. But during deserialization, it should handle multiple versions to support evolution (e.g. when we switch from Java serializable to some Avro serialization for FileScanTask)

@Override public IcebergSourceSplit deserialize(int version, byte[] serialized) throws IOException { switch (version) { case 1: return deserializeV1(serialized); default: throw new IOException("Unknown version: " + version); } }

kbendick · 2021-11-10T02:10:30Z

.../flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitSerializer.java

+      case 1:
+        return deserializeV1(serialized);
+      default:
+        throw new IOException("Unknown version: " + version);


Nit: It might help to mention the highest known version (assuming that we'll use monotonically increasing versioning). Or at the least, since most users won't be aware of SimpleVersionedSerializer, mentioning what the unknown version means (seems like this would happen in theory just with data written by a newer library version or maybe corrupted data / some kind of bug).

Something like Failed to deserialize IcebergSourceSplit. Encountered unknown version $version. The maximum version that can be handled is ${currentVersion}.

"maximum version" is probably also not accurate. it implies all versions below this number are supported. We may drop support of deserializing older versions. I will change the error msg to sth like

Failed to deserialize IcebergSourceSplit. Encountered unsupported version: $version. Supported version are [1]"

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

kbendick · 2021-11-10T02:14:49Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+   * Caching the byte representation makes repeated serialization cheap.
+   */
+  @Nullable
+  private transient byte[] serializedFormCache;


Nit: This is a great idea.

However, the name is a little confusing for me. Maybe lzySerializedBytes or something (sort of like @openinx's comment)? Just a nit as I don't want to bike shed on the name. Clever idea overall.

will change it to serializedBytesCache. This is not a "lazy" pattern, which is typically encapsulated within the class. this is a getter/setter interaction btw IcebergSourceSplit and IcebergSourceSplitSerializer

kbendick · 2021-11-10T02:16:11Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+  void serializedFormCache(byte[] cachedBytes) {
+    this.serializedFormCache = cachedBytes;
+  }


Correct me if I'm wrong, but it seems like it is lazy and only set via the first call to IcebergSourceSplitSerializer.

Maybe the name is what looks a bit odd to you? I had to read it a few times myself, but I'm not as quick at Flink stuff as you two (or in general).

kbendick · 2021-11-10T02:17:05Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(splitId());


I don't think that we have a guarantee that serializedFormCache will be set unless the split has in fact been serialized via IcebergSourceSplitSerializer.

stevenzwu · 2021-11-10T05:58:10Z

Would there be any benefit to us if we used FileSourceSplit interface instead? Seems like that's a bit too general for our use case. And the SerDe is versioned anyway.

@kbendick the initial implementation does have IcebergSourceSplit extending from FileSourceSplit with the purpose of leveraging the vectorized readers from Flink (suggested by Jingsong). After thinking about it more with Iceberg's row delete filter, I changed the decision to the current state (no extending from FileSourceSplit).

When we are ready to support vectorized readers in Flink, we need to make sure they support delete filter properly. There is one open PR for Orc: #2566

kbendick

Looked at the updates and this looks good to me 👍 .

I agree on not extending FileSourceSplit. Seems like it's more complicated than we need it to be and we also need the delete filter support.

stevenzwu · 2021-11-14T18:52:29Z

@openinx can you take another look?

openinx · 2021-11-16T02:26:35Z

Since we've enabled all the engine version's checkstyle check, let's rerun this travis CI again ! #3550

openinx · 2021-11-16T10:22:09Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java

+    if (context.includeColumnStats()) {
+      scan = scan.includeColumnStats();
+    }


Why add this switch in this PR ?

I will recommend to make this PR to add the flip-27 source split as focused as possible. So it will recommend to remove the unrelated changes.

column stats are needed for event time/watermark aligned assigner. You are correct that it is directly used by this PR. Right now, I am taking the approach of splitting sub PRs at minimally connected files for easier creation of the sub PRs. if you think it is important to avoid unrelated changes inside a file, I can revert the piece of change.

@stevenzwu How is your feeling for this comment ?

I think this is a reasonable addition. I think the motivation is to change the file in just this PR and not in the others that are part of FLIP-27.

openinx · 2021-11-16T11:41:04Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java

+  /**
+   * This returns splits for the FLIP-27 source
+   */
+  public static List<IcebergSourceSplit> planIcebergSourceSplits(


Nit: I think we don't need to switch to a new line.

openinx · 2021-11-16T11:47:15Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/Position.java

+    if (o == null || getClass() != o.getClass()) {
+      return false;
+    }
+    final Position that = (Position) o;


Nit: we usually don't use final for a local variable in iceberg.

yes. will remove

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

openinx · 2021-11-16T11:59:45Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+  private static final long serialVersionUID = 1L;
+
+  private final CombinedScanTask task;
+  /**


Nit: Could you pls leave a separate empty line between the two different block ?

There are two private final variables. Hence I thought it is one block. it is just that the 2nd variable has a Javadoc. If it is Iceberg's style convention to always start a Javadoc comment with an empty line, I am very happy to conform

I think this minor comment need to be addressed.

changed to // comments

openinx · 2021-11-16T12:04:13Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(splitId());


Then why not assign an unique integer number to the IcebergSourceSplit as splitId when planing the tasks in FlinkSplitPlanner#planIcebergSourceSplits ? I think keeping the toString approach as the identifier did not answer the question I raised in the first comment.

openinx · 2021-11-16T12:05:42Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+  public void updatePosition(int newFileOffset, long newRecordOffset) {
+    position.update(newFileOffset, newRecordOffset);
+  }


No usage for this method ?

it is not used here. But it is used by IcebergSourceRecordEmitter.

Again, I am creating the PR for minimal connected files (not minimally connected code within files).

openinx · 2021-11-16T12:07:35Z

.../flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitSerializer.java

+  @Override
+  public byte[] serialize(IcebergSourceSplit split) throws IOException {
+    if (split.serializedBytesCache() == null) {
+      final byte[] result = serializeV1(split);


Nit: remove the unnecessary final modifier.

openinx · 2021-11-16T12:15:04Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/source/split/SplitHelpers.java

+
+  public static List<IcebergSourceSplit> createFileSplits(
+      TemporaryFolder temporaryFolder, int fileCount, int filesPerSplit) throws Exception {
+    final File warehouseFile = temporaryFolder.newFolder();


Nit: it is recommended to remove all those unnecessary final modifiers to keep consistent with iceberg coding styles in my mind.

thx. will do a complete pass on finding and removing the unnecessary final modifiers

openinx · 2021-11-16T12:16:01Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/source/split/SplitHelpers.java

+                .map(files ->  new BaseCombinedScanTask(files))
+                .map(combinedScanTask -> IcebergSourceSplit.fromCombinedScanTask(combinedScanTask));


It's good to use method lambda here.

openinx · 2021-11-16T12:17:03Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/source/split/SplitHelpers.java

+          })
+          .collect(Collectors.toList());
+    } finally {
+      catalog.dropTable(TestFixtures.TABLE_IDENTIFIER);


As we will drop the table to remove data & metadata files in the end, could we read the IcebergSourceSplit in the outside method ?

Good question. I see why this can be confusing.

Right now, this method is used only by TestIcebergSourceSplitSerializer to generate some realistic splits with actual paths (unlike the createMockedSplits). I will add some comments both at method level and here. Let me know if I should move this method into the TestIcebergSourceSplitSerializer. I kept it here so that it might be useful for other testing

the method has been renamed and hopefully it won't be confusing. also updated javadoc

openinx · 2021-11-16T12:19:55Z

...nk/src/test/java/org/apache/iceberg/flink/source/split/TestIcebergSourceSplitSerializer.java

+      final IcebergSourceSplit deserialized2 = serializer.deserialize(serializer.getVersion(), cachedResult);
+      Assert.assertEquals(split, deserialized2);


Why not just use the deserialized ?

we already asserted deserialized in line 48. this is to make sure deserialized2 from the 2nd call from serializer.deserialize (cached bytes) still gets the same split.

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

stevenzwu · 2021-11-09T16:20:29Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+  void serializedFormCache(byte[] cachedBytes) {
+    this.serializedFormCache = cachedBytes;
+  }


Serialization is done by another class IcebergSourceSplitSerializer. It can't be computed and cached internally. Hence this setter provides a way for IcebergSourceSplitSerializer to cache the serialized bytes.

lazyIdToField can work because everything is encapsulated within the Schema class.

stevenzwu · 2021-11-09T16:23:34Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(splitId());


as mentioned in the other comment, serializedFormCache can't use the lazy approach because the serialization is done by a separate serializer class. IcebergSourceSplit doesn't know how to serialize itself. Hence we can't calculate the byte array's hashCode.

We aren't not using the toString from FileScanTask. We are only using fileScanTask.file().path().toString().

stevenzwu · 2021-11-09T16:25:03Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+            .add("file", fileScanTask.file() != null ?
+                fileScanTask.file().path().toString() :
+                "NoFile")


will remove the null check here

stevenzwu · 2021-11-09T16:26:09Z

...1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplitStatus.java

+
+package org.apache.iceberg.flink.source.split;
+
+public enum IcebergSourceSplitStatus {


This is not used by this PR. will remove

stevenzwu · 2021-11-16T20:08:06Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/source/split/SplitHelpers.java

+                .map(files ->  new BaseCombinedScanTask(files))
+                .map(combinedScanTask -> IcebergSourceSplit.fromCombinedScanTask(combinedScanTask));


stevenzwu · 2021-11-16T20:16:44Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/source/split/SplitHelpers.java

+          })
+          .collect(Collectors.toList());
+    } finally {
+      catalog.dropTable(TestFixtures.TABLE_IDENTIFIER);


Good question. I see why this can be confusing.

Right now, this method is used only by TestIcebergSourceSplitSerializer to generate some realistic splits with actual paths (unlike the createMockedSplits). I will add some comments both at method level and here. Let me know if I should move this method into the TestIcebergSourceSplitSerializer. I kept it here so that it might be useful for other testing

stevenzwu · 2021-11-16T20:23:04Z

...nk/src/test/java/org/apache/iceberg/flink/source/split/TestIcebergSourceSplitSerializer.java

+      final IcebergSourceSplit deserialized2 = serializer.deserialize(serializer.getVersion(), cachedResult);
+      Assert.assertEquals(split, deserialized2);


we already asserted deserialized in line 48. this is to make sure deserialized2 from the 2nd call from serializer.deserialize (cached bytes) still gets the same split.

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java

stevenzwu · 2021-12-03T16:58:00Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hashCode(splitId());


@openinx not sure where is the miscommunication. but we don't call FileScanTask#toString here. Instead, we only call FileScanTask.file().path().toString(). Please see this code snippet below.

private String toString(Collection<FileScanTask> files) { return Iterables.toString(files.stream().map(fileScanTask -> MoreObjects.toStringHelper(fileScanTask) .add("file", fileScanTask.file() != null ? fileScanTask.file().path().toString() : "NoDataFile") .add("start", fileScanTask.start()) .add("length", fileScanTask.length()) .toString()).collect(Collectors.toList())); }

rdblue · 2021-12-10T18:30:59Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java

+    try (CloseableIterable<CombinedScanTask> tasksIterable = planTasks(table, context)) {
+      List<IcebergSourceSplit> splits = Lists.newArrayList();
+      tasksIterable.forEach(task -> splits.add(IcebergSourceSplit.fromCombinedScanTask(task)));
+      return splits;


Lists.transform is another option here instead of separate lines.

sure. will change to this one-liner

return Lists.newArrayList(CloseableIterable.transform(tasksIterable, task -> IcebergSourceSplit.fromCombinedScanTask(task)));

rdblue · 2021-12-10T18:35:35Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/Position.java

+    this.recordOffset += 1L;
+  }
+
+  public void update(int newFileOffset, long newRecordOffset) {


Seems more like a set method to me since it directly sets the internal counters rather than advancing them using an amount.

sure. will rename the method to set

rdblue · 2021-12-10T18:36:13Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/Position.java

+    }
+    Position that = (Position) o;
+    return Objects.equals(fileOffset, that.fileOffset) &&
+        Objects.equals(recordOffset, that.recordOffset);


These are both primitives, so you can use == instead of Objects.equals.

I tend to prefer Object.equals as it immediately does == check anyway, and then I don't have to think about it in the future.

If the project style prefers == where possible, then by all means go with that.

Object.equals code per OpenJDK 8.

public static boolean equals(Object a, Object b) { return (a == b) || (a != null && a.equals(b)); }

These are primitives, so calling equals(Object, Object) will box both values just to do a more expensive check.

Oh good call hadn’t considered the boxing.

will change to ==

rdblue · 2021-12-10T18:36:48Z

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/Position.java

+ * </ul>
+ */
+@Internal
+public class Position implements Serializable {


SplitPosition?

will rename the class to SplitPosition

chethanuk · 2021-12-12T12:50:41Z

@stevenzwu does this work with Flink 1.14?

kbendick · 2021-12-13T03:39:51Z

@stevenzwu does this work with Flink 1.14?

This should work for 1.14. The convention in the repo is to do large PRs against one version and then back port in another PR (usually we apply it on the latest supported version, but this has existed for a while so latest at the time was 1.13).

Unless you’re aware of a specific reason this would not work wirh 1.14, I’m fairly certain it will given that we’re using the newer interfaces. And on top of that, we’re using one of the higher level split interfaces.

…ceSplit split` arg

…expose fileOffset and recordOffset directly.

…the same package as ScanContext

rdblue · 2022-01-09T18:14:46Z

Looks good now. Thanks, @stevenzwu!

This reverts commit d2c26a0.

github-actions bot added the flink label Nov 8, 2021

stevenzwu mentioned this pull request Nov 8, 2021

Flink: Initial implementation of Flink source with the new FLIP-27 source interface #2105

Closed

stevenzwu force-pushed the split branch from c630159 to 605b79f Compare November 8, 2021 23:02

openinx reviewed Nov 9, 2021

View reviewed changes

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java Outdated Show resolved Hide resolved

openinx mentioned this pull request Nov 9, 2021

Build: Enable engine's checkstyle and error-prone check in Github CI #3509

Merged

kbendick approved these changes Nov 10, 2021

View reviewed changes

This was referenced Nov 10, 2021

Build - Ensure checkstyle runs for Flink for all PRs by removing -Pquick=true #3519

Closed

Build: Ensure checkstyle and all check run for Flink PRs #3520

Closed

Build: Ensure checkstyle and all checks run for Spark PRs #3522

Closed

github-actions bot added the API label Nov 11, 2021

stevenzwu force-pushed the split branch 2 times, most recently from 84a662f to c03459f Compare November 11, 2021 19:09

kbendick approved these changes Nov 12, 2021

View reviewed changes

openinx closed this Nov 16, 2021

openinx reopened this Nov 16, 2021

openinx reviewed Nov 16, 2021

View reviewed changes

stevenzwu force-pushed the split branch from 8c17382 to 80f70cc Compare November 16, 2021 22:37

stevenzwu commented Dec 3, 2021

View reviewed changes

rdblue reviewed Dec 10, 2021

View reviewed changes

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Dec 19, 2021

address Ryan's review comments from PR apache#3501

4666fcd

stevenzwu force-pushed the split branch 2 times, most recently from 2ea704a to 1ad29e8 Compare December 19, 2021 22:53

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jan 8, 2022

address Ryan's review comments from PR apache#3501

8548df5

stevenzwu added 12 commits January 7, 2022 22:40

Flink: FLIP-27 Iceberg source split

99262f2

address openInx's review comments

02b9539

address Kyle's review comments

6290574

make IcebergSourceSplit constructor private

c874f42

add back null check

29ce395

address openInx's comments

51b056d

address Ryan's review comments

efb813c

fix comment

fab7e11

IcebergSourceSplit#serializeV1 shouldn't need to take an `IcebergSour…

1bcd5c3

…ceSplit split` arg

remove SplitPosition. DataIterator and IcebergSourceSplit tracks and …

6a8657b

…expose fileOffset and recordOffset directly.

revert ScanContext change

e9b2c84

add back includeColumnStats to ScanContext and moved SplitHelpers to …

7a0f9cb

…the same package as ScanContext

stevenzwu force-pushed the split branch from 1b07c21 to 7a0f9cb Compare January 8, 2022 07:16

single line formating change only

dc90b21

rdblue approved these changes Jan 9, 2022

View reviewed changes

rdblue merged commit d2c26a0 into apache:master Jan 9, 2022

ChaomingZhangCN pushed a commit to ChaomingZhangCN/iceberg that referenced this pull request Jan 10, 2022

Flink: Add FLIP-27 Iceberg source split for 1.14 (apache#3501)

286b767

ChaomingZhangCN mentioned this pull request Jan 10, 2022

Flink: Add FLIP-27 Iceberg source split for 1.14 #3867

Closed

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jan 10, 2022

Revert "Flink: Add FLIP-27 Iceberg source split (apache#3501)"

97fbbab

This reverts commit d2c26a0.

stevenzwu mentioned this pull request Jan 10, 2022

Revert "Flink: Add FLIP-27 Iceberg source split (#3501)" #3871

Merged

rdblue pushed a commit that referenced this pull request Jan 10, 2022

Revert "Flink: Add FLIP-27 Iceberg source split (#3501)" (#3871)

b6e561f

This reverts commit d2c26a0.

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Feb 7, 2022

address Ryan's review comments from PR apache#3501

08112cd

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Feb 21, 2022

address Ryan's review comments from PR apache#3501

52c614f


		package org.apache.iceberg.flink.source.split;

		public enum IcebergSourceSplitStatus {

		.map(files -> new BaseCombinedScanTask(files))
		.map(combinedScanTask -> IcebergSourceSplit.fromCombinedScanTask(combinedScanTask));

		final IcebergSourceSplit deserialized2 = serializer.deserialize(serializer.getVersion(), cachedResult);
		Assert.assertEquals(split, deserialized2);

Flink: FLIP-27 Iceberg source split #3501

Flink: FLIP-27 Iceberg source split #3501

Uh oh!

Conversation

stevenzwu commented Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Nov 14, 2021

Uh oh!

openinx commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

stevenzwu commented Nov 8, 2021 •

edited

Loading

stevenzwu commented Nov 10, 2021 •

edited

Loading

openinx commented Nov 16, 2021 •

edited

Loading