StreamingOffset Of Structured streaming read for Iceberg #2092

XuQianJin-Stars · 2021-01-14T12:31:39Z

An implementation of Spark Structured Streaming Offset, to track the current processed files of Iceberg table, This PR is a split of the PR-796 of Structured streaming read for Iceberg.

RussellSpitzer · 2021-01-14T15:00:30Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+      int version = JsonUtil.getInt(VERSION, node);
+      if (version > CURR_VERSION) {
+        throw new IOException(String.format("Cannot deserialize a JSON offset from version %d. %d is not compatible " +
+            "with the version of Iceberg %d and cannot be used. Please use a compatible version of Iceberg " +


This is a bit confusing because it's not the Iceberg version, but the Streaming Offset Version. Maybe just phrase it instead "This version of iceberg only supports version $curversion"

kbendick · 2021-01-15T06:00:56Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+ * snapshot.
+ * snapshot_fully_processed: Denote whether the current snapshot is fully processed, to avoid revisiting the processed
+ * snapshot.
+ */


Somewhat of a nit: Can you please make this a proper javadoc comment, such as using @param before the constructor parameters, listing out all of the constructor parameters in order, as well as formatting the constructor parameters the way that they are in the code (i.e. using camel case and not snake case)? I would say also that the line This StreamingOffset consists of: will be unnecessary if you do that.

XuQianJin-Stars · 2021-01-15T11:37:12Z

Thanks @RussellSpitzer @kbendick review this PR.

XuQianJin-Stars · 2021-01-15T11:47:36Z

hi @aokolnychyi @rdblue please take a look at this PR at your convenience.

kbendick · 2021-01-18T01:14:19Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+  }
+
+  static StreamingOffset fromJson(String json) {
+    Preconditions.checkNotNull(json, "The input JSON string is null");


Nit: it might be best to have a more explanatory message in the Preconditions check. Something like The input JSON string representation of a StreamingOffset cannot be null. I'll leave that up to you / the others to decide as it's a minor nit.

kbendick · 2021-01-18T01:15:57Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+      // The version of StreamingOffset. The offset was created with a version number
+      // used to validate when deserializing from json string.
+      int version = JsonUtil.getInt(VERSION, node);
+      if (version > CURR_VERSION) {


Do we not plan to support version 2 snapshot files yet? I suppose that makes sense given the need to process deletes etc with the version 2 snapshots, but looking to understand what other reasons might exist for not supporting them yet.

Right now, I think we want to focus on v1 data. But I think this version is actually referring to the version of the offset JSON, in case we need to change it later.

rdblue · 2021-01-18T20:32:26Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+  private static final String SNAPSHOT_ID = "snapshot_id";
+  private static final String INDEX = "index";
+  private static final String SCAN_ALL_FILES = "scan_all_files";
+  private static final String SNAPSHOT_FULLY_PROCESSED = "snapshot_fully_processed";


Now that data and delete files can report the row position within a manifest, I don't think that we need to use SNAPSHOT_FULLY_PROCESSED anymore. Even if manifests are filtered, the position will be reliable. So by using position for the index field in this offset, we can check whether a given manifest has been completely processed by comparing the index with the number of entries in the manifest.

I think that getting rid of this field and using a simpler offset is a good improvement.

rdblue · 2021-01-18T20:33:54Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+   * @param snapshotId             The current processed snapshot id.
+   * @param index                  The index of last scanned file in snapshot.
+   * @param scanAllFiles           Denote whether to scan all files in a snapshot, currently we only
+   *                               scan all files in the starting snapshot.


Docs should not refer to currently. If you want to give context for why this might be used, then use "for example". "Whether to scan all files in a snapshot; for example, to read all data when starting a stream"

rdblue · 2021-01-18T20:34:24Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+   *                               revisiting the processed snapshot.
+   */
+  StreamingOffset(long snapshotId, int index, boolean scanAllFiles,
+      boolean snapshotFullyProcessed) {


Nit: looks like this newline is unnecessary.

rdblue · 2021-01-18T20:36:11Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+  }
+
+  static StreamingOffset fromJson(String json) {
+    Preconditions.checkNotNull(json, "The input JSON string representation of a StreamingOffset cannot be null");


Error messages typically follow the pattern "Invalid ..." or "Cannot (action): (reason) [suggestion]". Here, there is no suggestion for fixing it, so it would be something like "Cannot parse offset JSON: null".

rdblue · 2021-01-18T20:37:51Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+
+      return new StreamingOffset(snapshotId, index, shouldScanAllFiles, snapshotFullyProcessed);
+    } catch (IOException e) {
+      throw new IllegalStateException(String.format("Failed to parse StreamingOffset from JSON string %s", json), e);


I think this should be IllegalArgumentException instead. There is no state in this static method.

rdblue · 2021-01-18T20:40:04Z

spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead.java

+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+public abstract class TestStructuredStreamingRead {


This class has no tests? Can you remove it?

rdblue · 2021-01-18T20:42:29Z

spark/src/main/java/org/apache/iceberg/spark/source/StreamingOffset.java

+      // used to validate when deserializing from json string.
+      int version = JsonUtil.getInt(VERSION, node);
+      if (version > CURR_VERSION) {
+        throw new IOException(String.format("This version of iceberg only supports version %s", CURR_VERSION));


This should not throw IOException, I think it would be IllegalArgumentException, which means you could use a Precondition. Also, I'd prefer to have the error message follow conventions:

Preconditions.check(version == CURR_VERSION, "Cannot parse offset JSON: offset version %s is not supported", version);

XuQianJin-Stars · 2021-01-19T03:54:17Z

Thanks again to @rdblue and @kbendick review this PR. I have addressed it.

rdblue · 2021-01-29T23:18:19Z

spark/src/test/java/org/apache/iceberg/spark/source/TestStreamingOffset.java

+  public ExpectedException exceptionRule = ExpectedException.none();
+
+  @BeforeClass
+  public static void startSpark() throws IOException {


This PR introduces a simple class that can serialize/deserialize itself using JSON.

I don't see any value in having a test that creates a Spark session and does dataframe operations. I think what needs to be tested is that you can create an offset, get the expected values from the getter methods, serialize, and deserialize correctly. Those don't involve Spark, so I don't see a reason to slow down tests overall by adding Spark tests here.

Can you remove the Spark code and do some basic tests for serialization?

rdblue · 2021-01-29T23:18:53Z

Thanks for updating this, @XuQianJin-Stars. I think the implementation looks good now. We just need to fix up the tests. Thanks!

XuQianJin-Stars · 2021-02-01T01:26:42Z

hi @rdblueThanks again, I have updating it.

XuQianJin-Stars · 2021-02-19T01:31:36Z

hi @rdblue would you please take another look at this PR at your convenience.

rdblue · 2021-02-19T01:53:46Z

Will do. Thanks for pinging me.

rdblue · 2021-02-21T23:36:26Z

Looks good now. I'll merge this. Thanks @XuQianJin-Stars!

StreamingOffset Of Structured streaming read for Iceberg

2dfc0eb

github-actions bot added the spark label Jan 14, 2021

RussellSpitzer reviewed Jan 14, 2021

View reviewed changes

StreamingOffset Of Structured streaming read for Iceberg

3a0aac7

kbendick reviewed Jan 15, 2021

View reviewed changes

StreamingOffset Of Structured streaming read for Iceberg

c9a06ee

kbendick reviewed Jan 18, 2021

View reviewed changes

StreamingOffset Of Structured streaming read for Iceberg

7508aba

rdblue reviewed Jan 18, 2021

View reviewed changes

StreamingOffset Of Structured streaming read for Iceberg

8a26f28

XuQianJin-Stars requested a review from rdblue January 22, 2021 02:24

rdblue reviewed Jan 29, 2021

View reviewed changes

StreamingOffset Of Structured streaming read for Iceberg

c1fba28

XuQianJin-Stars force-pushed the iceberg-2085 branch from d26b839 to c1fba28 Compare January 31, 2021 09:34

StreamingOffset Of Structured streaming read for Iceberg

0beb92f

XuQianJin-Stars requested a review from rdblue February 2, 2021 15:13

rdblue approved these changes Feb 21, 2021

View reviewed changes

rdblue merged commit 91ac421 into apache:master Feb 21, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Spark: Add StreamingOffset for structured streaming reader (apache#2092)

145edfd

This was referenced May 19, 2021

Spark3 structured streaming micro_batch read support #2611

Closed

Make a Copy of StreamingOffset implementation from Spark2 into Spark3's version of Offset implementation #2615

Merged

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Spark: Add StreamingOffset for structured streaming reader (apache#2092)

a3ebc7b

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Spark: Add StreamingOffset for structured streaming reader (apache#2092)

c8d84e9


		import static org.apache.iceberg.types.Types.NestedField.optional;

		public abstract class TestStructuredStreamingRead {

StreamingOffset Of Structured streaming read for Iceberg #2092

StreamingOffset Of Structured streaming read for Iceberg #2092

Uh oh!

Conversation

XuQianJin-Stars commented Jan 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars commented Jan 15, 2021

Uh oh!

XuQianJin-Stars commented Jan 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars commented Jan 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 29, 2021

Uh oh!

XuQianJin-Stars commented Feb 1, 2021

Uh oh!

XuQianJin-Stars commented Feb 19, 2021

Uh oh!

rdblue commented Feb 19, 2021

Uh oh!

rdblue commented Feb 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants