[SPARK-19876][SS][WIP] OneTime Trigger Executor #17219

ghost · 2017-03-09T00:25:17Z

What changes were proposed in this pull request?

An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.

In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.

How was this patch tested?

A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.

In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:

The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
A OneTime trigger execution that results in an exception being thrown.

@marmbrus @tdas @zsxwing

Please review http://spark.apache.org/contributing.html before opening a pull request.

…commit

marmbrus

This is great! Thanks for working on it.

marmbrus · 2017-03-09T00:35:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetCommitLog.scala

+    }
+    val version = lines.next()
+    if (version != OffsetCommitLog.VERSION) {
+      throw new IllegalStateException(s"Unknown log version: ${version}")


We should make sure the error here is consistent with the work being done in #17070

marmbrus · 2017-03-09T00:36:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetCommitLog.scala

+
+object OffsetCommitLog {
+  private val VERSION = "v1"
+  private val SERIALIZED_VOID = "-"


Why not make this an empty json object? {}

marmbrus · 2017-03-09T00:41:13Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala

+ *   df.write.trigger(OneTime)
+ * }}}
+ *
+ * Java Example:


I don't think this works?

yes. this doesnt. please fix them.

marmbrus · 2017-03-09T00:41:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetCommitLog.scala

+
+import org.apache.spark.sql.SparkSession
+
+class OffsetCommitLog(sparkSession: SparkSession, path: String)


Scala doc please

marmbrus · 2017-03-09T00:42:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

              finishTrigger(dataAvailable)
              if (dataAvailable) {
                // We'll increase currentBatchId after we complete processing current batch's data
+                commitLog.add(currentBatchId, None)


I wonder if we should make this async?

marmbrus · 2017-03-09T00:45:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-          case lastOffsets =>
-            committedOffsets = lastOffsets.toStreamProgress(sources)
-            logDebug(s"Resuming with committed offsets: $committedOffsets")
+        currentBatchId = commitLog.getLatest() match {


Mind adding a few more comments here. This logic is getting very dense. I think that its doing something like the following:

finding the max committed batch

checking to see if there is a started but uncommitted batch

otherwise constructing a new batch

SparkQA · 2017-03-09T01:43:57Z

Test build #74231 has finished for PR 17219 at commit 682eb1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OneTimeExecutor() extends TriggerExecutor

SparkQA · 2017-03-09T22:47:51Z

Test build #74282 has finished for PR 17219 at commit a129dd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OneTime() extends Trigger

…rce) after restart

SparkQA · 2017-03-10T19:17:14Z

Test build #74328 has finished for PR 17219 at commit b4ef029.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… an empty call to getBatch on each source using the last committed offset as the end offset

SparkQA · 2017-03-10T21:14:40Z

Test build #74333 has finished for PR 17219 at commit 7cb43b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-10T23:51:17Z

Test build #74347 has finished for PR 17219 at commit 98812cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T04:38:02Z

Test build #74365 has finished for PR 17219 at commit 8c5b84f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-13T21:56:18Z

Test build #74465 has finished for PR 17219 at commit 573ec98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-03-17T20:22:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala


+  /**
+   * A log that records the committed batch ids. This is used to check if a batch was committed
+   * on restart, instead of (possibly) re-running the previous batch.


nit: "if a batch was committed on restart" sounds like batches are supposed to get committed only on restart. :)

Also, keep the comment generic such that it does mean that not only the previous batch will be re-run. in future we could be rerun multiple batches.

tdas · 2017-03-17T20:25:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetCommitLog.scala

+    // called inside a try-finally where the underlying stream is closed in the caller
+    val lines = IOSource.fromInputStream(in, UTF_8.name()).getLines()
+    if (!lines.hasNext) {
+      throw new IllegalStateException("Incomplete log file")


can you say "incomplete log file in the offset commit log"

tdas · 2017-03-17T20:32:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetCommitLog.scala

+}
+
+object OffsetCommitLog {
+  private val VERSION = 1


Lets be consistent with other logs in writing "v1" for version and not "1"

tdas · 2017-03-17T20:56:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+   *  The basic structure of this method is as follows:
+   *
+   *  Identify (from the offset log) the offsets used to run the last batch
+   *  IF a last batch exists THEN


"a last batch" is grammatically weird .. isnt it?

tdas · 2017-03-17T20:56:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+   *
+   *  Identify (from the offset log) the offsets used to run the last batch
+   *  IF a last batch exists THEN
+   *    Set the next batch to that last batch


may be "set the next batch to be executed as the last recovered batch"

tdas · 2017-03-17T20:57:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+   *    Set the next batch to that last batch
+   *    Check the commit log to see which batch was committed last
+   *    IF the last batch was committed THEN
+   *      Call getBatch using the last batch start and end offsets


Add the reason regarding why we do this.

tdas · 2017-03-17T20:59:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-
-        offsetLog.get(batchId - 1).foreach {
-          case lastOffsets =>
+        if (batchId > 0) {


why are we introducing this condition?

tdas · 2017-03-17T21:01:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

   */
  private def populateStartOffsets(): Unit = {
    offsetLog.getLatest() match {
      case Some((batchId, nextOffsets)) =>


can you rename batchId to something more descriptive so that we can semantically differentiate it from the currentBatchId?

tdas · 2017-03-17T21:02:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+        if (batchId < currentBatchId) {
+          /* The last batch was successfully committed, so we can safely process a
+           * new next batch but first:
+           * Make a call a call to getBatch using the offsets from previous batch.


"a call" is present twice.

also when referring to getBatch, use source.getBatch to be more clear.

tdas · 2017-03-17T21:13:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+            if batchId == committedBatchId => committedBatchId + 1
+          case _ => batchId
+        }
+        if (batchId < currentBatchId) {


The above match-case, and this if statement essentially are the same semantic conditions - both will be true or both will be false. So might as well merge these two into a single if condition.

val completedBatchId = completedLog.getLatest() if (completedBatchId.isDefined && completedBatchId.get == batchId) { // call source.getBatch currentBatchId = completedBatchId.get + 1 } else { // warn if completedBatchId.get < batchId - 1 currentBatchId = batchId }

tdas · 2017-03-17T21:54:05Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala

+ *
+ * @since 2.2.0
+ */
+@Experimental


You need python APIs as well.

tdas · 2017-03-22T10:26:41Z

python/pyspark/sql/streaming.py

        self._jwrite = self._jwrite.queryName(queryName)
        return self

-    @keyword_only


Removed keyword_only, otherwise its weird if we have to write
writeStream.trigger(trigger=OneTime())

tdas · 2017-03-22T10:28:15Z

python/pyspark/sql/streaming.py

-    @keyword_only
    @since(2.0)
-    def trigger(self, processingTime=None):
+    def trigger(self, trigger=None, processingTime=None):


Added this before processingTime so that we can use positional param and write trigger(OneTime())
Does not break existing APIs. See doctest examples.

SparkQA · 2017-03-22T12:03:39Z

Test build #75045 has finished for PR 17219 at commit ae92ec6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OneTime(Trigger):

SparkQA · 2017-03-23T01:03:33Z

Test build #75074 has finished for PR 17219 at commit 64cd233.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-23T01:50:16Z

Test build #75077 has finished for PR 17219 at commit f928ade.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-23T04:28:10Z

Test build #75080 has finished for PR 17219 at commit 0925965.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-23T06:37:33Z

Test build #75086 has started for PR 17219 at commit db5ae3f.

tdas · 2017-03-23T08:58:18Z

Jenkins test this please

SparkQA · 2017-03-23T11:15:45Z

Test build #3607 has finished for PR 17219 at commit db5ae3f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Trigger

SparkQA · 2017-03-23T11:16:01Z

Test build #75091 has finished for PR 17219 at commit db5ae3f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Trigger

SparkQA · 2017-03-23T21:18:44Z

Test build #75108 has finished for PR 17219 at commit 0c3e20c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2017-03-23T21:55:44Z

Seems like this broke the 2.10 builds:
spark-master-compile-maven-scala-2.10 #3992
spark-master-compile-sbt-scala-2.10 #4077

## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See #17219 ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes #17921 from srowen/SPARK-19876.2. (cherry picked from commit 25ee816) Signed-off-by: Herman van Hovell <[email protected]>

## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See #17219 ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes #17921 from srowen/SPARK-19876.2.

## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See apache#17219 ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#17921 from srowen/SPARK-19876.2.

## What changes were proposed in this pull request? An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers. In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature. ## How was this patch tested? A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly. In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests: - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop). - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log. - A OneTime trigger execution that results in an exception being thrown. marmbrus tdas zsxwing Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <[email protected]> Author: Tathagata Das <[email protected]> Closes apache#17219 from tcondie/stream-commit. (cherry picked from commit 746a558)

## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See apache#17219 ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#17921 from srowen/SPARK-19876.2. (cherry picked from commit 25ee816) Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit c7bd909)

[SPARK-19876][BUILD] Move Trigger.java to java source hierarchy ## What changes were proposed in this pull request? Simply moves `Trigger.java` to `src/main/java` from `src/main/scala` See apache#17219 ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#17921 from srowen/SPARK-19876.2. (cherry picked from commit 25ee816) Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit c7bd909)

Tyson Condie added 4 commits March 6, 2017 11:05

update

08a36f8

Merge branch 'master' of https://github.com/apache/spark into stream-…

0e21d0e

…commit

update existing tests

9b8abb4

add onetime trigger test

682eb1a

marmbrus suggested changes Mar 9, 2017

View reviewed changes

address most comments from @marmbrus

a129dd5

Tyson Condie added 2 commits March 10, 2017 09:47

update deals with initialization of currentPartitionOffsets (KafkaSou…

3e666b1

…rce) after restart

update

b4ef029

add comment for current batch initialization after restart hack i.e.,…

7cb43b7

… an empty call to getBatch on each source using the last committed offset as the end offset

update to use commit method instead of getBatch

98812cb

revise logic for intializing next batch to process after restart @tdas

8c5b84f

update

573ec98

tdas reviewed Mar 17, 2017

View reviewed changes

Added python params and tests for OneTime()

ae92ec6

tdas reviewed Mar 22, 2017

View reviewed changes

tdas added 3 commits March 22, 2017 18:04

Refactored Trigger APIs

64cd233

Fixed python bug

8b50da3

Fixed RAT

f928ade

Fixed python lint

0925965

Fixed mima

db5ae3f

Fix python bug

0c3e20c

asfgit closed this in 746a558 Mar 23, 2017

srowen mentioned this pull request May 9, 2017

[SPARK-19876][BUILD] Move Trigger.java to java source hierarchy #17921

Closed

HeartSaVioR mentioned this pull request Jul 4, 2019

[SPARK-28199][SS] Move Trigger implementations to Triggers.scala and avoid exposing these to the end users #24996

Closed


		import org.apache.spark.sql.SparkSession

		class OffsetCommitLog(sparkSession: SparkSession, path: String)

[SPARK-19876][SS][WIP] OneTime Trigger Executor #17219

[SPARK-19876][SS][WIP] OneTime Trigger Executor #17219

Uh oh!

Conversation

ghost commented Mar 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

marmbrus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Mar 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Mar 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

tdas Mar 17, 2017 •

edited

Loading

tdas Mar 22, 2017 •

edited

Loading