[SPARK-22733] Split StreamExecution into MicroBatchExecution and StreamExecution. #19926

jose-torres · 2017-12-07T23:48:57Z

What changes were proposed in this pull request?

StreamExecution is now an abstract base class, which MicroBatchExecution (the current StreamExecution) inherits. When continuous processing is implemented, we'll have a new ContinuousExecution implementation of StreamExecution.

A few fields are also renamed to make them less microbatch-specific.

How was this patch tested?

refactoring only

SparkQA · 2017-12-08T01:53:03Z

Test build #84625 has finished for PR 19926 at commit 22d93b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MicroBatchExecution(
abstract class StreamExecution(
abstract class QueryExecutionThread(name: String) extends UninterruptibleThread(name)

SparkQA · 2017-12-08T05:16:41Z

Test build #84635 has finished for PR 19926 at commit 9f4daf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2017-12-11T20:50:10Z

/cc @brkyvz @zsxwing

CodingCat · 2017-12-12T18:18:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

do we really need explicitly typing here?

CodingCat · 2017-12-12T18:19:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

I think it's easy for the reader to derive that it is a String-typed variable from the code

CodingCat · 2017-12-12T18:40:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

will the noNewData flag still be useful for continuous processing?

Yes. The flag is really just a test harness; it's only used in processAllAvailable, so tests can block until there's a batch (or now epoch) that doesn't contain any data.

CodingCat · 2017-12-12T18:41:57Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

shall we make it class name configurable?

Sorry, I'm not sure what you have in mind here.

I mean, how we switch between ContinuousExecution and MicroBatchExecution?

My current thinking is to have it be a new trigger type. It can't really be a config, because continuous processing (at least in the initial implementation) won't support all operators.

CodingCat · 2017-12-12T18:44:41Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

we may want to add a new line above this

CodingCat · 2017-12-12T18:48:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

since this is a base class for both microbatch and continuous processing, is it right to put this variable here?

We may want to tweak the variable name, but continuous processing will still need to know how long it should retain commit and offset log entries. Unfortunately we're stuck with the config name, and I don't think it makes sense to introduce a second parallel one doing the same thing.

yes, tweaking the var names may make it look better

CodingCat · 2017-12-12T18:55:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

while this part of code is removed, offsetLog is still in the base class, and same for batchCommitLog,

offsetLog may be needed as WAL, batchCommitLog should be moved to MicroBatchStreamExecution?

The offset log right now has a strict schema that commit information wouldn't fit in. I was planning to keep both logs in the continuous implementation.

so, shall we also make them null here and let child classes override them?

Sure, we could do that.

SparkQA · 2017-12-12T20:58:56Z

Test build #84787 has finished for PR 19926 at commit 2fc73c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2017-12-12T22:26:56Z

retest this please

SparkQA · 2017-12-13T00:19:14Z

Test build #84810 has finished for PR 19926 at commit 0996023.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T01:05:52Z

Test build #84801 has finished for PR 19926 at commit 2fc73c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T03:10:52Z

Test build #84811 has finished for PR 19926 at commit 696ed5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… can extend it

…ng them

SparkQA · 2017-12-14T18:08:05Z

Test build #84923 has finished for PR 19926 at commit baaa933.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM except one minor comment

zsxwing · 2017-12-14T19:38:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

   * fully processed, and its output was committed to the sink, hence no need to process it again.
   * This is used (for instance) during restart, to help identify which batch to run next.
   */
-  val batchCommitLog = new BatchCommitLog(sparkSession, checkpointFile("commits"))


let's keep batchCommitLog and offsetLog in the base class since both subclasses need to initialize them. And we can rename batchCommitLog to commitLog to make it more general.

brkyvz

LGTM! Really excited to see this move forward

brkyvz · 2017-12-14T19:51:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+    val triggerLogicalPlan = withNewSources transformAllExpressions {
+      case a: Attribute if replacementMap.contains(a) =>
+        replacementMap(a).withMetadata(a.metadata)
+      case ct: CurrentTimestamp =>


Have we thought about how these will work with ContinuousProcessing? Will they be set at each start of the epoch?

That's a major candidate solution, but we're planning to just not support CurrentTimestamp for the initial implementation. It would require significant changes, since control flow won't return here between epochs.

SparkQA · 2017-12-14T21:49:45Z

Test build #84924 has finished for PR 19926 at commit 6ccc6ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-14T22:28:02Z

Test build #84925 has finished for PR 19926 at commit 52d5f21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-14T22:30:30Z

Test build #84926 has finished for PR 19926 at commit 8c75047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-12-14T22:30:53Z

LGTM. Merging to master!

CodingCat reviewed Dec 12, 2017

View reviewed changes

zsxwing mentioned this pull request Dec 14, 2017

[SPARK-22732] Add Structured Streaming APIs to DataSourceV2 #19925

Closed

jose-torres added 6 commits December 14, 2017 10:00

Refactor StreamExecution into a parent class so continuous processing…

6db9856

… can extend it

replace invokePrivate calls with the existing utility method obsoleti…

40ac3f2

…ng them

address fmt

8f418af

slight changes

d609c07

rm spurious space

486f184

fix compile

baaa933

jose-torres force-pushed the continuous-refactor branch from 696ed5f to baaa933 Compare December 14, 2017 18:00

fix rebase failure

6ccc6ec

zsxwing requested changes Dec 14, 2017

View reviewed changes

jose-torres added 2 commits December 14, 2017 11:40

refactor batchCommitLog -> commitLog

52d5f21

include new file

8c75047

brkyvz approved these changes Dec 14, 2017

View reviewed changes

asfgit closed this in 59daf91 Dec 14, 2017

[SPARK-22733] Split StreamExecution into MicroBatchExecution and StreamExecution. #19926

[SPARK-22733] Split StreamExecution into MicroBatchExecution and StreamExecution. #19926

Uh oh!

Conversation

jose-torres commented Dec 7, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 8, 2017

Uh oh!

SparkQA commented Dec 8, 2017

Uh oh!

jose-torres commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2017

Uh oh!

jose-torres commented Dec 12, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

zsxwing commented Dec 14, 2017

Uh oh!

Reviewers

Assignees