Skip to content

Conversation

@jose-torres
Copy link
Contributor

What changes were proposed in this pull request?

StreamExecution is now an abstract base class, which MicroBatchExecution (the current StreamExecution) inherits. When continuous processing is implemented, we'll have a new ContinuousExecution implementation of StreamExecution.

A few fields are also renamed to make them less microbatch-specific.

How was this patch tested?

refactoring only

@SparkQA
Copy link

SparkQA commented Dec 8, 2017

Test build #84625 has finished for PR 19926 at commit 22d93b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MicroBatchExecution(
  • abstract class StreamExecution(
  • abstract class QueryExecutionThread(name: String) extends UninterruptibleThread(name)

@SparkQA
Copy link

SparkQA commented Dec 8, 2017

Test build #84635 has finished for PR 19926 at commit 9f4daf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jose-torres
Copy link
Contributor Author

/cc @brkyvz @zsxwing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need explicitly typing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easy for the reader to derive that it is a String-typed variable from the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the noNewData flag still be useful for continuous processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The flag is really just a test harness; it's only used in processAllAvailable, so tests can block until there's a batch (or now epoch) that doesn't contain any data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make it class name configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure what you have in mind here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, how we switch between ContinuousExecution and MicroBatchExecution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current thinking is to have it be a new trigger type. It can't really be a config, because continuous processing (at least in the initial implementation) won't support all operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to add a new line above this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a base class for both microbatch and continuous processing, is it right to put this variable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to tweak the variable name, but continuous processing will still need to know how long it should retain commit and offset log entries. Unfortunately we're stuck with the config name, and I don't think it makes sense to introduce a second parallel one doing the same thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tweaking the var names may make it look better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this part of code is removed, offsetLog is still in the base class, and same for batchCommitLog,

offsetLog may be needed as WAL, batchCommitLog should be moved to MicroBatchStreamExecution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset log right now has a strict schema that commit information wouldn't fit in. I was planning to keep both logs in the continuous implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, shall we also make them null here and let child classes override them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could do that.

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84787 has finished for PR 19926 at commit 2fc73c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jose-torres
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84810 has finished for PR 19926 at commit 0996023.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84801 has finished for PR 19926 at commit 2fc73c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84811 has finished for PR 19926 at commit 696ed5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84923 has finished for PR 19926 at commit baaa933.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor comment

* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val batchCommitLog = new BatchCommitLog(sparkSession, checkpointFile("commits"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep batchCommitLog and offsetLog in the base class since both subclasses need to initialize them. And we can rename batchCommitLog to commitLog to make it more general.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Really excited to see this move forward

val triggerLogicalPlan = withNewSources transformAllExpressions {
case a: Attribute if replacementMap.contains(a) =>
replacementMap(a).withMetadata(a.metadata)
case ct: CurrentTimestamp =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we thought about how these will work with ContinuousProcessing? Will they be set at each start of the epoch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a major candidate solution, but we're planning to just not support CurrentTimestamp for the initial implementation. It would require significant changes, since control flow won't return here between epochs.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84924 has finished for PR 19926 at commit 6ccc6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84925 has finished for PR 19926 at commit 52d5f21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84926 has finished for PR 19926 at commit 8c75047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 14, 2017

LGTM. Merging to master!

@asfgit asfgit closed this in 59daf91 Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants