StreamFrame API alternatives. #17

rxin · 2015-09-24T20:43:55Z

This pull request proposes 3 alternatives for StreamFrame API.

A1: A single StreamFrame abstraction for both non-windowed and windowed stream. Blocking operations work also on non-windowed streams, as they just return a new stream with a new tuple emitted for every update (i.e. can be used to query partial results).
A2: A single StreamFrame abstraction for both non-windowed and windowed stream. Blocking operations throw runtime exceptions on non-windowed stream. This is similar to the PCollection idea in Google's Cloud Dataflow.
B: Two abstractions: StreamFrame and WindowedStreamFrame. Blocking operations are only available on WindowedStreamFrame.

At this point, we are mostly looking for feedback about the high level alternatives (e.g. not the detailed function names, window specs, etc).

rxin · 2015-09-24T20:44:03Z

sql/core/src/main/scala/org/apache/spark/sql/streamv1/WindowSpec.scala

Please ignore this file for now.

mateiz · 2015-09-28T18:00:29Z

sql/core/src/main/scala/org/apache/spark/sql/streamv1/StreamFrame.scala

What does "emitting" a tuple mean? In other words, how does this API tie in with event time and stuff like that?

Generally, the way I understood event time is that you want to define a query based on the event time in the data, and then evaluate it at possibly multiple points in "real" (processing) time to get a different answer for the same event time window.

If you mean that this kind of stream is the same as one with an infinite window, that does make sense.

mateiz · 2015-09-28T18:02:11Z

Hey, so looking at this quickly, it seems to me that some applications will want infinite windows, and in that case, we can make the StreamFrame-without-a-window be the same thing as a windowed StreamFrame with an infinite window. So I'd go for A1 with these semantics.

rxin · 2015-09-28T18:16:47Z

@mateiz do you mean an infinite window, or a landmark window, i.e. one that is from the beginning of time till now?

mateiz · 2015-09-28T20:12:22Z

I mean from the beginning of time until now (in event time). Isn't that what Dataflow also has?

In any case though, the tricking thing seems to be how it will combine with grouping, since people sometimes want to define a window per group. I think you also have to add that to the API before we can decide whether it makes sense.

jkbradley · 2015-10-01T19:57:56Z

I'd definitely rule out A2 since it seems like a hacky version of option B. Between A1 and B, I'd prefer B.

@mateiz Unifying the concepts makes technical sense, but I feel like it's more confusing for people not experienced with streaming. It seems simpler to separate the APIs for expensive operations (blocking) vs. cheap transformations.

Also, with A1, the natural default behavior is to use infinite windows since the beginning of time. Unless we think that's the right thing for most use cases, I'd prefer to force users to set a window explicitly.

mateiz · 2015-10-01T21:13:51Z

Sure, but just look at the Dataflow paper / API for their motivation and see whether you agree with that. Part of their motivation was to let you use the "same" program for both streaming and batch analysis, which is something that people have complained they can't do with Spark. So we should see how that's addressed (common superclass, easy way to get a snapshot of a stream as a DataFrame, or whatever). That's really the main reason I was thinking about this.

jkbradley · 2015-10-01T21:49:11Z

I see, that's a good argument. I'll take a look at the paper.

rbkim · 2015-10-02T09:38:16Z

@mateiz Google Dataflow uses "event time" (determined by the timestamp on the data element itself). I think Spark Streaming needs to support it also for the sessionization.

rxin · 2015-10-02T18:26:42Z

Thanks for commenting, @rbkim. Yes indeed. Those are included here too.

rbkim · 2015-10-03T01:26:11Z

@rxin yes but we should consider the timestamp in the DStream also because it is not the "event time".

rxin · 2015-10-03T08:00:46Z

@rbkim there is nothing about dstream in this API, is there?

rbkim · 2015-10-03T09:52:20Z

@rxin you're right. nothing about dstream in here. I thought you might have a plan to add some APIs related to dstream.

…onfig option. ## What changes were proposed in this pull request? Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10. This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that. After this PR, `OptimizerIn` is configurable. ```scala scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8] : +- INPUT +- Generate explode([1,2]), false, false, [a#7] +- Scan OneRowRelation[] scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2") scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17] : +- INPUT +- Generate explode([1,2]), false, false, [a#16] +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12562 from dongjoon-hyun/SPARK-14796.

…aggregations ## What changes were proposed in this pull request? Partial aggregations are generated in `EnsureRequirements`, but the planner fails to check if partial aggregation satisfies sort requirements. For the following query: ``` val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2") spark.sql("select max(b) from t2 group by a").explain(true) ``` Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation. ``` == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- LocalTableScan [a#5, b#6] ``` Actually, a correct plan is: ``` == Physical Plan == SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17]) +- *Sort [a#5 ASC], false, 0 +- Exchange hashpartitioning(a#5, 200) +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19]) +- *Sort [a#5 ASC], false, 0 +- LocalTableScan [a#5, b#6] ``` ## How was this patch tested? Added tests in `PlannerSuite`. Author: Takeshi YAMAMURO <[email protected]> Closes apache#14865 from maropu/SPARK-17289.

StreamFrame API alternatives.

0a8c5c3

rxin reviewed Sep 24, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streamv1/WindowSpec.scala

Copy link

Owner Author

rxin Sep 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore this file for now.

rxin added 3 commits September 24, 2015 13:55

Fixed misplaced comment.

49a2709

More misplaced commen

a798ac6

More comment update...

75060e6

mateiz reviewed Sep 28, 2015
View reviewed changes

rxin closed this Jan 19, 2016

StreamFrame API alternatives. #17

StreamFrame API alternatives. #17

Uh oh!

Conversation

rxin commented Sep 24, 2015

Uh oh!

rxin Sep 24, 2015

Choose a reason for hiding this comment

Uh oh!

mateiz Sep 28, 2015

Choose a reason for hiding this comment

Uh oh!

mateiz commented Sep 28, 2015

Uh oh!

rxin commented Sep 28, 2015

Uh oh!

mateiz commented Sep 28, 2015

Uh oh!

jkbradley commented Oct 1, 2015

Uh oh!

mateiz commented Oct 1, 2015

Uh oh!

jkbradley commented Oct 1, 2015

Uh oh!

rbkim commented Oct 2, 2015

Uh oh!

rxin commented Oct 2, 2015

Uh oh!

rbkim commented Oct 3, 2015

Uh oh!

rxin commented Oct 3, 2015

Uh oh!

rbkim commented Oct 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants