-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14255] [SQL] Streaming Aggregation #12048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
|
Test build #54469 has finished for PR 12048 at commit
|
|
Test build #54470 has finished for PR 12048 at commit
|
|
Test build #54485 has finished for PR 12048 at commit
|
|
Test this please |
|
test this please |
|
Test build #54543 has finished for PR 12048 at commit
|
|
Test build #2710 has finished for PR 12048 at commit
|
|
@tdas the maintenance test failed twice on this PR, but not the third time. |
|
Aggregation part looks good to me. |
| } catch { | ||
| case NonFatal(e) => | ||
| failTest(message, e) | ||
| if (!condition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had written this in this way so that if there are any errors in the lazy eval of condition that gets caught and message printed correctly. Happened to me a few times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All uses of verify are just doing equality checks with variables (and thus can't throw), except for those that were modified specifically such that that were going to throw exceptions upon failure. So I think really the problem is overloading what was a simple assert to be an error handler.
The issue this this construction is it now pollutes the output with the obvious:
condition was false
[info] org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
[info] org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info] org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
[info] org.apache.spark.sql.StreamTest$class.verify$1(StreamTest.scala:228)
[info] org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:355)
[info] org.apache.spark.sql.StreamTest$$anonfun$testStream$1.apply(StreamTest.scala:271)
[info] scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info] org.apache.spark.sql.StreamTest$class.testStream(StreamTest.scala:271)
[info] org.apache.spark.sql.streaming.StreamSuite.testStream(StreamSuite.scala:24)
|
Thanks, I'm going to merge to master and will address further comments in follow-ups. |
This PR adds the ability to perform aggregations inside of a
ContinuousQuery. In order to implement this feature, the planning of aggregation has augmented with a newStatefulAggregationStrategy. Unlike batch aggregation, stateful-aggregation uses theStateStore(introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression:The following refactoring was also performed to allow us to plug into existing code:
PhysicalAggregationAttributeReferenceused to identify the result of anAggregateFunctionas been moved into theAggregateExpressioncontainer. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around aMap[(AggregateFunction, Boolean), Attribute]. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup.SessionStateinto theQueryExecutionto make it easier to override in the streaming case.StreamTestthat checks only the output of the last batch has been added to simulate the future addition of output modes.