-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29348][SQL] Add observable Metrics for Streaming queries #26127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point. A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach: - Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map. - Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. - Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`. - Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`. - Added integration tests for streaming to the `StreamingQueryListenerSuite`. - Added integration tests for batch to the `DataFrameCallbackSuite`.
|
Test build #112110 has finished for PR 26127 at commit
|
|
I like an idea in general. Just a sake of understanding, is this meant to be an alternative of #21721 or we still have #21721 unresolved even after this? And this approach would have same issue pointed out in #21721 - it won't work for continuous processing. I would like to see some discussion/decision that whether we can ignore the continuous processing when designing some API for streaming queries. |
|
This seems nothing to do with DS v2, but a new API in From the PR description, this should support both micro-batch and continuous streaming, as epoch is a general concept. I haven't looked into the code yet, but do we support continuous streaming in this PR? |
|
Test build #112158 has finished for PR 26127 at commit
|
|
|
@cloud-fan @HeartSaVioR the current approach uses I will update the description to underline that this only works with (micro) batch. |
|
Test build #112224 has finished for PR 26127 at commit
|
|
If it's justified to have |
|
|
||
| /** | ||
| * Validate that collected metrics names are unique. The same name cannot be used for metrics | ||
| * with different results. However multiple instances of metrics with with same result and name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we eliminate the duplicated metrics (same name and result)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do that in a follow-up.
| * - Has only attributes that are nested inside an aggregate function. | ||
| * | ||
| * @param e expression to check. | ||
| * @param seenAggregate `true` iff one of the parents on the expression is an aggregate function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we check this in CheckAnalysis? We have a similar check there for Aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * completion point. | ||
| * Please note that continuous execution is currently not supported. | ||
| * | ||
| * The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it follow the same rule of Aggregate#aggregateExpressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It follows the rules for global aggregate.
| val sources: Array[SourceProgress], | ||
| val sink: SinkProgress) extends Serializable { | ||
| val sink: SinkProgress, | ||
| @JsonDeserialize(contentAs = classOf[GenericRowWithSchema]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that it can assume that the rows are instances of GenericRowWithSchema.
|
Alright, let me update this one. |
|
@cloud-fan I have updated, PTAL |
|
Test build #114590 has finished for PR 26127 at commit
|
|
retest this please |
|
Test build #114636 has finished for PR 26127 at commit
|
| class DataTypeJsonSerializer extends JsonSerializer[DataType] { | ||
| private val delegate = new JValueSerializer | ||
| override def serialize( | ||
| value: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4 space indentation
| StructType.fromAttributes(metricExpressions.map(_.toAttribute)) | ||
| } | ||
|
|
||
| private lazy val toRowConverter: InternalRow => Row = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add comment that it's better to use interpreted version here, as it's not called very frequently,
|
|
||
| override def outputPartitioning: Partitioning = child.outputPartitioning | ||
|
|
||
| override def outputOrdering: Seq[SortOrder] = child.outputOrdering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we override canonicalize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should. If we do that we might eliminate a CollectMetrics operator when reuse exchanges.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except a few minor comments
|
Test build #114738 has finished for PR 26127 at commit
|
|
Test build #114739 has finished for PR 26127 at commit
|
|
retest this please |
|
Test build #114758 has finished for PR 26127 at commit
|
|
Merging this. Thanks for the reviews! |
### What changes were proposed in this pull request? Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point. A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach: - Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map. - (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming. ### Why are the changes needed? This enabled observable metrics. ### Does this PR introduce any user-facing change? Yes. It adds the `observe` method to `Dataset`. ### How was this patch tested? - Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`. - Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`. - Added integration tests for streaming to the `StreamingQueryListenerSuite`. - Added integration tests for batch to the `DataFrameCallbackSuite`. Closes apache#26127 from hvanhovell/SPARK-29348. Authored-by: herman <[email protected]> Signed-off-by: herman <[email protected]>
| executedPlan.execute(), sparkSession.sessionState.conf) | ||
|
|
||
| /** Get the metrics observed during the execution of the query plan. */ | ||
| def observedMetrics: Map[String, Row] = CollectMetricsExec.collect(executedPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this PR description, observedMetrics sounds like a public API for end users to use.
However, it conflicts with what we commented above [in toRDD]. QueryExecution is not a public class and we discourage users from calling these internal functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think queryExecution is exposed in Dataset as an unstable API. I think the comments in the class implies that as well:
* The primary workflow for executing relational queries using Spark. Designed to allow easy
* access to the intermediate phases of query execution for developers.
*
* While this is not a public class, we should avoid changing the function names for the sake of
* changing them, because a lot of developers use the feature for debugging.
I agree that using methods in this class here is discouraged though. Maybe we had to mark Dataset.observe as an unstable API or developer API too for now if it is difficult to avoid adding and using an API here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, QueryExecution should be public class, as two methods in QueryExecutionListener are marked as @DeveloperApi and the signature of methods contain QueryExecution. If we worry about restriction for modifying QueryException due to being public class, we may need to have other class only for reporting to QueryExecutionListener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DeveloperApi implies:
A lower-level, unstable API intended for developers.
I agree that It might have to be considered a public class but unstable. I am sure most of methods there are discouraged even for developers to use without knowing exactly what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad. Thanks for pointing it out.
Though I still feel listener stuff is more alike public API, as it can be used more than debug purpose. QueryExecutionListener and StreamingQueryListener take different approach, QueryExecutionListener directly exposes low-level internal instance and let end users use with caution, whereas StreamingQueryListener has individual classes which only contain information for report only, not exposing internal one.
IMHO, I feel StreamingQueryListener is the right way to go, but QueryExecutionListener has been served long ago, so no strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, StreamingQueryProgress.observedMetrics with StreamingQueryListener seems fine. The problem here looks only QueryExecution.observedMetrics with QueryExecutionListener, which looks having a contradiction about its stability. It seems it has to be fixed by either avoid adding it to QueryExecution or explicitly marking Dataset.observe as an unstable API (or developer API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We should make the batch listener not rely on QueryExecution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the issue is?
By definition anything in QueryExecution is internal, unstable, API. The reason that I added it here is that is a good narrow waist to add this, you want to collect the metrics for the entire query. It is public because most other methods in this class are public; this allows for some clever integrations (for the more adventurous developer) and makes debugging easier.
The batch listener is marked as experimental. A developer should be warned when (s)he uses this API for anything (including using it to collect observable metrics). I am not sure how realistic stabilizing the batch listener API is if you include QueryExecution (or a stabilized version of it). We could expose a stable callback, e,g. onObservedMetrics(...), for observed metrics. On the other hand is kind of annoying that we can't expose this through the Dataframe itself, and that is because when we execute the Dataframe we often use a different one under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is not about this PR itself, but about the bath query listener relying on QueryExecution which is not going to be public. I think the streaming query listener is better. It's still unstable now but we are going to make it stable in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch listener is no longer marked as experimental in 3.0, as there have been some of slight modifications (enough for @Evolving) but it doesn't change majorly during 4 years of life - see #25558. If we feel concerned it could be rolled back to Experimental/Unstable, though I know there're couple of projects in Spark ecosystem already leveraging it, and it represents that the possibility is not restricted to the debug purpose.
The major feature of batch listener is that (unlikely of streaming query listener which summarizes information and stores into a new data structure) it exposes various plans for both logical/physical, and I don't imagine these plans will be used for execution-purposes. Mostly read-only. It may not unrealistic if we could provide these plans for read-only (cloned, can't execute), but yeah, it may not be easy as it seems (not enough familiar with it).
| * | ||
| * A user can observe these metrics by either adding | ||
| * [[org.apache.spark.sql.streaming.StreamingQueryListener]] or a | ||
| * [[org.apache.spark.sql.util.QueryExecutionListener]] to the spark session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an example with QueryExecutionListener too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll raise a PR for this, as I found a slight bug in example for streaming as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised #27046. Please note that the patch relies on the current status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to encourage users to use an unstable API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to show the documented usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and mark this as unstable too .. given that it relies on an unstable APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's sort it out a bit.
Was the batch use case excluded by intention due to the use of DeveloperApi? If that's the intention, then yes #27046 can just fix the slight bug on streaming example.
If the concern is focused on having QueryExecution as private API and exposing it to DeveloperApi, I think it's time to fix that, as no one would complain for the change on major release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't document unstable features unless you commit to stabilizing them in the short term. Please don't add that example. A savy developer will know how work with this, and can weight the risk involved themselves.
The same goes for unstable. Please don't mark it as unstable, the function itself is stable. It is stable for streaming. The batch case relies on QueryExecutionListener and isn't, however that API is marked as such so there shouldn't be any confusion there.
@HeartSaVioR it would great if we can fix the streaming example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for clarification. Totally makes sense. Will update the PR there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I offline discussed with @hvanhovell. I am okie.
What changes were proposed in this pull request?
Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point.
A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach:
QueryExecutionListener. This will be called when the query completes. A user can access the metrics by using theQueryExecution.observedMetricsmap.StreamingQueryListener. This will be called when the streaming query completes an epoch. A user can access the metrics by using theStreamingQueryProgress.observedMetricsmap. Please note that we currently do not support continuous execution streaming.Why are the changes needed?
This enabled observable metrics.
Does this PR introduce any user-facing change?
Yes. It adds the
observemethod toDataset.How was this patch tested?
CollectMetricslogical node to theAnalysisSuite.StreamingProgressJSON serialization to theStreamingQueryStatusAndProgressSuite.StreamingQueryListenerSuite.DataFrameCallbackSuite.