[SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system #26953

LucaCanali · 2019-12-19T15:16:26Z

What changes were proposed in this pull request?

This proposes to extend Spark instrumentation to add metrics aimed at drilling down on the performance of Python code called by Spark: via UDF, Pandas UDF or with MapPartittions. Relevant performance counters, notably exuction time, are exposed using the Spark Metrics System (based on the Dropwizard library).

Why are the changes needed?

This allows to easily consume the metrics produced by executors, for example using a performance dashboard (this references to previous work as discucssed in https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark ).
See also the screenshot that compares the existing state (no Python UDF time instrumentation) to the proposed new functionality

Does this PR introduce any user-facing change?

This PR adds the PythonMetrics source to the Spark Metrics system. The list of the implemented metrics has been added to the Monitoring documentation.

How was this patch tested?

Added relevant tests

manually tested end-to-end on a YARN cluster and using an existing Spark dashboard extended with the metrics proposed here.

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

core/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

maropu · 2019-12-20T01:23:37Z

cc: @HyukjinKwon @viirya @ueshin

core/src/main/scala/org/apache/spark/api/python/PythonMetrics.scala

HyukjinKwon

I took a cursory look. My impression is that the functionality is slightly overlapped with Python profiler feature and the current implementation is too verbose. Let me take a closer look later.

LucaCanali · 2019-12-20T10:07:05Z

Thanks @HyukjinKwon for taking time for this.
This functionality propoesed in ths PR is different from Python profiler, or at least its intended use is. Also it is intended to be lightweight so that it can be used for measuring Spark workloads in production as part of a performance dashboard based on the metrics coming from the Spark metrics system.

I'd like to add some additional context on how we intend to use this.

I have described how we implement and use a Spark performance dashboard based on the metrics system in a recent Spark Summit presentation https://databricks.com/session_eu19/performance-troubleshooting-using-apache-spark-metrics and in a blog entry http://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
Over the last couple of years I have helped improving this by adding to the Spark metrics system instrumentation for aggregated task metrics values per executor [SPARK-25228], JVM CPU usage [SPARK-22190], memory usage [SPARK-27189]. This still does not cover all the space of possible activities and "time usage" by Spark tasks and executors, in particular the problem that I am trying to solve with this PR is that time spent waiting for results to come back from Python workers (typically when executing a UDF) is currently not instrumented, so it appears in the dashboard as run time "without attribution" in the current dashboard, while it can be visualized using the metrics implemented here (see image )

As you mentioned, in the current PR I have implemented several details, which I guess can be useful when troubleshooting, but for the general use can be simplified as you propose, in particular in the number and detials exposed in the user-visible metrics in the PythonMetrics source.
A proposal for a simplified set of metrics to expose in the PythonMetrics source is:

BatchCountFromWorker
BatchCountToWorker
BytesSentToWorker
RunAndReadTimeFromWorker
WriteTimeToWorker
PandasUDFSentRowCount
PandasUDFReceivedRowCount

LucaCanali · 2020-01-09T08:26:51Z

This has now been streamlined down to the following metrics (under namespace=PythonMetrics):

BytesReceivedFromWorkers
BytesSentToWorkers
FetchResultsTimeFromWorkers
NumBatchesFromWorkers
NumBatchesToWorkers
PandasUDFReceivedNumRows
PandasUDFSentNumRows
WriteTimeToWorkers

HyukjinKwon · 2020-01-17T04:46:31Z

ok to test

HyukjinKwon · 2020-01-17T06:27:11Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

Have a couple of questions.

Is it possible to merge it with SQLMetric? It would be nicer if UI shows it as well.

Is it possible to integrate with existing Python profiler? The current read time isn't purely Python execution time. It includes socket IO time which can potentially be large.

I like the idea of adding SQLMetrics for Python UDF instrumentation and use them in the WEBUI. However, I think the work would rather fit for a separate JIRA/PR. The implementation details and the overhead of SQLMetrics are different from Dropwizard-based metrics, so probably we would like to have only a limited number of SQLMetrics instrumenting task activities in this area. Also the implementation of SQLMetrics for [[PythonUDF]] execution may require some important changes to the current plan evaluation code.

It is indeed the case that the “read time from worker” which is exposed to the users via the dropwizard library as “FetchResultsTimeFromWorkers” contains both socket I/O + deserialization time and Python UDF execution time. Measuring on the Python side could allow to separate the 2 time components, however currently I don’t see how to make a lightweight implementation for that. Python profiler has the possibility to measure on the Python side as you mentioned, but I see its usage more for debugging, while the proposed instrumentation is lightweight and intended to be used for production use cases too. Maybe future work can address this case if there is need?

@HyukjinKwon I have finally managed to work on your suggestion to instrument Python execution using SQL Metrics, so that users can see the metrics via the WebUI. See [SPARK-34265]. I imagine that I could later refactor the work on this PR based on that.

PR link: #31367

SparkQA · 2020-01-17T07:37:48Z

Test build #116893 has finished for PR 26953 at commit 2d70042.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-28T11:43:53Z

Test build #117475 has finished for PR 26953 at commit af34f49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-21T12:48:02Z

Test build #118782 has finished for PR 26953 at commit fd76071.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-02-25T09:39:08Z

retest this please

BryanCutler · 2020-02-25T18:42:14Z

retest this please

SparkQA · 2020-02-25T21:16:42Z

Test build #118929 has finished for PR 26953 at commit fd76071.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-05T08:05:02Z

Test build #119372 has finished for PR 26953 at commit 989dba1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2020-03-25T12:28:45Z

It looks like a case of flaky test. Can we test this again please?

LucaCanali · 2020-04-21T19:33:50Z

I would not worry very much about the performance impact of this additional instrumentation, as it hooks on something that is not very fast already, that is the serialization/deserialization JVM-Python. Moreover, the instrumentation mostly just takes timing values and does so per batch of serialized rows, so the impach on the total throughput is expected to be further reduced by this. So far, I have only tested this manually and did not observe any particular impact. If we have a Python UDF benchmark I could further test with that.

SparkQA · 2020-05-27T16:33:08Z

Test build #123188 has finished for PR 26953 at commit a39ad63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-21T07:05:01Z

Test build #126228 has finished for PR 26953 at commit 277245c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T19:50:35Z

Test build #128116 has finished for PR 26953 at commit 2fd3f1b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T20:27:21Z

Test build #128117 has finished for PR 26953 at commit c391be0.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-20T07:41:53Z

Test build #130035 has finished for PR 26953 at commit 337b67c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-20T08:20:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34642/

SparkQA · 2020-10-20T08:42:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34642/

SparkQA · 2020-10-20T10:46:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34650/

SparkQA · 2020-10-20T11:13:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34650/

SparkQA · 2020-10-20T11:42:21Z

Test build #130045 has finished for PR 26953 at commit 4b88804.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-20T12:31:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34654/

SparkQA · 2020-10-20T12:34:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34653/

SparkQA · 2020-10-20T12:47:26Z

Test build #130043 has finished for PR 26953 at commit c7570bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-20T12:59:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34653/

SparkQA · 2020-10-20T13:04:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34654/

SparkQA · 2020-10-20T14:21:15Z

Test build #130046 has finished for PR 26953 at commit 02a0ca0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-13T10:57:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35660/

SparkQA · 2020-11-13T11:26:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35660/

SparkQA · 2020-11-13T12:03:27Z

Test build #131055 has finished for PR 26953 at commit 750b973.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LucaCanali · 2021-04-28T07:48:35Z

I am closing this as the implementation should be refactored, what looks like a better way to implement this is to first work on the correctponding SQL metrics as in SPARK-34265 and then revisit the instrumentation for the Spark Metrics System.

seanli-rallyhealth reviewed Dec 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala Outdated Show resolved Hide resolved

seanli-rallyhealth reviewed Dec 19, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala Outdated Show resolved Hide resolved

seanli-rallyhealth reviewed Dec 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala Outdated Show resolved Hide resolved

dongjoon-hyun added PYSPARK SPARK CORE labels Dec 19, 2019

HyukjinKwon reviewed Dec 20, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonMetrics.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 20, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/PythonMetrics.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 20, 2019

View reviewed changes

LucaCanali changed the title ~~[SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and metrics using Spark Metrics system~~ [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput metrics using Spark Metrics system Jan 9, 2020

HyukjinKwon reviewed Jan 17, 2020

View reviewed changes

LucaCanali force-pushed the PythonUDFInstrumentation branch from 2d70042 to af34f49 Compare January 28, 2020 08:39

LucaCanali force-pushed the PythonUDFInstrumentation branch from af34f49 to fd76071 Compare February 21, 2020 10:18

LucaCanali force-pushed the PythonUDFInstrumentation branch from fd76071 to 989dba1 Compare March 5, 2020 07:52

LucaCanali force-pushed the PythonUDFInstrumentation branch from 989dba1 to 812ea5e Compare April 15, 2020 13:55

probot-autolabeler bot added CORE DOCS PYTHON SQL labels Apr 15, 2020

LucaCanali force-pushed the PythonUDFInstrumentation branch from 812ea5e to a39ad63 Compare May 27, 2020 13:03

LucaCanali force-pushed the PythonUDFInstrumentation branch from a39ad63 to 277245c Compare July 21, 2020 06:51

LucaCanali force-pushed the PythonUDFInstrumentation branch from 277245c to 2fd3f1b Compare August 31, 2020 19:45

LucaCanali force-pushed the PythonUDFInstrumentation branch from c391be0 to 337b67c Compare October 20, 2020 07:37

LucaCanali force-pushed the PythonUDFInstrumentation branch from 4b88804 to 02a0ca0 Compare October 20, 2020 11:43

Instrument Python UDF

750b973

LucaCanali force-pushed the PythonUDFInstrumentation branch from 02a0ca0 to 750b973 Compare November 13, 2020 09:19

LucaCanali changed the title ~~[SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput metrics using Spark Metrics system~~ [SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system Jan 18, 2021

LucaCanali closed this Apr 28, 2021

[SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system #26953

[SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system #26953

Uh oh!

Conversation

LucaCanali commented Dec 19, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maropu commented Dec 20, 2019

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

LucaCanali commented Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucaCanali commented Jan 9, 2020

Uh oh!

HyukjinKwon commented Jan 17, 2020

Uh oh!

HyukjinKwon Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

LucaCanali Jan 22, 2020

Choose a reason for hiding this comment

Uh oh!

LucaCanali Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

maropu Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

SparkQA commented Jan 28, 2020

Uh oh!

SparkQA commented Feb 21, 2020

Uh oh!

LucaCanali commented Feb 25, 2020

Uh oh!

BryanCutler commented Feb 25, 2020

Uh oh!

SparkQA commented Feb 25, 2020

Uh oh!

SparkQA commented Mar 5, 2020

Uh oh!

LucaCanali commented Mar 25, 2020

Uh oh!

LucaCanali commented Apr 21, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

SparkQA commented Jul 21, 2020

Uh oh!

SparkQA commented Aug 31, 2020

Uh oh!

SparkQA commented Aug 31, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

Uh oh!

SparkQA commented Oct 20, 2020

LucaCanali commented Dec 20, 2019 •

edited

Loading