Skip to content

Conversation

@LucaCanali
Copy link
Contributor

What changes were proposed in this pull request?

This proposes to extend Spark instrumentation to add metrics aimed at drilling down on the performance of Python code called by Spark: via UDF, Pandas UDF or with MapPartittions. Relevant performance counters, notably exuction time, are exposed using the Spark Metrics System (based on the Dropwizard library).

Why are the changes needed?

This allows to easily consume the metrics produced by executors, for example using a performance dashboard (this references to previous work as discucssed in https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark ).
See also the screenshot that compares the existing state (no Python UDF time instrumentation) to the proposed new functionality

Does this PR introduce any user-facing change?

This PR adds the PythonMetrics source to the Spark Metrics system. The list of the implemented metrics has been added to the Monitoring documentation.

How was this patch tested?

Added relevant tests

  • manually tested end-to-end on a YARN cluster and using an existing Spark dashboard extended with the metrics proposed here.

@maropu
Copy link
Member

maropu commented Dec 20, 2019

cc: @HyukjinKwon @viirya @ueshin

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a cursory look. My impression is that the functionality is slightly overlapped with Python profiler feature and the current implementation is too verbose. Let me take a closer look later.

@LucaCanali
Copy link
Contributor Author

LucaCanali commented Dec 20, 2019

Thanks @HyukjinKwon for taking time for this.
This functionality propoesed in ths PR is different from Python profiler, or at least its intended use is. Also it is intended to be lightweight so that it can be used for measuring Spark workloads in production as part of a performance dashboard based on the metrics coming from the Spark metrics system.

I'd like to add some additional context on how we intend to use this.

As you mentioned, in the current PR I have implemented several details, which I guess can be useful when troubleshooting, but for the general use can be simplified as you propose, in particular in the number and detials exposed in the user-visible metrics in the PythonMetrics source.
A proposal for a simplified set of metrics to expose in the PythonMetrics source is:

  • BatchCountFromWorker
  • BatchCountToWorker
  • BytesSentToWorker
  • RunAndReadTimeFromWorker
  • WriteTimeToWorker
  • PandasUDFSentRowCount
  • PandasUDFReceivedRowCount

@LucaCanali
Copy link
Contributor Author

This has now been streamlined down to the following metrics (under namespace=PythonMetrics):

  • BytesReceivedFromWorkers
  • BytesSentToWorkers
  • FetchResultsTimeFromWorkers
  • NumBatchesFromWorkers
  • NumBatchesToWorkers
  • PandasUDFReceivedNumRows
  • PandasUDFSentNumRows
  • WriteTimeToWorkers

@LucaCanali LucaCanali changed the title [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and metrics using Spark Metrics system [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput metrics using Spark Metrics system Jan 9, 2020
@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of questions.

  • Is it possible to merge it with SQLMetric? It would be nicer if UI shows it as well.
  • Is it possible to integrate with existing Python profiler? The current read time isn't purely Python execution time. It includes socket IO time which can potentially be large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I like the idea of adding SQLMetrics for Python UDF instrumentation and use them in the WEBUI. However, I think the work would rather fit for a separate JIRA/PR. The implementation details and the overhead of SQLMetrics are different from Dropwizard-based metrics, so probably we would like to have only a limited number of SQLMetrics instrumenting task activities in this area. Also the implementation of SQLMetrics for [[PythonUDF]] execution may require some important changes to the current plan evaluation code.

  • It is indeed the case that the “read time from worker” which is exposed to the users via the dropwizard library as “FetchResultsTimeFromWorkers” contains both socket I/O + deserialization time and Python UDF execution time. Measuring on the Python side could allow to separate the 2 time components, however currently I don’t see how to make a lightweight implementation for that. Python profiler has the possibility to measure on the Python side as you mentioned, but I see its usage more for debugging, while the proposed instrumentation is lightweight and intended to be used for production use cases too. Maybe future work can address this case if there is need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I have finally managed to work on your suggestion to instrument Python execution using SQL Metrics, so that users can see the metrics via the WebUI. See [SPARK-34265]. I imagine that I could later refactor the work on this PR based on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR link: #31367

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116893 has finished for PR 26953 at commit 2d70042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from 2d70042 to af34f49 Compare January 28, 2020 08:39
@SparkQA
Copy link

SparkQA commented Jan 28, 2020

Test build #117475 has finished for PR 26953 at commit af34f49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from af34f49 to fd76071 Compare February 21, 2020 10:18
@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118782 has finished for PR 26953 at commit fd76071.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali
Copy link
Contributor Author

retest this please

1 similar comment
@BryanCutler
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2020

Test build #118929 has finished for PR 26953 at commit fd76071.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from fd76071 to 989dba1 Compare March 5, 2020 07:52
@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119372 has finished for PR 26953 at commit 989dba1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali
Copy link
Contributor Author

It looks like a case of flaky test. Can we test this again please?

@LucaCanali
Copy link
Contributor Author

I would not worry very much about the performance impact of this additional instrumentation, as it hooks on something that is not very fast already, that is the serialization/deserialization JVM-Python. Moreover, the instrumentation mostly just takes timing values and does so per batch of serialized rows, so the impach on the total throughput is expected to be further reduced by this. So far, I have only tested this manually and did not observe any particular impact. If we have a Python UDF benchmark I could further test with that.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from 812ea5e to a39ad63 Compare May 27, 2020 13:03
@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123188 has finished for PR 26953 at commit a39ad63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from a39ad63 to 277245c Compare July 21, 2020 06:51
@SparkQA
Copy link

SparkQA commented Jul 21, 2020

Test build #126228 has finished for PR 26953 at commit 277245c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from 277245c to 2fd3f1b Compare August 31, 2020 19:45
@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128116 has finished for PR 26953 at commit 2fd3f1b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128117 has finished for PR 26953 at commit c391be0.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from c391be0 to 337b67c Compare October 20, 2020 07:37
@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130035 has finished for PR 26953 at commit 337b67c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34642/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34642/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34650/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34650/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130045 has finished for PR 26953 at commit 4b88804.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from 4b88804 to 02a0ca0 Compare October 20, 2020 11:43
@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34654/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34653/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130043 has finished for PR 26953 at commit c7570bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34653/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34654/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130046 has finished for PR 26953 at commit 02a0ca0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali force-pushed the PythonUDFInstrumentation branch from 02a0ca0 to 750b973 Compare November 13, 2020 09:19
@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35660/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35660/

@SparkQA
Copy link

SparkQA commented Nov 13, 2020

Test build #131055 has finished for PR 26953 at commit 750b973.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali LucaCanali changed the title [SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput metrics using Spark Metrics system [SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system Jan 18, 2021
@LucaCanali
Copy link
Contributor Author

I am closing this as the implementation should be refactored, what looks like a better way to implement this is to first work on the correctponding SQL metrics as in SPARK-34265 and then revisit the instrumentation for the Spark Metrics System.

@LucaCanali LucaCanali closed this Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants