-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30306][CORE][PYTHON][WIP] Instrument Python UDF execution time and throughput metrics using Spark Metrics system #26953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/metrics/source/SourceConfigSuite.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/api/python/PythonMetrics.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/api/python/PythonMetrics.scala
Outdated
Show resolved
Hide resolved
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a cursory look. My impression is that the functionality is slightly overlapped with Python profiler feature and the current implementation is too verbose. Let me take a closer look later.
|
Thanks @HyukjinKwon for taking time for this. I'd like to add some additional context on how we intend to use this.
As you mentioned, in the current PR I have implemented several details, which I guess can be useful when troubleshooting, but for the general use can be simplified as you propose, in particular in the number and detials exposed in the user-visible metrics in the PythonMetrics source.
|
|
This has now been streamlined down to the following metrics (under
|
|
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a couple of questions.
- Is it possible to merge it with
SQLMetric? It would be nicer if UI shows it as well. - Is it possible to integrate with existing Python profiler? The current read time isn't purely Python execution time. It includes socket IO time which can potentially be large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I like the idea of adding SQLMetrics for Python UDF instrumentation and use them in the WEBUI. However, I think the work would rather fit for a separate JIRA/PR. The implementation details and the overhead of SQLMetrics are different from Dropwizard-based metrics, so probably we would like to have only a limited number of SQLMetrics instrumenting task activities in this area. Also the implementation of SQLMetrics for
[[PythonUDF]]execution may require some important changes to the current plan evaluation code. -
It is indeed the case that the “read time from worker” which is exposed to the users via the dropwizard library as “FetchResultsTimeFromWorkers” contains both socket I/O + deserialization time and Python UDF execution time. Measuring on the Python side could allow to separate the 2 time components, however currently I don’t see how to make a lightweight implementation for that. Python profiler has the possibility to measure on the Python side as you mentioned, but I see its usage more for debugging, while the proposed instrumentation is lightweight and intended to be used for production use cases too. Maybe future work can address this case if there is need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I have finally managed to work on your suggestion to instrument Python execution using SQL Metrics, so that users can see the metrics via the WebUI. See [SPARK-34265]. I imagine that I could later refactor the work on this PR based on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR link: #31367
|
Test build #116893 has finished for PR 26953 at commit
|
2d70042 to
af34f49
Compare
|
Test build #117475 has finished for PR 26953 at commit
|
af34f49 to
fd76071
Compare
|
Test build #118782 has finished for PR 26953 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #118929 has finished for PR 26953 at commit
|
fd76071 to
989dba1
Compare
|
Test build #119372 has finished for PR 26953 at commit
|
|
It looks like a case of flaky test. Can we test this again please? |
989dba1 to
812ea5e
Compare
|
I would not worry very much about the performance impact of this additional instrumentation, as it hooks on something that is not very fast already, that is the serialization/deserialization JVM-Python. Moreover, the instrumentation mostly just takes timing values and does so per batch of serialized rows, so the impach on the total throughput is expected to be further reduced by this. So far, I have only tested this manually and did not observe any particular impact. If we have a Python UDF benchmark I could further test with that. |
812ea5e to
a39ad63
Compare
|
Test build #123188 has finished for PR 26953 at commit
|
a39ad63 to
277245c
Compare
|
Test build #126228 has finished for PR 26953 at commit
|
277245c to
2fd3f1b
Compare
|
Test build #128116 has finished for PR 26953 at commit
|
|
Test build #128117 has finished for PR 26953 at commit
|
c391be0 to
337b67c
Compare
|
Test build #130035 has finished for PR 26953 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130045 has finished for PR 26953 at commit
|
4b88804 to
02a0ca0
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Test build #130043 has finished for PR 26953 at commit
|
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #130046 has finished for PR 26953 at commit
|
02a0ca0 to
750b973
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131055 has finished for PR 26953 at commit
|
|
I am closing this as the implementation should be refactored, what looks like a better way to implement this is to first work on the correctponding SQL metrics as in SPARK-34265 and then revisit the instrumentation for the Spark Metrics System. |
What changes were proposed in this pull request?
This proposes to extend Spark instrumentation to add metrics aimed at drilling down on the performance of Python code called by Spark: via UDF, Pandas UDF or with MapPartittions. Relevant performance counters, notably exuction time, are exposed using the Spark Metrics System (based on the Dropwizard library).
Why are the changes needed?
This allows to easily consume the metrics produced by executors, for example using a performance dashboard (this references to previous work as discucssed in https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark ).
See also the screenshot that compares the existing state (no Python UDF time instrumentation) to the proposed new functionality
Does this PR introduce any user-facing change?
This PR adds the PythonMetrics source to the Spark Metrics system. The list of the implemented metrics has been added to the Monitoring documentation.
How was this patch tested?
Added relevant tests