-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730
Conversation
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #123547 has finished for PR 28730 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…w metrics in Query UI ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ```  but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ```  ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local.  Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 632b5bc) Signed-off-by: HyukjinKwon <[email protected]>
|
Merged to master and branch-3.0. I don't mind porting it back if anyone needs. I didn't here just because there's a conflict, and it's just a matter of monitoring. I will leave it to you @ueshin :D. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch @ueshin , LGTM thanks!
…how metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ```  but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ```  ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local.  Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
Just a suggestion. In the PR description, we need to list all the external APIs whose UI got affected. Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython just explain the internal APIs. |
What changes were proposed in this pull request?
In
Dataset.collectAsArrowToRandDataset.collectAsArrowToPython, since the code block forserveToStreamis run in the separate thread,withActionfinishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.We should call
serveToStreamfirst, thenwithActionin it.The affected functions are:
collect()in SparkRDataFrame.toPandas()in PySparkWhy are the changes needed?
When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:
but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:
Does this PR introduce any user-facing change?
Yes, the Query UI will show the plan with the correct metrics.
How was this patch tested?
I checked it manually in my local.