[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730

ueshin · 2020-06-04T22:27:56Z

What changes were proposed in this pull request?

In Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython, since the code block for serveToStream is run in the separate thread, withAction finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call serveToStream first, then withAction in it.

The affected functions are:

collect() in SparkR
DataFrame.toPandas() in PySpark

Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:

>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

Does this PR introduce any user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

How was this patch tested?

I checked it manually in my local.

gengliangwang

LGTM

SparkQA · 2020-06-05T02:47:38Z

Test build #123547 has finished for PR 28730 at commit 5705e15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

…w metrics in Query UI ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png) Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 632b5bc) Signed-off-by: HyukjinKwon <[email protected]>

HyukjinKwon · 2020-06-05T03:56:54Z

Merged to master and branch-3.0. I don't mind porting it back if anyone needs. I didn't here just because there's a conflict, and it's just a matter of monitoring.

I will leave it to you @ueshin :D.

BryanCutler

Nice catch @ueshin , LGTM thanks!

…how metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png) Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

gatorsmile · 2020-07-02T18:20:41Z

Just a suggestion. In the PR description, we need to list all the external APIs whose UI got affected.

Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython just explain the internal APIs.

Fix toPandas with Arrow enabled to show metrics in Query UI.

5705e15

ueshin requested review from BryanCutler and HyukjinKwon June 4, 2020 22:27

probot-autolabeler bot added the SQL label Jun 4, 2020

ueshin changed the title ~~[SPARK-31903][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.~~ [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. Jun 4, 2020

gengliangwang approved these changes Jun 4, 2020

View reviewed changes

HyukjinKwon approved these changes Jun 5, 2020

View reviewed changes

HyukjinKwon closed this in 632b5bc Jun 5, 2020

BryanCutler reviewed Jun 5, 2020

View reviewed changes

ueshin mentioned this pull request Jun 5, 2020

[SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI. #28740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730

[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730

Uh oh!

ueshin commented Jun 4, 2020 •

edited

Loading

Uh oh!

gengliangwang left a comment

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Jun 5, 2020 •

edited

Loading

Uh oh!

BryanCutler left a comment

Uh oh!

gatorsmile commented Jul 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730

[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. #28730

Uh oh!

Conversation

ueshin commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ueshin commented Jun 4, 2020 •

edited

Loading

HyukjinKwon commented Jun 5, 2020 •

edited

Loading