Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Jun 4, 2020

What changes were proposed in this pull request?

In Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython, since the code block for serveToStream is run in the separate thread, withAction finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call serveToStream first, then withAction in it.

The affected functions are:

  • collect() in SparkR
  • DataFrame.toPandas() in PySpark

Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:

>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

Screen Shot 2020-06-03 at 4 47 07 PM

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

Screen Shot 2020-06-03 at 4 47 27 PM

Does this PR introduce any user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

How was this patch tested?

I checked it manually in my local.

Screen Shot 2020-06-04 at 3 19 41 PM

@ueshin ueshin requested review from BryanCutler and HyukjinKwon June 4, 2020 22:27
@ueshin ueshin changed the title [SPARK-31903][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI. Jun 4, 2020
Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123547 has finished for PR 28730 at commit 5705e15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

HyukjinKwon pushed a commit that referenced this pull request Jun 5, 2020
…w metrics in Query UI

### What changes were proposed in this pull request?

In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png)

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png)

Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 632b5bc)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 5, 2020

Merged to master and branch-3.0. I don't mind porting it back if anyone needs. I didn't here just because there's a conflict, and it's just a matter of monitoring.

I will leave it to you @ueshin :D.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @ueshin , LGTM thanks!

HyukjinKwon pushed a commit that referenced this pull request Jun 6, 2020
…how metrics in Query UI

### What changes were proposed in this pull request?

This is a backport of #28730.

In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png)

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png)

Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@gatorsmile
Copy link
Member

Just a suggestion. In the PR description, we need to list all the external APIs whose UI got affected.

Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython just explain the internal APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants