[SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions #29098

HyukjinKwon · 2020-07-14T08:39:02Z

What changes were proposed in this pull request?

This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for toPandas with Arrow optimization enabled.

spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()

In the master and branch-3.0, this was fixed together at ecaa495 but it's legitimately not ported back.

Why are the changes needed?

To make empty Spark DataFrame able to be a pandas DataFrame.

Does this PR introduce any user-facing change?

Yes,

spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()

Before:

...
Caused by: java.lang.NegativeArraySizeException
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
...

After:

Empty DataFrame
Columns: [col1]
Index: []

How was this patch tested?

Manually tested and unittest were added.

SparkQA · 2020-07-14T12:00:45Z

Test build #125815 has finished for PR 29098 at commit c3a7f7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T12:25:21Z

Test build #125818 has finished for PR 29098 at commit 070ea46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T13:04:04Z

Test build #125816 has finished for PR 29098 at commit 8074075.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM

…e with no partitions ### What changes were proposed in this pull request? This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for `toPandas` with Arrow optimization enabled. ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` In the master and branch-3.0, this was fixed together at ecaa495 but it's legitimately not ported back. ### Why are the changes needed? To make empty Spark DataFrame able to be a pandas DataFrame. ### Does this PR introduce _any_ user-facing change? Yes, ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` **Before:** ``` ... Caused by: java.lang.NegativeArraySizeException at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) ... ``` **After:** ``` Empty DataFrame Columns: [col1] Index: [] ``` ### How was this patch tested? Manually tested and unittest were added. Closes #29098 from HyukjinKwon/SPARK-32300. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

BryanCutler · 2020-07-14T20:29:36Z

merged to branch-2.4, thanks @HyukjinKwon !

dongjoon-hyun · 2020-07-14T20:39:54Z

+1, late LGTM. Thanks all!

…h empty partitioned Spark DataFrame ### What changes were proposed in this pull request? This PR proposes to port the test case from #29098 to branch-3.0 and master. In the master and branch-3.0, this was fixed together at ecaa495 but no partition case is not being tested. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unit test was forward-ported. Closes #29099 from HyukjinKwon/SPARK-32300-1. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…h empty partitioned Spark DataFrame ### What changes were proposed in this pull request? This PR proposes to port the test case from #29098 to branch-3.0 and master. In the master and branch-3.0, this was fixed together at ecaa495 but no partition case is not being tested. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unit test was forward-ported. Closes #29099 from HyukjinKwon/SPARK-32300-1. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 676d92e) Signed-off-by: HyukjinKwon <[email protected]>

…oPython ### What changes were proposed in this pull request? This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format. ### Why are the changes needed? For simplicity and consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The same code is being tested in `collectAsArrowToPython` of branch-2.4. Closes #29100 from HyukjinKwon/minor-parts. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…oPython ### What changes were proposed in this pull request? This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format. ### Why are the changes needed? For simplicity and consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The same code is being tested in `collectAsArrowToPython` of branch-2.4. Closes #29100 from HyukjinKwon/minor-parts. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 03b5707) Signed-off-by: HyukjinKwon <[email protected]>

probot-autolabeler bot added PYTHON SQL labels Jul 14, 2020

HyukjinKwon requested a review from BryanCutler July 14, 2020 08:39

HyukjinKwon mentioned this pull request Jul 14, 2020

[SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame #29099

Closed

Handle the case when the number of partitions is 0 in toPandas

070ea46

HyukjinKwon force-pushed the SPARK-32300 branch from 8074075 to 070ea46 Compare July 14, 2020 08:56

HyukjinKwon mentioned this pull request Jul 14, 2020

[MINOR][R] Match collectAsArrowToR with non-streaming collectAsArrowToPython #29100

Closed

srowen approved these changes Jul 14, 2020

View reviewed changes

BryanCutler approved these changes Jul 14, 2020

View reviewed changes

BryanCutler closed this Jul 14, 2020

HyukjinKwon deleted the SPARK-32300 branch July 27, 2020 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions #29098

[SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions #29098

Uh oh!

HyukjinKwon commented Jul 14, 2020 •

edited by BryanCutler

Loading

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler commented Jul 14, 2020

Uh oh!

dongjoon-hyun commented Jul 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions #29098

[SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions #29098

Uh oh!

Conversation

HyukjinKwon commented Jul 14, 2020 • edited by BryanCutler Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

SparkQA commented Jul 14, 2020

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jul 14, 2020

Uh oh!

dongjoon-hyun commented Jul 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Jul 14, 2020 •

edited by BryanCutler

Loading