Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 14, 2020

What changes were proposed in this pull request?

This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for toPandas with Arrow optimization enabled.

spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()

In the master and branch-3.0, this was fixed together at ecaa495 but it's legitimately not ported back.

Why are the changes needed?

To make empty Spark DataFrame able to be a pandas DataFrame.

Does this PR introduce any user-facing change?

Yes,

spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()

Before:

...
Caused by: java.lang.NegativeArraySizeException
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
...

After:

Empty DataFrame
Columns: [col1]
Index: []

How was this patch tested?

Manually tested and unittest were added.

@SparkQA
Copy link

SparkQA commented Jul 14, 2020

Test build #125815 has finished for PR 29098 at commit c3a7f7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2020

Test build #125818 has finished for PR 29098 at commit 070ea46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2020

Test build #125816 has finished for PR 29098 at commit 8074075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

BryanCutler pushed a commit that referenced this pull request Jul 14, 2020
…e with no partitions

### What changes were proposed in this pull request?

This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for `toPandas` with Arrow optimization enabled.

```python
spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
```

In the master and branch-3.0, this was fixed together at ecaa495 but it's legitimately not ported back.

### Why are the changes needed?

To make empty Spark DataFrame able to be a pandas DataFrame.

### Does this PR introduce _any_ user-facing change?

Yes,

```python
spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
```

**Before:**

```
...
Caused by: java.lang.NegativeArraySizeException
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
...
```

**After:**

```
Empty DataFrame
Columns: [col1]
Index: []
```

### How was this patch tested?

Manually tested and unittest were added.

Closes #29098 from HyukjinKwon/SPARK-32300.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
@BryanCutler
Copy link
Member

merged to branch-2.4, thanks @HyukjinKwon !

@dongjoon-hyun
Copy link
Member

+1, late LGTM. Thanks all!

HyukjinKwon added a commit that referenced this pull request Jul 14, 2020
…h empty partitioned Spark DataFrame

### What changes were proposed in this pull request?

This PR proposes to port the test case from #29098 to branch-3.0 and master.  In the master and branch-3.0, this was fixed together at ecaa495 but no partition case is not being tested.

### Why are the changes needed?

To improve test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Unit test was forward-ported.

Closes #29099 from HyukjinKwon/SPARK-32300-1.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 14, 2020
…h empty partitioned Spark DataFrame

### What changes were proposed in this pull request?

This PR proposes to port the test case from #29098 to branch-3.0 and master.  In the master and branch-3.0, this was fixed together at ecaa495 but no partition case is not being tested.

### Why are the changes needed?

To improve test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Unit test was forward-ported.

Closes #29099 from HyukjinKwon/SPARK-32300-1.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 676d92e)
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 14, 2020
…oPython

### What changes were proposed in this pull request?

This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format.

### Why are the changes needed?

For simplicity and consistency.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The same code is being tested in `collectAsArrowToPython` of branch-2.4.

Closes #29100 from HyukjinKwon/minor-parts.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jul 14, 2020
…oPython

### What changes were proposed in this pull request?

This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format.

### Why are the changes needed?

For simplicity and consistency.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The same code is being tested in `collectAsArrowToPython` of branch-2.4.

Closes #29100 from HyukjinKwon/minor-parts.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 03b5707)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-32300 branch July 27, 2020 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants