Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Mar 9, 2023

What changes were proposed in this pull request?

Implement DataFrame.mapInArrow.

Why are the changes needed?

Parity with vanilla PySpark.

Does this PR introduce any user-facing change?

Yes. DataFrame.mapInArrow is supported as shown below.

>>> import pyarrow
>>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
>>> def filter_func(iterator):
...   for batch in iterator:
...     pdf = batch.to_pandas()
...     yield pyarrow.RecordBatch.from_pandas(pdf[pdf.id == 1])
... 
>>> df.mapInArrow(filter_func, df.schema).show()
+---+---+                                                                       
| id|age|
+---+---+
|  1| 21|
+---+---+

How was this patch tested?

Unit tests.

SPARK-41661

@xinrong-meng xinrong-meng changed the title [SPARK-42710][CONNECT][PYTHON] Implement DataFrame.mapInArrow [SPARK-42726][CONNECT][PYTHON] Implement DataFrame.mapInArrow Mar 9, 2023
@HyukjinKwon
Copy link
Member

The test failure seems unrelated.

Merged to master and branch-3.4.

HyukjinKwon pushed a commit that referenced this pull request Mar 10, 2023
### What changes were proposed in this pull request?
Implement `DataFrame.mapInArrow`.

### Why are the changes needed?
Parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.mapInArrow` is supported as shown below.

```
>>> import pyarrow
>>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
>>> def filter_func(iterator):
...   for batch in iterator:
...     pdf = batch.to_pandas()
...     yield pyarrow.RecordBatch.from_pandas(pdf[pdf.id == 1])
...
>>> df.mapInArrow(filter_func, df.schema).show()
+---+---+
| id|age|
+---+---+
|  1| 21|
+---+---+
```

### How was this patch tested?
Unit tests.

Closes #40350 from xinrong-meng/mapInArrowImpl.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit f35c2cb)
Signed-off-by: Hyukjin Kwon <[email protected]>
@xinrong-meng
Copy link
Member Author

Thanks @HyukjinKwon !

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
Implement `DataFrame.mapInArrow`.

### Why are the changes needed?
Parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.mapInArrow` is supported as shown below.

```
>>> import pyarrow
>>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
>>> def filter_func(iterator):
...   for batch in iterator:
...     pdf = batch.to_pandas()
...     yield pyarrow.RecordBatch.from_pandas(pdf[pdf.id == 1])
...
>>> df.mapInArrow(filter_func, df.schema).show()
+---+---+
| id|age|
+---+---+
|  1| 21|
+---+---+
```

### How was this patch tested?
Unit tests.

Closes apache#40350 from xinrong-meng/mapInArrowImpl.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit f35c2cb)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants