Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Nov 16, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

amaliujia and others added 3 commits November 16, 2022 16:48
…messages

### What changes were proposed in this pull request?

This PR adds a developer facing documentation for how to add proto messages to Connect proto. More specifically, adding how to add a proto field which takes considerations of `required`, `optional` and default values.

### Why are the changes needed?

Improve documentation for developers who want to update Connect proto.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

N/A

Closes #38605 from amaliujia/protobuf_design_doc.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…aframe comparison

### What changes were proposed in this pull request?
use `assert_eq` in `PandasOnSparkTestCase` to compare dataframes

### Why are the changes needed?
show detailed error message

before:
```
======================================================================
ERROR [0.667s]: test_fill_na (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/sql/tests/connect/test_connect_basic.py", line 244, in test_fill_na
    self.assertTrue(
AssertionError: False is not true
----------------------------------------------------------------------
```

after:
```
AssertionError: DataFrame.iloc[:, 0] (column name="id") are different

DataFrame.iloc[:, 0] (column name="id") values are different (100.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[left]:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[right]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Left:
   id
id    int64
dtype: object

Right:
   id
id    int64
dtype: object
```

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
existing UT

Closes #38670 from zhengruifeng/connect_test_df_equal.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
This extends the implementation of column aliases in Spark Connect with supporting lists of column names and providing the appropriate implementation for the Python side.

### Why are the changes needed?
Compatibility

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT in Python and Scala

Closes #38631 from grundprinzip/SPARK-40809-f.

Lead-authored-by: Martin Grund <[email protected]>
Co-authored-by: Martin Grund <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@pull pull bot merged commit 0f7eaee into huangxiaopingRD:master Nov 16, 2022
huangxiaopingRD pushed a commit that referenced this pull request Jun 26, 2023
…onnect

### What changes were proposed in this pull request?
Implement Arrow-optimized Python UDFs in Spark Connect.

Please see apache#39384 for motivation and  performance improvements of Arrow-optimized Python UDFs.

### Why are the changes needed?
Parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. In Spark Connect Python Client, users can:

1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF.

```sh
>>> df = spark.range(2)
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#18 AS <lambda>(id)#16]
+- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200
   +- *(1) Range (0, 2, step=1, splits=1)
```

2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized.

```sh
>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
>>> df.select(udf(lambda x : x + 1)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#30 AS <lambda>(id)#28]
+- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200
   +- *(1) Range (0, 2, step=1, splits=1)

```

### How was this patch tested?
Parity unit tests.

Closes apache#40725 from xinrong-meng/connect_arrow_py_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants