[pull] master from apache:master #28

pull · 2022-11-16T15:27:03Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…messages ### What changes were proposed in this pull request? This PR adds a developer facing documentation for how to add proto messages to Connect proto. More specifically, adding how to add a proto field which takes considerations of `required`, `optional` and default values. ### Why are the changes needed? Improve documentation for developers who want to update Connect proto. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #38605 from amaliujia/protobuf_design_doc. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…aframe comparison ### What changes were proposed in this pull request? use `assert_eq` in `PandasOnSparkTestCase` to compare dataframes ### Why are the changes needed? show detailed error message before: ``` ====================================================================== ERROR [0.667s]: test_fill_na (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/connect/test_connect_basic.py", line 244, in test_fill_na self.assertTrue( AssertionError: False is not true ---------------------------------------------------------------------- ``` after: ``` AssertionError: DataFrame.iloc[:, 0] (column name="id") are different DataFrame.iloc[:, 0] (column name="id") values are different (100.0 %) [index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [left]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [right]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] Left: id id int64 dtype: object Right: id id int64 dtype: object ``` ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? existing UT Closes #38670 from zhengruifeng/connect_test_df_equal. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This extends the implementation of column aliases in Spark Connect with supporting lists of column names and providing the appropriate implementation for the Python side. ### Why are the changes needed? Compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT in Python and Scala Closes #38631 from grundprinzip/SPARK-40809-f. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Martin Grund <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…onnect ### What changes were proposed in this pull request? Implement Arrow-optimized Python UDFs in Spark Connect. Please see apache#39384 for motivation and performance improvements of Arrow-optimized Python UDFs. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. In Spark Connect Python Client, users can: 1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF. ```sh >>> df = spark.range(2) >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#18 AS <lambda>(id)#16] +- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` 2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized. ```sh >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True) >>> df.select(udf(lambda x : x + 1)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#30 AS <lambda>(id)#28] +- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` ### How was this patch tested? Parity unit tests. Closes apache#40725 from xinrong-meng/connect_arrow_py_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

amaliujia and others added 3 commits November 16, 2022 16:48

github-actions bot added CONNECT CORE DOCS PYTHON SQL labels Nov 16, 2022

pull bot added ⤵️ pull and removed CORE SQL DOCS PYTHON CONNECT labels Nov 16, 2022

pull bot merged commit 0f7eaee into huangxiaopingRD:master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #28

[pull] master from apache:master #28

Uh oh!

pull bot commented Nov 16, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[pull] master from apache:master #28

[pull] master from apache:master #28

Uh oh!

Conversation

pull bot commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pull bot commented Nov 16, 2022 •

edited

Loading