[SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals #42767

zhengruifeng · 2023-09-01T07:34:26Z

What changes were proposed in this pull request?

make DataFrame.groupBy accept ordinals

Why are the changes needed?

for feature parity

select target_country, ua_date, sum(spending_usd)
from df
group by 2, 1
order by 2, 3 desc

this PR focus on the groupBy method

Does this PR introduce any user-facing change?

yes, new feature

In [2]: from pyspark.sql import functions as sf

In [3]: df = spark.createDataFrame([(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)], ["a", "b"])

In [4]: df.select("a", sf.lit(1), "b").groupBy("a", 2).agg(sf.sum("b")).show()
+---+---+------+                                                                
|  a|  1|sum(b)|
+---+---+------+
|  1|  1|     3|
|  2|  1|     3|
|  3|  1|     3|
+---+---+------+

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2023-09-01T07:35:19Z

cc @HyukjinKwon if this fix is fine, I will support other APIs in followup PRs

dongjoon-hyun

+1, LGTM. Thank you.
Merged to master for Apache Spark 4.0.

zhengruifeng · 2023-09-04T23:58:58Z

@HyukjinKwon @dongjoon-hyun thanks for review

…ip Pandas/PyArrow tests if not available ### What changes were proposed in this pull request? This PR aims to skip `Pandas`-related or `PyArrow`-related tests in `pyspark.sql.tests.test_group` if they are not installed. This regression was introduced by - #44322 - #42767 ### Why are the changes needed? Since `Pandas` and `PyArrow` are optional, we need to skip the tests instead of failures. - https://github.com/apache/spark/actions/runs/7543495430/job/20534809039 ``` ====================================================================== ERROR: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/pandas/utils.py", line 28, in require_minimum_pandas_version import pandas ModuleNotFoundError: No module named 'pandas' ``` ``` ====================================================================== ERROR: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/pandas/utils.py", line 61, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually with the Python installation without Pandas. ``` $ python/run-tests.py --testnames pyspark.sql.tests.test_group Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.9', 'pypy3'] Will test the following Python tests: ['pyspark.sql.tests.test_group'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.18 pypy3 python_implementation is PyPy pypy3 version is: Python 3.10.13 (f1607341da97ff5a1e93430b6e8c4af0ad1aa019, Sep 28 2023, 20:47:55) [PyPy 7.3.13 with GCC Apple LLVM 13.1.6 (clang-1316.0.21.2.5)] Starting test(python3.9): pyspark.sql.tests.test_group (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/ac9269b6-f0df-4d06-88b8-e5e710202b60/python3.9__pyspark.sql.tests.test_group__9zjp5i4z.log) Starting test(pypy3): pyspark.sql.tests.test_group (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/cab6ebed-e49f-4d86-80db-0dc3928079e3/pypy3__pyspark.sql.tests.test_group__thw6hily.log) Finished test(pypy3): pyspark.sql.tests.test_group (6s) ... 3 tests were skipped Finished test(python3.9): pyspark.sql.tests.test_group (7s) ... 3 tests were skipped Tests passed in 7 seconds Skipped tests in pyspark.sql.tests.test_group with pypy3: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it was not found.' test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it was not found.' test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it was not found.' Skipped tests in pyspark.sql.tests.test_group with python3.9: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) ``` - Manually with the Python installation without Pyarrow. ``` $ python/run-tests.py --testnames pyspark.sql.tests.test_group Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.9', 'pypy3'] Will test the following Python tests: ['pyspark.sql.tests.test_group'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.18 pypy3 python_implementation is PyPy pypy3 version is: Python 3.10.13 (f1607341da97ff5a1e93430b6e8c4af0ad1aa019, Sep 28 2023, 20:47:55) [PyPy 7.3.13 with GCC Apple LLVM 13.1.6 (clang-1316.0.21.2.5)] Starting test(pypy3): pyspark.sql.tests.test_group (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/7f1a665e-a679-467c-8ab4-a4532e0b2300/pypy3__pyspark.sql.tests.test_group__i67erhb4.log) Starting test(python3.9): pyspark.sql.tests.test_group (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/47b90765-8ad7-4da0-aa7b-c12cd266847e/python3.9__pyspark.sql.tests.test_group__190hx0tm.log) Finished test(python3.9): pyspark.sql.tests.test_group (6s) ... 3 tests were skipped Finished test(pypy3): pyspark.sql.tests.test_group (7s) ... 3 tests were skipped Tests passed in 7 seconds Skipped tests in pyspark.sql.tests.test_group with pypy3: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was not found.' test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was not found.' test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... skipped '[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was not found.' Skipped tests in pyspark.sql.tests.test_group with python3.9: test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... SKIP (0.000s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44759 from dongjoon-hyun/SPARK-46735. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

init

63bc78a

github-actions bot added SQL PYTHON CONNECT labels Sep 1, 2023

HyukjinKwon approved these changes Sep 4, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 4, 2023

View reviewed changes

dongjoon-hyun closed this in d17a861 Sep 4, 2023

zhengruifeng deleted the py_groupby_index branch September 4, 2023 23:59

dongjoon-hyun mentioned this pull request Jan 16, 2024

[SPARK-46735][PYTHON][TESTS] pyspark.sql.tests.test_group should skip Pandas/PyArrow tests if not available #44759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals #42767

[SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals #42767

Uh oh!

zhengruifeng commented Sep 1, 2023 •

edited

Loading

Uh oh!

zhengruifeng commented Sep 1, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

zhengruifeng commented Sep 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-45047][PYTHON][CONNECT] DataFrame.groupBy support ordinals #42767

[SPARK-45047][PYTHON][CONNECT] DataFrame.groupBy support ordinals #42767

Uh oh!

Conversation

zhengruifeng commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Sep 1, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals #42767

[SPARK-45047][PYTHON][CONNECT] `DataFrame.groupBy` support ordinals #42767

zhengruifeng commented Sep 1, 2023 •

edited

Loading