[SPARK-40333][PS] Implement `GroupBy.nth` #37801

zhengruifeng · 2022-09-05T08:34:56Z

What changes were proposed in this pull request?

Implement GroupBy.nth

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, new API

In [4]: import pyspark.pandas as ps

In [5]: import numpy as np

In [6]: df = ps.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [np.nan, 2, 3, 4, 5], 'C': ['a', 'b', 'c', 'd', 'e']}, columns=['A', 'B', 'C'])

In [7]: df.groupby('A').nth(0)
                                                                                
     B  C
A        
1  NaN  a
2  3.0  c

In [8]: df.groupby('A').nth(2)
Out[8]: 
     B  C
A        
1  4.0  d

In [9]: df.C.groupby(df.A).nth(-1)
Out[9]: 
A
1    d
2    e
Name: C, dtype: object

In [10]: df.C.groupby(df.A).nth(-2)
Out[10]: 
A
1    b
2    c
Name: C, dtype: object

How was this patch tested?

added UT

zhengruifeng · 2022-09-05T08:46:47Z

cc @itholic @HyukjinKwon @xinrong-meng @Yikun

Yikun · 2022-09-05T08:55:51Z

python/pyspark/pandas/groupby.py

verify_temp_column_name

Yikun · 2022-09-05T08:55:57Z

python/pyspark/pandas/groupby.py

verify_temp_column_name

Yikun · 2022-09-05T09:25:08Z

python/pyspark/pandas/groupby.py

Add a test to cover this? I'm a little fuzzy about this

there seems a bug in Pandas' GroupBy.nth, its returned index varies with n:

In [23]: pdf Out[23]: A B C D 0 1 3.1 a True 1 2 4.1 b False 2 1 4.1 b False 3 2 3.1 a True In [24]: pdf.groupby(["A", "B", "C", "D"]).nth(0) Out[24]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)] In [25]: pdf.groupby(["A", "B", "C", "D"]).nth(0).index Out[25]: MultiIndex([(1, 3.1, 'a', True), (1, 4.1, 'b', False), (2, 3.1, 'a', True), (2, 4.1, 'b', False)], names=['A', 'B', 'C', 'D']) In [26]: pdf.groupby(["A", "B", "C", "D"]).nth(1) Out[26]: Empty DataFrame Columns: [] Index: [] In [27]: pdf.groupby(["A", "B", "C", "D"]).nth(1).index Out[27]: MultiIndex([], names=['A', 'B', 'C', 'D']) In [28]: pdf.groupby(["A", "B", "C", "D"]).nth(-1) Out[28]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)] In [29]: pdf.groupby(["A", "B", "C", "D"]).nth(-1).index Out[29]: MultiIndex([(1, 3.1, 'a', True), (1, 4.1, 'b', False), (2, 3.1, 'a', True), (2, 4.1, 'b', False)], names=['A', 'B', 'C', 'D']) In [30]: pdf.groupby(["A", "B", "C", "D"]).nth(-2) Out[30]: Empty DataFrame Columns: [] Index: [] In [31]: pdf.groupby(["A", "B", "C", "D"]).nth(-2).index Out[31]: MultiIndex([], names=['A', 'B', 'C', 'D'])

while other functions' behavior in Pandas and PS are like this:

In [17]: pdf Out[17]: A B C D 0 1 3.1 a True 1 2 4.1 b False 2 1 4.1 b False 3 2 3.1 a True In [18]: pdf.groupby(["A", "B", "C", "D"]).max() Out[18]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)] In [19]: pdf.groupby(["A", "B", "C", "D"]).mad() Out[19]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)] In [20]: psdf.groupby(["A", "B", "C", "D"]).max() Out[20]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)] In [21]: psdf.groupby(["A", "B", "C", "D"]).mad() Out[21]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)] In [22]: In [22]: psdf.groupby(["A", "B", "C", "D"]).nth(0) Out[22]: Empty DataFrame Columns: [] Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]

so I think we can not add a test for it for now

If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?

e.g.

spark/python/pyspark/pandas/tests/test_series.py

Lines 1654 to 1658 in 6d2ce12

if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"):

# a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980

self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"])

else:

self.assert_eq(psser.astype(str), pser.astype(str))

Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.

When there is a bug in pandas, we usually do something like this:

we don't follow the behavior of pandas, we just assume it works properly and implement it.

comment the link related pandas issues to the test, from pandas repository(https://github.com/pandas-dev/pandas/issues/...) as below:

spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

Lines 510 to 517 in 0830575

if LooseVersion(pd.__version__) < LooseVersion("1.1.3"):

# pandas < 1.1.0: object dtype is returned after negation

# pandas 1.1.1 and 1.1.2:

# a TypeError "bad operand type for unary -: 'IntegerArray'" is raised

# Please refer to https://github.com/pandas-dev/pandas/issues/36063.

self.check_extension(pd.Series([-1, -2, -3, None], dtype=pser.dtype), -psser)

else:

self.check_extension(-pser, -psser)

spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

Lines 478 to 483 in 0830575

if LooseVersion(pd.__version__) >= LooseVersion("1.1.0"):

# Limit pandas version due to

# https://github.com/pandas-dev/pandas/issues/31204

self.check_extension(pser.astype(dtype), psser.astype(dtype))

else:

self.check_extension(pser.astype(dtype), psser.astype(dtype))

If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.

Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply like "yes, it's a bug".

Yikun · 2022-09-05T09:28:16Z

python/pyspark/pandas/groupby.py

validate n with a friendly exception?

>>> g.nth('C') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/yikun/venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2304, in nth raise TypeError("n needs to be an int or a list/set/tuple of ints") TypeError: n needs to be an int or a list/set/tuple of ints

Yikun · 2022-09-05T09:36:25Z

python/pyspark/pandas/groupby.py

spark/python/pyspark/sql/functions.py

Line 3622 in 5a03f70

def nth_value(col: "ColumnOrName", offset: int, ignoreNulls: Optional[bool] = False) -> Column:

Since 3.1, there are a def nth_value in spark, but considering negetive index and we are going to support list and slice in the future, I think use row_number is right in here, but just FYI if you have other idea.

I guess we can not apply nth_value for this purpose, it return the n-th row within one partition for each input row, can not use it to filter out unnecessary rows.

Yikun · 2022-09-05T09:41:50Z

python/pyspark/pandas/groupby.py

Suggested change

Returns

-------

itholic · 2022-09-05T23:52:48Z

python/pyspark/pandas/groupby.py

Maybe do we want to create a ticket as a sub-tasks of SPARK-40327 ?

let's add it when we start to implement the parameters

itholic · 2022-09-05T23:58:34Z

python/pyspark/pandas/groupby.py

nit: Maybe rather raises NotImplementedError since we should support the other types in the future ?

Pandas raise a TypeError for invalid n, see #37801 (comment)

Got it.

Btw, seems like the latest pandas (1.4.4) raises TypeError as below:

>>> g.nth("C") Traceback (most recent call last): ... TypeError: Invalid index <class 'str'>. Must be integer, list-like, slice or a tuple of integers and slices

Can we follow the TypeError and its message from pandas, for more information to users ?

FYI, upgrade pandas to 1.4.4 #37810

nit: Maybe rather raises NotImplementedError since we should support the other types in the future ?

let me take back my words. I think it should raise NotImplementedError for the types that Pandas already supported while we can not support right now

Yes, so we can:

Raise TypeError for unsupported type in pandas as well.

Raise NotImplementedError which is not implemented only in pandas API on Spark, but existing in pandas.

itholic · 2022-09-06T00:04:18Z

python/pyspark/pandas/groupby.py

If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?

e.g.

spark/python/pyspark/pandas/tests/test_series.py

Lines 1654 to 1658 in 6d2ce12

if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"):

# a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980

self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"])

else:

self.assert_eq(psser.astype(str), pser.astype(str))

itholic · 2022-09-07T02:07:27Z

python/pyspark/pandas/tests/test_groupby.py

qq: So, do we need upperbound (1.5.0) here since we're going to only support pandas 1.4.x for the Apache Spark 3.4.0 ?

< LooseVersion("1.5.0")
I think only 1.4.x will test this case

itholic

Looks pretty good to me

zhengruifeng · 2022-09-07T09:04:23Z

I just report the index related issue here pandas-dev/pandas#48434

As to the issue here, I personally think it's not a big deal. What about just mention it in the docsting that when there is no aggregation columns, the returned empty dataframe may has a different index other than Pandas.

@itholic @Yikun

Yikun · 2022-09-07T09:36:11Z

@zhengruifeng Yep, I'm fine with it!

itholic · 2022-09-07T09:39:26Z

@zhengruifeng I'm fine with it, too.

It would be great to have the Notes section, and describe it.

zhengruifeng · 2022-09-08T08:05:14Z

Merged to master, thanks @itholic @Yikun for reviews !

HyukjinKwon

LGTM2

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Sep 5, 2022

Yikun reviewed Sep 5, 2022

View reviewed changes

itholic reviewed Sep 6, 2022

View reviewed changes

itholic reviewed Sep 7, 2022

View reviewed changes

itholic approved these changes Sep 7, 2022

View reviewed changes

Yikun approved these changes Sep 7, 2022

View reviewed changes

zhengruifeng force-pushed the ps_groupby_nth branch 2 times, most recently from 277a62f to c8c70bd Compare September 8, 2022 02:06

zhengruifeng added 7 commits September 8, 2022 11:22

init

f99b245

address comments

03ab9bd

address comments

238e673

note the behavior difference

ed69aee

fix lint

109f896

fix lint II

f8b1c14

raise NotImplementedError

f7aeef1

zhengruifeng force-pushed the ps_groupby_nth branch from c8c70bd to f7aeef1 Compare September 8, 2022 03:22

zhengruifeng closed this in 4d73552 Sep 8, 2022

zhengruifeng deleted the ps_groupby_nth branch September 8, 2022 08:05

HyukjinKwon reviewed Sep 13, 2022

View reviewed changes

	if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"):
	# a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980
	self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"])
	else:
	self.assert_eq(psser.astype(str), pser.astype(str))

	if LooseVersion(pd.__version__) < LooseVersion("1.1.3"):
	# pandas < 1.1.0: object dtype is returned after negation
	# pandas 1.1.1 and 1.1.2:
	# a TypeError "bad operand type for unary -: 'IntegerArray'" is raised
	# Please refer to https://github.com/pandas-dev/pandas/issues/36063.
	self.check_extension(pd.Series([-1, -2, -3, None], dtype=pser.dtype), -psser)
	else:
	self.check_extension(-pser, -psser)

	if LooseVersion(pd.__version__) >= LooseVersion("1.1.0"):
	# Limit pandas version due to
	# https://github.com/pandas-dev/pandas/issues/31204
	self.check_extension(pser.astype(dtype), psser.astype(dtype))
	else:
	self.check_extension(pser.astype(dtype), psser.astype(dtype))

[SPARK-40333][PS] Implement GroupBy.nth #37801

[SPARK-40333][PS] Implement GroupBy.nth #37801

Uh oh!

Conversation

zhengruifeng commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 7, 2022

Uh oh!

Yikun commented Sep 7, 2022

Uh oh!

itholic commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Sep 8, 2022

Uh oh!

[SPARK-40333][PS] Implement `GroupBy.nth` #37801

[SPARK-40333][PS] Implement `GroupBy.nth` #37801

zhengruifeng commented Sep 5, 2022 •

edited

Loading

zhengruifeng commented Sep 5, 2022 •

edited

Loading

itholic Sep 7, 2022 •

edited

Loading

itholic Sep 7, 2022 •

edited

Loading

itholic Sep 6, 2022 •

edited

Loading

itholic Sep 8, 2022 •

edited

Loading

itholic commented Sep 7, 2022 •

edited

Loading