Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 5, 2022

What changes were proposed in this pull request?

Implement GroupBy.nth

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, new API

In [4]: import pyspark.pandas as ps

In [5]: import numpy as np

In [6]: df = ps.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [np.nan, 2, 3, 4, 5], 'C': ['a', 'b', 'c', 'd', 'e']}, columns=['A', 'B', 'C'])

In [7]: df.groupby('A').nth(0)
                                                                                
     B  C
A        
1  NaN  a
2  3.0  c

In [8]: df.groupby('A').nth(2)
Out[8]: 
     B  C
A        
1  4.0  d

In [9]: df.C.groupby(df.A).nth(-1)
Out[9]: 
A
1    d
2    e
Name: C, dtype: object

In [10]: df.C.groupby(df.A).nth(-2)
Out[10]: 
A
1    b
2    c
Name: C, dtype: object

How was this patch tested?

added UT

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Sep 5, 2022

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_temp_column_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_temp_column_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to cover this? I'm a little fuzzy about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems a bug in Pandas' GroupBy.nth, its returned index varies with n:

In [23]: pdf
Out[23]: 
   A    B  C      D
0  1  3.1  a   True
1  2  4.1  b  False
2  1  4.1  b  False
3  2  3.1  a   True

In [24]: pdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[24]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]

In [25]: pdf.groupby(["A", "B", "C", "D"]).nth(0).index
Out[25]: 
MultiIndex([(1, 3.1, 'a',  True),
            (1, 4.1, 'b', False),
            (2, 3.1, 'a',  True),
            (2, 4.1, 'b', False)],
           names=['A', 'B', 'C', 'D'])

In [26]: pdf.groupby(["A", "B", "C", "D"]).nth(1)
Out[26]: 
Empty DataFrame
Columns: []
Index: []

In [27]: pdf.groupby(["A", "B", "C", "D"]).nth(1).index
Out[27]: MultiIndex([], names=['A', 'B', 'C', 'D'])

In [28]: pdf.groupby(["A", "B", "C", "D"]).nth(-1)
Out[28]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]

In [29]: pdf.groupby(["A", "B", "C", "D"]).nth(-1).index
Out[29]: 
MultiIndex([(1, 3.1, 'a',  True),
            (1, 4.1, 'b', False),
            (2, 3.1, 'a',  True),
            (2, 4.1, 'b', False)],
           names=['A', 'B', 'C', 'D'])

In [30]: pdf.groupby(["A", "B", "C", "D"]).nth(-2)
Out[30]: 
Empty DataFrame
Columns: []
Index: []

In [31]: pdf.groupby(["A", "B", "C", "D"]).nth(-2).index
Out[31]: MultiIndex([], names=['A', 'B', 'C', 'D'])

while other functions' behavior in Pandas and PS are like this:

In [17]: pdf
Out[17]: 
   A    B  C      D
0  1  3.1  a   True
1  2  4.1  b  False
2  1  4.1  b  False
3  2  3.1  a   True

In [18]: pdf.groupby(["A", "B", "C", "D"]).max()
Out[18]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]

In [19]: pdf.groupby(["A", "B", "C", "D"]).mad()
Out[19]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]

In [20]: psdf.groupby(["A", "B", "C", "D"]).max()
Out[20]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]

In [21]: psdf.groupby(["A", "B", "C", "D"]).mad()
Out[21]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]

In [22]: 

In [22]: psdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[22]: 
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think we can not add a test for it for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?

e.g.

if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"):
# a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980
self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"])
else:
self.assert_eq(psser.astype(str), pser.astype(str))

Copy link
Contributor

@itholic itholic Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.

When there is a bug in pandas, we usually do something like this:

  • we don't follow the behavior of pandas, we just assume it works properly and implement it.

  • comment the link related pandas issues to the test, from pandas repository(https://github.com/pandas-dev/pandas/issues/...) as below:

    • if LooseVersion(pd.__version__) < LooseVersion("1.1.3"):
      # pandas < 1.1.0: object dtype is returned after negation
      # pandas 1.1.1 and 1.1.2:
      # a TypeError "bad operand type for unary -: 'IntegerArray'" is raised
      # Please refer to https://github.com/pandas-dev/pandas/issues/36063.
      self.check_extension(pd.Series([-1, -2, -3, None], dtype=pser.dtype), -psser)
      else:
      self.check_extension(-pser, -psser)
    • if LooseVersion(pd.__version__) >= LooseVersion("1.1.0"):
      # Limit pandas version due to
      # https://github.com/pandas-dev/pandas/issues/31204
      self.check_extension(pser.astype(dtype), psser.astype(dtype))
      else:
      self.check_extension(pser.astype(dtype), psser.astype(dtype))
  • If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.

Copy link
Contributor

@itholic itholic Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply like "yes, it's a bug".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate n with a friendly exception?

>>> g.nth('C')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/yikun/venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2304, in nth
    raise TypeError("n needs to be an int or a list/set/tuple of ints")
TypeError: n needs to be an int or a list/set/tuple of ints

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def nth_value(col: "ColumnOrName", offset: int, ignoreNulls: Optional[bool] = False) -> Column:

Since 3.1, there are a def nth_value in spark, but considering negetive index and we are going to support list and slice in the future, I think use row_number is right in here, but just FYI if you have other idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can not apply nth_value for this purpose, it return the n-th row within one partition for each input row, can not use it to filter out unnecessary rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns
-------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do we want to create a ticket as a sub-tasks of SPARK-40327 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add it when we start to implement the parameters

Comment on lines 940 to 948
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe rather raises NotImplementedError since we should support the other types in the future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas raise a TypeError for invalid n, see #37801 (comment)

Copy link
Contributor

@itholic itholic Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Btw, seems like the latest pandas (1.4.4) raises TypeError as below:

>>> g.nth("C")
Traceback (most recent call last):
...
TypeError: Invalid index <class 'str'>. Must be integer, list-like, slice or a tuple of integers and slices

Can we follow the TypeError and its message from pandas, for more information to users ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, upgrade pandas to 1.4.4 #37810

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe rather raises NotImplementedError since we should support the other types in the future ?

let me take back my words. I think it should raise NotImplementedError for the types that Pandas already supported while we can not support right now

Copy link
Contributor

@itholic itholic Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so we can:

  • Raise TypeError for unsupported type in pandas as well.
  • Raise NotImplementedError which is not implemented only in pandas API on Spark, but existing in pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?

e.g.

if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"):
# a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980
self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"])
else:
self.assert_eq(psser.astype(str), pser.astype(str))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: So, do we need upperbound (1.5.0) here since we're going to only support pandas 1.4.x for the Apache Spark 3.4.0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< LooseVersion("1.5.0")
I think only 1.4.x will test this case

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me

@zhengruifeng
Copy link
Contributor Author

I just report the index related issue here pandas-dev/pandas#48434

As to the issue here, I personally think it's not a big deal. What about just mention it in the docsting that when there is no aggregation columns, the returned empty dataframe may has a different index other than Pandas.

@itholic @Yikun

@Yikun
Copy link
Member

Yikun commented Sep 7, 2022

@zhengruifeng Yep, I'm fine with it!

@itholic
Copy link
Contributor

itholic commented Sep 7, 2022

@zhengruifeng I'm fine with it, too.

It would be great to have the Notes section, and describe it.

@zhengruifeng zhengruifeng force-pushed the ps_groupby_nth branch 2 times, most recently from 277a62f to c8c70bd Compare September 8, 2022 02:06
@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks @itholic @Yikun for reviews !

@zhengruifeng zhengruifeng deleted the ps_groupby_nth branch September 8, 2022 08:05
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants