-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40333][PS] Implement GroupBy.nth
#37801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify_temp_column_name
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify_temp_column_name
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test to cover this? I'm a little fuzzy about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems a bug in Pandas' GroupBy.nth, its returned index varies with n:
In [23]: pdf
Out[23]:
A B C D
0 1 3.1 a True
1 2 4.1 b False
2 1 4.1 b False
3 2 3.1 a True
In [24]: pdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[24]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [25]: pdf.groupby(["A", "B", "C", "D"]).nth(0).index
Out[25]:
MultiIndex([(1, 3.1, 'a', True),
(1, 4.1, 'b', False),
(2, 3.1, 'a', True),
(2, 4.1, 'b', False)],
names=['A', 'B', 'C', 'D'])
In [26]: pdf.groupby(["A", "B", "C", "D"]).nth(1)
Out[26]:
Empty DataFrame
Columns: []
Index: []
In [27]: pdf.groupby(["A", "B", "C", "D"]).nth(1).index
Out[27]: MultiIndex([], names=['A', 'B', 'C', 'D'])
In [28]: pdf.groupby(["A", "B", "C", "D"]).nth(-1)
Out[28]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [29]: pdf.groupby(["A", "B", "C", "D"]).nth(-1).index
Out[29]:
MultiIndex([(1, 3.1, 'a', True),
(1, 4.1, 'b', False),
(2, 3.1, 'a', True),
(2, 4.1, 'b', False)],
names=['A', 'B', 'C', 'D'])
In [30]: pdf.groupby(["A", "B", "C", "D"]).nth(-2)
Out[30]:
Empty DataFrame
Columns: []
Index: []
In [31]: pdf.groupby(["A", "B", "C", "D"]).nth(-2).index
Out[31]: MultiIndex([], names=['A', 'B', 'C', 'D'])
while other functions' behavior in Pandas and PS are like this:
In [17]: pdf
Out[17]:
A B C D
0 1 3.1 a True
1 2 4.1 b False
2 1 4.1 b False
3 2 3.1 a True
In [18]: pdf.groupby(["A", "B", "C", "D"]).max()
Out[18]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [19]: pdf.groupby(["A", "B", "C", "D"]).mad()
Out[19]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [20]: psdf.groupby(["A", "B", "C", "D"]).max()
Out[20]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
In [21]: psdf.groupby(["A", "B", "C", "D"]).mad()
Out[21]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
In [22]:
In [22]: psdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[22]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think we can not add a test for it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?
e.g.
spark/python/pyspark/pandas/tests/test_series.py
Lines 1654 to 1658 in 6d2ce12
| if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"): | |
| # a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980 | |
| self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"]) | |
| else: | |
| self.assert_eq(psser.astype(str), pser.astype(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.
When there is a bug in pandas, we usually do something like this:
-
we don't follow the behavior of pandas, we just assume it works properly and implement it.
-
comment the link related pandas issues to the test, from pandas repository(
https://github.com/pandas-dev/pandas/issues/...) as below:spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
Lines 510 to 517 in 0830575
if LooseVersion(pd.__version__) < LooseVersion("1.1.3"): # pandas < 1.1.0: object dtype is returned after negation # pandas 1.1.1 and 1.1.2: # a TypeError "bad operand type for unary -: 'IntegerArray'" is raised # Please refer to https://github.com/pandas-dev/pandas/issues/36063. self.check_extension(pd.Series([-1, -2, -3, None], dtype=pser.dtype), -psser) else: self.check_extension(-pser, -psser) spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
Lines 478 to 483 in 0830575
if LooseVersion(pd.__version__) >= LooseVersion("1.1.0"): # Limit pandas version due to # https://github.com/pandas-dev/pandas/issues/31204 self.check_extension(pser.astype(dtype), psser.astype(dtype)) else: self.check_extension(pser.astype(dtype), psser.astype(dtype))
-
If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply like "yes, it's a bug".
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate n with a friendly exception?
>>> g.nth('C')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/yikun/venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2304, in nth
raise TypeError("n needs to be an int or a list/set/tuple of ints")
TypeError: n needs to be an int or a list/set/tuple of ints
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/python/pyspark/sql/functions.py
Line 3622 in 5a03f70
| def nth_value(col: "ColumnOrName", offset: int, ignoreNulls: Optional[bool] = False) -> Column: |
Since 3.1, there are a def nth_value in spark, but considering negetive index and we are going to support list and slice in the future, I think use row_number is right in here, but just FYI if you have other idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can not apply nth_value for this purpose, it return the n-th row within one partition for each input row, can not use it to filter out unnecessary rows.
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Returns | |
| ------- |
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do we want to create a ticket as a sub-tasks of SPARK-40327 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add it when we start to implement the parameters
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe rather raises NotImplementedError since we should support the other types in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas raise a TypeError for invalid n, see #37801 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Btw, seems like the latest pandas (1.4.4) raises TypeError as below:
>>> g.nth("C")
Traceback (most recent call last):
...
TypeError: Invalid index <class 'str'>. Must be integer, list-like, slice or a tuple of integers and slicesCan we follow the TypeError and its message from pandas, for more information to users ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, upgrade pandas to 1.4.4 #37810
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe rather raises
NotImplementedErrorsince we should support the other types in the future ?
let me take back my words. I think it should raise NotImplementedError for the types that Pandas already supported while we can not support right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so we can:
- Raise
TypeErrorfor unsupported type in pandas as well. - Raise
NotImplementedErrorwhich is not implemented only in pandas API on Spark, but existing in pandas.
python/pyspark/pandas/groupby.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?
e.g.
spark/python/pyspark/pandas/tests/test_series.py
Lines 1654 to 1658 in 6d2ce12
| if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"): | |
| # a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980 | |
| self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"]) | |
| else: | |
| self.assert_eq(psser.astype(str), pser.astype(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: So, do we need upperbound (1.5.0) here since we're going to only support pandas 1.4.x for the Apache Spark 3.4.0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< LooseVersion("1.5.0")
I think only 1.4.x will test this case
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me
|
I just report the index related issue here pandas-dev/pandas#48434 As to the issue here, I personally think it's not a big deal. What about just mention it in the docsting that |
|
@zhengruifeng Yep, I'm fine with it! |
|
@zhengruifeng I'm fine with it, too. It would be great to have the |
277a62f to
c8c70bd
Compare
c8c70bd to
f7aeef1
Compare
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
What changes were proposed in this pull request?
Implement
GroupBy.nthWhy are the changes needed?
for API coverage
Does this PR introduce any user-facing change?
yes, new API
How was this patch tested?
added UT