-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40333][PS] Implement GroupBy.nth
#37801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f99b245
03ab9bd
238e673
ed69aee
109f896
f8b1c14
f7aeef1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -895,6 +895,104 @@ def sem(col: Column) -> Column: | |||||||||||||||||||||||||||||||||||||||
| bool_to_numeric=True, | ||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter | ||||||||||||||||||||||||||||||||||||||||
| def nth(self, n: int) -> FrameLike: | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
| def nth_value(col: "ColumnOrName", offset: int, ignoreNulls: Optional[bool] = False) -> Column: |
Since 3.1, there are a def nth_value in spark, but considering negetive index and we are going to support list and slice in the future, I think use row_number is right in here, but just FYI if you have other idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can not apply nth_value for this purpose, it return the n-th row within one partition for each input row, can not use it to filter out unnecessary rows.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Returns | |
| ------- |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate n with a friendly exception?
>>> g.nth('C')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/yikun/venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2304, in nth
raise TypeError("n needs to be an int or a list/set/tuple of ints")
TypeError: n needs to be an int or a list/set/tuple of ints
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test to cover this? I'm a little fuzzy about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems a bug in Pandas' GroupBy.nth, its returned index varies with n:
In [23]: pdf
Out[23]:
A B C D
0 1 3.1 a True
1 2 4.1 b False
2 1 4.1 b False
3 2 3.1 a True
In [24]: pdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[24]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [25]: pdf.groupby(["A", "B", "C", "D"]).nth(0).index
Out[25]:
MultiIndex([(1, 3.1, 'a', True),
(1, 4.1, 'b', False),
(2, 3.1, 'a', True),
(2, 4.1, 'b', False)],
names=['A', 'B', 'C', 'D'])
In [26]: pdf.groupby(["A", "B", "C", "D"]).nth(1)
Out[26]:
Empty DataFrame
Columns: []
Index: []
In [27]: pdf.groupby(["A", "B", "C", "D"]).nth(1).index
Out[27]: MultiIndex([], names=['A', 'B', 'C', 'D'])
In [28]: pdf.groupby(["A", "B", "C", "D"]).nth(-1)
Out[28]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [29]: pdf.groupby(["A", "B", "C", "D"]).nth(-1).index
Out[29]:
MultiIndex([(1, 3.1, 'a', True),
(1, 4.1, 'b', False),
(2, 3.1, 'a', True),
(2, 4.1, 'b', False)],
names=['A', 'B', 'C', 'D'])
In [30]: pdf.groupby(["A", "B", "C", "D"]).nth(-2)
Out[30]:
Empty DataFrame
Columns: []
Index: []
In [31]: pdf.groupby(["A", "B", "C", "D"]).nth(-2).index
Out[31]: MultiIndex([], names=['A', 'B', 'C', 'D'])
while other functions' behavior in Pandas and PS are like this:
In [17]: pdf
Out[17]:
A B C D
0 1 3.1 a True
1 2 4.1 b False
2 1 4.1 b False
3 2 3.1 a True
In [18]: pdf.groupby(["A", "B", "C", "D"]).max()
Out[18]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [19]: pdf.groupby(["A", "B", "C", "D"]).mad()
Out[19]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
In [20]: psdf.groupby(["A", "B", "C", "D"]).max()
Out[20]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
In [21]: psdf.groupby(["A", "B", "C", "D"]).mad()
Out[21]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
In [22]:
In [22]: psdf.groupby(["A", "B", "C", "D"]).nth(0)
Out[22]:
Empty DataFrame
Columns: []
Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think we can not add a test for it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?
e.g.
spark/python/pyspark/pandas/tests/test_series.py
Lines 1654 to 1658 in 6d2ce12
| if LooseVersion("1.1.1") <= LooseVersion(pd.__version__) < LooseVersion("1.1.4"): | |
| # a pandas bug: https://github.com/databricks/koalas/pull/1818#issuecomment-703961980 | |
| self.assert_eq(psser.astype(str).tolist(), ["hi", "hi ", " ", " \t", "", "None"]) | |
| else: | |
| self.assert_eq(psser.astype(str), pser.astype(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.
When there is a bug in pandas, we usually do something like this:
-
we don't follow the behavior of pandas, we just assume it works properly and implement it.
-
comment the link related pandas issues to the test, from pandas repository(
https://github.com/pandas-dev/pandas/issues/...) as below:spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
Lines 510 to 517 in 0830575
if LooseVersion(pd.__version__) < LooseVersion("1.1.3"): # pandas < 1.1.0: object dtype is returned after negation # pandas 1.1.1 and 1.1.2: # a TypeError "bad operand type for unary -: 'IntegerArray'" is raised # Please refer to https://github.com/pandas-dev/pandas/issues/36063. self.check_extension(pd.Series([-1, -2, -3, None], dtype=pser.dtype), -psser) else: self.check_extension(-pser, -psser) spark/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
Lines 478 to 483 in 0830575
if LooseVersion(pd.__version__) >= LooseVersion("1.1.0"): # Limit pandas version due to # https://github.com/pandas-dev/pandas/issues/31204 self.check_extension(pser.astype(dtype), psser.astype(dtype)) else: self.check_extension(pser.astype(dtype), psser.astype(dtype))
-
If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply like "yes, it's a bug".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do we want to create a ticket as a sub-tasks of SPARK-40327 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add it when we start to implement the parameters