-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataFrame.groupby.cumcount #1702
Conversation
So far this only works if there is actually a column left after the group by, e.g. ```python >>> df = ks.DataFrame( ... [['a'], ['a'], ['a'], ['b'], ['b'], ['a']], ... columns=list('A')) >>> df A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount(ascending=False).sort_index() 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN Name: 0, dtype: float64 >>> df["B"] = df["A"] >>> df.groupby('A').cumcount().sort_index() 0 0 1 1 2 2 3 0 4 1 5 3 Name: 0, dtype: int64 >>> df.groupby('A').cumcount(ascending=False) 0 3 1 2 2 1 3 1 4 0 5 0 Name: 0, dtype: int64 ```
@HyukjinKwon Sure I'm going to gladly review this ! |
I just submitted a question to pandas repo (pandas-dev/pandas#35682) |
Would you mind adding this to |
Anyway, I'd say think we better return the result as a int64 like pandas rather than float64. ? >>> pdf
A B C
0 1 NaN 4
1 1 0.1 3
2 1 20.0 2
3 4 10.0 1
>>> pdf.groupby("A").cumcount()
0 0
1 1
2 2
3 0
dtype: int64
>>> kdf.groupby("A").cumcount()
0 0.0
1 1.0
2 2.0
3 0.0
Name: 0, dtype: float64 |
The above changes are now all implemented. I don't quite like the explicit cast to |
Unfortunately, the order is not preserved as can be seen in this test (which I wanted to add to the testsuite now):
Not sure where this could be ensured in the new |
@itholic For pandas-dev/pandas#35682, |
6e5a1bd
to
426d361
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you fix the format?
Thanks, @ueshin . I'll close that QST with proper comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM pending tests.
# TODO: Enable the following test when ks.groupby(dropna=True) | ||
# is implemented (see #1007) | ||
# self.assert_eq( | ||
# kdf.groupby(["a", "b"]).cumcount().sort_index(), | ||
# pdf.groupby(["a", "b"]).cumcount().sort_index(), | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use the same test dataset as the other cumxxx
tests for now? Then we can enable this and one more below. Let's see #1007 for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomspur Could you update the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed above the columns from ["a", "b"]
to ["a", "c"]
, which does not contain Nones to have some other and keep the comments to remember to enable testing Nones at least partially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the same dataset like the other tests for now. Do you want to add a None to them once #1007 is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should modify or add tests which include None
s in groupkeys at #1007.
@itholic Could you take another look? Thanks! |
Note that currently two tests with multiple groupby does not work when the elements contain null values. These are commented out for now until the groupby is also dropping null values to match pandas functionality of groupby(dropna=True) in 1.1. See also databricks#1007.
@ueshin My pleasure. Let me check this weekend. |
I changed the description and also mentioned the current limitation for the |
@@ -852,6 +852,62 @@ def test_rank(self): | |||
pdf.groupby([("x", "a"), ("x", "b")]).rank().sort_index(), | |||
) | |||
|
|||
def test_cumcount(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also have a tests for ascending=False
?
You could just use the below
def test_cumcount(self):
pdf = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6] * 3,
"b": [1, 1, 2, 3, 5, 8] * 3,
"c": [1, 4, 9, 16, 25, 36] * 3,
},
index=np.random.rand(6 * 3),
)
kdf = ks.from_pandas(pdf)
ascendings = [True, False, 0, 1, None]
self.assert_eq(
kdf.groupby("b").cumcount().sort_index(), pdf.groupby("b").cumcount().sort_index()
)
self.assert_eq(
kdf.groupby(["a", "b"]).cumcount().sort_index(),
pdf.groupby(["a", "b"]).cumcount().sort_index(),
)
self.assert_eq(
kdf.groupby(["b"])["a"].cumcount().sort_index(),
pdf.groupby(["b"])["a"].cumcount().sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(),
pdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(kdf.b // 5).cumcount().sort_index(),
pdf.groupby(pdf.b // 5).cumcount().sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(kdf.b // 5)["a"].cumcount().sort_index(),
pdf.groupby(pdf.b // 5)["a"].cumcount().sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby("b").cumcount().sum(), pdf.groupby("b").cumcount().sum(),
)
# specify `ascending`
for ascending in ascendings:
self.assert_eq(
kdf.groupby("b").cumcount(ascending=ascending).sort_index(), pdf.groupby("b").cumcount(ascending=ascending).sort_index()
)
self.assert_eq(
kdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(),
pdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(),
)
self.assert_eq(
kdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(),
pdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(),
pdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(kdf.b // 5).cumcount(ascending=ascending).sort_index(),
pdf.groupby(pdf.b // 5).cumcount(ascending=ascending).sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby(kdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(),
pdf.groupby(pdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(),
almost=True,
)
self.assert_eq(
kdf.groupby("b").cumcount(ascending=ascending).sum(), pdf.groupby("b").cumcount(ascending=ascending).sum(),
)
# multi-index columns
columns = pd.MultiIndex.from_tuples([("x", "a"), ("x", "b"), ("y", "c")])
pdf.columns = columns
kdf.columns = columns
self.assert_eq(
kdf.groupby(("x", "b")).cumcount().sort_index(),
pdf.groupby(("x", "b")).cumcount().sort_index(),
)
self.assert_eq(
kdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(),
pdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(),
)
# specify `ascending`
for ascending in ascendings:
self.assert_eq(
kdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(),
pdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(),
)
self.assert_eq(
kdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(),
pdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but shall we use a loop for ascending?
for ascending in [True, False]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe we could test for [True, False, 0, 1, None]
- Not sure maybe It's too much though -
I fixed the example code in the above comment.
Otherwise, LGTM |
I'd merge this now and will submit a follow-up PR to address the comment. |
Modifies `test_cumcount` to address comments #1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.
Thank you for the throughout review and merging this! |
Modifies `test_cumcount` to address comments databricks/koalas#1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.
Here is an example of the implementation in analogy of the one from the pandas docs:
The tests are the same like for the other
cumxxx
tests for now.Nevertheless, it so far does not work the case of nulls and multiple columns within the groupby (thank you for finding this @itholic):
This will also work, when groupby supports a default
dropna=True
argument in #1007