Implement DataFrame.groupby.cumcount #1702

tomspur · 2020-08-10T10:52:35Z

Here is an example of the implementation in analogy of the one from the pandas docs:

>>> df = ks.DataFrame(
...     [['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
...     columns=list('A'))
>>> df
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount(ascending=False).sort_index()
0    3
1    2
2    1
3    1
4    0
5    0
Name: A, dtype: int64
>>> df.groupby('A').cumcount().sort_index()
0    0
1    1
2    2
3    0
4    1
5    3
Name: A, dtype: int64

The tests are the same like for the other cumxxx tests for now.

Nevertheless, it so far does not work the case of nulls and multiple columns within the groupby (thank you for finding this @itholic):

>>> df = ks.DataFrame({"a": [1, 1, 1, 4], "b": [None, 0.1, 20.0, None], "c": [4, 3, 2, 1]})
>>> df
   a     b  c
0  1   NaN  4
1  1   0.1  3
2  1  20.0  2
3  4   NaN  1
>>> df.groupby(["a", "b"]).cumcount().sort_index()
0    0
1    0
2    0
3    0
Name: a, dtype: int64
>>> df.to_pandas().groupby(["a", "b"]).cumcount().sort_index()
0    0
1    0
2    0
3    1
dtype: int64

This will also work, when groupby supports a default dropna=True argument in #1007

So far this only works if there is actually a column left after the group by, e.g. ```python >>> df = ks.DataFrame( ... [['a'], ['a'], ['a'], ['b'], ['b'], ['a']], ... columns=list('A')) >>> df A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount(ascending=False).sort_index() 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN Name: 0, dtype: float64 >>> df["B"] = df["A"] >>> df.groupby('A').cumcount().sort_index() 0 0 1 1 2 2 3 0 4 1 5 3 Name: 0, dtype: int64 >>> df.groupby('A').cumcount(ascending=False) 0 3 1 2 2 1 3 1 4 0 5 0 Name: 0, dtype: int64 ```

HyukjinKwon · 2020-08-11T02:58:38Z

Thanks @tomspur for working on this. @itholic can you review this?

itholic · 2020-08-11T23:21:19Z

@HyukjinKwon Sure I'm going to gladly review this !

databricks/koalas/groupby.py

itholic · 2020-08-12T04:03:33Z

I just submitted a question to pandas repo (pandas-dev/pandas#35682)
Let's just keep this for now until they response to it.

itholic · 2020-08-12T04:08:43Z

Would you mind adding this to docs/source/reference/groupby.rst ??

itholic · 2020-08-12T04:10:47Z

Anyway, I'd say think we better return the result as a int64 like pandas rather than float64. ?

>>> pdf
   A     B  C
0  1   NaN  4
1  1   0.1  3
2  1  20.0  2
3  4  10.0  1

>>> pdf.groupby("A").cumcount()
0    0
1    1
2    2
3    0
dtype: int64

>>> kdf.groupby("A").cumcount()
0    0.0
1    1.0
2    2.0
3    0.0
Name: 0, dtype: float64

tomspur · 2020-08-12T17:22:56Z

The above changes are now all implemented. I don't quite like the explicit cast to int64 in the code, but also couldn't find another way to ensure that the NaNs are properly skipped during the calculation.

tomspur · 2020-08-12T18:05:27Z

Unfortunately, the order is not preserved as can be seen in this test (which I wanted to add to the testsuite now):


In [10]: pdf = pd.DataFrame([[1, None, 4], [1, 0.1, 3], [1, 20.0, 2], [4, None, 1]],columns=list("ABC"),index=np.random.rand(4))                                                       

In [11]: kdf.groupby(column).cumcount(ascending=False)                                                                                                                                 
Out[11]: 
0.612123    0
0.440823    1
0.718036    2
0.883475    0
Name: 0, dtype: int64

In [12]: pdf.groupby(column).cumcount(ascending=False)                                                                                                                                 
Out[12]: 
0.874863    2
0.809940    1
0.528329    0
0.655884    0
dtype: int64

Not sure where this could be ensured in the new ascending=False part in databricks.koalas.series ..

ueshin · 2020-08-12T18:06:42Z

@itholic For pandas-dev/pandas#35682, cumcount doesn't need to care about its type, so it's natural to support with any type.

ueshin

Also could you fix the format?

databricks/koalas/groupby.py

databricks/koalas/tests/test_dataframe.py

databricks/koalas/missing/groupby.py

itholic · 2020-08-13T07:55:09Z

@itholic For pandas-dev/pandas#35682, cumcount doesn't need to care about its type, so it's natural to support with any type.

Thanks, @ueshin . I'll close that QST with proper comments.

ueshin

Otherwise, LGTM pending tests.

ueshin · 2020-08-13T20:35:19Z

databricks/koalas/tests/test_groupby.py

+        # TODO: Enable the following test when ks.groupby(dropna=True)
+        # is implemented (see #1007)
+        # self.assert_eq(
+        #    kdf.groupby(["a", "b"]).cumcount().sort_index(),
+        #    pdf.groupby(["a", "b"]).cumcount().sort_index(),
+        # )


Shall we use the same test dataset as the other cumxxx tests for now? Then we can enable this and one more below. Let's see #1007 for this case.

@tomspur Could you update the tests?

I just changed above the columns from ["a", "b"] to ["a", "c"], which does not contain Nones to have some other and keep the comments to remember to enable testing Nones at least partially.

Used the same dataset like the other tests for now. Do you want to add a None to them once #1007 is implemented?

Yes, we should modify or add tests which include Nones in groupkeys at #1007.

ueshin · 2020-08-13T20:38:44Z

@itholic Could you take another look? Thanks!

Note that currently two tests with multiple groupby does not work when the elements contain null values. These are commented out for now until the groupby is also dropping null values to match pandas functionality of groupby(dropna=True) in 1.1. See also databricks#1007.

itholic · 2020-08-14T07:49:59Z

@ueshin My pleasure. Let me check this weekend.

ueshin · 2020-08-14T18:05:52Z

@tomspur Now that this is good to go, pending @itholic's another look.
Could you update the PR description to describe the changes here?

tomspur · 2020-08-14T20:59:34Z

I changed the description and also mentioned the current limitation for the groupby with nulls in it

itholic · 2020-08-16T07:49:45Z

databricks/koalas/tests/test_groupby.py

@@ -852,6 +852,62 @@ def test_rank(self):
            pdf.groupby([("x", "a"), ("x", "b")]).rank().sort_index(),
        )

+    def test_cumcount(self):


Could we also have a tests for ascending=False ?

You could just use the below

def test_cumcount(self): pdf = pd.DataFrame( { "a": [1, 2, 3, 4, 5, 6] * 3, "b": [1, 1, 2, 3, 5, 8] * 3, "c": [1, 4, 9, 16, 25, 36] * 3, }, index=np.random.rand(6 * 3), ) kdf = ks.from_pandas(pdf) ascendings = [True, False, 0, 1, None] self.assert_eq( kdf.groupby("b").cumcount().sort_index(), pdf.groupby("b").cumcount().sort_index() ) self.assert_eq( kdf.groupby(["a", "b"]).cumcount().sort_index(), pdf.groupby(["a", "b"]).cumcount().sort_index(), ) self.assert_eq( kdf.groupby(["b"])["a"].cumcount().sort_index(), pdf.groupby(["b"])["a"].cumcount().sort_index(), almost=True, ) self.assert_eq( kdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(), pdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(), almost=True, ) self.assert_eq( kdf.groupby(kdf.b // 5).cumcount().sort_index(), pdf.groupby(pdf.b // 5).cumcount().sort_index(), almost=True, ) self.assert_eq( kdf.groupby(kdf.b // 5)["a"].cumcount().sort_index(), pdf.groupby(pdf.b // 5)["a"].cumcount().sort_index(), almost=True, ) self.assert_eq( kdf.groupby("b").cumcount().sum(), pdf.groupby("b").cumcount().sum(), ) # specify `ascending` for ascending in ascendings: self.assert_eq( kdf.groupby("b").cumcount(ascending=ascending).sort_index(), pdf.groupby("b").cumcount(ascending=ascending).sort_index() ) self.assert_eq( kdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(), pdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(), ) self.assert_eq( kdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(), pdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(), almost=True, ) self.assert_eq( kdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(), pdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(), almost=True, ) self.assert_eq( kdf.groupby(kdf.b // 5).cumcount(ascending=ascending).sort_index(), pdf.groupby(pdf.b // 5).cumcount(ascending=ascending).sort_index(), almost=True, ) self.assert_eq( kdf.groupby(kdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(), pdf.groupby(pdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(), almost=True, ) self.assert_eq( kdf.groupby("b").cumcount(ascending=ascending).sum(), pdf.groupby("b").cumcount(ascending=ascending).sum(), ) # multi-index columns columns = pd.MultiIndex.from_tuples([("x", "a"), ("x", "b"), ("y", "c")]) pdf.columns = columns kdf.columns = columns self.assert_eq( kdf.groupby(("x", "b")).cumcount().sort_index(), pdf.groupby(("x", "b")).cumcount().sort_index(), ) self.assert_eq( kdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(), pdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(), ) # specify `ascending` for ascending in ascendings: self.assert_eq( kdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(), pdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(), ) self.assert_eq( kdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(), pdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(), )

Sounds good, but shall we use a loop for ascending?

for ascending in [True, False]: ...

Looks better 👍

And maybe we could test for [True, False, 0, 1, None] - Not sure maybe It's too much though -
I fixed the example code in the above comment.

itholic · 2020-08-16T07:49:57Z

Otherwise, LGTM

ueshin · 2020-08-18T17:49:34Z

I'd merge this now and will submit a follow-up PR to address the comment.
@tomspur Thanks for working on this!

Modifies `test_cumcount` to address comments #1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.

tomspur · 2020-08-19T13:25:29Z

Thank you for the throughout review and merging this!

Modifies `test_cumcount` to address comments databricks/koalas#1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.

tomspur force-pushed the cumcount branch from ad27d88 to b6c1d4a Compare August 10, 2020 21:36

itholic reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

itholic reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

itholic reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

itholic reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

tomspur added 4 commits August 12, 2020 18:33

Run dev/reformat

f17349e

Add cumcount to the documentation

9334ee9

Document cumcount similar to pandas.core.groupby.GroupBy.cumcount

d50ad85

Cast result of cumcount to int64 to match pandas

1d4ad47

tomspur force-pushed the cumcount branch from 2a34786 to 1d4ad47 Compare August 12, 2020 17:16

tomspur force-pushed the cumcount branch 2 times, most recently from 6e5a1bd to 426d361 Compare August 12, 2020 18:22

ueshin reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

databricks/koalas/tests/test_dataframe.py Outdated Show resolved Hide resolved

tomspur added 2 commits August 12, 2020 22:28

Skip one part of the doctest

0917aae

Add more strict test for cumcount

c9cfd0c

tomspur force-pushed the cumcount branch from 426d361 to 50323ff Compare August 12, 2020 22:00

ueshin reviewed Aug 12, 2020

View reviewed changes

databricks/koalas/missing/groupby.py Show resolved Hide resolved

tomspur force-pushed the cumcount branch from 50323ff to 15d303d Compare August 13, 2020 20:10

ueshin approved these changes Aug 13, 2020

View reviewed changes

tomspur added 3 commits August 13, 2020 22:40

groupby.cumcount: Use F.count to avoid _apply_series_op

96a9db1

Remove SeriesGroupBy.cumcount from missing functions

66e53d8

tomspur force-pushed the cumcount branch from 15d303d to 66e53d8 Compare August 13, 2020 20:40

Consistently use same test data like other cumxxx tests

cd7e166

itholic reviewed Aug 16, 2020

View reviewed changes

ueshin merged commit 29e0441 into databricks:master Aug 18, 2020

ueshin mentioned this pull request Aug 18, 2020

Modify test_cumcount to address comments. #1714

Merged

ueshin added a commit that referenced this pull request Aug 18, 2020

Modify test_cumcount to address comments. (#1714)

4ffb22b

Modifies `test_cumcount` to address comments #1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.

tomspur deleted the cumcount branch August 19, 2020 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DataFrame.groupby.cumcount #1702

Implement DataFrame.groupby.cumcount #1702

tomspur commented Aug 10, 2020 •

edited

Loading

HyukjinKwon commented Aug 11, 2020

itholic commented Aug 11, 2020

itholic commented Aug 12, 2020

itholic commented Aug 12, 2020

itholic commented Aug 12, 2020

tomspur commented Aug 12, 2020

tomspur commented Aug 12, 2020

ueshin commented Aug 12, 2020

ueshin left a comment

itholic commented Aug 13, 2020

ueshin left a comment •

edited

Loading

ueshin Aug 13, 2020

ueshin Aug 14, 2020

tomspur Aug 14, 2020

tomspur Aug 14, 2020

ueshin Aug 14, 2020

ueshin commented Aug 13, 2020

itholic commented Aug 14, 2020

ueshin commented Aug 14, 2020

tomspur commented Aug 14, 2020

itholic Aug 16, 2020 •

edited

Loading

ueshin Aug 16, 2020

itholic Aug 16, 2020

itholic Aug 16, 2020

itholic commented Aug 16, 2020

ueshin commented Aug 18, 2020

tomspur commented Aug 19, 2020

Implement DataFrame.groupby.cumcount #1702

Implement DataFrame.groupby.cumcount #1702

Conversation

tomspur commented Aug 10, 2020 • edited Loading

HyukjinKwon commented Aug 11, 2020

itholic commented Aug 11, 2020

itholic commented Aug 12, 2020

itholic commented Aug 12, 2020

itholic commented Aug 12, 2020

tomspur commented Aug 12, 2020

tomspur commented Aug 12, 2020

ueshin commented Aug 12, 2020

ueshin left a comment

Choose a reason for hiding this comment

itholic commented Aug 13, 2020

ueshin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Aug 13, 2020

itholic commented Aug 14, 2020

ueshin commented Aug 14, 2020

tomspur commented Aug 14, 2020

itholic Aug 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Aug 16, 2020

ueshin commented Aug 18, 2020

tomspur commented Aug 19, 2020

tomspur commented Aug 10, 2020 •

edited

Loading

ueshin left a comment •

edited

Loading

itholic Aug 16, 2020 •

edited

Loading