Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame.groupby.cumcount #1702

Merged
merged 11 commits into from
Aug 18, 2020
Merged

Conversation

tomspur
Copy link
Contributor

@tomspur tomspur commented Aug 10, 2020

Here is an example of the implementation in analogy of the one from the pandas docs:

>>> df = ks.DataFrame(
...     [['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
...     columns=list('A'))
>>> df
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount(ascending=False).sort_index()
0    3
1    2
2    1
3    1
4    0
5    0
Name: A, dtype: int64
>>> df.groupby('A').cumcount().sort_index()
0    0
1    1
2    2
3    0
4    1
5    3
Name: A, dtype: int64

The tests are the same like for the other cumxxx tests for now.

Nevertheless, it so far does not work the case of nulls and multiple columns within the groupby (thank you for finding this @itholic):

>>> df = ks.DataFrame({"a": [1, 1, 1, 4], "b": [None, 0.1, 20.0, None], "c": [4, 3, 2, 1]})
>>> df
   a     b  c
0  1   NaN  4
1  1   0.1  3
2  1  20.0  2
3  4   NaN  1
>>> df.groupby(["a", "b"]).cumcount().sort_index()
0    0
1    0
2    0
3    0
Name: a, dtype: int64
>>> df.to_pandas().groupby(["a", "b"]).cumcount().sort_index()
0    0
1    0
2    0
3    1
dtype: int64

This will also work, when groupby supports a default dropna=True argument in #1007

So far this only works if there is actually a column left after the
group by, e.g.

```python
>>> df = ks.DataFrame(
...     [['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
...     columns=list('A'))
>>> df
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount(ascending=False).sort_index()
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: 0, dtype: float64
>>> df["B"] = df["A"]
>>> df.groupby('A').cumcount().sort_index()
0    0
1    1
2    2
3    0
4    1
5    3
Name: 0, dtype: int64
>>> df.groupby('A').cumcount(ascending=False)
0    3
1    2
2    1
3    1
4    0
5    0
Name: 0, dtype: int64
```
@HyukjinKwon
Copy link
Member

Thanks @tomspur for working on this. @itholic can you review this?

@itholic
Copy link
Contributor

itholic commented Aug 11, 2020

@HyukjinKwon Sure I'm going to gladly review this !

@itholic
Copy link
Contributor

itholic commented Aug 12, 2020

I just submitted a question to pandas repo (pandas-dev/pandas#35682)
Let's just keep this for now until they response to it.

@itholic
Copy link
Contributor

itholic commented Aug 12, 2020

Would you mind adding this to docs/source/reference/groupby.rst ??

@itholic
Copy link
Contributor

itholic commented Aug 12, 2020

Anyway, I'd say think we better return the result as a int64 like pandas rather than float64. ?

>>> pdf
   A     B  C
0  1   NaN  4
1  1   0.1  3
2  1  20.0  2
3  4  10.0  1

>>> pdf.groupby("A").cumcount()
0    0
1    1
2    2
3    0
dtype: int64

>>> kdf.groupby("A").cumcount()
0    0.0
1    1.0
2    2.0
3    0.0
Name: 0, dtype: float64

@tomspur
Copy link
Contributor Author

tomspur commented Aug 12, 2020

The above changes are now all implemented. I don't quite like the explicit cast to int64 in the code, but also couldn't find another way to ensure that the NaNs are properly skipped during the calculation.

@tomspur
Copy link
Contributor Author

tomspur commented Aug 12, 2020

Unfortunately, the order is not preserved as can be seen in this test (which I wanted to add to the testsuite now):


In [10]: pdf = pd.DataFrame([[1, None, 4], [1, 0.1, 3], [1, 20.0, 2], [4, None, 1]],columns=list("ABC"),index=np.random.rand(4))                                                       

In [11]: kdf.groupby(column).cumcount(ascending=False)                                                                                                                                 
Out[11]: 
0.612123    0
0.440823    1
0.718036    2
0.883475    0
Name: 0, dtype: int64

In [12]: pdf.groupby(column).cumcount(ascending=False)                                                                                                                                 
Out[12]: 
0.874863    2
0.809940    1
0.528329    0
0.655884    0
dtype: int64

Not sure where this could be ensured in the new ascending=False part in databricks.koalas.series ..

@ueshin
Copy link
Collaborator

ueshin commented Aug 12, 2020

@itholic For pandas-dev/pandas#35682, cumcount doesn't need to care about its type, so it's natural to support with any type.

@tomspur tomspur force-pushed the cumcount branch 2 times, most recently from 6e5a1bd to 426d361 Compare August 12, 2020 18:22
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you fix the format?

databricks/koalas/groupby.py Outdated Show resolved Hide resolved
databricks/koalas/groupby.py Outdated Show resolved Hide resolved
databricks/koalas/tests/test_dataframe.py Outdated Show resolved Hide resolved
@itholic
Copy link
Contributor

itholic commented Aug 13, 2020

@itholic For pandas-dev/pandas#35682, cumcount doesn't need to care about its type, so it's natural to support with any type.

Thanks, @ueshin . I'll close that QST with proper comments.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM pending tests.

Comment on lines 865 to 870
# TODO: Enable the following test when ks.groupby(dropna=True)
# is implemented (see #1007)
# self.assert_eq(
# kdf.groupby(["a", "b"]).cumcount().sort_index(),
# pdf.groupby(["a", "b"]).cumcount().sort_index(),
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use the same test dataset as the other cumxxx tests for now? Then we can enable this and one more below. Let's see #1007 for this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomspur Could you update the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed above the columns from ["a", "b"] to ["a", "c"], which does not contain Nones to have some other and keep the comments to remember to enable testing Nones at least partially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the same dataset like the other tests for now. Do you want to add a None to them once #1007 is implemented?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should modify or add tests which include Nones in groupkeys at #1007.

@ueshin
Copy link
Collaborator

ueshin commented Aug 13, 2020

@itholic Could you take another look? Thanks!

Note that currently two tests with multiple groupby does not work when
the elements contain null values. These are commented out for now until
the groupby is also dropping null values to match pandas functionality
of groupby(dropna=True) in 1.1. See also databricks#1007.
@itholic
Copy link
Contributor

itholic commented Aug 14, 2020

@ueshin My pleasure. Let me check this weekend.

@ueshin
Copy link
Collaborator

ueshin commented Aug 14, 2020

@tomspur Now that this is good to go, pending @itholic's another look.
Could you update the PR description to describe the changes here?

@tomspur
Copy link
Contributor Author

tomspur commented Aug 14, 2020

I changed the description and also mentioned the current limitation for the groupby with nulls in it

@@ -852,6 +852,62 @@ def test_rank(self):
pdf.groupby([("x", "a"), ("x", "b")]).rank().sort_index(),
)

def test_cumcount(self):
Copy link
Contributor

@itholic itholic Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have a tests for ascending=False ?

You could just use the below

    def test_cumcount(self):
        pdf = pd.DataFrame(
            {
                "a": [1, 2, 3, 4, 5, 6] * 3,
                "b": [1, 1, 2, 3, 5, 8] * 3,
                "c": [1, 4, 9, 16, 25, 36] * 3,
            },
            index=np.random.rand(6 * 3),
        )
        kdf = ks.from_pandas(pdf)
        ascendings = [True, False, 0, 1, None]

        self.assert_eq(
            kdf.groupby("b").cumcount().sort_index(), pdf.groupby("b").cumcount().sort_index()
        )
        self.assert_eq(
            kdf.groupby(["a", "b"]).cumcount().sort_index(),
            pdf.groupby(["a", "b"]).cumcount().sort_index(),
        )
        self.assert_eq(
            kdf.groupby(["b"])["a"].cumcount().sort_index(),
            pdf.groupby(["b"])["a"].cumcount().sort_index(),
            almost=True,
        )
        self.assert_eq(
            kdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(),
            pdf.groupby(["b"])[["a", "c"]].cumcount().sort_index(),
            almost=True,
        )
        self.assert_eq(
            kdf.groupby(kdf.b // 5).cumcount().sort_index(),
            pdf.groupby(pdf.b // 5).cumcount().sort_index(),
            almost=True,
        )
        self.assert_eq(
            kdf.groupby(kdf.b // 5)["a"].cumcount().sort_index(),
            pdf.groupby(pdf.b // 5)["a"].cumcount().sort_index(),
            almost=True,
        )
        self.assert_eq(
            kdf.groupby("b").cumcount().sum(), pdf.groupby("b").cumcount().sum(),
        )

        # specify `ascending`
        for ascending in ascendings:
            self.assert_eq(
                kdf.groupby("b").cumcount(ascending=ascending).sort_index(), pdf.groupby("b").cumcount(ascending=ascending).sort_index()
            )
            self.assert_eq(
                kdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(),
                pdf.groupby(["a", "b"]).cumcount(ascending=ascending).sort_index(),
            )
            self.assert_eq(
                kdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(),
                pdf.groupby(["b"])["a"].cumcount(ascending=ascending).sort_index(),
                almost=True,
            )
            self.assert_eq(
                kdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(),
                pdf.groupby(["b"])[["a", "c"]].cumcount(ascending=ascending).sort_index(),
                almost=True,
            )
            self.assert_eq(
                kdf.groupby(kdf.b // 5).cumcount(ascending=ascending).sort_index(),
                pdf.groupby(pdf.b // 5).cumcount(ascending=ascending).sort_index(),
                almost=True,
            )
            self.assert_eq(
                kdf.groupby(kdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(),
                pdf.groupby(pdf.b // 5)["a"].cumcount(ascending=ascending).sort_index(),
                almost=True,
            )
            self.assert_eq(
                kdf.groupby("b").cumcount(ascending=ascending).sum(), pdf.groupby("b").cumcount(ascending=ascending).sum(),
            )

        # multi-index columns
        columns = pd.MultiIndex.from_tuples([("x", "a"), ("x", "b"), ("y", "c")])
        pdf.columns = columns
        kdf.columns = columns

        self.assert_eq(
            kdf.groupby(("x", "b")).cumcount().sort_index(),
            pdf.groupby(("x", "b")).cumcount().sort_index(),
        )
        self.assert_eq(
            kdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(),
            pdf.groupby([("x", "a"), ("x", "b")]).cumcount().sort_index(),
        )

        # specify `ascending`
        for ascending in ascendings:
            self.assert_eq(
                kdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(),
                pdf.groupby(("x", "b")).cumcount(ascending=ascending).sort_index(),
            )
            self.assert_eq(
                kdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(),
                pdf.groupby([("x", "a"), ("x", "b")]).cumcount(ascending=ascending).sort_index(),
            )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but shall we use a loop for ascending?

for ascending in [True, False]:
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we could test for [True, False, 0, 1, None] - Not sure maybe It's too much though -
I fixed the example code in the above comment.

@itholic
Copy link
Contributor

itholic commented Aug 16, 2020

Otherwise, LGTM

@ueshin
Copy link
Collaborator

ueshin commented Aug 18, 2020

I'd merge this now and will submit a follow-up PR to address the comment.
@tomspur Thanks for working on this!

@ueshin ueshin merged commit 29e0441 into databricks:master Aug 18, 2020
ueshin added a commit that referenced this pull request Aug 18, 2020
Modifies `test_cumcount` to address comments #1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.
@tomspur
Copy link
Contributor Author

tomspur commented Aug 19, 2020

Thank you for the throughout review and merging this!

@tomspur tomspur deleted the cumcount branch August 19, 2020 13:25
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
Modifies `test_cumcount` to address comments databricks/koalas#1702 (comment), and also added some more tests in `OpsOnDiffFramesGroupByTest`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants