Select the series correctly in SeriesGroupBy APIs #1224

HyukjinKwon · 2020-01-24T12:01:13Z

Can be tested with the codes below:

import numpy as np
import pandas as pd
import databricks.koalas as ks

kdf = ks.DataFrame({'cust_id':['a', 'a', 'a', 'b', 'b'],
                   'sales': [100, 200, 300, 400, 500],
                   'days':[12.1,13.1,14.1,78.1,87.2]})

kdf.groupby('cust_id')['days'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())


pdf = kdf.to_pandas()
pdf.groupby('cust_id')['days'].apply(lambda x: x.ewm(alpha=0.5, adjust=False).mean())

0    12.10
1    12.60
2    13.35
3    78.10
4    82.65
Name: days, dtype: float64

Basic idea is that, it creates a DataFrame from Series to reuse existing implementations, and resolves the columns by name.

HyukjinKwon · 2020-01-24T12:01:28Z

databricks/koalas/groupby.py


    @property
    def _kdf(self) -> DataFrame:
-        return self._kser._kdf


The fix is here.

HyukjinKwon · 2020-01-24T12:01:31Z

databricks/koalas/groupby.py

@@ -2016,6 +2012,15 @@ class SeriesGroupBy(GroupBy):
    def __init__(self, kser: Series, by: List[Series], as_index: bool = True):
        self._kser = kser
        self._groupkeys = by


codecov-io · 2020-01-24T12:43:18Z

Codecov Report

Merging #1224 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #1224      +/-   ##
=========================================
+ Coverage   95.19%   95.2%   +<.01%     
=========================================
  Files          35      35              
  Lines        7263    7271       +8     
=========================================
+ Hits         6914    6922       +8     
  Misses        349     349

Impacted Files	Coverage Δ
databricks/koalas/groupby.py	`91.76% <100%> (-0.12%)`	⬇️
databricks/koalas/frame.py	`96.96% <100%> (ø)`	⬆️
databricks/koalas/indexes.py	`95.9% <0%> (-0.11%)`	⬇️
databricks/koalas/series.py	`96.46% <0%> (ø)`	⬆️
databricks/koalas/indexing.py	`96.32% <0%> (+0.01%)`	⬆️
databricks/koalas/internal.py	`95.57% <0%> (+0.36%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff5ec10...0c4f087. Read the comment docs.

ueshin

I agree that we should refactor these classes since the base infrastructure is rather old. We should revisit and refine later.

ueshin · 2020-01-24T20:41:11Z

databricks/koalas/groupby.py

+        self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys]
+        self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns]


F.col(s._internal.data_columns[0]) instead of F.col(name_like_string(s.name))?

I switched but realised that some tests fail with that change. Seems like there's some cases when internal column names and series names are different:

>>> (ks.range(10).id + 1)._internal.data_columns ['(id + 1)'] >>> (ks.range(10).id + 1).name 'id'

We use:

@property def _kdf(self) -> DataFrame: series = [self._kser] + [s for s in self._groupkeys if not s._equals(self._kser)] return DataFrame(self._kser._kdf._internal.with_new_columns(series))

which always aliases via using column_index at with_new_columns. So, it seems guaranteed to have the internal Spark column names same as s.name. I think using name here for this workaround is correct.

ah, I see. hm, maybe renaming the columns in with_new_columns was not a good idea. I'll fix it later.

databricks/koalas/groupby.py

ueshin · 2020-01-24T20:47:39Z

databricks/koalas/groupby.py

-        return self._kser._kdf
+        # TODO: if names from _kser and _groupkeys are name, grouping key is just ignored.
+        #    it can be a problem when both series have the same name but different operations.
+        series = [self._kser] + [s for s in self._groupkeys if s.name != self._kser.name]


How about using s._equals(self._kser) which compares the ._scol._jcs?

In that case, seems it fails because the same names exist in the same DataFrame AnalysisException: Reference 'b' is ambiguous, could be: b, b.; ... but yeah I think it's better to fail fast for now rather than returning a wrong result.

HyukjinKwon · 2020-01-26T04:11:12Z

databricks/koalas/tests/test_groupby.py

-        self.assert_eq(kdf.groupby(['b'])['a'].apply(lambda x: x).sort_index(),
-                       pdf.groupby(['b'])['a'].apply(lambda x: x).sort_index())
+        self.assert_eq(kdf.groupby(['b'])['b'].apply(lambda x: x).sort_index(),
+                       pdf.groupby(['b'])['b'].apply(lambda x: x).sort_index())


So, we will not support this case for now, which I think it's fine (?).

ueshin

LGTM.

ueshin · 2020-01-27T20:57:04Z

databricks/koalas/groupby.py

+        self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys]
+        self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns]


ah, I see. hm, maybe renaming the columns in with_new_columns was not a good idea. I'll fix it later.

ueshin · 2020-01-27T21:04:31Z

Thanks! merging.

…different values (#1233) A small followup of #1224 and #1229 Now, we can cover this case

sushmit86 · 2020-02-05T19:01:16Z

Is this issue fixed?

HyukjinKwon · 2020-02-06T00:07:38Z

Yup, this will be available for the next week's release.

…different values (#1233) A small followup of databricks/koalas#1224 and databricks/koalas#1229 Now, we can cover this case

HyukjinKwon commented Jan 24, 2020

View reviewed changes

databricks/koalas/groupby.py

@property

def _kdf(self) -> DataFrame:

return self._kser._kdf

Copy link

Member Author

HyukjinKwon Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is here.

HyukjinKwon commented Jan 24, 2020

View reviewed changes

HyukjinKwon requested a review from ueshin January 24, 2020 12:01

HyukjinKwon mentioned this pull request Jan 24, 2020

Faster calculation of moving averages in Koalas. #1213

Closed

Select the series correctly in SeriesGroupBy APIs

721b171

HyukjinKwon force-pushed the groupby-series branch from 13c0af0 to 721b171 Compare January 24, 2020 12:07

ueshin reviewed Jan 24, 2020

View reviewed changes

Address commnets

dac54d5

HyukjinKwon commented Jan 26, 2020

View reviewed changes

Fix the tests back

0c4f087

ueshin approved these changes Jan 27, 2020

View reviewed changes

ueshin merged commit 1c973a5 into databricks:master Jan 27, 2020

This was referenced Jan 28, 2020

Fix _InternalFrame.with_new_columns not to rename columns. #1229

Merged

Add a test case when series.groupby series references the same column but with different values #1233

Merged

ueshin pushed a commit that referenced this pull request Jan 28, 2020

Add a test case when groupby series referes the same column but with …

cce8eb8

…different values (#1233) A small followup of #1224 and #1229 Now, we can cover this case

itholic mentioned this pull request Jan 28, 2020

Rename Internal.data_columns after arithmetic operations for IndexOpsMixin #1236

Closed

HyukjinKwon deleted the groupby-series branch September 11, 2020 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select the series correctly in SeriesGroupBy APIs #1224

Select the series correctly in SeriesGroupBy APIs #1224

HyukjinKwon commented Jan 24, 2020

HyukjinKwon Jan 24, 2020

HyukjinKwon Jan 24, 2020

codecov-io commented Jan 24, 2020 •

edited

Loading

ueshin left a comment

ueshin Jan 24, 2020

HyukjinKwon Jan 26, 2020 •

edited

Loading

ueshin Jan 27, 2020

ueshin Jan 24, 2020

HyukjinKwon Jan 26, 2020

HyukjinKwon Jan 26, 2020

ueshin left a comment

ueshin Jan 27, 2020

ueshin commented Jan 27, 2020

sushmit86 commented Feb 5, 2020

HyukjinKwon commented Feb 6, 2020

		self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys]
		self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns]

Select the series correctly in SeriesGroupBy APIs #1224

Select the series correctly in SeriesGroupBy APIs #1224

Conversation

HyukjinKwon commented Jan 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 24, 2020 • edited Loading

Codecov Report

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jan 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Jan 27, 2020

sushmit86 commented Feb 5, 2020

HyukjinKwon commented Feb 6, 2020

codecov-io commented Jan 24, 2020 •

edited

Loading

HyukjinKwon Jan 26, 2020 •

edited

Loading