disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

itholic · 2019-12-03T03:19:03Z

Resolve #1095

>>> kser = ks.Series([1, 2, 3, 4, 5], name='x')
>>> kser.groupby('x').head(2)
Traceback (most recent call last):
...
KeyError: ('x',)

>>> pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
...                             'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
...                             'c': [4, 2, 7, 3, None, 1, 1, 1, 2],
...                             'd': list('abcdefght')},
...                            index=[0, 1, 3, 5, 6, 8, 9, 9, 9])
>>> pdf.groupby(pdf)
Traceback (most recent call last):
...
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

>>> pdf.a.groupby(pdf)
Traceback (most recent call last):
...
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

codecov-io · 2019-12-03T03:53:26Z

Codecov Report

Merging #1097 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #1097      +/-   ##
=========================================
- Coverage    95.2%   95.2%   -0.01%     
=========================================
  Files          34      34              
  Lines        6889    6950      +61     
=========================================
+ Hits         6559    6617      +58     
- Misses        330     333       +3

Impacted Files	Coverage Δ
databricks/koalas/generic.py	`96.18% <100%> (+0.25%)`	⬆️
databricks/koalas/base.py	`94.88% <0%> (-1.02%)`	⬇️
databricks/koalas/utils.py	`98.15% <0%> (-0.02%)`	⬇️
databricks/koalas/missing/frame.py	`100% <0%> (ø)`	⬆️
databricks/koalas/missing/series.py	`100% <0%> (ø)`	⬆️
databricks/koalas/frame.py	`96.76% <0%> (+0.02%)`	⬆️
databricks/koalas/series.py	`96.5% <0%> (+0.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a7a640...bbf956b. Read the comment docs.

HyukjinKwon · 2019-12-03T08:39:56Z

databricks/koalas/generic.py

@@ -1279,6 +1279,8 @@ def groupby(self, by, as_index: bool = True):
            col_by = [_resolve_col(df, col_or_s) for col_or_s in by]
            return DataFrameGroupBy(df_or_s, col_by, as_index=as_index)
        if isinstance(df_or_s, Series):
+            if not isinstance(by[0], Series):


I think you should remove if isinstance(by, str):,.
We also should fix the error message raise ValueError('Not a valid index: TODO') to match with pandas'

i think maybe It's hard to remove if isinstance(by, str): ,

because str type is valid for DataFrameGroupBy, so we need it ??

>>> pdf regiment company experience name preTestScore postTestScore 0 Nighthawks infantry veteran Miller 4 25 1 Nighthawks infantry rookie Jacobson 24 94 2 Nighthawks cavalry veteran Ali 31 57 3 Nighthawks cavalry rookie Milner 2 62 4 Dragoons infantry veteran Cooze 3 70 5 Dragoons infantry rookie Jacon 4 25 6 Dragoons cavalry veteran Ryaner 24 94 7 Dragoons cavalry rookie Sone 31 57 8 Scouts infantry veteran Sloan 2 62 9 Scouts infantry rookie Piger 3 70 10 Scouts cavalry veteran Riani 2 62 11 Scouts cavalry rookie Ali 3 70 >>> pdf.groupby('company') <pandas.core.groupby.generic.DataFrameGroupBy object at 0x124f39780>

HyukjinKwon · 2019-12-03T08:40:27Z

databricks/koalas/generic.py

@@ -1279,6 +1279,8 @@ def groupby(self, by, as_index: bool = True):
            col_by = [_resolve_col(df, col_or_s) for col_or_s in by]
            return DataFrameGroupBy(df_or_s, col_by, as_index=as_index)
        if isinstance(df_or_s, Series):
+            if not isinstance(by[0], Series):


You can also remove

elif isinstance(by, tuple): by = [by]

Since tuple is already Iterable.

i think they have slightly difference purpose,

if given by is ('hello', 'koalas')

when elif isinstance(by, tuple) handle this like below.

elif isinstance(by, tuple): by = [by] # [('hello', 'koalas')]

whereas isinstance(by, Iterable) like below.

elif isinstance(by, Iterable): by = [key if isinstance(key, (tuple, Series)) else (key,) for key in by] # [('hello',), ('koalas',)]

databricks/koalas/generic.py

itholic · 2019-12-03T18:14:57Z

we should not allow type of DataFrame as by parameter for GroupBy.

I added it to #1095 and this PR description.

softagram-bot · 2019-12-04T03:45:52Z

Softagram Impact Report for pull/1097 (head commit: `bbf956b`)

⚠️ Copy paste found

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 697, 721:

        kdf = ks.from_pandas(pdf)
        self.assert_eq(kdf.groupby(\"b\").transform(lambda x: x + 1).sort_index(),
   ...(truncated 421 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 704, 730:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b'), ('y', 'c')])
   ...(truncated 485 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 495, 508:

        pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                            'b': [1, 2, 2, 2, 3, 3, 3, 4, 4],
                            'c': [1, 2, 2, 2, ...(truncated 223 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 547, 575:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2],
                            'B': [2, 4, None, 3],
                            'C': [None, None, None, 1],
                     ...(truncated 231 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 33, 98:

        pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
                            'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
                            'c': [4, 2, 7, 3, No...(truncated 165 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 664, 681:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(
            kdf.groupby('car_id').apply(lambda _: pd.DataFrame({\"column\": [0.0]})).sort_index(),
            pdf.groupby('car_id')...(truncated 216 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 521, 547, 575:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2],
                            'B': [2, 4, None, 3],
                            'C': [None, None, None, 1],
                    ...(truncated 111 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 351, 375, 399, 423, 447, 471, 635, 694, 744:

        pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                            'b': [1, 1, 2, 3, 5, 8],
                            'c': [1, 4, 9, 16, 25, 36]}, columns=['a'...(truncated 107 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 772, 796:

        pdf = pd.DataFrame({'a': [1, 1, 2, 2, 3],
                            'b': [1, 2, 3, 4, 5],
                            'c': [5, 4, 3, 2, 1]}, columns=['a', 'b', 'c'...(truncated 89 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 561, 588:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('X', 'A'), ('X', 'B'), ('Y', 'C'), ('Z', 'D')])
        pdf.col...(truncated 172 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

(Open in Softagram Desktop for full details)

💡 Insights

Co-change Alert: You modified test_groupby.py. Often groupby.py (databricks/koalas) is modified at the same time.

📄 Full report

Permalink: Full report for pull/1097

Impact Report explained. Give feedback on this report to [email protected]

Block generating SeriesGroupBy by its name.

7b7813e

HyukjinKwon reviewed Dec 3, 2019

View reviewed changes

databricks/koalas/generic.py Outdated Show resolved Hide resolved

itholic changed the title ~~Block generating SeriesGroupBy by its name.~~ disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' Dec 3, 2019

fix and add

e5f198f

add test for coverage

bbf956b

HyukjinKwon approved these changes Dec 10, 2019

View reviewed changes

HyukjinKwon merged commit 00d824a into databricks:master Dec 10, 2019

itholic deleted the s_groupby_block_col_name branch December 10, 2019 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

itholic commented Dec 3, 2019 •

edited

Loading

codecov-io commented Dec 3, 2019 •

edited

Loading

HyukjinKwon Dec 3, 2019

itholic Dec 3, 2019 •

edited

Loading

HyukjinKwon Dec 3, 2019

itholic Dec 3, 2019 •

edited

Loading

itholic commented Dec 3, 2019

softagram-bot commented Dec 4, 2019

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

Conversation

itholic commented Dec 3, 2019 • edited Loading

codecov-io commented Dec 3, 2019 • edited Loading

Codecov Report

HyukjinKwon Dec 3, 2019

Choose a reason for hiding this comment

itholic Dec 3, 2019 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Dec 3, 2019

Choose a reason for hiding this comment

itholic Dec 3, 2019 • edited Loading

Choose a reason for hiding this comment

itholic commented Dec 3, 2019

softagram-bot commented Dec 4, 2019

Softagram Impact Report for pull/1097 (head commit: bbf956b)

⚠️ Copy paste found

⭐ Change Overview

💡 Insights

📄 Full report

itholic commented Dec 3, 2019 •

edited

Loading

codecov-io commented Dec 3, 2019 •

edited

Loading

itholic Dec 3, 2019 •

edited

Loading

itholic Dec 3, 2019 •

edited

Loading

Softagram Impact Report for pull/1097 (head commit: `bbf956b`)