Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' #1097

Merged
merged 3 commits into from
Dec 10, 2019

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Dec 3, 2019

Resolve #1095

>>> kser = ks.Series([1, 2, 3, 4, 5], name='x')
>>> kser.groupby('x').head(2)
Traceback (most recent call last):
...
KeyError: ('x',)
>>> pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
...                             'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
...                             'c': [4, 2, 7, 3, None, 1, 1, 1, 2],
...                             'd': list('abcdefght')},
...                            index=[0, 1, 3, 5, 6, 8, 9, 9, 9])
>>> pdf.groupby(pdf)
Traceback (most recent call last):
...
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

>>> pdf.a.groupby(pdf)
Traceback (most recent call last):
...
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

@codecov-io
Copy link

codecov-io commented Dec 3, 2019

Codecov Report

Merging #1097 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1097      +/-   ##
=========================================
- Coverage    95.2%   95.2%   -0.01%     
=========================================
  Files          34      34              
  Lines        6889    6950      +61     
=========================================
+ Hits         6559    6617      +58     
- Misses        330     333       +3
Impacted Files Coverage Δ
databricks/koalas/generic.py 96.18% <100%> (+0.25%) ⬆️
databricks/koalas/base.py 94.88% <0%> (-1.02%) ⬇️
databricks/koalas/utils.py 98.15% <0%> (-0.02%) ⬇️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
databricks/koalas/missing/series.py 100% <0%> (ø) ⬆️
databricks/koalas/frame.py 96.76% <0%> (+0.02%) ⬆️
databricks/koalas/series.py 96.5% <0%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a7a640...bbf956b. Read the comment docs.

@@ -1279,6 +1279,8 @@ def groupby(self, by, as_index: bool = True):
col_by = [_resolve_col(df, col_or_s) for col_or_s in by]
return DataFrameGroupBy(df_or_s, col_by, as_index=as_index)
if isinstance(df_or_s, Series):
if not isinstance(by[0], Series):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove if isinstance(by, str):,.
We also should fix the error message raise ValueError('Not a valid index: TODO') to match with pandas'

Copy link
Contributor Author

@itholic itholic Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe It's hard to remove if isinstance(by, str): ,

because str type is valid for DataFrameGroupBy, so we need it ??

>>> pdf
      regiment   company experience      name  preTestScore  postTestScore
0   Nighthawks  infantry    veteran    Miller             4             25
1   Nighthawks  infantry     rookie  Jacobson            24             94
2   Nighthawks   cavalry    veteran       Ali            31             57
3   Nighthawks   cavalry     rookie    Milner             2             62
4     Dragoons  infantry    veteran     Cooze             3             70
5     Dragoons  infantry     rookie     Jacon             4             25
6     Dragoons   cavalry    veteran    Ryaner            24             94
7     Dragoons   cavalry     rookie      Sone            31             57
8       Scouts  infantry    veteran     Sloan             2             62
9       Scouts  infantry     rookie     Piger             3             70
10      Scouts   cavalry    veteran     Riani             2             62
11      Scouts   cavalry     rookie       Ali             3             70
>>> pdf.groupby('company')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x124f39780>

@@ -1279,6 +1279,8 @@ def groupby(self, by, as_index: bool = True):
col_by = [_resolve_col(df, col_or_s) for col_or_s in by]
return DataFrameGroupBy(df_or_s, col_by, as_index=as_index)
if isinstance(df_or_s, Series):
if not isinstance(by[0], Series):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also remove

        elif isinstance(by, tuple):
            by = [by]

Since tuple is already Iterable.

Copy link
Contributor Author

@itholic itholic Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they have slightly difference purpose,

if given by is ('hello', 'koalas')

when elif isinstance(by, tuple) handle this like below.

elif isinstance(by, tuple):
    by = [by]  # [('hello', 'koalas')]

whereas isinstance(by, Iterable) like below.

elif isinstance(by, Iterable):
    by = [key if isinstance(key, (tuple, Series)) else (key,) for key in by]  # [('hello',), ('koalas',)]

@itholic itholic changed the title Block generating SeriesGroupBy by its name. disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' Dec 3, 2019
@itholic
Copy link
Contributor Author

itholic commented Dec 3, 2019

we should not allow type of DataFrame as by parameter for GroupBy.

I added it to #1095 and this PR description.

@softagram-bot
Copy link

Softagram Impact Report for pull/1097 (head commit: bbf956b)

⚠️ Copy paste found

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 697, 721:

        kdf = ks.from_pandas(pdf)
        self.assert_eq(kdf.groupby(\"b\").transform(lambda x: x + 1).sort_index(),
   ...(truncated 421 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 704, 730:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b'), ('y', 'c')])
   ...(truncated 485 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 495, 508:

        pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                            'b': [1, 2, 2, 2, 3, 3, 3, 4, 4],
                            'c': [1, 2, 2, 2, ...(truncated 223 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 547, 575:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2],
                            'B': [2, 4, None, 3],
                            'C': [None, None, None, 1],
                     ...(truncated 231 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 33, 98:

        pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
                            'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
                            'c': [4, 2, 7, 3, No...(truncated 165 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 664, 681:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(
            kdf.groupby('car_id').apply(lambda _: pd.DataFrame({\"column\": [0.0]})).sort_index(),
            pdf.groupby('car_id')...(truncated 216 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 521, 547, 575:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2],
                            'B': [2, 4, None, 3],
                            'C': [None, None, None, 1],
                    ...(truncated 111 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 351, 375, 399, 423, 447, 471, 635, 694, 744:

        pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                            'b': [1, 1, 2, 3, 5, 8],
                            'c': [1, 4, 9, 16, 25, 36]}, columns=['a'...(truncated 107 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 772, 796:

        pdf = pd.DataFrame({'a': [1, 1, 2, 2, 3],
                            'b': [1, 2, 3, 4, 5],
                            'c': [5, 4, 3, 2, 1]}, columns=['a', 'b', 'c'...(truncated 89 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 561, 588:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('X', 'A'), ('X', 'B'), ('Y', 'C'), ('Z', 'D')])
        pdf.col...(truncated 172 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

💡 Insights

  • Co-change Alert: You modified test_groupby.py. Often groupby.py (databricks/koalas) is modified at the same time.

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit 00d824a into databricks:master Dec 10, 2019
@itholic itholic deleted the s_groupby_block_col_name branch December 10, 2019 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy'
4 participants