Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add as_index check logic to groupby parameter #1253

Merged
merged 10 commits into from
Feb 4, 2020
Merged

Add as_index check logic to groupby parameter #1253

merged 10 commits into from
Feb 4, 2020

Conversation

beobest2
Copy link
Contributor

@beobest2 beobest2 commented Feb 2, 2020

Resolve #1252

By doing type validation on item as_index in groupby,
I think it will be fine if other parameters are added later.
Now, when checking grouping of multiple columns, koalas can find mistakes.

>>> pdf = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
...                   'foo', 'bar', 'foo', 'foo'],
...                    'B': ['one', 'one', 'two', 'three',
...                    'two', 'two', 'one', 'three'],
...                    'C': np.random.randn(8),
...                    'D': np.random.randn(8)})
>>> kdf = ks.from_pandas(pdf)
>>> kdf.groupby('B', 'A').sum().sort_index()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/hwpark/Desktop/dev/git_koalas/koalas/databricks/koalas/generic.py", line 1286, in groupby
    'got [%s]' % type(as_index))
TypeError: as_index must be an boolean; however, got [<class 'str'>]
>>> pdf.groupby('B', 'A').sum()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/hwpark/Desktop/dev/git_koalas/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 7883, in groupby
    axis = self._get_axis_number(axis)
  File "/Users/hwpark/Desktop/dev/git_koalas/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 411, in _get_axis_number
    raise ValueError("No axis named {0} for object type {1}".format(axis, cls))
ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

@codecov-io
Copy link

codecov-io commented Feb 2, 2020

Codecov Report

Merging #1253 into master will decrease coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1253      +/-   ##
==========================================
- Coverage    95.1%   95.02%   -0.08%     
==========================================
  Files          35       35              
  Lines        7152     7160       +8     
==========================================
+ Hits         6802     6804       +2     
- Misses        350      356       +6
Impacted Files Coverage Δ
databricks/koalas/generic.py 96.52% <100%> (-0.4%) ⬇️
databricks/koalas/__init__.py 82.97% <0%> (-2.13%) ⬇️
databricks/conftest.py 94.33% <0%> (-1.89%) ⬇️
databricks/koalas/indexes.py 95.68% <0%> (-0.23%) ⬇️
databricks/koalas/groupby.py 91.22% <0%> (-0.22%) ⬇️
databricks/koalas/frame.py 96.78% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9eedb2...cd516dd. Read the comment docs.

@@ -1281,6 +1281,9 @@ def groupby(self, by, as_index: bool = True):
raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by)))
if not len(by):
raise ValueError('No group keys passed!')
if not isinstance(as_index, bool):
raise TypeError('as_index must be an boolean; however, '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as_index must be an boolean -> as_index must be a boolean

Copy link
Contributor

@itholic itholic Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe this way looks fine for now, but i think we better handle by parameter directly rather than as_index since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybe axis or level like the below pandas are doing-)

    @Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
    def groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool = False,
        observed: bool = False,
    ) -> "groupby_generic.DataFrameGroupBy":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oppose~ I made a mistake. I'll fix this.

Copy link
Contributor

@itholic itholic Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can handle axis parameter like pandas.

ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

for example, you can add axis parameter with default 0, and raise NotImplementedError for when axis=1 for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe this way looks fine for now, but i think we better handle by parameter directly rather than as_index since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybe axis or level like the below pandas are doing-)

    @Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
    def groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool = False,
        observed: bool = False,
    ) -> "groupby_generic.DataFrameGroupBy":

Pandas recognized the value of the second argument as an "axis". Koalas does not yet support "axis" parameters, so it is recognized as "as_index". It is expected that "as_index" will be safely verified no matter what function is added in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can handle axis parameter like pandas.

ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

for example, you can add axis parameter with default 0, and raise NotImplementedError for when axis=1 for now.

Let me proceed that way. Thank you :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, go for it!

FYI: although i think your approach is not bad, but the point is that we always try to mimic pandas as possible as we can 😃

Copy link
Contributor

@itholic itholic Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And pandas doesn't raise exception when as_index is not boolean.
(They treat 0 as False otherwise True)

  • True for string or number which is not 0.
>>> pdf.groupby('A', as_index='koalas').sum()
            C         D
A
bar -0.998532  1.623860
foo  3.844849  1.563355

>>> pdf.groupby('A', as_index=100).sum()
            C         D
A
bar -0.998532  1.623860
foo  3.844849  1.563355
  • False for 0
>>> pdf.groupby('A', as_index=0).sum()
     A         C         D
0  bar -0.998532  1.623860
1  foo  3.844849  1.563355

but we're not.

>>> kdf.groupby('A', as_index='koalas').sum()
Traceback (most recent call last):
...
TypeError: as_index must be an boolean; however, got [<class 'str'>]

>>> kdf.groupby('A', as_index=100).sum()
Traceback (most recent call last):
...
TypeError: as_index must be an boolean; however, got [<class 'int'>]

i think this is a short example of why it's better to solve an issue like pandas does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And pandas doesn't raise exception when as_index is not boolean.
(They treat 0 as False otherwise True)

  • True for string or number which is not 0.
>>> pdf.groupby('A', as_index='koalas').sum()
            C         D
A
bar -0.998532  1.623860
foo  3.844849  1.563355

>>> pdf.groupby('A', as_index=100).sum()
            C         D
A
bar -0.998532  1.623860
foo  3.844849  1.563355
  • False for 0
>>> pdf.groupby('A', as_index=0).sum()
     A         C         D
0  bar -0.998532  1.623860
1  foo  3.844849  1.563355

but we're not.

>>> kdf.groupby('A', as_index='koalas').sum()
Traceback (most recent call last):
...
TypeError: as_index must be an boolean; however, got [<class 'str'>]

I got it! I will remove the "as_index" validation to work with the example shown above.

@beobest2
Copy link
Contributor Author

beobest2 commented Feb 2, 2020

@itholic
"as_index" validation is removed

>>> kdf.groupby('A', as_index='koalas').sum()
            C         D
A
bar  2.124855 -3.710326
foo -0.271959 -0.680334

also I add axis parameter with default 0, and raise error for when axis=1 for now.

>>> kdf.groupby('B', 'A').sum().sort_index()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/git_koalas/koalas/databricks/koalas/generic.py", line 1285, in groupby
    raise ValueError('axis should be either 0 or "index" currently.')
ValueError: axis sould be either 0 or "index" currently.

@itholic
Copy link
Contributor

itholic commented Feb 2, 2020

LGTM if tests are passed.

@@ -1193,7 +1193,7 @@ def abs(self):

# TODO: by argument only support the grouping name and as_index only for now. Documentation
# should be updated when it's supported.
def groupby(self, by, as_index: bool = True):
def groupby(self, by, axis=0, as_index: bool = True):
"""
Group DataFrame or Series using a Series of columns.

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter in the docstring should be fixed too. Actually, why don't you try to implement the other axis? It wouldn't be impossible to do if we use pandas UDF from a cursory look. We have enough time before the next release currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I'll be happy to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon For now, I modified it to only supports axis = 0. (also docstring is fix)
I've looked into Pandas, but I don't have confidence in Koalas yet,
so the implementation of other axis seems to take some time.
Hmm.. Is it better I implement another axis as the next step after this PR?
or do you want me to keep developing in this PR?

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it should be fine to do it separately.

@@ -1281,6 +1283,9 @@ def groupby(self, by, as_index: bool = True):
raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by)))
if not len(by):
raise ValueError('No group keys passed!')
axis = validate_axis(axis)
if axis != 0:
raise ValueError('axis should be either 0 or "index" currently.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's raise NotImplementedError since #1256 (comment)

@@ -80,6 +80,10 @@ def test_groupby(self):

self.assertRaises(TypeError, lambda: kdf.a.groupby(kdf.b, as_index=False))

self.assertRaises(ValueError, lambda: kdf.groupby('a', axis=1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should be fixed here also

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests to specify axis=0 and 'index'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic @ueshin okay let me fix it. thank you!

@HyukjinKwon HyukjinKwon merged commit 29fc70a into databricks:master Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check parameters when grouping in multiple columns
5 participants