Add as_index check logic to groupby parameter #1253

beobest2 · 2020-02-02T09:12:10Z

Resolve #1252

By doing type validation on item as_index in groupby,
I think it will be fine if other parameters are added later.
Now, when checking grouping of multiple columns, koalas can find mistakes.

>>> pdf = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
...                   'foo', 'bar', 'foo', 'foo'],
...                    'B': ['one', 'one', 'two', 'three',
...                    'two', 'two', 'one', 'three'],
...                    'C': np.random.randn(8),
...                    'D': np.random.randn(8)})
>>> kdf = ks.from_pandas(pdf)
>>> kdf.groupby('B', 'A').sum().sort_index()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/hwpark/Desktop/dev/git_koalas/koalas/databricks/koalas/generic.py", line 1286, in groupby
    'got [%s]' % type(as_index))
TypeError: as_index must be an boolean; however, got [<class 'str'>]
>>> pdf.groupby('B', 'A').sum()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/hwpark/Desktop/dev/git_koalas/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 7883, in groupby
    axis = self._get_axis_number(axis)
  File "/Users/hwpark/Desktop/dev/git_koalas/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 411, in _get_axis_number
    raise ValueError("No axis named {0} for object type {1}".format(axis, cls))
ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

codecov-io · 2020-02-02T09:32:39Z

Codecov Report

Merging #1253 into master will decrease coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1253      +/-   ##
==========================================
- Coverage    95.1%   95.02%   -0.08%     
==========================================
  Files          35       35              
  Lines        7152     7160       +8     
==========================================
+ Hits         6802     6804       +2     
- Misses        350      356       +6

Impacted Files	Coverage Δ
databricks/koalas/generic.py	`96.52% <100%> (-0.4%)`	⬇️
databricks/koalas/__init__.py	`82.97% <0%> (-2.13%)`	⬇️
databricks/conftest.py	`94.33% <0%> (-1.89%)`	⬇️
databricks/koalas/indexes.py	`95.68% <0%> (-0.23%)`	⬇️
databricks/koalas/groupby.py	`91.22% <0%> (-0.22%)`	⬇️
databricks/koalas/frame.py	`96.78% <0%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9eedb2...cd516dd. Read the comment docs.

itholic · 2020-02-02T12:35:40Z

databricks/koalas/generic.py

@@ -1281,6 +1281,9 @@ def groupby(self, by, as_index: bool = True):
            raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by)))
        if not len(by):
            raise ValueError('No group keys passed!')
+        if not isinstance(as_index, bool):
+            raise TypeError('as_index must be an boolean; however, '


nit: as_index must be an boolean -> as_index must be a boolean

hmm, maybe this way looks fine for now, but i think we better handle by parameter directly rather than as_index since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybe axis or level like the below pandas are doing-)

@Appender(_shared_docs["groupby"] % _shared_doc_kwargs) def groupby( self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True, group_keys: bool = True, squeeze: bool = False, observed: bool = False, ) -> "groupby_generic.DataFrameGroupBy":

Oppose~ I made a mistake. I'll fix this.

i think we can handle axis parameter like pandas.

ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

for example, you can add axis parameter with default 0, and raise NotImplementedError for when axis=1 for now.

hmm, maybe this way looks fine for now, but i think we better handle by parameter directly rather than as_index since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybe axis or level like the below pandas are doing-)

@Appender(_shared_docs["groupby"] % _shared_doc_kwargs) def groupby( self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True, group_keys: bool = True, squeeze: bool = False, observed: bool = False, ) -> "groupby_generic.DataFrameGroupBy":

Pandas recognized the value of the second argument as an "axis". Koalas does not yet support "axis" parameters, so it is recognized as "as_index". It is expected that "as_index" will be safely verified no matter what function is added in the future.

i think we can handle axis parameter like pandas.

ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>

for example, you can add axis parameter with default 0, and raise NotImplementedError for when axis=1 for now.

Let me proceed that way. Thank you :)

yes, go for it!

FYI: although i think your approach is not bad, but the point is that we always try to mimic pandas as possible as we can 😃

And pandas doesn't raise exception when as_index is not boolean.
(They treat 0 as False otherwise True)

True for string or number which is not 0.

>>> pdf.groupby('A', as_index='koalas').sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355 >>> pdf.groupby('A', as_index=100).sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355

False for 0

>>> pdf.groupby('A', as_index=0).sum() A C D 0 bar -0.998532 1.623860 1 foo 3.844849 1.563355

but we're not.

>>> kdf.groupby('A', as_index='koalas').sum() Traceback (most recent call last): ... TypeError: as_index must be an boolean; however, got [<class 'str'>] >>> kdf.groupby('A', as_index=100).sum() Traceback (most recent call last): ... TypeError: as_index must be an boolean; however, got [<class 'int'>]

i think this is a short example of why it's better to solve an issue like pandas does.

And pandas doesn't raise exception when as_index is not boolean.
(They treat 0 as False otherwise True)

True for string or number which is not 0.

>>> pdf.groupby('A', as_index='koalas').sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355 >>> pdf.groupby('A', as_index=100).sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355

False for 0

>>> pdf.groupby('A', as_index=0).sum() A C D 0 bar -0.998532 1.623860 1 foo 3.844849 1.563355

but we're not.

>>> kdf.groupby('A', as_index='koalas').sum() Traceback (most recent call last): ... TypeError: as_index must be an boolean; however, got [<class 'str'>]

I got it! I will remove the "as_index" validation to work with the example shown above.

beobest2 · 2020-02-02T14:36:22Z

@itholic
"as_index" validation is removed

>>> kdf.groupby('A', as_index='koalas').sum()
            C         D
A
bar  2.124855 -3.710326
foo -0.271959 -0.680334

also I add axis parameter with default 0, and raise error for when axis=1 for now.

>>> kdf.groupby('B', 'A').sum().sort_index()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/git_koalas/koalas/databricks/koalas/generic.py", line 1285, in groupby
    raise ValueError('axis should be either 0 or "index" currently.')
ValueError: axis sould be either 0 or "index" currently.

itholic · 2020-02-02T14:42:03Z

LGTM if tests are passed.

HyukjinKwon · 2020-02-02T16:13:22Z

databricks/koalas/generic.py

@@ -1193,7 +1193,7 @@ def abs(self):

    # TODO: by argument only support the grouping name and as_index only for now. Documentation
    # should be updated when it's supported.
-    def groupby(self, by, as_index: bool = True):
+    def groupby(self, by, axis=0, as_index: bool = True):
        """
        Group DataFrame or Series using a Series of columns.



The parameter in the docstring should be fixed too. Actually, why don't you try to implement the other axis? It wouldn't be impossible to do if we use pandas UDF from a cursory look. We have enough time before the next release currently.

@HyukjinKwon I'll be happy to do that.

@HyukjinKwon For now, I modified it to only supports axis = 0. (also docstring is fix)
I've looked into Pandas, but I don't have confidence in Koalas yet,
so the implementation of other axis seems to take some time.
Hmm.. Is it better I implement another axis as the next step after this PR?
or do you want me to keep developing in this PR?

Sure, it should be fine to do it separately.

itholic · 2020-02-03T22:28:29Z

databricks/koalas/generic.py

@@ -1281,6 +1283,9 @@ def groupby(self, by, as_index: bool = True):
            raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by)))
        if not len(by):
            raise ValueError('No group keys passed!')
+        axis = validate_axis(axis)
+        if axis != 0:
+            raise ValueError('axis should be either 0 or "index" currently.')


Let's raise NotImplementedError since #1256 (comment)

itholic · 2020-02-03T22:30:31Z

databricks/koalas/tests/test_groupby.py

@@ -80,6 +80,10 @@ def test_groupby(self):

        self.assertRaises(TypeError, lambda: kdf.a.groupby(kdf.b, as_index=False))

+        self.assertRaises(ValueError, lambda: kdf.groupby('a', axis=1))


so should be fixed here also

Could you add tests to specify axis=0 and 'index'?

@itholic @ueshin okay let me fix it. thank you!

Add as_index check logic to groupby parameter

d60701a

beobest2 mentioned this pull request Feb 2, 2020

Check parameters when grouping in multiple columns #1252

Closed

itholic reviewed Feb 2, 2020

View reviewed changes

beobest2 added 5 commits February 2, 2020 21:58

fix typo an -> a

26d773f

Add axis parameter check

38780e9

Del as_index validation

b1ea537

Fix typo

34f34b1

fix typo

1d14a17

HyukjinKwon reviewed Feb 2, 2020

View reviewed changes

beobest2 added 3 commits February 3, 2020 23:54

Fix docstring & add validate axis param

d5c7a35

Fix test code

08e7519

Add test case

f5d36f2

itholic reviewed Feb 3, 2020

View reviewed changes

change error into NotImplementedError & add axis=1 test case

cd516dd

HyukjinKwon approved these changes Feb 4, 2020

View reviewed changes

HyukjinKwon merged commit 29fc70a into databricks:master Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add as_index check logic to groupby parameter #1253

Add as_index check logic to groupby parameter #1253

beobest2 commented Feb 2, 2020

codecov-io commented Feb 2, 2020 •

edited

Loading

itholic Feb 2, 2020

itholic Feb 2, 2020 •

edited

Loading

beobest2 Feb 2, 2020

itholic Feb 2, 2020 •

edited

Loading

beobest2 Feb 2, 2020

beobest2 Feb 2, 2020

itholic Feb 2, 2020

itholic Feb 2, 2020 •

edited

Loading

beobest2 Feb 2, 2020

beobest2 commented Feb 2, 2020

itholic commented Feb 2, 2020 •

edited

Loading

HyukjinKwon Feb 2, 2020 •

edited

Loading

beobest2 Feb 2, 2020

beobest2 Feb 3, 2020

HyukjinKwon Feb 4, 2020 •

edited

Loading

itholic Feb 3, 2020

itholic Feb 3, 2020

ueshin Feb 4, 2020

beobest2 Feb 4, 2020

		@@ -80,6 +80,10 @@ def test_groupby(self):

		self.assertRaises(TypeError, lambda: kdf.a.groupby(kdf.b, as_index=False))

		self.assertRaises(ValueError, lambda: kdf.groupby('a', axis=1))

Add as_index check logic to groupby parameter #1253

Add as_index check logic to groupby parameter #1253

Conversation

beobest2 commented Feb 2, 2020

codecov-io commented Feb 2, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

itholic Feb 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Feb 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Feb 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beobest2 commented Feb 2, 2020

itholic commented Feb 2, 2020 • edited Loading

HyukjinKwon Feb 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 2, 2020 •

edited

Loading

itholic Feb 2, 2020 •

edited

Loading

itholic Feb 2, 2020 •

edited

Loading

itholic Feb 2, 2020 •

edited

Loading

itholic commented Feb 2, 2020 •

edited

Loading

HyukjinKwon Feb 2, 2020 •

edited

Loading

HyukjinKwon Feb 4, 2020 •

edited

Loading