-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add as_index check logic to groupby parameter #1253
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1253 +/- ##
==========================================
- Coverage 95.1% 95.02% -0.08%
==========================================
Files 35 35
Lines 7152 7160 +8
==========================================
+ Hits 6802 6804 +2
- Misses 350 356 +6
Continue to review full report at Codecov.
|
databricks/koalas/generic.py
Outdated
@@ -1281,6 +1281,9 @@ def groupby(self, by, as_index: bool = True): | |||
raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by))) | |||
if not len(by): | |||
raise ValueError('No group keys passed!') | |||
if not isinstance(as_index, bool): | |||
raise TypeError('as_index must be an boolean; however, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as_index must be an boolean
-> as_index must be a boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe this way looks fine for now, but i think we better handle by
parameter directly rather than as_index
since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybe axis
or level
like the below pandas are doing-)
@Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
def groupby(
self,
by=None,
axis=0,
level=None,
as_index: bool = True,
sort: bool = True,
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
) -> "groupby_generic.DataFrameGroupBy":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oppose~ I made a mistake. I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can handle axis
parameter like pandas.
ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>
for example, you can add axis
parameter with default 0, and raise NotImplementedError
for when axis=1
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe this way looks fine for now, but i think we better handle
by
parameter directly rather thanas_index
since now behavior looks some hacky (e.g. if other parameters will be added to second positional parameter, it will not work properly - maybeaxis
orlevel
like the below pandas are doing-)@Appender(_shared_docs["groupby"] % _shared_doc_kwargs) def groupby( self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True, group_keys: bool = True, squeeze: bool = False, observed: bool = False, ) -> "groupby_generic.DataFrameGroupBy":
Pandas recognized the value of the second argument as an "axis". Koalas does not yet support "axis" parameters, so it is recognized as "as_index". It is expected that "as_index" will be safely verified no matter what function is added in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can handle
axis
parameter like pandas.ValueError: No axis named A for object type <class 'pandas.core.frame.DataFrame'>for example, you can add
axis
parameter with default 0, and raiseNotImplementedError
for whenaxis=1
for now.
Let me proceed that way. Thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, go for it!
FYI: although i think your approach is not bad, but the point is that we always try to mimic pandas as possible as we can 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And pandas doesn't raise exception when as_index
is not boolean.
(They treat 0 as False otherwise True)
True
for string or number which is not 0.
>>> pdf.groupby('A', as_index='koalas').sum()
C D
A
bar -0.998532 1.623860
foo 3.844849 1.563355
>>> pdf.groupby('A', as_index=100).sum()
C D
A
bar -0.998532 1.623860
foo 3.844849 1.563355
False
for 0
>>> pdf.groupby('A', as_index=0).sum()
A C D
0 bar -0.998532 1.623860
1 foo 3.844849 1.563355
but we're not.
>>> kdf.groupby('A', as_index='koalas').sum()
Traceback (most recent call last):
...
TypeError: as_index must be an boolean; however, got [<class 'str'>]
>>> kdf.groupby('A', as_index=100).sum()
Traceback (most recent call last):
...
TypeError: as_index must be an boolean; however, got [<class 'int'>]
i think this is a short example of why it's better to solve an issue like pandas does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And pandas doesn't raise exception when
as_index
is not boolean.
(They treat 0 as False otherwise True)
True
for string or number which is not 0.>>> pdf.groupby('A', as_index='koalas').sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355 >>> pdf.groupby('A', as_index=100).sum() C D A bar -0.998532 1.623860 foo 3.844849 1.563355
False
for 0>>> pdf.groupby('A', as_index=0).sum() A C D 0 bar -0.998532 1.623860 1 foo 3.844849 1.563355but we're not.
>>> kdf.groupby('A', as_index='koalas').sum() Traceback (most recent call last): ... TypeError: as_index must be an boolean; however, got [<class 'str'>]
I got it! I will remove the "as_index" validation to work with the example shown above.
@itholic >>> kdf.groupby('A', as_index='koalas').sum()
C D
A
bar 2.124855 -3.710326
foo -0.271959 -0.680334 also I add axis parameter with default 0, and raise error for when axis=1 for now. >>> kdf.groupby('B', 'A').sum().sort_index()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/git_koalas/koalas/databricks/koalas/generic.py", line 1285, in groupby
raise ValueError('axis should be either 0 or "index" currently.')
ValueError: axis sould be either 0 or "index" currently. |
LGTM if tests are passed. |
@@ -1193,7 +1193,7 @@ def abs(self): | |||
|
|||
# TODO: by argument only support the grouping name and as_index only for now. Documentation | |||
# should be updated when it's supported. | |||
def groupby(self, by, as_index: bool = True): | |||
def groupby(self, by, axis=0, as_index: bool = True): | |||
""" | |||
Group DataFrame or Series using a Series of columns. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter in the docstring should be fixed too. Actually, why don't you try to implement the other axis? It wouldn't be impossible to do if we use pandas UDF from a cursory look. We have enough time before the next release currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I'll be happy to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon For now, I modified it to only supports axis = 0. (also docstring is fix)
I've looked into Pandas, but I don't have confidence in Koalas yet,
so the implementation of other axis seems to take some time.
Hmm.. Is it better I implement another axis as the next step after this PR?
or do you want me to keep developing in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it should be fine to do it separately.
databricks/koalas/generic.py
Outdated
@@ -1281,6 +1283,9 @@ def groupby(self, by, as_index: bool = True): | |||
raise ValueError("Grouper for '{}' not 1-dimensional".format(type(by))) | |||
if not len(by): | |||
raise ValueError('No group keys passed!') | |||
axis = validate_axis(axis) | |||
if axis != 0: | |||
raise ValueError('axis should be either 0 or "index" currently.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's raise NotImplementedError
since #1256 (comment)
@@ -80,6 +80,10 @@ def test_groupby(self): | |||
|
|||
self.assertRaises(TypeError, lambda: kdf.a.groupby(kdf.b, as_index=False)) | |||
|
|||
self.assertRaises(ValueError, lambda: kdf.groupby('a', axis=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so should be fixed here also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests to specify axis=0
and 'index'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve #1252
By doing type validation on item as_index in groupby,
I think it will be fine if other parameters are added later.
Now, when checking grouping of multiple columns, koalas can find mistakes.