-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DataFrame.pivot_table() #635
Conversation
Codecov Report
@@ Coverage Diff @@
## master #635 +/- ##
=========================================
- Coverage 93.52% 93.5% -0.03%
=========================================
Files 32 32
Lines 5455 5478 +23
=========================================
+ Hits 5102 5122 +20
- Misses 353 356 +3
Continue to review full report at Codecov.
|
Seems fine otherwise. cc @ueshin |
@HyukjinKwon thanks for the review. I'll address your comments after merging #637. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, now that we can add column index names, can we add them for other cases? E.g., from doctests:
>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum')
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum', fill_value=0)
C large small
A B
bar one 4 5
two 7 6
foo one 4 1
two 0 6
>>> pdf.pivot_table(values = ['D'], index =['C'], columns="A", aggfunc={'D':'mean'})
D
A bar foo
C
large 5.5 2.000000
small 5.5 2.333333
We can address in the separate PRs, though. Up to you, @garawalid.
Thanks!
databricks/koalas/frame.py
Outdated
The next example aggregates on multiple values. | ||
|
||
>>> table = df.pivot_table(index=['C'], columns="A", values=['B', 'E'], | ||
... aggfunc={'B': 'mean', 'E': 'sum'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use D
instead of B
since B
is a string column? it won't calculate anything.
BTW, pandas raises an error for the case:
>>> pdf.pivot_table(index=['C'], columns="A", values=['B', 'E'], aggfunc={'B': 'mean', 'E': 'sum'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/frame.py", line 6067, in pivot_table
observed=observed,
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 96, in pivot_table
agged = grouped.agg(aggfunc)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 1455, in aggregate
return super().aggregate(arg, *args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 229, in aggregate
result, how = self._aggregate(func, _level=_level, *args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 506, in _aggregate
result = _agg(arg, _agg_1dim)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 456, in _agg
result[fname] = func(fname, agg_how)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 440, in _agg_1dim
return colg.aggregate(how, _level=(_level or 0) + 1)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 845, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1205, in mean
"mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 888, in _cython_agg_general
raise DataError("No numeric types to aggregate")
pandas.core.base.DataError: No numeric types to aggregate
whereas:
>>> pdf.pivot_table(index=['C'], columns="A", values=['D', 'E'], aggfunc={'D': 'mean', 'E': 'sum'})
D E
A bar foo bar foo
C
large 5.5 2.000000 15 9
small 5.5 2.333333 17 13
Should we follow the behavior and raise an error? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, +1 to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Nice catch!
I agree we should raise the same error!
5c7784a
to
c13ee18
Compare
@ueshin Sure! |
To me, it seems fine in general but @ueshin has worked on indexing stuff more than I do.. so I will leave it to him. |
The reason why the doctests fail is you don't set the column index names. |
9d89391
to
7372652
Compare
Softagram Impact Report for pull/635 (head commit: ec7dcae)⭐ Change Overview
📄 Full report
Give feedback on this report to [email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks! merging. |
Resolves #511.
In the test, the
kdf
is converted to Pandas DataFrame in order to usesort_index()
. I'll update the test once #634 resolved.