Update DataFrame.pivot_table() #635

garawalid · 2019-08-11T11:39:05Z

Resolves #511.
In the test, the kdf is converted to Pandas DataFrame in order to use sort_index(). I'll update the test once #634 resolved.

codecov-io · 2019-08-11T11:57:39Z

Codecov Report

Merging #635 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #635      +/-   ##
=========================================
- Coverage   93.52%   93.5%   -0.03%     
=========================================
  Files          32      32              
  Lines        5455    5478      +23     
=========================================
+ Hits         5102    5122      +20     
- Misses        353     356       +3

Impacted Files	Coverage Δ
databricks/koalas/frame.py	`94.74% <100%> (+0.01%)`	⬆️
databricks/koalas/__init__.py	`82.05% <0%> (-2.57%)`	⬇️
databricks/conftest.py	`95.34% <0%> (-2.33%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08b653e...ec7dcae. Read the comment docs.

databricks/koalas/frame.py

HyukjinKwon · 2019-08-12T02:02:04Z

Seems fine otherwise. cc @ueshin

garawalid · 2019-08-12T22:53:50Z

@HyukjinKwon thanks for the review. I'll address your comments after merging #637.

ueshin

Btw, now that we can add column index names, can we add them for other cases? E.g., from doctests:

>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum')
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum', fill_value=0)
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

>>> pdf.pivot_table(values = ['D'], index =['C'], columns="A", aggfunc={'D':'mean'})
         D
A      bar       foo
C
large  5.5  2.000000
small  5.5  2.333333

We can address in the separate PRs, though. Up to you, @garawalid.

Thanks!

ueshin · 2019-08-12T23:10:26Z

databricks/koalas/frame.py

+        The next example aggregates on multiple values.
+
+        >>> table = df.pivot_table(index=['C'], columns="A", values=['B', 'E'],
+        ...                         aggfunc={'B': 'mean', 'E': 'sum'})


We should use D instead of B since B is a string column? it won't calculate anything.

BTW, pandas raises an error for the case:

>>> pdf.pivot_table(index=['C'], columns="A", values=['B', 'E'], aggfunc={'B': 'mean', 'E': 'sum'}) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/frame.py", line 6067, in pivot_table observed=observed, File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 96, in pivot_table agged = grouped.agg(aggfunc) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 1455, in aggregate return super().aggregate(arg, *args, **kwargs) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 229, in aggregate result, how = self._aggregate(func, _level=_level, *args, **kwargs) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 506, in _aggregate result = _agg(arg, _agg_1dim) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 456, in _agg result[fname] = func(fname, agg_how) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 440, in _agg_1dim return colg.aggregate(how, _level=(_level or 0) + 1) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 845, in aggregate return getattr(self, func_or_funcs)(*args, **kwargs) File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1205, in mean "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 888, in _cython_agg_general raise DataError("No numeric types to aggregate") pandas.core.base.DataError: No numeric types to aggregate

whereas:

>>> pdf.pivot_table(index=['C'], columns="A", values=['D', 'E'], aggfunc={'D': 'mean', 'E': 'sum'}) D E A bar foo bar foo C large 5.5 2.000000 15 9 small 5.5 2.333333 17 13

Should we follow the behavior and raise an error? cc @HyukjinKwon

Yea, +1 to match.

@ueshin Nice catch!
I agree we should raise the same error!

garawalid · 2019-08-18T21:57:24Z

@ueshin Sure!
I added column name index and I skipped the docstring test because it fails.

HyukjinKwon · 2019-08-19T04:49:52Z

To me, it seems fine in general but @ueshin has worked on indexing stuff more than I do.. so I will leave it to him.

ueshin · 2019-08-19T17:52:25Z

The reason why the doctests fail is you don't set the column index names.
Now that we always use column_index for even single index columns, and we can set the column index names for single index columns as well.
Could you try to set the column index names, or revert the changes for doctests not to fail and address it in a separate PR?

softagram-bot · 2019-08-23T22:10:18Z

Softagram Impact Report for pull/635 (head commit: `ec7dcae`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/635

Give feedback on this report to [email protected]

garawalid · 2019-08-23T22:41:28Z

@ueshin
Now pivot_table supports column index names (#636). Also, I updated the doctest of pivot.
Would you mind reviewing the PR?

ueshin

LGTM.

ueshin · 2019-08-26T19:01:34Z

Thanks! merging.

garawalid changed the title ~~Pivot table multiindex~~ Update DataFrame.pivot_table() Aug 11, 2019

garawalid mentioned this pull request Aug 11, 2019

Support column names in pivot_table #636

Closed