Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DataFrame.pivot_table() #635

Merged
merged 7 commits into from
Aug 26, 2019

Conversation

garawalid
Copy link
Contributor

Resolves #511.
In the test, the kdf is converted to Pandas DataFrame in order to use sort_index(). I'll update the test once #634 resolved.

@garawalid garawalid changed the title Pivot table multiindex Update DataFrame.pivot_table() Aug 11, 2019
@codecov-io
Copy link

codecov-io commented Aug 11, 2019

Codecov Report

Merging #635 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #635      +/-   ##
=========================================
- Coverage   93.52%   93.5%   -0.03%     
=========================================
  Files          32      32              
  Lines        5455    5478      +23     
=========================================
+ Hits         5102    5122      +20     
- Misses        353     356       +3
Impacted Files Coverage Δ
databricks/koalas/frame.py 94.74% <100%> (+0.01%) ⬆️
databricks/koalas/__init__.py 82.05% <0%> (-2.57%) ⬇️
databricks/conftest.py 95.34% <0%> (-2.33%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08b653e...ec7dcae. Read the comment docs.

@HyukjinKwon
Copy link
Member

Seems fine otherwise. cc @ueshin

@garawalid
Copy link
Contributor Author

@HyukjinKwon thanks for the review. I'll address your comments after merging #637.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, now that we can add column index names, can we add them for other cases? E.g., from doctests:

>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum')
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

>>> pdf.pivot_table(values='D', index=['A', 'B'], columns='C', aggfunc='sum', fill_value=0)
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

>>> pdf.pivot_table(values = ['D'], index =['C'], columns="A", aggfunc={'D':'mean'})
         D
A      bar       foo
C
large  5.5  2.000000
small  5.5  2.333333

We can address in the separate PRs, though. Up to you, @garawalid.

Thanks!

The next example aggregates on multiple values.

>>> table = df.pivot_table(index=['C'], columns="A", values=['B', 'E'],
... aggfunc={'B': 'mean', 'E': 'sum'})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use D instead of B since B is a string column? it won't calculate anything.

BTW, pandas raises an error for the case:

>>> pdf.pivot_table(index=['C'], columns="A", values=['B', 'E'], aggfunc={'B': 'mean', 'E': 'sum'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/frame.py", line 6067, in pivot_table
    observed=observed,
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 96, in pivot_table
    agged = grouped.agg(aggfunc)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 1455, in aggregate
    return super().aggregate(arg, *args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 229, in aggregate
    result, how = self._aggregate(func, _level=_level, *args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 506, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 456, in _agg
    result[fname] = func(fname, agg_how)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/base.py", line 440, in _agg_1dim
    return colg.aggregate(how, _level=(_level or 0) + 1)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 845, in aggregate
    return getattr(self, func_or_funcs)(*args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1205, in mean
    "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6_pd0.25/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 888, in _cython_agg_general
    raise DataError("No numeric types to aggregate")
pandas.core.base.DataError: No numeric types to aggregate

whereas:

>>> pdf.pivot_table(index=['C'], columns="A", values=['D', 'E'], aggfunc={'D': 'mean', 'E': 'sum'})
         D             E
A      bar       foo bar foo
C
large  5.5  2.000000  15   9
small  5.5  2.333333  17  13

Should we follow the behavior and raise an error? cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, +1 to match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Nice catch!
I agree we should raise the same error!

@garawalid garawalid force-pushed the pivot_table_multiindex branch from 5c7784a to c13ee18 Compare August 18, 2019 21:56
@garawalid
Copy link
Contributor Author

@ueshin Sure!
I added column name index and I skipped the docstring test because it fails.

@HyukjinKwon
Copy link
Member

To me, it seems fine in general but @ueshin has worked on indexing stuff more than I do.. so I will leave it to him.

@ueshin
Copy link
Collaborator

ueshin commented Aug 19, 2019

The reason why the doctests fail is you don't set the column index names.
Now that we always use column_index for even single index columns, and we can set the column index names for single index columns as well.
Could you try to set the column index names, or revert the changes for doctests not to fail and address it in a separate PR?

@garawalid garawalid force-pushed the pivot_table_multiindex branch from 9d89391 to 7372652 Compare August 23, 2019 20:28
@softagram-bot
Copy link

Softagram Impact Report for pull/635 (head commit: ec7dcae)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to [email protected]

@garawalid
Copy link
Contributor Author

@ueshin
Now pivot_table supports column index names (#636). Also, I updated the doctest of pivot.
Would you mind reviewing the PR?

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ueshin
Copy link
Collaborator

ueshin commented Aug 26, 2019

Thanks! merging.

@ueshin ueshin merged commit b7fc773 into databricks:master Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-level columns in pivot_table
5 participants