Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow str and list in aggfunc in DataFrameGroupby.agg #828

Merged
merged 13 commits into from
Oct 1, 2019

Conversation

charlesdong1991
Copy link
Contributor

right now, when I look at Groupby, it does not accept str or list, but in pandas, it's allowed. So before implementing named aggregation, i think this is a better thing to deal first.
e.g. in pandas we could have:

Screen Shot 2019-09-24 at 10 58 52 PM

now koalas can also accept this:

Screen Shot 2019-09-24 at 11 00 13 PM

@charlesdong1991
Copy link
Contributor Author

I am not sure if this will pass or not, since group key order is changed as you could see from the picture i attach. Any suggestions on a fix is welcome.

@charlesdong1991
Copy link
Contributor Author

charlesdong1991 commented Sep 27, 2019

emm, even with current setup, animals.groupby('kind').agg({'height': ['min', 'max']}) will have different order in index compared to pandas. Is it mandatory to have the exactly same output with pandas, the result looks correct except the order of rows changes. Probably should reorder the rows then before outputting the result. @ueshin @HyukjinKwon any thoughts?

@ueshin
Copy link
Collaborator

ueshin commented Sep 27, 2019

@charlesdong1991 We don't guarantee the row order without a special reason, e.g., Series.value_counts has a sort argument.

@charlesdong1991
Copy link
Contributor Author

thanks for your comment, i slightly changed the test a bit to fix this order issue that failed tests, and added some docstrings in agg. Feel free to take a look. @ueshin

@charlesdong1991
Copy link
Contributor Author

any follow-up review will be appreciated a lot ^^ @ueshin @HyukjinKwon


else:
group_keyname = [key.name for key in self._groupkeys]
agg_cols = [key for key in self._kdf.columns if key not in group_keyname]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess self._agg_columns should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! thanks! @ueshin

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@ueshin
Copy link
Collaborator

ueshin commented Oct 1, 2019

@charlesdong1991 oh, one more thing I'd like to ask.
Could you add tests for multi-index columns as well, just in case?

@charlesdong1991
Copy link
Contributor Author

@ueshin sure! Added and it passes tests locally, let's see if it's okay on CI.

@ueshin
Copy link
Collaborator

ueshin commented Oct 1, 2019

@charlesdong1991 Thanks!
LGTM again, pending tests.


for aggfunc in agg_funcs:
sorted_agg_kdf = kdf.groupby('kind').agg(aggfunc)
sorted_agg_pdf = pdf.groupby('kind').agg(aggfunc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need .sort_index()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, seems it has the same index as pandas when it's numeric. but could add one to ensure

@softagram-bot
Copy link

Softagram Impact Report for pull/828 (head commit: f1e3c6b)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@ueshin
Copy link
Collaborator

ueshin commented Oct 1, 2019

The latest failure seems not related to this PR.
Let me retrigger the tests.

@ueshin
Copy link
Collaborator

ueshin commented Oct 1, 2019

Thanks! merging.

@ueshin ueshin merged commit 1462291 into databricks:master Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants