Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #3141 add cumsum to DatasetGroupBy #3417

Closed
wants to merge 1 commit into from

Conversation

VladSkripniuk
Copy link
Contributor

expected = xr.Dataset(
{
"foo": (("x",), [7, 10, 1, 2, 1, 2, 3]),
"group_id": (("x",), [0, 0, 1, 1, 2, 2, 2]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why group_id becomes a data variable rather than a coord?

Here's the equivalent with sum:

In [6]: ds.groupby("group_id").sum()
Out[6]:
<xarray.Dataset>
Dimensions:   (group_id: 3)
Coordinates:
  * group_id  (group_id) int64 0 1 2
Data variables:
    foo       (group_id) int64 10 2 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_id is removed from coords in this line

coords.discard(name)

If we comment this line and run our example again, group_id appears in coords after cumsum. I need some help to understand what this line does.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could look more.

Does anyone who have more experience with this code know what's happening?

@pgierz
Copy link

pgierz commented Apr 26, 2022

Any progress on this one? I would like to use this functionality. I have a workaround for right now, but it would be nice to do it "correctly"

@max-sixty
Copy link
Collaborator

I'm very sorry this didn't get attention — it was a good PR and it's bad of us to drop something from a first-time contributor.

It was surprising that it promoted the group_id variable to a data_var from a coord — I think that's probably a mistake. So if we could resolve that — or show that it would have other adverse impacts — we could merge this.

The good news is that we simplified how methods are supplied to the GroupBy objects — the bad news is that it requires a small rewrite of the code now; though not the tests, which are the majority of this code.

@Illviljan
Copy link
Contributor

Aren't we soon going to let flox handle this once #5734 is merged? numpy_groupies supports cumsum at least so it might be possible to support it in flox as well.

@max-sixty
Copy link
Collaborator

Aren't we soon going to let flox handle this once #5734 is merged? numpy_groupies supports cumsum at least so it might be possible to support it in flox as well.

Ah, great, if flox supports cumsum then that is indeed ideal.

@dcherian
Copy link
Contributor

It doesn't at the moment. Writing the dask version will be fun :).

We can always add groupby().cumsum and use the "old" codepath like we do for median.

@dcherian
Copy link
Contributor

dcherian commented Apr 27, 2022

I opened #6525 preserving commit authorship. Does that look right to everyone? I could force-push here but was afraid of losing the author info.

That PR doesn't actually work, it's not clear to me why.

EDIT: OK it works now, but we have the same error about losing an index

@max-sixty
Copy link
Collaborator

Thanks a lot @dcherian !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calculating cumsums on a groupby object
5 participants