-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate passing pd.MultiIndex implicitly #8140
base: main
Are you sure you want to change the base?
Conversation
So that it is caught in more cases.
I've been out of the loop of discussions recently (and less recently...). To the extent this isn't firmly decided — is this necessary? Is there a downside to having a good default when pandas objects are passed in? Is there significant ambiguity on what the result should be? What do we recommend for converting from pandas-object-with-multiindex to dataset/dataarray? |
No worries! There's a more context in #6293 (comment) and in #6392 (comment).
The main source of ambiguity is the extraction of each multi-index level as a coordinate and the possible conflict with the other coordinates. More generally, maintaining the special cases for pandas multi-index has been a big hassle ever since support for it was added in Xarray. I share a lot of responsibility since I mainly contributed to adding that support :-). There has been numerous subtle bugs and it really makes the internal logic more complicated than it should in many places of the Xarray code base. Removing all those special cases will be a big relief! I think that a good default behavior is to treat the pandas objects passed as data or coordinate variables like any other duck array. If we want a more specific behavior leveraging the index contained in those objects, the recommended way is to convert them using the explicit conversion methods provided by Xarray, e.g.,
(note: for the two latter we might want to add an option to skip expanding the multi-index so that we don't need to re-stack the dimensions) |
Add suggestions for the cases where the pandas multi-index is passed via a pandas dataframe or series.
Can you describe what change you'd like to see? |
@dcherian ideally The goal is to avoid calling Lines 1573 to 1587 in e2b6f34
There are actually more general issues:
|
Thanks @benbovy !
I totally agree with not having native MultiIndex support within a DataArray / Dataset. I'm wondering whether we can still do something reasonable when a MultiIndex is passed in, since that's quite common IME, and it's common with folks who want to do something quickly, possibly are less experienced xarray users — and so the costs of explicit conversions might have the largest impact.
OK great, I'm less familiar with what this would be like — would
To the extent (FYI my guess is that often we don't want to |
Hmm even with the most reasonable option, extracting one or more level coordinates from a MultiIndex passed as a single variable feels too magical and is hardly predictable, IMHO. That's not the kind of a behavior one usually expects for generic mapping types. What if the MultiIndex is wrapped in another object, e.g., a pandas.Series, xarray.Variable, xarray.DataArray? What would be the most reasonable behavior for those cases? Here are a few examples: midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))
# extracts the multi-index levels as coordinates with dimension "x"
xr.Dataset({"x": midx})
xr.Dataset(coords={"x": midx})
xr.Dataset(coords={"x": xr.Variable("x", midx)})
xr.Dataset({"x": xr.DataArray(midx, dims="x")})
# creates only one dimension coordinate "x" with tuple values
xr.Dataset({"x": xr.DataArray(xr.Variable("x", midx))})
# creates one dimension coordinate "x" with tuple values
# and two indexed coordinates "one", "two" sharing the same index
xr.Dataset({"x": xr.DataArray(xr.IndexVariable("x", midx))})
# extracts the multi-index levels as coordinates with dimension "dim_0"
xr.Dataset({"x": pd.Series(range(4), index=midx)})
# creates a dimension coordinate "x" with values [0, 1, 2, 3]
xr.Dataset(coords={"x": pd.Series(range(4), index=midx)})
xr.Dataset({"x": ("x", pd.Series(range(4), index=midx))}) I doubt that all these results would have been accurately predicted by even experienced xarray users (the nested DataArray / IndexVariable example is certainly a bug). Another question: how common using pandas MultiIndex will it be compared to other Xarray indexes that will be available in the future? To which point is it justified treating PandasMultiIndex so differently than any other Xarray multi-coordinate index?
I'm afraid it is more complicated than that. |
Those are great examples!
OK. FWIW, extracting coords is what I was thinking... 😁
My mental model of this user is that they don't so much care about the I do worry that if we say "oh you want to pass in a dataframe with a multiindex, now you have to make a bunch of choices on how that should happen", that it won't be friendly. (I'm by no means claiming this is every user; I'm loading on my own experience working with folks who use both pandas & xarray) For example, this is very sufficient — pass in a df = pd.DataFrame(dict(a=range(7,11)), index=midx)
df
Out[32]:
a
one two
a 0 7
1 8
b 0 9
1 10 ...and then we can use xr.Dataset(df).sel(one='a', two=0)
Out[37]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
dim_0 object ('a', 0)
one <U1 'a'
two int64 0
Data variables:
a int64 7 It's not perfect — we have this You know this 10x better than I do, so I really don't mean to do a drive-by and slow anything down. I do wonder whether there's some synthesis of the two approaches — we make things robust once they're in the xarray data model, while remaining generous about accepting inputs. |
Yes I guess more generally it all depends on whether we see an Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels. While both point of views are valid, they are hard to reconcile through the same API. Trying to accommodate it too generously (or even with the barest amount of generosity) may reach a point where it is more harmful than beneficial for the two dataframe vs. array point of views (actually, I think we've already reached this point). After working on the index refactor, my point of view shifted more towards n-d arrays (so I'm biased!). Unlike a dataframe, the concept of an array rarely encapsulates an index. Now that indexes are 1st class members of the Xarray data model, it makes better sense IMO to handle them (and dataframe objects) through an explicit API rather than trying to continue mixing them with arrays in the same API function or method arguments. That said, I totally agree that we should never make Xarray unfriendly for you and other users using both Pandas & Xarray! We should continue to offer Premium™ builtin support, notably by keeping default PandasIndex objects for dimension coordinates and via API like If we require to pass (pandas) index, series or dataframe objects via explicit conversion methods, we should indeed try to minimize the friction as much as possible. But I think that we are not far from that goal. Taking your example xr.Dataset(df).sel(one='a', two=0) Doing instead xr.Dataset.from_dataframe(df).sel(one='a', two=0) doesn't look like adding a lot of friction to me (note: the latter dataset doesn't have any
I also agree with this. So if we choose to deprecate the current default behavior, we should consider a long deprecation cycle and make it clear what is the alternative to get the desired behavior. |
Thank you for the very thoughtful responses. I actually think we're quite close in how we're thinking about it. I like your distinction of "Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels.", and I tend towards the latter too, even if it's nice to occasionally orient around the former.
For me the main issues here are:
One note: I'm hesitant to push too hard here given how much work and thought has gone into it, and how absent I've been in the past year. So please forgive the continued questions if they feel like an imposition. I'm persevering because I do think it's important, and I do think there are a large number of users who may be more casual and so less represented here. I found xarray back in 2016 because of pandas dissatisfaction, so I'm keen to keep that immigration channel open for folks... |
This may take some time. I opened #8162 to track it |
@max-sixty your questions and thoughts are very much appreciated, please continue to do it! While there seems to me that there is a broad agreement about deprecating special multi-index behavior in general, there hasn't been much discussion about it especially about all the possible impact that this would have.
Do you think it would be a reasonable option adding a
I think that if users set
Not yet, but once it is clarified we should document it somewhere! I actually haven't thought much about dataframe objects passed directly to xr.Dataset({k: np.asarray(v) for k, v in df.items()})
# <xarray.Dataset>
# Dimensions: (a: 4)
# Coordinates:
# * a (a) int64 7 8 9 10
# Data variables:
# *empty* Now, that's not super nice to have as many dimensions as they are columns. Alternatively, we could have some special case for a dataframe but not trying to do too much (i.e., not trying to extract and convert the index). For example: xr.Dataset({k: ("dim_0", np.asarray(v)) for k, v in df.reset_index().items()})
# <xarray.Dataset>
# Dimensions: (dim_0: 4)
# Dimensions without coordinates: dim_0
# Data variables:
# one (dim_0) object 'a' 'a' 'b' 'b'
# two (dim_0) int64 0 1 0 1
# a (dim_0)) int64 7 8 9 10 What do you think? |
The current behaviour of To me, it seems sensible that |
Agreed. I guess that's because it has been there before any multi-index support in Xarray? I'm +1 for changing this behavior. A smooth transition could be using the |
Can we get away with a |
Yes we certainly can! We can also have both and keep |
Excellent, this is sounding good!
This will betray how long I've been out for, but was there any progress on allowing If that is possible, then this proposal would be ideal — basically a much better MultiIndex. If that's not, then it's awkward, because it's no longer possible to +1 to not unstacking automatically The (thanks for suggestions on the default |
Yes it is now supported since v2022.06.0.
Technically a |
To confirm the question (sorry if I'm being unclear), If we do this:
...then the result of: midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))
df = pd.DataFrame(dict(a=range(7,11)), index=midx)
ds = xr.Dataset(df) # (or `.from_dataframe` with a dim arg)
ds ...would change to something like: <xarray.Dataset>
Dimensions: (dim_0: 4)
Coordinates:
- * dim_0 (dim_0) object MultiIndex
* one (dim_0) object 'a' 'a' 'b' 'b'
* two (dim_0) int64 0 1 0 1
Data variables:
a (dim_0) int64 7 8 9 10 ...but then we'd still be have some way of calling I know we can currently do Or does 'the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"' mean that we would still have a MultiIndex, and the change is smaller than I was envisaging — instead it's just that it needs to be specified with a |
Yes exactly (sorry that was a bit confusing). What I wanted to say is: Those 1-d coordinates currently include both the dimension coordinate "dim_0" and the level coordinates "a", "b". If we consider #8143, eventually they will only include the level coordinates. In both cases, the level coordinates have a PandasMultiIndex so |
I see — great — I was conflating this & #8143 a bit, then. One note as I'm looking at some of my existing code which uses xarray — the current behavior of [ins] In [25]: df.index.name = 'foo'
[ins] In [26]: df
Out[26]:
a
one two
a 0 7
1 8
b 0 9
1 10
[ins] In [27]: xr.Dataset(df)
Out[27]:
<xarray.Dataset>
Dimensions: (foo: 4)
Coordinates:
* foo (foo) object MultiIndex
* one (foo) object 'a' 'a' 'b' 'b'
* two (foo) int64 0 1 0 1
Data variables:
a (foo) int64 7 8 9 10 ...so no unstacking. But it does rely on renaming the dim after creation (or, as in this case, using So I think we're nearing consensus. Let me write a few things down as a starter — I imagine this is 80% right so please correct me:
Thank you very much for the discussion @benbovy |
Either that (warning without a dim arg and when the passed df has a MultiIndex) or via another, temporary |
I opened #8140 to continue the discussion about Dataset.from_dataframe. |
Sorry I dropped this a while ago — I was just ramping up and lost it in my inbox. I think we were quite close to consensus, with the The one request I'd have is to be able to call |
Coming back to this a while later — this seems very reasonable indeed. ...and seems consistent with (my) suggestion:
I think we're in broad consensus on the goals. Is that right? :) |
Yes I think so. Well, almost :). Using |
😅
Not my most crucial point, but to the extent that (and maybe To recenter — my big point is that we hopefully don't get this:
...otherwise things seem great (and I ofc don't want to slow down progress with refinements) Thank you very much @benbovy . |
whats-new.rst
This PR should normally raise a warning each time when indexed coordinates are created implicitly from a
pd.MultiIndex
object.I updated the tests to create coordinates explicitly using
Coordinates.from_pandas_multiindex()
.I also refactored some parts where a
pd.MultiIndex
could still be passed and promoted internally, with the exception of:swap_dims()
: it should raise a warning! Right now the warning message is a bit confusing for this case, but instead of adding a special case we should probably deprecate the whole method? As it is suggested as a TODO comment... This method was to circumvent the limitations of dimension coordinates, which isn't needed anymore (rename_dims
and/orset_xindex
is equivalent and less confusing).xr.DataArray(pandas_obj_with_multiindex, dims=...)
: I guess it should raise a warning too?da.stack(z=...).groupby("z")
: it shoudn't raise a warning, but this requires a (heavy?) refactoring of groupby. During building the "grouper" objects,grouper.group1d
orgrouper.unique_coord
may still be built by extracting only the multi-index dimension coordinate. I'd greatly appreciate if anyone familiar with the groupby implementation could help me with this! @dcherian ?