Deprecate passing pd.MultiIndex implicitly #8140

benbovy · 2023-09-03T14:01:18Z

Follow-up Refactor update coordinates to better handle multi-coordinate indexes #8094
Closes refactor broadcast for flexible indexes #6481
User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR should normally raise a warning each time when indexed coordinates are created implicitly from a pd.MultiIndex object.

I updated the tests to create coordinates explicitly using Coordinates.from_pandas_multiindex().

I also refactored some parts where a pd.MultiIndex could still be passed and promoted internally, with the exception of:

swap_dims(): it should raise a warning! Right now the warning message is a bit confusing for this case, but instead of adding a special case we should probably deprecate the whole method? As it is suggested as a TODO comment... This method was to circumvent the limitations of dimension coordinates, which isn't needed anymore (rename_dims and/or set_xindex is equivalent and less confusing).
xr.DataArray(pandas_obj_with_multiindex, dims=...): I guess it should raise a warning too?
da.stack(z=...).groupby("z"): it shoudn't raise a warning, but this requires a (heavy?) refactoring of groupby. During building the "grouper" objects, grouper.group1d or grouper.unique_coord may still be built by extracting only the multi-index dimension coordinate. I'd greatly appreciate if anyone familiar with the groupby implementation could help me with this! @dcherian ?

So that it is caught in more cases.

max-sixty · 2023-09-07T02:22:08Z

xr.DataArray(pandas_obj_with_multiindex, dims=...): I guess it should raise a warning too?

I've been out of the loop of discussions recently (and less recently...). To the extent this isn't firmly decided — is this necessary? Is there a downside to having a good default when pandas objects are passed in? Is there significant ambiguity on what the result should be? What do we recommend for converting from pandas-object-with-multiindex to dataset/dataarray?

benbovy · 2023-09-07T07:54:40Z

I've been out of the loop of discussions recently (and less recently...)

No worries! There's a more context in #6293 (comment) and in #6392 (comment).

Is there a downside to having a good default when pandas objects are passed in? Is there significant ambiguity on what the result should be? What do we recommend for converting from pandas-object-with-multiindex to dataset/dataarray?

The main source of ambiguity is the extraction of each multi-index level as a coordinate and the possible conflict with the other coordinates.

More generally, maintaining the special cases for pandas multi-index has been a big hassle ever since support for it was added in Xarray. I share a lot of responsibility since I mainly contributed to adding that support :-). There has been numerous subtle bugs and it really makes the internal logic more complicated than it should in many places of the Xarray code base. Removing all those special cases will be a big relief!

I think that a good default behavior is to treat the pandas objects passed as data or coordinate variables like any other duck array. If we want a more specific behavior leveraging the index contained in those objects, the recommended way is to convert them using the explicit conversion methods provided by Xarray, e.g.,

For a pd.MultiIndex, use xr.Coordinates.from_pandas_multiindex(...)
For a pd.Series with a multi-index, use xr.DataArray.from_series(...).stack(...)
For a pd.DataFrame with a multi-index, use xr.Dataset.from_dataframe(...).stack(...)

(note: for the two latter we might want to add an option to skip expanding the multi-index so that we don't need to re-stack the dimensions)

Add suggestions for the cases where the pandas multi-index is passed via a pandas dataframe or series.

dcherian · 2023-09-07T08:24:38Z

During building the "grouper" objects, grouper.group1d or grouper.unique_coord may still be built by extracting only the multi-index dimension coordinate.

Can you describe what change you'd like to see?

benbovy · 2023-09-07T09:07:19Z

@dcherian ideally GroupBy._infer_concat_args() would return a xr.Coordinates object that contains both the coordinate(s) and their (multi-)index to assign to the result (combined) object.

The goal is to avoid calling create_default_index_implicit(coord) below where coord is a pd.MultiIndex or a single IndexVariable wrapping a multi-index. If coord is a Coordinates object, we could do combined = combined.assign_coords(coord) instead.

xarray/xarray/core/groupby.py

Lines 1573 to 1587 in e2b6f34

    
           def _combine(self, applied): 
        
               """Recombine the applied objects like the original.""" 
        
               applied_example, applied = peek_at(applied) 
        
               coord, dim, positions = self._infer_concat_args(applied_example) 
        
               combined = concat(applied, dim) 
        
               (grouper,) = self.groupers 
        
               combined = _maybe_reorder(combined, dim, positions, N=grouper.group.size) 
        
               # assign coord when the applied function does not return that coord 
        
               if coord is not None and dim not in applied_example.dims: 
        
                   index, index_vars = create_default_index_implicit(coord) 
        
                   indexes = {k: index for k in index_vars} 
        
                   combined = combined._overwrite_indexes(indexes, index_vars) 
        
               combined = self._maybe_restore_empty_groups(combined) 
        
               combined = self._maybe_unstack(combined) 
        
               return combined

There are actually more general issues:

The group parameter of Dataset.groupby being a single variable or variable name, it won't be possible to do groupby on a full pandas multi-index once we drop its dimension coordinate (Deprecate the multi-index dimension coordinate #8143). How can we still support it? Maybe passing a dimension name to group and check that there's only one index for that dimension?
How can we support custom, multi-coordinate indexes with groupby? I don't have any practical example in mind, but in theory just passing a single coordinate name as group will invalidate the index. Should we drop the index in the result? Or, like suggested above pass a dimension name as group and check the index?

max-sixty · 2023-09-07T16:03:52Z

Thanks @benbovy !

More generally, maintaining the special cases for pandas multi-index has been a big hassle ever since support for it was added in Xarray. I share a lot of responsibility since I mainly contributed to adding that support :-). There has been numerous subtle bugs and it really makes the internal logic more complicated than it should in many places of the Xarray code base. Removing all those special cases will be a big relief!

I totally agree with not having native MultiIndex support within a DataArray / Dataset. I'm wondering whether we can still do something reasonable when a MultiIndex is passed in, since that's quite common IME, and it's common with folks who want to do something quickly, possibly are less experienced xarray users — and so the costs of explicit conversions might have the largest impact.

I think that a good default behavior is to treat the pandas objects passed as data or coordinate variables like any other duck array.

OK great, I'm less familiar with what this would be like — would .sel still work? (Or feel free to point me to issues, thank you for your patience in advance...)

For a pd.MultiIndex, use xr.Coordinates.from_pandas_multiindex(...)

For a pd.Series with a multi-index, use xr.DataArray.from_series(...).stack(...)

For a pd.DataFrame with a multi-index, use xr.Dataset.from_dataframe(...).stack(...)

To the extent xr.Coordinates.from_pandas_multiindex(...) is what's required to get reasonable behavior, we could do that implicitly, and then for something more specific, folks can be explicit.

(FYI my guess is that often we don't want to .stack, since the indexes can be quite sparse)

benbovy · 2023-09-07T20:25:05Z

I'm wondering whether we can still do something reasonable when a MultiIndex is passed in, since that's quite common IME, and it's common with folks who want to do something quickly, possibly are less experienced xarray users — and so the costs of explicit conversions might have the largest impact.

Hmm even with the most reasonable option, extracting one or more level coordinates from a MultiIndex passed as a single variable feels too magical and is hardly predictable, IMHO. That's not the kind of a behavior one usually expects for generic mapping types.

What if the MultiIndex is wrapped in another object, e.g., a pandas.Series, xarray.Variable, xarray.DataArray? What would be the most reasonable behavior for those cases? Here are a few examples:

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))

# extracts the multi-index levels as coordinates with dimension "x"
xr.Dataset({"x": midx})
xr.Dataset(coords={"x": midx})
xr.Dataset(coords={"x": xr.Variable("x", midx)})
xr.Dataset({"x": xr.DataArray(midx, dims="x")})

# creates only one dimension coordinate "x" with tuple values
xr.Dataset({"x": xr.DataArray(xr.Variable("x", midx))})

# creates one dimension coordinate "x" with tuple values
# and two indexed coordinates "one", "two" sharing the same index
xr.Dataset({"x": xr.DataArray(xr.IndexVariable("x", midx))})

# extracts the multi-index levels as coordinates with dimension "dim_0"
xr.Dataset({"x": pd.Series(range(4), index=midx)})

# creates a dimension coordinate "x" with values [0, 1, 2, 3] 
xr.Dataset(coords={"x": pd.Series(range(4), index=midx)})
xr.Dataset({"x": ("x", pd.Series(range(4), index=midx))})

I doubt that all these results would have been accurately predicted by even experienced xarray users (the nested DataArray / IndexVariable example is certainly a bug).

Another question: how common using pandas MultiIndex will it be compared to other Xarray indexes that will be available in the future? To which point is it justified treating PandasMultiIndex so differently than any other Xarray multi-coordinate index?

To the extent xr.Coordinates.from_pandas_multiindex(...) is what's required to get reasonable behavior

I'm afraid it is more complicated than that.

max-sixty · 2023-09-07T21:26:57Z

Those are great examples!

Hmm even with the most reasonable option, extracting one or more level coordinates from a MultiIndex passed as a single variable feels too magical and is hardly predictable, IMHO. That's not the kind of a behavior one usually expects for generic mapping types.

OK. FWIW, extracting coords is what I was thinking... 😁

Another question: how common using pandas MultiIndex will it be compared to other Xarray indexes that will be available in the future? To which point is it justified treating PandasMultiIndex so differently than any other Xarray multi-coordinate index?

My mental model of this user is that they don't so much care about the MultiIndex object per se — but MultiIndexs are common in pandas, and they expect some reasonable-looking xarray object when implicitly converting from a pandas object. It remaining a literal MultiIndex within the da isn't important to them

I do worry that if we say "oh you want to pass in a dataframe with a multiindex, now you have to make a bunch of choices on how that should happen", that it won't be friendly.

(I'm by no means claiming this is every user; I'm loading on my own experience working with folks who use both pandas & xarray)

For example, this is very sufficient — pass in a DataFrame with a multiindex...

df = pd.DataFrame(dict(a=range(7,11)), index=midx)

df

Out[32]:
          a
one two
a   0     7
    1     8
b   0     9
    1    10

...and then we can use .sel on each of the levels:

xr.Dataset(df).sel(one='a', two=0)

Out[37]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    dim_0    object ('a', 0)
    one      <U1 'a'
    two      int64 0
Data variables:
    a        int64 7

It's not perfect — we have this dim_0 which has tuples since we didn't name the coord, but it does work pretty well.

You know this 10x better than I do, so I really don't mean to do a drive-by and slow anything down. I do wonder whether there's some synthesis of the two approaches — we make things robust once they're in the xarray data model, while remaining generous about accepting inputs.

benbovy · 2023-09-08T08:44:14Z

Yes I guess more generally it all depends on whether we see an Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels.

While both point of views are valid, they are hard to reconcile through the same API. Trying to accommodate it too generously (or even with the barest amount of generosity) may reach a point where it is more harmful than beneficial for the two dataframe vs. array point of views (actually, I think we've already reached this point).

After working on the index refactor, my point of view shifted more towards n-d arrays (so I'm biased!). Unlike a dataframe, the concept of an array rarely encapsulates an index. Now that indexes are 1st class members of the Xarray data model, it makes better sense IMO to handle them (and dataframe objects) through an explicit API rather than trying to continue mixing them with arrays in the same API function or method arguments.

That said, I totally agree that we should never make Xarray unfriendly for you and other users using both Pandas & Xarray! We should continue to offer Premium™ builtin support, notably by keeping default PandasIndex objects for dimension coordinates and via API like .from_dataframe, .from_series, .from_pandas_multiindex, etc.

If we require to pass (pandas) index, series or dataframe objects via explicit conversion methods, we should indeed try to minimize the friction as much as possible. But I think that we are not far from that goal. Taking your example

xr.Dataset(df).sel(one='a', two=0)

Doing instead

xr.Dataset.from_dataframe(df).sel(one='a', two=0)

doesn't look like adding a lot of friction to me (note: the latter dataset doesn't have any dim_0 added).

I do worry that if we say "oh you want to pass in a dataframe with a multiindex, now you have to make a bunch of choices on how that should happen", that it won't be friendly.

I also agree with this. So if we choose to deprecate the current default behavior, we should consider a long deprecation cycle and make it clear what is the alternative to get the desired behavior.

max-sixty · 2023-09-08T20:44:50Z

Thank you for the very thoughtful responses. I actually think we're quite close in how we're thinking about it. I like your distinction of "Xarray Dataset as a kind of multi-dimensional dataframe or as a mapping of n-dimensional arrays with labels.", and I tend towards the latter too, even if it's nice to occasionally orient around the former.

If we require to pass (pandas) index, series or dataframe objects via explicit conversion methods, we should indeed try to minimize the friction as much as possible. But I think that we are not far from that goal. Taking your example
xr.Dataset(df).sel(one='a', two=0)
Doing instead
xr.Dataset.from_dataframe(df).sel(one='a', two=0)
doesn't look like adding a lot of friction to me (note: the latter dataset doesn't have any dim_0 added).

For me the main issues here are:

How would someone use .from_pandas_multiindex to convert a df to a ds? I tried a couple of things but couldn't get the correct indexes, and couldn't immediately see an example in the tests (sorry if this is basic / covered elsewhere — please feel very free to say "read X")

xr.Dataset(df.reset_index(drop=True), coords=xr.Coordinates.from_pandas_multiindex(df.index, dim='foo'))
Out[23]:
<xarray.Dataset>
Dimensions:  (dim_0: 4, foo: 4)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * foo      (foo) object MultiIndex
  * one      (foo) object 'a' 'a' 'b' 'b'
  * two      (foo) int64 0 1 0 1
Data variables:
    a        (dim_0) int64 7 8 9 10

Is there somewhere I can read about the end-state? I agree that supporting all of pandas' warts is something we can ideally avoid. What would "treat the pandas objects passed as data or coordinate variables like any other duck array" look like? What would the object from the previous bullet be, assuming this were correctly converted? Is there some state that the Dataset should be in, which allows for some notion of sparse data, which we can try and automagically move the dataset closer towards? While we want to generally be robust and explicit, we do have prior art of auto-magic (e.g. combine_by_coords).
Using .from_dataframe unstacks the array, which would obv be quite bad for sparse indexes. For example if a multiindex was used to label a date dimension with a n_days_counter (which we would use a coord for in xarray), then it would expand to an n x n array, with data only on the diagonals.

One note: I'm hesitant to push too hard here given how much work and thought has gone into it, and how absent I've been in the past year. So please forgive the continued questions if they feel like an imposition. I'm persevering because I do think it's important, and I do think there are a large number of users who may be more casual and so less represented here. I found xarray back in 2016 because of pandas dissatisfaction, so I'm keen to keep that immigration channel open for folks...

dcherian · 2023-09-09T04:51:27Z

GroupBy._infer_concat_args() would return a xr.Coordinates object that contains both the coordinate(s) and their (multi-)index to assign to the result (combined) object.

This may take some time. I opened #8162 to track it

benbovy · 2023-09-09T08:13:19Z

@max-sixty your questions and thoughts are very much appreciated, please continue to do it! While there seems to me that there is a broad agreement about deprecating special multi-index behavior in general, there hasn't been much discussion about it especially about all the possible impact that this would have.

Using .from_dataframe unstacks the array, which would obv be quite bad for sparse indexes.

Do you think it would be a reasonable option adding a dim=None argument to Dataset.from_dataframe (and DataArray.from_series)?

dim=None (default) corresponds to the current behavior
- single index: a dimension coordinate is created and is named like the index name (or "dim_0" if the index has no name)
- MultiIndex: the dataframe is unstacked and each multi-index level is extracted as a dimension coordinate
dim="x":
- single index: if it has no name a dimension coordinate "x" is created, otherwise an indexed (non-dimension) coordinate is created, is named like the index and has dimension "x"
- MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

I think that if users set dim="x" explicitly, it is pretty clear that they want to keep the Dataset as 1-dimensional (so no expansion of a MultiIndex into a tensor product).

Is there somewhere I can read about the end-state?

Not yet, but once it is clarified we should document it somewhere! I actually haven't thought much about dataframe objects passed directly to Dataset.__init__. If we don't try anymore to extract any index, so no special case anymore for pandas.DataFrame, we could naively consider it like any other input passed to Dataset, i.e., as a mapping of arrays. This could look like:

xr.Dataset({k: np.asarray(v) for k, v in df.items()})
# <xarray.Dataset>
# Dimensions:  (a: 4)
# Coordinates:
#   * a        (a) int64 7 8 9 10
# Data variables:
#    *empty*

Now, that's not super nice to have as many dimensions as they are columns.

Alternatively, we could have some special case for a dataframe but not trying to do too much (i.e., not trying to extract and convert the index). For example:

xr.Dataset({k: ("dim_0", np.asarray(v)) for k, v in df.reset_index().items()})
# <xarray.Dataset>
# Dimensions:  (dim_0: 4)
# Dimensions without coordinates: dim_0
# Data variables:
#     one      (dim_0) object 'a' 'a' 'b' 'b'
#     two      (dim_0) int64 0 1 0 1
#     a        (dim_0)) int64 7 8 9 10

What do you think?

dcherian · 2023-09-09T11:07:16Z

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.

To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.

benbovy · 2023-09-09T11:17:44Z

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.

Agreed. I guess that's because it has been there before any multi-index support in Xarray? I'm +1 for changing this behavior.

A smooth transition could be using the dim argument as proposed above to turn on the new behavior. Eventually dim=None won't unstack anymore.

dcherian · 2023-09-09T13:08:33Z

Can we get away with a unstack: bool kwarg instead ( that is eventually removed) and have the user manually rename as an extra step?

benbovy · 2023-09-09T13:55:28Z

Yes we certainly can!

We can also have both and keep dim afterwards, assuming that a MultiIndex rarely has its .name set (that's why I added a dim argument in Coordinates.from_pandas_multiindex).

max-sixty · 2023-09-09T20:44:54Z

Excellent, this is sounding good!

MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

This will betray how long I've been out for, but was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

If that is possible, then this proposal would be ideal — basically a much better MultiIndex.

If that's not, then it's awkward, because it's no longer possible to .sel from that dimension, which seems quite important.

+1 to not unstacking automatically

The dim="x" (rather than unstack=False) I think might be required, because IIUC a MultiIndex doesn't have a .name, only a .names (referring to the level names), so a bool doesn't give the information for which dimension it should be on.

(thanks for suggestions on the default __init__ behavior, let me think; possibly it somewhat depends on whether we can still have "multi-level" indexes that can be accessed with .sel)

benbovy · 2023-09-09T21:40:45Z

was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

Yes it is now supported since v2022.06.0.

a MultiIndex doesn't have a .name, only a .names

Technically a pd.MultiIndex has a .name property (inherited from pd.Index) but in practice it is mostly ignored I think. In xarray.core.indexes.PandasMultiIndex we keep it in sync with the dimension name of the level coordinates, but I doubt that this is really useful (it might become useful for round-trip conversion between xarray.Dataset and pandas.DataFrame if we don't unstack anymore).

max-sixty · 2023-09-09T22:28:33Z

was there any progress on allowing .sel to work with coords? IIRC there were some plans to allow indexes on coords beyond those named the same as a dimension.

Yes it is now supported since v2022.06.0.

To confirm the question (sorry if I'm being unclear), If we do this:

MultiIndex: the dataframe is not unstacked and the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"

...then the result of:

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=("one", "two"))
df = pd.DataFrame(dict(a=range(7,11)), index=midx)
ds = xr.Dataset(df)  # (or `.from_dataframe` with a dim arg)
ds

...would change to something like:

<xarray.Dataset>
Dimensions:  (dim_0: 4)
Coordinates:
-  * dim_0    (dim_0) object MultiIndex
  * one      (dim_0) object 'a' 'a' 'b' 'b'
  * two      (dim_0) int64 0 1 0 1
Data variables:
    a        (dim_0) int64 7 8 9 10

...but then we'd still be have some way of calling ds.sel(one='a').

I know we can currently do ds.sel(one='a') — but IIUC that's only because the MultiIndex is there.

Or does 'the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"' mean that we would still have a MultiIndex, and the change is smaller than I was envisaging — instead it's just that it needs to be specified with a dim when it's passed?

benbovy · 2023-09-09T23:02:12Z

Or does 'the MultiIndex is added to the Dataset with all its levels as 1D coordinates of dimension "x"' mean that we would still have a MultiIndex, and the change is smaller than I was envisaging

Yes exactly (sorry that was a bit confusing). What I wanted to say is: xarray.Dataset.from_dataframe(df) with no unstack would preserve the MultiIndex of df, i.e., wrap it in a xarray.core.indexes.PandasMultiIndex, create 1-d coordinates from it and then put everything in the new created Dataset.

Those 1-d coordinates currently include both the dimension coordinate "dim_0" and the level coordinates "a", "b". If we consider #8143, eventually they will only include the level coordinates. In both cases, the level coordinates have a PandasMultiIndex so ds.sel(one='a') is supported. The latter case is possible because Xarray now allows setting an index for any set of arbitrary coordinate(s).

max-sixty · 2023-09-10T00:49:25Z

I see — great — I was conflating this & #8143 a bit, then.

One note as I'm looking at some of my existing code which uses xarray — the current behavior of xr.Dataset(df) is fairly sane; it's what I & folks I work with use a lot:

[ins] In [25]: df.index.name = 'foo'

[ins] In [26]: df
Out[26]:
          a
one two
a   0     7
    1     8
b   0     9
    1    10

[ins] In [27]: xr.Dataset(df)
Out[27]:
<xarray.Dataset>
Dimensions:  (foo: 4)
Coordinates:
  * foo      (foo) object MultiIndex
  * one      (foo) object 'a' 'a' 'b' 'b'
  * two      (foo) int64 0 1 0 1
Data variables:
    a        (foo) int64 7 8 9 10

...so no unstacking. But it does rely on renaming the dim after creation (or, as in this case, using .name property of a multiindex, which I hadn't even know was a thing, thanks for the pointer above)

So I think we're nearing consensus. Let me write a few things down as a starter — I imagine this is 80% right so please correct me:

We'll try to move away from .unstack-ing in .from_dataframe
We'll have a deprecation warning for .from_dataframe without a dim arg
The dim arg will be used as the name for the "index" dimension (the columns are data vars)
The dim arg will cause it to not unstack?
And then the direction of Deprecate the multi-index dimension coordinate #8143 can mean we can get the level coords without the parent name

Thank you very much for the discussion @benbovy

benbovy · 2023-09-10T07:02:59Z

We'll have a deprecation warning for .from_dataframe without a dim arg
The dim arg will cause it to not unstack?

Either that (warning without a dim arg and when the passed df has a MultiIndex) or via another, temporary unstack argument as @dcherian suggests. The latter is clearer but the advantage of temporarily controlling unstack via dim is that we won't need to later introduce any breaking change in the API.

benbovy · 2023-09-10T15:55:55Z

I opened #8140 to continue the discussion about Dataset.from_dataframe.

max-sixty · 2023-11-15T20:15:00Z

Sorry I dropped this a while ago — I was just ramping up and lost it in my inbox.

I think we were quite close to consensus, with the unstack kwarg. Was there even anything else to cover, or this was just waiting on me to test it out?

The one request I'd have is to be able to call xr.Dataset(df), where df has a multiindex, and have that work as it always has. That has had very reasonable behavior — it doesn't unstack. Recent Xarray code prints a deprecation warning — I think it would be quite unfriendly to force folks to instead take apart the dataframe, extract the multiindex, run through xr.Coordinates.from_pandas_multiindex, and then pass it all into the constructor....

max-sixty · 2024-10-19T19:13:18Z

xarray.Dataset.from_dataframe(df) with no unstack would preserve the MultiIndex of df, i.e., wrap it in a xarray.core.indexes.PandasMultiIndex, create 1-d coordinates from it and then put everything in the new created Dataset.

Those 1-d coordinates currently include both the dimension coordinate "dim_0" and the level coordinates "a", "b". If we consider #8143, eventually they will only include the level coordinates. In both cases, the level coordinates have a PandasMultiIndex so ds.sel(one='a') is supported. The latter case is possible because Xarray now allows setting an index for any set of arbitrary coordinate(s).

Coming back to this a while later — this seems very reasonable indeed.

...and seems consistent with (my) suggestion:

be able to call xr.Dataset(df), where df has a multiindex, and ~~have that work as it always has~~ [edit: have that work without unstacking, even if the exact behavior of the multiindex changes to multiple indexes]. The existing behavior is very reasonable — it doesn't unstack. Recent Xarray code prints a deprecation warning — I think it would be quite unfriendly to force folks to instead take apart the dataframe, extract the multiindex, run through xr.Coordinates.from_pandas_multiindex, and then pass it all into the constructor....

I think we're in broad consensus on the goals. Is that right? :)

benbovy · 2024-10-23T10:09:06Z

I think we're in broad consensus on the goals. Is that right? :)

Yes I think so.

Well, almost :). Using xr.Dataset(df) will become more difficult to support especially if we deprecate positional arguments (#8959 #8979). This may be another reason to encourage explicit construction xr.Dataset.from_dataframe(df) (choosing unstack vs. preserve the multi-index #8170) or conversion df.to_xarray().

max-sixty · 2024-10-23T18:29:31Z

Well, almost :)

😅

This may be another reason to encourage explicit construction xr.Dataset.from_dataframe(df)

Not my most crucial point, but to the extent that xr.Dataset.from_dataframe(df) evaluates, it seems reasonable to me that xr.Dataset(df) does the same thing. Re If we make the first positional arg vars, that seems consistent.

(and maybe .from_dataframe has some options which xr.Dataset(df) doesn't, that seems fine too)

To recenter — my big point is that we hopefully don't get this:

I think it would be quite unfriendly to force folks to instead take apart the dataframe, extract the multiindex, run through xr.Coordinates.from_pandas_multiindex, and then pass it all into the constructor....

...otherwise things seem great (and I ofc don't want to slow down progress with refinements)

Thank you very much @benbovy .

benbovy added 9 commits September 1, 2023 17:44

move implicit mindex warning deeper in the stack

69c51a8

So that it is caught in more cases.

wip: 1st pass silencing warnings

7889af9

refactor broadcast (full multi-index support)

60bbbe3

refactor polyfit (full multi-index support)

00ecc27

refactor concat

36b8531

wip: 2nd pass silencing warnings

99b2ada

wip: 3rd pass silencing warnings

bf4a184

4th pass silencing warnings

514e1dd

remove temp commented raise

3786f55

github-actions bot added the topic-indexing label Sep 3, 2023

benbovy mentioned this pull request Sep 4, 2023

Deprecate the multi-index dimension coordinate #8143

Open

2 tasks

benbovy added 2 commits September 7, 2023 10:22

Merge branch 'main' into deprecate-implicit-mindex

3e334b8

improve warning message

72ad345

Add suggestions for the cases where the pandas multi-index is passed via a pandas dataframe or series.

update what's new

ef7dae0

dcherian mentioned this pull request Sep 9, 2023

Update group by multi index #8162

Open

benbovy mentioned this pull request Sep 10, 2023

Dataset.from_dataframe: deprecate expanding the multi-index #8166

Open

This was referenced Mar 29, 2024

to_base_variable: coerce multiindex data to numpy array #8888

Open

Pass variable name to encode_zarr_variable #8809

Closed

rename_vars followed by swap_dims and merge causes swapped dim to reappear #8646

Open

benbovy mentioned this pull request Apr 5, 2024

Refactor swap dims #8911

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate passing pd.MultiIndex implicitly #8140

Deprecate passing pd.MultiIndex implicitly #8140

benbovy commented Sep 3, 2023 •

edited

Loading

max-sixty commented Sep 7, 2023

benbovy commented Sep 7, 2023 •

edited

Loading

dcherian commented Sep 7, 2023

benbovy commented Sep 7, 2023 •

edited

Loading

max-sixty commented Sep 7, 2023

benbovy commented Sep 7, 2023

max-sixty commented Sep 7, 2023

benbovy commented Sep 8, 2023 •

edited

Loading

max-sixty commented Sep 8, 2023

dcherian commented Sep 9, 2023

benbovy commented Sep 9, 2023 •

edited

Loading

dcherian commented Sep 9, 2023 •

edited

Loading

benbovy commented Sep 9, 2023

dcherian commented Sep 9, 2023 •

edited

Loading

benbovy commented Sep 9, 2023

max-sixty commented Sep 9, 2023

benbovy commented Sep 9, 2023 •

edited

Loading

max-sixty commented Sep 9, 2023

benbovy commented Sep 9, 2023 •

edited

Loading

max-sixty commented Sep 10, 2023

benbovy commented Sep 10, 2023

benbovy commented Sep 10, 2023

max-sixty commented Nov 15, 2023

max-sixty commented Oct 19, 2024

benbovy commented Oct 23, 2024

max-sixty commented Oct 23, 2024

Deprecate passing pd.MultiIndex implicitly #8140

Are you sure you want to change the base?

Deprecate passing pd.MultiIndex implicitly #8140

Conversation

benbovy commented Sep 3, 2023 • edited Loading

max-sixty commented Sep 7, 2023

benbovy commented Sep 7, 2023 • edited Loading

dcherian commented Sep 7, 2023

benbovy commented Sep 7, 2023 • edited Loading

max-sixty commented Sep 7, 2023

benbovy commented Sep 7, 2023

max-sixty commented Sep 7, 2023

benbovy commented Sep 8, 2023 • edited Loading

max-sixty commented Sep 8, 2023

dcherian commented Sep 9, 2023

benbovy commented Sep 9, 2023 • edited Loading

dcherian commented Sep 9, 2023 • edited Loading

benbovy commented Sep 9, 2023

dcherian commented Sep 9, 2023 • edited Loading

benbovy commented Sep 9, 2023

max-sixty commented Sep 9, 2023

benbovy commented Sep 9, 2023 • edited Loading

max-sixty commented Sep 9, 2023

benbovy commented Sep 9, 2023 • edited Loading

max-sixty commented Sep 10, 2023

benbovy commented Sep 10, 2023

benbovy commented Sep 10, 2023

max-sixty commented Nov 15, 2023

max-sixty commented Oct 19, 2024

benbovy commented Oct 23, 2024

max-sixty commented Oct 23, 2024

benbovy commented Sep 3, 2023 •

edited

Loading

benbovy commented Sep 7, 2023 •

edited

Loading

benbovy commented Sep 7, 2023 •

edited

Loading

benbovy commented Sep 8, 2023 •

edited

Loading

benbovy commented Sep 9, 2023 •

edited

Loading

dcherian commented Sep 9, 2023 •

edited

Loading

dcherian commented Sep 9, 2023 •

edited

Loading

benbovy commented Sep 9, 2023 •

edited

Loading

benbovy commented Sep 9, 2023 •

edited

Loading