Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when adding a DataArray to an existing Dataset with a MultiIndex #7921

Closed
dalonsoa opened this issue Jun 15, 2023 · 10 comments · Fixed by #8094
Closed

Error when adding a DataArray to an existing Dataset with a MultiIndex #7921

dalonsoa opened this issue Jun 15, 2023 · 10 comments · Fixed by #8094

Comments

@dalonsoa
Copy link

What is your issue?

This is a mixture between question, bug (potentially) and general issue, so feel free to label it accordingly.

Here is my question: what is the recommended approach to add a xr.DataArray to an existing xr.Dataset with a MultiIndex?

To give some more context, I've a xarray.Dataset called market with several variables and coordinates, one of them, timeslice, a MultiIndex. This is what it looks like:

<xarray.Dataset>
Dimensions:       (region: 1, commodity: 6, timeslice: 6, year: 8)
Coordinates:
  * region        (region) object 'R1'
  * commodity     (commodity) object 'electricity' 'gas' ... 'CO2f' 'wind'
    units_prices  (commodity) object 'MUS$2010/GWh' ... 'MUS$2010/kt'
  * timeslice     (timeslice) object MultiIndex
  * month         (timeslice) object 'all-year' 'all-year' ... 'all-year'
  * day           (timeslice) object 'all-week' 'all-week' ... 'all-week'
  * hour          (timeslice) object 'night' 'morning' ... 'late-peak' 'evening'
  * year          (year) int64 2020 2025 2030 2035 2040 2045 2050 2055
Data variables:
    prices        (commodity, region, year, timeslice) float64 0.0702 ... 0.0
    exports       (commodity, region, year, timeslice) float64 0.0 0.0 ... 0.0
    imports       (timeslice, commodity, region, year) float64 0.0 0.0 ... 0.0
    static_trade  (timeslice, commodity, region, year) float64 0.0 0.0 ... 0.0

Now, I want to add another variable, called supply, identical to exports but filled with zeros. In a code that was working with xarray==2022.3.0 and pandas==1.4.4, I was simply doing:

market["supply"] = xr.zeros_like(market.exports)

And it worked totally fine. With the newest versions of xarray==2023.5.0 and pandas==2.0.2 under python 3.10, this fails with:

*** DeprecationWarning: Deleting a single level of a MultiIndex is deprecated. Previously, this deleted all levels of a MultiIndex. Please also drop the following variables: {'timeslice'} to avoid an error in the future.

I've tried variants like:

market["supply"] = market.exports * 0
market = market.assign(supply = zeros_like(market.exports))

both failing with the same message.

I totally fail to see how this process is deleting a level of the MultiIndex - or modifying the indexes in any form. Probably it is because I don't understand the inner workings of xarray indexes.

The following works totally fine, but it is rather convoluted having to create a brand new Dataset from scratch manually, in addition to be problematic if you really want to modify the Dataset in place (same problem will have assign).

vars = dict(market.data_vars)
vars["supply"] = xr.zeros_like(market.exports)
market = xr.Dataset(vars)

Resulting in:

<xarray.Dataset>
Dimensions:       (region: 1, commodity: 6, timeslice: 6, year: 8)
Coordinates:
  * region        (region) object 'R1'
  * commodity     (commodity) object 'electricity' 'gas' ... 'CO2f' 'wind'
    units_prices  (commodity) object 'MUS$2010/GWh' ... 'MUS$2010/kt'
  * timeslice     (timeslice) object MultiIndex
  * month         (timeslice) object 'all-year' 'all-year' ... 'all-year'
  * day           (timeslice) object 'all-week' 'all-week' ... 'all-week'
  * hour          (timeslice) object 'night' 'morning' ... 'late-peak' 'evening'
  * year          (year) int64 2020 2025 2030 2035 2040 2045 2050 2055
Data variables:
    prices        (commodity, region, year, timeslice) float64 0.0702 ... 0.0
    exports       (commodity, region, year, timeslice) float64 0.0 0.0 ... 0.0
    imports       (timeslice, commodity, region, year) float64 0.0 0.0 ... 0.0
    static_trade  (timeslice, commodity, region, year) float64 0.0 0.0 ... 0.0
    supply        (commodity, region, year, timeslice) float64 0.0 0.0 ... 0.0

Many thanks for your support!

@dalonsoa dalonsoa added the needs triage Issue that has not been reviewed by xarray team member label Jun 15, 2023
@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Jun 15, 2023
@dcherian
Copy link
Contributor

Thanks for taking the time to file a bug report!

I totally fail to see how this process is deleting a level of the MultiIndex - or modifying the indexes in any form. Probably it is because I don't understand the inner workings of xarray indexes.

I agree this is confusing and seems like it should work.

@kmuehlbauer
Copy link
Contributor

@dalonsoa It would be great if you could provide a MCVE here. It makes it much easier to debug for interested parties.

@dalonsoa
Copy link
Author

Hi @kmuehlbauer , many thanks for asking for a MCVE because, to be honest, I'm not able to reproduce the error with the following code which, I think represents the situation we have at hand. It runs beginning to end without any problem, using the same versions for xarray and pandas:

import xarray as xr
import numpy as np

da1 = xr.DataArray(
    np.arange(48).reshape(2, 2, 3, 4),
    coords=[
        ("v", [10, 20]),
        ("x", ["a", "b"]),
        ("y", [0, 1, 2]),
        ("z", ["alpha", "beta", "gamma", "delta"]),
    ],
)

da1 = da1.stack(w=("x", "z", "v"))
da2 = xr.zeros_like(da1.transpose("w", "y"))
da3 = xr.zeros_like(da1)

ds = xr.Dataset({"one": da1, "two": da2, "three": da3})
ds["four"] = xr.zeros_like(ds.one)
print(ds)

I'll investigate why my code is failing and this one is not. May it be the way the MultIndex is being created... 🤔 ?

If anyone is interested, this is the line of the code I'm refactoring that is causing me trouble: https://github.com/SGIModel/MUSE_OS/blob/9fb62bc0c3b7adeb9ce89dce9cad4856e1082925/src/muse/examples.py#L193

@kmuehlbauer
Copy link
Contributor

@dalonsoa Thanks for coming back this fast. I've also no real clue where the problem lies. It might be how the MultiIndex is created, as you are suggesting.

I've had a look at the tests over at your place to get an impression how things are about to work. But there are too many fixtures to quickly adapt a MCVE from that, at least for one who is not familiar with the code base. Would you be able to destill a MCVE from your test code?

@dalonsoa
Copy link
Author

Mmm... the code is rather convoluted - trying to simplify it - but I'll try to put something simple together that uses parts of the original code and reproduces the error. Bear with me while I do that.

@dalonsoa
Copy link
Author

I've have not forgotten about this. I've tracked where and how the timeslice MultiIndex is created and created another example that closely matches that (see below), but that one also works...

The problem I have is that the process looks like:

1.timeslice MultiIndex is created using pd.MultiIndex.from_tuples.
2. A lot of stuff happens now, but timeslice remains the same... in principle.
3. The program finally fails when doing market["supply"] = zeros_like(market.exports) as described above.

So I'll keep investigating what's going on in step 2 that makes things break down the line.

import pandas as pd
import xarray as xr
import numpy as np

timeslices = {
    "all-year.all-week.night": 1460,
    "all-year.all-week.morning": 1460,
    "all-year.all-week.afternoon": 1460,
    "all-year.all-week.early-peak": 1460,
    "all-year.all-week.late-peak": 1460,
    "all-year.all-week.evening": 1460,
}
level_names = ["month", "day", "hour"]

levels = [tuple(k.split(".")) for k in timeslices.keys()]
values = list(timeslices.values())

indices = pd.MultiIndex.from_tuples(levels, names=level_names)
timeslice = xr.DataArray(values, coords={"timeslice": indices}, dims="timeslice")

da1 = xr.DataArray(
    np.arange(36).reshape(2, 3, 6),
    coords=[
        ("x", ["a", "b"]),
        ("y", [0, 1, 2]),
        timeslice.timeslice,
    ],
)

da2 = xr.zeros_like(da1.transpose("y", "x", ...))
da3 = xr.zeros_like(da1)

ds = xr.Dataset({"one": da1, "two": da2, "three": da3})
ds["four"] = xr.zeros_like(ds.one)
print(ds)

@dalonsoa
Copy link
Author

For reference, I've narrowed down the problem to this function. The manipulations going on there result in a DataArray with a MultiIndex coordinate that misbehaves. The docstring of that function is quite thorough in case anyone is curious about that it is doing.

@dalonsoa
Copy link
Author

dalonsoa commented Jul 4, 2023

Ok, while trying to figure out what's wrong with my code above I'm finding examples that have an odd behaviour or that fail, but for a different reason.

Let's take the last example but where the MultiIndex is added by expanding the dimensions of da1 instead of when creating it.

# as above until here
# ...

da1 = xr.DataArray(
    np.arange(6).reshape(2, 3),
    coords=[
        ("x", ["a", "b"]),
        ("y", [0, 1, 2]),
    ],
)
da1 = da1.expand_dims(dim={"timeslice": timeslice.timeslice})
print(da1)

This does not add the MultiIndex coordinate resulting in a similar array as above, as I was expecting, but in the following odd-looking coordinate:

<xarray.DataArray (timeslice: 6, x: 2, y: 3)>
array([[[0, 1, 2],
        [3, 4, 5]],

       ...

       [[0, 1, 2],
        [3, 4, 5]]])
Coordinates:
  * timeslice  (timeslice) object ('all-year', 'all-week', 'night') ... ('all...
  * x          (x) <U1 'a' 'b'
  * y          (y) int64 0 1 2

To get the proper MultIndex coordinate, I need to assign it explicitely:

da1 = da1.expand_dims(dim={"timeslice": timeslice.timeslice}).assign_coords(timeslice=timeslice.timeslice)
print(da1)

Resulting in:

<xarray.DataArray (timeslice: 6, x: 2, y: 3)>
array([[[0, 1, 2],
        [3, 4, 5]],

       ...

       [[0, 1, 2],
        [3, 4, 5]]])
Coordinates:
  * timeslice  (timeslice) object MultiIndex
  * x          (x) <U1 'a' 'b'
  * y          (y) int64 0 1 2
  * month      (timeslice) object 'all-year' 'all-year' ... 'all-year'
  * day        (timeslice) object 'all-week' 'all-week' ... 'all-week'
  * hour       (timeslice) object 'night' 'morning' ... 'late-peak' 'evening'

One would think this should be a perfectly fine DataArray, but when I do either of these things:

ds = xr.Dataset({"one": da1})
ds["four"] = xr.zeros_like(ds.one)

or

ds = xr.Dataset({"one": da1, "two": xr.zeros_like(da1)})

Things fail with:

ValueError: cannot re-index or align objects with conflicting indexes found for the following dimensions: 'timeslice' (2 conflicting indexes)
Conflicting indexes may occur when
- they relate to different sets of coordinate and/or dimension names
- they don't have the same type
- they may be used to reindex data along common dimensions

This is not the error I was originally reporting, but goes along the same lines of having a perfectly looking array with a MultiIndex coordinate that misbehaves.

I will keep trying to reproduce the original error, but any suggestion of why this might be happening with an otherwise perfectly looking array will be helpful.

@benbovy
Copy link
Member

benbovy commented Aug 22, 2023

@dalonsoa the examples in your last comment are working now with #8094, i.e.,

ds = xr.Dataset({"one": da1})
ds["four"] = xr.zeros_like(ds.one)

and

ds = xr.Dataset({"one": da1, "two": xr.zeros_like(da1)})

Could you confirm that #8094 also solves your original issue?

@dalonsoa
Copy link
Author

dalonsoa commented Sep 6, 2023

@benbovy , many thanks for the fix. I was on holiday. I'll check if the original issue was also fixed by this as soon as possible, but it is great that, if nothing else, at least part of it is sorted. I'll keep you posted in case it has not been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants