Multi-index levels as coordinates #947

benbovy · 2016-08-05T11:34:49Z

Implements 2, 4 and 5 in #719.

Demo:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xarray as xr

In [4]: index = pd.MultiIndex.from_product((list('ab'), range(2)),
   ...:                                    names= ('level_1', 'level_2'))

In [5]: da = xr.DataArray(np.random.rand(4, 4), coords={'x': index},
   ...:                   dims=('x', 'y'), name='test')

In [6]: da
Out[6]: 
<xarray.DataArray 'test' (x: 4, y: 4)>
array([[ 0.15036153,  0.68974802,  0.40082234,  0.94451318],
       [ 0.26732938,  0.49598123,  0.8679231 ,  0.6149102 ],
       [ 0.3313594 ,  0.93857424,  0.73023367,  0.44069622],
       [ 0.81304837,  0.81244159,  0.37274953,  0.86405196]])
Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 0 1 0 1
  * y        (y) int64 0 1 2 3

In [7]: da['level_1']
Out[7]: 
<xarray.DataArray 'level_1' (x: 4)>
array(['a', 'a', 'b', 'b'], dtype=object)
Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 0 1 0 1

In [8]: da.sel(x='a', level_2=1)
Out[8]: 
<xarray.DataArray 'test' (y: 4)>
array([ 0.26732938,  0.49598123,  0.8679231 ,  0.6149102 ])
Coordinates:
    x        object ('a', 1)
  * y        (y) int64 0 1 2 3

In [9]: da.sel(level_2=1)
Out[9]: 
<xarray.DataArray 'test' (level_1: 2, y: 4)>
array([[ 0.26732938,  0.49598123,  0.8679231 ,  0.6149102 ],
       [ 0.81304837,  0.81244159,  0.37274953,  0.86405196]])
Coordinates:
  * level_1  (level_1) object 'a' 'b'
  * y        (y) int64 0 1 2 3

Some notes about the implementation:

I slightly modified Coordinate so that it allows setting different values for the names of the coordinate and its dimension. There is no breaking change.
I also added a Coordinate.get_level_coords method to get independent, single-index coordinates objects from a MultiIndex coordinate.

Remaining issues:

Coordinate.get_level_coords calls pandas.MultiIndex.get_level_values for each level and is itself called each time when indexing and for repr. This can be very costly!! It would be nice to return some kind of lazy index object instead of computing the actual level values.
repr replace a MultiIndex coordinate by its level coordinates. That can be confusing in some cases (see below). Maybe we can set a different marker than * for level coordinates.

In [6]: [name for name in da.coords]
Out[6]: ['x', 'y']

In [7]: da.coords.keys()
Out[7]: 
KeysView(Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 0 1 0 1
  * y        (y) int64 0 1 2 3)

DataArray.level_1 doesn't return another DataArray object:

In [10]: da.level_1
Out[10]: 
<xarray.Coordinate 'level_1' (x: 4)>
array(['a', 'a', 'b', 'b'], dtype=object)

Maybe we need to test the uniqueness of level names at DataArray or Dataset creation.

Of course still needs proper tests and docs...

jhamman · 2016-08-05T22:07:21Z

xarray/core/dataset.py

+            if name != var.dims[0]:
+                continue
+            level_coords.update(var.to_coord().get_level_coords())
+        return level_coords


Am I missing something here? Wouldn't this also work without the continue statement?

for name in self._coord_names: var = self.variables[name] if name == var.dims[0]: level_coords.update(var.to_coord().get_level_coords()) return level_coords

benbovy · 2016-08-06T01:15:02Z

In the example above, DataArray.level_1 now returns a DataArray object, although I haven't found another way than creating a new coordinates.DataArrayLevelCoordinates class for this.

I wasn't happy with the initial implementation of using levels in .sel, it really added too much complexity. I re-implemented it and now I find it much cleaner. However, it is not possible anymore to mix level indexers and non-dict dimension indexers, e.g., da.sel(x='a', level_2=1) doesn't work but da.sel(x={'level_1': 'a'}, level_2=1) does, which is already quite flexible though.

shoyer · 2016-08-06T21:35:24Z

This is very exciting to see!

A few thoughts on implementation:

Instead of always creating a dictionary of level coordinates, I would add an attribute level_names to Coordinate, which would default to None and be set to a tuple of MultiIndex.names if a Coordinate is created from a MultiIndex. This would make make checking for multi-index levels very cheap, even if we do need to iterate through every coordinate to find them.

It's much cheaper to call .get_level_values() after slicing a MultiIndex than before, e.g.,

In [12]: idx = pd.MultiIndex.from_product([np.linspace(0, 1, num=500), np.arange(1000)])

In [13]: %timeit idx.get_level_values(0)[:10]
1000 loops, best of 3: 1.28 ms per loop

In [14]: %timeit idx[:10].get_level_values(0)
10000 loops, best of 3: 101 µs per loop

It's even more extreme for larger indexes. If possible, we should use something closer to this approach when formatting coordinates.

However, it is not possible anymore to mix level indexers and non-dict dimension indexers, e.g., da.sel(x='a', level_2=1) doesn't work but da.sel(x={'level_1': 'a'}, level_2=1) does, which is already quite flexible though.

I would actually be happy to disallow both, which might be even easy. It seems like a fine rule to say that you cannot call .sel on both a level and the dimension name at the same time. Actually, if we check uniqueness of level names at Dataset/DataArray creation (which is a good idea!), there is not much need for level indexing with a dictionary at all.

shoyer · 2016-08-06T23:00:17Z

I would suggest putting the logic to create new variables for levels in the private _get_virtual_variable in dataset.py. We already call this function for creating variables on demand in operations like ds['time.month'], so it's already called in all the right places (even in ds.coords, and so on). We could simply extend it to also check for MultiIndex levels and build those variables on demand, too, but only when necessary. If possible, it would be nice if ds['time.day'] works even if time is a multi-index level.

This could get us most of the way there, but there are still a few things to watch out for:

What happens when you write ds.coords['level_1'] = ...? With the current implementation, I think this would create a new variable level_1. In an ideal world, maybe this would replace the MultiIndex level? For now, it is probably better to raise an error and note that the MultiIndex should be modified instead.
Should levels appear in ds.keys() or ds.coords.keys()? If we're printing them in the repr as peers to dimension coordinates, then maybe the answer should be yes? It could be confusing to have both the redundant levels and the multi-index coordinate in there, though. So maybe it's simpler to avoid changes here.

shoyer · 2016-08-06T23:07:25Z

I'm conflicted about how to handle the repr. On the one hand, I like how * indicates indexable variables. On the other hand, it should indeed be clear that these are MultiIndex levels, not dimensions in their own right (especially if they don't appear in ds.coords.keys() and the like). So maybe something closer to what we had before would be better.

Let me try to sketch out some concrete proposals to encourage the peanut gallery to speak up:

Option 1: no special indicator for the MultiIndex:

Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 0 1 0 1
  * y        (y) int64 0 1 2 3

Option 2: both MultiIndex and levels in repr:

Coordinates:
  * x        (x) MultiIndex
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 0 1 0 1
  * y        (y) int64 0 1 2 3

Option 3: both MultiIndex and levels in repr, different symbol for levels:

Coordinates:
  * x        (x) MultiIndex
  - level_1  (x) object 'a' 'a' 'b' 'b'
  - level_2  (x) int64 0 1 0 1
  * y        (y) int64 0 1 2 3

Option 4: both MultiIndex and levels in repr, different symbol for levels, with indentation:

Coordinates:
  * x          (x) MultiIndex
    - level_1  (x) object 'a' 'a' 'b' 'b'
    - level_2  (x) int64 0 1 0 1
  * y          (y) int64 0 1 2 3

A separate question (if we pick one of options 2-4) is how to represent the MultiIndex dtype and values (everything after the dimension name):

Option A: MultiIndex (as shown above)
Option B: MultiIndex[level_0, level_1]
Option C: object MultiIndex
Option D: object MultiIndex[level_0, level_1]
Option E: MultiIndex ('a', 0) ('a', 1) ('b', 0) ('b', 1)
Option F: object ('a', 0) ('a', 1) ('b', 0) ('b', 1) (current repr)

The tradeoffs here are whether or not we include the exact dtype information (object), and how explicitly/redundantly we display the values.

I'm currently leaning toward Option 3A, but I don't have a strong opinion.

shoyer · 2016-08-06T23:17:47Z

xarray/core/variable.py


-    def __init__(self, name, data, attrs=None, encoding=None, fastpath=False):
-        super(Coordinate, self).__init__(name, data, attrs, encoding, fastpath)
+    def __init__(self, name, data, attrs=None, encoding=None,


I would slightly rather change the signature from name -> dims in the first argument (to match Variable), and add name=None as an optional parameter at the end (before fastpath). This could possibly break use of Coordinate(name=..., data=...) but I expect that usage is relatively rare -- this part of the API is usually not touched by users.

Possibly it would be a good idea to rename Coordinate -> IndexVariable, and change it's intended usage to cover anytime someone wants to represent a pandas.Index in xarray. This could be useful for anyone who needs support for custom pandas types like Period or Categorical, even if they won't actually use the array for indexing.

benbovy · 2016-08-09T13:16:50Z

It seems like a fine rule to say that you cannot call .sel on both a level and the dimension name at the same time

Agreed. I have always the tendency to want to make things as flexible as possible, but this is definitely not needed here.

I'm currently leaning toward Option 3A, but I don't have a strong opinion.

I'm also +1 for Option 3A. Maybe one little argument in favor of Options 3E or 3F is that they still show a consistent repr when a scalar is returned from the multi-index (see below), even though I don't like how they would display duplicate information.

>>> da.sel(level_1='a', level_2=0)
<xarray.DataArray 'test' (y: 4)>
array([ 0.66068869,  0.20374398,  0.43967166,  0.09447679])
Coordinates:
    x        object ('a', 0)
  * y        (y) int64 0 1 2 3

Possibly it would be a good idea to rename Coordinate -> IndexVariable

As a recent xarray user, I indeed remember that I initially found confusing to have Dataset or DataArray "coordinates" that can be either Coordinate or Variable objects. I find that the name IndexVariable is more representative of the object.

shoyer · 2016-08-09T22:14:49Z

As a recent xarray user, I indeed remember that I initially found confusing to have Dataset or DataArray "coordinates" that can be either Coordinate or Variable objects. I find that the name IndexVariable is more representative of the object.

Sounds good, I will do this in a separate PR.

benbovy · 2016-08-11T13:23:52Z

I just made some updates.

I would add an attribute level_names to Coordinate

Done.

It's much cheaper to call .get_level_values() after slicing a MultiIndex than before

Done, although it currently slices an arbitrary number (30) of first elements rather than calculating the number of elements needed for display.

It seems like a fine rule to say that you cannot call .sel on both a level and the dimension name at the same time

Done.

check uniqueness of level names at Dataset/DataArray creation (which is a good idea!)

I tried but it broke some existing tests. It actually triggered data loading for Coordinate objects (via calls to to_index()). I need to further investigate this.

there is not much need for level indexing with a dictionary at all

Right, but in the current implementation this is still used internally.

I would suggest putting the logic to create new variables for levels in the private _get_virtual_variable in dataset.py. [...] If possible, it would be nice if ds['time.day'] works even if time is a multi-index level.

Done. It should also work with multi-index levels although not tested yet.

I'm currently leaning toward Option 3A...

I've chosen option 3A for the repr, but I can change it depending on others' opinions.

What happens when you write ds.coords['level_1'] = ...? [...] probably better to raise an error and note that the MultiIndex should be modified instead.

Done.

Should levels appear in ds.keys() or ds.coords.keys()?

They don't appear in there. If we keep Option 3A for the repr, I also think that we can avoid changes here.

shoyer · 2016-08-11T15:59:17Z

xarray/core/formatting.py

+
+
+def _summarize_coord_levels(coord, col_width, marker):
+    # TODO: maybe slicing based on calculated number of displayed values


It's almost certainly overkill, but you could write:

max_width = OPTIONS['display_width'] max_possibly_relevant = max(int(np.ceil(max_width / 2.0)), 1) relevant_coord = coord[:max_possibly_relevant]

Either way, you probably do want to slice the coordinate outside the look.

shoyer · 2016-08-30T20:49:38Z

xarray/core/dataset.py

            coords = {}
        if data_vars is not None or coords is not None:
            self._set_init_vars_and_dims(data_vars, coords, compat)
+            self._check_multiindex_level_names()


Rather than putting this check in the DataArray and Dataset constructors, let's calling this from the Variable constructor instead. Probably the cleanest place to put this check in PandasIndexAdapter, which is already used to wrap the data from all pandas Indexes:
https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L434

mmh not sure to understand as this checks for unique multi-index level names among all the (multi-index) coordinates of a new DataArray or a Dataset object. I guess we need to iterate over those coordinates... I can rename it if it's not clear.

Ohhh... that makes perfect sense. I was confused.

In that case, the logic should go into merge.py, to ensure these checks will be performed every time variables are modified, too (which doesn't necessarily use the constructor). We should be doing this check on the result of merge_variables() in merge_coords_without_align, merge_coords and merge_core.

OK I see, Dataset._set_init_vars_and_dims indeed depends on merge.py but I actually didn't go more into that module. I'll take a look.

benbovy · 2016-09-01T14:00:02Z

Not sure how to write the tests for this PR, as there are quite many small changes spread in the API (e.g., repr, data object and coordinate properties, etc.). Should I write new tests specific to multi-index or should I modify existing tests (e.g., TestDataset.test_modify_inplace, TestDataset.tes_coord_properties, TestDataset.test_coords_modify, etc.) to include the multi-index case?

shoyer · 2016-09-01T16:55:12Z

Should I write new tests specific to multi-index or should I modify existing tests (e.g., TestDataset.test_modify_inplace, TestDataset.tes_coord_properties, TestDataset.test_coords_modify, etc.) to include the multi-index case?

I would usually lean towards dedicated tests (e.g., test_modify_multiindex_inplace) but if it's easier to modify the original tests in a small way (e.g., for repr) feel free to take that approach

shoyer · 2016-09-01T16:57:51Z

xarray/core/coordinates.py

    def _update_coords(self, coords):
        from .dataset import calculate_dimensions

+        for key in coords:


Is this not already checked by all the callers, e.g., in merge_coords_without_align?

No, currently it is not, though this should be indeed checked there!

benbovy · 2016-09-02T23:07:14Z

@shoyer this is ready for another round of review. I don't see any remaining issue, I added some tests and I updated the docs.

shoyer · 2016-09-02T23:13:25Z

xarray/core/coordinates.py

    def to_dataset(self):
        return self._to_dataset()

+    def __setitem__(self, key, value):


Just define this method once on AbstractCoordinates instead of repeating it twice

Actually, I think these should also be caught by the checks in merge.py

Yep! So I can remove this.

shoyer · 2016-09-02T23:57:26Z

Rather than adding an independent name attribute to IndexVariable, let's deprecate/remove IndexVariable.name instead (I can do this in a follow-up PR). Clearly it's useful to be able to use IndexVariable objects even for objects that do not correspond to ticks along a dimension, but for such objects, name is misleading. IndexVariable.name was never more than a convenient shortcut, and it's outlived it's usefulness.

shoyer · 2016-09-03T00:03:02Z

xarray/core/dataarray.py

+            if var.ndim == 1:
+                level_names = var.to_index_variable().level_names
+                if level_names is not None:
+                    dim = var.dims[0]


Use tuple unpacking instead: dim, = var.dims

benbovy · 2016-09-03T00:21:06Z

Rather than adding an independent name attribute to IndexVariable, let's deprecate/remove IndexVariable.name instead (I can do this in a follow-up PR). Clearly it's useful to be able to use IndexVariable objects even for objects that do not correspond to ticks along a dimension, but for such objects, name is misleading. IndexVariable.name was never more than a convenient shortcut, and it's outlived it's usefulness.

I agree that name is misleading here. If we remove IndexVariable.name, then we can modify IndexVariable.get_level_variable such that it accepts one or more level names and returns an OrderedDict of {level_name: IndexVariable} items. That would be even better than the current implementation. Maybe we should then rename get_level_variable to get_levels_as_variables?

shoyer · 2016-09-03T01:14:48Z

If we remove IndexVariable.name, then we can modify IndexVariable.get_level_variable such that it accepts one or more level names and returns an OrderedDict of {level_name: IndexVariable} items. That would be even better than the current implementation. Maybe we should then rename get_level_variable to get_levels_as_variables?

I think what get_level_variable returns is an independent matter? I don't think it would be good idea to return a dict here because computing level values can be expensive and the levels aren't always necessary (they're stored internally as an Index for each level and integer codes).

BTW, I will be away over the holiday weekend (in the US), but I expect we will probably be able to merge this shortly after I get back.

benbovy · 2016-09-03T10:44:58Z

I've made changes according to your comments.

I don't think it would be good idea to return a dict here because computing level values can be expensive and the levels aren't always necessary (they're stored internally as an Index for each level and integer codes).

I just thought that returning a dict could be a little more convenient (i.e., using a single call) if we need to get either one particular or several or all level(s) as IndexVariable object(s). However, I admit that this is certainly overkill and that it is actually not related to removing the name attr.

BTW, I will be away over the holiday weekend (in the US), but I expect we will probably be able to merge this shortly after I get back.

Happy holidays! (I'll be on holiday next week too).

shoyer · 2016-09-13T03:08:37Z

I think the main (only?) thing left to do here is to remove the name argument you added to IndexVariable. For now, we can live with some IndexVariable objects with an inconsistent name (I'll remove the attribute shortly).

benbovy · 2016-09-13T13:21:53Z

I think the main (only?) thing left to do here is to remove the name argument you added to IndexVariable

Just removed it.

shoyer · 2016-09-14T03:35:04Z

OK, in it goes. Big thanks to @benbovy !

jhamman reviewed Aug 5, 2016
View reviewed changes

shoyer reviewed Aug 6, 2016
View reviewed changes

shoyer reviewed Aug 11, 2016
View reviewed changes

This was referenced Aug 18, 2016

Indexing with alignment and broadcasting #974

Closed

API design for pointwise indexing #475

Closed

shoyer mentioned this pull request Aug 30, 2016

Coordinate -> IndexVariable and other deprecations #993

Merged

shoyer reviewed Aug 30, 2016
View reviewed changes

benbovy force-pushed the multi-index_coord branch from 4672448 to 9dc2c16 Compare August 31, 2016 21:23

benbovy mentioned this pull request Aug 31, 2016

Multi-index repr #879

Closed

shoyer reviewed Sep 1, 2016
View reviewed changes

Benoit Bovy added 5 commits September 2, 2016 10:20

make multi-index levels visible as coordinates

f31a278

make levels also visible for Dataset

5e8a677

fix unnamed levels

19ec381

allow providing multi-index levels in .sel

1566938

refactored _get_valid_indexers to get_dim_indexers

9f4e4e3

Benoit Bovy added 2 commits September 2, 2016 10:23

cosmetic changes

936ec55

fix Coordinate -> IndexVariable

1d6a96f

benbovy force-pushed the multi-index_coord branch from 3d9dc31 to 1d6a96f Compare September 2, 2016 09:17

Benoit Bovy added 7 commits September 2, 2016 13:47

fix col width when formatting multi-index levels

ec67bbd

add tests for IndexVariable new methods and indexing

f80d7a8

fix bug in assert_unique_multiindex_level_names

861c78b

add tests for Dataset

37a0796

fix appveyor tests

fdbf4aa

add tests for DataArray

d237022

add docs

949fb46

shoyer reviewed Sep 2, 2016
View reviewed changes

shoyer reviewed Sep 3, 2016
View reviewed changes

review changes

bdaad9b

remove name argument of IndexVariable

a447767

shoyer mentioned this pull request Sep 14, 2016

Remove IndexVariable.name #1004

Open

shoyer merged commit 41654ef into pydata:master Sep 14, 2016

benbovy mentioned this pull request Sep 14, 2016

MultiIndex and data selection #767

Closed

benbovy deleted the multi-index_coord branch September 14, 2016 15:25

shoyer mentioned this pull request Sep 17, 2016

MultiIndex level coordinates as Dataset attributes #1006

Merged



		def _summarize_coord_levels(coord, col_width, marker):
		# TODO: maybe slicing based on calculated number of displayed values

Uh oh!

Multi-index levels as coordinates #947

Multi-index levels as coordinates #947

Uh oh!

Conversation

benbovy commented Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy commented Aug 6, 2016

Uh oh!

shoyer commented Aug 6, 2016

Uh oh!

shoyer commented Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Aug 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy commented Aug 9, 2016

Uh oh!

shoyer commented Aug 9, 2016

Uh oh!

benbovy commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer Aug 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Sep 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy commented Sep 3, 2016

Uh oh!

shoyer commented Sep 3, 2016

Uh oh!

benbovy commented Sep 3, 2016

Uh oh!

shoyer commented Sep 13, 2016

Uh oh!

benbovy commented Sep 13, 2016

Uh oh!

shoyer commented Sep 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

benbovy commented Aug 5, 2016 •

edited

Loading

shoyer commented Aug 6, 2016 •

edited

Loading

benbovy commented Aug 11, 2016 •

edited

Loading

shoyer Aug 30, 2016 •

edited

Loading

benbovy commented Sep 1, 2016 •

edited

Loading