-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Multi-index levels as coordinates #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if name != var.dims[0]: | ||
| continue | ||
| level_coords.update(var.to_coord().get_level_coords()) | ||
| return level_coords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something here? Wouldn't this also work without the continue statement?
for name in self._coord_names:
var = self.variables[name]
if name == var.dims[0]:
level_coords.update(var.to_coord().get_level_coords())
return level_coords|
In the example above, I wasn't happy with the initial implementation of using levels in |
|
This is very exciting to see! A few thoughts on implementation: Instead of always creating a dictionary of level coordinates, I would add an attribute It's much cheaper to call It's even more extreme for larger indexes. If possible, we should use something closer to this approach when formatting coordinates.
I would actually be happy to disallow both, which might be even easy. It seems like a fine rule to say that you cannot call |
|
I would suggest putting the logic to create new variables for levels in the private This could get us most of the way there, but there are still a few things to watch out for:
|
|
I'm conflicted about how to handle the repr. On the one hand, I like how Let me try to sketch out some concrete proposals to encourage the peanut gallery to speak up: Option 1: no special indicator for the MultiIndex: Option 2: both MultiIndex and levels in repr: Option 3: both MultiIndex and levels in repr, different symbol for levels: Option 4: both MultiIndex and levels in repr, different symbol for levels, with indentation: A separate question (if we pick one of options 2-4) is how to represent the Option A: The tradeoffs here are whether or not we include the exact dtype information ( I'm currently leaning toward Option 3A, but I don't have a strong opinion. |
xarray/core/variable.py
Outdated
|
|
||
| def __init__(self, name, data, attrs=None, encoding=None, fastpath=False): | ||
| super(Coordinate, self).__init__(name, data, attrs, encoding, fastpath) | ||
| def __init__(self, name, data, attrs=None, encoding=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would slightly rather change the signature from name -> dims in the first argument (to match Variable), and add name=None as an optional parameter at the end (before fastpath). This could possibly break use of Coordinate(name=..., data=...) but I expect that usage is relatively rare -- this part of the API is usually not touched by users.
Possibly it would be a good idea to rename Coordinate -> IndexVariable, and change it's intended usage to cover anytime someone wants to represent a pandas.Index in xarray. This could be useful for anyone who needs support for custom pandas types like Period or Categorical, even if they won't actually use the array for indexing.
Agreed. I have always the tendency to want to make things as flexible as possible, but this is definitely not needed here.
I'm also +1 for Option 3A. Maybe one little argument in favor of Options 3E or 3F is that they still show a consistent repr when a scalar is returned from the multi-index (see below), even though I don't like how they would display duplicate information.
As a recent xarray user, I indeed remember that I initially found confusing to have Dataset or DataArray "coordinates" that can be either |
Sounds good, I will do this in a separate PR. |
|
I just made some updates.
Done.
Done, although it currently slices an arbitrary number (30) of first elements rather than calculating the number of elements needed for display.
Done.
I tried but it broke some existing tests. It actually triggered data loading for Coordinate objects (via calls to
Right, but in the current implementation this is still used internally.
Done. It should also work with multi-index levels although not tested yet.
I've chosen option 3A for the repr, but I can change it depending on others' opinions.
Done.
They don't appear in there. If we keep Option 3A for the repr, I also think that we can avoid changes here. |
xarray/core/formatting.py
Outdated
|
|
||
|
|
||
| def _summarize_coord_levels(coord, col_width, marker): | ||
| # TODO: maybe slicing based on calculated number of displayed values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost certainly overkill, but you could write:
max_width = OPTIONS['display_width']
max_possibly_relevant = max(int(np.ceil(max_width / 2.0)), 1)
relevant_coord = coord[:max_possibly_relevant]
Either way, you probably do want to slice the coordinate outside the look.
xarray/core/dataset.py
Outdated
| coords = {} | ||
| if data_vars is not None or coords is not None: | ||
| self._set_init_vars_and_dims(data_vars, coords, compat) | ||
| self._check_multiindex_level_names() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than putting this check in the DataArray and Dataset constructors, let's calling this from the Variable constructor instead. Probably the cleanest place to put this check in PandasIndexAdapter, which is already used to wrap the data from all pandas Indexes:
https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh not sure to understand as this checks for unique multi-index level names among all the (multi-index) coordinates of a new DataArray or a Dataset object. I guess we need to iterate over those coordinates... I can rename it if it's not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh... that makes perfect sense. I was confused.
In that case, the logic should go into merge.py, to ensure these checks will be performed every time variables are modified, too (which doesn't necessarily use the constructor). We should be doing this check on the result of merge_variables() in merge_coords_without_align, merge_coords and merge_core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, Dataset._set_init_vars_and_dims indeed depends on merge.py but I actually didn't go more into that module. I'll take a look.
4672448 to
9dc2c16
Compare
|
Not sure how to write the tests for this PR, as there are quite many small changes spread in the API (e.g., repr, data object and coordinate properties, etc.). Should I write new tests specific to multi-index or should I modify existing tests (e.g., |
I would usually lean towards dedicated tests (e.g., |
xarray/core/coordinates.py
Outdated
| def _update_coords(self, coords): | ||
| from .dataset import calculate_dimensions | ||
|
|
||
| for key in coords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not already checked by all the callers, e.g., in merge_coords_without_align?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, currently it is not, though this should be indeed checked there!
3d9dc31 to
1d6a96f
Compare
|
@shoyer this is ready for another round of review. I don't see any remaining issue, I added some tests and I updated the docs. |
xarray/core/coordinates.py
Outdated
| def to_dataset(self): | ||
| return self._to_dataset() | ||
|
|
||
| def __setitem__(self, key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just define this method once on AbstractCoordinates instead of repeating it twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think these should also be caught by the checks in merge.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! So I can remove this.
|
Rather than adding an independent |
xarray/core/dataarray.py
Outdated
| if var.ndim == 1: | ||
| level_names = var.to_index_variable().level_names | ||
| if level_names is not None: | ||
| dim = var.dims[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tuple unpacking instead: dim, = var.dims
I agree that |
I think what BTW, I will be away over the holiday weekend (in the US), but I expect we will probably be able to merge this shortly after I get back. |
|
I've made changes according to your comments.
I just thought that returning a dict could be a little more convenient (i.e., using a single call) if we need to get either one particular or several or all level(s) as
Happy holidays! (I'll be on holiday next week too). |
|
I think the main (only?) thing left to do here is to remove the |
Just removed it. |
|
OK, in it goes. Big thanks to @benbovy ! |
Implements 2, 4 and 5 in #719.
Demo:
Some notes about the implementation:
Coordinateso that it allows setting different values for the names of the coordinate and its dimension. There is no breaking change.Coordinate.get_level_coordsmethod to get independent, single-index coordinates objects from a MultiIndex coordinate.Remaining issues:
Coordinate.get_level_coordscallspandas.MultiIndex.get_level_valuesfor each level and is itself called each time when indexing and for repr. This can be very costly!! It would be nice to return some kind of lazy index object instead of computing the actual level values.*for level coordinates.DataArray.level_1doesn't return anotherDataArrayobject:DataArrayorDatasetcreation.Of course still needs proper tests and docs...