-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rank Methods #1733
Rank Methods #1733
Conversation
a00b098
to
c8997c7
Compare
Thanks for putting this together! Rather than using the injection stuff in Also take a look at the use of |
xarray/core/duck_array_ops.py
Outdated
@@ -23,6 +23,8 @@ | |||
try: | |||
import bottleneck as bn | |||
has_bottleneck = True | |||
# monkeypatch numpy with rankdata | |||
np.rankdata = bn.rankdata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't monkeypatch -- this would affect NumPy for all xarray users.
@shoyer Thanks for the comment and the tips! I'll make the changes asap, hopefully this week-end |
Dropped support for Dataset. I don't think there's much use for it anyway. |
f9c0bc5
to
94ac0ca
Compare
xarray/core/dataarray.py
Outdated
array([ 1., 2., 3.]) | ||
Dimensions without coordinates: x | ||
""" | ||
import bottleneck as bn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the exception is explicit enough
xarray/core/dataarray.py
Outdated
ranks = apply_ufunc(func, self, | ||
dask='parallelized', | ||
keep_attrs=True, | ||
output_dtypes=[np.float_], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the docs, it looks like the return type is always np.float64
.
Ranks begin at 1, not 0. If pct is True, computes percentage ranks. | ||
|
||
NaNs in the input array are returned as NaNs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note here that this requires bottleneck.
# str | ||
y = DataArray(['c', 'b', 'a']) | ||
self.assertDataArrayEqual(y.rank('dim_0'), x) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add test coverage for the pct=True
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- can you take a look at implementing this for Dataset, too? For consistency, it's nice if we have the same methods for both Dataset/DataArray if possible. I think this should be a pretty straightforward use of apply
:
def rank(self, dim, pct=False):
return ds.apply(lambda array: array.rank(dim, pct=pct))
@shoyer when the ranking dimension is missing from an array in the dataset, should we do nothing, or put 1s everywhere ? None of these looks appealing or natural. |
Good point, maybe it needs to be slightly more complicated than apply. Two other options are raise an error (safest) or to drop all such variables from the result. The later is similar to what we do in aggregations like |
Should i move the implementation on Variable and use temp datasets like |
I would be happy either way here. In some ways, yes, that would be the cleanest solution. |
xarray/core/dataarray.py
Outdated
kwargs=dict(axis=axis)).transpose(*self.dims) | ||
if not pct: | ||
return ranks | ||
return (ranks - 1) / (self.count(dim) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer with pct=True
this is not the same implementation as pandas which returns rank / count
. I think this one makes more sense but users might be suprised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks for calling this out. The pandas definition makes a little more sense to me, as "fraction of the data with this rank or higher". I don't know how I would describe your version in a simple sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalized ? It lies in [0, 1] not [1/Nsamples, 1]
My main issue is that I would expect the pct rank to average to 0.5 but with pandas we have
In [2]: pd.Series([1,2,3]).rank(pct=True)
Out[2]:
0 0.333333
1 0.666667
2 1.000000
On the other hand, with my implem, pct rank is not defined for a dim of length 1
For the record, Bottleneck move_rank returns a polar rank in [-1, 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main issue is that I would expect the pct rank to average to 0.5
I agree that this would be a nice property, but I'm not sure it's worth the loss in interpretability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a str arg for specifying the normalization scheme ?
'pct', 'norm' and 'polar'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would polar
mean?
All things being equal, we do try to lean towards doing what pandas does in xarray because consistency makes it easier to use both packages together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
polar would be [-1, 1]
I get your point of view, i'll make the change to mimick pandas
xarray/core/dataarray.py
Outdated
---------- | ||
dim : str | ||
pct : bool, optional | ||
keep_attrs : bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment, it would be great if you could expand the parameters section of the docstrings to include a 1 line description of each. For example:
Parameters
----------
dim : str
Dimension(s) over which to compute rank.
pct : bool, optional
If True, compute percentage ranks, otherwise compute integer ranks.
keep_attrs : bool, optional
If True, the dataset's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.
This looks great @0x0L thanks & congrats on your first contribution! |
# int | ||
v = Variable(['x'], [3,2,1]) | ||
expect = bn.rankdata(v.data, axis=0) | ||
np.testing.assert_allclose(v.rank('x').values, expect) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI for the future, xarray.test
has these natively, and assert_equal
rather than the older self.
methods
(tbc, no need to change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shameless copy/paste from from the test above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor points.
xarray/core/dataset.py
Outdated
for name, var in iteritems(self.variables): | ||
if name in self.data_vars and dim in var.dims: | ||
variables[name] = var.rank(dim, pct=pct) | ||
variables.update({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little complicated to me, and also will drop non-dimension coordinates. As a simpler rule, how about simply keeping all existing coordinates?
This could look like:
if name in self.data_vars:
if dim in var.dims:
variables[name] = var.rank(dim, pct=pct)
else:
variables[name] = var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good at add test cases verifying:
- that coordinates stick around
- that invalid data variables are dropped
import bottleneck as bn | ||
|
||
if isinstance(self.data, dask_array_type): | ||
raise TypeError("rank does not work for arrays stored as dask " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test that this error is raised, e.g., using pytest.raises
or raises_regex
Variables that do not depend on `dim` are dropped. | ||
""" | ||
if dim not in self.dims: | ||
raise ValueError('Dataset does not contain the dimension: %s' % dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add test coverage for this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me! I'll merge this in a day or two unless anyone has other comments/suggestions.
xarray/core/dataset.py
Outdated
else: | ||
variables[name] = var | ||
|
||
coord_names = set(k for k in self.coords if k in variables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be simplified as just coord_names = set(self.coords)
now
Thank you! |
git diff upstream/master **/*py | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new API