-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Refactor nanops #2236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor nanops #2236
Conversation
xarray/core/nanops.py
Outdated
| min_count = kwds.get('min_count', 1) | ||
|
|
||
| if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
| and not isinstance(axis, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
|
|
||
| if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
| and not isinstance(axis, tuple) | ||
| and values.dtype.kind in 'uifc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
| if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
| and not isinstance(axis, tuple) | ||
| and values.dtype.kind in 'uifc' | ||
| and values.dtype.isnative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
| and not isinstance(axis, tuple) | ||
| and values.dtype.kind in 'uifc' | ||
| and values.dtype.isnative | ||
| and (dtype is None or np.dtype(dtype) == values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
| and values.dtype.kind in 'uifc' | ||
| and values.dtype.isnative | ||
| and (dtype is None or np.dtype(dtype) == values.dtype) | ||
| and min_count != 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/tests/test_variable.py
Outdated
| with pytest.raises(NotImplementedError): | ||
| v.argmax(skipna=True) | ||
| v.argmax(skipna=True) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W293 blank line contains whitespace
|
|
||
| v = Variable('t', pd.date_range('2000-01-01', periods=3)) | ||
| with pytest.raises(NotImplementedError): | ||
| v.argmax(skipna=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone remember why we did not implement nanargmax for pd.date_range? It looks np.nanargmax takes care of this dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, no idea! :)
|
Very nice!
I think this is correct -- bottleneck does not speed up non-NaN skipping functions. |
xarray/tests/test_duck_array_ops.py
Outdated
| have a sentinel missing value (int) or skipna=True has not been | ||
| implemented (object, datetime64 or timedelta64). | ||
| min_count : int, default None | ||
| The required number of valid values to perform the operation. If fewer than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (87 > 79 characters)
xarray/tests/test_duck_array_ops.py
Outdated
| # without min_count | ||
| actual = DataArray.mean.__doc__ | ||
| expected = dedent("""\ | ||
| Reduce this DataArray's data by applying `mean` along some dimension(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (80 > 79 characters)
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to restructure this a little bit to have two well defined layers:
- A module of bottleneck/numpy functions that act on numpy arrays only.
- A module of functions that act on numpy or dask arrays (or these could be moved into
duck_array_ops).
xarray/core/duck_array_ops.py
Outdated
| if dtype is None: | ||
| func = _dask_or_eager_func(name) | ||
| else: | ||
| func = getattr(np_module, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little dangerous -- it could silently convert a dask array into a NumPy array. Can we raise an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a kind of holdover.
Removed.
xarray/core/nanops.py
Outdated
|
|
||
| if mask is not None: | ||
| if isinstance(a, dask_array_type): | ||
| return dask_array.where(mask, val, a), mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little surprising to me. I would rather use duck_array_ops.where() here than use explicit type checks.
Likewise, instead of logic above for mask, use duck_array_ops.isnull()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Could you explain more detail about this idea? |
OK, let me try:
We could implement this with:
|
|
Hello from me hopefully contributing some needfull things. At first, I would like to comment that I checked out your code. I ran the following code example using a datafile uploaded under the following link: I have recognized that the min_count option at recent state technically only works for DataArrays and not for DataSets. However, more interesting is the fact that the dimensions are destroyed: no longitude and latitude survives your operation. If I would use the the sum-operator on the full dataset (where maybe the code was not modified?) I got |
|
!!!Correction!!!. The resample-example above gives also a missing lon-lat dimensions in case of unmodified model code. The resulting numbers are the same. Now I am a little bit puzzled. But ... !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!IT SEEMS TO BE A BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If I do the resample process using the old nomenclature: |
|
Okay. Using the old resample-nomenclature I tried to compare the results with your modified code (min_count = 2) and with the tagged version 0.10.7 (no min_count argument). But I am not sure, if this also works. Do you examine the keywords given from the old-resample method?. However, in the comparison I did not saw the nan's I expected over water. |
|
@shoyer , thanks for the details. I think I understood your idea. |
|
Thanks, @rpnaut . I will add the test also. Thanks again for the testing :) |
|
To wrap it up. Your implementation works for timeseries - data. There is something strange with time-space data, which should be fixed. If this is fixed, it is worth to test in my evaluation environment. And maybe, it would be interesting to have in the future the min_count argument also available for the old syntax and not only the new. The reason: The dimension name is not flexible anymore - it cannot be a variable like dim=${dim} |
|
Thanks for testing.
I think it is not true. |
xarray/tests/test_dataset.py
Outdated
|
|
||
| actual = ds.resample(time='1D').sum(min_count=1) | ||
| expected = xr.concat([ | ||
| ds.isel(time=slice(i*4, (i+1)*4)).sum('time', min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E226 missing whitespace around arithmetic operator
|
I noticed |
|
Can anyone give further review? |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments -- generally this looks like a nice improvement, though!
xarray/core/nanops.py
Outdated
| type """ | ||
| valid_count = count(value, axis=axis) | ||
| value = fillna(value, fill_value) | ||
| data = getattr(np, func)(value, axis=axis, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is doing eager evaluation by calling the numpy function -- can we call the dask function instead on dask arrays?
Alternatively, we could make this function error on dask arrays? I just don't want to silently compute them.
xarray/core/nanops.py
Outdated
| valid_count = count(value, axis=axis) | ||
| value = fillna(value, fill_value) | ||
| data = getattr(np, func)(value, axis=axis, **kwargs) | ||
| # dask seems return non-integer type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to file an upstream bug report with dask about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a misunderstanding or already fixed in upstream.
Removed this line.
| if isinstance(value, dask_array_type): | ||
| data = data.astype(int) | ||
|
|
||
| if (valid_count == 0).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also evaluate dask arrays, which can be costly. I don't know if there's much to do about this but it should be noted.
xarray/core/nanops.py
Outdated
|
|
||
| if isinstance(a, dask_array_type): | ||
| return dask_array.nanmax(a, axis=axis) | ||
| return nputils.nanmax(a, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be more cleanly written as:
module = nptuils if isinstance(a, dask_array_type) else dask_array
return module.nanmin(a, axis=axis)
xarray/core/nanops.py
Outdated
|
|
||
| if mask is not None: | ||
| mask = np.all(mask, axis=axis) | ||
| if np.any(mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more dask friendly to use the.all() and .any() methods -- that way at least most of the check can be done in parallel, and only a single value (the boolean scalar) gets evaluates.
xarray/tests/test_duck_array_ops.py
Outdated
| # also check ddof!=0 case | ||
| actual = getattr(da, func)(skipna=skipna, dim=aggdim, ddof=5) | ||
| if dask: | ||
| isinstance(da.data, dask_array_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs assert?
xarray/tests/test_duck_array_ops.py
Outdated
| da = construct_dataarray(dim_num, dtype, contains_nan=False, dask=dask) | ||
| actual = getattr(da, func)(skipna=skipna) | ||
| if dask: | ||
| isinstance(da.data, dask_array_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs assert
|
|
||
| v = Variable('t', pd.date_range('2000-01-01', periods=3)) | ||
| with pytest.raises(NotImplementedError): | ||
| v.argmax(skipna=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, no idea! :)
|
Thanks, @shoyer. |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks!
doc/whats-new.rst
Outdated
| - Follow up the renamings in dask; from dask.ghost to dask.overlap | ||
| By `Keisuke Fujii <https://github.com/fujiisoup>`_. | ||
|
|
||
| By `Tony Tung <https://github.com/ttung>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should reference the line above, not your change here
|
Thanks for the review. Merging. |
|
Hi, Edit: |
|
Thanks, @st-bender, for the bug report. |


whats-new.rstfor all changes andapi.rstfor new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)In #2230, the addition of
min_countkeywords for our reduction methods was discussed, butour
duck_array_opsmodule is becoming messy (mainly due to nan-aggregation methods for dask, bottleneck and numpy) and it looks a little hard to maintain them.I tried to refactor them by moving nan-aggregation methods to
nanopsmodule.I think I still need to take care of more edge cases, but I appreciate any comment for the current implementation.
Note:
In my implementation, bottleneck is not used when
skipna=False.bottleneck would be advantageous when
skipna=Trueas numpy needs to copy the entire array once, but I think numpy's method is still OK ifskipna=False.