Refactor nanops #2236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

fujiisoup merged 29 commits into pydata:master from fujiisoup:refactor_nanops

Aug 16, 2018

Member

fujiisoup commented Jun 18, 2018 •

edited

Loading

Closes Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230
Tests added
Tests passed
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

In #2230, the addition of min_count keywords for our reduction methods was discussed, but
our duck_array_ops module is becoming messy (mainly due to nan-aggregation methods for dask, bottleneck and numpy) and it looks a little hard to maintain them.

I tried to refactor them by moving nan-aggregation methods to nanops module.

I think I still need to take care of more edge cases, but I appreciate any comment for the current implementation.

Note:
In my implementation, bottleneck is not used when skipna=False.
bottleneck would be advantageous when skipna=True as numpy needs to copy the entire array once, but I think numpy's method is still OK if skipna=False.

fujiisoup added 3 commits

June 17, 2018 20:02


          Inhouse nanops

f36f8f2


          Cleanup nanops

76218b2


          remove NAT_TYPES

943e2b1

stickler-ci reviewed

View reviewed changes

xarray/core/nanops.py Outdated

    
                          min_count = kwds.get('min_count', 1)

                          if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK

                                  and not isinstance(axis, tuple)

Contributor

stickler-ci Jun 18, 2018

W503 line break before binary operator

xarray/core/nanops.py Outdated

    
                          if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK

                                  and not isinstance(axis, tuple)

                                  and values.dtype.kind in 'uifc'

Contributor

stickler-ci Jun 18, 2018

W503 line break before binary operator

xarray/core/nanops.py Outdated

    
                          if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK

                                  and not isinstance(axis, tuple)

                                  and values.dtype.kind in 'uifc'

                                  and values.dtype.isnative

Contributor

stickler-ci Jun 18, 2018

W503 line break before binary operator

xarray/core/nanops.py Outdated

    
                                  and not isinstance(axis, tuple)

                                  and values.dtype.kind in 'uifc'

                                  and values.dtype.isnative

                                  and (dtype is None or np.dtype(dtype) == values.dtype)

Contributor

stickler-ci Jun 18, 2018

W503 line break before binary operator

xarray/core/nanops.py Outdated

    
                                  and values.dtype.kind in 'uifc'

                                  and values.dtype.isnative

                                  and (dtype is None or np.dtype(dtype) == values.dtype)

                                  and min_count != 1):

Contributor

stickler-ci Jun 18, 2018

W503 line break before binary operator


          flake8.

84fc69e

stickler-ci reviewed

View reviewed changes

xarray/tests/test_variable.py Outdated

    
                      with pytest.raises(NotImplementedError):

                          v.argmax(skipna=True)

                      v.argmax(skipna=True)

Contributor

stickler-ci Jun 18, 2018

W293 blank line contains whitespace

fujiisoup commented

View reviewed changes

xarray/tests/test_variable.py

    
                      v = Variable('t', pd.date_range('2000-01-01', periods=3))

                      with pytest.raises(NotImplementedError):

                          v.argmax(skipna=True)

Member Author

fujiisoup Jun 18, 2018

Does anyone remember why we did not implement nanargmax for pd.date_range? It looks np.nanargmax takes care of this dtype.

Member

shoyer Aug 16, 2018

nope, no idea! :)


          another flake8

11d735f

fujiisoup mentioned this pull request

Inconsistency between Sum of NA's and Mean of NA's: resampling gives 0 or 'NA' #2230

Closed

Member

shoyer commented Jun 18, 2018

Very nice!

In my implementation, bottleneck is not used when skipna=False.
bottleneck would be advantageous when skipna=True as numpy needs to copy the entire array once, but I think numpy's method is still OK if skipna=False.

I think this is correct -- bottleneck does not speed up non-NaN skipping functions.

fujiisoup added 5 commits

June 19, 2018 07:07


          recover nat types

7a079f6


          remove keep_dims option from nanops (to make them compatible with num…

441be59

…py==1.11).


          Test aggregation over multiple dimensions

f95054b


          Remove print.

9211b64


          Docs. More cleanup.

491ce2f

stickler-ci reviewed

View reviewed changes

xarray/tests/test_duck_array_ops.py Outdated

    
                          have a sentinel missing value (int) or skipna=True has not been

                          implemented (object, datetime64 or timedelta64).

                      min_count : int, default None

                          The required number of valid values to perform the operation. If fewer than

Contributor

stickler-ci Jun 20, 2018

E501 line too long (87 > 79 characters)

xarray/tests/test_duck_array_ops.py Outdated

    
                  # without min_count

                  actual = DataArray.mean.__doc__

                  expected = dedent("""\

                      Reduce this DataArray's data by applying `mean` along some dimension(s).

Contributor

stickler-ci Jun 20, 2018

E501 line too long (80 > 79 characters)

fujiisoup added 2 commits

June 20, 2018 21:26


          Merge branch 'master' into refactor_nanops

f0fc8bf


          flake8

5dda535

fujiisoup changed the title ~~[WIP] Refactor nanops~~ Refactor nanops


          Bug fix. Better test coverage.

5ddc4eb

shoyer reviewed

View reviewed changes

Member

shoyer left a comment

I think it would make sense to restructure this a little bit to have two well defined layers:

A module of bottleneck/numpy functions that act on numpy arrays only.
A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

xarray/core/duck_array_ops.py Outdated

    
                          if dtype is None:

                              func = _dask_or_eager_func(name)

                          else:

                              func = getattr(np_module, name)

Member

shoyer Jun 20, 2018

This looks a little dangerous -- it could silently convert a dask array into a NumPy array. Can we raise an error instead?

Member Author

fujiisoup Jun 20, 2018

It looks a kind of holdover.
Removed.

xarray/core/nanops.py Outdated

    
                  if mask is not None:

                      if isinstance(a, dask_array_type):

                          return dask_array.where(mask, val, a), mask

Member

shoyer Jun 20, 2018

This is a little surprising to me. I would rather use duck_array_ops.where() here than use explicit type checks.

Likewise, instead of logic above for mask, use duck_array_ops.isnull()

Member Author

fujiisoup Jun 20, 2018

Done.


          using isnull, where_method. Remove unnecessary conditional branching.

c37de0e

Member Author

fujiisoup commented Jun 20, 2018

I think it would make sense to restructure this a little bit to have two well defined layers:

A module of bottleneck/numpy functions that act on numpy arrays only.
A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

Could you explain more detail about this idea?

Member

shoyer commented Jun 20, 2018

A module of bottleneck/numpy functions that act on numpy arrays only.
A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

Could you explain more detail about this idea?

OK, let me try:

On numpy arrays, we use bottleneck eqiuvalents of numpy functions when possible because bottleneck is faster than numpy
On dask arrays, we use dask equivalents of numpy functions.
We also want to add some extra features on top of what numpy/dask/bottleneck provide, e.g., handling of min_count

We could implement this with:

nputils.nansum() is equivalent to numpy.nansum() but uses bottleneck.nansum() internally instead when possible.
duck_array_ops.nansum() uses numpy_nansum() or dask.array.nansum(), based upon the type of the inputs.
duck_array_ops.sum() uses numpy.sum() or dask.array.sum(), based upon the type of the inputs.
duck_array_ops.sum_with_mincount() adds mincount and skipna support and is used in the Dataset.sum() implementation. Its is written using duck_array_ops.nansum(), duck_array_ops.sum(), duck_array_ops.where() and duck_array_ops.isnull().

rpnaut commented Jun 21, 2018 •

edited

Loading

Hello from me hopefully contributing some needfull things.

At first, I would like to comment that I checked out your code.

I ran the following code example using a datafile uploaded under the following link:
https://swiftbrowser.dkrz.de/public/dkrz_c0725fe8741c474b97f291aac57f268f/GregorMoeller/

import xarray
import matplotlib.pyplot as plt

data = xarray.open_dataset("eObs_gridded_0.22deg_rot_v14.0.TOT_PREC.1950-2016.nc_CutParamTimeUnitCor_FinalEvalGrid")
data
<xarray.Dataset>
Dimensions:       (rlat: 136, rlon: 144, time: 153)
Coordinates:
  * rlon          (rlon) float32 -22.6 -22.38 -22.16 -21.94 -21.72 -21.5 ...
  * rlat          (rlat) float32 -12.54 -12.32 -12.1 -11.88 -11.66 -11.44 ...
  * time          (time) datetime64[ns] 2006-05-01T12:00:00 ...
Data variables:
    rotated_pole  int32 ...
    TOT_PREC      (time, rlat, rlon) float32 ...
Attributes:
    CDI:                       Climate Data Interface version 1.8.0 (http://m...
    Conventions:               CF-1.6
    history:                   Thu Jun 14 12:34:59 2018: cdo -O -s -P 4 remap...
    CDO:                       Climate Data Operators version 1.8.0 (http://m...
    cdo_openmp_thread_number:  4

data_aggreg = data["TOT_PREC"].resample(time="M").sum(min_count=0)
data_aggreg2 = data["TOT_PREC"].resample(time="M").sum(min_count=1)

I have recognized that the min_count option at recent state technically only works for DataArrays and not for DataSets. However, more interesting is the fact that the dimensions are destroyed:

data_aggreg
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 551833.25   ,  465640.09375,  328445.90625,  836892.1875 ,  503601.5    ], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...

no longitude and latitude survives your operation. If I would use the the sum-operator on the full dataset (where maybe the code was not modified?) I got

>>> data_aggreg = data.resample(time="M").sum()
>>> data_aggreg
<xarray.Dataset>
Dimensions:       (rlat: 136, rlon: 144, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
  * rlon          (rlon) float32 -22.6 -22.38 -22.16 -21.94 -21.72 -21.5 ...
  * rlat          (rlat) float32 -12.54 -12.32 -12.1 -11.88 -11.66 -11.44 ...
Data variables:
    rotated_pole  (time) int64 1 1 1 1 1
    TOT_PREC      (time, rlat, rlon) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

rpnaut commented Jun 21, 2018 •

edited

Loading

!!!Correction!!!. The resample-example above gives also a missing lon-lat dimensions in case of unmodified model code. The resulting numbers are the same.

data_aggreg = data["TOT_PREC"].resample(time="M").sum()
data_aggreg
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 551833.25   ,  465640.09375,  328445.90625,  836892.1875 ,  503601.5    ], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...

Now I am a little bit puzzled. But ...

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!IT SEEMS TO BE A BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

If I do the resample process using the old nomenclature:
data_aggreg = data["TOT_PREC"].resample(dim="time",how="sum",freq="M")
it works.
DO we have a bug in xarray?

rpnaut commented Jun 21, 2018 •

edited

Loading

Okay. Using the old resample-nomenclature I tried to compare the results with your modified code (min_count = 2) and with the tagged version 0.10.7 (no min_count argument). But I am not sure, if this also works. Do you examine the keywords given from the old-resample method?.

However, in the comparison I did not saw the nan's I expected over water.

Member Author

fujiisoup commented Jun 21, 2018

@shoyer , thanks for the details. I think I understood your idea.
This sounds a cleaner solution.
I will update the code again, but it will take some more days (or a week).

Member Author

fujiisoup commented Jun 21, 2018

Thanks, @rpnaut .
Actually, I'm changing the code also around sum. So it looks my change caused the bug you reported.
I think we do not have a good test coverage around the dataset.reduction.

I will add the test also. Thanks again for the testing :)


          More refactoring based on the comments

7aedd02

rpnaut commented Aug 9, 2018 •

edited

Loading

To wrap it up. Your implementation works for timeseries - data. There is something strange with time-space data, which should be fixed. If this is fixed, it is worth to test in my evaluation environment.
Do you have a feeling, why the new syntax is giving such strange behaviour? Shall we put the bug onto the issue list?

And maybe, it would be interesting to have in the future the min_count argument also available for the old syntax and not only the new. The reason: The dimension name is not flexible anymore - it cannot be a variable like dim=${dim}

rpnaut commented Aug 10, 2018 •

edited

Loading

Ok. The strange thing with the spatial dimensions is, that the new syntax forces the user to tell exactly, on which dimension the mathematical operator for resampling (like sum) should be applied. The syntax is now data.resample(time="M").sum(dim="time",min_count=1).

That weird - two times giving the dimension. However, doing so the xarray is not summing up for, e.g. the first month, all values he finds in a specific dataaray, but only those values along the dimension time.

AND NOW, the good new is that I got the following picture with your 'min_count=0' and 'min_count=1':

fujiisoup added 2 commits

August 11, 2018 10:49


          Merge branch 'master' into refactor_nanops

a65c579


          Add tests for dataset

5c82628

Member Author

fujiisoup commented Aug 11, 2018

Thanks for testing.

Your min_count argument is not allowed for type 'dataset' but only for type 'dataarray'.
Starting with the dataset located here:

I think it is not true.
It works on Dataset, but not on resampled object. I will raise a issue for this later.


          Add tests with resample.

06319ac

stickler-ci reviewed

View reviewed changes

xarray/tests/test_dataset.py Outdated

    
                      actual = ds.resample(time='1D').sum(min_count=1)

                      expected = xr.concat([

                          ds.isel(time=slice(i*4, (i+1)*4)).sum('time', min_count=1)

Contributor

stickler-ci Aug 11, 2018

E226 missing whitespace around arithmetic operator


          lint

737118e

Member Author

fujiisoup commented Aug 11, 2018

I noticed min_count is working also on resampled object.
Your issue might be for the API of resample.


          updated whatsnew

85b5650

Member Author

fujiisoup commented Aug 11, 2018

Can anyone give further review?

fujiisoup mentioned this pull request

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Closed

shoyer reviewed

View reviewed changes

Member

shoyer left a comment

some minor comments -- generally this looks like a nice improvement, though!

xarray/core/nanops.py Outdated

    
                  type """

                  valid_count = count(value, axis=axis)

                  value = fillna(value, fill_value)

                  data = getattr(np, func)(value, axis=axis, **kwargs)

Member

shoyer Aug 16, 2018

This function is doing eager evaluation by calling the numpy function -- can we call the dask function instead on dask arrays?

Alternatively, we could make this function error on dask arrays? I just don't want to silently compute them.

xarray/core/nanops.py Outdated

    
                  valid_count = count(value, axis=axis)

                  value = fillna(value, fill_value)

                  data = getattr(np, func)(value, axis=axis, **kwargs)

                  # dask seems return non-integer type

Member

shoyer Aug 16, 2018

It would be good to file an upstream bug report with dask about this

Member Author

fujiisoup Aug 16, 2018

This looks a misunderstanding or already fixed in upstream.
Removed this line.

xarray/core/nanops.py

    
                  if isinstance(value, dask_array_type):

                      data = data.astype(int)

                  if (valid_count == 0).any():

Member

shoyer Aug 16, 2018

This would also evaluate dask arrays, which can be costly. I don't know if there's much to do about this but it should be noted.

xarray/core/nanops.py Outdated

    
                  if isinstance(a, dask_array_type):

                      return dask_array.nanmax(a, axis=axis)

                  return nputils.nanmax(a, axis=axis)

Member

shoyer Aug 16, 2018

Maybe this could be more cleanly written as:

    module = nptuils if isinstance(a, dask_array_type) else dask_array
    return module.nanmin(a, axis=axis)

xarray/core/nanops.py Outdated

    
                  if mask is not None:

                      mask = np.all(mask, axis=axis)

                      if np.any(mask):

Member

shoyer Aug 16, 2018

It's more dask friendly to use the.all() and .any() methods -- that way at least most of the check can be done in parallel, and only a single value (the boolean scalar) gets evaluates.

xarray/tests/test_duck_array_ops.py Outdated

    
                          # also check ddof!=0 case

                          actual = getattr(da, func)(skipna=skipna, dim=aggdim, ddof=5)

                          if dask:

                              isinstance(da.data, dask_array_type)

Member

shoyer Aug 16, 2018

Needs assert?

xarray/tests/test_duck_array_ops.py Outdated

    
                      da = construct_dataarray(dim_num, dtype, contains_nan=False, dask=dask)

                      actual = getattr(da, func)(skipna=skipna)

                      if dask:

                          isinstance(da.data, dask_array_type)

Member

shoyer Aug 16, 2018

also needs assert

xarray/tests/test_variable.py

    
                      v = Variable('t', pd.date_range('2000-01-01', periods=3))

                      with pytest.raises(NotImplementedError):

                          v.argmax(skipna=True)

Member

shoyer Aug 16, 2018

nope, no idea! :)

fujiisoup added 5 commits

August 16, 2018 12:03


          Merge branch 'master' into refactor_nanops

623016b


          Revise from comments.

015e85c


          Use .any and .all method instead of np.any / np.all

01a1419


          Avoid using numpy methods

a5b18fc


          Avoid casting to int for dask array

e4e1d1e

Member Author

fujiisoup commented Aug 16, 2018

Thanks, @shoyer.
All done.

shoyer approved these changes

View reviewed changes

Member

shoyer left a comment

looks good to me, thanks!

doc/whats-new.rst Outdated

    
              - Follow up the renamings in dask; from dask.ghost to dask.overlap

                By `Keisuke Fujii <https://github.com/fujiisoup>`_.

                By `Tony Tung <https://github.com/ttung>`_.

Member

shoyer Aug 16, 2018

this should reference the line above, not your change here


          Update whatsnew

b72a1c8

fujiisoup merged commit 0b9ab2d into pydata:master

Member Author

fujiisoup commented Aug 16, 2018

Thanks for the review. Merging.

fujiisoup deleted the refactor_nanops branch

August 16, 2018 06:59

Contributor

st-bender commented Sep 26, 2018 •

edited

Loading

Hi,
just to let you know that .std() does not accept the ddof keyword anymore (it worked in 0.10.8)
Should I open a new bugreport?

Edit:
It fails with:

~/Work/miniconda3/envs/stats/lib/python3.6/site-packages/xarray/core/duck_array_ops.py in f(values, axis, skipna, **kwargs)
    234 
    235         try:
--> 236             return func(values, axis=axis, **kwargs)
    237         except AttributeError:
    238             if isinstance(values, dask_array_type):

TypeError: nanstd() got an unexpected keyword argument 'ddof'

fujiisoup mentioned this pull request

ddof does not working with 0.10.9 #2440

Closed

Member Author

fujiisoup commented Sep 26, 2018

Thanks, @st-bender, for the bug report.
I copied your comment to #2440.

fujiisoup mentioned this pull request

restore ddof support in std #2447

Merged

4 tasks

DimitriPapadopoulos mentioned this pull request

Apply assorted ruff preview rules #10465

Merged

4 tasks

keewis mentioned this pull request

Support taking the mean over xarray(pint(numpy(uncertainties))) #10904

Open

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet