Dask masked-data stats #2743

pp-mo · 2017-08-18T12:45:28Z

Contributes to #2717

pelson · 2017-08-21T13:17:46Z

lib/iris/analysis/__init__.py

+                da.core.elemwise(ma.getmaskarray, array),
+                axis=axis)
            masked_point_fractions = (point_mask_counts + 0.0) / point_counts
            # Note: the +0.0 forces a floating-point divide.


I know it's not your implementation, but would you mind just adding a from __future__ import division and getting rid of the + 0.0 & the comment?

It was my implementation, just not on this iteration !

I'd rather retain some explicit indication of "using floating division on ints here", for code clarity.
The __future__ route makes the resulting code "tidier", but I don't think it's as clear.

The + 0.0 is a pretty old-fashioned approach, though, so maybe I should use 'np.float' instead, which is a more explicit statement of what is required.
Would you prefer that ?

/ is the explicit indication of floating division; // is the explicit indication of integer division. Every file should have the __future__ import already and using Python 3 semantics. This should not be surprising.

pelson · 2017-08-21T13:20:12Z

lib/iris/analysis/__init__.py

-            point_mask_counts = da.sum(da.isnan(array), axis=axis)
+            # Build a lazy computation to compare the fraction of missing
+            # input points at each output point to the 'mdtol' threshold.
+            point_counts = da.sum(da.ones(array.shape, chunks=array.chunks),


Why isn't this just np.prod(array.shape) ?

Because the summation may be over one or multiple dimensions (axes).
E.G. if you have shape=(2,3,4), some alternatives might be ...

>>> a = np.ones((2,3,4)) >>> a array([[[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]], [[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]]) >>> np.sum(a, axis=1) array([[ 3., 3., 3., 3.], [ 3., 3., 3., 3.]]) >>> np.sum(a, axis=-1) array([[ 4., 4., 4.], [ 4., 4., 4.]]) >>> np.sum(a) 24.0 >>> np.sum(a, axis=(1,2)) array([ 12., 12.]) >>>

There is almost certainly a better way of doing that, but I thought this would be okay + at least it ensures that we treat the 'axis' parameter in the prescribed numpy manner, so as to match how it is applied in the 'main' statistical operation.

pelson · 2017-08-21T13:23:16Z

lib/iris/analysis/__init__.py

-        dask_result = dask_nanfunction(array, axis=axis, **kwargs)
+        # Call the statistic to get the basic result (missing-data tolerant).
+        dask_result = dask_stats_function(array, axis=axis, **kwargs)
        if mdtol is None:


Or mdtol == 1?

pp-mo · 2017-08-21T15:26:29Z

lib/iris/analysis/__init__.py

-        dask_result = dask_nanfunction(array, axis=axis, **kwargs)
+        # Call the statistic to get the basic result (missing-data tolerant).
+        dask_result = dask_stats_function(array, axis=axis, **kwargs)
        if mdtol is None:


It does always bother me to do equality testing on floats, though I suppose this is only a shortcut.

So maybe this should say mdtol is None or mdtol >= 1.0,
making any mdtol > 1 equivalent to mdtol == 1.
That makes practical sense, meaning "valid if there is even one good point".
( As internally in the code, we have boolean_mask = masked_point_fractions > mdtol )

However, it's slightly inconsistent with the effect of mdtol < 0.0 :
Whereas "mdtol=0" means "no missing points", "mdtol<0" will mask everything, regardless.

though I suppose this is only a shortcut

Yep.

pp-mo · 2017-08-21T17:42:32Z

Improved use of dask-mask code.

Latest changes in response to dask/dask#2301 (comment)
It will also be nicer to use `da.ma.masked_array', when he has added it.

pp-mo · 2017-08-22T10:33:57Z

use `da.ma.masked_array', when he has added it.

Now done.
Also ditched the explicit cast for floating divides, as requested.

Is this now good to go @pelson ?

djkirkham · 2017-08-22T11:05:30Z

lib/iris/analysis/__init__.py

-            # Note: the +0.0 forces a floating-point divide.
+            # Build a lazy computation to compare the fraction of missing
+            # input points at each output point to the 'mdtol' threshold.
+            point_counts = da.sum(da.ones(array.shape, chunks=array.chunks),


Does this need to be lazy?

No, in fact it used not to be.
But I think it's arguably better if it is.
Especially if this was quite a large calculation, then this way could chunk it.

As explained in previous reply to @pelson above, there is doubtless a better way of doing this than constructing another large array + collapsing it, but making it a collapse does have the advantage of guaranteeing the same interpretation of the 'axis' control as for the main statistic (assuming they do all work the same way, that is).

This shouldn't be filling the array at all. This is the only real blocker to this being merged IMO.

How about you get the consistency without the cost of the ones...

>>> shape = [1201230123, 123123, 123123, 1231, 3123123] >>> a = da.empty(shape, chunks=shape) >>> >>> a.shape (1201230123, 123123, 123123, 1231, 3123123) >>> r = da.sum(a, axis=[1, 2]) >>> r.shape (1201230123, 1231, 3123123) >>> np.prod(r.shape) 4618206582709412799

pelson · 2017-08-25T15:44:08Z

lib/iris/analysis/__init__.py


    The returned value is a new function operating on dask arrays.
-    It has the call signature "stat(data, axis=-1, mdtol=None, *kwargs)".
+    It has the call signature "stat(data, axis=-1, mdtol=None, **kwargs)".


In general, I encourage you to use back-ticks:

"stat(data, axis=-1, mdtol=None, **kwargs)"

vs

stat(data, axis=-1, mdtol=None, **kwargs)

None of this is public, so you never get to see a rendered version anyway !

pp-mo · 2017-08-25T18:17:53Z

How about you get the consistency without the cost of the ones...

Oh, I get it !
Stupidly, I had not spotted that this is always "just a number"
-- so, as I had it previously, an array full of all the same value.

Now fixed that, and added testing for all axis possibilities = None / single / multiple.

djkirkham · 2017-08-29T08:15:09Z

lib/iris/analysis/__init__.py

+            # Multiply the sizes of the collapsed dimensions, to get
+            # the total number of input points at each output point.
+            point_counts = np.prod([array.shape[axis_index]
+                                    for axis_index in axis_indices])


How about:

point_counts = np.prod(array.shape) / np.prod(dask_result.shape)

Dead right !! 👍
But I get even more ...
as "np.prod(a.shape)" === "a.size", we can just use that...
See following commit,

pp-mo · 2017-10-11T15:05:46Z

Fully rebased to get tests re-checking.

pp-mo · 2017-10-11T22:57:26Z

One tiny fix..
I checked all the comments above + I think all existing work is done.

Tests now passing, can you please check this out @djkirkham ?

djkirkham · 2017-10-12T10:21:01Z

There is still a test skip here:

iris/lib/iris/tests/unit/analysis/test_STD_DEV.py

Line 33 in a8aabe6

@tests.skip_dask_mask

Other than that it looks good.

pp-mo · 2017-10-12T13:03:31Z

still a test skip here:

Good spot !

pp-mo · 2017-10-12T14:22:22Z

Fixed !!
Please re-review @djkirkham

pp-mo changed the title ~~Dask odd mask fixes~~ Dask masked-data stats Aug 18, 2017

pp-mo mentioned this pull request Aug 18, 2017

Masked arrays dask/dask#2301

Merged

pp-mo added the dask-mask label Aug 18, 2017

pp-mo requested a review from djkirkham August 18, 2017 14:26

pp-mo mentioned this pull request Aug 18, 2017

Fix other mask related failures #2717

Closed

7 tasks

pelson reviewed Aug 21, 2017

View reviewed changes

pp-mo commented Aug 21, 2017

View reviewed changes

djkirkham reviewed Aug 22, 2017

View reviewed changes

djkirkham approved these changes Aug 23, 2017

View reviewed changes

pelson reviewed Aug 25, 2017

View reviewed changes

djkirkham reviewed Aug 29, 2017

View reviewed changes

pp-mo closed this Oct 11, 2017

pp-mo reopened this Oct 11, 2017

pp-mo added 8 commits October 11, 2017 16:04

Fix lazy stats; enhance mdtol test.

bff88ff

Tidy lazy mdtol code; use dask.core.elemwise.

cb0a6a3

pep8 fixes.

d8d8988

Improved use of dask-mask code.

0655efa

Simplified with latest dask.

7526fe6

Add lazy-stats tests for all 'axis' options.

dc6fcc2

Efficient points count in _build_dask_mdtol_function.

cb61276

Simplify points-count calculation.

0423c0c

pp-mo force-pushed the dask_odd_mask_fixes branch from ee592c2 to 0423c0c Compare October 11, 2017 15:04

Simpler lazy stats calculations for mdtol>=1.0

a8aabe6

Remove+fix final skip_dask_mask in analysis tests.

66da8f1

djkirkham merged commit 49a930e into SciTools:dask_mask_array Oct 12, 2017

pp-mo deleted the dask_odd_mask_fixes branch October 12, 2017 16:05

rcomer mentioned this pull request Oct 31, 2022

Correct dask function in STD_DEV and VAR docstrings #5039

Merged

Dask masked-data stats #2743

Dask masked-data stats #2743

Uh oh!

Conversation

pp-mo commented Aug 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Aug 21, 2017

Uh oh!

pp-mo commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pelson Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pp-mo commented Oct 11, 2017

Uh oh!

djkirkham commented Oct 12, 2017

Uh oh!

pp-mo commented Oct 12, 2017

Uh oh!

pp-mo commented Oct 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pp-mo Aug 21, 2017 •

edited

Loading

pp-mo Aug 21, 2017 •

edited

Loading

pp-mo commented Aug 22, 2017 •

edited

Loading

pp-mo Aug 22, 2017 •

edited

Loading

pelson Aug 25, 2017 •

edited

Loading

pp-mo commented Aug 25, 2017 •

edited

Loading

pp-mo commented Oct 11, 2017 •

edited

Loading