-
Notifications
You must be signed in to change notification settings - Fork 300
Dask masked-data stats #2743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask masked-data stats #2743
Conversation
lib/iris/analysis/__init__.py
Outdated
| da.core.elemwise(ma.getmaskarray, array), | ||
| axis=axis) | ||
| masked_point_fractions = (point_mask_counts + 0.0) / point_counts | ||
| # Note: the +0.0 forces a floating-point divide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not your implementation, but would you mind just adding a from __future__ import division and getting rid of the + 0.0 & the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my implementation, just not on this iteration !
I'd rather retain some explicit indication of "using floating division on ints here", for code clarity.
The __future__ route makes the resulting code "tidier", but I don't think it's as clear.
The + 0.0 is a pretty old-fashioned approach, though, so maybe I should use 'np.float' instead, which is a more explicit statement of what is required.
Would you prefer that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ is the explicit indication of floating division; // is the explicit indication of integer division. Every file should have the __future__ import already and using Python 3 semantics. This should not be surprising.
lib/iris/analysis/__init__.py
Outdated
| point_mask_counts = da.sum(da.isnan(array), axis=axis) | ||
| # Build a lazy computation to compare the fraction of missing | ||
| # input points at each output point to the 'mdtol' threshold. | ||
| point_counts = da.sum(da.ones(array.shape, chunks=array.chunks), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just np.prod(array.shape) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the summation may be over one or multiple dimensions (axes).
E.G. if you have shape=(2,3,4), some alternatives might be ...
>>> a = np.ones((2,3,4))
>>> a
array([[[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]],
[[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]]])
>>> np.sum(a, axis=1)
array([[ 3., 3., 3., 3.],
[ 3., 3., 3., 3.]])
>>> np.sum(a, axis=-1)
array([[ 4., 4., 4.],
[ 4., 4., 4.]])
>>> np.sum(a)
24.0
>>> np.sum(a, axis=(1,2))
array([ 12., 12.])
>>>
There is almost certainly a better way of doing that, but I thought this would be okay + at least it ensures that we treat the 'axis' parameter in the prescribed numpy manner, so as to match how it is applied in the 'main' statistical operation.
lib/iris/analysis/__init__.py
Outdated
| dask_result = dask_nanfunction(array, axis=axis, **kwargs) | ||
| # Call the statistic to get the basic result (missing-data tolerant). | ||
| dask_result = dask_stats_function(array, axis=axis, **kwargs) | ||
| if mdtol is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or mdtol == 1?
lib/iris/analysis/__init__.py
Outdated
| dask_result = dask_nanfunction(array, axis=axis, **kwargs) | ||
| # Call the statistic to get the basic result (missing-data tolerant). | ||
| dask_result = dask_stats_function(array, axis=axis, **kwargs) | ||
| if mdtol is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does always bother me to do equality testing on floats, though I suppose this is only a shortcut.
So maybe this should say mdtol is None or mdtol >= 1.0,
making any mdtol > 1 equivalent to mdtol == 1.
That makes practical sense, meaning "valid if there is even one good point".
( As internally in the code, we have boolean_mask = masked_point_fractions > mdtol )
However, it's slightly inconsistent with the effect of mdtol < 0.0 :
Whereas "mdtol=0" means "no missing points", "mdtol<0" will mask everything, regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I suppose this is only a shortcut
Yep.
Latest changes in response to dask/dask#2301 (comment) |
Now done. Is this now good to go @pelson ? |
lib/iris/analysis/__init__.py
Outdated
| # Note: the +0.0 forces a floating-point divide. | ||
| # Build a lazy computation to compare the fraction of missing | ||
| # input points at each output point to the 'mdtol' threshold. | ||
| point_counts = da.sum(da.ones(array.shape, chunks=array.chunks), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be lazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in fact it used not to be.
But I think it's arguably better if it is.
Especially if this was quite a large calculation, then this way could chunk it.
As explained in previous reply to @pelson above, there is doubtless a better way of doing this than constructing another large array + collapsing it, but making it a collapse does have the advantage of guaranteeing the same interpretation of the 'axis' control as for the main statistic (assuming they do all work the same way, that is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be filling the array at all. This is the only real blocker to this being merged IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you get the consistency without the cost of the ones...
>>> shape = [1201230123, 123123, 123123, 1231, 3123123]
>>> a = da.empty(shape, chunks=shape)
>>>
>>> a.shape
(1201230123, 123123, 123123, 1231, 3123123)
>>> r = da.sum(a, axis=[1, 2])
>>> r.shape
(1201230123, 1231, 3123123)
>>> np.prod(r.shape)
4618206582709412799
lib/iris/analysis/__init__.py
Outdated
| The returned value is a new function operating on dask arrays. | ||
| It has the call signature "stat(data, axis=-1, mdtol=None, *kwargs)". | ||
| It has the call signature "stat(data, axis=-1, mdtol=None, **kwargs)". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I encourage you to use back-ticks:
"stat(data, axis=-1, mdtol=None, **kwargs)"
vs
stat(data, axis=-1, mdtol=None, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is public, so you never get to see a rendered version anyway !
Oh, I get it ! Now fixed that, and added testing for all axis possibilities = None / single / multiple. |
lib/iris/analysis/__init__.py
Outdated
| # Multiply the sizes of the collapsed dimensions, to get | ||
| # the total number of input points at each output point. | ||
| point_counts = np.prod([array.shape[axis_index] | ||
| for axis_index in axis_indices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
point_counts = np.prod(array.shape) / np.prod(dask_result.shape)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead right !! 👍
But I get even more ...
as "np.prod(a.shape)" === "a.size", we can just use that...
See following commit,
ee592c2 to
0423c0c
Compare
|
Fully rebased to get tests re-checking. |
|
One tiny fix.. Tests now passing, can you please check this out @djkirkham ? |
|
There is still a test skip here:
Other than that it looks good. |
Good spot ! |
|
Fixed !! |
Contributes to #2717