Skip to content

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Aug 18, 2017

Contributes to #2717

@pp-mo pp-mo changed the title Dask odd mask fixes Dask masked-data stats Aug 18, 2017
@pp-mo pp-mo mentioned this pull request Aug 18, 2017
@pp-mo pp-mo requested a review from djkirkham August 18, 2017 14:26
@pp-mo pp-mo mentioned this pull request Aug 18, 2017
7 tasks
da.core.elemwise(ma.getmaskarray, array),
axis=axis)
masked_point_fractions = (point_mask_counts + 0.0) / point_counts
# Note: the +0.0 forces a floating-point divide.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not your implementation, but would you mind just adding a from __future__ import division and getting rid of the + 0.0 & the comment?

Copy link
Member Author

@pp-mo pp-mo Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my implementation, just not on this iteration !

I'd rather retain some explicit indication of "using floating division on ints here", for code clarity.
The __future__ route makes the resulting code "tidier", but I don't think it's as clear.

The + 0.0 is a pretty old-fashioned approach, though, so maybe I should use 'np.float' instead, which is a more explicit statement of what is required.
Would you prefer that ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ is the explicit indication of floating division; // is the explicit indication of integer division. Every file should have the __future__ import already and using Python 3 semantics. This should not be surprising.

point_mask_counts = da.sum(da.isnan(array), axis=axis)
# Build a lazy computation to compare the fraction of missing
# input points at each output point to the 'mdtol' threshold.
point_counts = da.sum(da.ones(array.shape, chunks=array.chunks),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just np.prod(array.shape) ?

Copy link
Member Author

@pp-mo pp-mo Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the summation may be over one or multiple dimensions (axes).
E.G. if you have shape=(2,3,4), some alternatives might be ...

>>> a = np.ones((2,3,4))
>>> a
array([[[ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.]],

       [[ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.]]])
>>> np.sum(a, axis=1)
array([[ 3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.]])
>>> np.sum(a, axis=-1)
array([[ 4.,  4.,  4.],
       [ 4.,  4.,  4.]])
>>> np.sum(a)
24.0
>>> np.sum(a, axis=(1,2))
array([ 12.,  12.])
>>> 

There is almost certainly a better way of doing that, but I thought this would be okay + at least it ensures that we treat the 'axis' parameter in the prescribed numpy manner, so as to match how it is applied in the 'main' statistical operation.

dask_result = dask_nanfunction(array, axis=axis, **kwargs)
# Call the statistic to get the basic result (missing-data tolerant).
dask_result = dask_stats_function(array, axis=axis, **kwargs)
if mdtol is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or mdtol == 1?

dask_result = dask_nanfunction(array, axis=axis, **kwargs)
# Call the statistic to get the basic result (missing-data tolerant).
dask_result = dask_stats_function(array, axis=axis, **kwargs)
if mdtol is None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does always bother me to do equality testing on floats, though I suppose this is only a shortcut.

So maybe this should say mdtol is None or mdtol >= 1.0,
making any mdtol > 1 equivalent to mdtol == 1.
That makes practical sense, meaning "valid if there is even one good point".
( As internally in the code, we have boolean_mask = masked_point_fractions > mdtol )

However, it's slightly inconsistent with the effect of mdtol < 0.0 :
Whereas "mdtol=0" means "no missing points", "mdtol<0" will mask everything, regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I suppose this is only a shortcut

Yep.

@pp-mo
Copy link
Member Author

pp-mo commented Aug 21, 2017

Improved use of dask-mask code.

Latest changes in response to dask/dask#2301 (comment)
It will also be nicer to use `da.ma.masked_array', when he has added it.

@pp-mo
Copy link
Member Author

pp-mo commented Aug 22, 2017

use `da.ma.masked_array', when he has added it.

Now done.
Also ditched the explicit cast for floating divides, as requested.

Is this now good to go @pelson ?

# Note: the +0.0 forces a floating-point divide.
# Build a lazy computation to compare the fraction of missing
# input points at each output point to the 'mdtol' threshold.
point_counts = da.sum(da.ones(array.shape, chunks=array.chunks),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be lazy?

Copy link
Member Author

@pp-mo pp-mo Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in fact it used not to be.
But I think it's arguably better if it is.
Especially if this was quite a large calculation, then this way could chunk it.

As explained in previous reply to @pelson above, there is doubtless a better way of doing this than constructing another large array + collapsing it, but making it a collapse does have the advantage of guaranteeing the same interpretation of the 'axis' control as for the main statistic (assuming they do all work the same way, that is).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be filling the array at all. This is the only real blocker to this being merged IMO.

Copy link
Member

@pelson pelson Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you get the consistency without the cost of the ones...

>>> shape = [1201230123, 123123, 123123, 1231, 3123123]
>>> a = da.empty(shape, chunks=shape)
>>> 
>>> a.shape
(1201230123, 123123, 123123, 1231, 3123123)
>>> r = da.sum(a, axis=[1, 2])
>>> r.shape
(1201230123, 1231, 3123123)
>>> np.prod(r.shape)
4618206582709412799

The returned value is a new function operating on dask arrays.
It has the call signature "stat(data, axis=-1, mdtol=None, *kwargs)".
It has the call signature "stat(data, axis=-1, mdtol=None, **kwargs)".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I encourage you to use back-ticks:

"stat(data, axis=-1, mdtol=None, **kwargs)"

vs

stat(data, axis=-1, mdtol=None, **kwargs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this is public, so you never get to see a rendered version anyway !

@pp-mo
Copy link
Member Author

pp-mo commented Aug 25, 2017

How about you get the consistency without the cost of the ones...

Oh, I get it !
Stupidly, I had not spotted that this is always "just a number"
-- so, as I had it previously, an array full of all the same value.

Now fixed that, and added testing for all axis possibilities = None / single / multiple.

# Multiply the sizes of the collapsed dimensions, to get
# the total number of input points at each output point.
point_counts = np.prod([array.shape[axis_index]
for axis_index in axis_indices])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

point_counts = np.prod(array.shape) / np.prod(dask_result.shape)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead right !! 👍
But I get even more ...
as "np.prod(a.shape)" === "a.size", we can just use that...
See following commit,

@pp-mo pp-mo closed this Oct 11, 2017
@pp-mo pp-mo reopened this Oct 11, 2017
@pp-mo pp-mo force-pushed the dask_odd_mask_fixes branch from ee592c2 to 0423c0c Compare October 11, 2017 15:04
@pp-mo
Copy link
Member Author

pp-mo commented Oct 11, 2017

Fully rebased to get tests re-checking.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 11, 2017

One tiny fix..
I checked all the comments above + I think all existing work is done.

Tests now passing, can you please check this out @djkirkham ?

@djkirkham
Copy link
Contributor

There is still a test skip here:

Other than that it looks good.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 12, 2017

still a test skip here:

Good spot !

@pp-mo
Copy link
Member Author

pp-mo commented Oct 12, 2017

Fixed !!
Please re-review @djkirkham

@djkirkham djkirkham merged commit 49a930e into SciTools:dask_mask_array Oct 12, 2017
@pp-mo pp-mo deleted the dask_odd_mask_fixes branch October 12, 2017 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants