Skip to content

Conversation

@DPeterK
Copy link
Member

@DPeterK DPeterK commented Aug 3, 2017

Enables lazy masked fill_value retrieval for NetCDF save.

Dask exposes no mechanism for retrieving the fill value of a lazy masked array (indeed, arguably this is currently not a soluble problem). This PR proposes a workaround for this problem, exposes the workaround as a function in iris._lazy_data and integrates the function into the NetCDF saver.

Specifically, the workaround realises the smallest possible slice of the lazy masked data array that, when realised, contains the required fill value. Turns out the smallest possible slice is actually an empty slice:

>>> m = np.ma.masked_array([0, 1, 2], mask=[1, 0, 1], fill_value=9999)
>>> dm = da.from_array(m, chunks=(3,), asarray=False)
>>> dm[:0].compute().fill_value
9999

Fixes #2703.

@DPeterK DPeterK requested review from bjlittle and lbdreyer August 3, 2017 13:41
@DPeterK
Copy link
Member Author

DPeterK commented Aug 3, 2017

Note: no tests yet! I thought I'd push the changes so that people could get eyes on the proposed functionality.

@DPeterK
Copy link
Member Author

DPeterK commented Aug 3, 2017

Now with unit tests! Still got a bunch of existing test failures to deal with, though.

if is_lazy_data(data):
inds = tuple([0] * (data.ndim-1))
smallest_slice = data[inds][:0]
data = as_concrete_data(smallest_slice)
Copy link
Member

@bjlittle bjlittle Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkillick You might want coverage for 0-dim arrays and MaskedConstants ... I've seen those being passed around on my travels

Copy link
Member Author

@DPeterK DPeterK Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work; see these tests... although I'm not convinced that they cover the MaskedConstant case. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would one produce a MaskedConstant to test on? They seem to be things you can only make when you don't want to (i.e. I now want to and I can't manage to make one!) 😒

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following works:

>>> a = ma.masked_array([4], mask=True)
>>> masked_constant = a[0]
>>> print type(masked_constant)
<class 'numpy.ma.core.MaskedConstant'>

from iris._lazy_data import lazy_masked_fill_value, _MAX_CHUNK_SIZE


class Test_as_lazy_data(tests.IrisTest):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs renaming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! Thought I'd got away with copying an existing file...

return data


def lazy_masked_fill_value(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you but into a rebrand, say "lazy_fill_value" instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to have to try harder than that to convince me 😉

I guess that...

  • I only care about fill values if I have masked data,
  • so 'masked' could be omitted in that sense, but
  • clarity is a good thing – it's in the masked data case that I want to retrieve the fill value.

So I'm not currently buying into a rebrand! What do you see as the benefit of doing so?

self.fill_value = 9999
self.m = ma.masked_array(data, mask=mask, fill_value=self.fill_value)
self.dm = da.from_array(self.m, asarray=False,
chunks=_MAX_CHUNK_SIZE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you not just call iris._lazy_data.as_lazy_data

Copy link
Member Author

@DPeterK DPeterK Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but I don't see how that benefits me / improves the existing approach... So if you have a good benefit / improvement for changing this I'd love to hear it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's just the argument we initially had of whether to include as_lazy_data.
We chose to put all our dealing with dask in a single module (iris._lazy_data) rather than having it dotted throughout the iris codebase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine... apart from the fact that it doesn't hold in the tests, which make widespread use of da.from_array (including the test module I duplicated to make this one).

Either way, this intermediate step has now been banked, so do feel free to change this in a follow-up PR 😄

if is_lazy_data(data):
inds = tuple([0] * (data.ndim-1))
smallest_slice = data[inds][:0]
data = as_concrete_data(smallest_slice)
Copy link
Member

@lbdreyer lbdreyer Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why calculate inds, why not just use [:0]

i.e.

    if is_lazy_data(data):
        smallest_slice = data[:0]
        data = as_concrete_data(smallest_slice)

Copy link
Member Author

@DPeterK DPeterK Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting idea! I was going to say "Because then it won't be the smallest slice!", which turns out to not quite be accurate (look at the shape of the resultant array 😱):

z = np.arange(24).reshape(2,3,4)
z[:0]
array([], shape=(0, 3, 4), dtype=int64)

Apparently (according to an offline conversation) doing the suggested also causes the 0D case to not work properly. So, while I'm 👍 for reducing SLOC and code complexity, I think the extra line in this case adds some worthwhile resilience.

@QuLogic QuLogic added this to the dask-mask milestone Aug 3, 2017
@bjlittle
Copy link
Member

bjlittle commented Aug 4, 2017

@dkillick Awesome thanks 👍

I'm going to bank this PR and follow-up with any/all rework in another PR, given your unavailability.

@bjlittle bjlittle merged commit 9ed20b7 into SciTools:dask_mask_array Aug 4, 2017
@bjlittle bjlittle self-assigned this Aug 4, 2017
@DPeterK DPeterK deleted the nc_lazy_fillvalue branch August 4, 2017 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants