Integrate dask masked array support (just code changes, no tests fix) #2699

lbdreyer · 2017-07-31T10:19:23Z

This is pointing at the dask_mask_array feature branch.

This contains just the code changes that I think require to be done to integrate the dask masked array support into Iris.

This will obviously break a bunch of tests but we can address that later.

lbdreyer · 2017-07-31T10:37:58Z

I'm pretty happy with the handling of fill value:

>>> x = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=10)
>>> dx = da.from_array(x, chunks=(2,), asarray=False)
>>> cdx = dx.compute()
>>> print cdx.fill_value
10

Combining two daskified masked arrays was interesting though:

>>> x = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=15)
>>> y = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=32, dtype=np.float64)
>>> dx = da.from_array(x, chunks=(2,), asarray=False)
>>> dy = da.from_array(y, chunks=(2,), asarray=False)
>>> dxy = dx * dy
>>> cdxy = dxy.compute()
>>> print cdxy.fill_value
15.0

bjlittle · 2017-08-01T10:56:44Z

lib/iris/_lazy_data.py

-            data = array_masked_to_nans(data)
-        data = da.from_array(data, chunks=chunks)
+            asarray = False
+        data = da.from_array(data, chunks=chunks, asarray=asarray)


@lbdreyer This change could be simply reduced to:

asarray = not ma.isMaskedArray(data) data = da.from_array(data, chunks=chunks, asarray=asarray)

bjlittle · 2017-08-01T11:33:49Z

bjlittle · 2017-08-01T13:26:01Z

bjlittle · 2017-08-01T13:40:21Z

lib/iris/_merge.py

-                              fill_value, cube._cell_measures_and_dims)
+
+        return _CubeSignature(cube.metadata, cube.shape,
+                              cube.lazy_data().dtype,


@lbdreyer This should just be cube.dtype right? ...

Addressed in the most recent commit

bjlittle · 2017-08-01T13:44:10Z

lib/iris/coords.py

        if is_lazy_data(data) and data.dtype.kind in 'biu':
-            # Disallow lazy integral data, as it will cause problems with dask
-            # if it turns out to contain any masked points.
            # Non-floating cell measures are not valid up to CF v1.7 anyway,


@lbdreyer Minor, but you can remove the anyway, from the end of this comment.

Addressed in the most recent commit

bjlittle · 2017-08-01T13:49:39Z

lib/iris/cube.py

-    @fill_value.setter
-    def fill_value(self, fill_value):
-        self._data_manager.fill_value = fill_value
+        return self._data_manager.core_data().dtype


@lbdreyer Isn't this just self._data_manager.dtype still? Why the need to go through the core_data function?

Addressed in the most recent commit

bjlittle · 2017-08-01T13:54:07Z

lib/iris/cube.py

-
-    @fill_value.setter
-    def fill_value(self, fill_value):
-        self._data_manager.fill_value = fill_value


@lbdreyer Yup, nice.

There's no point in even having a fill_value getter convenience, as it's not possible to get a lazy fill_value or indeed track whether the lazy array is an ndarray or masked without realising it (please tell me otherwise).

bjlittle · 2017-08-01T13:55:49Z

lib/iris/cube.py

            cube_xml_element.setAttribute('var_name', self.var_name)
        cube_xml_element.setAttribute('units', str(self.units))
-        if self.fill_value is not None:
-            cube_xml_element.setAttribute('fill_value', str(self.fill_value))


Oh boy! CML carnage 😱

bjlittle · 2017-08-02T06:34:55Z

@lbdreyer Just missed one use case of replace on line 194 in _data_manager.py for the _deepcopy method.

bjlittle · 2017-08-02T06:50:19Z

lib/iris/cube.py

-        data_xml_element.setAttribute('dtype', dtype.name)
+            else:
+                dtype = self.lazy_data().dtype
+            data_xml_element.setAttribute('dtype', dtype.name)


@lbdreyer I think you need to back this tab change out.

Referring back to v1.13 the else on line 2904 aligns with the if on line 2879, and not the if on line 2901.

The previous logic makes sense, in that the dtype xml attribute is always set, and for non-lazy data extra xml metadata is set for the concrete data payload.

This change is most likely adding to some CML differences ...

bjlittle · 2017-08-02T07:05:18Z

lib/iris/etc/pp_save_rules.txt

 #MDI
 IF
-    cm.fill_value is not None
+    isinstance(cm.data, ma.core.MaskedArray)


@lbdreyer Replace this with ma.isMaskedArray(cm.data) ? But this forces the loading of the data.

Slightly confusing is that for both of these PP save rule, in v1.13, there should be no change here. I don't quite understand that (at the moment) as there is no fill_value attribute on a cm (cube) ... so how did these save rule work in v1.13 ...

It looks like @pp-mo added a cube.fill_value in #2452

So this change will undo the benefit we got from that PR

@lbdreyer I believe that PR pretty much got removed as a result of the merge back of the original dask feature branch into master

See issues raise #2704 to address this

bjlittle · 2017-08-02T07:05:41Z

lib/iris/etc/pp_save_rules.txt

+    not isinstance(cm.data, ma.core.MaskedArray)
 THEN
-    pp.bmdi = -1e30
+pp.bmdi = -1e30


You've lost the spacing indent ...

bjlittle · 2017-08-02T07:07:37Z

lib/iris/fileformats/grib/__init__.py

                                  offset - message_length)
            self._data = as_lazy_data(proxy)
        else:
            values_array = _message_values(grib_message, shape)


@lbdreyer Nothing is assigned back to self.data ?

bjlittle · 2017-08-02T07:21:33Z

lib/iris/fileformats/grib/__init__.py

+    # Handle missing values in a sensible way.
+    mask = np.isnan(data)
+    if mask.any():
+        data = ma.array(data, mask=mask, fill_value=np.nan)


@lbdreyer Just confirming my understanding ... so this is change is making up for the removal of convert_nans_array to convert an array containing nans to a masked array, got it ... but you're explicitly setting the fill_value to be nan, which we didn't ever do before. Why so?

Also, this function _message_values is called from the GribWrapper.__init__, but also the GribDataProxy.__getitem__ ... which is interesting as the GribDataProxy didn't ever then apply convert_nans_array ...

bjlittle · 2017-08-02T07:32:41Z

lib/iris/fileformats/grib/_save_rules.py

    if ma.isMaskedArray(cube.data):
-        fill_value = cube.fill_value
-        if fill_value is None or np.isnan(cube.fill_value):
+        if not np.isnan(cube.data.fill_value):


@lbdreyer Okay ... I think I see the connective tissue now. So is this the reason for setting the fill_value to nan in _message_values when a GRIB message has nan values in it's message data payload in order to control auto-selecting an appropriate fill_value?

bjlittle · 2017-08-02T07:45:34Z

lib/iris/fileformats/grib/message.py

    def __repr__(self):
        msg = '<{self.__class__.__name__} shape={self.shape} ' \
-              'dtype={self.dtype!r} recreate_raw={self.recreate_raw!r} '
+              'dtype={self.dtype!r} fill_value={self.fill_value!r} ' \


@lbdreyer There is no fill_value ...

bjlittle · 2017-08-02T07:53:31Z

lib/iris/fileformats/grib/message.py

-                data = _data
+                # `ma.masked_array` masks where input = 1, the opposite of
+                # the behaviour specified by the GRIB spec.
+                data = ma.masked_array(_data, mask=np.logical_not(bitmap))


@lbdreyer So the deal is that _DataProxy will return a data payload that is masked, right? Rather than an array that is nan filled ... that might have testing fall-out, I guess.

That aside, shouldn't this be ...

data = ma.masked_array(_data, mask=np.logical_not(bitmap.astype(bool)), fill_value=np.nan)

This would align with the strategy for auto-selecting the fill_value on save, right?

... Only problem is that using fill_value=np.nan means that the dtype of all GRIB data must be float, otherwise it's not possible to set np.nan as a fill_value for non-float data ...

bjlittle · 2017-08-02T09:57:26Z

lib/iris/fileformats/netcdf.py


-    dtype = nan_array_type(dummy_data.dtype)
-    proxy = NetCDFDataProxy(cf_var.shape, dtype,
+    proxy = NetCDFDataProxy(cf_var.shape, dummy_data.dtype,


@lbdreyer Calculation of the fill_value on line 503 has the default value of None.

In v1.12 and v1.13 it is netCDF4.default_fillvals[cf_var.dtype.str[1:]] ... I think that this default behaviour should be reinstated.

Agreed, a couple of tests were complaining about this

Hoo-rah for test coverage. Win!

lbdreyer · 2017-08-02T10:21:32Z

We have agreed to keep this PR to just code changes so I am going to back out all the test fixes.

I am hoping to keep this PR to just code changes, and then when it gets merged, squash it to a single commit, so in the commit history there will be a single commit with the code changes and then subsequent commits with fixes to the tests that fall out from the code changes.

bjlittle · 2017-08-02T10:21:56Z

lib/iris/fileformats/netcdf.py

+                    fill_value = cube.lazy_data().fill_value
+                else:
+                    fill_value = None
+


@lbdreyer We can't get the fill_value if the data is lazy, so that changes the logic here some what ...

For lines 1951-1957, I propose the following:

if packing is None: fill_value = None if not cube.has_lazy_data() and ma.isMaskedArray(cube.data): fill_value = cube.data.fill_value dtype = cube.dtype.newbyteorder('=')

Iris 1.13 has

if packing is None: fill_value = cube.lazy_data().fill_value dtype = cube.lazy_data().dtype.newbyteorder('=')

Does that mean we could get the fill_value from cube.lazy_data() when we were still using biggus?

I am concerned that in the example where we have lazy masked data, we aren't setting the fill_value as it would just be None. I am not sure what the fall out of this would be

Yeah, that's behaviour that we just can't avoid, and it's concerning. If we're saving lazy masked data, then the fill-value is lost. Not good.

We could do a work around though ... we know that we can only get the fill-value from concrete or non-lazy masked data, so in the lazy masked case, we could slice the lazy data down to 1 data item, then compute it, then get the fill-value. That works, and it's similar to a trick we've used in netcdf to calculate the derived dtype from data that requires a scale + offset calculation.

As you mentioned, we could also apply this approach to PP to ensure that we don't lose the goodness of @pp-mo's #2452.

@lbdreyer How does that sound?

@lbdreyer Actually, we need to be slightly cautious how we do this ...

>>> m = ma.masked_array([0,1,2,3], mask=[0,0,0,1], fill_value=123) >>> dm = da.from_array(m, chunks=(1,), asarray=False) >>> dm[0].compute() 0 >>> dm[-1].compute() masked >>> dm[:1].compute() masked_array(data = [0], mask = [False], fill_value = 123)

The last example of slicing with [:1] gives us back a masked array with the fill-value, which is what we need. The first and second examples differ based on whether the underlying data value is masked or not (that's bad) and they return a numpy.int64 and a MaskedConstant (evil), both of which don't have a fill_value.

See issues raised #2703 and #2704

… fixes)

bjlittle · 2017-08-02T10:36:32Z

lib/iris/fileformats/netcdf.py

-                                 nans_replacement=cube.fill_value,
-                                 result_dtype=cube.dtype)
+            data = cube.lazy_data()
+


@lbdreyer We could tidy this now into a simple one liner (if you agree):

da.store([cube.lazy_data()], [cf_var])

bjlittle

@lbdreyer Okay, awesome work!

I've finish going through the code changes and I've added various comments that need further discussion/action.

lbdreyer · 2017-08-02T12:08:45Z

lib/iris/etc/pp_save_rules.txt

 #MDI
 IF
-    isinstance(cm.data, ma.core.MaskedArray)
+    ma.isMasked(cm.data)


Argh this should be isMaskedArray !
Hang on putting up a new commit

lbdreyer · 2017-08-02T12:10:58Z

@bjlittle I have now addressed all your comments (except for the fill_value at saving to pp and netcdf time which we have #2703 and #2704 to address)

bjlittle · 2017-08-02T12:12:55Z

@lbdreyer Okay, I'll take a final look through the updates ...

lbdreyer assigned bjlittle Jul 31, 2017

lbdreyer added the dask-mask label Jul 31, 2017

bjlittle reviewed Aug 1, 2017

View reviewed changes

lbdreyer force-pushed the dask_code_changes branch from bbfa2fa to 7e074e4 Compare August 1, 2017 13:26

bjlittle reviewed Aug 1, 2017

View reviewed changes

lbdreyer mentioned this pull request Aug 1, 2017

Dask test api fixes + cml changes #2702

Closed

bjlittle reviewed Aug 2, 2017

View reviewed changes

Integrate with dask masked array support (just code changes, no tests…

13b8b6f

… fixes)

lbdreyer force-pushed the dask_code_changes branch from 1ff6f00 to 13b8b6f Compare August 2, 2017 10:28

bjlittle self-requested a review August 2, 2017 10:33

bjlittle reviewed Aug 2, 2017

View reviewed changes

bjlittle requested changes Aug 2, 2017

View reviewed changes

lbdreyer mentioned this pull request Aug 2, 2017

Fix fill_value when saving to netCDF #2703

Closed

lbdreyer commented Aug 2, 2017

View reviewed changes

Review actions

86a5d6b

lbdreyer force-pushed the dask_code_changes branch from 4a16352 to 86a5d6b Compare August 2, 2017 12:10

This was referenced Aug 2, 2017

Remove cube.replace() or data_manager.replace() from tests #2705

Closed

Remove usages of convert_nans_array, array_masked_to_nans and nans_array_type #2706

Closed

bjlittle approved these changes Aug 2, 2017

View reviewed changes

bjlittle merged commit 9c6782b into SciTools:dask_mask_array Aug 2, 2017

QuLogic added this to the dask-mask milestone Aug 2, 2017

lbdreyer deleted the dask_code_changes branch July 23, 2018 10:47

Integrate dask masked array support (just code changes, no tests fix) #2699

Integrate dask masked array support (just code changes, no tests fix) #2699

Uh oh!

Conversation

lbdreyer commented Jul 31, 2017

Uh oh!

lbdreyer commented Jul 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjlittle commented Aug 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjlittle Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer commented Aug 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bjlittle commented Aug 1, 2017 •

edited

Loading

bjlittle commented Aug 1, 2017 •

edited

Loading

bjlittle commented Aug 2, 2017 •

edited

Loading

bjlittle Aug 2, 2017 •

edited

Loading

bjlittle Aug 2, 2017 •

edited

Loading