Skip to content

Conversation

@lbdreyer
Copy link
Member

This is pointing at the dask_mask_array feature branch.

This contains just the code changes that I think require to be done to integrate the dask masked array support into Iris.

This will obviously break a bunch of tests but we can address that later.

@lbdreyer
Copy link
Member Author

I'm pretty happy with the handling of fill value:

>>> x = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=10)
>>> dx = da.from_array(x, chunks=(2,), asarray=False)
>>> cdx = dx.compute()
>>> print cdx.fill_value
10

Combining two daskified masked arrays was interesting though:

>>> x = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=15)
>>> y = ma.masked_array([1,2,3,4], mask=[0,1,0,0], fill_value=32, dtype=np.float64)
>>> dx = da.from_array(x, chunks=(2,), asarray=False)
>>> dy = da.from_array(y, chunks=(2,), asarray=False)
>>> dxy = dx * dy
>>> cdxy = dxy.compute()
>>> print cdxy.fill_value
15.0

data = array_masked_to_nans(data)
data = da.from_array(data, chunks=chunks)
asarray = False
data = da.from_array(data, chunks=chunks, asarray=asarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer This change could be simply reduced to:

asarray = not ma.isMaskedArray(data)
data = da.from_array(data, chunks=chunks, asarray=asarray)

@bjlittle
Copy link
Member

bjlittle commented Aug 1, 2017

@lbdreyer Just noticed that we should be purging the use of the following _lazy_data functions:

  • nan_array_type
    • fileformats/netcdf.py - imported but not used
    • _lazy_data.py - definition
  • array_masked_to_nans
    • fileformats/grib/message.py - imported but not used
    • fileformats/netcdf.py - I don't think that we should be using this here
    • fileformats/pp.py - imported but not used
    • tests/unit/lazy_data/test_array_masked_to_nans.py - we can nuke these tests
    • _lazy_data.py - definition
  • convert_nans_array
    • fileformats/grib/__init__.py - revert usage back to 1.13 ?
    • fileformats/netcdf.py - imported by not used
    • tests/unit/lazy_data/test_as_concrete_data.py - we can nuke these tests
    • tests/unit/lazy_data/test_convert_nans_array.py - we can remove all these tests
    • _lazy_data.py - definition

@lbdreyer See #2719 and #2706

@bjlittle
Copy link
Member

bjlittle commented Aug 1, 2017

@lbdreyer We should remove the DataManager.replace method and associated Cube.replace method, and retrospectively use datamanager.data = data and cube.data = data instead ...

  • analysis/__init__.py, line 566
  • cube.py, lines 2912-2922
  • tests/test_analysis.py, lines 132, 138, 147
  • tests/test_basic_maths.py, line 42
  • tests/test_cdm.py, line 1033, line 1039
  • tests/test_concatenate.py, line 521, line 542
  • tests/test_interpolation.py, line 46
  • tests/unit/analysis/regrid/test_RectilinearRegridder.py, line 1266
  • tests/unit/cube/test_Cube.py, assorted lines
  • tests/unit/fileformats/pp/test_as_pairs.py, line 54
  • tests/unit/data_manager/test_DataManager.py, assorted lines
  • _data_manager.py, line 193, line 374

@lbdreyer lbdreyer force-pushed the dask_code_changes branch from bbfa2fa to 7e074e4 Compare August 1, 2017 13:26
fill_value, cube._cell_measures_and_dims)

return _CubeSignature(cube.metadata, cube.shape,
cube.lazy_data().dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer This should just be cube.dtype right? ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the most recent commit

if is_lazy_data(data) and data.dtype.kind in 'biu':
# Disallow lazy integral data, as it will cause problems with dask
# if it turns out to contain any masked points.
# Non-floating cell measures are not valid up to CF v1.7 anyway,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Minor, but you can remove the anyway, from the end of this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the most recent commit

lib/iris/cube.py Outdated
@fill_value.setter
def fill_value(self, fill_value):
self._data_manager.fill_value = fill_value
return self._data_manager.core_data().dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Isn't this just self._data_manager.dtype still? Why the need to go through the core_data function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the most recent commit


@fill_value.setter
def fill_value(self, fill_value):
self._data_manager.fill_value = fill_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Yup, nice.

There's no point in even having a fill_value getter convenience, as it's not possible to get a lazy fill_value or indeed track whether the lazy array is an ndarray or masked without realising it (please tell me otherwise).

cube_xml_element.setAttribute('var_name', self.var_name)
cube_xml_element.setAttribute('units', str(self.units))
if self.fill_value is not None:
cube_xml_element.setAttribute('fill_value', str(self.fill_value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy! CML carnage 😱

@bjlittle
Copy link
Member

bjlittle commented Aug 2, 2017

@lbdreyer Just missed one use case of replace on line 194 in _data_manager.py for the _deepcopy method.

lib/iris/cube.py Outdated
data_xml_element.setAttribute('dtype', dtype.name)
else:
dtype = self.lazy_data().dtype
data_xml_element.setAttribute('dtype', dtype.name)
Copy link
Member

@bjlittle bjlittle Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer I think you need to back this tab change out.

Referring back to v1.13 the else on line 2904 aligns with the if on line 2879, and not the if on line 2901.

The previous logic makes sense, in that the dtype xml attribute is always set, and for non-lazy data extra xml metadata is set for the concrete data payload.

This change is most likely adding to some CML differences ...

#MDI
IF
cm.fill_value is not None
isinstance(cm.data, ma.core.MaskedArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Replace this with ma.isMaskedArray(cm.data) ? But this forces the loading of the data.

Slightly confusing is that for both of these PP save rule, in v1.13, there should be no change here. I don't quite understand that (at the moment) as there is no fill_value attribute on a cm (cube) ... so how did these save rule work in v1.13 ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @pp-mo added a cube.fill_value in #2452

So this change will undo the benefit we got from that PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer I believe that PR pretty much got removed as a result of the merge back of the original dask feature branch into master

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issues raise #2704 to address this

not isinstance(cm.data, ma.core.MaskedArray)
THEN
pp.bmdi = -1e30
pp.bmdi = -1e30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've lost the spacing indent ...

offset - message_length)
self._data = as_lazy_data(proxy)
else:
values_array = _message_values(grib_message, shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Nothing is assigned back to self.data ?

# Handle missing values in a sensible way.
mask = np.isnan(data)
if mask.any():
data = ma.array(data, mask=mask, fill_value=np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Just confirming my understanding ... so this is change is making up for the removal of convert_nans_array to convert an array containing nans to a masked array, got it ... but you're explicitly setting the fill_value to be nan, which we didn't ever do before. Why so?

Also, this function _message_values is called from the GribWrapper.__init__, but also the GribDataProxy.__getitem__ ... which is interesting as the GribDataProxy didn't ever then apply convert_nans_array ...

if ma.isMaskedArray(cube.data):
fill_value = cube.fill_value
if fill_value is None or np.isnan(cube.fill_value):
if not np.isnan(cube.data.fill_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Okay ... I think I see the connective tissue now. So is this the reason for setting the fill_value to nan in _message_values when a GRIB message has nan values in it's message data payload in order to control auto-selecting an appropriate fill_value?

def __repr__(self):
msg = '<{self.__class__.__name__} shape={self.shape} ' \
'dtype={self.dtype!r} recreate_raw={self.recreate_raw!r} '
'dtype={self.dtype!r} fill_value={self.fill_value!r} ' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer There is no fill_value ...

data = _data
# `ma.masked_array` masks where input = 1, the opposite of
# the behaviour specified by the GRIB spec.
data = ma.masked_array(_data, mask=np.logical_not(bitmap))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer So the deal is that _DataProxy will return a data payload that is masked, right? Rather than an array that is nan filled ... that might have testing fall-out, I guess.

That aside, shouldn't this be ...

data = ma.masked_array(_data, mask=np.logical_not(bitmap.astype(bool)), fill_value=np.nan)

This would align with the strategy for auto-selecting the fill_value on save, right?

... Only problem is that using fill_value=np.nan means that the dtype of all GRIB data must be float, otherwise it's not possible to set np.nan as a fill_value for non-float data ...


dtype = nan_array_type(dummy_data.dtype)
proxy = NetCDFDataProxy(cf_var.shape, dtype,
proxy = NetCDFDataProxy(cf_var.shape, dummy_data.dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Calculation of the fill_value on line 503 has the default value of None.

In v1.12 and v1.13 it is netCDF4.default_fillvals[cf_var.dtype.str[1:]] ... I think that this default behaviour should be reinstated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, a couple of tests were complaining about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoo-rah for test coverage. Win!

@lbdreyer
Copy link
Member Author

lbdreyer commented Aug 2, 2017

We have agreed to keep this PR to just code changes so I am going to back out all the test fixes.

I am hoping to keep this PR to just code changes, and then when it gets merged, squash it to a single commit, so in the commit history there will be a single commit with the code changes and then subsequent commits with fixes to the tests that fall out from the code changes.

fill_value = cube.lazy_data().fill_value
else:
fill_value = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer We can't get the fill_value if the data is lazy, so that changes the logic here some what ...

For lines 1951-1957, I propose the following:

if packing is None:
    fill_value = None
    if not cube.has_lazy_data() and ma.isMaskedArray(cube.data):
        fill_value = cube.data.fill_value
    dtype = cube.dtype.newbyteorder('=')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iris 1.13 has

            if packing is None:
                fill_value = cube.lazy_data().fill_value
                dtype = cube.lazy_data().dtype.newbyteorder('=')

Does that mean we could get the fill_value from cube.lazy_data() when we were still using biggus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that in the example where we have lazy masked data, we aren't setting the fill_value as it would just be None. I am not sure what the fall out of this would be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's behaviour that we just can't avoid, and it's concerning. If we're saving lazy masked data, then the fill-value is lost. Not good.

We could do a work around though ... we know that we can only get the fill-value from concrete or non-lazy masked data, so in the lazy masked case, we could slice the lazy data down to 1 data item, then compute it, then get the fill-value. That works, and it's similar to a trick we've used in netcdf to calculate the derived dtype from data that requires a scale + offset calculation.

As you mentioned, we could also apply this approach to PP to ensure that we don't lose the goodness of @pp-mo's #2452.

@lbdreyer How does that sound?

Copy link
Member

@bjlittle bjlittle Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Actually, we need to be slightly cautious how we do this ...

>>> m = ma.masked_array([0,1,2,3], mask=[0,0,0,1], fill_value=123)
>>> dm = da.from_array(m, chunks=(1,), asarray=False)
>>> dm[0].compute()
0
>>> dm[-1].compute()
masked
>>> dm[:1].compute()
masked_array(data = [0],
             mask = [False],
       fill_value = 123)

The last example of slicing with [:1] gives us back a masked array with the fill-value, which is what we need. The first and second examples differ based on whether the underlying data value is masked or not (that's bad) and they return a numpy.int64 and a MaskedConstant (evil), both of which don't have a fill_value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issues raised #2703 and #2704

@lbdreyer lbdreyer force-pushed the dask_code_changes branch from 1ff6f00 to 13b8b6f Compare August 2, 2017 10:28
@bjlittle bjlittle self-requested a review August 2, 2017 10:33
nans_replacement=cube.fill_value,
result_dtype=cube.dtype)
data = cube.lazy_data()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer We could tidy this now into a simple one liner (if you agree):

da.store([cube.lazy_data()], [cf_var])

Copy link
Member

@bjlittle bjlittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer Okay, awesome work!

I've finish going through the code changes and I've added various comments that need further discussion/action.

#MDI
IF
isinstance(cm.data, ma.core.MaskedArray)
ma.isMasked(cm.data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh this should be isMaskedArray !
Hang on putting up a new commit

@lbdreyer lbdreyer force-pushed the dask_code_changes branch from 4a16352 to 86a5d6b Compare August 2, 2017 12:10
@lbdreyer
Copy link
Member Author

lbdreyer commented Aug 2, 2017

@bjlittle I have now addressed all your comments (except for the fill_value at saving to pp and netcdf time which we have #2703 and #2704 to address)

@bjlittle
Copy link
Member

bjlittle commented Aug 2, 2017

@lbdreyer Okay, I'll take a final look through the updates ...

@bjlittle bjlittle merged commit 9c6782b into SciTools:dask_mask_array Aug 2, 2017
@QuLogic QuLogic added this to the dask-mask milestone Aug 2, 2017
@lbdreyer lbdreyer deleted the dask_code_changes branch July 23, 2018 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants