-
Notifications
You must be signed in to change notification settings - Fork 300
Integrate dask masked array support (just code changes, no tests fix) #2699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm pretty happy with the handling of fill value: Combining two daskified masked arrays was interesting though: |
| data = array_masked_to_nans(data) | ||
| data = da.from_array(data, chunks=chunks) | ||
| asarray = False | ||
| data = da.from_array(data, chunks=chunks, asarray=asarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer This change could be simply reduced to:
asarray = not ma.isMaskedArray(data)
data = da.from_array(data, chunks=chunks, asarray=asarray)|
@lbdreyer Just noticed that we should be purging the use of the following
|
|
@lbdreyer We should remove the
|
bbfa2fa to
7e074e4
Compare
lib/iris/_merge.py
Outdated
| fill_value, cube._cell_measures_and_dims) | ||
|
|
||
| return _CubeSignature(cube.metadata, cube.shape, | ||
| cube.lazy_data().dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer This should just be cube.dtype right? ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the most recent commit
lib/iris/coords.py
Outdated
| if is_lazy_data(data) and data.dtype.kind in 'biu': | ||
| # Disallow lazy integral data, as it will cause problems with dask | ||
| # if it turns out to contain any masked points. | ||
| # Non-floating cell measures are not valid up to CF v1.7 anyway, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Minor, but you can remove the anyway, from the end of this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the most recent commit
lib/iris/cube.py
Outdated
| @fill_value.setter | ||
| def fill_value(self, fill_value): | ||
| self._data_manager.fill_value = fill_value | ||
| return self._data_manager.core_data().dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Isn't this just self._data_manager.dtype still? Why the need to go through the core_data function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the most recent commit
|
|
||
| @fill_value.setter | ||
| def fill_value(self, fill_value): | ||
| self._data_manager.fill_value = fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Yup, nice.
There's no point in even having a fill_value getter convenience, as it's not possible to get a lazy fill_value or indeed track whether the lazy array is an ndarray or masked without realising it (please tell me otherwise).
| cube_xml_element.setAttribute('var_name', self.var_name) | ||
| cube_xml_element.setAttribute('units', str(self.units)) | ||
| if self.fill_value is not None: | ||
| cube_xml_element.setAttribute('fill_value', str(self.fill_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy! CML carnage 😱
|
@lbdreyer Just missed one use case of |
lib/iris/cube.py
Outdated
| data_xml_element.setAttribute('dtype', dtype.name) | ||
| else: | ||
| dtype = self.lazy_data().dtype | ||
| data_xml_element.setAttribute('dtype', dtype.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer I think you need to back this tab change out.
Referring back to v1.13 the else on line 2904 aligns with the if on line 2879, and not the if on line 2901.
The previous logic makes sense, in that the dtype xml attribute is always set, and for non-lazy data extra xml metadata is set for the concrete data payload.
This change is most likely adding to some CML differences ...
lib/iris/etc/pp_save_rules.txt
Outdated
| #MDI | ||
| IF | ||
| cm.fill_value is not None | ||
| isinstance(cm.data, ma.core.MaskedArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Replace this with ma.isMaskedArray(cm.data) ? But this forces the loading of the data.
Slightly confusing is that for both of these PP save rule, in v1.13, there should be no change here. I don't quite understand that (at the moment) as there is no fill_value attribute on a cm (cube) ... so how did these save rule work in v1.13 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer I believe that PR pretty much got removed as a result of the merge back of the original dask feature branch into master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See issues raise #2704 to address this
lib/iris/etc/pp_save_rules.txt
Outdated
| not isinstance(cm.data, ma.core.MaskedArray) | ||
| THEN | ||
| pp.bmdi = -1e30 | ||
| pp.bmdi = -1e30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've lost the spacing indent ...
| offset - message_length) | ||
| self._data = as_lazy_data(proxy) | ||
| else: | ||
| values_array = _message_values(grib_message, shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Nothing is assigned back to self.data ?
| # Handle missing values in a sensible way. | ||
| mask = np.isnan(data) | ||
| if mask.any(): | ||
| data = ma.array(data, mask=mask, fill_value=np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Just confirming my understanding ... so this is change is making up for the removal of convert_nans_array to convert an array containing nans to a masked array, got it ... but you're explicitly setting the fill_value to be nan, which we didn't ever do before. Why so?
Also, this function _message_values is called from the GribWrapper.__init__, but also the GribDataProxy.__getitem__ ... which is interesting as the GribDataProxy didn't ever then apply convert_nans_array ...
| if ma.isMaskedArray(cube.data): | ||
| fill_value = cube.fill_value | ||
| if fill_value is None or np.isnan(cube.fill_value): | ||
| if not np.isnan(cube.data.fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Okay ... I think I see the connective tissue now. So is this the reason for setting the fill_value to nan in _message_values when a GRIB message has nan values in it's message data payload in order to control auto-selecting an appropriate fill_value?
lib/iris/fileformats/grib/message.py
Outdated
| def __repr__(self): | ||
| msg = '<{self.__class__.__name__} shape={self.shape} ' \ | ||
| 'dtype={self.dtype!r} recreate_raw={self.recreate_raw!r} ' | ||
| 'dtype={self.dtype!r} fill_value={self.fill_value!r} ' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer There is no fill_value ...
lib/iris/fileformats/grib/message.py
Outdated
| data = _data | ||
| # `ma.masked_array` masks where input = 1, the opposite of | ||
| # the behaviour specified by the GRIB spec. | ||
| data = ma.masked_array(_data, mask=np.logical_not(bitmap)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer So the deal is that _DataProxy will return a data payload that is masked, right? Rather than an array that is nan filled ... that might have testing fall-out, I guess.
That aside, shouldn't this be ...
data = ma.masked_array(_data, mask=np.logical_not(bitmap.astype(bool)), fill_value=np.nan)This would align with the strategy for auto-selecting the fill_value on save, right?
... Only problem is that using fill_value=np.nan means that the dtype of all GRIB data must be float, otherwise it's not possible to set np.nan as a fill_value for non-float data ...
|
|
||
| dtype = nan_array_type(dummy_data.dtype) | ||
| proxy = NetCDFDataProxy(cf_var.shape, dtype, | ||
| proxy = NetCDFDataProxy(cf_var.shape, dummy_data.dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Calculation of the fill_value on line 503 has the default value of None.
In v1.12 and v1.13 it is netCDF4.default_fillvals[cf_var.dtype.str[1:]] ... I think that this default behaviour should be reinstated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, a couple of tests were complaining about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoo-rah for test coverage. Win!
|
We have agreed to keep this PR to just code changes so I am going to back out all the test fixes. I am hoping to keep this PR to just code changes, and then when it gets merged, squash it to a single commit, so in the commit history there will be a single commit with the code changes and then subsequent commits with fixes to the tests that fall out from the code changes. |
lib/iris/fileformats/netcdf.py
Outdated
| fill_value = cube.lazy_data().fill_value | ||
| else: | ||
| fill_value = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer We can't get the fill_value if the data is lazy, so that changes the logic here some what ...
For lines 1951-1957, I propose the following:
if packing is None:
fill_value = None
if not cube.has_lazy_data() and ma.isMaskedArray(cube.data):
fill_value = cube.data.fill_value
dtype = cube.dtype.newbyteorder('=')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iris 1.13 has
if packing is None:
fill_value = cube.lazy_data().fill_value
dtype = cube.lazy_data().dtype.newbyteorder('=')
Does that mean we could get the fill_value from cube.lazy_data() when we were still using biggus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that in the example where we have lazy masked data, we aren't setting the fill_value as it would just be None. I am not sure what the fall out of this would be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's behaviour that we just can't avoid, and it's concerning. If we're saving lazy masked data, then the fill-value is lost. Not good.
We could do a work around though ... we know that we can only get the fill-value from concrete or non-lazy masked data, so in the lazy masked case, we could slice the lazy data down to 1 data item, then compute it, then get the fill-value. That works, and it's similar to a trick we've used in netcdf to calculate the derived dtype from data that requires a scale + offset calculation.
As you mentioned, we could also apply this approach to PP to ensure that we don't lose the goodness of @pp-mo's #2452.
@lbdreyer How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Actually, we need to be slightly cautious how we do this ...
>>> m = ma.masked_array([0,1,2,3], mask=[0,0,0,1], fill_value=123)
>>> dm = da.from_array(m, chunks=(1,), asarray=False)
>>> dm[0].compute()
0
>>> dm[-1].compute()
masked
>>> dm[:1].compute()
masked_array(data = [0],
mask = [False],
fill_value = 123)The last example of slicing with [:1] gives us back a masked array with the fill-value, which is what we need. The first and second examples differ based on whether the underlying data value is masked or not (that's bad) and they return a numpy.int64 and a MaskedConstant (evil), both of which don't have a fill_value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1ff6f00 to
13b8b6f
Compare
lib/iris/fileformats/netcdf.py
Outdated
| nans_replacement=cube.fill_value, | ||
| result_dtype=cube.dtype) | ||
| data = cube.lazy_data() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer We could tidy this now into a simple one liner (if you agree):
da.store([cube.lazy_data()], [cf_var])
bjlittle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbdreyer Okay, awesome work!
I've finish going through the code changes and I've added various comments that need further discussion/action.
lib/iris/etc/pp_save_rules.txt
Outdated
| #MDI | ||
| IF | ||
| isinstance(cm.data, ma.core.MaskedArray) | ||
| ma.isMasked(cm.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh this should be isMaskedArray !
Hang on putting up a new commit
4a16352 to
86a5d6b
Compare
|
@lbdreyer Okay, I'll take a final look through the updates ... |
This is pointing at the
dask_mask_arrayfeature branch.This contains just the code changes that I think require to be done to integrate the dask masked array support into Iris.
This will obviously break a bunch of tests but we can address that later.