Skip to content

Conversation

@bjlittle
Copy link
Member

@bjlittle bjlittle commented Mar 13, 2017

This PR addresses the issue of dealing with the dtype and fill_value of the cube.

We need to deal with fact that dask does not support masked arrays, and so we need to take care to correctly handle the intended dtype and fill_value of the data payload in the cube.

The dtype and fill_value have now been promoted to be cube metadata attributes, because we care about them and define what a cube is. In the case of a dask lazy, masked data payload, we need to preserve the case where the intended dtype is integral.

We also have a shift in how fill_value is handled, in that previously we handed that off to biggus and numpy.ma in the hope that it would do the right thing. Now we deal with this directly within the cube through cube.fill_value, which is used as the intended fill_value of any masked data payload.

@bjlittle bjlittle force-pushed the fill-value-and-dtype branch from a1e5743 to 338f8d9 Compare March 13, 2017 14:27
@dtype.setter
def dtype(self, dtype):
self._dtype = dtype
if dtype != self.dtype:
Copy link
Member

@pp-mo pp-mo Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not always allow cube.dtype = None, to reset the dtype to the underlying data ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the case that you might want to unset the dtype for lazy data to None, which you can't at the moment given the current implementation ... I've no argument against that.

@pp-mo
Copy link
Member

pp-mo commented Mar 14, 2017

Hi @bjlittle highly impressed you have this passing now !!
Will review later...

In meantime, here's an idea to ponder on...
I analysed usecases as I was suggesting, especially for the "minimal" requirements.
The only important outcome was this : I suspect we don't really need dtype assignment, we can allow dtype (that is, the 'alternative real dtype') to be specified only when setting lazy data.

I think we will need to retain real-data reassignment "cube.data =", but we can probably do without lazy-data reassignment (I need to check out existing code). Even if not, we can define a particular function for this.
This would mean that 'real dtype' can only be supplied in association with new lazy data.
I think removing the dtype assignment makes the whole API cleaner and simpler (including simpler to explain).

Think on, will discuss later.

@bjlittle
Copy link
Member Author

bjlittle commented Mar 14, 2017

Ran 4006 tests in 104.584s

FAILED (SKIP=310, failures=38)

Test case failures by category - tick the box to own the failures, investigate and fix, push changes as a PR on this branch and strike out the category when done ...

@lbdreyer
Copy link
Member

Sorry @bjlittle! I accidentally pushed (still getting the hang of pycharm) the change
bjlittle@2b77106
directly to this branch.

You can always revert it if you diagree with it

lib/iris/cube.py Outdated
# as the fill_value is checked against self.dtype.
self._dtype = None
if dtype is not None:
self.dtype = dtype
Copy link
Member Author

@bjlittle bjlittle Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have this check here, we just need to initialise _dtype = None and then set self.dtype = dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

dtype = np.dtype(dtype)
if dtype.kind != 'i':
emsg = ('Can only cast lazy data to integral dtype, '
'got {!r}.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

@bjlittle bjlittle Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be strict here. We're only allowing the dtype to be of integral type to support the masked case. If we didn't do this, then users would abuse this to perform casting of their data.

If we do want to support casting of data, for the sake of it, then we should support that properly, say with cube.astype(...) or whatever. Hence, that's why we nail this dtype change down to only being applied to the lazy, integral (masked) data case. HTH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: if you're visiting this bit, I've found that this test should also be allowing 'u' type data, i.e. unsigned ints. E.G. if dtype.kind not in ('i', 'u'): ?
If not it can wait -- I think it could do with a more general code search anyway, as I think we may have missed this possibility elsewhere in the code too.

Copy link
Member Author

@bjlittle bjlittle Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm good point @pp-mo ... See this concatenate change and here also ... is there a generic numpy kind to catch all integral kinds instead of i and u ?

Copy link
Member Author

@bjlittle bjlittle Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This numpy scalars link was handy ... also see Array-protocol type strings section of numpy docs ...

And this pattern seems to work:

>>> integral = [np.arange(10, dtype=np.int8),
... np.arange(10, dtype=np.int16),
... np.arange(10, dtype=np.int32),
... np.arange(10, dtype=np.int64),
... np.arange(10, dtype=np.uint8),
... np.arange(10, dtype=np.uint16),
... np.arange(10, dtype=np.uint32),
... np.arange(10, dtype=np.uint64)]
>>> for i in integral:
...     print(isinstance(i[0], np.integer))
... 
True
True
True
True
True
True
True
True
>>> f = np.arange(10, dtype=np.float)
>>> isinstance(f[0], np.integer)
False

Don't know if it's preferable over value.dtype.kind in ('i', 'u') though ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only allowing the dtype to be of integral type to support the masked case

It might be worth putting a comment in the code about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only mentioned the kind=u/i thing for completeness..
Let's not get hung up on this, please.
It will be much simpler to fix it later as I'm sure it needs doing in unrelated bits of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer np.integer.
It will also need to be fixed in iris/_concatenate.py e.g. https://github.com/SciTools/iris/pull/2433/files#diff-5fa4649482766af96ae11aead017e334R340

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo can you make an issue for this please

@bjlittle bjlittle changed the title With test failures to investigate. fill_value and dtype Mar 15, 2017
if kwargs['dtype'] is None and self.data_type.kind == 'i':
kwargs['dtype'] = self.data_type
defn = iris.cube.CubeMetadata(**kwargs)
return defn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have some unit tests for this really ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not going to do that here as this might all change given conversations with @pp-mo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created the following issue #2439 for this

def fill_value(self, fill_value):
if fill_value is not None:
# Convert the given value to the dtype of the cube.
fill_value = np.asarray([fill_value])[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

fill_value, = np.asarray([fill_value])

Copy link
Member Author

@bjlittle bjlittle Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tending to steer away from the fill_value, = np.asarray([fill_value]) pattern to unpack a scalar value, as it's quite subtle i.e. it's pretty easy for the eye to miss.

However, we could opt for [fill_value] = np.asarray([fill_value]) instead ... ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this change ...

@pp-mo pp-mo mentioned this pull request Mar 15, 2017
result = concatenate(cubes)
self.assertEqual(len(result), 2)

def test_masked_fill_value(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder whether "test_masked_diff_fill_value" would be better

@bjlittle
Copy link
Member Author

@pp-mo @lbdreyer Let's get this merged asap ....

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make this obscure stuff a bit clearer ?

I'm otherwise happy with the code.
I must just whizz through the test changes, too, but I'm not expecting to find anything much to complain of..

return result

def promote_defn(self):
defn = self.defn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worthy of explanation !
Suggest:

  • call it "lazy_version_defn"
  • docstring like ....
Produce the defn (i.e. cube.metadata) which a lazy version of the cube would have.
A cube with lazy data representing masked ints must have its `metadata.dtype` set appropriately.

Even if we think this may all change again later ...

# in :meth:`iris.cube.CubeList.concatenate_cube()`.
msg = 'Cube metadata differs for phenomenon: {}'
msgs.append(msg.format(self.defn.name()))
promoted = self.promote_defn()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain here too. Something like "Check whether lazy versions of the defns would match."

@lbdreyer
Copy link
Member

The tests are failing due to pep8 @bjlittle

@lbdreyer
Copy link
Member

Once this tests are passing I'll merge this in

@lbdreyer
Copy link
Member

Are you happy for me to squash and merge (once the tests have passed)?

@bjlittle
Copy link
Member Author

Ping @lbdreyer 👍

@bjlittle
Copy link
Member Author

Ping @pp-mo @dkillick

@lbdreyer lbdreyer merged commit 62cdd00 into SciTools:dask Mar 15, 2017
@lbdreyer
Copy link
Member

Sorry about the wait. @pp-mo wanted to do some final checks

@bjlittle bjlittle deleted the fill-value-and-dtype branch March 15, 2017 15:04
bjlittle added a commit to bjlittle/iris that referenced this pull request May 31, 2017
* Testing ideas for cube data/dtype/fill_value interaction (not passing -- need new API).

* Add checks for realisation; fix dtype tests; use 'lazy' keyword instead of 'real'.

* With test failures to investigate.

* Fixed tests for new code from SciTools#2433.

* Fix tests

* Pep8 fix.

* Fixed data_dtype_fillvalue test fails (other things now failing).

* Fix bug in clearing _dask_array when assigning real data.

* Keep fill value of cube when resetting the cube's data

* Fix concatenate tests.

* Keep fill values during stats operations; cast cml fill values to float64.

* Keep fill values during regrid test.

* Fix basic maths tests.

* Tidy cube.__init__ for dtype setting.

* Review actions.

* Rename concatenate unit tests.

* Review actions.

* pep8
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants