-
Notifications
You must be signed in to change notification settings - Fork 300
Lazy masked fill_value retrieval for NC save #2723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note: no tests yet! I thought I'd push the changes so that people could get eyes on the proposed functionality. |
|
Now with unit tests! Still got a bunch of existing test failures to deal with, though. |
| if is_lazy_data(data): | ||
| inds = tuple([0] * (data.ndim-1)) | ||
| smallest_slice = data[inds][:0] | ||
| data = as_concrete_data(smallest_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkillick You might want coverage for 0-dim arrays and MaskedConstants ... I've seen those being passed around on my travels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work; see these tests... although I'm not convinced that they cover the MaskedConstant case. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would one produce a MaskedConstant to test on? They seem to be things you can only make when you don't want to (i.e. I now want to and I can't manage to make one!) 😒
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following works:
>>> a = ma.masked_array([4], mask=True)
>>> masked_constant = a[0]
>>> print type(masked_constant)
<class 'numpy.ma.core.MaskedConstant'>
| from iris._lazy_data import lazy_masked_fill_value, _MAX_CHUNK_SIZE | ||
|
|
||
|
|
||
| class Test_as_lazy_data(tests.IrisTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs renaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot! Thought I'd got away with copying an existing file...
| return data | ||
|
|
||
|
|
||
| def lazy_masked_fill_value(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you but into a rebrand, say "lazy_fill_value" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to have to try harder than that to convince me 😉
I guess that...
- I only care about fill values if I have masked data,
- so 'masked' could be omitted in that sense, but
- clarity is a good thing – it's in the masked data case that I want to retrieve the fill value.
So I'm not currently buying into a rebrand! What do you see as the benefit of doing so?
| self.fill_value = 9999 | ||
| self.m = ma.masked_array(data, mask=mask, fill_value=self.fill_value) | ||
| self.dm = da.from_array(self.m, asarray=False, | ||
| chunks=_MAX_CHUNK_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you not just call iris._lazy_data.as_lazy_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but I don't see how that benefits me / improves the existing approach... So if you have a good benefit / improvement for changing this I'd love to hear it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's just the argument we initially had of whether to include as_lazy_data.
We chose to put all our dealing with dask in a single module (iris._lazy_data) rather than having it dotted throughout the iris codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine... apart from the fact that it doesn't hold in the tests, which make widespread use of da.from_array (including the test module I duplicated to make this one).
Either way, this intermediate step has now been banked, so do feel free to change this in a follow-up PR 😄
| if is_lazy_data(data): | ||
| inds = tuple([0] * (data.ndim-1)) | ||
| smallest_slice = data[inds][:0] | ||
| data = as_concrete_data(smallest_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why calculate inds, why not just use [:0]
i.e.
if is_lazy_data(data):
smallest_slice = data[:0]
data = as_concrete_data(smallest_slice)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting idea! I was going to say "Because then it won't be the smallest slice!", which turns out to not quite be accurate (look at the shape of the resultant array 😱):
z = np.arange(24).reshape(2,3,4)
z[:0]
array([], shape=(0, 3, 4), dtype=int64)Apparently (according to an offline conversation) doing the suggested also causes the 0D case to not work properly. So, while I'm 👍 for reducing SLOC and code complexity, I think the extra line in this case adds some worthwhile resilience.
|
@dkillick Awesome thanks 👍 I'm going to bank this PR and follow-up with any/all rework in another PR, given your unavailability. |
Enables lazy masked
fill_valueretrieval for NetCDF save.Dask exposes no mechanism for retrieving the fill value of a lazy masked array (indeed, arguably this is currently not a soluble problem). This PR proposes a workaround for this problem, exposes the workaround as a function in
iris._lazy_dataand integrates the function into the NetCDF saver.Specifically, the workaround realises the smallest possible slice of the lazy masked data array that, when realised, contains the required fill value. Turns out the smallest possible slice is actually an empty slice:
Fixes #2703.