-
Notifications
You must be signed in to change notification settings - Fork 300
Netcdf packing support v2 #2193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netcdf packing support v2 #2193
Conversation
lib/iris/fileformats/netcdf.py
Outdated
| with Saver(filename, netcdf_format) as sman: | ||
| # Iterate through the cubelist. | ||
| for cube in cubes: | ||
| if isinstance(packing, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung The user might just use a tuple rather than a list, and that'll cause some carnage from this point on ... we could be graceful and deal with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could actually allow this to be an iterable, rather than a list. I made it a list because testing for iterables is tricky. But I can test for strings and types instead and assume an iterable otherwise. If I do some checking before entering the Saver context (as you suggest below), this should fail gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test for an iterable with
from collections import Iterable
if isinstance(some_thing, Iterable):
...This will return True for string also, so you'll need to check if it is iterable, but not a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also return True for a dict ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, never mind then!
lib/iris/fileformats/netcdf.py
Outdated
| 'cube.') | ||
| raise ValueError(msg) | ||
| else: | ||
| packspec = packing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung For this whole section, if we have a list (or a tuple) we're popping as we go along ... but it can be the case that the user doesn't have the same number of cubes as packspecs, and you've dealt with that. Great!
However, we only raise an exception in this case as we stumble across it. From a user perspective, I think I'd rather see that we fail early rather than later.
What I'm suggesting is that before entering the Saver context manager, we detect this exception case beforehand, as we already have the cubes and the packing, so we can easily check that they align in terms of length (associated pairs). If the lengths differ, then raise the exception. This means that when we enter the Saver context manager we have a stronger saving contract i.e. we should now save all of the cubes in the CubeList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with this too. I would have done this except that you can't just do len() on an iterable, you have to iterate through it. I guess we can safely assume that's not going to be a heavy operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wrote some pre-Saver context validation code, so it should fail before writing anything if packing is invalid in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_valid_packspec(p):
""" Only checks that the datatype is valid. """
if isinstance(p, dict):
if 'dtype' in p:
return is_valid_packspec(p['dtype'])
else:
msg = "The argument to packing must contain the key 'dtype'."
raise ValueError(msg)
elif (isinstance(p, unicode) or isinstance(p, type) or
isinstance(p, str)):
pdtype = np.dtype(p) # Does nothing if it's already a numpy dtype
if pdtype.kind != 'i' and pdtype.kind != 'u':
msg = "The packing datatype must be a numpy integer type."
raise ValueError(msg)
return True
elif packing is None:
return True
else:
return False
if is_valid_packspec(packing):
packspecs = repeat(packing)
else:
# Assume iterable, make sure packing is the same length as cubes.
for cube, packspec in zip_longest(cubes, packing, -1):
if cube == -1 or packspec == -1:
msg = ('If packing is a list, it must have the '
'same number of elements as the argument to'
'cube.')
raise ValueError(msg)
if not is_valid_packspec(packing):
msg = ('Invalid packing argument: {}.'.format(packspec))
raise ValueError(msg)
packspecs = packing
repeat and zip_longest are from itertools. This is right before entering the saver context. Then it zips cubes with packspecs.
lib/iris/fileformats/netcdf.py
Outdated
| scale_factor = None | ||
| add_offset = None | ||
| fill_value = None | ||
| masked = isinstance(cube.data, np.ma.core.MaskedArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung You could use np.ma.isMaskedArray(cube.data) instead ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use ma. instead of np.ma. please. There is already an import numpy.ma as ma in the imports section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
lib/iris/fileformats/netcdf.py
Outdated
| add_offset = (cmax + cmin)/2 | ||
| else: | ||
| add_offset = cmin + 2**(n-1)*scale_factor | ||
| if masked and not fill_value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung Careful ... are you sure you don't want if masked and fill_value is not None here ... just in case someone wanted a fill_value=0 (why? ... just because they can I suppose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
lib/iris/fileformats/netcdf.py
Outdated
| masked = isinstance(cube.data, np.ma.core.MaskedArray) | ||
| if isinstance(packing, dict): | ||
| if 'dtype' not in packing: | ||
| msg = ("The dtype attribute is required for packing. ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - trailing white space within string, and you can drop the brackets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. I think it was longer before...
lib/iris/fileformats/netcdf.py
Outdated
| if not packing: | ||
| # Determine whether there is a cube MDI value. | ||
| fill_value = None | ||
| if isinstance(cube.data, ma.core.MaskedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung You've already cached this in the masked local variable, calculated earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Wasn't my code except the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the masked local was only set if packing was in kwargs, but I might as well move it outside the conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got complicated. I can't test for masking without loading lazy data, which I only want to do if the user uses the automatic calculation of parameters (which triggers loading anyway). But if data remains lazy, we assume we need a fill value whether the data is masked or not (this is the current behaviour). Anyway, it was easier just to do isMaskedArray again than to try to cache the result.
lib/iris/fileformats/netcdf.py
Outdated
| if isinstance(cube.data, ma.core.MaskedArray): | ||
| fill_value = cube.data.fill_value | ||
|
|
||
| if not packing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if packing is not None
lib/iris/fileformats/netcdf.py
Outdated
| # Create the cube CF-netCDF data variable. | ||
| # Explicitly assign the fill_value, which will be the type default | ||
| # in the case of an unmasked array. | ||
| if not packing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if packing is not None
| if isinstance(cube.data, np.ma.core.MaskedArray): | ||
| masked = True | ||
| else: | ||
| masked = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung Yeah, you know it now 😉 ... masked = ma.isMaskedArray(cube.data)
| file_in = tests.get_data_path( | ||
| ('PP', 'cf_processing', | ||
| '000003000000.03.236.000128.1990.12.01.00.00.b.pp')) | ||
| cube = iris.load_cube(file_in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung L642:645 common to all test_single_packed_... type tests, so you could put that in a def setUp(self).
| # Read PP input file. | ||
| file_in = tests.get_data_path(('PP', 'cf_processing', | ||
| 'abcza_pa19591997_daily_29.b.pp')) | ||
| cubes = iris.load(file_in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung L714:716 is common to both test_multi_packed_... type tests ...
Infact, you could split this testing class into two: Test_single_packed_data and Test_multi_packed_data
Each could have their own common code to load their respective cube test data in their setUp method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a lot more redundancy than just the setup. I removed the redundancy, but without splitting the class.
| decimal=decimal) | ||
| else: | ||
| self.assertArrayEqual(cube.data, packedcube.data) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung Awesome tests, thanks!
However, I'd consider them more as integration tests rather than unit tests ... @ajdawson and @marqh would you agree?
If so, then move them to iris/tests/integration/test_netcdf.py instead ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with the tests in tests/test_netcdf.py, then moved them to integration and then finally (after reading this: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html) I moved them to unit tests. But I'm happy to put them wherever you feel they belong.
|
|
||
| @tests.skip_data | ||
| class Test_packed_data(tests.IrisTest): | ||
| def _get_scale_factor_add_offset(self, cube, datatype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung If you agree to split this test class (as mentioned below) then this method is really a convenience function that would live at the module level (note no usage of self within the method) ... then each of the test cases can still call it as a module level function.
|
@jswanljung Okay I think I'm done. Thanks for this PR, it's going to make a valuable contribution to the code base 😄 I'm going to be distracted elsewhere for a wee while, but I'll get back on this PR asap (hope you don't mind). I'd love to see it merged 👍 |
|
@jswanljung You'll need to rebase from upstream/master to resolve conflicts, otherwise travis-ci won't kick in ... |
c52ad10 to
679924f
Compare
lib/iris/fileformats/netcdf.py
Outdated
| import iris.fileformats._pyke_rules | ||
| import iris.io | ||
| import iris.util | ||
| from itertools import repeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung We've adopted an import ordering within iris, which is basically builtins, third-party, then iris specific modules.
This said, from itertools is a built-in import and belongs up after the import collections line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
lib/iris/fileformats/netcdf.py
Outdated
| import iris.io | ||
| import iris.util | ||
| from itertools import repeat | ||
| from six.moves import zip_longest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For six you can add this to the import on line-28
from six.moves import (filter, input, map, range, zip, zip_longest) # noqaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there's a coding standards test that makes sure nobody tampers with that line!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iris.tests.test_coding_standards.TestFutureImports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I can put it right after that line.
|
I've got all tests passing now except an image test that is failing under Python 3. I think it's this problem: #2195. I'll try rerunning the tests. |
83d3cfc to
c9bb112
Compare
|
Actually I don't seem to be able to rerun these tests without committing something new. |
|
@jswanljung I've restarted your failed travis test for you. Unfortunately, the failure is our fault and is related to #2195. Apologies. |
| # importing anything else. | ||
| import iris.tests as tests | ||
|
|
||
| from itertools import repeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jswanljung Sorry, but within each import block, the imports should be in alphabetical order, so itertools comes after contextlib and before os.path 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Another chance to fail the tests! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm working on a fix for #2195 😱
But I'll kept an eye on your travis jobs and keep restarting them manually. We'll get there ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no doubt.
|
Maybe it's too obvious to need pointing out in the docstring, but iterable arguments to cubes and packing must both be ordered. No sets, for instance! |
|
This PR is looking great from my point of view. many thanks @jswanljung for the contributions I fully support the approach and the implementation looks just fine to me merge when ready, I say 👍 |
|
@jswanljung I really wouldn't worry about explicitly talking about sets in your docstrings. Thanks for the nudge @marqh. I'll take another pass through this PR. Let's get this merged today! |
|
Great! Just added a what's new entry dated today (no other changes). Hopefully the tests will pass again. |
|
Okay, tests passed, but I guess I still need to do a rebase against upstream/master again? |
9e56a69 to
aba9510
Compare
|
Rebased against latest upstream/master. |
|
Passing after rebase except for a random image test. |
|
Passing again! So back to where we were this morning except now the code actually does something useful. |
|
@jswanljung It appears that one of your computers does not have git configured correctly. All but three of the commits are attributed to |
|
@QuLogic Thanks, I'll have a look at that. |
|
@jswanljung Thanks for an awesome contribution and your sustained effort on this PR. It's very much appreciated! Grattis 🍻 |
|
And thank you and the rest of the SciTools people! This was a real learning experience and it's gratifying to see it merged. Also something I'll use all the time myself. 🍻 |
NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See:
http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.
This PR implements packing by adding an optional
packingargument to the netcdf.save function and netcdf.Saver.write method. As described in the docstring:(signed or unsigned) or a string that describes a numpy integer dtype
(i.e. 'i2', 'short', 'u4') or a dict of packing parameters as described
below or a list of such types, strings, or dicts. This provides support
for netCDF data packing as described in
http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
If this argument is a type (or type string), appropriate values of
scale_factor and add_offset will be automatically calculated based on
cube.dataand possible masking. For masked data,fill_valueistaken from netCDF4.default_fillvals. For more control, pass a dict with
one or more of the following keys:
dtype(required),scale_factor,add_offset, andfill_value. To save multiple cubes with differentpacking parameters, pass a list of types, strings, dicts, or
None,one for each cube. Note that automatic calculation of packing
parameters will trigger loading of lazy data; set them manually using a
dict to avoid this. The default is
None, in which case the datatypeis determined from the cube and no packing will occur.
So for example:
This PR supercedes a previous attempt (#2152) which I withdrew because it set packing attributes in the cube. In this version packing is completely decoupled from
the cube (see the previous PR for discussion). It also uses the new pattern for setting cf attributes (change requested by @marqh).
In the discussion in the previous PR, @ajdawson objected (and I agreed) that the method implemented here for providing different packing specifications for multiple cubes is suboptimal. Ideally we would identify cubes in the list by key, but we were unable to come up with keys that could identify cubes uniquely in a list. There is no requirement that cube names be unique, for instance. Suggestions are welcome.