Netcdf packing support v2 #2193

jswanljung · 2016-10-13T13:59:08Z

NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See:
http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.

This PR implements packing by adding an optional packing argument to the netcdf.save function and netcdf.Saver.write method. As described in the docstring:

packing (type or string or dict or list): A numpy integer datatype
(signed or unsigned) or a string that describes a numpy integer dtype
(i.e. 'i2', 'short', 'u4') or a dict of packing parameters as described
below or a list of such types, strings, or dicts. This provides support
for netCDF data packing as described in
http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
If this argument is a type (or type string), appropriate values of
scale_factor and add_offset will be automatically calculated based on
cube.data and possible masking. For masked data, fill_value is
taken from netCDF4.default_fillvals. For more control, pass a dict with
one or more of the following keys: dtype (required), scale_factor,
add_offset, and fill_value. To save multiple cubes with different
packing parameters, pass a list of types, strings, dicts, or None,
one for each cube. Note that automatic calculation of packing
parameters will trigger loading of lazy data; set them manually using a
dict to avoid this. The default is None, in which case the datatype
is determined from the cube and no packing will occur.

So for example:

# pack into np.int16 with automatic calc of scale_factor and add_offset, default fillval if applicable
 iris.save(cube, 'outfile.nc', packing='int16') 

# pack into np.int16 with manually set scale_factor and add_offset
iris.save(cube, 'outfile.nc', packing=dict(dtype=np.short, scale_factor=0.01,  add_offset=250))

# pack one cube with auto scale, the other manually, and the third not at all (cubelist has 3 cubes)
iris.save(cubelist, 'outfile.nc', packing=['int16', dict(dtype='int16', scale_factor=0.01), None])

This PR supercedes a previous attempt (#2152) which I withdrew because it set packing attributes in the cube. In this version packing is completely decoupled from
the cube (see the previous PR for discussion). It also uses the new pattern for setting cf attributes (change requested by @marqh).

In the discussion in the previous PR, @ajdawson objected (and I agreed) that the method implemented here for providing different packing specifications for multiple cubes is suboptimal. Ideally we would identify cubes in the list by key, but we were unable to come up with keys that could identify cubes uniquely in a list. There is no requirement that cube names be unique, for instance. Suggestions are welcome.

bjlittle · 2016-10-14T09:04:19Z

lib/iris/fileformats/netcdf.py

    with Saver(filename, netcdf_format) as sman:
        # Iterate through the cubelist.
        for cube in cubes:
+            if isinstance(packing, list):


@jswanljung The user might just use a tuple rather than a list, and that'll cause some carnage from this point on ... we could be graceful and deal with that.

Good point.

I could actually allow this to be an iterable, rather than a list. I made it a list because testing for iterables is tricky. But I can test for strings and types instead and assume an iterable otherwise. If I do some checking before entering the Saver context (as you suggest below), this should fail gracefully.

That would be great 😄

You can test for an iterable with

from collections import Iterable if isinstance(some_thing, Iterable): ...

This will return True for string also, so you'll need to check if it is iterable, but not a string.

This will also return True for a dict ...

Alright, never mind then!

bjlittle · 2016-10-14T09:14:39Z

lib/iris/fileformats/netcdf.py

+                           'cube.')
+                    raise ValueError(msg)
+            else:
+                packspec = packing


@jswanljung For this whole section, if we have a list (or a tuple) we're popping as we go along ... but it can be the case that the user doesn't have the same number of cubes as packspecs, and you've dealt with that. Great!

However, we only raise an exception in this case as we stumble across it. From a user perspective, I think I'd rather see that we fail early rather than later.

What I'm suggesting is that before entering the Saver context manager, we detect this exception case beforehand, as we already have the cubes and the packing, so we can easily check that they align in terms of length (associated pairs). If the lengths differ, then raise the exception. This means that when we enter the Saver context manager we have a stronger saving contract i.e. we should now save all of the cubes in the CubeList.

Yeah, I agree with this too. I would have done this except that you can't just do len() on an iterable, you have to iterate through it. I guess we can safely assume that's not going to be a heavy operation.

Just wrote some pre-Saver context validation code, so it should fail before writing anything if packing is invalid in some way.

def is_valid_packspec(p): """ Only checks that the datatype is valid. """ if isinstance(p, dict): if 'dtype' in p: return is_valid_packspec(p['dtype']) else: msg = "The argument to packing must contain the key 'dtype'." raise ValueError(msg) elif (isinstance(p, unicode) or isinstance(p, type) or isinstance(p, str)): pdtype = np.dtype(p) # Does nothing if it's already a numpy dtype if pdtype.kind != 'i' and pdtype.kind != 'u': msg = "The packing datatype must be a numpy integer type." raise ValueError(msg) return True elif packing is None: return True else: return False if is_valid_packspec(packing): packspecs = repeat(packing) else: # Assume iterable, make sure packing is the same length as cubes. for cube, packspec in zip_longest(cubes, packing, -1): if cube == -1 or packspec == -1: msg = ('If packing is a list, it must have the ' 'same number of elements as the argument to' 'cube.') raise ValueError(msg) if not is_valid_packspec(packing): msg = ('Invalid packing argument: {}.'.format(packspec)) raise ValueError(msg) packspecs = packing

repeat and zip_longest are from itertools. This is right before entering the saver context. Then it zips cubes with packspecs.

bjlittle · 2016-10-14T09:40:17Z

lib/iris/fileformats/netcdf.py

+                scale_factor = None
+                add_offset = None
+                fill_value = None
+                masked = isinstance(cube.data, np.ma.core.MaskedArray)


@jswanljung You could use np.ma.isMaskedArray(cube.data) instead ...

Can you use ma. instead of np.ma. please. There is already an import numpy.ma as ma in the imports section.

bjlittle · 2016-10-14T09:44:46Z

lib/iris/fileformats/netcdf.py

+                            add_offset = (cmax + cmin)/2
+                        else:
+                            add_offset = cmin + 2**(n-1)*scale_factor
+                if masked and not fill_value:


@jswanljung Careful ... are you sure you don't want if masked and fill_value is not None here ... just in case someone wanted a fill_value=0 (why? ... just because they can I suppose)

Cool, thanks!

bjlittle · 2016-10-14T09:54:25Z

lib/iris/fileformats/netcdf.py

+                masked = isinstance(cube.data, np.ma.core.MaskedArray)
+                if isinstance(packing, dict):
+                    if 'dtype' not in packing:
+                        msg = ("The dtype attribute is required for packing. ")


Minor - trailing white space within string, and you can drop the brackets

Whoops. I think it was longer before...

bjlittle · 2016-10-14T09:58:15Z

lib/iris/fileformats/netcdf.py

+            if not packing:
+                # Determine whether there is a cube MDI value.
+                fill_value = None
+                if isinstance(cube.data, ma.core.MaskedArray):


@jswanljung You've already cached this in the masked local variable, calculated earlier.

True. Wasn't my code except the condition.

Actually the masked local was only set if packing was in kwargs, but I might as well move it outside the conditional.

This got complicated. I can't test for masking without loading lazy data, which I only want to do if the user uses the automatic calculation of parameters (which triggers loading anyway). But if data remains lazy, we assume we need a fill value whether the data is masked or not (this is the current behaviour). Anyway, it was easier just to do isMaskedArray again than to try to cache the result.

bjlittle · 2016-10-14T10:03:52Z

lib/iris/fileformats/netcdf.py

-            if isinstance(cube.data, ma.core.MaskedArray):
-                fill_value = cube.data.fill_value
+
+            if not packing:


if packing is not None

bjlittle · 2016-10-14T10:04:38Z

lib/iris/fileformats/netcdf.py

            # Create the cube CF-netCDF data variable.
            # Explicitly assign the fill_value, which will be the type default
            # in the case of an unmasked array.
+            if not packing:


if packing is not None

bjlittle · 2016-10-14T10:08:43Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+        if isinstance(cube.data, np.ma.core.MaskedArray):
+            masked = True
+        else:
+            masked = False


@jswanljung Yeah, you know it now 😉 ... masked = ma.isMaskedArray(cube.data)

bjlittle · 2016-10-14T10:18:16Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+        file_in = tests.get_data_path(
+            ('PP', 'cf_processing',
+             '000003000000.03.236.000128.1990.12.01.00.00.b.pp'))
+        cube = iris.load_cube(file_in)


@jswanljung L642:645 common to all test_single_packed_... type tests, so you could put that in a def setUp(self).

bjlittle · 2016-10-14T10:22:40Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+        # Read PP input file.
+        file_in = tests.get_data_path(('PP', 'cf_processing',
+                                       'abcza_pa19591997_daily_29.b.pp'))
+        cubes = iris.load(file_in)


@jswanljung L714:716 is common to both test_multi_packed_... type tests ...

Infact, you could split this testing class into two: Test_single_packed_data and Test_multi_packed_data

Each could have their own common code to load their respective cube test data in their setUp method.

There was a lot more redundancy than just the setup. I removed the redundancy, but without splitting the class.

bjlittle · 2016-10-14T10:40:00Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py

+                                                decimal=decimal)
+                else:
+                    self.assertArrayEqual(cube.data, packedcube.data)
+


@jswanljung Awesome tests, thanks!

However, I'd consider them more as integration tests rather than unit tests ... @ajdawson and @marqh would you agree?

If so, then move them to iris/tests/integration/test_netcdf.py instead ...

I started with the tests in tests/test_netcdf.py, then moved them to integration and then finally (after reading this: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html) I moved them to unit tests. But I'm happy to put them wherever you feel they belong.

bjlittle · 2016-10-14T10:42:48Z

lib/iris/tests/unit/fileformats/netcdf/test_Saver.py


+@tests.skip_data
+class Test_packed_data(tests.IrisTest):
+    def _get_scale_factor_add_offset(self, cube, datatype):


@jswanljung If you agree to split this test class (as mentioned below) then this method is really a convenience function that would live at the module level (note no usage of self within the method) ... then each of the test cases can still call it as a module level function.

bjlittle · 2016-10-14T10:48:42Z

@jswanljung Okay I think I'm done.

Thanks for this PR, it's going to make a valuable contribution to the code base 😄

I'm going to be distracted elsewhere for a wee while, but I'll get back on this PR asap (hope you don't mind). I'd love to see it merged 👍

bjlittle · 2016-10-14T12:22:28Z

@jswanljung You'll need to rebase from upstream/master to resolve conflicts, otherwise travis-ci won't kick in ...

bjlittle · 2016-10-14T14:14:53Z

lib/iris/fileformats/netcdf.py

 import iris.fileformats._pyke_rules
 import iris.io
 import iris.util
+from itertools import repeat


@jswanljung We've adopted an import ordering within iris, which is basically builtins, third-party, then iris specific modules.

This said, from itertools is a built-in import and belongs up after the import collections line.

bjlittle · 2016-10-14T14:16:46Z

lib/iris/fileformats/netcdf.py

 import iris.io
 import iris.util
+from itertools import repeat
+from six.moves import zip_longest


For six you can add this to the import on line-28

from six.moves import (filter, input, map, range, zip, zip_longest) # noqa

Unfortunately there's a coding standards test that makes sure nobody tampers with that line!

iris.tests.test_coding_standards.TestFutureImports

But I can put it right after that line.

marqh · 2016-10-14T14:25:56Z

However, I'd consider them more as integration tests rather than unit tests ... @ajdawson and @marqh would you agree?

I agree, i think these should live under tests.integration

jswanljung · 2016-10-14T19:18:30Z

I've got all tests passing now except an image test that is failing under Python 3. I think it's this problem: #2195. I'll try rerunning the tests.

jswanljung · 2016-10-14T20:20:19Z

Actually I don't seem to be able to rerun these tests without committing something new.

bjlittle · 2016-10-14T21:35:54Z

@jswanljung I've restarted your failed travis test for you. Unfortunately, the failure is our fault and is related to #2195. Apologies.

bjlittle · 2016-10-14T21:42:03Z

lib/iris/tests/integration/test_netcdf.py

 # importing anything else.
 import iris.tests as tests

+from itertools import repeat


@jswanljung Sorry, but within each import block, the imports should be in alphabetical order, so itertools comes after contextlib and before os.path 😄

Ah! Another chance to fail the tests! :-)

Yeah, I'm working on a fix for #2195 😱

But I'll kept an eye on your travis jobs and keep restarting them manually. We'll get there ...

I have no doubt.

jswanljung · 2016-10-15T18:01:37Z

Maybe it's too obvious to need pointing out in the docstring, but iterable arguments to cubes and packing must both be ordered. No sets, for instance!

marqh · 2016-10-19T07:26:51Z

This PR is looking great from my point of view. many thanks @jswanljung for the contributions

I fully support the approach and the implementation looks just fine to me

merge when ready, I say 👍

bjlittle · 2016-10-19T07:33:17Z

@jswanljung I really wouldn't worry about explicitly talking about sets in your docstrings.

Thanks for the nudge @marqh. I'll take another pass through this PR. Let's get this merged today!

jswanljung · 2016-10-19T07:43:21Z

Great! Just added a what's new entry dated today (no other changes). Hopefully the tests will pass again.

…flect this.

jswanljung · 2016-10-19T11:45:37Z

Okay, tests passed, but I guess I still need to do a rebase against upstream/master again?

jswanljung · 2016-10-19T11:51:11Z

Rebased against latest upstream/master.

jswanljung · 2016-10-19T12:13:47Z

Passing after rebase except for a random image test.

jswanljung · 2016-10-19T13:42:00Z

Passing again! So back to where we were this morning except now the code actually does something useful.

QuLogic · 2016-10-19T21:06:46Z

@jswanljung It appears that one of your computers does not have git configured correctly. All but three of the commits are attributed to Johan Swanljung <[email protected]> which is presumably not what you want.

jswanljung · 2016-10-19T21:25:15Z

@QuLogic Thanks, I'll have a look at that.

bjlittle · 2016-10-19T21:44:15Z

@jswanljung Thanks for an awesome contribution and your sustained effort on this PR.

It's very much appreciated! Grattis 🍻

jswanljung · 2016-10-19T21:59:08Z

And thank you and the rest of the SciTools people! This was a real learning experience and it's gratifying to see it merged. Also something I'll use all the time myself. 🍻

bjlittle reviewed Oct 14, 2016

View reviewed changes

bjlittle added the Status: Work in Progress label Oct 14, 2016

bjlittle reviewed Oct 14, 2016

View reviewed changes

bjlittle self-assigned this Oct 14, 2016

jswanljung force-pushed the netcdf_packing_support_v2 branch from c52ad10 to 679924f Compare October 14, 2016 13:00

bjlittle reviewed Oct 14, 2016

View reviewed changes

jswanljung force-pushed the netcdf_packing_support_v2 branch from 83d3cfc to c9bb112 Compare October 14, 2016 19:25

bjlittle reviewed Oct 14, 2016

View reviewed changes

Johan Swanljung added 17 commits October 19, 2016 13:42

pep violation fix

0f2998a

Fixed problems with masked and lazy_data

3089dc5

Fixed a bug in zip_longest arg

46f732a

Fixed bug with packing specification check for iterable.

3ac9861

Packing specification checks now handling None correctly.

4109433

Moved packing tests to integration.

700de12

Addressed bogus pep violations

a15fc0b

Replaced unicode with six.text_type

cc4a836

Removed unnecessary test for str.

6e0a7a6

Reverted inadvertent deleted line break in travis.yml

5434785

Turns out str test was necessary after all.

cef2b8f

Put test_netcdf.py imports in alphabetical order.

9a2d2df

Added what's new.

e367b93

added what's new entry

29df858

Renamed what's new entry to conform to spec

41766ae

Actually set datatype so packing will occur. Replaced CDL files to re…

5f08ab8

…flect this.

Removed an extra 'a' that snuck into a docstring.

aba9510

jswanljung force-pushed the netcdf_packing_support_v2 branch from 9e56a69 to aba9510 Compare October 19, 2016 11:50

bjlittle merged commit bd441c1 into SciTools:master Oct 19, 2016

bjlittle removed the Status: Work in Progress label Oct 19, 2016

QuLogic added this to the v1.11 milestone Oct 19, 2016

jswanljung deleted the netcdf_packing_support_v2 branch October 20, 2016 07:44

Netcdf packing support v2 #2193

Netcdf packing support v2 #2193

Uh oh!

Conversation

jswanljung commented Oct 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjlittle Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jswanljung Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjlittle commented Oct 14, 2016

Uh oh!

bjlittle commented Oct 14, 2016

jswanljung commented Oct 13, 2016 •

edited

Loading

bjlittle Oct 14, 2016 •

edited

Loading

bjlittle Oct 14, 2016 •

edited

Loading

jswanljung Oct 14, 2016 •

edited

Loading

bjlittle commented Oct 14, 2016 •

edited

Loading