Skip to content

Conversation

@jswanljung
Copy link
Contributor

@jswanljung jswanljung commented Oct 13, 2016

NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See:
http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.

This PR implements packing by adding an optional packing argument to the netcdf.save function and netcdf.Saver.write method. As described in the docstring:

  • packing (type or string or dict or list): A numpy integer datatype
    (signed or unsigned) or a string that describes a numpy integer dtype
    (i.e. 'i2', 'short', 'u4') or a dict of packing parameters as described
    below or a list of such types, strings, or dicts. This provides support
    for netCDF data packing as described in
    http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
    If this argument is a type (or type string), appropriate values of
    scale_factor and add_offset will be automatically calculated based on
    cube.data and possible masking. For masked data, fill_value is
    taken from netCDF4.default_fillvals. For more control, pass a dict with
    one or more of the following keys: dtype (required), scale_factor,
    add_offset, and fill_value. To save multiple cubes with different
    packing parameters, pass a list of types, strings, dicts, or None,
    one for each cube. Note that automatic calculation of packing
    parameters will trigger loading of lazy data; set them manually using a
    dict to avoid this. The default is None, in which case the datatype
    is determined from the cube and no packing will occur.

So for example:

# pack into np.int16 with automatic calc of scale_factor and add_offset, default fillval if applicable
 iris.save(cube, 'outfile.nc', packing='int16') 

# pack into np.int16 with manually set scale_factor and add_offset
iris.save(cube, 'outfile.nc', packing=dict(dtype=np.short, scale_factor=0.01,  add_offset=250))

# pack one cube with auto scale, the other manually, and the third not at all (cubelist has 3 cubes)
iris.save(cubelist, 'outfile.nc', packing=['int16', dict(dtype='int16', scale_factor=0.01), None])

This PR supercedes a previous attempt (#2152) which I withdrew because it set packing attributes in the cube. In this version packing is completely decoupled from
the cube (see the previous PR for discussion). It also uses the new pattern for setting cf attributes (change requested by @marqh).

In the discussion in the previous PR, @ajdawson objected (and I agreed) that the method implemented here for providing different packing specifications for multiple cubes is suboptimal. Ideally we would identify cubes in the list by key, but we were unable to come up with keys that could identify cubes uniquely in a list. There is no requirement that cube names be unique, for instance. Suggestions are welcome.

with Saver(filename, netcdf_format) as sman:
# Iterate through the cubelist.
for cube in cubes:
if isinstance(packing, list):
Copy link
Member

@bjlittle bjlittle Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung The user might just use a tuple rather than a list, and that'll cause some carnage from this point on ... we could be graceful and deal with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could actually allow this to be an iterable, rather than a list. I made it a list because testing for iterables is tricky. But I can test for strings and types instead and assume an iterable otherwise. If I do some checking before entering the Saver context (as you suggest below), this should fail gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test for an iterable with

from collections import Iterable

if isinstance(some_thing, Iterable):
    ...

This will return True for string also, so you'll need to check if it is iterable, but not a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also return True for a dict ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, never mind then!

'cube.')
raise ValueError(msg)
else:
packspec = packing
Copy link
Member

@bjlittle bjlittle Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung For this whole section, if we have a list (or a tuple) we're popping as we go along ... but it can be the case that the user doesn't have the same number of cubes as packspecs, and you've dealt with that. Great!

However, we only raise an exception in this case as we stumble across it. From a user perspective, I think I'd rather see that we fail early rather than later.

What I'm suggesting is that before entering the Saver context manager, we detect this exception case beforehand, as we already have the cubes and the packing, so we can easily check that they align in terms of length (associated pairs). If the lengths differ, then raise the exception. This means that when we enter the Saver context manager we have a stronger saving contract i.e. we should now save all of the cubes in the CubeList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with this too. I would have done this except that you can't just do len() on an iterable, you have to iterate through it. I guess we can safely assume that's not going to be a heavy operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wrote some pre-Saver context validation code, so it should fail before writing anything if packing is invalid in some way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def is_valid_packspec(p):
        """ Only checks that the datatype is valid. """
        if isinstance(p, dict):
            if 'dtype' in p:
                return is_valid_packspec(p['dtype'])
            else:
                msg = "The argument to packing must contain the key 'dtype'."
                raise ValueError(msg)
        elif (isinstance(p, unicode) or isinstance(p, type) or
              isinstance(p, str)):
            pdtype = np.dtype(p)  # Does nothing if it's already a numpy dtype
            if pdtype.kind != 'i' and pdtype.kind != 'u':
                msg = "The packing datatype must be a numpy integer type."
                raise ValueError(msg)
            return True
        elif packing is None:
            return True
        else:
            return False

    if is_valid_packspec(packing):
        packspecs = repeat(packing)
    else:
        # Assume iterable, make sure packing is the same length as cubes.
        for cube, packspec in zip_longest(cubes, packing, -1):
            if cube == -1 or packspec == -1:
                msg = ('If packing is a list, it must have the '
                       'same number of elements as the argument to'
                       'cube.')
                raise ValueError(msg)
            if not is_valid_packspec(packing):
                msg = ('Invalid packing argument: {}.'.format(packspec))
                raise ValueError(msg)
        packspecs = packing

repeat and zip_longest are from itertools. This is right before entering the saver context. Then it zips cubes with packspecs.

scale_factor = None
add_offset = None
fill_value = None
masked = isinstance(cube.data, np.ma.core.MaskedArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung You could use np.ma.isMaskedArray(cube.data) instead ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use ma. instead of np.ma. please. There is already an import numpy.ma as ma in the imports section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

add_offset = (cmax + cmin)/2
else:
add_offset = cmin + 2**(n-1)*scale_factor
if masked and not fill_value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung Careful ... are you sure you don't want if masked and fill_value is not None here ... just in case someone wanted a fill_value=0 (why? ... just because they can I suppose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

masked = isinstance(cube.data, np.ma.core.MaskedArray)
if isinstance(packing, dict):
if 'dtype' not in packing:
msg = ("The dtype attribute is required for packing. ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - trailing white space within string, and you can drop the brackets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I think it was longer before...

if not packing:
# Determine whether there is a cube MDI value.
fill_value = None
if isinstance(cube.data, ma.core.MaskedArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung You've already cached this in the masked local variable, calculated earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Wasn't my code except the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the masked local was only set if packing was in kwargs, but I might as well move it outside the conditional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got complicated. I can't test for masking without loading lazy data, which I only want to do if the user uses the automatic calculation of parameters (which triggers loading anyway). But if data remains lazy, we assume we need a fill value whether the data is masked or not (this is the current behaviour). Anyway, it was easier just to do isMaskedArray again than to try to cache the result.

if isinstance(cube.data, ma.core.MaskedArray):
fill_value = cube.data.fill_value

if not packing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if packing is not None

# Create the cube CF-netCDF data variable.
# Explicitly assign the fill_value, which will be the type default
# in the case of an unmasked array.
if not packing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if packing is not None

if isinstance(cube.data, np.ma.core.MaskedArray):
masked = True
else:
masked = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung Yeah, you know it now 😉 ... masked = ma.isMaskedArray(cube.data)

file_in = tests.get_data_path(
('PP', 'cf_processing',
'000003000000.03.236.000128.1990.12.01.00.00.b.pp'))
cube = iris.load_cube(file_in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung L642:645 common to all test_single_packed_... type tests, so you could put that in a def setUp(self).

# Read PP input file.
file_in = tests.get_data_path(('PP', 'cf_processing',
'abcza_pa19591997_daily_29.b.pp'))
cubes = iris.load(file_in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung L714:716 is common to both test_multi_packed_... type tests ...

Infact, you could split this testing class into two: Test_single_packed_data and Test_multi_packed_data

Each could have their own common code to load their respective cube test data in their setUp method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a lot more redundancy than just the setup. I removed the redundancy, but without splitting the class.

decimal=decimal)
else:
self.assertArrayEqual(cube.data, packedcube.data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung Awesome tests, thanks!

However, I'd consider them more as integration tests rather than unit tests ... @ajdawson and @marqh would you agree?

If so, then move them to iris/tests/integration/test_netcdf.py instead ...

Copy link
Contributor Author

@jswanljung jswanljung Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with the tests in tests/test_netcdf.py, then moved them to integration and then finally (after reading this: http://scitools.org.uk/iris/docs/latest/developers_guide/tests.html) I moved them to unit tests. But I'm happy to put them wherever you feel they belong.


@tests.skip_data
class Test_packed_data(tests.IrisTest):
def _get_scale_factor_add_offset(self, cube, datatype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung If you agree to split this test class (as mentioned below) then this method is really a convenience function that would live at the module level (note no usage of self within the method) ... then each of the test cases can still call it as a module level function.

@bjlittle bjlittle self-assigned this Oct 14, 2016
@bjlittle
Copy link
Member

@jswanljung Okay I think I'm done.

Thanks for this PR, it's going to make a valuable contribution to the code base 😄

I'm going to be distracted elsewhere for a wee while, but I'll get back on this PR asap (hope you don't mind). I'd love to see it merged 👍

@bjlittle
Copy link
Member

@jswanljung You'll need to rebase from upstream/master to resolve conflicts, otherwise travis-ci won't kick in ...

@jswanljung jswanljung force-pushed the netcdf_packing_support_v2 branch from c52ad10 to 679924f Compare October 14, 2016 13:00
import iris.fileformats._pyke_rules
import iris.io
import iris.util
from itertools import repeat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung We've adopted an import ordering within iris, which is basically builtins, third-party, then iris specific modules.

This said, from itertools is a built-in import and belongs up after the import collections line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

import iris.io
import iris.util
from itertools import repeat
from six.moves import zip_longest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For six you can add this to the import on line-28

from six.moves import (filter, input, map, range, zip, zip_longest) # noqa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there's a coding standards test that makes sure nobody tampers with that line!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iris.tests.test_coding_standards.TestFutureImports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can put it right after that line.

@marqh
Copy link
Member

marqh commented Oct 14, 2016

However, I'd consider them more as integration tests rather than unit tests ... @ajdawson and @marqh would you agree?

I agree, i think these should live under tests.integration

@jswanljung
Copy link
Contributor Author

I've got all tests passing now except an image test that is failing under Python 3. I think it's this problem: #2195. I'll try rerunning the tests.

@jswanljung jswanljung force-pushed the netcdf_packing_support_v2 branch from 83d3cfc to c9bb112 Compare October 14, 2016 19:25
@jswanljung
Copy link
Contributor Author

Actually I don't seem to be able to rerun these tests without committing something new.

@bjlittle
Copy link
Member

bjlittle commented Oct 14, 2016

@jswanljung I've restarted your failed travis test for you. Unfortunately, the failure is our fault and is related to #2195. Apologies.

# importing anything else.
import iris.tests as tests

from itertools import repeat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jswanljung Sorry, but within each import block, the imports should be in alphabetical order, so itertools comes after contextlib and before os.path 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Another chance to fail the tests! :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm working on a fix for #2195 😱

But I'll kept an eye on your travis jobs and keep restarting them manually. We'll get there ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no doubt.

@jswanljung
Copy link
Contributor Author

Maybe it's too obvious to need pointing out in the docstring, but iterable arguments to cubes and packing must both be ordered. No sets, for instance!

@marqh
Copy link
Member

marqh commented Oct 19, 2016

This PR is looking great from my point of view. many thanks @jswanljung for the contributions

I fully support the approach and the implementation looks just fine to me

merge when ready, I say 👍

@bjlittle
Copy link
Member

bjlittle commented Oct 19, 2016

@jswanljung I really wouldn't worry about explicitly talking about sets in your docstrings.

Thanks for the nudge @marqh. I'll take another pass through this PR. Let's get this merged today!

@jswanljung
Copy link
Contributor Author

Great! Just added a what's new entry dated today (no other changes). Hopefully the tests will pass again.

@jswanljung
Copy link
Contributor Author

Okay, tests passed, but I guess I still need to do a rebase against upstream/master again?

@jswanljung jswanljung force-pushed the netcdf_packing_support_v2 branch from 9e56a69 to aba9510 Compare October 19, 2016 11:50
@jswanljung
Copy link
Contributor Author

Rebased against latest upstream/master.

@jswanljung
Copy link
Contributor Author

Passing after rebase except for a random image test.

@jswanljung
Copy link
Contributor Author

Passing again! So back to where we were this morning except now the code actually does something useful.

@QuLogic
Copy link
Member

QuLogic commented Oct 19, 2016

@jswanljung It appears that one of your computers does not have git configured correctly. All but three of the commits are attributed to Johan Swanljung <[email protected]> which is presumably not what you want.

@jswanljung
Copy link
Contributor Author

@QuLogic Thanks, I'll have a look at that.

@bjlittle
Copy link
Member

@jswanljung Thanks for an awesome contribution and your sustained effort on this PR.

It's very much appreciated! Grattis 🍻

@bjlittle bjlittle merged commit bd441c1 into SciTools:master Oct 19, 2016
@jswanljung
Copy link
Contributor Author

And thank you and the rest of the SciTools people! This was a real learning experience and it's gratifying to see it merged. Also something I'll use all the time myself. 🍻

@QuLogic QuLogic added this to the v1.11 milestone Oct 19, 2016
@jswanljung jswanljung deleted the netcdf_packing_support_v2 branch October 20, 2016 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants