Added support for netCDF data packing. #2152

jswanljung · 2016-09-23T09:33:47Z

NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See: http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.

#2148

This PR adds an optional pack_dtype argument to the netcdf save and Saver.write methods. From the docstring:

pack_dtype (type or string or list):
A numpy integer datatype (signed or unsigned) or a string that
describes a numpy integer dtype (i.e. 'i2', 'short', 'u4'). This
provides support for netCDF data packing as described in
http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
If either scale_factor or add_offset are set in cube.attributes,
those values are used. If neither is set, appropriate values of
scale_factor and add_offset are calculated based on cube.data,
pack_dtype and possible masking (see the link for details). Note
that automatic calculation of scale_factor and add_offset will
trigger loading of lazy_data; set scale_factor and add_offset
manually in cube.attributes if you wish to avoid this. For
masked data, fill_values are taken from netCDF4.default_fillvals.
If the argument to cube is an iterable and different datatypes are
desired for each cube, pack_dtype can also be a list with the same
number of elements as the iterable. The default is None, in which
case the datatype is determined from the cube and no packing will
occur.

One of the first things that sold me on Iris was how wonderfully convenient netCDF input and output was, but I couldn't use it when I needed packing (which was fairly often). This fixes that, and with automatic calculation of packing attributes adds even more convenience.

Although you can set the packing attributes (scale_factor and add_offset) manually, there is no way to set a custom fill value for masked data. I doubt this will be missed, but I could be wrong.

There is an error in the tests, but it is due to changes in cf_units, as has been noted previously. All tests associated with this change pass.

jswanljung · 2016-09-29T14:07:01Z

The way xarray handles this is to pass a dictionary argument to its to_netcdf method. The dictionary allows per-variable values of dtype, scale_factor and add_offset to be set upon save without actually putting them in a cube. That can easily accommodate custom fill values as well.

pelson · 2016-10-04T16:34:27Z

Firstly, thank you for this extremely well put together change @jswanljung. I'm completely in support of adding the ability to write packed variable data and just wanted to let you know that I will take a deeper look at this change over the next few days. In the meantime, could I ask that you sign and send over a CLA (http://scitools.org.uk/documents/cla_form.pdf)?

Thanks,

jswanljung · 2016-10-04T19:30:28Z

Thanks, I sent over a signed CLA. As an argument against the way I implemented this, have a look at the way xarray does it (http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html), in particular the encoding argument. The nice thing about that is that no information about the details of the file encoding ever goes into the Dataset or DataArray. In my implementation, you set scale_factor and add_offset in Cube.attributes, which goes against the general philosophy of keeping the file format details separate from the cube abstraction. We can't lift the xarray method directly since cubes in iris don't have unique names, but we could pass a dict or list of dicts with dtype, scale_factor and add_offset as keys to save and Saver.write instead of requiring new attributes to be set in cubes. That would also make custom fill values possible.

I'm willing to make that change if we agree that it's better; now that I know how this stuff works I don't think it would take that long.

marqh · 2016-10-06T16:48:01Z

CLA signed and docs updated
SciTools/scitools.org.uk@f242e4a

thank you @jswanljung

marqh

the recent changes to the
iris.fileformats.netcdf
module have made this PR unsuitable to commit due to conflicts.

The _setncattr is now preferred, but was not known about when you raised your change

please may you rebase and address these issues and I will re-review

I have not found any fundamental concerns at this stage

many thanks for all the hard work, I hope you can stay with it whilst we get these changes in place

marqh · 2016-10-12T12:33:41Z

lib/iris/fileformats/netcdf.py

+                to the netCDF Dataset.
+            """
+            if cube.standard_name:
+                cf_var.standard_name = cube.standard_name


we have run into compatibility issues with Ptyhon2 Python3 and netcdf4Python 1.2 around the setting of attributes on netcdf variable which has caused some concern.
we have recently adopted a different pattern for setting attributes, to ensure that typing is preserved through the py netcdf4 testing matrix
see
#2158 #2179 #2183
for the somewhat disjointed changes for this
(with apologies from me for the lack of cohesion in change, I missed a couple of key issues along the way)

with this in mind, please may you use the

_setncattr(cfvar, 'standard_name', cube.standard_name)

pattern for setting attributes?

marqh · 2016-10-12T12:33:52Z

lib/iris/fileformats/netcdf.py

+                cf_var.standard_name = cube.standard_name
+
+            if cube.long_name:
+                cf_var.long_name = cube.long_name


marqh · 2016-10-12T12:33:58Z

lib/iris/fileformats/netcdf.py

+                cf_var.long_name = cube.long_name
+
+            if cube.units != 'unknown':
+                cf_var.units = str(cube.units)


marqh · 2016-10-12T12:34:28Z

lib/iris/fileformats/netcdf.py

+                          'global attribute.'.format(attr_name=attr_name)
+                    warnings.warn(msg)
+
+                cf_var.setncattr(attr_name, value)


jswanljung · 2016-10-12T13:22:27Z

I don't mind going forward with this, but in the absence of other opinions, I have convinced myself that setting packing attributes in the cube is the wrong way to go (see my argument in the comments above). Unless someone strenuously objects, I shall withdraw this pull request and submit a new one in which the pack_dtype argument to netcdf.save and Saver.write in this PR is replaced by an argument called packing with a docstring as follows:

packing (type or string or dict or iterable): A numpy integer datatype (signed or unsigned) or a string that describes a numpy integer dtype (i.e. 'i2', 'short', 'u4') or a dict of packing parameters as described below or an iterable of such types, strings, or dicts. This provides support for netCDF data packing as described in http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values If this argument is a type (or type string), appropriate values of scale_factor and add_offset will be automatically calculated based on cube.data and possible masking. For masked data, fill_values are taken from netCDF4.default_fillvals. For more control, pass a dict with one or more of the following keys: dtype (required), scale_factor, add_offset, and fill_values. To save multiple cubes with different packing parameters, pass a list (or other iterable) of types, strings, or dicts, one for each cube. Note that automatic calculation of scale_factor and add_offset will trigger loading of lazy_data; set them manually to avoid this. The default is None, in which case the datatype is determined from the cube and no packing will occur.

So for example:

# pack into np.int16 with automatic calc of scale_factor and add_offset, default fillval if applicable
 iris.save(cube, 'outfile.nc', packing='int16') 

# pack into np.int16 with manually set scale_factor and add_offset
iris.save(cube, 'outfile.nc', packing=dict(dtype=np.short, scale_factor=0.01,  add_offset=250))

# pack one cube with auto scale, the other manually, and the third not at all (cubelist has 3 cubes)
iris.save(cubelist, 'outfile.nc', packing=['int16', dict(dtype='int16', scale_factor=0.01), None])

Doing it this way will also require fewer structural changes to the code than in this PR.

ajdawson · 2016-10-12T13:37:10Z

@jswanljung - I'm in favour of de-coupling saver options from cube attributes, so I'd support your suggested new approach. I'm not sure about the finer details but in general I think controlling this solely at the saver interface is the right thing to do.

jswanljung · 2016-10-12T13:45:12Z

@ajdawson If you have any misgivings about the finer details, I'm very open to revising them. Especially before I've implemented them.

ajdawson · 2016-10-12T13:53:12Z

I don't has misgivings as such, the approach sounds reasonable, I'm just not clear on how some of the details should work.

Can you explain how you will determine which cube gets which packing type? I had imagined a dictionary that somehow maps cubes to packing specifiers (perhaps by name), which would allow you to specify packing for a subset of cubes and not pack the rest. I think this would be more flexible than an iterable the same length as the number of cubes.

jswanljung · 2016-10-12T13:59:35Z

My suggestion was to do it by order in the list. The problem with using names as keys is that iris doesn't require them to be unique. In fact, I think the test data for saving multiple cubes has more than one cube called 'air_temperature'. There doesn't seem to be any unique identifier for a cube except the cube itself. It would of course be possible to use the cubes as dictionary keys, assuming None for any missing cubes.

Edit: xarray uses names as keys, exactly as you suggest, but unlike iris it has a Dataset concept which makes it easy to enforce uniqueness of names.

ajdawson · 2016-10-12T14:08:24Z

I appreciate the confusion of using names, it doesn't work all that well, but I don't like the idea of having an iterable that requires a matching length. For example, if I want to apply packing to 1 cube in a list of 10, I'd have to construct a length-10 iterable with the correct position set with my packing options. It would be nicer to be able to just provide the packing details for the single cube without having to construct something larger, which is where a dict would prove useful.

Perhaps we can think of something better than name for a key to identify each cube.

jswanljung · 2016-10-12T14:14:38Z

I understand your concern and agree that it would be nice to be able to set options only for individual cubes. What about using cubes themselves as keys? It's a bit weird, but I'm having trouble coming up with anything else that uniquely identifies cubes from the iterable passed to iris.save.

jswanljung · 2016-10-12T16:29:07Z

Answering myself, I see several drawbacks to using cubes as keys, but mainly that it precludes creating reusable dicts of packing parameters. But I really can't think of anything else that would work unambiguously as a key.

One possibility would be to keep my suggestion using ordered lists, but also allow packing to be a callable that returns a packing parameter dict given a cube as an argument. That would give the user unlimited flexibility to devise reusable packing schemes as well as responsibility for distinguishing among cubes.

But this is probably a pretty marginal use case. I may be wrong, but I think storing multiple variables in a single netCDF file is still pretty unusual because it wasn't possible before netCDF 4 and it complicates lots of workflows, for instance those using command line tools. For datasets small enough to comfortably fit in memory, just using 'int16' as a packing argument will be very convenient even for multiple variables. And for larger datasets it is even more unusual to store multiple variables per file.

ajdawson · 2016-10-12T17:14:04Z

Well, in the absence of a better alternative I'd be happy to see a PR that simply implements the sequence method you initially described.

I may be wrong, but I think storing multiple variables in a single netCDF file is still pretty unusual because it wasn't possible before netCDF 4

I'm going to have to strongly disagree with that statement! Storing multiple variables in a netcdf file has always been possible, and is an extremely common use case. Don't be under the illusion that this is an edge case.

jswanljung · 2016-10-12T17:28:03Z

I stand corrected! In my admittedly limited experience, I've rarely seen it in the wild and I thought it was an HDF5 feature.

But I'll get to work on the implementation.

Added support for netCDF data packing.

2cb428a

Johan Swanljung added 2 commits October 2, 2016 12:36

Merge git://github.com/SciTools/iris into netcdf_packing_support

d37bb70

Added cube sort to multi packed tests.

7ba9374

jswanljung force-pushed the netcdf_packing_support branch from a6d660b to 7ba9374 Compare October 3, 2016 06:34

marqh requested changes Oct 12, 2016

View reviewed changes

Updated test output files for netcdf packing

375e2d8

jswanljung closed this Oct 13, 2016

jswanljung mentioned this pull request Oct 13, 2016

Netcdf packing support v2 #2193

Merged

jswanljung deleted the netcdf_packing_support branch October 20, 2016 07:32

Added support for netCDF data packing. #2152

Added support for netCDF data packing. #2152

Uh oh!

Conversation

jswanljung commented Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jswanljung commented Sep 29, 2016

Uh oh!

pelson commented Oct 4, 2016

Uh oh!

jswanljung commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marqh commented Oct 6, 2016

Uh oh!

marqh left a comment

Choose a reason for hiding this comment

Uh oh!

marqh Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

marqh Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

marqh Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

marqh Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

jswanljung commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajdawson commented Oct 12, 2016

Uh oh!

jswanljung commented Oct 12, 2016

Uh oh!

ajdawson commented Oct 12, 2016

Uh oh!

jswanljung commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajdawson commented Oct 12, 2016

Uh oh!

jswanljung commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jswanljung commented Oct 12, 2016

Uh oh!

ajdawson commented Oct 12, 2016

Uh oh!

jswanljung commented Oct 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jswanljung commented Sep 23, 2016 •

edited

Loading

jswanljung commented Oct 4, 2016 •

edited

Loading

jswanljung commented Oct 12, 2016 •

edited

Loading

jswanljung commented Oct 12, 2016 •

edited

Loading

jswanljung commented Oct 12, 2016 •

edited

Loading