Skip to content

Conversation

@jswanljung
Copy link
Contributor

@jswanljung jswanljung commented Sep 23, 2016

NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See: http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.

#2148

This PR adds an optional pack_dtype argument to the netcdf save and Saver.write methods. From the docstring:

  • pack_dtype (type or string or list):
    A numpy integer datatype (signed or unsigned) or a string that
    describes a numpy integer dtype (i.e. 'i2', 'short', 'u4'). This
    provides support for netCDF data packing as described in
    http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
    If either scale_factor or add_offset are set in cube.attributes,
    those values are used. If neither is set, appropriate values of
    scale_factor and add_offset are calculated based on cube.data,
    pack_dtype and possible masking (see the link for details). Note
    that automatic calculation of scale_factor and add_offset will
    trigger loading of lazy_data; set scale_factor and add_offset
    manually in cube.attributes if you wish to avoid this. For
    masked data, fill_values are taken from netCDF4.default_fillvals.
    If the argument to cube is an iterable and different datatypes are
    desired for each cube, pack_dtype can also be a list with the same
    number of elements as the iterable. The default is None, in which
    case the datatype is determined from the cube and no packing will
    occur.

One of the first things that sold me on Iris was how wonderfully convenient netCDF input and output was, but I couldn't use it when I needed packing (which was fairly often). This fixes that, and with automatic calculation of packing attributes adds even more convenience.

Although you can set the packing attributes (scale_factor and add_offset) manually, there is no way to set a custom fill value for masked data. I doubt this will be missed, but I could be wrong.

There is an error in the tests, but it is due to changes in cf_units, as has been noted previously. All tests associated with this change pass.

@jswanljung
Copy link
Contributor Author

The way xarray handles this is to pass a dictionary argument to its to_netcdf method. The dictionary allows per-variable values of dtype, scale_factor and add_offset to be set upon save without actually putting them in a cube. That can easily accommodate custom fill values as well.

@jswanljung jswanljung force-pushed the netcdf_packing_support branch from a6d660b to 7ba9374 Compare October 3, 2016 06:34
@pelson
Copy link
Member

pelson commented Oct 4, 2016

Firstly, thank you for this extremely well put together change @jswanljung. I'm completely in support of adding the ability to write packed variable data and just wanted to let you know that I will take a deeper look at this change over the next few days. In the meantime, could I ask that you sign and send over a CLA (http://scitools.org.uk/documents/cla_form.pdf)?

Thanks,

@jswanljung
Copy link
Contributor Author

jswanljung commented Oct 4, 2016

Thanks, I sent over a signed CLA. As an argument against the way I implemented this, have a look at the way xarray does it (http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html), in particular the encoding argument. The nice thing about that is that no information about the details of the file encoding ever goes into the Dataset or DataArray. In my implementation, you set scale_factor and add_offset in Cube.attributes, which goes against the general philosophy of keeping the file format details separate from the cube abstraction. We can't lift the xarray method directly since cubes in iris don't have unique names, but we could pass a dict or list of dicts with dtype, scale_factor and add_offset as keys to save and Saver.write instead of requiring new attributes to be set in cubes. That would also make custom fill values possible.

I'm willing to make that change if we agree that it's better; now that I know how this stuff works I don't think it would take that long.

@marqh
Copy link
Member

marqh commented Oct 6, 2016

CLA signed and docs updated
SciTools/scitools.org.uk@f242e4a

thank you @jswanljung

Copy link
Member

@marqh marqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the recent changes to the
iris.fileformats.netcdf
module have made this PR unsuitable to commit due to conflicts.

The _setncattr is now preferred, but was not known about when you raised your change

please may you rebase and address these issues and I will re-review

I have not found any fundamental concerns at this stage

many thanks for all the hard work, I hope you can stay with it whilst we get these changes in place

to the netCDF Dataset.
"""
if cube.standard_name:
cf_var.standard_name = cube.standard_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have run into compatibility issues with Ptyhon2 Python3 and netcdf4Python 1.2 around the setting of attributes on netcdf variable which has caused some concern.
we have recently adopted a different pattern for setting attributes, to ensure that typing is preserved through the py netcdf4 testing matrix
see
#2158 #2179 #2183
for the somewhat disjointed changes for this
(with apologies from me for the lack of cohesion in change, I missed a couple of key issues along the way)

with this in mind, please may you use the

_setncattr(cfvar, 'standard_name', cube.standard_name)

pattern for setting attributes?

cf_var.standard_name = cube.standard_name

if cube.long_name:
cf_var.long_name = cube.long_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_setncattr

cf_var.long_name = cube.long_name

if cube.units != 'unknown':
cf_var.units = str(cube.units)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_setncattr

'global attribute.'.format(attr_name=attr_name)
warnings.warn(msg)

cf_var.setncattr(attr_name, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_setncattr

@jswanljung
Copy link
Contributor Author

jswanljung commented Oct 12, 2016

I don't mind going forward with this, but in the absence of other opinions, I have convinced myself that setting packing attributes in the cube is the wrong way to go (see my argument in the comments above). Unless someone strenuously objects, I shall withdraw this pull request and submit a new one in which the pack_dtype argument to netcdf.save and Saver.write in this PR is replaced by an argument called packing with a docstring as follows:

  • packing (type or string or dict or iterable): A numpy integer datatype (signed or unsigned) or a string that describes a numpy integer dtype (i.e. 'i2', 'short', 'u4') or a dict of packing parameters as described below or an iterable of such types, strings, or dicts. This provides support for netCDF data packing as described in http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values If this argument is a type (or type string), appropriate values of scale_factor and add_offset will be automatically calculated based on cube.data and possible masking. For masked data, fill_values are taken from netCDF4.default_fillvals. For more control, pass a dict with one or more of the following keys: dtype (required), scale_factor, add_offset, and fill_values. To save multiple cubes with different packing parameters, pass a list (or other iterable) of types, strings, or dicts, one for each cube. Note that automatic calculation of scale_factor and add_offset will trigger loading of lazy_data; set them manually to avoid this. The default is None, in which case the datatype is determined from the cube and no packing will occur.

So for example:

# pack into np.int16 with automatic calc of scale_factor and add_offset, default fillval if applicable
 iris.save(cube, 'outfile.nc', packing='int16') 

# pack into np.int16 with manually set scale_factor and add_offset
iris.save(cube, 'outfile.nc', packing=dict(dtype=np.short, scale_factor=0.01,  add_offset=250))

# pack one cube with auto scale, the other manually, and the third not at all (cubelist has 3 cubes)
iris.save(cubelist, 'outfile.nc', packing=['int16', dict(dtype='int16', scale_factor=0.01), None])

Doing it this way will also require fewer structural changes to the code than in this PR.

@ajdawson
Copy link
Member

@jswanljung - I'm in favour of de-coupling saver options from cube attributes, so I'd support your suggested new approach. I'm not sure about the finer details but in general I think controlling this solely at the saver interface is the right thing to do.

@jswanljung
Copy link
Contributor Author

@ajdawson If you have any misgivings about the finer details, I'm very open to revising them. Especially before I've implemented them.

@ajdawson
Copy link
Member

I don't has misgivings as such, the approach sounds reasonable, I'm just not clear on how some of the details should work.

Can you explain how you will determine which cube gets which packing type? I had imagined a dictionary that somehow maps cubes to packing specifiers (perhaps by name), which would allow you to specify packing for a subset of cubes and not pack the rest. I think this would be more flexible than an iterable the same length as the number of cubes.

@jswanljung
Copy link
Contributor Author

jswanljung commented Oct 12, 2016

My suggestion was to do it by order in the list. The problem with using names as keys is that iris doesn't require them to be unique. In fact, I think the test data for saving multiple cubes has more than one cube called 'air_temperature'. There doesn't seem to be any unique identifier for a cube except the cube itself. It would of course be possible to use the cubes as dictionary keys, assuming None for any missing cubes.

Edit: xarray uses names as keys, exactly as you suggest, but unlike iris it has a Dataset concept which makes it easy to enforce uniqueness of names.

@ajdawson
Copy link
Member

I appreciate the confusion of using names, it doesn't work all that well, but I don't like the idea of having an iterable that requires a matching length. For example, if I want to apply packing to 1 cube in a list of 10, I'd have to construct a length-10 iterable with the correct position set with my packing options. It would be nicer to be able to just provide the packing details for the single cube without having to construct something larger, which is where a dict would prove useful.

Perhaps we can think of something better than name for a key to identify each cube.

@jswanljung
Copy link
Contributor Author

jswanljung commented Oct 12, 2016

I understand your concern and agree that it would be nice to be able to set options only for individual cubes. What about using cubes themselves as keys? It's a bit weird, but I'm having trouble coming up with anything else that uniquely identifies cubes from the iterable passed to iris.save.

@jswanljung
Copy link
Contributor Author

Answering myself, I see several drawbacks to using cubes as keys, but mainly that it precludes creating reusable dicts of packing parameters. But I really can't think of anything else that would work unambiguously as a key.

One possibility would be to keep my suggestion using ordered lists, but also allow packing to be a callable that returns a packing parameter dict given a cube as an argument. That would give the user unlimited flexibility to devise reusable packing schemes as well as responsibility for distinguishing among cubes.

But this is probably a pretty marginal use case. I may be wrong, but I think storing multiple variables in a single netCDF file is still pretty unusual because it wasn't possible before netCDF 4 and it complicates lots of workflows, for instance those using command line tools. For datasets small enough to comfortably fit in memory, just using 'int16' as a packing argument will be very convenient even for multiple variables. And for larger datasets it is even more unusual to store multiple variables per file.

@ajdawson
Copy link
Member

Well, in the absence of a better alternative I'd be happy to see a PR that simply implements the sequence method you initially described.

I may be wrong, but I think storing multiple variables in a single netCDF file is still pretty unusual because it wasn't possible before netCDF 4

I'm going to have to strongly disagree with that statement! Storing multiple variables in a netcdf file has always been possible, and is an extremely common use case. Don't be under the illusion that this is an edge case.

@jswanljung
Copy link
Contributor Author

I stand corrected! In my admittedly limited experience, I've rarely seen it in the wild and I thought it was an HDF5 feature.

But I'll get to work on the implementation.

@jswanljung jswanljung closed this Oct 13, 2016
@jswanljung jswanljung deleted the netcdf_packing_support branch October 20, 2016 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants