-
Notifications
You must be signed in to change notification settings - Fork 300
Added support for netCDF data packing. #2152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The way xarray handles this is to pass a dictionary argument to its to_netcdf method. The dictionary allows per-variable values of dtype, scale_factor and add_offset to be set upon save without actually putting them in a cube. That can easily accommodate custom fill values as well. |
a6d660b to
7ba9374
Compare
|
Firstly, thank you for this extremely well put together change @jswanljung. I'm completely in support of adding the ability to write packed variable data and just wanted to let you know that I will take a deeper look at this change over the next few days. In the meantime, could I ask that you sign and send over a CLA (http://scitools.org.uk/documents/cla_form.pdf)? Thanks, |
|
Thanks, I sent over a signed CLA. As an argument against the way I implemented this, have a look at the way xarray does it (http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html), in particular the encoding argument. The nice thing about that is that no information about the details of the file encoding ever goes into the Dataset or DataArray. In my implementation, you set scale_factor and add_offset in Cube.attributes, which goes against the general philosophy of keeping the file format details separate from the cube abstraction. We can't lift the xarray method directly since cubes in iris don't have unique names, but we could pass a dict or list of dicts with dtype, scale_factor and add_offset as keys to save and Saver.write instead of requiring new attributes to be set in cubes. That would also make custom fill values possible. I'm willing to make that change if we agree that it's better; now that I know how this stuff works I don't think it would take that long. |
|
CLA signed and docs updated thank you @jswanljung |
marqh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the recent changes to the
iris.fileformats.netcdf
module have made this PR unsuitable to commit due to conflicts.
The _setncattr is now preferred, but was not known about when you raised your change
please may you rebase and address these issues and I will re-review
I have not found any fundamental concerns at this stage
many thanks for all the hard work, I hope you can stay with it whilst we get these changes in place
| to the netCDF Dataset. | ||
| """ | ||
| if cube.standard_name: | ||
| cf_var.standard_name = cube.standard_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have run into compatibility issues with Ptyhon2 Python3 and netcdf4Python 1.2 around the setting of attributes on netcdf variable which has caused some concern.
we have recently adopted a different pattern for setting attributes, to ensure that typing is preserved through the py netcdf4 testing matrix
see
#2158 #2179 #2183
for the somewhat disjointed changes for this
(with apologies from me for the lack of cohesion in change, I missed a couple of key issues along the way)
with this in mind, please may you use the
_setncattr(cfvar, 'standard_name', cube.standard_name)
pattern for setting attributes?
| cf_var.standard_name = cube.standard_name | ||
|
|
||
| if cube.long_name: | ||
| cf_var.long_name = cube.long_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_setncattr
| cf_var.long_name = cube.long_name | ||
|
|
||
| if cube.units != 'unknown': | ||
| cf_var.units = str(cube.units) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_setncattr
| 'global attribute.'.format(attr_name=attr_name) | ||
| warnings.warn(msg) | ||
|
|
||
| cf_var.setncattr(attr_name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_setncattr
|
I don't mind going forward with this, but in the absence of other opinions, I have convinced myself that setting packing attributes in the cube is the wrong way to go (see my argument in the comments above). Unless someone strenuously objects, I shall withdraw this pull request and submit a new one in which the pack_dtype argument to netcdf.save and Saver.write in this PR is replaced by an argument called
So for example: Doing it this way will also require fewer structural changes to the code than in this PR. |
|
@jswanljung - I'm in favour of de-coupling saver options from cube attributes, so I'd support your suggested new approach. I'm not sure about the finer details but in general I think controlling this solely at the saver interface is the right thing to do. |
|
@ajdawson If you have any misgivings about the finer details, I'm very open to revising them. Especially before I've implemented them. |
|
I don't has misgivings as such, the approach sounds reasonable, I'm just not clear on how some of the details should work. Can you explain how you will determine which cube gets which packing type? I had imagined a dictionary that somehow maps cubes to packing specifiers (perhaps by name), which would allow you to specify packing for a subset of cubes and not pack the rest. I think this would be more flexible than an iterable the same length as the number of cubes. |
|
My suggestion was to do it by order in the list. The problem with using names as keys is that iris doesn't require them to be unique. In fact, I think the test data for saving multiple cubes has more than one cube called 'air_temperature'. There doesn't seem to be any unique identifier for a cube except the cube itself. It would of course be possible to use the cubes as dictionary keys, assuming None for any missing cubes. Edit: xarray uses names as keys, exactly as you suggest, but unlike iris it has a Dataset concept which makes it easy to enforce uniqueness of names. |
|
I appreciate the confusion of using names, it doesn't work all that well, but I don't like the idea of having an iterable that requires a matching length. For example, if I want to apply packing to 1 cube in a list of 10, I'd have to construct a length-10 iterable with the correct position set with my packing options. It would be nicer to be able to just provide the packing details for the single cube without having to construct something larger, which is where a dict would prove useful. Perhaps we can think of something better than name for a key to identify each cube. |
|
I understand your concern and agree that it would be nice to be able to set options only for individual cubes. What about using cubes themselves as keys? It's a bit weird, but I'm having trouble coming up with anything else that uniquely identifies cubes from the iterable passed to |
|
Answering myself, I see several drawbacks to using cubes as keys, but mainly that it precludes creating reusable dicts of packing parameters. But I really can't think of anything else that would work unambiguously as a key. One possibility would be to keep my suggestion using ordered lists, but also allow But this is probably a pretty marginal use case. I may be wrong, but I think storing multiple variables in a single netCDF file is still pretty unusual because it wasn't possible before netCDF 4 and it complicates lots of workflows, for instance those using command line tools. For datasets small enough to comfortably fit in memory, just using 'int16' as a packing argument will be very convenient even for multiple variables. And for larger datasets it is even more unusual to store multiple variables per file. |
|
Well, in the absence of a better alternative I'd be happy to see a PR that simply implements the sequence method you initially described.
I'm going to have to strongly disagree with that statement! Storing multiple variables in a netcdf file has always been possible, and is an extremely common use case. Don't be under the illusion that this is an edge case. |
|
I stand corrected! In my admittedly limited experience, I've rarely seen it in the wild and I thought it was an HDF5 feature. But I'll get to work on the implementation. |
NetCDF has support for variable packing; by specifying attributes scale_factor and add_offset, data can be packed into smaller data types. 16 bit short integers provide more than enough quantization for suitably scaled temperature data, for instance. See: http://unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
Iris already supports loading of packed netCDF because it is automatically handled by the netCDF4 module. However, there is currently no way to save data with packing.
#2148
This PR adds an optional pack_dtype argument to the netcdf save and Saver.write methods. From the docstring:
A numpy integer datatype (signed or unsigned) or a string that
describes a numpy integer dtype (i.e. 'i2', 'short', 'u4'). This
provides support for netCDF data packing as described in
http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Packed-Data-Values
If either
scale_factororadd_offsetare set incube.attributes,those values are used. If neither is set, appropriate values of
scale_factorandadd_offsetare calculated based oncube.data,pack_dtypeand possible masking (see the link for details). Notethat automatic calculation of
scale_factorandadd_offsetwilltrigger loading of lazy_data; set
scale_factorandadd_offsetmanually in
cube.attributesif you wish to avoid this. Formasked data,
fill_valuesare taken fromnetCDF4.default_fillvals.If the argument to
cubeis an iterable and different datatypes aredesired for each cube,
pack_dtypecan also be a list with the samenumber of elements as the iterable. The default is
None, in whichcase the datatype is determined from the cube and no packing will
occur.
One of the first things that sold me on Iris was how wonderfully convenient netCDF input and output was, but I couldn't use it when I needed packing (which was fairly often). This fixes that, and with automatic calculation of packing attributes adds even more convenience.
Although you can set the packing attributes (scale_factor and add_offset) manually, there is no way to set a custom fill value for masked data. I doubt this will be missed, but I could be wrong.
There is an error in the tests, but it is due to changes in cf_units, as has been noted previously. All tests associated with this change pass.