Skip to content

Conversation

@jswhit
Copy link
Collaborator

@jswhit jswhit commented Mar 21, 2018

Issue #785 points out that by default, either a numpy.ma.core.MaskedArray or a numpy.ndarray can be returned, depending on whether the slice contains any missing values. This pull request ensures that a numpy.ma.core.MaskedArray is always returned, unless set_auto_mask(False) is invoked and then a numpy.ndarray is always returned.

@jswhit jswhit requested a review from dopplershift March 21, 2018 18:01
@jswhit
Copy link
Collaborator Author

jswhit commented Mar 22, 2018

@shoyer, would this impact xarray?

@shoyer
Copy link
Contributor

shoyer commented Mar 22, 2018

Xarray always sets set_auto_mask(False), so this would not impact us.

@MeraX
Copy link

MeraX commented Mar 26, 2018

@akrherz I would like to point out my issues with this pull-request in this thread.
I try to give a comprehensible example.

Below is a dummy code, but it is potentially similar to code I used to process raw radar measurements. I start with some raw_data that is related to the voltage recorded by the radar receiver. Since there is only an entry, when the device is measuring, there is no missing data in my netcdf file. This means, that I know there is no missing data and since I might be lazy ;), I do not check for that.

This raw data also contains some noise, which has to be estimated and subtracted. Afterwards I convert everything to dB. Because I want to get rid of data that is below the noise level (including -inf and nan), I create the boolean filter array. Using this, I reduce my data to measurements above the noise level.

The code might look like this:

import numpy as np
def dummy_radar_processing(raw_signal):
    noise = np.percentile(raw_signal, 10) # simplified noise estimation
    log_signal = 10 * np.log10(raw_signal - noise) # subtract nose and to dB
    
    filter = np.isfinite(log_signal) & (log_signal > 10*np.log10(noise))
    above_noise = log_signal[filter]
    
    print('log_signal:', log_signal)
    print('above_noise:', above_noise)
    print('I have %d values above noise level.' % above_noise.size)

(Later, I would use the filter variable to filter other variable that share the same dimension with raw_signal.)

Now, let's apply this code to some test data. I use np.array and np.ma.masked_less as replacement for nc.variables['raw_data'][:].

Up to netCDF4 1.3.1 it would be like:

raw_signal = np.array((1.,5,6,8,3,4,2,3,7,0.5,1), np.float32) # read from netcdf, no value is missing
dummy_radar_processing(raw_signal)
log_signal: [     -inf 6.0206003 6.9897003 8.45098   3.0103002 4.7712126 0.        3.0103002 7.7815127       nan      -inf]
above_noise: [6.0206003 6.9897003 8.45098   3.0103002 4.7712126 3.0103002 7.7815127]
I have 7 values above noise level.

But with a MaskedArray (PR #787) :

import netCDF4
raw_signal = np.ma.masked_equal(np.array((1.,5,6,8,3,4,2,3,7,0.5,1), np.float32), netCDF4.default_fillvals['f4']) # read from netcdf
dummy_radar_processing(raw_signal)
log_signal: [-- 6.020600318908691 6.9897003173828125 8.450980186462402 3.0103001594543457 4.771212577819824 0.0
 3.0103001594543457 7.78151273727417 -- --]
above_noise: [-- 6.020600318908691 6.9897003173828125 8.450980186462402 3.0103001594543457 4.771212577819824 3.0103001594543457
 7.78151273727417 -- --]
I have 10 values above noise level.

Now, it is obvious, that there are only 7 valid values, but since I treated masked arrays with my code that is written for normal array, I somehow get another result.
A little further imagination is needed to make the impact of having a MaskedArray instead of a np.array. more prominent. E.g. take another variable and apply the boolean filter index on it and apply mean() to get the mean of this variable for all valid radar measurements. (Using the normal array, [filter] returns 7 values => mean of 7 values; Using MaskedArray [filter] returns 10 values => mean of 10 values)

I hope this example makes it clear, how code might break with the change of the default.

EDIT: sorry, for mixing up @jswhit and @akrherz

@jswhit
Copy link
Collaborator Author

jswhit commented Mar 26, 2018

If you use

if np.ma.isMA(filter):
    above_noise = log_signal[filter].compressed()
else:
    above_noise = log_signal[filter]

then the result is the same for ndarrays and MaskedArrays.

@MeraX
Copy link

MeraX commented Mar 26, 2018

Yes, thats true. There are many ways to save me from this pitfall. (But to do so, I have to be aware of this change)
But my point is not actually about my code, but about the code of my colleagues: Imagine, dummy_radar_processing() is ancient legacy code that worked well in the past.
Now, I run it on another computer which happens to have a more recent netCDF4 installed that includes this change.
I.e., I will get different results.
=> The old code is broken not in a way that it raises a big Exception, but it provides results with a subtle difference.

@jswhit
Copy link
Collaborator Author

jswhit commented Mar 26, 2018

Agreed - this change can break existing code, and I regret that. Sometimes that can't be avoided though, and all in all I think all and all it is a step in the right direction.

@shoyer
Copy link
Contributor

shoyer commented Mar 26, 2018

One good way to handle backwards compatibility breaks is to increment major version numbers.

@flaviobar
Copy link

Hi, just my 2 cents,
this behavior does not describe correctly the content of a netcdf file. In principle a variable could be twice masked, since it could have both the _FillValue and the missing_value attribute, with 2 different meanings. FillValue is to be used for filling the variable before writing, so values not filtered by this attribute are the values that should be there, while missing_value is to be used to mark values that would had to be inside the variable but, as example, are corrupted.
According to me tha aim of a library that read a general file format, as netcdf is, should be to easily have a copy of the file inside of the memory and providing to the users some tools to smart and quickly process the objects retrieved from the file. But the user should choice to apply this tools.
Netcdf4 module implicitly filters the variables of the file, without any notification to the user.
So, according to me, setting the default to set_auto_mask(False) and set_auto_scale(False), letting the user choice whether or not using the functions provided by the module, would be a more clever implementation.
Anyway it's a very good thing that now the module has a unique output type.

I had to raise a little issue: when the 2 different attributes are present, both _FillValue and missing_value, the reported value in the masked array is wrong, since in fill_value it has the value of the missing_value attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants