always return a masked array be default #787

jswhit · 2018-03-21T16:57:03Z

Issue #785 points out that by default, either a numpy.ma.core.MaskedArray or a numpy.ndarray can be returned, depending on whether the slice contains any missing values. This pull request ensures that a numpy.ma.core.MaskedArray is always returned, unless set_auto_mask(False) is invoked and then a numpy.ndarray is always returned.

jswhit · 2018-03-22T01:36:19Z

@shoyer, would this impact xarray?

shoyer · 2018-03-22T02:03:32Z

Xarray always sets set_auto_mask(False), so this would not impact us.

MeraX · 2018-03-26T15:46:17Z

@akrherz I would like to point out my issues with this pull-request in this thread.
I try to give a comprehensible example.

Below is a dummy code, but it is potentially similar to code I used to process raw radar measurements. I start with some raw_data that is related to the voltage recorded by the radar receiver. Since there is only an entry, when the device is measuring, there is no missing data in my netcdf file. This means, that I know there is no missing data and since I might be lazy ;), I do not check for that.

This raw data also contains some noise, which has to be estimated and subtracted. Afterwards I convert everything to dB. Because I want to get rid of data that is below the noise level (including -inf and nan), I create the boolean filter array. Using this, I reduce my data to measurements above the noise level.

The code might look like this:

import numpy as np
def dummy_radar_processing(raw_signal):
    noise = np.percentile(raw_signal, 10) # simplified noise estimation
    log_signal = 10 * np.log10(raw_signal - noise) # subtract nose and to dB
    
    filter = np.isfinite(log_signal) & (log_signal > 10*np.log10(noise))
    above_noise = log_signal[filter]
    
    print('log_signal:', log_signal)
    print('above_noise:', above_noise)
    print('I have %d values above noise level.' % above_noise.size)

(Later, I would use the filter variable to filter other variable that share the same dimension with raw_signal.)

Now, let's apply this code to some test data. I use np.array and np.ma.masked_less as replacement for nc.variables['raw_data'][:].

Up to netCDF4 1.3.1 it would be like:

raw_signal = np.array((1.,5,6,8,3,4,2,3,7,0.5,1), np.float32) # read from netcdf, no value is missing
dummy_radar_processing(raw_signal)

log_signal: [     -inf 6.0206003 6.9897003 8.45098   3.0103002 4.7712126 0.        3.0103002 7.7815127       nan      -inf]
above_noise: [6.0206003 6.9897003 8.45098   3.0103002 4.7712126 3.0103002 7.7815127]
I have 7 values above noise level.

But with a MaskedArray (PR #787) :

import netCDF4
raw_signal = np.ma.masked_equal(np.array((1.,5,6,8,3,4,2,3,7,0.5,1), np.float32), netCDF4.default_fillvals['f4']) # read from netcdf
dummy_radar_processing(raw_signal)

log_signal: [-- 6.020600318908691 6.9897003173828125 8.450980186462402 3.0103001594543457 4.771212577819824 0.0
 3.0103001594543457 7.78151273727417 -- --]
above_noise: [-- 6.020600318908691 6.9897003173828125 8.450980186462402 3.0103001594543457 4.771212577819824 3.0103001594543457
 7.78151273727417 -- --]
I have 10 values above noise level.

Now, it is obvious, that there are only 7 valid values, but since I treated masked arrays with my code that is written for normal array, I somehow get another result.
A little further imagination is needed to make the impact of having a MaskedArray instead of a np.array. more prominent. E.g. take another variable and apply the boolean filter index on it and apply mean() to get the mean of this variable for all valid radar measurements. (Using the normal array, [filter] returns 7 values => mean of 7 values; Using MaskedArray [filter] returns 10 values => mean of 10 values)

I hope this example makes it clear, how code might break with the change of the default.

EDIT: sorry, for mixing up @jswhit and @akrherz

jswhit · 2018-03-26T16:13:33Z

If you use

if np.ma.isMA(filter):
    above_noise = log_signal[filter].compressed()
else:
    above_noise = log_signal[filter]

then the result is the same for ndarrays and MaskedArrays.

MeraX · 2018-03-26T16:36:25Z

Yes, thats true. There are many ways to save me from this pitfall. (But to do so, I have to be aware of this change)
But my point is not actually about my code, but about the code of my colleagues: Imagine, dummy_radar_processing() is ancient legacy code that worked well in the past.
Now, I run it on another computer which happens to have a more recent netCDF4 installed that includes this change.
I.e., I will get different results.
=> The old code is broken not in a way that it raises a big Exception, but it provides results with a subtle difference.

jswhit · 2018-03-26T16:53:41Z

Agreed - this change can break existing code, and I regret that. Sometimes that can't be avoided though, and all in all I think all and all it is a step in the right direction.

shoyer · 2018-03-26T17:07:06Z

One good way to handle backwards compatibility breaks is to increment major version numbers.

flaviobar · 2018-05-16T19:37:02Z

Hi, just my 2 cents,
this behavior does not describe correctly the content of a netcdf file. In principle a variable could be twice masked, since it could have both the _FillValue and the missing_value attribute, with 2 different meanings. FillValue is to be used for filling the variable before writing, so values not filtered by this attribute are the values that should be there, while missing_value is to be used to mark values that would had to be inside the variable but, as example, are corrupted.
According to me tha aim of a library that read a general file format, as netcdf is, should be to easily have a copy of the file inside of the memory and providing to the users some tools to smart and quickly process the objects retrieved from the file. But the user should choice to apply this tools.
Netcdf4 module implicitly filters the variables of the file, without any notification to the user.
So, according to me, setting the default to set_auto_mask(False) and set_auto_scale(False), letting the user choice whether or not using the functions provided by the module, would be a more clever implementation.
Anyway it's a very good thing that now the module has a unique output type.

I had to raise a little issue: when the 2 different attributes are present, both _FillValue and missing_value, the reported value in the masked array is wrong, since in fill_value it has the value of the missing_value attribute.

As documented in a couple spots: Unidata/netcdf4-python#785 Unidata/netcdf4-python#787 Unidata/netcdf4-python#809 Unidata/netcdf4-python#816

Jeff Whitaker added 3 commits March 21, 2018 10:44

always return masked array by default (even if slice has no mask)

49588be

update

3025306

update

43f255c

jswhit requested a review from dopplershift March 21, 2018 18:01

jswhit mentioned this pull request Mar 22, 2018

Supprising type of nc.variables[foo][:]. (ndarray or MaskedArray depending on its data) #785

Open

jswhit merged commit 3355c0c into master Mar 22, 2018

jswhit deleted the issue785 branch March 22, 2018 16:45

ocefpaf mentioned this pull request May 22, 2018

Use cftime instead of netcdftime SciTools/iris#3016

Merged

dwhswenson mentioned this pull request May 28, 2018

Fix problems in to_mdtraj and notebook tests openpathsampling/openpathsampling#776

Merged

3 tasks

fmaussion mentioned this pull request Jun 22, 2018

Use of masked arrays comes with huge performance drop #809

Closed

pp-mo mentioned this pull request Oct 2, 2019

Should we have our own account of missing data in netcdf ? SciTools/iris#3438

Open

ijpulidos mentioned this pull request Jun 1, 2023

manually casting objects to np arrays when reading checkpoint choderalab/openmmtools#701

Merged

5 tasks

amstokely mentioned this pull request Jun 2, 2025

Add Global Dimension Variables for IODA v3 Compliance NCAR/obs2ioda#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

always return a masked array be default #787

always return a masked array be default #787

Uh oh!

jswhit commented Mar 21, 2018 •

edited

Loading

Uh oh!

jswhit commented Mar 22, 2018

Uh oh!

shoyer commented Mar 22, 2018

Uh oh!

MeraX commented Mar 26, 2018 •

edited

Loading

Uh oh!

jswhit commented Mar 26, 2018

Uh oh!

MeraX commented Mar 26, 2018 •

edited

Loading

Uh oh!

jswhit commented Mar 26, 2018

Uh oh!

shoyer commented Mar 26, 2018

Uh oh!

flaviobar commented May 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

always return a masked array be default #787

always return a masked array be default #787

Uh oh!

Conversation

jswhit commented Mar 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jswhit commented Mar 22, 2018

Uh oh!

shoyer commented Mar 22, 2018

Uh oh!

MeraX commented Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jswhit commented Mar 26, 2018

Uh oh!

MeraX commented Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jswhit commented Mar 26, 2018

Uh oh!

shoyer commented Mar 26, 2018

Uh oh!

flaviobar commented May 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jswhit commented Mar 21, 2018 •

edited

Loading

MeraX commented Mar 26, 2018 •

edited

Loading

MeraX commented Mar 26, 2018 •

edited

Loading