Skip to content

Conversation

@bayliffe
Copy link
Contributor

@bayliffe bayliffe commented Jul 17, 2017

Replacement for PR #2682 (np.percentile method as an alternative to scipy.mstats).
Now for Iris V2.

Introduction of a numpy.percentile method to the percentile aggregator for the purposes of providing a fast alternative (approx 50 times faster).

  • Accessed with kwarg percentile_method="numpy_percentile" passed to cube.collapsed method.
  • Existing unit tests duplicated with call to fast method, where masked data will result in an error.
  • Ran test cases and found fractional differences of order 1E-16 between two methods.
  • scipy.stats.mstats.mquantiles method remains important for dealing with masked data.

I have rewritten the unit tests across the section of interest. I have (possibly unwisely) removed the contentious assertCML tests (see comments on #2682). Instead I have added a simple additional coordinate check on the resulting cubes as something approaching a replacement, though obviously not as rigorous and it could be expanded. I have moved the assert statements into functions and separated out the cases where the first and third percentiles were being checked in a single test as requested.

@bayliffe
Copy link
Contributor Author

@pp-mo Any chance I can reinvigorate some interest in this PR? I did factor out some stuff in the unit tests as suggested to act as a sweetener :-)

@DPeterK DPeterK self-assigned this Aug 25, 2017
@pelson pelson added the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 25, 2017
Copy link
Member

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bayliffe - many thanks for reimplementing this for Iris v2. I've made a number of comments that will need to be addressed before this can be merged, but for the most part the changes I'm after are reasonably minor and to make this functionality easier to use from the user's perspective.

raise ValueError(msg.format(percentile_method))
if not ma.isMaskedArray(data) and not ma.is_masked(result):
result = np.asarray(result)
if percentile_method == 'numpy_percentile':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to revisit this logical test. Could this not be done after L1052? Or does this really only need to fire in the case where the result is not masked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Agreed, this can be lumped into the first if statement.

Kwargs:
* percentile_method (string) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the specifics of this implementation. In particular, this implementation relies heavily on specifying one of two strings that are obvious when you have the code in front of you but are otherwise quite esoteric. Add to that the fact that both strings are long and thus ripe for typos and it makes this implementation not very user-friendly.

I wonder if the following might work better:

  • kwarg renamed to "fast_percentile", default value False (for backward compatibility)
  • document what values of True and False will mean for the percentile calculation (including mentioning that the fast option is not compatible with masked arrays)
  • simplify the if-elif-else block to just an if-else block (i.e. if fast_percentile...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def _check_percentile(self, data, axis, percents, expected_result,
coord_check=False, **kwargs):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need these two blank lines in this method 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ I do love a bit of empty space, but I'll happily make everything a bit more cosy.

def _check_collapsed_percentile(self, cube, percents, collapse_coord,
expected_result, coord_check=False,
**kwargs):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this blank line too please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_percentile_1d(self):

def _check_coord_properties(self, cube, collapse_coord, unit, points):
if not isinstance(collapse_coord, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good practice! ⚠️ Consider:

>>> t = (5,)  # A tuple: not a list, but still an iterable
>>> if not isinstance(t, list):
...    t = [t]
...
>>> print t
[(t)]

Much better would be:

if isinstance(collapse_coord, six.string_types):
    collapse_coord = [collapse_coord]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've changed it as you suggest.

I did not know about the six module for python 2/3 compatibility. That's quite cool.

def _check_coord_properties(self, cube, collapse_coord, unit, points):
if not isinstance(collapse_coord, list):
collapse_coord = [collapse_coord]
name = 'percentile_over_' + '_'.join(collapse_coord)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid string concatenations by using the following please:

name = 'percentile_over_{}'.format('_'.join(collapse_coord))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ We can.

if not isinstance(collapse_coord, list):
collapse_coord = [collapse_coord]
name = 'percentile_over_' + '_'.join(collapse_coord)
self.assertTrue(cube.coord(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this test checking?

Copy link
Contributor Author

@bayliffe bayliffe Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the collapsed coordinate over which percentiles have been calculated has the name that is expected. The cube and coordinate are provided separately, the percentile method should result in a coordinate with the name defined on line 358. This simply checks this is so.

A previous reviewer expressed a distaste for the CML checks, so this whole function is a crude attempt at checking the cube's form in an alternative manner.


np.testing.assert_array_almost_equal(result, expected_result)

def test_percentile_invalid_percentile_method(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing my suggested change above would remove the need for this test method too: there can only be two options and one of them will always be selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

percent=25)
np.testing.assert_array_almost_equal(first_quartile.data,
np.array([2.5], dtype=np.float32))
self.assertCML(first_quartile, ('analysis',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've dropped all the CML checks from these tests. Can you re-implement them please, as they're important for checking the resultant cubes as a whole entity are as expected following a collapse operation. You will need to add new CML results for the new tests you've added as well.

I wouldn't expect the CML to change for the existing tests (though you've renamed the tests). If the CML is changing after accounting for the renamed tests that may well be cause for concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put these back in and added new results files for the fast_percentile_method tests.

@pelson pelson added the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 29, 2017
Copy link
Member

@DPeterK DPeterK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bayliffe there's just one more thing for you to contemplate, which will require another commit either way - and this is good as it might trigger the CLA blocked check to go away now that I've added you to the contributors list (congrats on that!)

if isinstance(collapse_coord, six.string_types):
collapse_coord = [collapse_coord]
name = 'percentile_over_{}'.format('_'.join(collapse_coord))
self.assertTrue(cube.coord(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we've lost the previous comments thread here, apparently... To recall some of that thread:

A previous reviewer expressed a distaste for the CML checks

And with good reason, admittedly – they are too sensitive to small and unimportant changes; though they are good at checking a cube wholesale! Given that you've now introduced CML for the tests here can I talk you into dropping this check method? All the checks this makes are included in the CML...

For reference, I'm still not a fan of this particular assertion. I think that if the named coord you're looking for does not exist then cube.coord(name) will raise an exception, which will cause the test that called this method to error, which is undesirable. It might be better to do something like the following (if you don't replace this check method with CML checking):

coord_names = [c.name() for c in cube.coords()]
self.assertIn(name, coord_names)

This won't error if the named coord doesn't exist, but that assertion will fail, which is a better way to be. It's also clearer to the test reader what the assertion is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can definitely convince me of this, it was only there as the beginnings of some replacement to the CML checks, but with them back in place, it serves no useful purpose. I have removed the _check_coord_properties function along with the associated keyword. Now, if a CML_filename is provided it will conduct the CML test, otherwise it will just compare the expected values, as was previously the case.

@pelson pelson removed the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 31, 2017
@DPeterK
Copy link
Member

DPeterK commented Aug 31, 2017

@bayliffe great stuff! Thanks for bearing with the review process and ploughing through all the changes we requested. I think this is good to go now!

@DPeterK DPeterK merged commit c9c40c5 into SciTools:master Aug 31, 2017
@DPeterK
Copy link
Member

DPeterK commented Aug 31, 2017

🎉

@QuLogic QuLogic added this to the v2.0 milestone Aug 31, 2017
@bayliffe
Copy link
Contributor Author

Thanks for your help Pete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants