Fast percentile method in Iris V2. #2687

bayliffe · 2017-07-17T15:28:38Z

Replacement for PR #2682 (np.percentile method as an alternative to scipy.mstats).
Now for Iris V2.

Introduction of a numpy.percentile method to the percentile aggregator for the purposes of providing a fast alternative (approx 50 times faster).

Accessed with kwarg percentile_method="numpy_percentile" passed to cube.collapsed method.
Existing unit tests duplicated with call to fast method, where masked data will result in an error.
Ran test cases and found fractional differences of order 1E-16 between two methods.
scipy.stats.mstats.mquantiles method remains important for dealing with masked data.

I have rewritten the unit tests across the section of interest. I have (possibly unwisely) removed the contentious assertCML tests (see comments on #2682). Instead I have added a simple additional coordinate check on the resulting cubes as something approaching a replacement, though obviously not as rigorous and it could be expanded. I have moved the assert statements into functions and separated out the cases where the first and third percentiles were being checked in a single test as requested.

bayliffe · 2017-07-24T08:59:01Z

@pp-mo Any chance I can reinvigorate some interest in this PR? I did factor out some stuff in the unit tests as suggested to act as a sweetener :-)

DPeterK

Hi @bayliffe - many thanks for reimplementing this for Iris v2. I've made a number of comments that will need to be addressed before this can be merged, but for the most part the changes I'm after are reasonably minor and to make this functionality easier to use from the user's perspective.

DPeterK · 2017-08-25T11:37:18Z

lib/iris/analysis/__init__.py

+        raise ValueError(msg.format(percentile_method))
    if not ma.isMaskedArray(data) and not ma.is_masked(result):
        result = np.asarray(result)
+        if percentile_method == 'numpy_percentile':


It seems odd to revisit this logical test. Could this not be done after L1052? Or does this really only need to fire in the case where the result is not masked?

✅ Agreed, this can be lumped into the first if statement.

DPeterK · 2017-08-25T12:25:13Z

lib/iris/analysis/__init__.py


+    Kwargs:
+
+    * percentile_method (string) :


I don't like the specifics of this implementation. In particular, this implementation relies heavily on specifying one of two strings that are obvious when you have the code in front of you but are otherwise quite esoteric. Add to that the fact that both strings are long and thus ripe for typos and it makes this implementation not very user-friendly.

I wonder if the following might work better:

kwarg renamed to "fast_percentile", default value False (for backward compatibility)

document what values of True and False will mean for the percentile calculation (including mentioning that the fast option is not compatible with masked arrays)

simplify the if-elif-else block to just an if-else block (i.e. if fast_percentile...).

DPeterK · 2017-08-25T12:30:40Z

lib/iris/tests/test_analysis.py

+
+    def _check_percentile(self, data, axis, percents, expected_result,
+                          coord_check=False, **kwargs):
+


We probably don't need these two blank lines in this method 😉

✅ I do love a bit of empty space, but I'll happily make everything a bit more cosy.

DPeterK · 2017-08-25T12:31:17Z

lib/iris/tests/test_analysis.py

+    def _check_collapsed_percentile(self, cube, percents, collapse_coord,
+                                    expected_result, coord_check=False,
+                                    **kwargs):
+


Can we drop this blank line too please.

DPeterK · 2017-08-25T12:44:07Z

lib/iris/tests/test_analysis.py

-    def test_percentile_1d(self):
+
+    def _check_coord_properties(self, cube, collapse_coord, unit, points):
+        if not isinstance(collapse_coord, list):


This is not good practice! ⚠️ Consider:

>>> t = (5,) # A tuple: not a list, but still an iterable >>> if not isinstance(t, list): ... t = [t] ... >>> print t [(t)]

Much better would be:

if isinstance(collapse_coord, six.string_types): collapse_coord = [collapse_coord]

Good point, I've changed it as you suggest.

I did not know about the six module for python 2/3 compatibility. That's quite cool.

DPeterK · 2017-08-25T12:57:49Z

lib/iris/tests/test_analysis.py

+    def _check_coord_properties(self, cube, collapse_coord, unit, points):
+        if not isinstance(collapse_coord, list):
+            collapse_coord = [collapse_coord]
+        name = 'percentile_over_' + '_'.join(collapse_coord)


Can we avoid string concatenations by using the following please:

name = 'percentile_over_{}'.format('_'.join(collapse_coord))

✅ We can.

DPeterK · 2017-08-25T12:59:02Z

lib/iris/tests/test_analysis.py

+        if not isinstance(collapse_coord, list):
+            collapse_coord = [collapse_coord]
+        name = 'percentile_over_' + '_'.join(collapse_coord)
+        self.assertTrue(cube.coord(name))


What is this test checking?

That the collapsed coordinate over which percentiles have been calculated has the name that is expected. The cube and coordinate are provided separately, the percentile method should result in a coordinate with the name defined on line 358. This simply checks this is so.

A previous reviewer expressed a distaste for the CML checks, so this whole function is a crude attempt at checking the cube's form in an alternative manner.

DPeterK · 2017-08-25T13:01:15Z

lib/iris/tests/test_analysis.py

+
+        np.testing.assert_array_almost_equal(result, expected_result)
+
+    def test_percentile_invalid_percentile_method(self):


Implementing my suggested change above would remove the need for this test method too: there can only be two options and one of them will always be selected.

DPeterK · 2017-08-25T13:08:50Z

lib/iris/tests/test_analysis.py

-                                        percent=25)
-        np.testing.assert_array_almost_equal(first_quartile.data,
-                                             np.array([2.5], dtype=np.float32))
-        self.assertCML(first_quartile, ('analysis',


I see you've dropped all the CML checks from these tests. Can you re-implement them please, as they're important for checking the resultant cubes as a whole entity are as expected following a collapse operation. You will need to add new CML results for the new tests you've added as well.

I wouldn't expect the CML to change for the existing tests (though you've renamed the tests). If the CML is changing after accounting for the renamed tests that may well be cause for concern.

I've put these back in and added new results files for the fast_percentile_method tests.

…with new KGO for new method.

DPeterK

@bayliffe there's just one more thing for you to contemplate, which will require another commit either way - and this is good as it might trigger the CLA blocked check to go away now that I've added you to the contributors list (congrats on that!)

DPeterK · 2017-08-29T15:35:15Z

lib/iris/tests/test_analysis.py

+        if isinstance(collapse_coord, six.string_types):
+            collapse_coord = [collapse_coord]
+        name = 'percentile_over_{}'.format('_'.join(collapse_coord))
+        self.assertTrue(cube.coord(name))


Hm, we've lost the previous comments thread here, apparently... To recall some of that thread:

A previous reviewer expressed a distaste for the CML checks

And with good reason, admittedly – they are too sensitive to small and unimportant changes; though they are good at checking a cube wholesale! Given that you've now introduced CML for the tests here can I talk you into dropping this check method? All the checks this makes are included in the CML...

For reference, I'm still not a fan of this particular assertion. I think that if the named coord you're looking for does not exist then cube.coord(name) will raise an exception, which will cause the test that called this method to error, which is undesirable. It might be better to do something like the following (if you don't replace this check method with CML checking):

coord_names = [c.name() for c in cube.coords()] self.assertIn(name, coord_names)

This won't error if the named coord doesn't exist, but that assertion will fail, which is a better way to be. It's also clearer to the test reader what the assertion is doing.

You can definitely convince me of this, it was only there as the beginnings of some replacement to the CML checks, but with them back in place, it serves no useful purpose. I have removed the _check_coord_properties function along with the associated keyword. Now, if a CML_filename is provided it will conduct the CML test, otherwise it will just compare the expected values, as was previously the case.

… reimplemented.

DPeterK · 2017-08-31T10:34:40Z

@bayliffe great stuff! Thanks for bearing with the review process and ploughing through all the changes we requested. I think this is good to go now!

DPeterK · 2017-08-31T10:34:53Z

🎉

bayliffe · 2017-08-31T12:22:55Z

Thanks for your help Pete.

This was referenced Jul 17, 2017

np.percentile method as an alternative to scipy.mstats #2682

Closed

improver percentile: use custom numpy.percentile for speed metoppv/improver#113

Closed

DPeterK self-assigned this Aug 25, 2017

pelson added the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 25, 2017

DPeterK requested changes Aug 25, 2017

View reviewed changes

bayliffe added 2 commits August 29, 2017 13:38

Moving percentile method changes into Iris V2.

26a1200

Changes requested by Pete. Restoration of CML testing in unit tests, …

4996c45

…with new KGO for new method.

bayliffe force-pushed the PercentileV2 branch from 1072240 to 4996c45 Compare August 29, 2017 12:43

pelson added the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 29, 2017

DPeterK reviewed Aug 29, 2017

View reviewed changes

Removed _check_coord_properties function now that CML tests have been…

b410b9a

… reimplemented.

pelson removed the Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form label Aug 31, 2017

DPeterK approved these changes Aug 31, 2017

View reviewed changes

DPeterK merged commit c9c40c5 into SciTools:master Aug 31, 2017

QuLogic added this to the v2.0 milestone Aug 31, 2017

Peter9192 mentioned this pull request Mar 11, 2021

Lazy implementation of multi_model_statistics and ensemble_statistics preprocessors ESMValGroup/ESMValCore#968

Merged

9 tasks


		def _check_percentile(self, data, axis, percents, expected_result,
		coord_check=False, **kwargs):


		np.testing.assert_array_almost_equal(result, expected_result)

		def test_percentile_invalid_percentile_method(self):

Fast percentile method in Iris V2. #2687

Fast percentile method in Iris V2. #2687

Uh oh!

Conversation

bayliffe commented Jul 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bayliffe commented Jul 24, 2017

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bayliffe Aug 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DPeterK commented Aug 31, 2017

Uh oh!

DPeterK commented Aug 31, 2017

Uh oh!

bayliffe commented Aug 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bayliffe commented Jul 17, 2017 •

edited

Loading

bayliffe Aug 29, 2017 •

edited

Loading