np.percentile method as an alternative to scipy.mstats #2682

bayliffe · 2017-07-12T14:22:51Z

Introduction of a numpy.percentile method to the percentile aggregator for the purposes of providing a fast alternative (approx 50 times faster).

Accessed with kwarg percentile_method="fast" passed to cube.collapsed method.
Existing unit tests duplicated with call to fast method, where masked data will result in an error.
Ran test cases and found fractional differences of order 1E-16 between two methods.
scipy.stats.mstats.mquantiles method remains important for dealing with masked data.

added unit tests.

…r. Accessed with kwarg percentile_method="fast". Existing unit tests duplicated with call to fast method, where masked data will result in an error.

pp-mo · 2017-07-13T10:45:35Z

@bayliffe
Some of those test failures seem decidedly odd + I doubt they are "your fault".
Possibly something has changed in the Travis environment, like a new version of a package becoming available.
This happens from time to time -- I'll take a look...

marqh

Hello @bayliffe

thank you for the contribution, this looks like a helpful performance benefit in principal

there are some implementation details I have commented on, which could do with a little more thought

please could you let me know your organisation affiliation, so I can check if you are already covered by an organisation Contributor License Agreement?

There are some test failures, which seem unrelated to your change. please may you 'rebase' to 'master' and push, to see if that fixes these oddities?

git fetch upstream
git rebase upstream/master
(then, assuming this has worked with no errors)
git push -f origin FastPercentileMethod

if this isn't clear, please let me know, and I can follow up; don't do this if it doesn't make sense to you

many thanks
mark

marqh · 2017-07-13T10:36:11Z

lib/iris/analysis/__init__.py


+    Kwargs:
+
+    * percentile_method (string) :


by providing a string input, you are enabling a further enhancement opportunity, specifying another implementation; which is no bad thing, this is just an observation.

Given the string input, I would like to see an enhanced docstring, describing the choice of inputs and their effect. for example

Switch to choose between methods for calculating percentiles. Specify: `percentile_method='fast'` to use the fast `numpy.percentiles` method. Do not specify or specify `percentile_method=None` to use the scipy.mstats.mquantiles method.

Afternoon. My affiliation is Met Office.
I have based the code on v1.13 as that is the version we are currently using. It is up-to-date with that branch, so a rebase does nothing. I can move it to v2 if that is desirable?

I have changed the comment and moved the keyword argument into the interface as suggested.

marqh · 2017-07-13T10:38:21Z

lib/iris/analysis/__init__.py

+        calculating percentiles, and the much faster np.percentiles method.
+
    """
+    percentile_method = kwargs.get('percentile_method', None)


please may you consider specifying this in the function definition, i.e.:

def _percentile(data, axis, percent, percentile_method=None, **kwargs):

👍
Actually yes, that is a better way.

marqh · 2017-07-13T10:39:55Z

lib/iris/analysis/__init__.py

+        msg = 'Cannot use fast np.percentile method with masked array.'
+        if ma.isMaskedArray(data):
+            raise TypeError(msg)
+        result = np.percentile(data, percent, axis=-1)


does np.percentile take the same **kwargs as scipy.stats.mstats.mquantiles?

if it is, then I think the kwargs should be passed down

if not, then some thought about how to handle kwargs differently for different implementations should be considered

I think the kwargs should be passed down

👍
That makes good sense + from the docs you'd expect it to do that in both cases.

if [they don't take the same kwargs] then some thought about how to handle kwargs differently for different implementations should be considered

👎
I disagree : I don't think that's at all necessary or helpful unless your code is specifically using extra args by testing them, adjusting them or automatically generating them.
In my head, it is clear that percentile_method switches between two underlying methods, and that is an explicit user choice. Then any extra args are supplied by the same user code, appropriate to the chosen method, so all the necessary information and responsibility is in the user's hands.
This routine knows nothing about those extra args, and doesn't need to, and certainly should not reproduce information relating to the underlying methods -- which can subsequently change anyway, without breaking this code.

The np.percentile method does not except the same kwargs, hence their omission. I will leave this as is for now, pending any further suggestions.

marqh · 2017-07-13T10:44:40Z

lib/iris/tests/test_analysis.py

+                                        percent=25, percentile_method='fast')
+        np.testing.assert_array_almost_equal(first_quartile.data,
+                                             np.array([2.5], dtype=np.float32))
+        self.assertCML(first_quartile, ('analysis',


here, and in other places, I would rather not run self.assertCML in these cases, it is an expensive and sometimes unstable test pattern

I would also prefer to see one assert per test case. T=in this case, that would mean having 2 test functions, one for first_quartile and one for third_quartile

This is asking @bayliffe to improve on the existing code.
Which I'm not actually against, BTW 😉

Ideally, I'd prefer to see a common utility routine instead of all this duplicated code...
I wouldn't insist on it, but it would make these improvements simpler + more obvious.

P.S.
> improve on the existing code

Which actually is very old + unfit -- see below...

common utility routine

By which I mean, something like self._check_percentile(cube, percents, expected_result)

Also, I just noticed that nothing here is testing for multiple values passed to the percentile arg.

Actually TBH these are very old tests, as are all those not under iris/tests/unit or iris/tests/integration.
I don't have time just now to research a more definite suggestion, but I would definitely be thinking of adding new tests into iris.tests.unit.analysis.test_PERCENTILE or iris.tests.unit.analysis.test_PercentileAggregator.
There is testing there for multiple values, too.

Sorry to have been so slow seeing the obvious here !

I have rewritten the unit tests across the section of interest. I have (possibly unwisely) removed the contentious assertCML tests. Instead I have added a simple additional coordinate check on the resulting cubes as something approaching a replacement, though obviously not as rigorous and it could be expanded. I have moved the assert statements into functions and separated out the cases where the first and third percentiles were being checked in a single test.

This may not be what you were after, but it is a bit tidier. Feel free to suggest further changes, or the return of the CML statements as you prefer.

pp-mo · 2017-07-13T13:14:40Z

lib/iris/analysis/__init__.py

+    else:
+        quantiles = np.array(percent) / 100.
+        result = scipy.stats.mstats.mquantiles(data, quantiles, axis=-1,
+                                               **kwargs)


As **kwargs is passed here for extended args, you should really remove any 'percentile_method' key as it could otherwise cause errors.
BTW this is also a usage mode you don't test : cube.collapsed(..., percent=x, percentile_method=None) :
As stating percentile_method=None is not the same as not providing one, this could cause problems with other routines calling it.
The usual way I've seen elsewhere is to use kwargs.pop in place of kwargs.get.

In fact, for choice, I'd prefer distinct allowed terms here, like "numpy_percentile" and "scipy_mquantiles", and not just "None".
I guess it would be neater if it was just a method object, but as they need handling differently, I'm happy that a string modal switch makes better sense.

Changed to accept numpy_percentile or scipy_mquantiles, with the latter set as the keyword default. A ValueError is raised if an invalid method is set. I have added a unit test for this eventuality.

ajdawson · 2017-07-17T14:42:44Z

@bayliffe (cc @marqh) The merge target here is v1.13.x, which is a bugfix-only branch closed for new features. This contribution should be made to the master branch instead.

… percentiles. Changed switch strings for selecting percentile_method. Added additional unit test for invalid percentile_method.

bayliffe · 2017-07-17T15:03:00Z

Thanks @ajdawson, I will change this to Iris V2.

bayliffe · 2017-07-17T15:29:11Z

Replaced with PR #2687 for implementation in Iris V2.

bayliffe added 2 commits July 12, 2017 15:09

Introduction of a numpy.percentile method to the percentile aggregato…

9567b90

…r. Accessed with kwarg percentile_method="fast". Existing unit tests duplicated with call to fast method, where masked data will result in an error.

Removed errant print statement.

867cf34

bayliffe mentioned this pull request Jul 12, 2017

improver percentile: use custom numpy.percentile for speed metoppv/improver#113

Closed

Removed whitespace causing pep8 issues.

a0f5f7d

marqh reviewed Jul 13, 2017

View reviewed changes

pp-mo reviewed Jul 13, 2017

View reviewed changes

Changes following first review of code. Functionalised unit tests for…

d049dd3

… percentiles. Changed switch strings for selecting percentile_method. Added additional unit test for invalid percentile_method.

bayliffe mentioned this pull request Jul 17, 2017

Fast percentile method in Iris V2. #2687

Merged

bayliffe closed this Jul 17, 2017

np.percentile method as an alternative to scipy.mstats #2682

np.percentile method as an alternative to scipy.mstats #2682

Uh oh!

Conversation

bayliffe commented Jul 12, 2017

Uh oh!

pp-mo commented Jul 13, 2017

Uh oh!

marqh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajdawson commented Jul 17, 2017

Uh oh!

bayliffe commented Jul 17, 2017

Uh oh!

bayliffe commented Jul 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pp-mo Jul 13, 2017 •

edited

Loading

pp-mo Jul 13, 2017 •

edited

Loading

pp-mo Jul 13, 2017 •

edited

Loading