-
Notifications
You must be signed in to change notification settings - Fork 300
np.percentile method as an alternative to scipy.mstats #2682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r. Accessed with kwarg percentile_method="fast". Existing unit tests duplicated with call to fast method, where masked data will result in an error.
|
@bayliffe |
marqh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @bayliffe
thank you for the contribution, this looks like a helpful performance benefit in principal
there are some implementation details I have commented on, which could do with a little more thought
please could you let me know your organisation affiliation, so I can check if you are already covered by an organisation Contributor License Agreement?
There are some test failures, which seem unrelated to your change. please may you 'rebase' to 'master' and push, to see if that fixes these oddities?
git fetch upstream
git rebase upstream/master
(then, assuming this has worked with no errors)
git push -f origin FastPercentileMethod
if this isn't clear, please let me know, and I can follow up; don't do this if it doesn't make sense to you
many thanks
mark
| Kwargs: | ||
| * percentile_method (string) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by providing a string input, you are enabling a further enhancement opportunity, specifying another implementation; which is no bad thing, this is just an observation.
Given the string input, I would like to see an enhanced docstring, describing the choice of inputs and their effect. for example
Switch to choose between methods for calculating percentiles.
Specify: `percentile_method='fast'` to use the fast `numpy.percentiles` method.
Do not specify or specify `percentile_method=None` to use the scipy.mstats.mquantiles method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afternoon. My affiliation is Met Office.
I have based the code on v1.13 as that is the version we are currently using. It is up-to-date with that branch, so a rebase does nothing. I can move it to v2 if that is desirable?
I have changed the comment and moved the keyword argument into the interface as suggested.
lib/iris/analysis/__init__.py
Outdated
| calculating percentiles, and the much faster np.percentiles method. | ||
| """ | ||
| percentile_method = kwargs.get('percentile_method', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please may you consider specifying this in the function definition, i.e.:
def _percentile(data, axis, percent, percentile_method=None, **kwargs):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Actually yes, that is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| msg = 'Cannot use fast np.percentile method with masked array.' | ||
| if ma.isMaskedArray(data): | ||
| raise TypeError(msg) | ||
| result = np.percentile(data, percent, axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does np.percentile take the same **kwargs as scipy.stats.mstats.mquantiles?
if it is, then I think the kwargs should be passed down
if not, then some thought about how to handle kwargs differently for different implementations should be considered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the kwargs should be passed down
👍
That makes good sense + from the docs you'd expect it to do that in both cases.
if [they don't take the same kwargs] then some thought about how to handle kwargs differently for different implementations should be considered
👎
I disagree : I don't think that's at all necessary or helpful unless your code is specifically using extra args by testing them, adjusting them or automatically generating them.
In my head, it is clear that percentile_method switches between two underlying methods, and that is an explicit user choice. Then any extra args are supplied by the same user code, appropriate to the chosen method, so all the necessary information and responsibility is in the user's hands.
This routine knows nothing about those extra args, and doesn't need to, and certainly should not reproduce information relating to the underlying methods -- which can subsequently change anyway, without breaking this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The np.percentile method does not except the same kwargs, hence their omission. I will leave this as is for now, pending any further suggestions.
lib/iris/tests/test_analysis.py
Outdated
| percent=25, percentile_method='fast') | ||
| np.testing.assert_array_almost_equal(first_quartile.data, | ||
| np.array([2.5], dtype=np.float32)) | ||
| self.assertCML(first_quartile, ('analysis', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, and in other places, I would rather not run self.assertCML in these cases, it is an expensive and sometimes unstable test pattern
I would also prefer to see one assert per test case. T=in this case, that would mean having 2 test functions, one for first_quartile and one for third_quartile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is asking @bayliffe to improve on the existing code.
Which I'm not actually against, BTW 😉
Ideally, I'd prefer to see a common utility routine instead of all this duplicated code...
I wouldn't insist on it, but it would make these improvements simpler + more obvious.
P.S.
> improve on the existing code
Which actually is very old + unfit -- see below...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
common utility routine
By which I mean, something like self._check_percentile(cube, percents, expected_result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just noticed that nothing here is testing for multiple values passed to the percentile arg.
Actually TBH these are very old tests, as are all those not under iris/tests/unit or iris/tests/integration.
I don't have time just now to research a more definite suggestion, but I would definitely be thinking of adding new tests into iris.tests.unit.analysis.test_PERCENTILE or iris.tests.unit.analysis.test_PercentileAggregator.
There is testing there for multiple values, too.
Sorry to have been so slow seeing the obvious here !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rewritten the unit tests across the section of interest. I have (possibly unwisely) removed the contentious assertCML tests. Instead I have added a simple additional coordinate check on the resulting cubes as something approaching a replacement, though obviously not as rigorous and it could be expanded. I have moved the assert statements into functions and separated out the cases where the first and third percentiles were being checked in a single test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be what you were after, but it is a bit tidier. Feel free to suggest further changes, or the return of the CML statements as you prefer.
| else: | ||
| quantiles = np.array(percent) / 100. | ||
| result = scipy.stats.mstats.mquantiles(data, quantiles, axis=-1, | ||
| **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As **kwargs is passed here for extended args, you should really remove any 'percentile_method' key as it could otherwise cause errors.
BTW this is also a usage mode you don't test : cube.collapsed(..., percent=x, percentile_method=None) :
As stating percentile_method=None is not the same as not providing one, this could cause problems with other routines calling it.
The usual way I've seen elsewhere is to use kwargs.pop in place of kwargs.get.
In fact, for choice, I'd prefer distinct allowed terms here, like "numpy_percentile" and "scipy_mquantiles", and not just "None".
I guess it would be neater if it was just a method object, but as they need handling differently, I'm happy that a string modal switch makes better sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to accept numpy_percentile or scipy_mquantiles, with the latter set as the keyword default. A ValueError is raised if an invalid method is set. I have added a unit test for this eventuality.
… percentiles. Changed switch strings for selecting percentile_method. Added additional unit test for invalid percentile_method.
|
Thanks @ajdawson, I will change this to Iris V2. |
|
Replaced with PR #2687 for implementation in Iris V2. |
Introduction of a numpy.percentile method to the percentile aggregator for the purposes of providing a fast alternative (approx 50 times faster).