-
Notifications
You must be signed in to change notification settings - Fork 300
Fast percentile method in Iris V2. #2687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pp-mo Any chance I can reinvigorate some interest in this PR? I did factor out some stuff in the unit tests as suggested to act as a sweetener :-) |
DPeterK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bayliffe - many thanks for reimplementing this for Iris v2. I've made a number of comments that will need to be addressed before this can be merged, but for the most part the changes I'm after are reasonably minor and to make this functionality easier to use from the user's perspective.
lib/iris/analysis/__init__.py
Outdated
| raise ValueError(msg.format(percentile_method)) | ||
| if not ma.isMaskedArray(data) and not ma.is_masked(result): | ||
| result = np.asarray(result) | ||
| if percentile_method == 'numpy_percentile': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to revisit this logical test. Could this not be done after L1052? Or does this really only need to fire in the case where the result is not masked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Agreed, this can be lumped into the first if statement.
lib/iris/analysis/__init__.py
Outdated
| Kwargs: | ||
| * percentile_method (string) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the specifics of this implementation. In particular, this implementation relies heavily on specifying one of two strings that are obvious when you have the code in front of you but are otherwise quite esoteric. Add to that the fact that both strings are long and thus ripe for typos and it makes this implementation not very user-friendly.
I wonder if the following might work better:
kwargrenamed to "fast_percentile", default valueFalse(for backward compatibility)- document what values of
TrueandFalsewill mean for the percentile calculation (including mentioning that the fast option is not compatible with masked arrays) - simplify the if-elif-else block to just an if-else block (i.e.
if fast_percentile...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
lib/iris/tests/test_analysis.py
Outdated
|
|
||
| def _check_percentile(self, data, axis, percents, expected_result, | ||
| coord_check=False, **kwargs): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need these two blank lines in this method 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ I do love a bit of empty space, but I'll happily make everything a bit more cosy.
lib/iris/tests/test_analysis.py
Outdated
| def _check_collapsed_percentile(self, cube, percents, collapse_coord, | ||
| expected_result, coord_check=False, | ||
| **kwargs): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop this blank line too please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
lib/iris/tests/test_analysis.py
Outdated
| def test_percentile_1d(self): | ||
|
|
||
| def _check_coord_properties(self, cube, collapse_coord, unit, points): | ||
| if not isinstance(collapse_coord, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good practice!
>>> t = (5,) # A tuple: not a list, but still an iterable
>>> if not isinstance(t, list):
... t = [t]
...
>>> print t
[(t)]Much better would be:
if isinstance(collapse_coord, six.string_types):
collapse_coord = [collapse_coord]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've changed it as you suggest.
I did not know about the six module for python 2/3 compatibility. That's quite cool.
lib/iris/tests/test_analysis.py
Outdated
| def _check_coord_properties(self, cube, collapse_coord, unit, points): | ||
| if not isinstance(collapse_coord, list): | ||
| collapse_coord = [collapse_coord] | ||
| name = 'percentile_over_' + '_'.join(collapse_coord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid string concatenations by using the following please:
name = 'percentile_over_{}'.format('_'.join(collapse_coord))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ We can.
lib/iris/tests/test_analysis.py
Outdated
| if not isinstance(collapse_coord, list): | ||
| collapse_coord = [collapse_coord] | ||
| name = 'percentile_over_' + '_'.join(collapse_coord) | ||
| self.assertTrue(cube.coord(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this test checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the collapsed coordinate over which percentiles have been calculated has the name that is expected. The cube and coordinate are provided separately, the percentile method should result in a coordinate with the name defined on line 358. This simply checks this is so.
A previous reviewer expressed a distaste for the CML checks, so this whole function is a crude attempt at checking the cube's form in an alternative manner.
lib/iris/tests/test_analysis.py
Outdated
|
|
||
| np.testing.assert_array_almost_equal(result, expected_result) | ||
|
|
||
| def test_percentile_invalid_percentile_method(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing my suggested change above would remove the need for this test method too: there can only be two options and one of them will always be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
| percent=25) | ||
| np.testing.assert_array_almost_equal(first_quartile.data, | ||
| np.array([2.5], dtype=np.float32)) | ||
| self.assertCML(first_quartile, ('analysis', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've dropped all the CML checks from these tests. Can you re-implement them please, as they're important for checking the resultant cubes as a whole entity are as expected following a collapse operation. You will need to add new CML results for the new tests you've added as well.
I wouldn't expect the CML to change for the existing tests (though you've renamed the tests). If the CML is changing after accounting for the renamed tests that may well be cause for concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put these back in and added new results files for the fast_percentile_method tests.
…with new KGO for new method.
DPeterK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bayliffe there's just one more thing for you to contemplate, which will require another commit either way - and this is good as it might trigger the CLA blocked check to go away now that I've added you to the contributors list (congrats on that!)
lib/iris/tests/test_analysis.py
Outdated
| if isinstance(collapse_coord, six.string_types): | ||
| collapse_coord = [collapse_coord] | ||
| name = 'percentile_over_{}'.format('_'.join(collapse_coord)) | ||
| self.assertTrue(cube.coord(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we've lost the previous comments thread here, apparently... To recall some of that thread:
A previous reviewer expressed a distaste for the CML checks
And with good reason, admittedly – they are too sensitive to small and unimportant changes; though they are good at checking a cube wholesale! Given that you've now introduced CML for the tests here can I talk you into dropping this check method? All the checks this makes are included in the CML...
For reference, I'm still not a fan of this particular assertion. I think that if the named coord you're looking for does not exist then cube.coord(name) will raise an exception, which will cause the test that called this method to error, which is undesirable. It might be better to do something like the following (if you don't replace this check method with CML checking):
coord_names = [c.name() for c in cube.coords()]
self.assertIn(name, coord_names)This won't error if the named coord doesn't exist, but that assertion will fail, which is a better way to be. It's also clearer to the test reader what the assertion is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can definitely convince me of this, it was only there as the beginnings of some replacement to the CML checks, but with them back in place, it serves no useful purpose. I have removed the _check_coord_properties function along with the associated keyword. Now, if a CML_filename is provided it will conduct the CML test, otherwise it will just compare the expected values, as was previously the case.
|
@bayliffe great stuff! Thanks for bearing with the review process and ploughing through all the changes we requested. I think this is good to go now! |
|
🎉 |
|
Thanks for your help Pete. |
Replacement for PR #2682 (np.percentile method as an alternative to scipy.mstats).
Now for Iris V2.
Introduction of a numpy.percentile method to the percentile aggregator for the purposes of providing a fast alternative (approx 50 times faster).
I have rewritten the unit tests across the section of interest. I have (possibly unwisely) removed the contentious assertCML tests (see comments on #2682). Instead I have added a simple additional coordinate check on the resulting cubes as something approaching a replacement, though obviously not as rigorous and it could be expanded. I have moved the assert statements into functions and separated out the cases where the first and third percentiles were being checked in a single test as requested.