-
Notifications
You must be signed in to change notification settings - Fork 300
Reuse multidim_daskstack in merge + fast um loading. #2423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/iris/_lazy_data.py
Outdated
| Args: | ||
| * data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your input argument is called "stack", not "data".
| self._data_cache = [da.stack(self._data_cache[i:i+size]) for i | ||
| in range(0, len(self._data_cache), size)] | ||
| self._data_cache, = self._data_cache | ||
| data_arrays = np.array([f._data for f in self.fields], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to convert each f._data into a dask array here. This isn't done by multidim_daskstack, unless calling da.stack converts the inputs into dask arrays and then stacks them? Either way, the conversion to arrays is not being done by the common API from #2421, which it probably should be.
| class Test_multidim_daskstack(tests.IrisTest): | ||
| def test_0d(self): | ||
| value = 4 | ||
| data = np.array(da.from_array(np.array(value), chunks=1), dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These from_array calls will need to be replaced after #2421 has been merged...
| vals = [4, 8, 11] | ||
| data = np.array([da.from_array(val*np.ones((2, 2)), chunks=2) for val | ||
| in vals], dtype=object) | ||
| data = data.reshape(3, 1, 2, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue this data is no longer 2D! Perhaps you could rename this test to test_nd?
| # You should have received a copy of the GNU Lesser General Public License | ||
| # along with Iris. If not, see <http://www.gnu.org/licenses/>. | ||
| """Test :meth:`iris._lazy data.array_masked_to_nans` method.""" | ||
| """Test function :func:`iris._lazy data.array_masked_to_nans`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| # You should have received a copy of the GNU Lesser General Public License | ||
| # along with Iris. If not, see <http://www.gnu.org/licenses/>. | ||
| """Test :meth:`iris._lazy data.is_lazy_data` method.""" | ||
| """Test function :func:`iris._lazy data.is_lazy_data`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| self.assertArrayEqual(result[:, :, 0, 0], np.array(vals).reshape(3, 1)) | ||
|
|
||
|
|
||
| if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given my comment above about whether you have NumPy arrays or dask arrays at the end of this process, you could consider adding a test to check what happens when you pass NumPy arrays in and dask arrays in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think we need to worry about what happens if you pass the routine numpy arrays (i.e. a numpy object array containing numpy arrays).
In intended use the argument is as stated "an ndarray of dask arrays", so we should just test that.
123e252 to
975105b
Compare
048c521 to
717e7cf
Compare
|
@dkillick @pp-mo the changes based on you two's reviews are now up! |
DPeterK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more comments from me but it's looking good! 🌴 👍
| def _check(self, stack_shape): | ||
| vals = np.arange(np.prod(stack_shape)).reshape(stack_shape) | ||
| stack = np.empty(stack_shape, 'object') | ||
| stack_element_shape = (4, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These magic numbers could probably do with a little explanation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not magic numbers. I just chose them as they were small enough that I could produce a 4D result (for test_2d_lazy_stack of shape (3, 2, 4, 5) where the length of the dimensions is not repeated.
It's equivalent to a pp field data of shape (4,5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically they are, as they're numbers that appear without much introduction! A comment on the line above that said "equivalent to a pp field data of shape (4,5)" would nicely sort this and improve code understand-ability, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possibly better to think of the numbers as "test input", similar to when you create an array to test an array operation, that array is thought of as "test input data".
In these tests, I am creating a stack of val*np.ones((4,5)) arrays where the shape of each element in the stack is (4,5). So there's nothing special about the numbers I just need something to create an array and then I reuse those numbers to check I'm getting the right output shape.
My worry with the comment "equivalent to a pp field data of shape (4,5)" is that that is an example of an implementation of multidim_lazy_stack.
I would consider renaming the variable though. Would stack_element_dask_array_shape be clearer?
| shape = (2,) | ||
| result = self._check(shape) | ||
|
|
||
| def test_2d_lazy_stack(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance of a >2D test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit unnecessary? The 2d test is already testing the recursivity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, however part of testing is checking edge cases, and I don't know what would happen if I passed a 7D array to it. Evidently we don't need to test all possible dimensionalities (!), but it would be good to test something that's outside of the boundary of logical intent for the functionality being tested. This is just in case an unsafe assumption has been made about how the function will be used; "of course, no-one would ever use this for more than a 2D input".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !
| self.vector_dims_shape + data_arrays.shape[1:]) | ||
| self._data_cache = multidim_daskstack(data_arrays) | ||
| stack = np.empty(np.prod(self.vector_dims_shape), 'object') | ||
| for index, f in enumerate(self.fields): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I so would have gone for "i, field" here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I used f as that was what was used in the previous code. I will change it to
for index, field in ...
| self._data_cache = [da.stack(self._data_cache[i:i+size]) for i | ||
| in range(0, len(self._data_cache), size)] | ||
| self._data_cache, = self._data_cache | ||
| stack = np.empty(np.prod(self.vector_dims_shape), 'object') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be neater + clearer to iterate with a multidimensional index here, thus avoiding the np.prod(shape) and the subsequent stack.reshape.
Your magic friend for that is np.ndindex(shape), which gives you indices to all elements of a multidimensional array, as used here for example
I think that gets you something like ...
stack = np.empty(self.vector_dims_shape, 'object')
for field, nd_index in zip(fields, np.ndindex(self.vector_dims_shape)):
stack[nd_index] = as_lazy_data(field._data, chunks=field._data.shape)
self._data_cache = multidim_lazy_stack(stack)
| shape = (2,) | ||
| result = self._check(shape) | ||
|
|
||
| def test_2d_lazy_stack(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !
| for index, val in np.ndenumerate(vals): | ||
| stack[index] = as_lazy_data(val*np.ones(stack_element_shape)) | ||
| result = multidim_lazy_stack(stack) | ||
| self.assertEqual(result.shape, stack_shape + stack_element_shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could also check the actual values, to ensure that the ordering of elements is definitely correct ?
I think all that is needed is something like...
expected = np.empty(list(stack_shape)+list(stack_element_shape), dtype=int)
for index, val in np.ndenumerate(vals):
stack[index] = as_lazy_data(val*np.ones(stack_element_shape))
expected[index] = val
...
self.assertArrayAllClose(result.compute(), expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure that the ordering of elements is definitely correct
Surely that's already achieved by using dimensions of different lengths?
Not that I'm against adding extra assurance.
815fe26 to
34f810d
Compare
|
@dkillick @pp-mo I have made the final changes based of your reviews, and tests pass...Merge? (Please squash and merge) |
Use multidimension dask stacks in merge and fast um loading.
This reuses the code that recursively builds a multidimensional stacked dask array in two places.