Reuse multidim_daskstack in merge + fast um loading. #2423

lbdreyer · 2017-03-09T11:49:54Z

This reuses the code that recursively builds a multidimensional stacked dask array in two places.

lbdreyer · 2017-03-09T11:58:23Z

DPeterK · 2017-03-09T12:02:37Z

lib/iris/_lazy_data.py

+
+    Args:
+
+    * data:


Your input argument is called "stack", not "data".

DPeterK · 2017-03-09T12:07:06Z

lib/iris/fileformats/um/_fast_load_structured_fields.py

-                self._data_cache = [da.stack(self._data_cache[i:i+size]) for i
-                                    in range(0, len(self._data_cache), size)]
-            self._data_cache, = self._data_cache
+            data_arrays = np.array([f._data for f in self.fields],


I think you need to convert each f._data into a dask array here. This isn't done by multidim_daskstack, unless calling da.stack converts the inputs into dask arrays and then stacks them? Either way, the conversion to arrays is not being done by the common API from #2421, which it probably should be.

DPeterK · 2017-03-09T12:10:59Z

lib/iris/tests/unit/lazy_data/test_multidim_daskarray.py

+class Test_multidim_daskstack(tests.IrisTest):
+    def test_0d(self):
+        value = 4
+        data = np.array(da.from_array(np.array(value), chunks=1), dtype=object)


These from_array calls will need to be replaced after #2421 has been merged...

DPeterK · 2017-03-09T12:14:57Z

lib/iris/tests/unit/lazy_data/test_multidim_daskarray.py

+        vals = [4, 8, 11]
+        data = np.array([da.from_array(val*np.ones((2, 2)), chunks=2) for val
+                         in vals], dtype=object)
+        data = data.reshape(3, 1, 2, 2)


I'd argue this data is no longer 2D! Perhaps you could rename this test to test_nd?

DPeterK · 2017-03-09T12:15:30Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

 # You should have received a copy of the GNU Lesser General Public License
 # along with Iris.  If not, see <http://www.gnu.org/licenses/>.
-"""Test :meth:`iris._lazy data.array_masked_to_nans` method."""
+"""Test function :func:`iris._lazy data.array_masked_to_nans`."""


DPeterK · 2017-03-09T12:15:35Z

lib/iris/tests/unit/lazy_data/test_is_lazy_data.py

 # You should have received a copy of the GNU Lesser General Public License
 # along with Iris.  If not, see <http://www.gnu.org/licenses/>.
-"""Test :meth:`iris._lazy data.is_lazy_data` method."""
+"""Test function :func:`iris._lazy data.is_lazy_data`."""


DPeterK · 2017-03-09T12:18:04Z

lib/iris/tests/unit/lazy_data/test_multidim_daskarray.py

+        self.assertArrayEqual(result[:, :, 0, 0], np.array(vals).reshape(3, 1))
+
+
+if __name__ == '__main__':


Given my comment above about whether you have NumPy arrays or dask arrays at the end of this process, you could consider adding a test to check what happens when you pass NumPy arrays in and dask arrays in.

Actually, I don't think we need to worry about what happens if you pass the routine numpy arrays (i.e. a numpy object array containing numpy arrays).
In intended use the argument is as stated "an ndarray of dask arrays", so we should just test that.

lbdreyer · 2017-03-10T14:14:03Z

@dkillick @pp-mo the changes based on you two's reviews are now up!

DPeterK

Just a couple more comments from me but it's looking good! 🌴 👍

DPeterK · 2017-03-10T14:20:46Z

lib/iris/tests/unit/lazy_data/test_multidim_lazy_stack.py

+    def _check(self, stack_shape):
+        vals = np.arange(np.prod(stack_shape)).reshape(stack_shape)
+        stack = np.empty(stack_shape, 'object')
+        stack_element_shape = (4, 5)


These magic numbers could probably do with a little explanation...

Not magic numbers. I just chose them as they were small enough that I could produce a 4D result (for test_2d_lazy_stack of shape (3, 2, 4, 5) where the length of the dimensions is not repeated.

It's equivalent to a pp field data of shape (4,5)

Technically they are, as they're numbers that appear without much introduction! A comment on the line above that said "equivalent to a pp field data of shape (4,5)" would nicely sort this and improve code understand-ability, I think.

It's possibly better to think of the numbers as "test input", similar to when you create an array to test an array operation, that array is thought of as "test input data".

In these tests, I am creating a stack of val*np.ones((4,5)) arrays where the shape of each element in the stack is (4,5). So there's nothing special about the numbers I just need something to create an array and then I reuse those numbers to check I'm getting the right output shape.

My worry with the comment "equivalent to a pp field data of shape (4,5)" is that that is an example of an implementation of multidim_lazy_stack.

I would consider renaming the variable though. Would stack_element_dask_array_shape be clearer?

DPeterK · 2017-03-10T14:21:44Z

lib/iris/tests/unit/lazy_data/test_multidim_lazy_stack.py

+        shape = (2,)
+        result = self._check(shape)
+
+    def test_2d_lazy_stack(self):


Is there any chance of a >2D test?

It seems a bit unnecessary? The 2d test is already testing the recursivity

Indeed, however part of testing is checking edge cases, and I don't know what would happen if I passed a 7D array to it. Evidently we don't need to test all possible dimensionalities (!), but it would be good to test something that's outside of the boundary of logical intent for the functionality being tested. This is just in case an unsafe assumption has been made about how the function will be used; "of course, no-one would ever use this for more than a 2D input".

I'm not hung up on this though, so I'm not going to make a right fuss if it's left out. @bjlittle @pp-mo do you have any inputs to this?

From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !

DPeterK · 2017-03-10T14:22:30Z

lib/iris/fileformats/um/_fast_load_structured_fields.py

-                self.vector_dims_shape + data_arrays.shape[1:])
-            self._data_cache = multidim_daskstack(data_arrays)            
+            stack = np.empty(np.prod(self.vector_dims_shape), 'object')
+            for index, f in enumerate(self.fields):


Haha, I so would have gone for "i, field" here...

Good point. I used f as that was what was used in the previous code. I will change it to
for index, field in ...

DPeterK · 2017-03-10T15:12:27Z

@bjlittle @pp-mo do you have any comments to add? Otherwise I'm minded to merge this when Travis is happy.

pp-mo · 2017-03-10T17:04:57Z

lib/iris/fileformats/um/_fast_load_structured_fields.py

-                self._data_cache = [da.stack(self._data_cache[i:i+size]) for i
-                                    in range(0, len(self._data_cache), size)]
-            self._data_cache, = self._data_cache
+            stack = np.empty(np.prod(self.vector_dims_shape), 'object')


I think it would be neater + clearer to iterate with a multidimensional index here, thus avoiding the np.prod(shape) and the subsequent stack.reshape.
Your magic friend for that is np.ndindex(shape), which gives you indices to all elements of a multidimensional array, as used here for example
I think that gets you something like ...

stack = np.empty(self.vector_dims_shape, 'object') for field, nd_index in zip(fields, np.ndindex(self.vector_dims_shape)): stack[nd_index] = as_lazy_data(field._data, chunks=field._data.shape) self._data_cache = multidim_lazy_stack(stack)

pp-mo · 2017-03-10T17:24:51Z

lib/iris/tests/unit/lazy_data/test_multidim_lazy_stack.py

+        shape = (2,)
+        result = self._check(shape)
+
+    def test_2d_lazy_stack(self):


From a mockist / white-box-y point of view, the existing tests do already cover all 3 code branches.
The implementation relies on iteration over the passed stack to deconstruct the input one dimension at a time, but 2d is already checking that.
One more would do no harm, I suppose, but I don't really expect it to go wrong !

pp-mo · 2017-03-10T17:30:53Z

lib/iris/tests/unit/lazy_data/test_multidim_lazy_stack.py

+        for index, val in np.ndenumerate(vals):
+            stack[index] = as_lazy_data(val*np.ones(stack_element_shape))
+        result = multidim_lazy_stack(stack)
+        self.assertEqual(result.shape, stack_shape + stack_element_shape)


I wonder if we could also check the actual values, to ensure that the ordering of elements is definitely correct ?
I think all that is needed is something like...

expected = np.empty(list(stack_shape)+list(stack_element_shape), dtype=int) for index, val in np.ndenumerate(vals): stack[index] = as_lazy_data(val*np.ones(stack_element_shape)) expected[index] = val ... self.assertArrayAllClose(result.compute(), expected)

ensure that the ordering of elements is definitely correct

Surely that's already achieved by using dimensions of different lengths?

Not that I'm against adding extra assurance.

lbdreyer · 2017-03-13T15:47:15Z

@dkillick @pp-mo I have made the final changes based of your reviews, and tests pass...Merge?

(Please squash and merge)

Use multidimension dask stacks in merge and fast um loading.

QuLogic added the Status: Work in Progress label Mar 9, 2017

lbdreyer added the dask label Mar 9, 2017

lbdreyer changed the title ~~Reuse multidim_daskstack in merge in fast um loading.~~ Reuse multidim_daskstack in merge + fast um loading. Mar 9, 2017

DPeterK requested changes Mar 9, 2017

View reviewed changes

lbdreyer added 2 commits March 9, 2017 17:52

Reuse multidim_daskstack in merge in fast um loading.

d150fe6

Use as lazy data.

975105b

lbdreyer force-pushed the multi_dim_stack_dask branch from 123e252 to 975105b Compare March 9, 2017 18:10

Create empty arrays then fill with dasks.

717e7cf

lbdreyer force-pushed the multi_dim_stack_dask branch from 048c521 to 717e7cf Compare March 10, 2017 14:06

DPeterK reviewed Mar 10, 2017

View reviewed changes

lbdreyer added 2 commits March 10, 2017 14:59

Rename iteration variable.

62e25e5

Add comment explaining numbers.

e69c46f

DPeterK approved these changes Mar 10, 2017

View reviewed changes

Use iteration variable

2b9fe09

pp-mo requested changes Mar 10, 2017

View reviewed changes

add test of values; use np.ndindex

34f810d

lbdreyer force-pushed the multi_dim_stack_dask branch from 815fe26 to 34f810d Compare March 13, 2017 14:20

pp-mo approved these changes Mar 13, 2017

View reviewed changes

pp-mo merged commit 52d4ccf into SciTools:dask Mar 13, 2017

pp-mo removed the Status: Work in Progress label Mar 13, 2017

QuLogic added this to the dask milestone Mar 13, 2017

bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017

Reuse multidim_daskstack in merge + fast um loading. (SciTools#2423)

1edeed2

Use multidimension dask stacks in merge and fast um loading.

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

lbdreyer deleted the multi_dim_stack_dask branch July 23, 2018 10:48

		self.assertArrayEqual(result[:, :, 0, 0], np.array(vals).reshape(3, 1))


		if __name__ == '__main__':

Reuse multidim_daskstack in merge + fast um loading. #2423

Reuse multidim_daskstack in merge + fast um loading. #2423

Uh oh!

Conversation

lbdreyer commented Mar 9, 2017

Uh oh!

lbdreyer commented Mar 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer commented Mar 10, 2017

Uh oh!

DPeterK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DPeterK commented Mar 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pp-mo Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer commented Mar 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lbdreyer Mar 10, 2017 •

edited

Loading

pp-mo Mar 10, 2017 •

edited

Loading

pp-mo Mar 10, 2017 •

edited

Loading

lbdreyer Mar 13, 2017 •

edited

Loading