Reinstate `as_lazy_data` #2421

DPeterK · 2017-03-08T14:31:06Z

The return of the as_lazy_data function, which acts as Iris' interface to da.from_array. I've added tests for the function and changed the Iris code and tests so that it always uses the new function and does not call da.from_array directly.

DPeterK · 2017-03-08T14:40:09Z

ping @bjlittle @lbdreyer @pp-mo

lbdreyer · 2017-03-08T14:43:13Z

lib/iris/fileformats/_pyke_rules/fc_rules_cf.krb

    import iris.exceptions
    import iris.std_names
    import iris.util
+    from iris._lazy_data import as_lazy_data


Does it matter what order these are in? I just noticed that in some of the other files the private iris modules are imported first.

Ah, I put this here because I'd seen examples where the private modules were imported last 😉

lbdreyer · 2017-03-08T14:51:29Z

lib/iris/tests/unit/lazy_data/test_is_lazy_data.py

-        lazy_values = np.arange(30).reshape((2, 5, 3))
-        lazy_array = da.from_array(lazy_values, 1e6)
+        values = np.arange(30).reshape((2, 5, 3))
+        lazy_array = da.from_array(values, _MAX_CHUNK_SIZE)


Minor, but it might be worth using the kwarg i.e.
lazy_array = da.from_array(values, chunks=_MAX_CHUNK_SIZE)
to be consistent with the other times da.from_array is called.

lbdreyer · 2017-03-08T14:58:28Z

lib/iris/_merge.py

 import numpy.ma as ma

-from iris._lazy_data import is_lazy_data, array_masked_to_nans
+from iris._lazy_data import array_masked_to_nans, as_lazy_data, is_lazy_data 


I think pep8 is complaining about this trailing whitespace

lbdreyer · 2017-03-08T15:01:06Z

lib/iris/_merge.py

                            data = array_masked_to_nans(data)
                        data = data.data
-                    data = da.from_array(data, chunks=data.shape)
+                    data = as_lazy_data(data, chunks=data.shape)


The lines doing the mask->nans can now be removed as it's being done in as_lazy_data

marqh · 2017-03-08T16:42:59Z

lib/iris/tests/integration/test_trajectory.py

        cube = istk.simple_4d_with_hybrid_height()
        # Put a biggus array on the cube so we can test deferred loading.
-        cube.data = da.from_array(cube.data, chunks=cube.data.shape)
+        cube.data = as_lazy_data(cube.data)


i'm interested why you have removed teh chunking here, and in other tests

is there a reason to use the 'magic chunking number' in these cases?

Primarily this has been done for consistency, but there are also performance improvements to be had for setting the chunk size to the magic chunk size.

marqh · 2017-03-08T16:50:23Z

i'm not against this change. I would like to understand a little about it's motivation. It appears at first glance that calls to
da.from_array()
are replaced by
iris._lazy_data.as_lazy_data()
calls with a very similar call signature

the only difference I can see is that
iris._lazy_data.as_lazy_data()
always calls
iris._lazy_data.array_masked_to_nans()
where as previously this had to be called explicitly if required

is this the benefit which is being targeted?
Is there extra benefit which is being targeted which I am not observing?

Additionally, it seems a shame to reintroduce

# A magic value, borrowed from biggus
MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2

to me, I think a better chunking approach is required than this

none of this is a barrier to adoption, I am an interested watcher in this case, keen to comprehend

DPeterK · 2017-03-08T17:03:44Z

@marqh we have identified the following benefits with reintroducing as_masked_array:

factoring repeated code: though there's not a lot of code been factored, if we wanted to change the way that iris interfaced with dask then this means we need only change one piece of code
similarly we get to define a single common interface for what to do when constructing a lazy array: we always want to ensure there are no masks present when setting up a lazy array and we want to have a consistent approach to chunking; these are both most efficiently done by having this common interface.
disentangling Iris and dask code: I think it would be good to have as little direct Iris interaction with dask as possible, not least for reasons of factoring common code
reintroducing the biggus magic number is an improvement to chunking. The arbitrary scattergun approach to chunking that existed before was naive at best and only likely to negatively impact dask performance. This magic number was actually carefully selected to maximise the performance improvement of chunking while minimising the overhead of setting up chunking, so reintroducing it is a very positive step. Certainly further improvements to how chunks are set up are necessary, but at the moment there is no driver to implement anything here (although having a single function to set up lazy arrays will greatly ease the introduction of it).

lbdreyer · 2017-03-08T17:14:59Z

lib/iris/tests/unit/lazy_data/test_as_lazy_data.py

+#
+# You should have received a copy of the GNU Lesser General Public License
+# along with Iris.  If not, see <http://www.gnu.org/licenses/>.
+"""Test :meth:`iris._lazy data.as_lazy_data` method."""


"""Test function :func:`iris._lazy data.as_lazy_data`."""

I also just ran into this in the tests I'm writing. It looks the other two unit tests for iris._lazy_data also have this mistake.

As in, it should have a :func: decorator?

Yep, as it's a function not a method.

marqh · 2017-03-09T08:43:44Z

so

always calls
iris._lazy_data.array_masked_to_nans()
where as previously this had to be called explicitly if required
seems a fair assessment of benefit. I can see that this is deemed useful.

disentangling Iris and dask code: I think it would be good to have as little direct Iris interaction with dask as possible, not least for reasons of factoring common code

I don't see the benefit of this in principal. I think Iris using dask is beneficial and Iris should be clear to make calls to dask where useful, I don't think that, in principal, as little direct interaction as possible is a good thing.
In this case it can be seen as a practical step, to handle the work around for missing data, but this isn't an approach that I would advocate across the board.

marqh · 2017-03-09T08:49:25Z

reintroducing the biggus magic number is an improvement to chunking. The arbitrary scattergun approach to chunking that existed before was naive at best and only likely to negatively impact dask performance. This magic number was actually carefully selected to maximise the performance improvement of chunking while minimising the overhead of setting up chunking, so reintroducing it is a very positive step. Certainly further improvements to how chunks are set up are necessary, but at the moment there is no driver to implement anything here (although having a single function to set up lazy arrays will greatly ease the introduction of it).

this is far from clear to me.
How is this magic number an improvement to chunking please?
Is this a performance benefit? A memory overhead benefit? Have we measured these somewhere that I have not seen?
How is 16777216 carefully selected to deliver performance?

Dask does not provide a magic chunk size, it requests it is provided based on context. I think it is worth considering whether Iris should respect Dask's API in this case and not provide a 'default value'

It looks in this PR like the only places this is used is in testing, that all of the functional code supplies it's own chunk size. If this is not the case and I'm missing something, I'd be interested to see it.

With this in mind, I wonder whether not including the magic number in this PR and looking into it as a follow on activity might better suit that introducing it here as part of a non-functional code refactor for purposes discussed above.

DPeterK · 2017-03-09T09:13:09Z

How is 16777216 carefully selected to deliver performance?

I spoke to @rhattersley about why he chose this number for biggus. What it comes down to is that chunking the data (and in dask, setting up the graph) commands a not insignificant memory/time overhead. The performance tests for biggus showed that setting the chunks too small, as was being done here, ruins performance as any improvement gained through chunking is destroyed through setting up the chunking. Conversely, setting the chunking too large also caused a (less significant) performance impact due to the chunks becoming larger and slower to process.

The value selected is set to be just to the right of peak chunking performance in biggus and I am asserting that this is true in dask too, not least because the dask docs say (see FAQ 3) that chunks are best set to be between 10MB and 100MB in size, which this number handily is.

I'm going to do some performance (timing) testings on this though, as I'm interested to see how performance will vary by chunk size.

One more thing to note is that you can rechunk dask arrays. The dask docs suggest that a lot of the performance improvements in chunking come from setting the correct chunk size for the operation. I think that at a later date we should look into rechunking lazy data cubes based upon the operation we're about to perform on it. It is still the case though that setting a default max chunk size of 16MB in both cases will make a noticeable performance improvement than the naive and inconsistent chunk sizes that I have replaced.

DPeterK · 2017-03-09T09:16:32Z

the only places this is used is in testing

Not true:

almost all operations now use the default chunk size as this will be better in the general case.
the fileformats interface specifies a chunk size that is the shape of the 2D field coming in.
in general usage, especially when we have users supplying large datasets, having a default chunk size set to a sensible value is a better approach.

DPeterK · 2017-03-09T09:18:40Z

looking into it as a follow on activity

This will be much easier done if we have followed the good software design pattern reintroduced here of having a single consistent point of entry for creating dask arrays for Iris cubes rather than having duplicated code scattered all across the Iris codebase.

marqh · 2017-03-09T09:31:57Z

almost all operations now use the default chunk size as this will be better in the general case.

please may you point me to where this occurs?
all of the calls to the
iris._lazy_data.as_lazy_data()
that I can find on this branch explicitly pass in chunk size

DPeterK · 2017-03-09T09:53:46Z

Well, there's all the tests - they now use the default chunk size. Cube also uses the default chunk size. The other non-test calls to as_lazy_data are all made by code that's about to stack up multiple 2D fields.

lbdreyer · 2017-03-09T15:48:59Z

lib/iris/_lazy_data.py

+
+    * chunks:
+        Describes how the created dask array should be split up. Defaults to a
+        value first defined in biggus (being `8 * 1024 * 1024 * 2`).


I do wonder whether the references to biggus here and on line 43 are necessary. I want to be able to read this code without needing to also know the history of Iris lazy data handling.
Not that this is a blocker to this PR.

Yeah, I guess this is a temporary note for us to describe the background to choosing this number in particular...

lbdreyer · 2017-03-09T15:50:30Z

lib/iris/_lazy_data.py

+
+    """
+    if not is_lazy_data(data):
+        if isinstance(data, np.ma.MaskedArray):


Is it worth importing np.ma rather than just np

i.e.

import numpy as np import numpy.ma as ma

lbdreyer · 2017-03-09T16:08:36Z

I'm gonna merge this in as I agree with the principle.

(@bjlittle @pp-mo I would still encourage you to have a look over this.)

We will have to look into optimising chunking at a later stage.

DPeterK · 2017-03-09T16:09:10Z

Thanks @lbdreyer!

DPeterK · 2017-03-09T16:09:56Z

We will have to look into optimising chunking at a later stage.

Agreed. Solid API first, performance improvements thereafter.

* Reinstate func and uses in Iris code * Tests use new func, tests for new func * lazy data func handles masked arrays * Review actions * Remove missed spurious chunk sizes * Review action: reference a function as a function

DPeterK added the Status: Work in Progress label Mar 8, 2017

DPeterK added 3 commits March 8, 2017 14:34

Reinstate func and uses in Iris code

3c8cf68

Tests use new func, tests for new func

cc9a9a7

lazy data func handles masked arrays

a0ddf14

DPeterK force-pushed the reinstate-as_lazy_data branch from 340c696 to a0ddf14 Compare March 8, 2017 14:38

DPeterK added this to the dask milestone Mar 8, 2017

DPeterK added the dask label Mar 8, 2017

lbdreyer reviewed Mar 8, 2017

View reviewed changes

Review actions

f358756

marqh reviewed Mar 8, 2017

View reviewed changes

lbdreyer reviewed Mar 8, 2017

View reviewed changes

Remove missed spurious chunk sizes

b61d19e

DPeterK mentioned this pull request Mar 9, 2017

Reuse multidim_daskstack in merge + fast um loading. #2423

Merged

lbdreyer reviewed Mar 9, 2017

View reviewed changes

Review action: reference a function as a function

d5d374a

DPeterK force-pushed the reinstate-as_lazy_data branch from 40baf68 to d5d374a Compare March 9, 2017 15:55

lbdreyer merged commit 8435ae2 into SciTools:dask Mar 9, 2017

QuLogic removed the Status: Work in Progress label Mar 9, 2017

DPeterK deleted the reinstate-as_lazy_data branch March 9, 2017 16:09

DPeterK mentioned this pull request Mar 9, 2017

Tighten purpose of array_masked_to_nans #2424

Merged

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

Reinstate as_lazy_data #2421

Reinstate as_lazy_data #2421

Uh oh!

Conversation

DPeterK commented Mar 8, 2017

Uh oh!

DPeterK commented Mar 8, 2017

Uh oh!

lbdreyer Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marqh commented Mar 8, 2017

Uh oh!

DPeterK commented Mar 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marqh commented Mar 9, 2017

Uh oh!

marqh commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

marqh commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbdreyer commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

DPeterK commented Mar 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reinstate `as_lazy_data` #2421

Reinstate `as_lazy_data` #2421

lbdreyer Mar 8, 2017 •

edited

Loading

lbdreyer Mar 8, 2017 •

edited

Loading