Skip to content

Conversation

@DPeterK
Copy link
Member

@DPeterK DPeterK commented Mar 8, 2017

The return of the as_lazy_data function, which acts as Iris' interface to da.from_array. I've added tests for the function and changed the Iris code and tests so that it always uses the new function and does not call da.from_array directly.

@DPeterK DPeterK force-pushed the reinstate-as_lazy_data branch from 340c696 to a0ddf14 Compare March 8, 2017 14:38
@DPeterK
Copy link
Member Author

DPeterK commented Mar 8, 2017

ping @bjlittle @lbdreyer @pp-mo

@DPeterK DPeterK added this to the dask milestone Mar 8, 2017
@DPeterK DPeterK added the dask label Mar 8, 2017
import iris.exceptions
import iris.std_names
import iris.util
from iris._lazy_data import as_lazy_data
Copy link
Member

@lbdreyer lbdreyer Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter what order these are in? I just noticed that in some of the other files the private iris modules are imported first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I put this here because I'd seen examples where the private modules were imported last 😉

lazy_values = np.arange(30).reshape((2, 5, 3))
lazy_array = da.from_array(lazy_values, 1e6)
values = np.arange(30).reshape((2, 5, 3))
lazy_array = da.from_array(values, _MAX_CHUNK_SIZE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but it might be worth using the kwarg i.e.
lazy_array = da.from_array(values, chunks=_MAX_CHUNK_SIZE)
to be consistent with the other times da.from_array is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

import numpy.ma as ma

from iris._lazy_data import is_lazy_data, array_masked_to_nans
from iris._lazy_data import array_masked_to_nans, as_lazy_data, is_lazy_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pep8 is complaining about this trailing whitespace

data = array_masked_to_nans(data)
data = data.data
data = da.from_array(data, chunks=data.shape)
data = as_lazy_data(data, chunks=data.shape)
Copy link
Member

@lbdreyer lbdreyer Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines doing the mask->nans can now be removed as it's being done in as_lazy_data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot!

cube = istk.simple_4d_with_hybrid_height()
# Put a biggus array on the cube so we can test deferred loading.
cube.data = da.from_array(cube.data, chunks=cube.data.shape)
cube.data = as_lazy_data(cube.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm interested why you have removed teh chunking here, and in other tests

is there a reason to use the 'magic chunking number' in these cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily this has been done for consistency, but there are also performance improvements to be had for setting the chunk size to the magic chunk size.

@marqh
Copy link
Member

marqh commented Mar 8, 2017

i'm not against this change. I would like to understand a little about it's motivation. It appears at first glance that calls to
da.from_array()
are replaced by
iris._lazy_data.as_lazy_data()
calls with a very similar call signature

the only difference I can see is that
iris._lazy_data.as_lazy_data()
always calls
iris._lazy_data.array_masked_to_nans()
where as previously this had to be called explicitly if required

is this the benefit which is being targeted?
Is there extra benefit which is being targeted which I am not observing?

Additionally, it seems a shame to reintroduce

# A magic value, borrowed from biggus
MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2

to me, I think a better chunking approach is required than this

none of this is a barrier to adoption, I am an interested watcher in this case, keen to comprehend

@DPeterK
Copy link
Member Author

DPeterK commented Mar 8, 2017

@marqh we have identified the following benefits with reintroducing as_masked_array:

  • factoring repeated code: though there's not a lot of code been factored, if we wanted to change the way that iris interfaced with dask then this means we need only change one piece of code
  • similarly we get to define a single common interface for what to do when constructing a lazy array: we always want to ensure there are no masks present when setting up a lazy array and we want to have a consistent approach to chunking; these are both most efficiently done by having this common interface.
  • disentangling Iris and dask code: I think it would be good to have as little direct Iris interaction with dask as possible, not least for reasons of factoring common code
  • reintroducing the biggus magic number is an improvement to chunking. The arbitrary scattergun approach to chunking that existed before was naive at best and only likely to negatively impact dask performance. This magic number was actually carefully selected to maximise the performance improvement of chunking while minimising the overhead of setting up chunking, so reintroducing it is a very positive step. Certainly further improvements to how chunks are set up are necessary, but at the moment there is no driver to implement anything here (although having a single function to set up lazy arrays will greatly ease the introduction of it).

#
# You should have received a copy of the GNU Lesser General Public License
# along with Iris. If not, see <http://www.gnu.org/licenses/>.
"""Test :meth:`iris._lazy data.as_lazy_data` method."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Test function :func:`iris._lazy data.as_lazy_data`."""

I also just ran into this in the tests I'm writing. It looks the other two unit tests for iris._lazy_data also have this mistake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, it should have a :func: decorator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, as it's a function not a method.

@marqh
Copy link
Member

marqh commented Mar 9, 2017

so

always calls
iris._lazy_data.array_masked_to_nans()
where as previously this had to be called explicitly if required
seems a fair assessment of benefit. I can see that this is deemed useful.

disentangling Iris and dask code: I think it would be good to have as little direct Iris interaction with dask as possible, not least for reasons of factoring common code

I don't see the benefit of this in principal. I think Iris using dask is beneficial and Iris should be clear to make calls to dask where useful, I don't think that, in principal, as little direct interaction as possible is a good thing.
In this case it can be seen as a practical step, to handle the work around for missing data, but this isn't an approach that I would advocate across the board.

@marqh
Copy link
Member

marqh commented Mar 9, 2017

reintroducing the biggus magic number is an improvement to chunking. The arbitrary scattergun approach to chunking that existed before was naive at best and only likely to negatively impact dask performance. This magic number was actually carefully selected to maximise the performance improvement of chunking while minimising the overhead of setting up chunking, so reintroducing it is a very positive step. Certainly further improvements to how chunks are set up are necessary, but at the moment there is no driver to implement anything here (although having a single function to set up lazy arrays will greatly ease the introduction of it).

this is far from clear to me.
How is this magic number an improvement to chunking please?
Is this a performance benefit? A memory overhead benefit? Have we measured these somewhere that I have not seen?
How is 16777216 carefully selected to deliver performance?

Dask does not provide a magic chunk size, it requests it is provided based on context. I think it is worth considering whether Iris should respect Dask's API in this case and not provide a 'default value'

It looks in this PR like the only places this is used is in testing, that all of the functional code supplies it's own chunk size. If this is not the case and I'm missing something, I'd be interested to see it.

With this in mind, I wonder whether not including the magic number in this PR and looking into it as a follow on activity might better suit that introducing it here as part of a non-functional code refactor for purposes discussed above.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

How is 16777216 carefully selected to deliver performance?

I spoke to @rhattersley about why he chose this number for biggus. What it comes down to is that chunking the data (and in dask, setting up the graph) commands a not insignificant memory/time overhead. The performance tests for biggus showed that setting the chunks too small, as was being done here, ruins performance as any improvement gained through chunking is destroyed through setting up the chunking. Conversely, setting the chunking too large also caused a (less significant) performance impact due to the chunks becoming larger and slower to process.

The value selected is set to be just to the right of peak chunking performance in biggus and I am asserting that this is true in dask too, not least because the dask docs say (see FAQ 3) that chunks are best set to be between 10MB and 100MB in size, which this number handily is.

I'm going to do some performance (timing) testings on this though, as I'm interested to see how performance will vary by chunk size.

One more thing to note is that you can rechunk dask arrays. The dask docs suggest that a lot of the performance improvements in chunking come from setting the correct chunk size for the operation. I think that at a later date we should look into rechunking lazy data cubes based upon the operation we're about to perform on it. It is still the case though that setting a default max chunk size of 16MB in both cases will make a noticeable performance improvement than the naive and inconsistent chunk sizes that I have replaced.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

the only places this is used is in testing

Not true:

  • almost all operations now use the default chunk size as this will be better in the general case.
  • the fileformats interface specifies a chunk size that is the shape of the 2D field coming in.
  • in general usage, especially when we have users supplying large datasets, having a default chunk size set to a sensible value is a better approach.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

looking into it as a follow on activity

This will be much easier done if we have followed the good software design pattern reintroduced here of having a single consistent point of entry for creating dask arrays for Iris cubes rather than having duplicated code scattered all across the Iris codebase.

@marqh
Copy link
Member

marqh commented Mar 9, 2017

almost all operations now use the default chunk size as this will be better in the general case.

please may you point me to where this occurs?
all of the calls to the
iris._lazy_data.as_lazy_data()
that I can find on this branch explicitly pass in chunk size

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

Well, there's all the tests - they now use the default chunk size. Cube also uses the default chunk size. The other non-test calls to as_lazy_data are all made by code that's about to stack up multiple 2D fields.

* chunks:
Describes how the created dask array should be split up. Defaults to a
value first defined in biggus (being `8 * 1024 * 1024 * 2`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder whether the references to biggus here and on line 43 are necessary. I want to be able to read this code without needing to also know the history of Iris lazy data handling.
Not that this is a blocker to this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess this is a temporary note for us to describe the background to choosing this number in particular...

"""
if not is_lazy_data(data):
if isinstance(data, np.ma.MaskedArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth importing np.ma rather than just np

i.e.

import numpy as np
import numpy.ma as ma

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably!

@DPeterK DPeterK force-pushed the reinstate-as_lazy_data branch from 40baf68 to d5d374a Compare March 9, 2017 15:55
@lbdreyer
Copy link
Member

lbdreyer commented Mar 9, 2017

I'm gonna merge this in as I agree with the principle.

(@bjlittle @pp-mo I would still encourage you to have a look over this.)

We will have to look into optimising chunking at a later stage.

@lbdreyer lbdreyer merged commit 8435ae2 into SciTools:dask Mar 9, 2017
@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

Thanks @lbdreyer!

@DPeterK DPeterK deleted the reinstate-as_lazy_data branch March 9, 2017 16:09
@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

We will have to look into optimising chunking at a later stage.

Agreed. Solid API first, performance improvements thereafter.

bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017
* Reinstate func and uses in Iris code

* Tests use new func, tests for new func

* lazy data func handles masked arrays

* Review actions

* Remove missed spurious chunk sizes

* Review action: reference a function as a function
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants