-
Notifications
You must be signed in to change notification settings - Fork 300
Reinstate as_lazy_data
#2421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinstate as_lazy_data
#2421
Conversation
340c696 to
a0ddf14
Compare
| import iris.exceptions | ||
| import iris.std_names | ||
| import iris.util | ||
| from iris._lazy_data import as_lazy_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter what order these are in? I just noticed that in some of the other files the private iris modules are imported first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I put this here because I'd seen examples where the private modules were imported last 😉
| lazy_values = np.arange(30).reshape((2, 5, 3)) | ||
| lazy_array = da.from_array(lazy_values, 1e6) | ||
| values = np.arange(30).reshape((2, 5, 3)) | ||
| lazy_array = da.from_array(values, _MAX_CHUNK_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but it might be worth using the kwarg i.e.
lazy_array = da.from_array(values, chunks=_MAX_CHUNK_SIZE)
to be consistent with the other times da.from_array is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
lib/iris/_merge.py
Outdated
| import numpy.ma as ma | ||
|
|
||
| from iris._lazy_data import is_lazy_data, array_masked_to_nans | ||
| from iris._lazy_data import array_masked_to_nans, as_lazy_data, is_lazy_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pep8 is complaining about this trailing whitespace
| data = array_masked_to_nans(data) | ||
| data = data.data | ||
| data = da.from_array(data, chunks=data.shape) | ||
| data = as_lazy_data(data, chunks=data.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines doing the mask->nans can now be removed as it's being done in as_lazy_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot!
| cube = istk.simple_4d_with_hybrid_height() | ||
| # Put a biggus array on the cube so we can test deferred loading. | ||
| cube.data = da.from_array(cube.data, chunks=cube.data.shape) | ||
| cube.data = as_lazy_data(cube.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm interested why you have removed teh chunking here, and in other tests
is there a reason to use the 'magic chunking number' in these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily this has been done for consistency, but there are also performance improvements to be had for setting the chunk size to the magic chunk size.
|
i'm not against this change. I would like to understand a little about it's motivation. It appears at first glance that calls to the only difference I can see is that is this the benefit which is being targeted? Additionally, it seems a shame to reintroduce to me, I think a better chunking approach is required than this none of this is a barrier to adoption, I am an interested watcher in this case, keen to comprehend |
|
@marqh we have identified the following benefits with reintroducing
|
| # | ||
| # You should have received a copy of the GNU Lesser General Public License | ||
| # along with Iris. If not, see <http://www.gnu.org/licenses/>. | ||
| """Test :meth:`iris._lazy data.as_lazy_data` method.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Test function :func:`iris._lazy data.as_lazy_data`."""
I also just ran into this in the tests I'm writing. It looks the other two unit tests for iris._lazy_data also have this mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, it should have a :func: decorator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, as it's a function not a method.
|
so
I don't see the benefit of this in principal. I think Iris using dask is beneficial and Iris should be clear to make calls to dask where useful, I don't think that, in principal, as little direct interaction as possible is a good thing. |
this is far from clear to me. Dask does not provide a magic chunk size, it requests it is provided based on context. I think it is worth considering whether Iris should respect Dask's API in this case and not provide a 'default value' It looks in this PR like the only places this is used is in testing, that all of the functional code supplies it's own chunk size. If this is not the case and I'm missing something, I'd be interested to see it. With this in mind, I wonder whether not including the magic number in this PR and looking into it as a follow on activity might better suit that introducing it here as part of a non-functional code refactor for purposes discussed above. |
I spoke to @rhattersley about why he chose this number for biggus. What it comes down to is that chunking the data (and in dask, setting up the graph) commands a not insignificant memory/time overhead. The performance tests for biggus showed that setting the chunks too small, as was being done here, ruins performance as any improvement gained through chunking is destroyed through setting up the chunking. Conversely, setting the chunking too large also caused a (less significant) performance impact due to the chunks becoming larger and slower to process. The value selected is set to be just to the right of peak chunking performance in biggus and I am asserting that this is true in dask too, not least because the dask docs say (see FAQ 3) that chunks are best set to be between 10MB and 100MB in size, which this number handily is. I'm going to do some performance (timing) testings on this though, as I'm interested to see how performance will vary by chunk size. One more thing to note is that you can |
Not true:
|
This will be much easier done if we have followed the good software design pattern reintroduced here of having a single consistent point of entry for creating dask arrays for Iris cubes rather than having duplicated code scattered all across the Iris codebase. |
please may you point me to where this occurs? |
|
Well, there's all the tests - they now use the default chunk size. Cube also uses the default chunk size. The other non-test calls to |
| * chunks: | ||
| Describes how the created dask array should be split up. Defaults to a | ||
| value first defined in biggus (being `8 * 1024 * 1024 * 2`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder whether the references to biggus here and on line 43 are necessary. I want to be able to read this code without needing to also know the history of Iris lazy data handling.
Not that this is a blocker to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess this is a temporary note for us to describe the background to choosing this number in particular...
lib/iris/_lazy_data.py
Outdated
| """ | ||
| if not is_lazy_data(data): | ||
| if isinstance(data, np.ma.MaskedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth importing np.ma rather than just np
i.e.
import numpy as np
import numpy.ma as ma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably!
40baf68 to
d5d374a
Compare
|
Thanks @lbdreyer! |
Agreed. Solid API first, performance improvements thereafter. |
* Reinstate func and uses in Iris code * Tests use new func, tests for new func * lazy data func handles masked arrays * Review actions * Remove missed spurious chunk sizes * Review action: reference a function as a function
The return of the
as_lazy_datafunction, which acts as Iris' interface toda.from_array. I've added tests for the function and changed the Iris code and tests so that it always uses the new function and does not callda.from_arraydirectly.