Skip to content

Conversation

@trexfeathers
Copy link
Contributor

@trexfeathers trexfeathers commented Oct 31, 2022

🚀 Pull Request

Description

Replaces #5027.

Now updated since the original text below. Following helpful discussion (in the comments below) with @pp-mo, @rcomer, @DPeterK: we have decided to adopt Dask's new copy behaviour by adjusting our testing and announcing the change in a What's New entry. We expect minimal user disruption due to this change being fairly niche.

Original text:

Possible solution to #5027. I've confirmed that this fixes the failing test - iris.tests.unit.cube.test_Cube.Test_copy.test__lazy():

def test__lazy(self):
cube = Cube(as_lazy_data(np.array([1, 0])))
self._check_copy(cube, cube.copy())

What is less certain is whether this is correct. Dask's philosophy is clearly that Dask gets to make the decisions about shared/different memory blocks, and that downstream code should not presume to control such things (see dask/dask#9555). Having a workaround is at-best a deviation from the 'thin wrapper' philosophy, and at-worst deferring a change to Iris behaviour that we might have to adopt later anyway if future changes are made to the Dask internals I have used.

Discuss!


Consult Iris pull request check list

@pp-mo
Copy link
Member

pp-mo commented Oct 31, 2022

I think my view of this is that it is basically the Iris test that is wrong -- it asks too much.

I find that in older dasks, if ...

m = np.array([1, 2])
da1 = da.from_array(m, chunks=-1)
da2 = da.from_array(m, chunks=-1)

then da1 is da2 is False (of course)
but da1.compute() is da2.compute() is True
which makes sense to me -- they are different lazy arrays, but they both compute to the same original numpy object.

So, that is still true with latest dask.
What has changed is what deepcopy of a lazy array gives you.
So, in older dask, deepcopy(da1).compute() is da1.compute() fails -- i.e. it interprets making a deepcopy of the lazy array as requiring it to make a copy of the underlying content, even if that means copying a huge actual in-memory array.
Whereas in latest dask, deepcopy(da1).compute() is da1.compute() is True.
That actually makes perfect sense to me, since a "lazy" array is really more like a pointer to the original than a reference -- so why would copying the pointer "copy" the thing pointed to?

So in my view, where in '_check_copy', Iris is currently checking that cube.copy().data is not cube.data, that is actually a requirement too far. It should instead be checking that cube.copy().core_data() is not cube.core_data() -- i.e. the lazy array is different, but the computed results can be the same.
That seems perfectly reasonable to me, given the above case where different lazy arrays da1 and da2 do compute to the identical original array, and that would also fail this test.
I.E. we already have this behaviour ...

>>> c1 = Cube(da1)
>>> c2 = Cube(da2)
>>> c1.core_data() is c2.core_data()
False
>>> c1.data is c2.data
True
>>> 

And if that makes sense, why really should Iris expect or require Cube(deepcopy(da1)) to behave differently ?

So, as you hinted above, I think these distinctions are a prerogative of Dask, and Iris shouldn't be insisting on the detail.

Makes sense to me. Any more takers ?!?

@rcomer
Copy link
Member

rcomer commented Nov 1, 2022

I've been racking my brains to think how the failed test could be a problem for the user. In general if we modify the data in a copy of a cube, the original cube's data is unaffected. I think this was a deliberate design choice.

import numpy as np
import iris.cube

cube1 = iris.cube.Cube(np.arange(5))
cube2 = cube1.copy()
cube2.data[3] = 42
print(cube1.data)

gives

[0 1 2 3 4]

However, with latest dask:

import dask.array as da

cube1 = iris.cube.Cube(da.from_array(np.arange(5)))
cube2 = cube1.copy()
cube2.data[3] = 42
print(cube1.data)
[ 0  1  2 42  4]

I'm not at all sure that this second example represents a realistic user workflow though. The vast majority of the time, if a user has lazy data it's because they loaded from a file.

@rcomer
Copy link
Member

rcomer commented Nov 1, 2022

Also, ping @DPeterK as he has clearly given this some thought recently.

@pp-mo
Copy link
Member

pp-mo commented Nov 1, 2022

In general if we modify the data in a copy of a cube, the original cube's data is unaffected. I think this was a deliberate design choice.

That is definitely (historically) correct! Though it has also caused some trouble.
But we don't absolutely enforce it. For instance,

>>> arr = np.array([1,3,2,4])
>>> c1 = Cube(arr)
>>> c2 = Cube(arr)
>>> c1.data is c2.data
True

So, I think there is considerable room for a "pragmatic" approach.

@DPeterK
Copy link
Member

DPeterK commented Nov 2, 2022

Thanks @rcomer for the ping - and excellent spot on the original cause of this issue! I tried to ping the scitools/devs group on the dask PR when it was brought to my attention so that it would reach everyone's collective attention... but couldn't, apparently due to a GitHub foible?

That aside, the reason that I'm implicated here is that I implemented the change to dask that has now been backed out again in dask#9555. The rationale was that Iris clearly wanted a copied cube to have a separate data array, so Matt suggested we add that functionality to dask. Now it's been reassessed and considered (reasonably!) un-daskonic, I think the onus is back on Iris to work out how to handle this - and what Iris behaviour for copied cubes we want to enforce.

I'm in agreement with both @pp-mo and @rcomer with their thoughts to date, in that (a) we should let dask decide how to copy arrays, (b) we should be pragmatic in Iris about how we handle this, but nevertheless (c) this particular behaviour was clearly implemented for a reason (from a vague memory, either to support some cube math operations, or perhaps to preserve metadata across copy operations?)

I think we have a few options from which we could choose:

  1. maintain existing Iris behaviour via a shim, meaning the test will pass but we'll diverge from dask behaviour, which might cause other issues / extra maintenance overhead
  2. follow dask's lead on this and either delete or rewrite this failing test accordingly, meaning we follow dask but might discover some edge cases in user code that suddenly don't work as expected?
  3. (is this a breaking change? If so, do we need to selectively enable/disable this via a future switch or something?)
  4. implement new functionality that means both old and new behaviours are preserved, perhaps via deepcopy

Like @pp-mo, I think we need to be pragmatic about this. If Iris really doesn't need to care about making a new data array on cube copy, then I think we just go with (2) and let dask fully decide how to manage lazy data. If we do care about the old behaviour, we should perhaps promote (4) via a note in the documentation about the changed behaviour.

My preference, though, is to just go for (2) and be done with it.

@trexfeathers trexfeathers changed the title Ensure copied lazy data computes to a new array, rather than the same array. Accept new copy behaviour from dask/dask#9555. Nov 2, 2022
Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think both other reviewers @rcomer and @DPeterK both said they are OK with this approach, so that works for me.

@pp-mo pp-mo merged commit 421e193 into SciTools:main Nov 4, 2022
@trexfeathers
Copy link
Contributor Author

Thanks @pp-mo! That's a full set of ✔ on main.

@trexfeathers trexfeathers deleted the lazy_array_copy branch November 29, 2022 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants