Accept new copy behaviour from dask/dask#9555. #5041

trexfeathers · 2022-10-31T15:20:22Z

🚀 Pull Request

Description

Replaces #5027.

Now updated since the original text below. Following helpful discussion (in the comments below) with @pp-mo, @rcomer, @DPeterK: we have decided to adopt Dask's new copy behaviour by adjusting our testing and announcing the change in a What's New entry. We expect minimal user disruption due to this change being fairly niche.

Original text:

Possible solution to #5027. I've confirmed that this fixes the failing test - iris.tests.unit.cube.test_Cube.Test_copy.test__lazy():

iris/lib/iris/tests/unit/cube/test_Cube.py

Lines 1913 to 1915 in b5178a9

def test__lazy(self):

cube = Cube(as_lazy_data(np.array([1, 0])))

self._check_copy(cube, cube.copy())

What is less certain is whether this is correct. Dask's philosophy is clearly that Dask gets to make the decisions about shared/different memory blocks, and that downstream code should not presume to control such things (see dask/dask#9555). Having a workaround is at-best a deviation from the 'thin wrapper' philosophy, and at-worst deferring a change to Iris behaviour that we might have to adopt later anyway if future changes are made to the Dask internals I have used.

Discuss!

Consult Iris pull request check list

pp-mo · 2022-10-31T17:08:36Z

I think my view of this is that it is basically the Iris test that is wrong -- it asks too much.

I find that in older dasks, if ...

m = np.array([1, 2])
da1 = da.from_array(m, chunks=-1)
da2 = da.from_array(m, chunks=-1)

then da1 is da2 is False (of course)
but da1.compute() is da2.compute() is True
which makes sense to me -- they are different lazy arrays, but they both compute to the same original numpy object.

So, that is still true with latest dask.
What has changed is what deepcopy of a lazy array gives you.
So, in older dask, deepcopy(da1).compute() is da1.compute() fails -- i.e. it interprets making a deepcopy of the lazy array as requiring it to make a copy of the underlying content, even if that means copying a huge actual in-memory array.
Whereas in latest dask, deepcopy(da1).compute() is da1.compute() is True.
That actually makes perfect sense to me, since a "lazy" array is really more like a pointer to the original than a reference -- so why would copying the pointer "copy" the thing pointed to?

So in my view, where in '_check_copy', Iris is currently checking that cube.copy().data is not cube.data, that is actually a requirement too far. It should instead be checking that cube.copy().core_data() is not cube.core_data() -- i.e. the lazy array is different, but the computed results can be the same.
That seems perfectly reasonable to me, given the above case where different lazy arrays da1 and da2 do compute to the identical original array, and that would also fail this test.
I.E. we already have this behaviour ...

>>> c1 = Cube(da1)
>>> c2 = Cube(da2)
>>> c1.core_data() is c2.core_data()
False
>>> c1.data is c2.data
True
>>>

And if that makes sense, why really should Iris expect or require Cube(deepcopy(da1)) to behave differently ?

So, as you hinted above, I think these distinctions are a prerogative of Dask, and Iris shouldn't be insisting on the detail.

Makes sense to me. Any more takers ?!?

rcomer · 2022-11-01T09:48:19Z

I've been racking my brains to think how the failed test could be a problem for the user. In general if we modify the data in a copy of a cube, the original cube's data is unaffected. I think this was a deliberate design choice.

import numpy as np
import iris.cube

cube1 = iris.cube.Cube(np.arange(5))
cube2 = cube1.copy()
cube2.data[3] = 42
print(cube1.data)

gives

[0 1 2 3 4]

However, with latest dask:

import dask.array as da

cube1 = iris.cube.Cube(da.from_array(np.arange(5)))
cube2 = cube1.copy()
cube2.data[3] = 42
print(cube1.data)

[ 0  1  2 42  4]

I'm not at all sure that this second example represents a realistic user workflow though. The vast majority of the time, if a user has lazy data it's because they loaded from a file.

rcomer · 2022-11-01T09:48:47Z

Also, ping @DPeterK as he has clearly given this some thought recently.

pp-mo · 2022-11-01T11:20:04Z

In general if we modify the data in a copy of a cube, the original cube's data is unaffected. I think this was a deliberate design choice.

That is definitely (historically) correct! Though it has also caused some trouble.
But we don't absolutely enforce it. For instance,

>>> arr = np.array([1,3,2,4])
>>> c1 = Cube(arr)
>>> c2 = Cube(arr)
>>> c1.data is c2.data
True

So, I think there is considerable room for a "pragmatic" approach.

DPeterK · 2022-11-02T10:00:30Z

Thanks @rcomer for the ping - and excellent spot on the original cause of this issue! I tried to ping the scitools/devs group on the dask PR when it was brought to my attention so that it would reach everyone's collective attention... but couldn't, apparently due to a GitHub foible?

That aside, the reason that I'm implicated here is that I implemented the change to dask that has now been backed out again in dask#9555. The rationale was that Iris clearly wanted a copied cube to have a separate data array, so Matt suggested we add that functionality to dask. Now it's been reassessed and considered (reasonably!) un-daskonic, I think the onus is back on Iris to work out how to handle this - and what Iris behaviour for copied cubes we want to enforce.

I'm in agreement with both @pp-mo and @rcomer with their thoughts to date, in that (a) we should let dask decide how to copy arrays, (b) we should be pragmatic in Iris about how we handle this, but nevertheless (c) this particular behaviour was clearly implemented for a reason (from a vague memory, either to support some cube math operations, or perhaps to preserve metadata across copy operations?)

I think we have a few options from which we could choose:

maintain existing Iris behaviour via a shim, meaning the test will pass but we'll diverge from dask behaviour, which might cause other issues / extra maintenance overhead
follow dask's lead on this and either delete or rewrite this failing test accordingly, meaning we follow dask but might discover some edge cases in user code that suddenly don't work as expected?
(is this a breaking change? If so, do we need to selectively enable/disable this via a future switch or something?)
implement new functionality that means both old and new behaviours are preserved, perhaps via deepcopy

Like @pp-mo, I think we need to be pragmatic about this. If Iris really doesn't need to care about making a new data array on cube copy, then I think we just go with (2) and let dask fully decide how to manage lazy data. If we do care about the old behaviour, we should perhaps promote (4) via a note in the documentation about the changed behaviour.

My preference, though, is to just go for (2) and be done with it.

This reverts commit 7363412.

…sk#9555.

docs/src/whatsnew/latest.rst

pp-mo

LGTM

I think both other reviewers @rcomer and @DPeterK both said they are OK with this approach, so that works for me.

trexfeathers · 2022-11-04T11:20:19Z

Thanks @pp-mo! That's a full set of ✔ on main.

Replicate legacy dask behaviour pre dask/dask#9555.

7363412

trexfeathers requested review from bjlittle, lbdreyer and pp-mo October 31, 2022 15:20

trexfeathers added Status: Work in Progress Experience: High Status: Decision Required labels Oct 31, 2022

trexfeathers self-assigned this Nov 2, 2022

trexfeathers removed the Status: Decision Required label Nov 2, 2022

trexfeathers added 4 commits November 2, 2022 14:53

Revert "Replicate legacy dask behaviour pre dask/dask#9555."

66fe23b

This reverts commit 7363412.

Accept shared NumPy arrays when copying certain Dask arrays - dask/da…

cdf288a

…sk#9555.

Updated lock-files.

3a82efd

What's New entry.

8956317

trexfeathers changed the title ~~Ensure copied lazy data computes to a new array, rather than the same array.~~ Accept new copy behaviour from dask/dask#9555. Nov 2, 2022

rcomer reviewed Nov 2, 2022

View reviewed changes

docs/src/whatsnew/latest.rst Outdated Show resolved Hide resolved

trexfeathers added 2 commits November 2, 2022 15:57

Re-arrange What's New entry.

dc28189

Merge remote-tracking branch 'upstream/main' into lazy_array_copy

2d9afe0

trexfeathers marked this pull request as ready for review November 2, 2022 15:59

trexfeathers removed the Status: Work in Progress label Nov 2, 2022

This was referenced Nov 2, 2022

[iris.ci] environment lockfiles auto-update #5027

Closed

[CI Bot] environment lockfiles auto-update #5046

Closed

pp-mo approved these changes Nov 4, 2022

View reviewed changes

pp-mo merged commit 421e193 into SciTools:main Nov 4, 2022

github-actions bot mentioned this pull request Nov 5, 2022

Performance Shift(s): 421e193d #5052

Closed

rcomer mentioned this pull request Nov 5, 2022

Fix array copy not being a no-op dask/dask#9555

Merged

3 tasks

trexfeathers deleted the lazy_array_copy branch November 29, 2022 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accept new copy behaviour from dask/dask#9555. #5041

Accept new copy behaviour from dask/dask#9555. #5041

Uh oh!

trexfeathers commented Oct 31, 2022 •

edited

Loading

Uh oh!

pp-mo commented Oct 31, 2022 •

edited

Loading

Uh oh!

rcomer commented Nov 1, 2022 •

edited

Loading

Uh oh!

rcomer commented Nov 1, 2022

Uh oh!

pp-mo commented Nov 1, 2022 •

edited

Loading

Uh oh!

DPeterK commented Nov 2, 2022

Uh oh!

Uh oh!

pp-mo left a comment •

edited

Loading

Uh oh!

trexfeathers commented Nov 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def test__lazy(self):
	cube = Cube(as_lazy_data(np.array([1, 0])))
	self._check_copy(cube, cube.copy())

Accept new copy behaviour from dask/dask#9555. #5041

Accept new copy behaviour from dask/dask#9555. #5041

Uh oh!

Conversation

trexfeathers commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Pull Request

Description

Original text:

Uh oh!

pp-mo commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcomer commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcomer commented Nov 1, 2022

Uh oh!

pp-mo commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DPeterK commented Nov 2, 2022

Uh oh!

Uh oh!

pp-mo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trexfeathers commented Nov 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

trexfeathers commented Oct 31, 2022 •

edited

Loading

pp-mo commented Oct 31, 2022 •

edited

Loading

rcomer commented Nov 1, 2022 •

edited

Loading

pp-mo commented Nov 1, 2022 •

edited

Loading

pp-mo left a comment •

edited

Loading