Skip to content

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Mar 22, 2017

Preserve lazy data in PP save-pairs generation, so you can work with a list of save cubes and/or fields without excessive memory requirements (as intended for saving).

The existing PP save code does not preserve lazy data when making a PPField from a 2d cube slice.
There was no point bothering with that, as the PP save rules fetch the data, basically just to get the data fill-value (if there is one).

This fixes that, so that the save rules, and save field generation, both preserve the lazy aspect.
Changes here:

  • preserve lazy data in generating PP save pairs
  • add a cube.fill_value to generated saved cubes (only)
    • this approach anticipates a probable future API change for cube to have a standard cube.fill_value property -- as currently under discussion for the dask migration work.

@pp-mo pp-mo force-pushed the cube_fill_value branch from fc247e9 to 5965b89 Compare March 23, 2017 11:53
@pp-mo
Copy link
Member Author

pp-mo commented Mar 23, 2017

@bjlittle can you comment whether this now looks sensible, in light of ongoing Dask development ?

@cpelley
Copy link

cpelley commented Mar 28, 2017

Happy to review this when it's ready. We have external collaborators (CMIP6) who this effects and have limited availability of hardware to work around this problem.

Cheers

@pp-mo pp-mo changed the title Cube fill value Lazy data in PP save pairs Apr 18, 2017
@cpelley cpelley self-requested a review April 19, 2017 20:43
@cpelley
Copy link

cpelley commented Apr 20, 2017

this approach anticipates a probable future API change for cube to have a standard cube.fill_value property -- as currently under discussion for the dask migration work.

Description a little out of date?? #2433

@marqh
Copy link
Member

marqh commented Apr 20, 2017

Description a little out of date?? #2433

hi @cpelley

those changes are on a feature branch, no dask code is yet on master, to my knowledge.

so, i don't think this is out of date

# Attach an extra cube "fill_value" property, allowing the save rules
# to deduce MDI more easily without realising the data.
# NOTE: it is done this way because this property may exist in future.
slice2D.fill_value = _data_fill_value(slice2D)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a fill_value attribute at run-time seems rather messy to me especially as it presumes that the dask branch ends up merged onto master as it is. Perhaps it would be better to point this PR to the dask branch instead?
Saying that I'm not going to push this, If your happy then I'm happy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be avoided by making the code in _data_fill_value() into a set of save rules to determine pp.bmdi.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a purely temporary measure, based on the idea that the forthcoming dask change will definitely be introducing the generic "cube.fill_value" property.
We will remove this hack when the dask branch is merged back.
Obviously, the method actually used here is 🤢 but it is only here so you guys can check that it fixes the current memory burn problem.

if slice2D.has_lazy_data():
slice_core_data = slice2D.lazy_data()
else:
slice_core_data = slice2D.data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? Can't you simply avoid the if-else and instead simply always set it to the lazy array i.e. I doubt there would be much overhead in re-instantiating a lazy array when it has been realised.

i.e. pp_field._data = slice2D.lazy_data()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best like this because creating a simple wrapper around real data is substantially more costly in dask than it was with biggus : The earliest dask branch retained existing cube property code to get cube.shape and cube.dtype from cube.lazy_data(), and then the tests took nearly twice as long!
Having said that, this code should be removed when we transition to dask anyway, as mentioned above

@cpelley cpelley merged commit 089ebe8 into SciTools:master Apr 25, 2017
@pp-mo pp-mo deleted the cube_fill_value branch April 25, 2017 11:01
@cpelley
Copy link

cpelley commented Apr 25, 2017

Thanks @pp-mo, ping @hdyson

@hdyson
Copy link
Contributor

hdyson commented Apr 25, 2017

Thanks all, it's very much appreciated. I'll keep you informed once we've done a test with the data that revealed this issue (this may take a few days, it was an external collaborator that found it).

@hdyson
Copy link
Contributor

hdyson commented May 11, 2017

Just confirming that this does work for the use case that our collaborators found - good stuff. We're seeing ~5x improvement in memory usage, and this is more than sufficient for our needs.

Thanks again!

@QuLogic QuLogic added this to the v2.0 milestone May 11, 2017
@lbdreyer lbdreyer mentioned this pull request May 17, 2017
@QuLogic QuLogic modified the milestones: v1.13.0, v2.0 May 17, 2017
bjlittle added a commit to bjlittle/iris that referenced this pull request Jun 5, 2017
@bjlittle bjlittle mentioned this pull request Aug 4, 2017
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants