-
Notifications
You must be signed in to change notification settings - Fork 300
Lazy data in PP save pairs #2452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bjlittle can you comment whether this now looks sensible, in light of ongoing Dask development ? |
|
Happy to review this when it's ready. We have external collaborators (CMIP6) who this effects and have limited availability of hardware to work around this problem. Cheers |
Description a little out of date?? #2433 |
| # Attach an extra cube "fill_value" property, allowing the save rules | ||
| # to deduce MDI more easily without realising the data. | ||
| # NOTE: it is done this way because this property may exist in future. | ||
| slice2D.fill_value = _data_fill_value(slice2D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a fill_value attribute at run-time seems rather messy to me especially as it presumes that the dask branch ends up merged onto master as it is. Perhaps it would be better to point this PR to the dask branch instead?
Saying that I'm not going to push this, If your happy then I'm happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be avoided by making the code in _data_fill_value() into a set of save rules to determine pp.bmdi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a purely temporary measure, based on the idea that the forthcoming dask change will definitely be introducing the generic "cube.fill_value" property.
We will remove this hack when the dask branch is merged back.
Obviously, the method actually used here is 🤢 but it is only here so you guys can check that it fixes the current memory burn problem.
| if slice2D.has_lazy_data(): | ||
| slice_core_data = slice2D.lazy_data() | ||
| else: | ||
| slice_core_data = slice2D.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary? Can't you simply avoid the if-else and instead simply always set it to the lazy array i.e. I doubt there would be much overhead in re-instantiating a lazy array when it has been realised.
i.e. pp_field._data = slice2D.lazy_data()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best like this because creating a simple wrapper around real data is substantially more costly in dask than it was with biggus : The earliest dask branch retained existing cube property code to get cube.shape and cube.dtype from cube.lazy_data(), and then the tests took nearly twice as long!
Having said that, this code should be removed when we transition to dask anyway, as mentioned above
|
Thanks all, it's very much appreciated. I'll keep you informed once we've done a test with the data that revealed this issue (this may take a few days, it was an external collaborator that found it). |
|
Just confirming that this does work for the use case that our collaborators found - good stuff. We're seeing ~5x improvement in memory usage, and this is more than sufficient for our needs. Thanks again! |
Preserve lazy data in PP save-pairs generation, so you can work with a list of save cubes and/or fields without excessive memory requirements (as intended for saving).
The existing PP save code does not preserve lazy data when making a PPField from a 2d cube slice.
There was no point bothering with that, as the PP save rules fetch the data, basically just to get the data fill-value (if there is one).
This fixes that, so that the save rules, and save field generation, both preserve the lazy aspect.
Changes here:
cube.fill_valueto generated saved cubes (only)cube.fill_valueproperty -- as currently under discussion for the dask migration work.