Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out-of-memory warnings/solutions #237

Open
CeresBarros opened this issue Apr 23, 2024 · 7 comments
Open

out-of-memory warnings/solutions #237

CeresBarros opened this issue Apr 23, 2024 · 7 comments
Assignees
Labels
v1.0.0 complete before release of version 1.0.0

Comments

@CeresBarros
Copy link
Member

If the user asks for many GCMs/runs/periods/scenarios/years across many (many) locations, they can easily run out of memory to extract cliamte values and the large data.table of climate values to be downscaled.

I've seen this happen when asking for all GCMs/scenarios/periods x 3 runs for 2 700 000 point coordinates, using a 32Gb machine. The error is of the type std::bac_alloc which can easily be made more intuitive for the user with some messaging.

We could also foresee having climr actually deal with this problem by, e.g.:

  1. extracting/downscaling by subsets of points or combinations of climate model parameters
  2. saving each subset to a csv file with write.csv(..., append = TRUE)

See https://stackoverflow.com/questions/78170318/error-stdbad-alloc-using-terraextract-on-large-stack-and-many-points -- extracting only the unique raster locations is not a good solution because 1) results are different and 2) the user will still run out of memory when expanding back to the full set of points.

@CeresBarros CeresBarros added the v1.0.0 complete before release of version 1.0.0 label Apr 23, 2024
@CeresBarros
Copy link
Member Author

see bcgov/BGCprojections#5

@kdaust
Copy link
Collaborator

kdaust commented Apr 23, 2024

Look at you, getting SO replies from Hijmans himself ;)

Once I have more time this summer I'd be more than happy to take on dealing with this issue directly.

@CeresBarros
Copy link
Member Author

I know, I feel so special :P

@achubaty
Copy link
Contributor

achubaty commented May 24, 2024

I haven't yet had a chance to dive too far into using this package (thank you btw!) or its inner workings, but it occured to me that to help deal with memory limitations on the data.table side of things, using on-disk formats such as disk.frame or arrow may also be useful. Both are quite fast, but do trade low memory use for higher disk storage requirements. These may be a great fit for data that need to persist e.g., across R sessions, but perhaps the overhead of creating these is too much for more 'transient' use.

  • disk.frame supports data.table syntax, but is soft deprecated in favour of arrow;
  • arrow is built for dplyr syntax, and has become very popular.

@CeresBarros
Copy link
Member Author

CeresBarros commented May 24, 2024

Thanks for the suggestion @achubaty.
I think that could help with RAM limitation, yes. For the "persisting between sessions" or even calls to downscale we might need to embed some caching mechanism that detects that the downscaled on-disk table has already been produced.

@kdaust, thoughts?

@kdaust
Copy link
Collaborator

kdaust commented May 25, 2024

Thanks @achubaty ! I haven't used arrow before, but took a quick look and it seems promising. However, as far as I recall, the memory issues were when we were extracting point from the terra raster. @CeresBarros is that correct? If so, I'm not sure using arrow would fix the problem? I know the dcast operation at the final stage of downscale_core takes a lot of RAM, so arrow could definitely help with that.

@CeresBarros
Copy link
Member Author

CeresBarros commented May 27, 2024

The only time i've tried arrow I didn't manage to get it to work, not sure why. But I've been successful with dataLaF from the LaF package, at least to read big data in chunks.

That is correct, but the issue, as far as I remember, is that the table of point data created is too large (because there were >2000 rasters). So, it may help to:

  1. extract the point data by e.g. layer and dump it into disk
  2. do the downscaling on the disk-based table (I think this requires processing it in chunks, but would have to refresh my memory of arrow)
  3. educate the user (through doc and messaging) about the output/downscaled table being larger than memroy and existing on disk only (so not being output by downscale). We could even provide examples of how to deal it later

This could be a great enhancement that takes the onus of dealing with sequential processing of climate scenarios/models/runs/etc by the user when they don't understand why climr is failing. The user will of course still have to figure out how they want to deal with the big downscaled table, but we can say "that's"out of our hands, we've already done the downscaling bit"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.0.0 complete before release of version 1.0.0
Projects
None yet
Development

No branches or pull requests

3 participants