NetCDF chunking using extra memory with Iris 2.4

On behalf of Scitools user(s):

I have an Iris/Dask/NetCDF question about chunking data and collapsing dimensions, in this case to compute ensemble statistics (ensemble mean/median/standard deviation fields).

The NetCDF files that i'm using have chunking is optimized for this operation. Rather than using the NetCDF file's chunking, Iris/Dask is putting each ensemble member into its own chunk. I think this is resulting in my code trying to read all of the data into memory at once. Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?

Also, the memory requirement for this and other operations seem to have gone up noticeably when moving from Iris 2.2 to Iris 2.4. I've been busy doubling my memory allocations to allow for it. Running on Iris 2.2 this task took a bit over an hour to compute with 16GB RAM allocated. With Iris 2.4 it no longer fits into that memory or within a 2 hour job.

To recreate issue:

First, I created two Conda installations, one with Iris v2.2 and one with Iris v2.4: 
	conda create --name iris-v2.2 -c conda-forge python=3.7 iris=2.2 memory_profiler 
	conda create --name iris-v2.4 -c conda-forge python=3.7 iris=2.4 memory_profiler 
	
Then, after activating each environment, I ran (**python code below**): 
	python -m mprof run **dask_issue.py** 
	mprof plot 

**For Iris v2.2:** 
Loading the data:
		Has lazy data: True
		Chunksize: (1, 1, 36, 72) 
	
Extracting a single year: 
		Has lazy data: True 
		Chunksize: (1, 1, 36, 72) 

Realising the data of the single year: 
		Has lazy data: False 
		Chunksize: (9, 12, 36, 72) 
	
**For Iris v2.4:** 
Loading the data: 
		Has lazy data: True 
		Chunksize: (1, 2028, 36, 72) 

Extracting a single year: 
		Has lazy data: True 
		Chunksize: (1, 12, 36, 72) 

Realising the data of the single year: 
		Has lazy data: False 
		Chunksize: (9, 12, 36, 72) 
	
The plots produced by memory_profiler shows the memory usage when realising extracted data using Iris v2.4 is over 5 times the memory usage when realising extracted data using Iris v2.2. The time taken to realise the data also increases. The cube prior to extraction contains data from 1850 to 2018. Is it possible that all the data (rather than just the year that had been extracted) are being realised when using Iris v2.4? 


**dask_issue.py** code referenced above:


#!/usr/bin/env python
import iris	

def main():
	    print('Loading the data:')
	    cube = load_cube()
	    print_cube_info(cube)
	
	    print('Extracting a single year:')
	    cube = extract_year(cube)
	    print_cube_info(cube)
	
	    print('Realising the data of the single year:')
	    realise_data(cube)
	    print_cube_info(cube)
	
	
	@profile
	def load_cube():
	    filename = ('[path]/HadSST4/analysis/HadCRUT.5.0.0.0.SST.analysis.anomalies.?.nc')
	    cube = iris.load_cube(filename)
	    return cube
	
	
	@profile
	def extract_year(cube):
	    year = 1999
	    time_constraint = iris.Constraint(
	        time=lambda cell: cell.point.year == year)
	    cube = cube.extract(time_constraint)
	    return cube
	
	
	@profile
	def realise_data(cube):
	    cube.data
	
	
	def print_cube_info(cube):
	    tab = ' ' * 4
	    print(f'{tab}Has lazy data: {cube.has_lazy_data()}')
	    print(f'{tab}Chunksize: {cube.lazy_data().chunksize}')
	
	
	if __name__ == '__main__':
    main()




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NetCDF chunking using extra memory with Iris 2.4 #4107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NetCDF chunking using extra memory with Iris 2.4 #4107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions