-
Notifications
You must be signed in to change notification settings - Fork 300
Description
On behalf of Scitools user(s):
I have an Iris/Dask/NetCDF question about chunking data and collapsing dimensions, in this case to compute ensemble statistics (ensemble mean/median/standard deviation fields).
The NetCDF files that i'm using have chunking is optimized for this operation. Rather than using the NetCDF file's chunking, Iris/Dask is putting each ensemble member into its own chunk. I think this is resulting in my code trying to read all of the data into memory at once. Is there a simple way to make make Iris use the NetCDF file's chunking when it loads NetCDF data into a Dask array?
Also, the memory requirement for this and other operations seem to have gone up noticeably when moving from Iris 2.2 to Iris 2.4. I've been busy doubling my memory allocations to allow for it. Running on Iris 2.2 this task took a bit over an hour to compute with 16GB RAM allocated. With Iris 2.4 it no longer fits into that memory or within a 2 hour job.
To recreate issue:
First, I created two Conda installations, one with Iris v2.2 and one with Iris v2.4:
conda create --name iris-v2.2 -c conda-forge python=3.7 iris=2.2 memory_profiler
conda create --name iris-v2.4 -c conda-forge python=3.7 iris=2.4 memory_profiler
Then, after activating each environment, I ran (python code below):
python -m mprof run dask_issue.py
mprof plot
For Iris v2.2:
Loading the data:
Has lazy data: True
Chunksize: (1, 1, 36, 72)
Extracting a single year:
Has lazy data: True
Chunksize: (1, 1, 36, 72)
Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)
For Iris v2.4:
Loading the data:
Has lazy data: True
Chunksize: (1, 2028, 36, 72)
Extracting a single year:
Has lazy data: True
Chunksize: (1, 12, 36, 72)
Realising the data of the single year:
Has lazy data: False
Chunksize: (9, 12, 36, 72)
The plots produced by memory_profiler shows the memory usage when realising extracted data using Iris v2.4 is over 5 times the memory usage when realising extracted data using Iris v2.2. The time taken to realise the data also increases. The cube prior to extraction contains data from 1850 to 2018. Is it possible that all the data (rather than just the year that had been extracted) are being realised when using Iris v2.4?
dask_issue.py code referenced above:
#!/usr/bin/env python
import iris
def main():
print('Loading the data:')
cube = load_cube()
print_cube_info(cube)
print('Extracting a single year:')
cube = extract_year(cube)
print_cube_info(cube)
print('Realising the data of the single year:')
realise_data(cube)
print_cube_info(cube)
@profile
def load_cube():
filename = ('[path]/HadSST4/analysis/HadCRUT.5.0.0.0.SST.analysis.anomalies.?.nc')
cube = iris.load_cube(filename)
return cube
@profile
def extract_year(cube):
year = 1999
time_constraint = iris.Constraint(
time=lambda cell: cell.point.year == year)
cube = cube.extract(time_constraint)
return cube
@profile
def realise_data(cube):
cube.data
def print_cube_info(cube):
tab = ' ' * 4
print(f'{tab}Has lazy data: {cube.has_lazy_data()}')
print(f'{tab}Chunksize: {cube.lazy_data().chunksize}')
if __name__ == '__main__':
main()