xarray.DataArray.stack load data into memory #4113

Paul-Aime · 2020-05-30T13:45:38Z

Stacking is loading the data into memory, which is unexpected, or at least undocumented, afaik.

MCVE Code Sample

import os
import psutil
import numpy as np
import xarray as xr

def main():

    xr.DataArray(
        np.random.randn(1024, 1024, 100),
        dims=("x", "y", "z"),
    ).to_netcdf("da.nc")

    da = xr.open_dataarray("da.nc")
    print(f" da: {mb(da.nbytes)} MB")
    print_ram_state()

    mda = da.stack(px=("x", "y"))
    print_ram_state()

def print_ram_state():
    # https://stackoverflow.com/a/21632554
    process = psutil.Process(os.getpid())
    ram_state = process.memory_info().rss
    print(f"RAM: {mb(ram_state) :.2f} MB")

def mb(nbytes):
    return nbytes / (1024 * 1024)

if __name__ == "__main__":
    main()

Problem Description

Using xarray.DataArray.stack method is loading the data into memory, which is unexpected behavior, or at least undocumented afaik.

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: None

xarray: 0.15.1
pandas: 1.0.3
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.16.0
distributed: 2.16.0
matplotlib: 3.2.1
cartopy: None
seaborn: 0.10.1
numbagg: None
setuptools: 46.4.0.post20200518
pip: 20.1.1
conda: 4.8.3
pytest: 5.4.2
IPython: 7.14.0
sphinx: 3.0.4

The text was updated successfully, but these errors were encountered:

fujiisoup · 2020-05-31T04:21:29Z

Thank you for raising an issue.
I confirmed this problem is reproduced.

Since our Lazyarray does not support the reshaping, it loads the data automatically.
This automatic loading happens in many other operations.

For example, if you multiply your array by a scalar,

mda = da *2

It also loads the data into memory.
Maybe we should improve the documentation.

FYI, using dask arrays may solve this problem.
To open the file with dask, you could add chunks keywords,

da = xr.open_dataarray("da.nc", chunks={'x': 16, 'y': 16})

Then, the reshape will be a lazy operation too.

Paul-Aime · 2020-05-31T16:04:39Z

Thanks for the answer.

I tried some experiments with chunked reading with dask, but I have observations I don't fully get :

1) Still loading memory

Reading with chunks load the memory more than reading without chunks, but not loading an amount of memory equals to the size of the array (300MB for a 800MB array in the example below). And by the way, also loading up the memory a bit more when stacking.

But I think this may be normal, because of something like loading the dask machinery in the memory, and that I will see the full benefits when working on bigger data, am I right?

2) Stacking is breaking the chunks

When stacking a chunked array, only chunks alongside the first stacking dimension are conserved, and chunks along the second stacking dimension seem to be merged.

I think this has something to do with the very nature of indexes, but not sure.

3) Rechunking load the memory

A workaround to 2) could have been to re-chunk as wanted after stacking, but then it is fully loading the data.

Example

(Considering the following to replace the main() function of the script in the original post.)

def main():

    fname = "da.nc"
    shape = 512, 2048, 100  # 800 MB

    xr.DataArray(
        np.random.randn(*shape),
        dims=("x", "y", "z"),
    ).to_netcdf(fname)
    print_ram_state()

    da = xr.open_dataarray(fname, chunks=dict(x=1, y=1))
    print(f" da: {mb(da.nbytes)} MB")
    print_ram_state()

    mda = da.stack(px=("x", "y"))
    print_ram_state()

    mda = mda.chunk(dict(px=1))
    print_ram_state()

which outputs something like:

RAM: 94.52 MB
 da: 800.0 MB
RAM: 398.83 MB
RAM: 589.05 MB
RAM: 1409.11 MB

Chunks displayed thanks to the jupyter notebook visualization:

Before stacking:

After stacking:

A workaround could have been to save the data already stacked, but "MultiIndex cannot yet be serialized to netCDF".

Maybe there is another workaround?

(Sorry for the long post)

fujiisoup · 2020-06-01T05:24:35Z

Reading with chunks load the memory more than reading without chunks, but not loading an amount of memory equals to the size of the array (300MB for a 800MB array in the example below). And by the way, also loading up the memory a bit more when stacking.

I think it depends on the chunk size.
If I use the chunks chunks=dict(x=128, y=128), the memory usage is

RAM: 118.14 MB
 da: 800.0 MB
RAM: 119.14 MB
RAM: 125.59 MB
RAM: 943.79 MB

When stacking a chunked array, only chunks alongside the first stacking dimension are conserved, and chunks along the second stacking dimension seem to be merged.

I am not sure where 512 comes from in your example (maybe dask does something). If I work with chunks=dict(x=128, y=128), the chunksize after the stacking was (100, 16384), which is reasonable (z=100, px=(128, 128)).

A workaround could have been to save the data already stacked, but "MultiIndex cannot yet be serialized to netCDF".

You can do reset_index before saving it into the netCDF, but it requires another computation when creating the MultiIndex after loading.

Paul-Aime · 2020-06-01T13:06:02Z

I think it depends on the chunk size.

Yes, I'm not very familiar with chunks, it seems that it's not good to have too many of them.

I am not sure where 512 comes from in your example (maybe dask does something).

Sorry it should have been (100, 2048), it comes from the second dimension of stacking (explained below). My screenshot was for .stack(px=("y", "x")), my bad.

If I work with chunks=dict(x=128, y=128), the chunksize after the stacking was (100, 16384), which is reasonable (z=100, px=(128, 128)).

Yes, after some more experiments I found out that the second chunksize after stacking is (100, X) where X is a multiple of the size of the second stacking dimension (here "y"), hence why it is working in your case (128 * 128 == 2048 * 8).

The formula for X is something like:

shape[1] * ( (x_chunk * y_chunk) // shape[1] + bool((x_chunk * y_chunk) % shape[1]) )

So, minimum value for X is shape[1] (size of "y" dim, hence my case with small values for x_chunk and y_chunk).

That's why I was saying that "chunks along the second stacking dimension seem to be merged". This might be normal, just unexpected, and still quite obscure for me.

And it must be happening on dask side anyway. Thanks a lot for your insights.

You can do reset_index before saving it into the netCDF, but it requires another computation when creating the MultiIndex after loading.

Ah yes, thanks! I thought reset_index was similar to unstack for indexes created with stack.

stale · 2022-04-19T10:43:48Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

max-sixty · 2022-04-19T16:10:25Z

Closing as mostly resolved, I think? Please reopen if not.

LukeWeidenwalker mentioned this issue Apr 13, 2022

Remove .stack() from aggregate spatial Open-EO/openeo-processes-python#156

Merged

stale bot added the stale label Apr 19, 2022

dcherian added topic-lazy array and removed stale labels Apr 19, 2022

dcherian mentioned this issue Apr 19, 2022

Lazy indexing arrays as a stand-alone package #5081

Open

max-sixty closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray.DataArray.stack load data into memory #4113

xarray.DataArray.stack load data into memory #4113

Paul-Aime commented May 30, 2020

INSTALLED VERSIONS

fujiisoup commented May 31, 2020

Paul-Aime commented May 31, 2020

fujiisoup commented Jun 1, 2020

Paul-Aime commented Jun 1, 2020

stale bot commented Apr 19, 2022

max-sixty commented Apr 19, 2022

xarray.DataArray.stack load data into memory #4113

xarray.DataArray.stack load data into memory #4113

Comments

Paul-Aime commented May 30, 2020

MCVE Code Sample

Problem Description

Versions

INSTALLED VERSIONS

fujiisoup commented May 31, 2020

Paul-Aime commented May 31, 2020

1) Still loading memory

2) Stacking is breaking the chunks

3) Rechunking load the memory

Example

fujiisoup commented Jun 1, 2020

Paul-Aime commented Jun 1, 2020

stale bot commented Apr 19, 2022

max-sixty commented Apr 19, 2022