-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray.DataArray.stack load data into memory #4113
Comments
Thank you for raising an issue. Since our Lazyarray does not support the reshaping, it loads the data automatically. For example, if you multiply your array by a scalar, mda = da *2 It also loads the data into memory. FYI, using dask arrays may solve this problem. da = xr.open_dataarray("da.nc", chunks={'x': 16, 'y': 16}) Then, the reshape will be a lazy operation too. |
Thanks for the answer. I tried some experiments with chunked reading with dask, but I have observations I don't fully get : 1) Still loading memoryReading with chunks load the memory more than reading without chunks, but not loading an amount of memory equals to the size of the array (300MB for a 800MB array in the example below). And by the way, also loading up the memory a bit more when stacking. But I think this may be normal, because of something like loading the dask machinery in the memory, and that I will see the full benefits when working on bigger data, am I right? 2) Stacking is breaking the chunksWhen stacking a chunked array, only chunks alongside the first stacking dimension are conserved, and chunks along the second stacking dimension seem to be merged. I think this has something to do with the very nature of indexes, but not sure. 3) Rechunking load the memoryA workaround to 2) could have been to re-chunk as wanted after stacking, but then it is fully loading the data. Example(Considering the following to replace the def main():
fname = "da.nc"
shape = 512, 2048, 100 # 800 MB
xr.DataArray(
np.random.randn(*shape),
dims=("x", "y", "z"),
).to_netcdf(fname)
print_ram_state()
da = xr.open_dataarray(fname, chunks=dict(x=1, y=1))
print(f" da: {mb(da.nbytes)} MB")
print_ram_state()
mda = da.stack(px=("x", "y"))
print_ram_state()
mda = mda.chunk(dict(px=1))
print_ram_state() which outputs something like:
Chunks displayed thanks to the jupyter notebook visualization: After stacking: A workaround could have been to save the data already stacked, but "MultiIndex cannot yet be serialized to netCDF". Maybe there is another workaround? (Sorry for the long post) |
I think it depends on the chunk size.
I am not sure where
You can do |
Yes, I'm not very familiar with chunks, it seems that it's not good to have too many of them.
Sorry it should have been
Yes, after some more experiments I found out that the second chunksize after stacking is The formula for X is something like:
So, minimum value for X is That's why I was saying that "chunks along the second stacking dimension seem to be merged". This might be normal, just unexpected, and still quite obscure for me. And it must be happening on dask side anyway. Thanks a lot for your insights.
Ah yes, thanks! I thought |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
Closing as mostly resolved, I think? Please reopen if not. |
Stacking is loading the data into memory, which is unexpected, or at least undocumented, afaik.
MCVE Code Sample
Problem Description
Using
xarray.DataArray.stack
method is loading the data into memory, which is unexpected behavior, or at least undocumented afaik.Versions
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: None
xarray: 0.15.1
pandas: 1.0.3
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.16.0
distributed: 2.16.0
matplotlib: 3.2.1
cartopy: None
seaborn: 0.10.1
numbagg: None
setuptools: 46.4.0.post20200518
pip: 20.1.1
conda: 4.8.3
pytest: 5.4.2
IPython: 7.14.0
sphinx: 3.0.4
The text was updated successfully, but these errors were encountered: