You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying to save a lazy xr.DataArray of datetime objects as a zarr forces a dask.compute operation and retrieves the data to the local notebook. This is generally not a problem for indices of datetime objects as that is already locally store and generally small in size.
However, if the whole underlying array is a datetime object, that can be a serious problem. In my case it simply crashed the scheduler upon attempting to retrieve the data persisted on workers.
I managed to isolate the problem on this call stack. The issue is in the encode_cf_datetime function
What did you expect to happen?
Storing the data in zarr format to be performed directly by dask workers bypassing the scheduler/Client if compute=True, and complete lazy operation if compute=False
Minimal Complete Verifiable Example
importnumpyasnpimportxarrayasxrimportdask.arrayasdatest=xr.DataArray(
data=da.full((20000, 20000), np.datetime64('2005-02-25T03:30', 'ns')),
coords= {'x': range(20000), 'y': range(20000)}
).to_dataset(name='test')
print(test.test.dtype)
# dtype('<M8[ns]')test.to_zarr('test.zarr', compute=False)
# this will take a while and trigger the computation of the array. No data will be actually saved though
MVCE confirmation
Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
File/env/lib/python3.8/site-packages/xarray/core/dataset.py:2036, inDataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
2033ifencodingisNone:
2034encoding= {}
->2036returnto_zarr(
2037self,
2038store=store,
2039chunk_store=chunk_store,
2040storage_options=storage_options,
2041mode=mode,
2042synchronizer=synchronizer,
2043group=group,
2044encoding=encoding,
2045compute=compute,
2046consolidated=consolidated,
2047append_dim=append_dim,
2048region=region,
2049safe_chunks=safe_chunks,
2050 )
File/env/lib/python3.8/site-packages/xarray/backends/api.py:1431, into_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
1429writer=ArrayWriter()
1430# TODO: figure out how to properly handle unlimited_dims->1431dump_to_store(dataset, zstore, writer, encoding=encoding)
1432writes=writer.sync(compute=compute)
1434ifcompute:
File/env/lib/python3.8/site-packages/xarray/backends/api.py:1119, indump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1116ifencoder:
1117variables, attrs=encoder(variables, attrs)
->1119store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File/env/lib/python3.8/site-packages/xarray/backends/zarr.py:500, inZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
498new_variables=set(variables) -existing_variable_names499variables_without_encoding= {vn: variables[vn] forvninnew_variables}
-->500variables_encoded, attributes=self.encode(
501variables_without_encoding, attributes502 )
504ifexisting_variable_names:
505# Decode variables directly, without going via xarray.Dataset to506# avoid needing to load index variables into memory.507# TODO: consider making loading indexes lazy again?508existing_vars, _, _ =conventions.decode_cf_variables(
509self.get_variables(), self.get_attrs()
510 )
File/env/lib/python3.8/site-packages/xarray/backends/common.py:200, inAbstractWritableDataStore.encode(self, variables, attributes)
183defencode(self, variables, attributes):
184""" 185 Encode the variables and attributes in this store 186 (...) 198 199 """-->200variables= {k: self.encode_variable(v) fork, vinvariables.items()}
201attributes= {k: self.encode_attribute(v) fork, vinattributes.items()}
202returnvariables, attributesFile/env/lib/python3.8/site-packages/xarray/backends/common.py:200, in<dictcomp>(.0)
183defencode(self, variables, attributes):
184""" 185 Encode the variables and attributes in this store 186 (...) 198 199 """-->200variables= {k: self.encode_variable(v) fork, vinvariables.items()}
201attributes= {k: self.encode_attribute(v) fork, vinattributes.items()}
202returnvariables, attributesFile/env/lib/python3.8/site-packages/xarray/backends/zarr.py:459, inZarrStore.encode_variable(self, variable)
458defencode_variable(self, variable):
-->459variable=encode_zarr_variable(variable)
460returnvariableFile/env/lib/python3.8/site-packages/xarray/backends/zarr.py:258, inencode_zarr_variable(var, needs_copy, name)
237defencode_zarr_variable(var, needs_copy=True, name=None):
238""" 239 Converts an Variable into an Variable which follows some 240 of the CF conventions: (...) 255 A variable which has been encoded as described above. 256 """-->258var=conventions.encode_cf_variable(var, name=name)
260# zarr allows unicode, but not variable-length strings, so it's both261# simpler and more compact to always encode as UTF-8 explicitly.262# TODO: allow toggling this explicitly via dtype in encoding.263coder=coding.strings.EncodedStringCoder(allows_unicode=True)
File/env/lib/python3.8/site-packages/xarray/conventions.py:273, inencode_cf_variable(var, needs_copy, name)
264ensure_not_multiindex(var, name=name)
266forcoderin [
267times.CFDatetimeCoder(),
268times.CFTimedeltaCoder(),
(...)
271variables.UnsignedIntegerCoder(),
272 ]:
-->273var=coder.encode(var, name=name)
275# TODO(shoyer): convert all of these to use coders, too:276var=maybe_encode_nonstring_dtype(var, name=name)
File/env/lib/python3.8/site-packages/xarray/coding/times.py:659, inCFDatetimeCoder.encode(self, variable, name)
655dims, data, attrs, encoding=unpack_for_encoding(variable)
656ifnp.issubdtype(data.dtype, np.datetime64) orcontains_cftime_datetimes(
657variable658 ):
-->659 (data, units, calendar) =encode_cf_datetime(
660data, encoding.pop("units", None), encoding.pop("calendar", None)
661 )
662safe_setitem(attrs, "units", units, name=name)
663safe_setitem(attrs, "calendar", calendar, name=name)
File/env/lib/python3.8/site-packages/xarray/coding/times.py:592, inencode_cf_datetime(dates, units, calendar)
582defencode_cf_datetime(dates, units=None, calendar=None):
583"""Given an array of datetime objects, returns the tuple `(num, units, 584 calendar)` suitable for a CF compliant time variable. 585 (...) 590 cftime.date2num 591 """-->592dates=np.asarray(dates)
594ifunitsisNone:
595units=infer_datetime_units(dates)
Anything else we need to know?
Our system uses dask_gateway in a AWS infrastructure (S3 for storage)
This is correct -- CFDatetimeCoder.encode is not lazy, even if the inputs are Dask arrays.
We would welcome contributions to fix this. This would entail making the encode look similar to the decode method (using lazy_elemwise_func).
We would also need a fall-back method for determining appropriate time units without looking at the array values. Something like seconds since 1900-01-01T00:00:00 would probably be a reasonable choice.
@dcherian Thanks; I agree that this seems to be the same as #7028. Just as a note, I have had 3 people reach out to me (1 from UNH, 2 from across the globe) thanking me for the work around in my message. So this does seem to be a commonly encountered issue.
What happened?
Trying to save a lazy xr.DataArray of datetime objects as a zarr forces a dask.compute operation and retrieves the data to the local notebook. This is generally not a problem for indices of datetime objects as that is already locally store and generally small in size.
However, if the whole underlying array is a datetime object, that can be a serious problem. In my case it simply crashed the scheduler upon attempting to retrieve the data persisted on workers.
I managed to isolate the problem on this call stack. The issue is in the
encode_cf_datetime
functionWhat did you expect to happen?
Storing the data in zarr format to be performed directly by dask workers bypassing the scheduler/Client if
compute=True
, and complete lazy operation ifcompute=False
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
Our system uses dask_gateway in a AWS infrastructure (S3 for storage)
Environment
xarray: 2022.3.0
pandas: 1.5.0
numpy: 1.22.4
scipy: 1.9.1
netCDF4: 1.6.1
pydap: installed
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.13.2
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.2
cfgrib: None
iris: None
bottleneck: 1.3.5
dask: 2022.9.2
distributed: 2022.9.2
matplotlib: 3.6.0
cartopy: 0.20.2
seaborn: 0.12.0
numbagg: None
fsspec: 2022.8.2
cupy: None
pint: None
sparse: 0.13.0
setuptools: 65.4.1
pip: 22.2.2
conda: None
pytest: 7.1.3
IPython: 8.5.0
sphinx: None
The text was updated successfully, but these errors were encountered: