You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add some warnings about rechunking to the docs (#6569)
* Dask doc changes
* small change
* More edits
* Update doc/user-guide/dask.rst
* Update doc/user-guide/dask.rst
* Back to one liners
Co-authored-by: Maximilian Roos <[email protected]>
@@ -95,13 +95,21 @@ use :py:func:`~xarray.open_mfdataset`::
95
95
96
96
This function will automatically concatenate and merge datasets into one in
97
97
the simple cases that it understands (see :py:func:`~xarray.combine_by_coords`
98
-
for the full disclaimer). By default, :py:meth:`~xarray.open_mfdataset` will chunk each
98
+
for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
99
99
netCDF file into a single Dask array; again, supply the ``chunks`` argument to
100
100
control the size of the resulting Dask arrays. In more complex cases, you can
101
-
open each file individually using :py:meth:`~xarray.open_dataset` and merge the result, as
102
-
described in :ref:`combining data`. Passing the keyword argument ``parallel=True`` to :py:meth:`~xarray.open_mfdataset` will speed up the reading of large multi-file datasets by
101
+
open each file individually using :py:func:`~xarray.open_dataset` and merge the result, as
102
+
described in :ref:`combining data`. Passing the keyword argument ``parallel=True`` to
103
+
:py:func:`~xarray.open_mfdataset` will speed up the reading of large multi-file datasets by
103
104
executing those read tasks in parallel using ``dask.delayed``.
104
105
106
+
.. warning::
107
+
108
+
:py:func:`~xarray.open_mfdataset` called without ``chunks`` argument will return
109
+
dask arrays with chunk sizes equal to the individual files. Re-chunking
110
+
the dataset after creation with ``ds.chunk()`` will lead to an ineffective use of
111
+
memory and is not recommended.
112
+
105
113
You'll notice that printing a dataset still shows a preview of array values,
106
114
even if they are actually Dask arrays. We can do this quickly with Dask because
107
115
we only need to compute the first few values (typically from the first block).
@@ -224,6 +232,7 @@ disk.
224
232
available memory.
225
233
226
234
.. note::
235
+
227
236
For more on the differences between :py:meth:`~xarray.Dataset.persist` and
228
237
:py:meth:`~xarray.Dataset.compute` see this `Stack Overflow answer <https://stackoverflow.com/questions/41806850/dask-difference-between-client-persist-and-client-compute>`_ and the `Dask documentation <https://distributed.dask.org/en/latest/manage-computation.html#dask-collections-to-futures>`_.
229
238
@@ -236,6 +245,11 @@ sizes of Dask arrays is done with the :py:meth:`~xarray.Dataset.chunk` method:
<https://en.wikipedia.org/wiki/Embarrassingly_parallel>`__ "map" type operations
301
314
where a function written for processing NumPy arrays should be repeatedly
302
315
applied to xarray objects containing Dask arrays. It works similarly to
@@ -542,18 +555,20 @@ larger chunksizes.
542
555
Optimization Tips
543
556
-----------------
544
557
545
-
With analysis pipelines involving both spatial subsetting and temporal resampling, Dask performance can become very slow in certain cases. Here are some optimization tips we have found through experience:
558
+
With analysis pipelines involving both spatial subsetting and temporal resampling, Dask performance
559
+
can become very slow or memory hungry in certain cases. Here are some optimization tips we have found
560
+
through experience:
546
561
547
-
1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_).
562
+
1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early in the pipeline, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 <https://github.com/dask/dask/issues/746>`_). More generally, ``groupby()`` is a costly operation and does not (yet) perform well on datasets split across multiple files (see :pull:`5734` and linked discussions there).
548
563
549
564
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
550
565
551
-
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
566
+
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load subsets of data which span multiple chunks. On individual files, prefer to subset before chunking (suggestion 1).
567
+
568
+
4. Chunk as early as possible, and avoid rechunking as much as possible. Always pass the ``chunks={}`` argument to :py:func:`~xarray.open_mfdataset` to avoid redundant file reads.
552
569
553
-
4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
554
-
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
570
+
5. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset` can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
555
571
556
-
5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
572
+
6. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
557
573
558
-
6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
559
-
useful in identifying performance bottlenecks.
574
+
7. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be useful in identifying performance bottlenecks.
0 commit comments