diff --git a/docs/docs/icechunk-python/xarray.md b/docs/docs/icechunk-python/xarray.md index a2551db6a..5f6469150 100644 --- a/docs/docs/icechunk-python/xarray.md +++ b/docs/docs/icechunk-python/xarray.md @@ -12,15 +12,21 @@ and `icechunk.xarray.to_icechunk` methods. pip install "xarray>=2025.1.1" ``` -!!! note "`to_icechunk` vs `to_zarr`" +!!!note "`to_icechunk` vs `to_zarr`" [`xarray.Dataset.to_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr) - and [`to_icechunk`](./reference.md#icechunk.xarray.to_icechunk) are nearly functionally identical. In a a distributed context, e.g. + and [`to_icechunk`](./reference.md#icechunk.xarray.to_icechunk) are nearly functionally identical. + + In a distributed context, e.g. writes orchestrated with `multiprocesssing` or a `dask.distributed.Client` and `dask.array`, you *must* use `to_icechunk`. This will ensure that you can execute a commit that successfully records all remote writes. See [these docs on orchestrating parallel writes](./parallel.md) and [these docs on dask.array with distributed](./dask.md#icechunk-dask-xarray) for more. + If using `to_zarr`, remember to set `zarr_format=3, consolidated=False`. Consolidated metadata + is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata + in a way that makes it very fast to fetch from storage. + In this example, we'll explain how to create a new Icechunk repo, write some sample data to it, and append data a second block of data using Icechunk's version control features. @@ -82,19 +88,13 @@ Create a new writable session on the `main` branch to get the `IcechunkStore`: session = repo.writable_session("main") ``` -Writing Xarray data to Icechunk is as easy as calling `Dataset.to_zarr`: +Writing Xarray data to Icechunk is as easy as calling `to_icechunk`: ```python -ds1.to_zarr(session.store, zarr_format=3, consolidated=False) -``` +from icechunk.xarray import to_icechunk -!!! note - - 1. [Consolidated metadata](https://docs.xarray.dev/en/latest/user-guide/io.html#consolidated-metadata) - is unnecessary (and unsupported) in Icechunk. - Icechunk already organizes the dataset metadata in a way that makes it very - fast to fetch from storage. - 2. `zarr_format=3` is required until the default Zarr format changes in Xarray. +to_icechunk(ds, session) +``` After writing, we commit the changes using the session: @@ -111,7 +111,7 @@ this reason. Again, we'll use `Dataset.to_zarr`, this time with `append_dim='tim ```python # we have to get a new session after committing session = repo.writable_session("main") -ds2.to_zarr(session.store, append_dim='time') +to_icechunk(ds2, session, append_dim='time') ``` And then we'll commit the changes: diff --git a/icechunk-python/python/icechunk/xarray.py b/icechunk-python/python/icechunk/xarray.py index 601906eee..5f8ebf315 100644 --- a/icechunk-python/python/icechunk/xarray.py +++ b/icechunk-python/python/icechunk/xarray.py @@ -282,7 +282,7 @@ def to_icechunk( - If ``region`` is set, _all_ variables in a dataset must have at least one dimension in common with the region. Other variables - should be written in a separate single call to ``to_zarr()``. + should be written in a separate single call to ``to_icechunk()``. - Dimensions cannot be included in both ``region`` and ``append_dim`` at the same time. To create empty arrays to fill in with ``region``, use the `XarrayDatasetWriter` directly.