Skip to content
Draft
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/3534.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Adds support for `RectilinearChunkGrid`, enabling arrays with variable chunk sizes along each dimension in Zarr v3. Users can now specify irregular chunking patterns using nested sequences: `chunks=[[10, 20, 30], [25, 25, 25, 25]]` creates an array with 3 chunks of sizes 10, 20, and 30 along the first dimension, and 4 chunks of size 25 along the second dimension. This feature is useful for data with non-uniform structure or when aligning chunks with existing data partitions. Note that `RectilinearChunkGrid` is only supported in Zarr format 3 and cannot be used with sharding or when creating arrays from existing data via `from_array()`.
118 changes: 118 additions & 0 deletions docs/user-guide/arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,124 @@ In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is
This means that `10*10` chunks are stored in each shard, and there are `10*10` shards in total.
Without the `shards` argument, there would be 10,000 chunks stored as individual files.

## Variable Chunking (Zarr v3)

In addition to regular chunking where all chunks have the same size, Zarr v3 supports
**variable chunking** (also called rectilinear chunking), where chunks can have different
sizes along each dimension. This is useful when your data has non-uniform structure or
when you need to align chunks with existing data partitions.

### Basic usage

To create an array with variable chunking, provide a nested sequence to the `chunks`
parameter instead of a regular tuple:

```python exec="true" session="arrays" source="above" result="ansi"
# Create an array with variable chunk sizes
z = zarr.create_array(
store='data/example-21.zarr',
shape=(60, 100),
chunks=[[10, 20, 30], [25, 25, 25, 25]], # Variable chunks
dtype='float32',
zarr_format=3
)
print(z)
print(f"Chunk grid type: {type(z.metadata.chunk_grid).__name__}")
```

In this example, the first dimension is divided into 3 chunks with sizes 10, 20, and 30
(totaling 60), and the second dimension is divided into 4 chunks of size 25 (totaling 100).

### Reading and writing

Arrays with variable chunking support the same read/write operations as regular arrays:

```python exec="true" session="arrays" source="above" result="ansi"
# Write data
data = np.arange(60 * 100, dtype='float32').reshape(60, 100)
z[:] = data

# Read data back
result = z[:]
print(f"Data matches: {np.all(result == data)}")
print(f"Slice [10:30, 50:75]: {z[10:30, 50:75].shape}")
```

### Accessing chunk information

With variable chunking, the standard `.chunks` property is not available since chunks
have different sizes. Instead, access chunk information through the chunk grid:
Comment on lines +615 to +616
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if .chunks just had a different type (tuple of tuples of ints)


```python exec="true" session="arrays" source="above" result="ansi"
from zarr.core.chunk_grids import RectilinearChunkGrid

# Access the chunk grid
chunk_grid = z.metadata.chunk_grid
print(f"Chunk grid type: {type(chunk_grid).__name__}")

# Get chunk shapes for each dimension
if isinstance(chunk_grid, RectilinearChunkGrid):
print(f"Dimension 0 chunk sizes: {chunk_grid.chunk_shapes[0]}")
print(f"Dimension 1 chunk sizes: {chunk_grid.chunk_shapes[1]}")
print(f"Total number of chunks: {chunk_grid.get_nchunks((60, 100))}")
```

### Use cases

Variable chunking is particularly useful for:

1. **Irregular time series**: When your data has non-uniform time intervals, you can
create chunks that align with your sampling periods.

2. **Aligning with partitions**: When you need to match chunk boundaries with existing
data partitions or structural boundaries in your data.

3. **Optimizing access patterns**: When certain regions of your array are accessed more
frequently, you can use smaller chunks there for finer-grained access.

### Example: Time series with irregular intervals

```python exec="true" session="arrays" source="above" result="ansi"
# Daily measurements for one year, chunked by month
# Each chunk corresponds to one month (varying from 28-31 days)
z_timeseries = zarr.create_array(
store='data/example-22.zarr',
shape=(365, 100), # 365 days, 100 measurements per day
chunks=[[31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31], [100]], # Days per month
dtype='float64',
zarr_format=3
)
print(f"Created array with shape {z_timeseries.shape}")
print(f"Chunk shapes: {z_timeseries.metadata.chunk_grid.chunk_shapes}")
print(f"Number of chunks: {len(z_timeseries.metadata.chunk_grid.chunk_shapes[0])} months")
```

### Limitations

Variable chunking has some important limitations:

1. **Zarr v3 only**: This feature is only available when using `zarr_format=3`.
Attempting to use variable chunks with `zarr_format=2` will raise an error.

2. **Not compatible with sharding**: You cannot use variable chunking together with
the sharding feature. Arrays must use either variable chunking or sharding, but not both.
Comment on lines +669 to +670
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


3. **Not compatible with `from_array()`**: Variable chunking cannot be used when creating
arrays from existing data using [`zarr.from_array`][]. This is because the function needs
to partition the input data, which requires regular chunk sizes.

4. **No `.chunks` property**: For arrays with variable chunking, accessing the `.chunks`
property will raise a `NotImplementedError`. Use `.metadata.chunk_grid.chunk_shapes`
instead.

```python exec="true" session="arrays" source="above" result="ansi"
# This will raise an error
try:
_ = z.chunks
except NotImplementedError as e:
print(f"Error: {e}")
```

## Missing features in 3.0

The following features have not been ported to 3.0 yet.
Expand Down
4 changes: 3 additions & 1 deletion docs/user-guide/extending.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,6 @@ classes by implementing the interface defined in [`zarr.abc.buffer.BufferPrototy

## Other extensions

In the future, Zarr will support writing custom custom data types and chunk grids.
Zarr now includes built-in support for `RectilinearChunkGrid` (variable chunking), which allows arrays to have different chunk sizes along each dimension. See the [Variable Chunking](arrays.md#variable-chunking-zarr-v3) section in the Arrays guide for more information.

In the future, Zarr will support writing fully custom chunk grids and custom data types.
20 changes: 15 additions & 5 deletions src/zarr/api/synchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from zarr.errors import ZarrDeprecationWarning

if TYPE_CHECKING:
from collections.abc import Iterable
from collections.abc import Iterable, Sequence

import numpy as np
import numpy.typing as npt
Expand All @@ -29,6 +29,7 @@
)
from zarr.core.array_spec import ArrayConfigLike
from zarr.core.buffer import NDArrayLike, NDArrayLikeOrScalar
from zarr.core.chunk_grids import ChunkGrid
from zarr.core.chunk_key_encodings import ChunkKeyEncoding, ChunkKeyEncodingLike
from zarr.core.common import (
JSON,
Expand Down Expand Up @@ -821,7 +822,7 @@ def create_array(
shape: ShapeLike | None = None,
dtype: ZDTypeLike | None = None,
data: np.ndarray[Any, np.dtype[Any]] | None = None,
chunks: tuple[int, ...] | Literal["auto"] = "auto",
chunks: tuple[int, ...] | Sequence[Sequence[int]] | ChunkGrid | Literal["auto"] = "auto",
shards: ShardsLike | None = None,
filters: FiltersLike = "auto",
compressors: CompressorsLike = "auto",
Expand Down Expand Up @@ -857,9 +858,14 @@ def create_array(
data : np.ndarray, optional
Array-like data to use for initializing the array. If this parameter is provided, the
``shape`` and ``dtype`` parameters must be ``None``.
chunks : tuple[int, ...] | Literal["auto"], default="auto"
Chunk shape of the array.
If chunks is "auto", a chunk shape is guessed based on the shape of the array and the dtype.
chunks : tuple[int, ...] | Sequence[Sequence[int]] | ChunkGrid | Literal["auto"], default="auto"
Chunk shape of the array. Several formats are supported:

- tuple of ints: Creates a RegularChunkGrid with uniform chunks, e.g., ``(10, 10)``
- nested sequence: Creates a RectilinearChunkGrid with variable-sized chunks (Zarr format 3 only),
e.g., ``[[10, 20, 30], [5, 5]]`` creates variable chunks along each dimension
- ChunkGrid instance: Uses the provided chunk grid directly (Zarr format 3 only)
- "auto": Automatically determines chunk shape based on array shape and dtype
shards : tuple[int, ...], optional
Shard shape of the array. The default value of ``None`` results in no sharding at all.
filters : Iterable[Codec] | Literal["auto"], optional
Expand Down Expand Up @@ -1033,6 +1039,10 @@ def from_array(
- tuple[int, ...]: A tuple of integers representing the chunk shape.

If not specified, defaults to "keep" if data is a zarr Array, otherwise "auto".

.. note::
Variable chunking (RectilinearChunkGrid) is not supported when creating arrays from
existing data. Use regular chunking (uniform chunk sizes) instead.
shards : tuple[int, ...], optional
Shard shape of the array.
Following values are supported:
Expand Down
Loading