Fix for Dataset.to_zarr with both `consolidated` and `write_empty_chunks` #8326

Metamess · 2023-10-17T20:38:25Z

Fixed an issue where Dataset.to_zarr with both consolidated and write_empty_chunks failed with a ReadOnlyError

Closes to_zarr with region and write_empty_chunks keywords raises ReadOnlyError if using consolidated metadata #8323
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…_empty_chunks failed with a ReadOnlyError

max-sixty · 2023-10-18T00:11:45Z

This looks good @Metamess . Thank you!

I'll merge it soon unless anyone who knows it better has any feedback (I don't know this code that well, though the tests look good)

jhamman

Thanks @Metamess for the fix here. I'd like to understand a bit more of the "why" behind the bug here before shipping this fix. A specific question below...

jhamman · 2023-10-18T23:00:55Z

xarray/backends/zarr.py

@@ -676,7 +676,7 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
                # and append_dim.
                if self._write_empty is not None:
                    zarr_array = zarr.open(
-                        store=self.zarr_group.store,
+                        store=self.zarr_group.chunk_store,


Do you understand why the store interface is read-only and the chunk_store is not? I'm curious if this happens for all stores or only the directory store?

Hey @jhamman , thanks for the reply! This is just my understanding from going through the code, but I feel like I understand fairly well, so here's my explanation as to the why:

Even though to_zarr is a write operation, the zarr store is first opened to create a representative ZarrStore object. For this purpose, only the metadata has to be read.

A zarr by default will have a .zmetadata file stored per variable (this file stores encoding information and the attrs of a Dataset/DataArray), resulting in many small files. Since this is inefficient to open, especially over a network, the concept of 'consolidated metadata' was created as an experimental feature. This is effectively another metadata file living at the root of the zarr store, containing a combined copy of the metadata that is stored in the various .zmetadata files. This means you only need to open 1 file, and it speeds up reads significantly.

Since the consolidated metadata file is a mere copy, changes to the data that would trigger an edit to the zarr's metadata would cause the consolidated metadata to become 'out of sync'. To prevent this from happening, the choice was made to implement a ConsolidatedMetadataStore subclass of Zarr's Store class, which is used when the zarr is opened using consolidated metadata. This class has implemented the write-operations __setitem__ and __delitem__ to raise a ReadOnlyError instead of performing the operation.

The ConsolidatedMetadataStore is the store attribute of a zarr Group object, which in turn is the zarr_group attribute of the xarray ZarrStore object. To support using different Store implementations for writing metadata and writing chunk data, a Group can be given a separate Store object via the chunk_store parameter (and attribute), which will default to store if not provided. Our situation is exactly the use case this split is meant to support.

The open_consolidated() function in zarr.convenience.py, enforces a mode of r or r+ (and to_zarr with region provided enforces a read mode of r+), and this function makes sure the resulting Group has a store of type ConsolidatedMetadataStore and a 'normal Store subtype for chunk_store. The exact type depends on if a local path was used, or a URL of some sort, but the point is that it's not a read-only ConsolidatedMetadataStore.

Finally, the question remains: "Is writing chunk data actually safe for operation?". The answer is yes, because no metadata would be changed by to_zarr with the region parameter:

Because the write mode is enforced to be r+, no new variables can be added to the store (this is also checked and enforced in xarray.backends.api.py::to_zarr()). Existing variables already have their attrs included in the consolidated metadata file.

The size of dimensions can not be expanded, that would require a call using append_dim which is mutually exclusive with region

Thanks for this excellent explanation.

I believe we created a bit of a mess with consolidated metadata. Given the way things work in Zarr, I think this PR is an acceptable solution.

My one suggestion would be to put ☝️ that super useful explanation as an inline comment right in the code. It will be extremely helpful for future maintainers.

jhamman

Pending some inline documentation in the backends/zarr.py requested by @rabernat, this look great.

🙌 Thanks @Metamess for the excellent summary of the problem 🙌

dcherian · 2023-11-02T21:49:48Z

Excellent work @Metamess . Thank you!

welcome · 2023-11-02T23:20:10Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

Fixed an issue where Dataset.to_zarr with both consolidated and write…

9e69aaa

…_empty_chunks failed with a ReadOnlyError

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Oct 17, 2023

max-sixty added the plan to merge Final call for comments label Oct 18, 2023

dcherian requested a review from rabernat October 18, 2023 20:47

jhamman reviewed Oct 18, 2023

View reviewed changes

rabernat approved these changes Oct 19, 2023

View reviewed changes

jhamman approved these changes Oct 20, 2023

View reviewed changes

Add comment

ca30960

Merge branch 'main' into issue-8323-to-zarr-readonlyerror

0f77100

dcherian enabled auto-merge (squash) November 2, 2023 21:50

dcherian disabled auto-merge November 2, 2023 21:50

dcherian enabled auto-merge (squash) November 2, 2023 21:50

dcherian disabled auto-merge November 2, 2023 21:51

dcherian changed the title ~~Fixed an issue where Dataset.to_zarr raised a ReadOnlyError~~ Fix for Dataset.to_zarr with both consolidated and write_empty_chunks Nov 2, 2023

dcherian enabled auto-merge (squash) November 2, 2023 21:51

dcherian merged commit 10f2aa1 into pydata:main Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Dataset.to_zarr with both `consolidated` and `write_empty_chunks` #8326

Fix for Dataset.to_zarr with both `consolidated` and `write_empty_chunks` #8326

Metamess commented Oct 17, 2023 •

edited

Loading

max-sixty commented Oct 18, 2023 •

edited

Loading

jhamman left a comment

jhamman Oct 18, 2023

Metamess Oct 19, 2023

rabernat Oct 19, 2023

jhamman left a comment

dcherian commented Nov 2, 2023

welcome bot commented Nov 2, 2023

Fix for Dataset.to_zarr with both consolidated and write_empty_chunks #8326

Fix for Dataset.to_zarr with both consolidated and write_empty_chunks #8326

Conversation

Metamess commented Oct 17, 2023 • edited Loading

max-sixty commented Oct 18, 2023 • edited Loading

jhamman left a comment

Choose a reason for hiding this comment

jhamman Oct 18, 2023

Choose a reason for hiding this comment

Metamess Oct 19, 2023

Choose a reason for hiding this comment

rabernat Oct 19, 2023

Choose a reason for hiding this comment

jhamman left a comment

Choose a reason for hiding this comment

dcherian commented Nov 2, 2023

welcome bot commented Nov 2, 2023

Fix for Dataset.to_zarr with both `consolidated` and `write_empty_chunks` #8326

Fix for Dataset.to_zarr with both `consolidated` and `write_empty_chunks` #8326

Metamess commented Oct 17, 2023 •

edited

Loading

max-sixty commented Oct 18, 2023 •

edited

Loading