Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Dataset.to_zarr with both consolidated and write_empty_chunks #8326

Merged
merged 3 commits into from
Nov 2, 2023

Conversation

Metamess
Copy link
Contributor

@Metamess Metamess commented Oct 17, 2023

Fixed an issue where Dataset.to_zarr with both consolidated and write_empty_chunks failed with a ReadOnlyError

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Oct 17, 2023
@max-sixty
Copy link
Collaborator

max-sixty commented Oct 18, 2023

This looks good @Metamess . Thank you!

I'll merge it soon unless anyone who knows it better has any feedback (I don't know this code that well, though the tests look good)

@max-sixty max-sixty added the plan to merge Final call for comments label Oct 18, 2023
@dcherian dcherian requested a review from rabernat October 18, 2023 20:47
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Metamess for the fix here. I'd like to understand a bit more of the "why" behind the bug here before shipping this fix. A specific question below...

@@ -676,7 +676,7 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
# and append_dim.
if self._write_empty is not None:
zarr_array = zarr.open(
store=self.zarr_group.store,
store=self.zarr_group.chunk_store,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand why the store interface is read-only and the chunk_store is not? I'm curious if this happens for all stores or only the directory store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jhamman , thanks for the reply! This is just my understanding from going through the code, but I feel like I understand fairly well, so here's my explanation as to the why:

  • Even though to_zarr is a write operation, the zarr store is first opened to create a representative ZarrStore object. For this purpose, only the metadata has to be read.
  • A zarr by default will have a .zmetadata file stored per variable (this file stores encoding information and the attrs of a Dataset/DataArray), resulting in many small files. Since this is inefficient to open, especially over a network, the concept of 'consolidated metadata' was created as an experimental feature. This is effectively another metadata file living at the root of the zarr store, containing a combined copy of the metadata that is stored in the various .zmetadata files. This means you only need to open 1 file, and it speeds up reads significantly.
  • Since the consolidated metadata file is a mere copy, changes to the data that would trigger an edit to the zarr's metadata would cause the consolidated metadata to become 'out of sync'. To prevent this from happening, the choice was made to implement a ConsolidatedMetadataStore subclass of Zarr's Store class, which is used when the zarr is opened using consolidated metadata. This class has implemented the write-operations __setitem__ and __delitem__ to raise a ReadOnlyError instead of performing the operation.
  • The ConsolidatedMetadataStore is the store attribute of a zarr Group object, which in turn is the zarr_group attribute of the xarray ZarrStore object. To support using different Store implementations for writing metadata and writing chunk data, a Group can be given a separate Store object via the chunk_store parameter (and attribute), which will default to store if not provided. Our situation is exactly the use case this split is meant to support.
  • The open_consolidated() function in zarr.convenience.py, enforces a mode of r or r+ (and to_zarr with region provided enforces a read mode of r+), and this function makes sure the resulting Group has a store of type ConsolidatedMetadataStore and a 'normal Store subtype for chunk_store. The exact type depends on if a local path was used, or a URL of some sort, but the point is that it's not a read-only ConsolidatedMetadataStore.
  • Finally, the question remains: "Is writing chunk data actually safe for operation?". The answer is yes, because no metadata would be changed by to_zarr with the region parameter:
    • Because the write mode is enforced to be r+, no new variables can be added to the store (this is also checked and enforced in xarray.backends.api.py::to_zarr()). Existing variables already have their attrs included in the consolidated metadata file.
    • The size of dimensions can not be expanded, that would require a call using append_dim which is mutually exclusive with region

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this excellent explanation.

I believe we created a bit of a mess with consolidated metadata. Given the way things work in Zarr, I think this PR is an acceptable solution.

My one suggestion would be to put ☝️ that super useful explanation as an inline comment right in the code. It will be extremely helpful for future maintainers.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending some inline documentation in the backends/zarr.py requested by @rabernat, this look great.

🙌 Thanks @Metamess for the excellent summary of the problem 🙌

@dcherian
Copy link
Contributor

dcherian commented Nov 2, 2023

Excellent work @Metamess . Thank you!

@dcherian dcherian enabled auto-merge (squash) November 2, 2023 21:50
@dcherian dcherian disabled auto-merge November 2, 2023 21:50
@dcherian dcherian enabled auto-merge (squash) November 2, 2023 21:50
@dcherian dcherian disabled auto-merge November 2, 2023 21:51
@dcherian dcherian changed the title Fixed an issue where Dataset.to_zarr raised a ReadOnlyError Fix for Dataset.to_zarr with both consolidated and write_empty_chunks Nov 2, 2023
@dcherian dcherian enabled auto-merge (squash) November 2, 2023 21:51
@dcherian dcherian merged commit 10f2aa1 into pydata:main Nov 2, 2023
Copy link

welcome bot commented Nov 2, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_zarr with region and write_empty_chunks keywords raises ReadOnlyError if using consolidated metadata
5 participants