-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Dataset.to_zarr with both consolidated
and write_empty_chunks
#8326
Fix for Dataset.to_zarr with both consolidated
and write_empty_chunks
#8326
Conversation
…_empty_chunks failed with a ReadOnlyError
This looks good @Metamess . Thank you! I'll merge it soon unless anyone who knows it better has any feedback (I don't know this code that well, though the tests look good) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Metamess for the fix here. I'd like to understand a bit more of the "why" behind the bug here before shipping this fix. A specific question below...
@@ -676,7 +676,7 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No | |||
# and append_dim. | |||
if self._write_empty is not None: | |||
zarr_array = zarr.open( | |||
store=self.zarr_group.store, | |||
store=self.zarr_group.chunk_store, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you understand why the store
interface is read-only and the chunk_store
is not? I'm curious if this happens for all stores or only the directory store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jhamman , thanks for the reply! This is just my understanding from going through the code, but I feel like I understand fairly well, so here's my explanation as to the why:
- Even though
to_zarr
is a write operation, the zarr store is first opened to create a representativeZarrStore
object. For this purpose, only the metadata has to be read. - A zarr by default will have a
.zmetadata
file stored per variable (this file stores encoding information and theattrs
of a Dataset/DataArray), resulting in many small files. Since this is inefficient to open, especially over a network, the concept of 'consolidated metadata' was created as an experimental feature. This is effectively another metadata file living at the root of the zarr store, containing a combined copy of the metadata that is stored in the various.zmetadata
files. This means you only need to open 1 file, and it speeds up reads significantly. - Since the consolidated metadata file is a mere copy, changes to the data that would trigger an edit to the zarr's metadata would cause the consolidated metadata to become 'out of sync'. To prevent this from happening, the choice was made to implement a
ConsolidatedMetadataStore
subclass of Zarr'sStore
class, which is used when the zarr is opened using consolidated metadata. This class has implemented the write-operations__setitem__
and__delitem__
to raise aReadOnlyError
instead of performing the operation. - The
ConsolidatedMetadataStore
is thestore
attribute of a zarrGroup
object, which in turn is thezarr_group
attribute of the xarrayZarrStore
object. To support using differentStore
implementations for writing metadata and writing chunk data, aGroup
can be given a separateStore
object via thechunk_store
parameter (and attribute), which will default tostore
if not provided. Our situation is exactly the use case this split is meant to support. - The
open_consolidated()
function inzarr.convenience.py
, enforces a mode ofr
orr+
(andto_zarr
withregion
provided enforces a read mode ofr+
), and this function makes sure the resultingGroup
has astore
of typeConsolidatedMetadataStore
and a 'normalStore
subtype forchunk_store
. The exact type depends on if a local path was used, or a URL of some sort, but the point is that it's not a read-onlyConsolidatedMetadataStore
. - Finally, the question remains: "Is writing chunk data actually safe for operation?". The answer is yes, because no metadata would be changed by
to_zarr
with theregion
parameter:- Because the write mode is enforced to be
r+
, no new variables can be added to the store (this is also checked and enforced inxarray.backends.api.py::to_zarr()
). Existing variables already have theirattrs
included in the consolidated metadata file. - The size of dimensions can not be expanded, that would require a call using
append_dim
which is mutually exclusive withregion
- Because the write mode is enforced to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this excellent explanation.
I believe we created a bit of a mess with consolidated metadata. Given the way things work in Zarr, I think this PR is an acceptable solution.
My one suggestion would be to put ☝️ that super useful explanation as an inline comment right in the code. It will be extremely helpful for future maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @Metamess . Thank you! |
consolidated
and write_empty_chunks
Fixed an issue where Dataset.to_zarr with both consolidated and write_empty_chunks failed with a ReadOnlyError
whats-new.rst
api.rst