-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr: Optimize appending #8998
Zarr: Optimize appending #8998
Conversation
return region | ||
|
||
|
||
def _validate_datatypes_for_zarr_append(zstore, dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved to backends/zarr.py
existing dtype. | ||
""" | ||
|
||
existing_vars = zstore.get_variables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR removes this get_variables
call by instead running the check in ZarrStore.store
where we are already requesting existing variables, and doing checks.
if self._mode in ["a", "a-", "r+"]: | ||
_validate_datatypes_for_zarr_append( | ||
vn, existing_vars[vn], variables[vn] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the check runs here where we have both the variables in the store, and the new variables to be written.
6fcd873
to
b21f7f4
Compare
@@ -1721,28 +1685,12 @@ def to_zarr( | |||
) | |||
|
|||
if mode in ["a", "a-", "r+"]: | |||
_validate_datatypes_for_zarr_append(zstore, dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to ZarrStore.store
raise ValueError( | ||
f"variable {var_name!r} already exists, but encoding was provided" | ||
) | ||
if mode == "r+": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to ZarrStore.store
@@ -612,26 +640,58 @@ def store( | |||
import zarr | |||
|
|||
existing_keys = tuple(self.zarr_group.array_keys()) | |||
|
|||
if self._mode == "r+": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checks moved from backends/api.py
@@ -696,7 +756,6 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No | |||
|
|||
for vn, v in variables.items(): | |||
name = _encode_variable_name(vn) | |||
check = vn in check_encoding_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reducing some indirection.
"``mode='r+'``. To allow writing new variables, set ``mode='a'``." | ||
) | ||
|
||
if self._append_dim is not None and self._append_dim not in existing_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._append_dim not in existing_keys
is a new addition. We needn't parse all arrays in the store for any append dimensions that we know to exist in the store.
f"append_dim={append_dim!r} does not match any existing " | ||
f"dataset dimensions {existing_dims}" | ||
) | ||
if encoding and mode in ["a", "a-", "r+"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now skip this request for existing array names if encoding
was not provided.
@dcherian -- this looks great! The logic makes sense and things seem to work. Could you share some diagnostics on what this will mean in terms of traffic between Xarray and the Zarr store? Also, appreciating that this is just a refactor, do you think a new test would help avoid "uneccessary" traffic in the future? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (Though I don't have that much context down at this level...)
942e1dd
to
dc33846
Compare
modified.to_zarr(store, mode="a", append_dim="x") | ||
# v2024.03.0: {'iter': 6, 'contains': 2, 'setitem': 5, 'getitem': 10, 'listdir': 6, 'list_prefix': 0} | ||
# 6057128b: {'iter': 5, 'contains': 2, 'setitem': 5, 'getitem': 10, "listdir": 5, "list_prefix": 0} | ||
expected = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.to_zarr(store, region={"x": slice(None)}) | ||
# v2024.03.0: {'iter': 5, 'contains': 2, 'setitem': 1, 'getitem': 6, 'listdir': 5, 'list_prefix': 0} | ||
# 6057128b: {'iter': 4, 'contains': 2, 'setitem': 1, 'getitem': 5, 'listdir': 4, 'list_prefix': 0} | ||
expected = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhamman 👁️
dc33846
to
f2d4ff4
Compare
f2d4ff4
to
f76e725
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff @dcherian!
Builds on #8997Continues moving checks one level down to
ZarrStore.store
where we already have a bunch of checks, and a loop over existing variables in the store.