-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache pre-existing Zarr arrays in Zarr backend #9861
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
If you swtich the equality to xarray/xarray/tests/test_backends.py Line 3325 in eac5105
|
@dcherian I have implemented caching both array keys and arrays, but for that reason this test is failing: xarray/xarray/tests/test_backends.py Lines 409 to 416 in eac5105
This test assumes that the |
oh wow, I did not even know that there was a This code is from 10 years ago and I don't think anyone is using it like this. Today this test would be with self.roundtrip(expected) as actual:
assert_identical(actual, expected) cc @shoyer |
Co-authored-by: Deepak Cherian <[email protected]>
I altered the test as you recommended, let me know if that change should be in a separate PR |
a7602f6
to
9d147b9
Compare
xarray/tests/test_backends.py
Outdated
yield group | ||
|
||
def test_write_store(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be over-ridden still if the default is to turn the cache off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, because we need to explicitly ensure that the cache is off, otherwise tests will fail. If we rely on the default behavior (which is defined very far away from this test), and if that default value changes, then tests will fail here and it will not be clear why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another way to remove this code would be to change the tests so that they no longer rely on the old ZarrStore
behavior; maybe this could be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push a commit. The create_store
on this class seems pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Clearly its pretty effective. I left some minor suggestions.
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @d-v-b. Left a comment on the default value.
Also, please add a what's new note in the docs.
xarray/backends/zarr.py
Outdated
@@ -633,6 +635,7 @@ def open_store( | |||
zarr_format=None, | |||
use_zarr_fill_value_as_mask=None, | |||
write_empty: bool | None = None, | |||
cache_members: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about changing the default to True
? I'd like to know if we think there is a downside (risk, performance, etc.) to defaulting to this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some weird tests that use open_store
and dump_to_store
that can't handle that default.
For nearly all users, the effective default is the value set in open_zarr
and to_zarr
(which is True
, or should be after that one suggestion is merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm up for it! the main risk is whether anyone is actively using ZarrStore
instances explicitly, with the assumption that they dynamically track the state of the underlying zarr Group. if nobody is doing that, then there's no risk. ofc I have no way of know if this use case is common, but my impression was that ZarrStore
is internal enough that users are not expected to be using it directly.
Co-authored-by: Deepak Cherian <[email protected]>
CI is failing on mypy due to the following:
I don't think that's from this PR? |
"what's new" has been added and the default value is now |
ZarrStore
callsarray_keys
on thezarr.Group
it wraps multiple times. Each of thesearray_keys
calls performs IO, but if we model the names of the arrays as static, then they can be cached, then we can do less IO.Accordingly this PR adds a new method to
ZarrStore
(get_array_keys
), a new keyword argument forZarrStore.__init__
(cache_array_keys
), and a new attribute toZarrStore
(_cache
).get_array_keys
gets the array keys from the.zarr_group
instance contained in theZarrStore
, either by reading the keys from the cache (if caching is enabled and the cache is populated), or by directly invokingZarrStore.zarr_group.array_keys()
(if caching is disabled, or the cache is empty).Caching is controlled by the
cache_array_keys
keyword argument inZarrStore.__init__
.ZarrStore.__init__
is called with withcache_array_keys=True
, the_cache
attribute of theZarrStore
is set to{"array_keys": None}
. On the first invocation ofget_array_keys
, this empty-but-active cache is filled with a call tozarr_group.array_keys
, and that value is returned. On subsequent invocations ofget_array_keys
, the value from the cache is retrieved and returned.ZarrStore.__init__
is called withcache_array_keys=False
, then the_cache
attribute never gets aarray_keys
key inserted, and this indicates toget_array_keys
that the cache is not being used. Calls toget_array_keys
will return the result ofzarr_group.array_keys
.Let me know if the
_cache
thing is a bit baroque -- my initial effort added a_cache_array_keys
attribute as well as the cache, but I figured the state of the cache itself can convey whether the cache is meant to be used, which removes the need for a "use the cache or not" attribute onZarrStore
.And I would like some feedback on the test -- is it in the right place?
closes #9853