Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing IO in ZarrStore #9853

Closed
d-v-b opened this issue Dec 4, 2024 · 2 comments · Fixed by #9861
Closed

Reducing IO in ZarrStore #9853

d-v-b opened this issue Dec 4, 2024 · 2 comments · Fixed by #9861
Labels
topic-performance topic-zarr Related to zarr storage library

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Dec 4, 2024

What is your issue?

I'm looking into ways to reduce the latency of reading data from zarr. In ZarrStore, there are 3 invocations of array_keys, each of which will trigger IO (e.g., a listPrefix s3 call). If we assume that the ZarrStore instance can observe all expected changes to the zarr group, then, we could call array_keys once when constructing ZarrStore, store the result + keep that result synchronized with any modifications of the zarr group, which would cut down on the need to call array_keys again later.

This would improve performance, especially on high latency backends, but it also results in data duplication -- we would maintain a model of the array keys on the ZarrStore object, that might not match the real array keys if the zarr group is modified by some third party. Perhaps we could guard against this with a cache_array_keys option to ZarrStore, with a default of True (use a cache), but if cache_array_keys is false, then the array keys would be eagerly checked.

Thoughts? I can whip up a PR that demonstrates this idea pretty quickly, but it would be good to know that people think it's worthwhile

@d-v-b d-v-b added the needs triage Issue that has not been reviewed by xarray team member label Dec 4, 2024
Copy link

welcome bot commented Dec 4, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian
Copy link
Contributor

dcherian commented Dec 4, 2024

For this workload

ds = xr.open_zarr(store, ...)
# do something to ds
ds.to_zarr(store, ...)

The open_zarr and to_zarr create a new ZarrStore so any third-party modifications before to_zarr will be visible.

I don't think anyone should rely on the ability to see third-party modifications that occur in parallel with to_zarr (the fact that it requests things multiple times is purely an implementation detail).

I think we can cache it on ZarrStore, though perhaps there is similar functionality that could be of use to the broader Zarr community.

@dcherian dcherian added topic-performance topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-performance topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants