Reducing IO in `ZarrStore` #9853

d-v-b · 2024-12-04T10:58:54Z

What is your issue?

I'm looking into ways to reduce the latency of reading data from zarr. In ZarrStore, there are 3 invocations of array_keys, each of which will trigger IO (e.g., a listPrefix s3 call). If we assume that the ZarrStore instance can observe all expected changes to the zarr group, then, we could call array_keys once when constructing ZarrStore, store the result + keep that result synchronized with any modifications of the zarr group, which would cut down on the need to call array_keys again later.

This would improve performance, especially on high latency backends, but it also results in data duplication -- we would maintain a model of the array keys on the ZarrStore object, that might not match the real array keys if the zarr group is modified by some third party. Perhaps we could guard against this with a cache_array_keys option to ZarrStore, with a default of True (use a cache), but if cache_array_keys is false, then the array keys would be eagerly checked.

Thoughts? I can whip up a PR that demonstrates this idea pretty quickly, but it would be good to know that people think it's worthwhile

The text was updated successfully, but these errors were encountered:

welcome · 2024-12-04T10:58:57Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

dcherian · 2024-12-04T16:27:07Z

For this workload

ds = xr.open_zarr(store, ...)
# do something to ds
ds.to_zarr(store, ...)

The open_zarr and to_zarr create a new ZarrStore so any third-party modifications before to_zarr will be visible.

I don't think anyone should rely on the ability to see third-party modifications that occur in parallel with to_zarr (the fact that it requests things multiple times is purely an implementation detail).

I think we can cache it on ZarrStore, though perhaps there is similar functionality that could be of use to the broader Zarr community.

d-v-b added the needs triage Issue that has not been reviewed by xarray team member label Dec 4, 2024

dcherian added topic-performance topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 4, 2024

d-v-b mentioned this issue Dec 6, 2024

Cache pre-existing Zarr arrays in Zarr backend #9861

Merged

dcherian closed this as completed in #9861 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing IO in `ZarrStore` #9853

Reducing IO in `ZarrStore` #9853

d-v-b commented Dec 4, 2024 •

edited

Loading

welcome bot commented Dec 4, 2024

dcherian commented Dec 4, 2024

Reducing IO in ZarrStore #9853

Reducing IO in ZarrStore #9853

Comments

d-v-b commented Dec 4, 2024 • edited Loading

What is your issue?

welcome bot commented Dec 4, 2024

dcherian commented Dec 4, 2024

Reducing IO in `ZarrStore` #9853

Reducing IO in `ZarrStore` #9853

d-v-b commented Dec 4, 2024 •

edited

Loading