Listing every format that could be represented as virtual zarr #218

TomNicholas · 2024-08-08T18:18:36Z

maxrjones · 2024-08-09T13:07:55Z

Unfortunately based on https://gdal.org/user/virtual_file_systems.html#jpeg2000 JPEG2000 is likely in the 'probably can't support' category. I would've liked if these datasets could be virtualized, but they're all JPEG2000 for to optimize for the download to disk model :(

Another way to phrase this question, which may help the search, is which of the formats supported by GDAL's raster drivers can be virtualized?

martindurant · 2024-08-20T15:03:30Z

I like this issue! It's worth saying that anything kerchunk can chunk can be v-zarrred, right? In that repo, there are suggestions of other worthwhile formats, dicom and nifti (medical imaging) spring to mind. The latter is nice, but often whole-file-gzipped, the former is evil in the way that other 90s standards are evil, but extremely widespread.

norlandrhagen · 2024-08-20T15:46:49Z

... the former is evil in the way that other 90s standards are evil, but extremely widespread.

❤️

TomNicholas · 2024-08-20T17:04:17Z

anything kerchunk can chunk can be v-zarrred, right?

Yes, that's the idea. This function does kerchunk refs -> virtual dataset, and this function does virtual dataset -> kerchunk refs. Any additional kerchunk file readers can be called as another if...else... in here.

TomNicholas · 2025-01-01T21:34:27Z

Hugging Face safetensors is an interesting example - it's uncompressed so basically just like reading netCDF3, having no internal chunking. But it also puts all the metadata at the start of the file, making it a bit like cloud-optimized HDF5. See also huggingface/safetensors#527 (comment)

martindurant · 2025-01-02T15:45:19Z

If the format is simple and common, I say it should be included immediately, especially when there is a straight-forward way to check correctness.

having no internal chunking

but you can assign internal chunking. Is partial reading available in upstream at all yet?

TomNicholas · 2025-01-03T21:21:39Z

If the format is simple and common, I say it should be included immediately, especially when there is a straight-forward way to check correctness.

I raised #367 to track adding it.

but you can assign internal chunking. Is partial reading available in upstream at all yet?

This issue seems to suggest it is: zarr-developers/zarr-python#1106. But I think to take advantage of this with virtualizarr would require #199 to be merged.

martindurant · 2025-01-03T21:33:11Z

No, zarr's PR1106 only implemented it for blosc compression, something I've been arguing about for a very very long time!

If you can dynamically re-imagine the chunking at runtime (which is what I tink #119 does), that that would be good enough for most practical uses - but still annoying. Zarr should just do this! i.e., the chunk IO function shouldn't just be passed "I need chunk X", but "I need section (:, s:t, i:j) of chunk X" and a way to characterise what the decompression pipeline looks like (this is OK for decompressed, some blosc, zstd maybe..., but not zlib). This was my suggestion in passing Contexts around in zarr v2.

TomNicholas · 2025-01-07T16:54:52Z

I don't disagree, but if we want to discuss this further we should do it on a new issue (on this repo or upstream on zarr).

itsgifnotjiff · 2025-04-05T18:03:15Z

Environment and Climate Change Canada here 😊. We have something called RPN/FST files which are binary files containing metadata, grid and fortran arrays.

They are incredibly efficient in both disk space and compression (optional).

c/Fortran library to work with them
Python library that opens them into xarray Datasets with attrs and all

We would love to use something like VirtualiZarr to create per model data cubes that can be easily accessed by ML modules like Jax/torch/Dask but also by Web APIs such as pygeoapi for OGC APIs.

Some of the difficulties are the fact that we have multiple grids and vertical dimensions per file and not all files of a model contain the same variables (accumulation variables start after 0 forecast hour).

Indexing our data lake is something doable but up until recent advancements in both Zarr indexing and Grouping it did not really fit our plans.

Tools like SHTools, XPublish, XCube, MetPy, Holoviz, etc. All have an Xarray entry point but some particularities makes the workflow still experimental and hacky.

If there was a way to create a massive Zarr/Icechunk/TileDB/Delta Lake on top of a GPFS/Ceph filesystem mounted on a HPC where perhaps access to the data can be accelerated through extra CPU/GPUs that would allow me and my colleagues to build with confidence tooling for some proper Science 😊.

martindurant · 2025-04-05T18:22:15Z

@itsgifnotjiff , that sounds like it might be the perfect opportunity for me to build a vzarr reader rather than a kerchunk indexer (unless @TomNicholas , you don't think it's worth making a difference between the two). I am in Toronto, so if you are two, I'd be happy to meet in person to go over the format and requirements.

itsgifnotjiff · 2025-04-05T19:18:59Z

I am actually in Montreal. 😶 I am however available and motivated. Let me know if you have time to meet virtually if nothing else just for me and my boss to say thank you for your wonderful work.

My email for work

TomNicholas · 2025-04-05T21:04:21Z

binary files containing metadata, grid and fortran arrays.

Yet another self-describing binary file format! Sounds like something that could be virtualized.

We would love to use something like VirtualiZarr to create per model data cubes that can be easily accessed by ML modules like Jax/torch/Dask

Yes, it should be possible to provide access via the zarr-python API.

also by Web APIs such as pygeoapi for OGC APIs.

I'm less familiar with this, but there might be a way to do that.

Some of the difficulties are the fact that we have multiple grids and vertical dimensions per file and not all files of a model contain the same variables (accumulation variables start after 0 forecast hour).

The (current) set of restrictions for the virtual zarr approach are documented here. As your data apparently can already be mapped to the xarray/netCDF data model, as long as you don't have any of those listed issues then I would expect this to be possible.

data lake

A cloud data lake?

If there was a way to create a massive Zarr/Icechunk/TileDB/Delta Lake on top of a GPFS/Ceph filesystem mounted on a HPC

This is technically possible using VirtualiZarr and Icechunk together, though currently the support within icechunk for Ceph & and HPC are underdeveloped compared to the typical AWS use case.

perfect opportunity for me to build a vzarr reader rather than a kerchunk indexer

@martindurant it would be great for you to have a go at building a virtualizarr reader (bearing in mind we're just finishing up a refactor that will simply the definition of what a virtualizarr reader actually is), but @itsgifnotjiff as the owner of a bespoke file format you should also be the owner of any code designed to read that format. VirtualiZarr is designed to be extensible in that respect, so you can write a dedicated virtualizarr reader for RPN/FST files that lives outside of this main virtualizarr repository.

itsgifnotjiff · 2025-04-05T22:13:09Z

A cloud data lake?

Internal cloud ... ish 🙂 (I can get into details if needed)

VirtualiZarr is designed to be extensible in that respect, so you can write a dedicated virtualizarr reader for RPN/FST files that lives outside of this main virtualizarr repository.

That's great to hear! We have published some tools that allow for advanced workflows such as

Open an RPN file by creating a Polars or Pandas dataframe where each row has the data for a single record and columns for all its metadata and a final column with the actual data (numpy array at the moment).
Filter, add, remove etc. in memory as well as perform post proccessing tasks (important for atmospheric sciences) and a new library coming soon that should be able to reproject (horizontally and vertically) any RPN but also Xarray Dataset in memory from any grid/vertical to any grid/vertical called GeoRef (think the completion of the xesmf project by ECMWF).
Opening any RPN file/files directly into an Xarray Dataset by using engine="fstd" or through fstd2nc.

TomNicholas · 2025-04-07T15:11:08Z

Cool! Let us know how you get on with a virtualizarr reader, and please raise new issues if you have any problems.

TomNicholas added help wanted Extra attention is needed references generation Reading byte ranges from archival files usage example Real world use case examples labels Aug 8, 2024

TomNicholas changed the title ~~Listing every format that might conceivably be represented as virtual zarr~~ Listing every format that could be represented as virtual zarr Aug 8, 2024

This was referenced Aug 9, 2024

Replace this package with a VirtualiZarr reader? MITgcm/xmitgcm#337

Open

Reading mult-tile output. MITgcm/xmitgcm#28

Open

TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Aug 21, 2024

TomNicholas mentioned this issue Sep 30, 2024

"readers" or "parsers" or what should we call them? #239

Open

TomNicholas pinned this issue Nov 15, 2024

TomNicholas mentioned this issue Jan 2, 2025

Reader for Hugging Face's SafeTensor format #367

Open

TomNicholas mentioned this issue Feb 4, 2025

VirtualiZarr NeurodataWithoutBorders/lindi#91

Open

This was referenced Apr 3, 2025

Add printable representation for MDIOReader and MDIOWriter TGSAI/mdio-python#33

Open

obstore-based Store implementation zarr-developers/zarr-python#1661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listing every format that could be represented as virtual zarr #218

Listing every format that could be represented as virtual zarr #218

TomNicholas commented Aug 8, 2024 •

edited

Loading

maxrjones commented Aug 9, 2024

martindurant commented Aug 20, 2024

norlandrhagen commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

TomNicholas commented Jan 1, 2025

martindurant commented Jan 2, 2025

TomNicholas commented Jan 3, 2025 •

edited

Loading

martindurant commented Jan 3, 2025

TomNicholas commented Jan 7, 2025

itsgifnotjiff commented Apr 5, 2025

martindurant commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 7, 2025

Listing every format that could be represented as virtual zarr #218

Listing every format that could be represented as virtual zarr #218

Comments

TomNicholas commented Aug 8, 2024 • edited Loading

maxrjones commented Aug 9, 2024

martindurant commented Aug 20, 2024

norlandrhagen commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

TomNicholas commented Jan 1, 2025

martindurant commented Jan 2, 2025

TomNicholas commented Jan 3, 2025 • edited Loading

martindurant commented Jan 3, 2025

TomNicholas commented Jan 7, 2025

itsgifnotjiff commented Apr 5, 2025

martindurant commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 7, 2025

TomNicholas commented Aug 8, 2024 •

edited

Loading

TomNicholas commented Jan 3, 2025 •

edited

Loading