Refactor codec specs into a single doc#102
Refactor codec specs into a single doc#102joshmoore merged 6 commits intozarr-developers:core-protocol-v3.0-devfrom
Conversation
| https://tools.ietf.org/html/rfc1952 | ||
|
|
||
| .. [BLOSC] F. Alted. Blosc Chunk Format. URL: | ||
| https://github.com/Blosc/c-blosc/blob/master/README_CHUNK_FORMAT.rst |
There was a problem hiding this comment.
This document describes the "Blosc Chunk Format". Does it mean a Zarr chunk consists of one or more Blosc chunks? If so, is the Zarr chunk a simple concatenation of the Blosc chunks?
There was a problem hiding this comment.
The "Blosc Chunk Format" describes how blosc encodes an input buffer. So one zarr chunk becomes one blosc chunk once encoded.
docs/codecs.rst
Outdated
|
|
||
| { | ||
| "compressor": { | ||
| "codec": "https://purl.org/zarr/spec/codecs/gzip", |
There was a problem hiding this comment.
The spec version isn't part of the URL anymore? We previously had:
| "codec": "https://purl.org/zarr/spec/codecs/gzip", | |
| "codec": "https://purl.org/zarr/spec/codecs/gzip/1.0", |
There was a problem hiding this comment.
Yes on reflection I thought the version number was potentially confusing and unnecessary in the codec URI. In the case of gzip (and blosc) we do not expect the encoding format to change, and so we don't need a version number.
docs/codecs.rst
Outdated
|
|
||
| { | ||
| "compressor": { | ||
| "codec": "https://purl.org/zarr/spec/codecs/blosc", |
There was a problem hiding this comment.
The spec version isn't part of the URL anymore? We previously had:
| "codec": "https://purl.org/zarr/spec/codecs/blosc", | |
| "codec": "https://purl.org/zarr/spec/codecs/blosc/1.0", |
There was a problem hiding this comment.
Yes, see above.
Note that Blosc 2 would be treated as an entirely new codec, and we would probably give it a URI like https://purl.org/zarr/spec/codecs/blosc2.
|
Thanks @alimanfoo, it is much better than what we had previously with one document per codec, and I like the way parameters are described, similar to what we are used to in a Python docstring. |
Co-authored-by: David Brochart <david.brochart@gmail.com>
Co-authored-by: David Brochart <david.brochart@gmail.com>
Co-authored-by: David Brochart <david.brochart@gmail.com>
|
Thanks @davidbrochart, I've added your suggestions. Regarding the codec URIs, are you happy that we remove the version numbers, or would you like to discuss this some more? |
I'm happy like that, I was indeed confused if it was the version of the spec or the version of e.g. the GZip library. |
|
I've attempted an update of this PR and intend to merge it into core-protocol-v3.0-dev in case anyone would like to re-review. The change in 2314192 matches the work from #116, #117 and #119 which the community is already depending on. |
On reflection, I thought it might be better to refactor the codec specifications into a single document, because this would reduce the overhead of adding new codecs and maintaining the documentation. Generally for each codec we are citing some existing documentation elsewhere for the definition of the encoding algorithm and format, and so we don't actually need to provide much information for each codec, hence having a single document per codec seems overkill.
So this PR brings the existing gzip codec spec and the proposed blosc codec spec (#95) into a single "Codec registry" specification.
Comments welcome.