Embed hash of raw dictionary in compressed resource (optionally) #4023

pmeenan · 2024-04-09T14:24:17Z

It would be useful if, when using a raw dictionary for compression, the compressor could embed a hash of the dictionary that was used during compression. Then, at decompression time, if the hash is present and doesn't match the provided dictionary the decompression would fail. It could also be used to identify which dictionary in a set of dictionaries was used at decompression time.

For the HTTP content-encoding we are currently passing the dictionary hash as an additional response header but that depends on the header and the resource always being together. It would be cleaner if the payload itself could contain the dictionary information.

See httpwg/http-extensions#2770

Cyan4973 · 2024-04-09T23:39:48Z

Currently, Zstandard supports 2 modes for dictionaries :

Without identifier : it can be any content
With an identifier (up to 32-bit) : it must be a Zstandard-formatted dictionary (with its specified header format).

The current format of Zstandard has been frozen in RFC8878, so if we want to remain within the boundaries of what has been specified, these are pretty much the only options.

Now, introducing format-breaking novelties is not impossible, but it will come at a cost: existing (already deployed) Zstandard decoders will be incompatible with these changes. So this is an option we want to be careful about, and trigger only for a very good reason.

Regarding the described request to transmit a hash of the dictionary to compare against, there is an existing work around that might help here: the skippable frame.
These frames can be appended or prepended in a flow of regular Zstandard frames, and the decoder will skip them.
Which means, their content can be anything that an external application defines.
This is frequently used for watermarking for example, allowing fleet-scale investigations, and could be used here to store the wanted hash.

The advantage is that the application is fully in charge, so it can make the choices it wants, and change them, without having to coordinate with libzstd. For example, what's the format of the hash ? Is that SHA256 ? or something else ? will it evolve tomorrow ? I presume it means the hash is controlled, hence the reference scanned with the desired algorithm ? Or maybe it was already scanned, and the value is already cached somewhere ?
All these decisions could be made, and updated, at application level.

A skippable frame is fairly light weight, it introduces a cost of only 8 bytes, for the magic header and the content size.
The main cost is actually logic complexity at application level.

On the other hand, if we were willing to push that logic inside libzstd, it would add a few more topics to consider :

First, since it's incompatible with the existing zstd format, the format would need an evolution, breaking compatibility with existing coders.
Second, the question of the "type of hash" is not neutral, and needs to be decided upfront. It may impact the dependency surface of the libzstd library (which is currently very small, which is preferable to support a broad range of applications). Finally, updating this choice later on can become quite tricky.

So, with these trade-offs in mind, a method based on skippable frames to transport the information feels like a reasonable option to consider.
There are probably other ways to send this information too, but I'm not familiar enough with the domain to correctly list the pros and cons.

pmeenan · 2024-04-10T12:54:56Z

Thanks. Without a tagging mechanism for the skippable frames (and a registry for ID's of some kind) I don't think we want to be adding them to all of the dictionary-compressed streams served on the web.

A web-specific container (header) in front of the zstd file format might work for transport but the raw resources wouldn't be usable by the cli tools.

Sounds like an out-of-band negotiation is the best we can hope for for now and just ask that you keep it in mind for any future revisions to the file format (if there end up being any).

felixhandte · 2024-04-10T14:52:38Z

Yeah. Although note that the skippable frame magic has a range of 16 values. If we were going to pursue this, we could probably reserve one of those values for this purpose.

pmeenan · 2024-04-10T15:36:51Z

If we went that route, we would probably need a combination of reserving one of the magic's as well as a signature header on the hash itself in case the same magic was also used by someone else for watermarking, etc.

Cyan4973 self-assigned this Apr 9, 2024

pmeenan closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embed hash of raw dictionary in compressed resource (optionally) #4023

Embed hash of raw dictionary in compressed resource (optionally) #4023

pmeenan commented Apr 9, 2024

Cyan4973 commented Apr 9, 2024 •

edited

Loading

pmeenan commented Apr 10, 2024

felixhandte commented Apr 10, 2024

pmeenan commented Apr 10, 2024

Embed hash of raw dictionary in compressed resource (optionally) #4023

Embed hash of raw dictionary in compressed resource (optionally) #4023

Comments

pmeenan commented Apr 9, 2024

Cyan4973 commented Apr 9, 2024 • edited Loading

pmeenan commented Apr 10, 2024

felixhandte commented Apr 10, 2024

pmeenan commented Apr 10, 2024

Cyan4973 commented Apr 9, 2024 •

edited

Loading