Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed hash of raw dictionary in compressed resource (optionally) #4023

Closed
pmeenan opened this issue Apr 9, 2024 · 4 comments
Closed

Embed hash of raw dictionary in compressed resource (optionally) #4023

pmeenan opened this issue Apr 9, 2024 · 4 comments
Assignees

Comments

@pmeenan
Copy link

pmeenan commented Apr 9, 2024

It would be useful if, when using a raw dictionary for compression, the compressor could embed a hash of the dictionary that was used during compression. Then, at decompression time, if the hash is present and doesn't match the provided dictionary the decompression would fail. It could also be used to identify which dictionary in a set of dictionaries was used at decompression time.

For the HTTP content-encoding we are currently passing the dictionary hash as an additional response header but that depends on the header and the resource always being together. It would be cleaner if the payload itself could contain the dictionary information.

See httpwg/http-extensions#2770

@Cyan4973 Cyan4973 self-assigned this Apr 9, 2024
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 9, 2024

Currently, Zstandard supports 2 modes for dictionaries :

  • Without identifier : it can be any content
  • With an identifier (up to 32-bit) : it must be a Zstandard-formatted dictionary (with its specified header format).

The current format of Zstandard has been frozen in RFC8878, so if we want to remain within the boundaries of what has been specified, these are pretty much the only options.

Now, introducing format-breaking novelties is not impossible, but it will come at a cost: existing (already deployed) Zstandard decoders will be incompatible with these changes. So this is an option we want to be careful about, and trigger only for a very good reason.

Regarding the described request to transmit a hash of the dictionary to compare against, there is an existing work around that might help here: the skippable frame.
These frames can be appended or prepended in a flow of regular Zstandard frames, and the decoder will skip them.
Which means, their content can be anything that an external application defines.
This is frequently used for watermarking for example, allowing fleet-scale investigations, and could be used here to store the wanted hash.

The advantage is that the application is fully in charge, so it can make the choices it wants, and change them, without having to coordinate with libzstd. For example, what's the format of the hash ? Is that SHA256 ? or something else ? will it evolve tomorrow ? I presume it means the hash is controlled, hence the reference scanned with the desired algorithm ? Or maybe it was already scanned, and the value is already cached somewhere ?
All these decisions could be made, and updated, at application level.

A skippable frame is fairly light weight, it introduces a cost of only 8 bytes, for the magic header and the content size.
The main cost is actually logic complexity at application level.

On the other hand, if we were willing to push that logic inside libzstd, it would add a few more topics to consider :

  • First, since it's incompatible with the existing zstd format, the format would need an evolution, breaking compatibility with existing coders.
  • Second, the question of the "type of hash" is not neutral, and needs to be decided upfront. It may impact the dependency surface of the libzstd library (which is currently very small, which is preferable to support a broad range of applications). Finally, updating this choice later on can become quite tricky.

So, with these trade-offs in mind, a method based on skippable frames to transport the information feels like a reasonable option to consider.
There are probably other ways to send this information too, but I'm not familiar enough with the domain to correctly list the pros and cons.

@pmeenan
Copy link
Author

pmeenan commented Apr 10, 2024

Thanks. Without a tagging mechanism for the skippable frames (and a registry for ID's of some kind) I don't think we want to be adding them to all of the dictionary-compressed streams served on the web.

A web-specific container (header) in front of the zstd file format might work for transport but the raw resources wouldn't be usable by the cli tools.

Sounds like an out-of-band negotiation is the best we can hope for for now and just ask that you keep it in mind for any future revisions to the file format (if there end up being any).

@pmeenan pmeenan closed this as completed Apr 10, 2024
@felixhandte
Copy link
Contributor

Yeah. Although note that the skippable frame magic has a range of 16 values. If we were going to pursue this, we could probably reserve one of those values for this purpose.

@pmeenan
Copy link
Author

pmeenan commented Apr 10, 2024

If we went that route, we would probably need a combination of reserving one of the magic's as well as a signature header on the hash itself in case the same magic was also used by someone else for watermarking, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants