Add the dictionary hash to the HTTP response #2680

pmeenan · 2023-11-10T16:17:32Z

Adds a new "Content-Dictionary" with the hash of the dictionary used when encoding the HTTP response.

Adds a new "Content-Dictionary" with the hash of the dictionary used when encoding the HTTP response. For #httpwg#2636

pmeenan · 2023-11-10T16:20:43Z

@martinthomson this adds the dictionary hash to the HTTP response when used. Right now it is a SHOULD. Let me know if you think it should be a MUST instead.

martinthomson

If this field carries binary data, why aren't you using structured fields?

draft-ietf-httpbis-compression-dictionary.md

pmeenan · 2023-11-13T14:32:29Z

If this field carries binary data, why aren't you using structured fields?

Mostly for developer ergonomics to make it easier to deploy. In the case of static resources using previous versions as dictionaries (v1 to v2 of main.js), one method for enabling that is to compress the new version using the old version on the cli as part of the push to production and store the dictionary-compressed artifact as a file on the server next to the full file. Then at serving time, compare the value of the Available-Dictionary request header with the file names on the server for the given URL to see if a delta-compressed artifact is available.

Structured fields encode the binary data as base-64 encoded strings which aren't filesystem-friendly so there would need to be some steps at serving time to decode the header, convert it to ascii and then check. It's not insurmountable but it's a bit more processing and a straight string match could be handled directly with templating in server config.

Lowercase hex was picked since that's what all of the cli tools for generating hashes emit.

pmeenan · 2023-11-13T14:43:31Z

Is Content-Dictionary a reasonable name for the response header? I'd actually prefer Dictionary but it felt like that might be over-stepping since there could be some other generic need for associating a dictionary with a HTTP-response but maybe I'm being overly-cautious.

martinthomson · 2023-11-14T05:35:14Z

I don't think that authoring convenience outweighs the advantages of having a unified grammar here. It's not like base64 is some esoteric function. If you are talking about sha256sum, here's a toy: https://github.com/martinthomson/shabase64

reschke · 2023-11-14T08:27:38Z

I agree with @martinthomson - don't. To make the sf value filesystem friendly, you just need to strip a few characters and remap "/".

davidben · 2023-11-14T13:30:26Z

Lowercase hex was picked since that's what all of the cli tools for generating hashes emit.

Keep in mind that hashes are not hex strings. A SHA-256 hash is 32 bytes of binary data. The hex thing is just what some tools happen to output. But anything consuming this programmatically to, e.g., compare the hash against something would be best served by the underlying bytes.

Using base64 is also slightly more compact on the wire. And if binary structured fields gets off the ground, it can be even more compact.

pmeenan · 2023-11-14T17:29:44Z

My main concern is trading off a unified grammer of using sf-binary in the headers vs consistency with tooling and how the dictionaries are represented once it gets out of the protocol.

In the dynamic case, this could be something like the web server checking to see if the requested dictionary is available (on disk) in response to a request. I don't know how each web server will choose to build that but having files named with the hash is an easy common starting point. Otherwise each chooses what the ascii representation of the 32-byte hash should be for their case.

On the static case, I'm not so much worried about not having a tool that can create base64-encoded strings of sha256 hashes (it's trivial as you pointed out). I'm more concerned about the transformations that need to happen at serving time to decode the hash, convert it into an ascii representation and then check if that variant of the encoded file is available. It's certainly all doable, but it requires base64decode and binhex (or some equivalents) to be available in the layer where the checks are being done - or to rely on application code.

Specifically, I was thinking that it would be easy enough in VCL or an existing Nginx or Apache config to do something like:

If Available-Dictionary request header is present and validates against a [0-9a-f]+ regex
and 'br-d' is available in the Accept-Encoding request header
try returning <path> + '.' + request['Accept-Encoding'] + '.br-d' (and add relevant response headers)
otherwise return <path>

Doing similar processing with base64-encoded hashes has a step to convert it into an unspecified format that is platform-specific and may not be doable without either running application code or updating the server with base64decode + binhex support.

It likely doesn't matter for the case of applications and servers that are dictionary-aware and can operate on the raw hashes (and enjoy the ~20 byte request header savings) but it is a tradeoff that pushes complexity further up the stack so I just want to make sure that is the tradeoff that we want to make.

davidben · 2023-11-14T18:30:55Z

Having a finally consistent grammar for HTTP is the precursor to get consistent tooling. It may be more convenient to do something else in the short term, but in the long term we'd just be adding to the mess of inconsistency in HTTP. In the long term, I would hope that these config language DSLs would be able to process structured fields natively. (Although, it may be on this WG to demonstrate this being possible, since the data model for structured fields is slightly interesting...)

I also don't think we should be encouraging, much less designing for, folks trying to process HTTP headers with regexps. There's a well-defined grammar and inventing other parsers is a great way to not quite get things right. We already have a pretty hard time with HTTP extensibility due to folks parsing it wrong.

The path-unfriendliness is a little unfortunate. Possibly we should have picked the URL-safe version of base64, but too late for that. But since byte strings and text strings aren't the same type, I imagine any sf-binary-aware config language DSL would also have hex and base64 functions available. Indeed taking attacker-supplied input and mapping it to a path component, without running it through some very restrictive encoding, seems a pretty bad idea!

At the end of the day, SHA-256 hashes are byte strings, and sf-binary is the way HTTP decided to say byte string.

We could say that, although SHA-256 hashes are byte strings, dictionary names are ASCII strings computed by hex(sha256). Then these headers should be sf-string (with quotes). But that would be less efficient on the wire, and our mandate on the web is to put user needs (bandwidth, in this case) first, so I think this is the wrong tradeoff.

That said, I see Available-Dictionary also uses this format. It wouldn't make sense for Available-Dictionary and Content-Dictionary to use different formats, so perhaps switching away from the less consistent and less efficient format should probably be handled separately from this PR?

pmeenan · 2023-11-14T19:13:09Z

I switched Available-Dictionary and Content-Dictionary both to sf-binary. The plan is to update the ID and then send out a summary of changes for discussion on the listserv and since this PR is about the dictionary in the headers anyway it should be fine to update both of them at the same time. Happy to split it out if it helps with history tracking.

The latest spec discussed in IETF HTTPWG is using `Available-Dictionary` request header name instead of `Sec-Available-Dictionary`. Also in the value is using a Structured Field sf-binary sha256 hash of the dictionary after [1]. So after this CL, when V2 backend is enabled, Chrome will set `Available-Dictionary` header with a Structured Field sf-binary sha256 hash of the dictionary. [1]: httpwg/http-extensions#2680 Bug: 1413922 Change-Id: Ibb716ec27bb8f5d58a5266ad6ac3ec75ad643130 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5224886 Reviewed-by: Kenichi Ishibashi <[email protected]> Commit-Queue: Tsuyoshi Horo <[email protected]> Cr-Commit-Position: refs/heads/main@{#1251259}

A new "Content-Dictionary" response header was introduced by this spec change [1]. When V2 backend is enabled, if the "Content-Encoding" response header indicates that a dictionary is used, Chrome will check the "Content-Dictionary" response header. And if there is no such header or it doesn't match with the value of "Available-Dictionary" request header, Chrome will treat the response as an error. [1]: httpwg/http-extensions#2680 Bug: 1413922 Change-Id: I37d09fad1201f4fb4b059d65c4aaee58cec788ac Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5226641 Reviewed-by: Kenichi Ishibashi <[email protected]> Commit-Queue: Tsuyoshi Horo <[email protected]> Cr-Commit-Position: refs/heads/main@{#1251341}

yoavweiss · 2024-02-20T17:21:35Z

I think this change adds significant complexity, and I'm not sure regarding its benefits.

I very recently played around with prototyping a compression dictionaries deployment and was very happy with the simplicity of it:

You generate diff files at build time with some naming convention
Then at serving time, you configure your routing logic (similarly to how @pmeenan pointed out) to those diffs

That can be done today in almost any deployed routing layer.

Once this change lands in implementations, a similar deployment would require me to add custom logic to transform that header value to something that is file system friendly.
It doesn't matter if that transformation is to base64 decode the binary data and then to hex encode it, or to 'just.. strip a few characters and remap "/"'. It is an operation that is not necessarily supported in many layers that developers currently operate in.

To take just one example - if I were to implement this as a Cloudflare transform rule, I'm not even sure it's feasible to do that. (although it might be, with a creative regex_replace).

Can you elaborate on the advantages of moving to base64, beyond theoretical purity?

Add the dictionary hash to the HTTP response

c65e8ee

Adds a new "Content-Dictionary" with the hash of the dictionary used when encoding the HTTP response. For #httpwg#2636

pmeenan requested a review from martinthomson November 10, 2023 16:19

martinthomson reviewed Nov 12, 2023

View reviewed changes

draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved

draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved

draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved

pmeenan added 2 commits November 13, 2023 09:34

Updated abnf

dd67532

Changed Content-Dictionary to MUST

04938b1

Changed the dictionary mismatch to cannot

b48393c

Switched to sf-binary for hash

a337819

pmeenan merged commit f676b6f into httpwg:main Nov 20, 2023
1 check passed

pmeenan deleted the content-dictionary branch November 20, 2023 16:47

pmeenan added the compression-dictionary label Dec 11, 2023

pmeenan mentioned this pull request Apr 18, 2024

Hash format in HTTP Headers #2781

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the dictionary hash to the HTTP response #2680

Add the dictionary hash to the HTTP response #2680

pmeenan commented Nov 10, 2023

pmeenan commented Nov 10, 2023

martinthomson left a comment

pmeenan commented Nov 13, 2023

pmeenan commented Nov 13, 2023

martinthomson commented Nov 14, 2023

reschke commented Nov 14, 2023

davidben commented Nov 14, 2023

pmeenan commented Nov 14, 2023 •

edited

Loading

davidben commented Nov 14, 2023 •

edited

Loading

pmeenan commented Nov 14, 2023

yoavweiss commented Feb 20, 2024 •

edited

Loading

Add the dictionary hash to the HTTP response #2680

Add the dictionary hash to the HTTP response #2680

Conversation

pmeenan commented Nov 10, 2023

pmeenan commented Nov 10, 2023

martinthomson left a comment

Choose a reason for hiding this comment

pmeenan commented Nov 13, 2023

pmeenan commented Nov 13, 2023

martinthomson commented Nov 14, 2023

reschke commented Nov 14, 2023

davidben commented Nov 14, 2023

pmeenan commented Nov 14, 2023 • edited Loading

davidben commented Nov 14, 2023 • edited Loading

pmeenan commented Nov 14, 2023

yoavweiss commented Feb 20, 2024 • edited Loading

pmeenan commented Nov 14, 2023 •

edited

Loading

davidben commented Nov 14, 2023 •

edited

Loading

yoavweiss commented Feb 20, 2024 •

edited

Loading