Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the dictionary hash to the HTTP response #2680

Merged
merged 5 commits into from
Nov 20, 2023

Conversation

pmeenan
Copy link
Contributor

@pmeenan pmeenan commented Nov 10, 2023

Adds a new "Content-Dictionary" with the hash of the dictionary used when encoding the HTTP response.

For #2636

Adds a new "Content-Dictionary" with the hash of the dictionary used when encoding the HTTP response.

For #httpwg#2636
@pmeenan
Copy link
Contributor Author

pmeenan commented Nov 10, 2023

@martinthomson this adds the dictionary hash to the HTTP response when used. Right now it is a SHOULD. Let me know if you think it should be a MUST instead.

Copy link
Contributor

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this field carries binary data, why aren't you using structured fields?

draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved
draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved
draft-ietf-httpbis-compression-dictionary.md Outdated Show resolved Hide resolved
@pmeenan
Copy link
Contributor Author

pmeenan commented Nov 13, 2023

If this field carries binary data, why aren't you using structured fields?

Mostly for developer ergonomics to make it easier to deploy. In the case of static resources using previous versions as dictionaries (v1 to v2 of main.js), one method for enabling that is to compress the new version using the old version on the cli as part of the push to production and store the dictionary-compressed artifact as a file on the server next to the full file. Then at serving time, compare the value of the Available-Dictionary request header with the file names on the server for the given URL to see if a delta-compressed artifact is available.

Structured fields encode the binary data as base-64 encoded strings which aren't filesystem-friendly so there would need to be some steps at serving time to decode the header, convert it to ascii and then check. It's not insurmountable but it's a bit more processing and a straight string match could be handled directly with templating in server config.

Lowercase hex was picked since that's what all of the cli tools for generating hashes emit.

@pmeenan
Copy link
Contributor Author

pmeenan commented Nov 13, 2023

Is Content-Dictionary a reasonable name for the response header? I'd actually prefer Dictionary but it felt like that might be over-stepping since there could be some other generic need for associating a dictionary with a HTTP-response but maybe I'm being overly-cautious.

@martinthomson
Copy link
Contributor

I don't think that authoring convenience outweighs the advantages of having a unified grammar here. It's not like base64 is some esoteric function. If you are talking about sha256sum, here's a toy: https://github.com/martinthomson/shabase64

@reschke
Copy link
Contributor

reschke commented Nov 14, 2023

I agree with @martinthomson - don't. To make the sf value filesystem friendly, you just need to strip a few characters and remap "/".

@davidben
Copy link
Contributor

Lowercase hex was picked since that's what all of the cli tools for generating hashes emit.

Keep in mind that hashes are not hex strings. A SHA-256 hash is 32 bytes of binary data. The hex thing is just what some tools happen to output. But anything consuming this programmatically to, e.g., compare the hash against something would be best served by the underlying bytes.

Using base64 is also slightly more compact on the wire. And if binary structured fields gets off the ground, it can be even more compact.

@pmeenan
Copy link
Contributor Author

pmeenan commented Nov 14, 2023

My main concern is trading off a unified grammer of using sf-binary in the headers vs consistency with tooling and how the dictionaries are represented once it gets out of the protocol.

In the dynamic case, this could be something like the web server checking to see if the requested dictionary is available (on disk) in response to a request. I don't know how each web server will choose to build that but having files named with the hash is an easy common starting point. Otherwise each chooses what the ascii representation of the 32-byte hash should be for their case.

On the static case, I'm not so much worried about not having a tool that can create base64-encoded strings of sha256 hashes (it's trivial as you pointed out). I'm more concerned about the transformations that need to happen at serving time to decode the hash, convert it into an ascii representation and then check if that variant of the encoded file is available. It's certainly all doable, but it requires base64decode and binhex (or some equivalents) to be available in the layer where the checks are being done - or to rely on application code.

Specifically, I was thinking that it would be easy enough in VCL or an existing Nginx or Apache config to do something like:

  • If Available-Dictionary request header is present and validates against a [0-9a-f]+ regex
  • and 'br-d' is available in the Accept-Encoding request header
  • try returning <path> + '.' + request['Accept-Encoding'] + '.br-d' (and add relevant response headers)
  • otherwise return <path>

Doing similar processing with base64-encoded hashes has a step to convert it into an unspecified format that is platform-specific and may not be doable without either running application code or updating the server with base64decode + binhex support.

It likely doesn't matter for the case of applications and servers that are dictionary-aware and can operate on the raw hashes (and enjoy the ~20 byte request header savings) but it is a tradeoff that pushes complexity further up the stack so I just want to make sure that is the tradeoff that we want to make.

@davidben
Copy link
Contributor

davidben commented Nov 14, 2023

Having a finally consistent grammar for HTTP is the precursor to get consistent tooling. It may be more convenient to do something else in the short term, but in the long term we'd just be adding to the mess of inconsistency in HTTP. In the long term, I would hope that these config language DSLs would be able to process structured fields natively. (Although, it may be on this WG to demonstrate this being possible, since the data model for structured fields is slightly interesting...)

I also don't think we should be encouraging, much less designing for, folks trying to process HTTP headers with regexps. There's a well-defined grammar and inventing other parsers is a great way to not quite get things right. We already have a pretty hard time with HTTP extensibility due to folks parsing it wrong.

The path-unfriendliness is a little unfortunate. Possibly we should have picked the URL-safe version of base64, but too late for that. But since byte strings and text strings aren't the same type, I imagine any sf-binary-aware config language DSL would also have hex and base64 functions available. Indeed taking attacker-supplied input and mapping it to a path component, without running it through some very restrictive encoding, seems a pretty bad idea!

At the end of the day, SHA-256 hashes are byte strings, and sf-binary is the way HTTP decided to say byte string.

We could say that, although SHA-256 hashes are byte strings, dictionary names are ASCII strings computed by hex(sha256). Then these headers should be sf-string (with quotes). But that would be less efficient on the wire, and our mandate on the web is to put user needs (bandwidth, in this case) first, so I think this is the wrong tradeoff.

That said, I see Available-Dictionary also uses this format. It wouldn't make sense for Available-Dictionary and Content-Dictionary to use different formats, so perhaps switching away from the less consistent and less efficient format should probably be handled separately from this PR?

@pmeenan
Copy link
Contributor Author

pmeenan commented Nov 14, 2023

I switched Available-Dictionary and Content-Dictionary both to sf-binary. The plan is to update the ID and then send out a summary of changes for discussion on the listserv and since this PR is about the dictionary in the headers anyway it should be fine to update both of them at the same time. Happy to split it out if it helps with history tracking.

@pmeenan pmeenan merged commit f676b6f into httpwg:main Nov 20, 2023
1 check passed
@pmeenan pmeenan deleted the content-dictionary branch November 20, 2023 16:47
aarongable pushed a commit to chromium/chromium that referenced this pull request Jan 24, 2024
The latest spec discussed in IETF HTTPWG is using `Available-Dictionary`
request header name instead of `Sec-Available-Dictionary`.
Also in the value is using a Structured Field sf-binary sha256 hash
of the dictionary after [1].

So after this CL, when V2 backend is enabled, Chrome will set
`Available-Dictionary` header with a Structured Field sf-binary sha256 hash of the dictionary.

[1]: httpwg/http-extensions#2680

Bug: 1413922
Change-Id: Ibb716ec27bb8f5d58a5266ad6ac3ec75ad643130
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5224886
Reviewed-by: Kenichi Ishibashi <[email protected]>
Commit-Queue: Tsuyoshi Horo <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1251259}
aarongable pushed a commit to chromium/chromium that referenced this pull request Jan 24, 2024
A new "Content-Dictionary" response header was introduced by this spec
change [1].

When V2 backend is enabled, if the "Content-Encoding" response header
indicates that a dictionary is used, Chrome will check the
"Content-Dictionary" response header. And if there is no such header or
it doesn't match with the value of "Available-Dictionary" request
header, Chrome will treat the response as an error.

[1]: httpwg/http-extensions#2680

Bug: 1413922
Change-Id: I37d09fad1201f4fb4b059d65c4aaee58cec788ac
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5226641
Reviewed-by: Kenichi Ishibashi <[email protected]>
Commit-Queue: Tsuyoshi Horo <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1251341}
@yoavweiss
Copy link
Contributor

yoavweiss commented Feb 20, 2024

I think this change adds significant complexity, and I'm not sure regarding its benefits.

I very recently played around with prototyping a compression dictionaries deployment and was very happy with the simplicity of it:

  • You generate diff files at build time with some naming convention
  • Then at serving time, you configure your routing logic (similarly to how @pmeenan pointed out) to those diffs

That can be done today in almost any deployed routing layer.

Once this change lands in implementations, a similar deployment would require me to add custom logic to transform that header value to something that is file system friendly.
It doesn't matter if that transformation is to base64 decode the binary data and then to hex encode it, or to 'just.. strip a few characters and remap "/"'. It is an operation that is not necessarily supported in many layers that developers currently operate in.

To take just one example - if I were to implement this as a Cloudflare transform rule, I'm not even sure it's feasible to do that. (although it might be, with a creative regex_replace).

Can you elaborate on the advantages of moving to base64, beyond theoretical purity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants