Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Last-Modified and ETag headers on HTML responses #1560

Open
jsha opened this issue Nov 26, 2021 · 10 comments
Open

Add Last-Modified and ETag headers on HTML responses #1560

jsha opened this issue Nov 26, 2021 · 10 comments
Labels
A-backend Area: Webserver backend

Comments

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

Right now, HTML pages on docs.rs get no caching headers at all:

$ curl -iL docs.rs/regex | less
...
HTTP/2 200
content-type: text/html; charset=utf-8
content-length: 136585
server: nginx/1.14.0 (Ubuntu)
date: Fri, 26 Nov 2021 19:34:01 GMT
vary: Accept-Encoding
x-cache: Miss from cloudfront
via: 1.1 eece508272520f70691e4eebdc5a6dea.cloudfront.net (CloudFront)
x-amz-cf-pop: HIO50-C1
x-amz-cf-id: sDMQ3tIHBbZPLMSUynsQRoBTkxdeWH6pc_QqwGzxHSvGQe71aaccEw==

That's in part because they could be updated at any time. However, there's no need for the user's browser to load the whole thing every time. If we set Last-Modified and/or ETag, the browser can send a request with the If-None-Match and/or If-Modified-Since headers. In the common case when the document hasn't been updated, the server can reply with 304 and the browser will use what it has stored locally, saving a lot of bytes downloaded.

Note that doc.rust-lang.org already serves both ETag and Last-Modified, thanks to S3:

$ curl -i https://doc.rust-lang.org/nightly/std/string/struct.String.html | less
HTTP/2 200 
content-type: text/html
content-length: 614892
date: Fri, 26 Nov 2021 19:58:13 GMT
last-modified: Fri, 26 Nov 2021 00:50:39 GMT
x-amz-version-id: 5q36d4E6SPVJsJeZ4l4N32_CxCLiRRwj
etag: "0a1f4a4cb8c158e2a3e92972a5c86673"
server: AmazonS3
vary: Accept-Encoding
x-cache: Miss from cloudfront
via: 1.1 87cff53a3b3c669d865b820d148e2d63.cloudfront.net (CloudFront)
x-amz-cf-pop: HIO50-C2
x-amz-cf-id: ThqtuTVHKIHpQxwJNpkzEuDXD4pFvshCbivyVGE55y2t9sp9mIyqkQ==
@jyn514 jyn514 added A-backend Area: Webserver backend E-easy Effort: Should be easy to implement and would make a good first PR labels Nov 26, 2021
@syphar
Copy link
Member

syphar commented Nov 26, 2021

The only easy implementation here would be doing that for static assets.

Implementing a last-modified date for rustdoc pages is harder, since every new release changes the content for the old releases. Also, an easy E-tag calculation (based on the MD5 for example) might have a performance impact on the webserver for bigger rustdoc files.

The performance advantage would likely be only be perceivable for US users, since for other users most of the time is spent in the roundtrip, and e-tag/last-modified caching still does the roundtrip.

@jyn514 jyn514 removed the E-easy Effort: Should be easy to implement and would make a good first PR label Nov 26, 2021
@jsha
Copy link
Contributor Author

jsha commented Nov 26, 2021

Also, an easy E-tag calculation (based on the MD5 for example) might have a performance impact on the webserver for bigger rustdoc files.

We could use the crate version + rustdoc version as the ETag.

Implementing a last-modified date for rustdoc pages is harder, since every new release changes the content for the old releases.

This suggests perhaps an ETag of crate version + rustdoc version + latest crate version.

All that said, it seems like Last-Modified would also work pretty well and would be simple to calculate.

The performance advantage would likely be only be perceivable for US users, since for other users most of the time is spent in the roundtrip, and e-tag/last-modified caching still does the roundtrip.

That's only true if you assume a full download happens in a single roundtrip. If (at a guess) a typical page is 150kB, and a starting receive window size of 16kb (the Windows default), there will be at least a few roundtrips beyond the first. We could measure this with Wireshark!

Another potential benefit, besides end-user speed, is that this could reduce bandwidth costs.

@jyn514
Copy link
Member

jyn514 commented Nov 26, 2021

We could use the crate version + rustdoc version as the ETag.

This needs to include the docs.rs version too, in case the header has changed.

Actually that reminds me, we also need to include the latest version in the ETag, so the drop-down gets updated with all newer versions.

@jsha
Copy link
Contributor Author

jsha commented Nov 26, 2021

Here's a webpagetest result for https://docs.rs/serde_json/1.0.72/serde_json/struct.Deserializer.html, from Milan, on Chrome, with Cable internet speed:

https://www.webpagetest.org/result/211126_AiDcNW_f655e755ab2f15dbb0c7886b76f7b16f/1/details/#waterfall_view_step1

You can click on the waterfall for a detailed view, but I'll copy the details of the first request here for convenience:

URL: https://docs.rs/serde_json/1.0.72/serde_json/struct.Deserializer.html
Loaded By: https://docs.rs/serde_json/1.0.72/crates-20211124-1.58.0-nightly-b426445c6.js:
Document: https://docs.rs/serde_json/1.0.72/serde_json/struct.Deserializer.html
Host: docs.rs
IP: 18.66.196.123
Error/Status Code: 200
Priority: Highest
Protocol: HTTP/2
HTTP/2 Stream: 1, weight 256, depends on 0, EXCLUSIVE
Request ID: B306C410314E1DEE8995BA169B82D0C5
Discovered: 0.012 s
Request Start: 0.124 s
DNS Lookup: 35 ms
Initial Connection: 29 ms
SSL Negotiation: 47 ms
Time to First Byte: 233 ms
Content Download: 446 ms
Bytes In (downloaded): 16.5 KB
Uncompressed Size: 149.0 KB
Bytes Out (uploaded): 1.8 KB
CPU Time: 2 ms

The most relevant bit here is "Content Download: 446ms", which is 56% of the total time for that request (790ms). The total page load was ~1900ms. That suggests to me that we could save a good amount of time with this technique.

If you click over to the Response tab for that request you can see x-amz-cf-pop: MXP63-P1, which AFAICT is in Milan. That's presumably why the Initial Connection and SSL Negotiation numbers are so good.

The Time to First Byte: 233 ms probably represents a combination of roundtrip from Milan to the US, and internal processing in the docs.rs webserver. I think that's the amount that would be reduced by #1552. Of course, #1552 would improve performance for first load and subsequent loads alike, while this would only improve performance for repeat loads.

@syphar
Copy link
Member

syphar commented Nov 27, 2021

We could use the crate version + rustdoc version as the ETag.

This needs to include the docs.rs version too, in case the header has changed.

Actually that reminds me, we also need to include the latest version in the ETag, so the drop-down gets updated with all newer versions.

btw, if we can calculate the etag and last-updated without generating the page, that would also save processing time on the server because the CDN can directly return the cached page.

Both E-tag and last-modified would have to change when any of these change:

  • rustdoc version
  • requested crate version
  • latest crate version
  • docs.rs version
  • yanked versions
  • rebuilt versions (while we could assume the rustdoc version changes too when we rebuild, most of the time)

@jyn514
Copy link
Member

jyn514 commented Nov 27, 2021

rebuilt versions (while we could assume the rustdoc version changes too when we rebuild, most of the time)

Yeah, I don't think this needs to be tracked explicitly, even if we do a rebuild it shouldn't change the page unless the docs.rs or rustdoc version have changed.

@syphar
Copy link
Member

syphar commented Nov 27, 2021

Only implementing a valid E-tag is probably easier than trying to make up a usable last-modified timestamp based on these inputs. And for caching in the browser it shouldn't matter.

The most relevant bit here is "Content Download: 446ms", which is 56% of the total time for that request (790ms). The total page load was ~1900ms. That suggests to me that we could save a good amount of time with this technique.

If you click over to the Response tab for that request you can see x-amz-cf-pop: MXP63-P1, which AFAICT is in Milan. That's presumably why the Initial Connection and SSL Negotiation numbers are so good.

Valid points, this could be a real improvement, if we can keep calculating the ETag simple, and check if cloudfront caches these too.

The Time to First Byte: 233 ms probably represents a combination of roundtrip from Milan to the US, and internal processing in the docs.rs webserver. I think that's the amount that would be reduced by #1552. Of course, #1552 would improve performance for first load and subsequent loads alike, while this would only improve performance for repeat loads.

I'm not sure why I didn't think about this earlier, but when we count in CloudFront into the ETag, then theoretically subsequent requests from any other browser to the CDN should also be reduced do just the roundtrip to check the latest ETag.

@syphar
Copy link
Member

syphar commented Jan 25, 2022

We now have control over the cache and we could continue on this story.

@jyn514
Copy link
Member

jyn514 commented Jul 16, 2022

This suggests perhaps an ETag of crate version + rustdoc version + latest crate version.

This needs all versions btw, the latest server is not necessarily the most recently published crate.

@syphar
Copy link
Member

syphar commented Oct 13, 2022

two points I want to add:

  • as @jsha described, setting last-modified can affect how browsers cache when there are no other cache-control headers. Since we use "no cache-control headers" as marker for cloudfront to apply the default TTL (see new cache-policy & cache middleware structure to support full page caching #1856), we probably shouldn't set it.
  • the "simple" ETag implementation I see often is just generating an MD5 hash of the content, and returning 304 if the new hash matches. While this doesn't help with server response times, it might help saving the data transfer between the server, cloudfront, and the users. And the transfer especially from the US takes a big chunk of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend Area: Webserver backend
Projects
None yet
Development

No branches or pull requests

3 participants