Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

full page caching for rustdoc pages in the CDN #1552

Closed
syphar opened this issue Nov 13, 2021 · 10 comments · Fixed by #1856
Closed

full page caching for rustdoc pages in the CDN #1552

syphar opened this issue Nov 13, 2021 · 10 comments · Fixed by #1856
Labels
A-backend Area: Webserver backend C-enhancement Category: This is a new feature E-medium Effort: This requires a fair amount of work

Comments

@syphar
Copy link
Member

syphar commented Nov 13, 2021

We can use a CDN to improve world-wide speed for serving documentation. Even when we would improve server-side response times (for example by caching S3 requests locally on the webserver) we would look at the global latency between EU/US and else where (at least 100ms). If after CDN caching we still need to optimize server-side response times we can do it then.

Documentation is mostly static and can only change with a build it's a nearly perfect candidate for CDN caching. It's only nearly because we have rebuilds and also we are adding a header & footer.

I think we can leverage CDN caching for most parts of the site. When actively invalidating caches and using a good CDN we would still always have up-to-date content.

page-types and invalidation events

cached forever, no invalidation needed:

  • static assets with hashed filenames

can only change after any release for one specific crate

  • rustdoc pages (the header contains all versions of the crate)
  • latest-version redirects
  • release-internal redirects

can only change when we release new code:

  • documentation pages etc
  • styles

not really cachable:

  • search-results
  • release-lists (only cachable if we accept them being outdated for a certain amount of time)

requirements to the CDN

we need

  • fast invalidation (tag-based if possible, path/pattern based works for simple cases too).
  • CDN specific caching headers will be removed on the CDN level (to have control over the cache at all times)
  • logic on the edge to add CSP nonces at the edge

Nice to have would be:

  • serving stale content while updating the cache in the background.
  • soft purge: serves stale content while updating the cache. Prevents thundering herd problem when clearing the cache.

CloudFront

  • invalidations are probably too expensive to execute these on every release
  • path/pattern invalidations are possible, tags not
  • invalidations take minutes, sometimes 15.
  • secret headers I would need to research, I don't have a definitive answer yet. Perhaps solvable with lambda@edge or CF configuration.
  • but we already have it
  • Lambda@edge could probably solve the CSP issue, I didn't dig deeper on programming language support in there

Fastly

CloudFlare

I didn't dig deeper yet on the feature set here.

browser caches

since we want to actively invalidate certain caches we won't cache these pages in the browser and limit browser caching to static assets with hashed filenames as currently.

@syphar syphar added E-medium Effort: This requires a fair amount of work C-enhancement Category: This is a new feature A-backend Area: Webserver backend labels Nov 13, 2021
@syphar
Copy link
Member Author

syphar commented Nov 13, 2021

@jyn514 to start with I just wrote down my thoughts. I'll probably refine the text over the time and add details.

The quickest solution with the least amount of work would be:

  • use our current CloudFront setup
  • cache pages and redirects below the /crate_name/ path, invalidate this whole pattern on every release (or yank) for that crate
  • release-lists and static pages could be cached for a short time (30 minutes or shorter, or not at all)
    a full cache purge is risky depending on the amount of requests the server gets because all cache locations would request new content at the same time. costs need to be checked -> AWS charges $0.005 per path invalidated after the first 1000 invalidations (standard price, I don't know our discounts / credits). When I take 600 releases per day, that sums up to $90, which feels ok-ish, but not up to me :)

The best solution would probably be going with fastly (alternatively cloudflare if the features match).
From a backend perspective the effort is similar (also I did some setups already). Additional work is of course the contract / infrastructure parts.

@syphar
Copy link
Member Author

syphar commented Nov 14, 2021

to add: for CloudFront there would be additional costs for Lambda@Edge

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

Thanks for writing this up! Could you describe what is the current status quo? CloudFront, but with no caching? Or CloudFront, but without lots of PoPs all over the world?

It would be great to have a summary of how bad the current situation is, and how much it would improve under a better CDN approach. Are you interested in collecting a bunch of samples from https://webpagetest.org/? It seems like in particular Time To First Byte (TTFB) would be the most important measure that would be improved by a better CDN.

Also cross-linking some caching/performance related issues:

Another thing to consider: With Cloudflare Workers / Fastly Compute@Edge, we could do the unpacking of storage blobs inside the CDN. That would have the advantage that when someone requests 1 page of a crate's docs, their local PoP would have the whole blob of that crate's docs, so subsequent navigations would be very fast.

AWS charges $0.005 per path invalidated after the first 1000 invalidations (standard price, I don't know our discounts / credits). When I take 600 releases per day, that sums up to $90, which feels ok-ish, but not up to me :)

I'm getting 0.005 * 600 = $3. Presumably there is some other multiplier here that I'm missing?

@syphar
Copy link
Member Author

syphar commented Nov 26, 2021

Thanks for writing this up! Could you describe what is the current status quo? CloudFront, but with no caching? Or CloudFront, but without lots of PoPs all over the world?

Valid question, thanks for asking.

The whole of docs.rs is behind CloudFront. All static assets, from rustdoc or docs.rs, are cached in the browser and the CDN. All other pages are uncached and just routed through the CDN. Our webserver answers all of them. For most crates server-side response times are totally fine. From my perception the bottleneck is the request from europe (for me) to the AWS datacenter in the US where docs.rs is hosted.

Right now the pages are regenerated for every request, including fetching the original files from S3. While we could of course start caching files locally on our webserver we could just skip this step and directly cache on the edge, helping not only US but worldwide users :)

It would be great to have a summary of how bad the current situation is, and how much it would improve under a better CDN approach. Are you interested in collecting a bunch of samples from https://webpagetest.org/? It seems like in particular Time To First Byte (TTFB) would be the most important measure that would be improved by a better CDN.

Any modern CDN would be fine regarding performance and POPs. Biggest differentiator is how we can selectively invalidate parts of the page, since every new (or re-) release would also change cached content for old releases and we want the docs to be up-to-date.
Edge-Logic is (for the start) only needed to combine CSP with caching.

TTFB measurements could of course be fed into a better CDN selection, we could start with CloudFront and already have a good solution without too much infrastructure effort.

Another thing to consider: With Cloudflare Workers / Fastly Compute@Edge, we could do the unpacking of storage blobs inside the CDN. That would have the advantage that when someone requests 1 page of a crate's docs, their local PoP would have the whole blob of that crate's docs, so subsequent navigations would be very fast.

While that could be a next optimization step, it needs more design, since the documentation blobs sometimes have multiple gigabytes and millions of files.

AWS charges $0.005 per path invalidated after the first 1000 invalidations (standard price, I don't know our discounts / credits). When I take 600 releases per day, that sums up to $90, which feels ok-ish, but not up to me :)

I'm getting 0.005 * 600 = $3. Presumably there is some other multiplier here that I'm missing?

Sorry I wasn't clear enough here, I was thinking about $90 monthly.

@jsha
Copy link
Contributor

jsha commented Sep 19, 2022

I think this sounds exactly right. I was actually wondering if something like this (CDN-only caching for the HTML pages) was possible with CloudFront.

@syphar
Copy link
Member Author

syphar commented Sep 20, 2022

@jsha i actually removed the long writeup again because I forgot the main reason why full page caching is more work here: CSP 😄.

We could handle this part with lambda@edge, but I would have to dig into that first.

@jsha
Copy link
Contributor

jsha commented Sep 20, 2022

I think it would be useful to bring back the long writeup, even if you caveat it with "I think this won't work because of CSP script-nonce." However, I think it will work with CSP script-none (and also we should move away from our plans to use CSP in this way, see #1853).

Here's a Server Fault thread to back me up: https://serverfault.com/a/1064775/361298 (and I talk about this here: #1569 (comment)).

The short version is: CSP script-nonce is not a nonce in the cryptographic sense of "if this is ever used twice everything will explode horribly." Instead, it's just a random value that needs to (a) be unpredictable before a given page is generated, and (b) match the nonce= attribute for the scripts on a generated page.

When a page with CSP script-nonce: xyz is generated. docs.rs adds nonce=xyz to all script tags. If that page gets cached, the CSP header gets cached along with the body, so loading that page subsequently will work just fine: the header still matches the contents. Sometime down the road, if the browser gets a fresh copy of the page, it will get fresh headers that match the fresh body, and things will work fine.

Does caching help an attacker trying to defeat CSP script-nonce? Nope. Once the page is cached, its contents are unchanging, so there's no possibility of an attacker crafting an XSS with a known nonce.

@syphar
Copy link
Member Author

syphar commented Sep 21, 2022

I think it would be useful to bring back the long writeup, even if you caveat it with "I think this won't work because of CSP script-nonce."

you're right, I'll re-add it from memory below. (if someone still has an email notifications I'll take it ;) )

However, I think it will work with CSP script-none (and also we should move away from our plans to use CSP in this way, see #1853).

related comment by @jsha : #1569 (comment)

IMO this is the biggest question to be answered around the caching topic. I would love some more input by @rust-lang/docs-rs .

the new full page caching idea

( from memory, in keywords)

Cloudfront is somewhat limited compared to Fastly or CloudFlare. For our caching we would need something that controls caching in the CDN without affecting caching in the browser (like max-age) or other intermediary caches (like s-maxage). Fastly as the Surrogate-Control header for this, CloudFlare has CDN-Cache-Control.

With this control we could let the CDN cache for a long time, while just actively invalidating the content we want to invalidate after we build a crate version.

But there could be a workaround: looking at its documentation on cache-control headers we could use the default TTL for this.

  • rustdoc pages, redirects etc don't get any max-age, so cloudfront internally applies the long default TTL. While I could imagine a short max-age + stale-while-revalidate having a similar behaviour I believe especially for /latest/ URLs we want more control.
  • static assets could still get an explicit "forever" TTL
  • pages we always want up-to-date (release-lists, builds etc) would need an explicit max-age=0 or no-cache

Invalidation would be:

  • /krate/* after a build
  • the whole site after HTML / style changes (only if we start caching non rustdoc / redirect pages)

The only annoying part of this approach is that we have to explicitly set no-cache on all pages we don't want to be cached, which I could imagine as a middleware.

@syphar
Copy link
Member Author

syphar commented Sep 21, 2022

I will do some reading & testing around this

@jsha
Copy link
Contributor

jsha commented Sep 22, 2022

I had a copy in my email notifications. You reproduced it remarkably well from memory!

original proposal

I would love feedback here before continuing.

While working on #1825 and setting thinking about new settings / improvements for #1569 I had a new idea.

With other sites that have static content and are updated via a publish / build process there is a quite performant setup possible:

  • cache everything for a long time in the CDN
  • purge affected content after building a release
  • when we deploy changes that affect cached pages we might need a manual purge of the whole site
  • serve stale content for a short time while revalidating
  • forbid any browser or intermediary caching (excluding things with hashed filenames of course)
  • Since CloudFront doesn't have Surrogate-Control (like Fastly) or CDN-Cache-Control ( like CloudFlare) I thought this would probably involve some lambda@edge JS/Python logic, which I would love to prevent.

Now after reading the CloudFront docs in more detail I had an idea:

  • let's set a long TTL as default TTL in CloudFront. It will be applied when we don't provide any max-age in our responses. We still can set stale-while-revalidate so respones are fast after purges. Browsers and other caches won't see this TTL.
  • static assets will still return a longer TTL and cached by everyone.
  • pages that should be always up-to-date would need to get Cache-Control: no-cache, no-store

While the last point might sound risky since we could forget adding this, this could be the default and for example added via middleware, with an optional longer TTL.

Any thoughts about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend Area: Webserver backend C-enhancement Category: This is a new feature E-medium Effort: This requires a fair amount of work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants