Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement CDN invalidation queue & background worker #1961

Merged
merged 1 commit into from
Jan 14, 2023

Conversation

syphar
Copy link
Member

@syphar syphar commented Dec 17, 2022

To work around the Cloudfront limit (max 15 active wildcard invalidations), this implements a queue & batching for CDN invalidations.

So when we get a bunch of changes at once (many small builds, many failed builds, ....) these would be queued and executed over time.

I was also thinking about adding invalidation-related metrics, but then decided to move this to a later point when we have this code running for some time.

When the invalidations work reliably we can finally slowly introduce the full page caching for our docs.

@syphar
Copy link
Member Author

syphar commented Dec 17, 2022

r? @jsha

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Dec 17, 2022
@syphar syphar force-pushed the cdn-invalidation-batches branch 2 times, most recently from 21404b0 to 51f7367 Compare December 17, 2022 09:38
@syphar
Copy link
Member Author

syphar commented Dec 17, 2022

( I'm debugging the failing test, which doesn't fail locally. Review is still possible IMO) solved the test failure

@syphar syphar force-pushed the cdn-invalidation-batches branch 2 times, most recently from 0f99d9d to 3713623 Compare December 17, 2022 10:16
@syphar syphar force-pushed the cdn-invalidation-batches branch from 3713623 to 2ba21db Compare January 7, 2023 12:24
@syphar
Copy link
Member Author

syphar commented Jan 7, 2023

cc @rust-lang/docs-rs anyone could help with reviewing this here?

I would love to finish up & activate full page caching

src/cdn.rs Show resolved Hide resolved
Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand the design correctly:

There is a new table cdn_invalidation_queue. After a crate build, we append to that table. Periodically, a cron task fetches up to N items from the table, creates a CDN invalidation from them, and sets those items to created_in_cdn = CURRENT_TIMESTAMP.

As part of the same cron task, we fetch the list of active invalidations from the CDN, and delete from the database all items that are not considered active invalidations by the CDN, but which do have a non-null created_in_cdn. This should effectively mean deleting all completed items. So the database should in normal operation always have a smallish number of rows.

What happens if we fall behind? It seems like there should be an additional query that deletes all rows from the database where queued is more than X minutes ago, regardless of whether we were able to create an invalidation for them. At some point they are old enough that they've become irrelevant.

It's worth noting that databases-as-queues are usually not very efficient, but so long as we keep the table size and write volume small, this will probably be okay for now.

src/cdn.rs Show resolved Hide resolved
src/cdn.rs Show resolved Hide resolved
src/cdn.rs Show resolved Hide resolved
@syphar
Copy link
Member Author

syphar commented Jan 11, 2023

To make sure I understand the design correctly:

There is a new table cdn_invalidation_queue. After a crate build, we append to that table. Periodically, a cron task fetches up to N items from the table, creates a CDN invalidation from them, and sets those items to created_in_cdn = CURRENT_TIMESTAMP.

As part of the same cron task, we fetch the list of active invalidations from the CDN, and delete from the database all items that are not considered active invalidations by the CDN, but which do have a non-null created_in_cdn. This should effectively mean deleting all completed items. So the database should in normal operation always have a smallish number of rows.

Yep, that's it.

What happens if we fall behind? It seems like there should be an additional query that deletes all rows from the database where queued is more than X minutes ago, regardless of whether we were able to create an invalidation for them. At some point they are old enough that they've become irrelevant.

When the queued invalidation would be deleted, we would leave outdated content visible to users. IMO the invalidation is never irrelevant, only when it's older than the TTL on the CDN. And best case that TTL is forever.

It's worth noting that databases-as-queues are usually not very efficient, but so long as we keep the table size and write volume small, this will probably be okay for now.

Yeah, I'm aware of that and the possible alternatives.

Since we have the build-queue in the DB too I would postpone that change alltogether to a later point.

Some more general notes:

If we go live with this, the plan would be to leave the TTL short for some time to see how long the invalidations take / if they fail. We also could think about adding some metrics, when we know which help us.

I believe for the near future we would be fine with the 15-invalidation-limit and the invalidations would be fast enough for the amount of releases / yanks we get.
But of course this is a hard limit that won't move long-term, where we would have to either merge/escalate invalidations, or move to a different CDN, or give up caching again.

I definitely didn't expect this much complexity to work around CloudFront limitations, and the downsides it still has.

I'm still excited & motivated to forward, web-speed-me would definitely enjoy browsing docs far more when it's done.

My long-term wish / preference would be moving to another CDN like fastly, which would make all of this queuing logic obsolete. But since we don't have it yet, I lean towards adding this queue here until then.

Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some unhandled cases where Cloudfront's list of invalidations gets too big or the table of queued invalidations gets too big, but I guess so long as our cache time is low those aren't a big deal; it will just cause the cron task to get stuck somewhere, but builds should continue to happen and the cache will expire after N minutes. Let's try it!

@syphar syphar merged commit b50e43a into rust-lang:master Jan 14, 2023
@syphar syphar deleted the cdn-invalidation-batches branch January 14, 2023 08:14
@github-actions github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Jan 14, 2023
@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants