Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental static regeneration #1028

Merged
merged 61 commits into from
May 19, 2021
Merged

Conversation

kirkness
Copy link
Collaborator

@kirkness kirkness commented Apr 28, 2021

Adds Incremental Static Regeneration!

Related issues #804 #995

  • Add unit tests for regeneration logic
  • Test assumptions around how rollup.js bundles dynamic imports as per @dphang's comment
  • Add e2e tests for ISR fallbacks
  • Setup deployment of infra via serverless
  • Add e2e tests
  • Make code for ISR possible to be agnostic to the provider as per @dphang's comments
  • Only pass what's needed to the queue as per comment from @jvarho
  • Find a workaround for max 300 SQS calls per second (using lambda async invoke)
  • Setup deployment of infra via CDK
  • Create a demo site for reviewers - https://d3kjdl6gtxbc5l.cloudfront.net/about
  • Update cache logic to handle changes to revalidate property, as per @dvarho's comment
  • Load test the SQS queue to more than 300 requests in one seconds to test this assumption from @evangow - assumption was correct, it's max 300 API calls per second, will find a workaround
  • Tidy-up code by moving duplicate code (for example the page generation code) into functions/seperate files

The implementation looks like:

  • At "origin response" we check whether the response page is an ISR page (has an Expires header, or the original build page had a revalidate property)
  • We check whether the Expires header is in the future and if so apply a cache-control header such that the object is stored in the edge until its ready for revalidation.
  • If the Expires header is in the past then we trigger the page to be regenerated, whilst still returning the existing page with no-cache
  • To trigger a page rebuild we send a message to a FIFO SQS Queue, deduplicated using the identifier of the S3 object we are replacing so that we at most trigger one rebuild per page. FIFO SQS queues do come with a 300/second rate limit on all API Requests, which might be hit on very active sites, in which case we cache the page for an extra second to back-off requests.

You can check out a demo of this feature here, check the headers and the response times to try to get your head around its behaviour. The code for this page looks like:

export async function getStaticProps() {
  return {
    revalidate: 60,
    props: {
      date: new Date().toLocaleString()
    }
  };
}

const AboutPage = (props: {date: string}) => (
    // ...
    <p>This is the about page, and the time is about {props.date}</p>
    // ...
)

export default AboutPage

👍

retryStrategy: await buildS3RetryStrategy()
});

const { req, res } = lambdaAtEdgeCompat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be just regular Lambda, so this should not be needed, right? You can just create a Node req, res in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! Good catch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, the reason this is here is due to the fact that the original CloudFront event is actually passed through the queue to this lambda (the cloudFrontEventRequest variable), and therefore although this is a standard lambda we are still serving the original CloudFront event. The ideal change would be to standardizee the request object throughout, which would be nice perhaps using some of the principles in the serverless-http package. However, might slightly sit outside the scope of this PR. Keen to know your thoughts @dphang!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, makes sense for now. I think it would be good to have a generic event since in the future this can be used for regular Lambda traffic (and maybe other providers)

@evangow
Copy link

evangow commented Apr 28, 2021

I didn't see any handler for rate limit errors on SQS queue in your code.

Thought about ISR quite a bit before. See my comment here: #804 (comment)

After I posted the comment above, I spent a while thinking about the edge cases I listed and the SQS queue.

The downside of using SQS to deduplicate requests is that you have to use a FIFO queue, which only supports 300 messages per second (300 send, receive, or delete operations per second) without batching.

The example I had in my head when thinking about this issue was Reddit. They could benefit a great deal from ISR on /r/popular (as an example), because the content can be stale for 60 seconds and it wouldn't really matter.

The second that the page expires though, there would be 10,000s (100Ks? 1Ms?) of requests to regenerate /r/popular.

The SQS queue would throw a rate limit error on any of those requests past the first 300 and all of the other pages requesting revalidation would also run into a rate limit error.

To handle the rate limit, you could have all initial requests go to a non-FIFO SQS queue, then have that queue send the messages to the FIFO queue.

You would also be able to batch the requests from initial queue and send them over to the FIFO SQS queue. By batching the requests, you can process 3,000 requests / second (300 message limit * 10 messages / batch).

If the FIFO queue returns a rate limit error, then you can just leave those in the initial queue and retry later.

In other words:
Queue 1) Non-FIFO. Receives all regeneration requests. Batches them 10 at a time to send to Queue 2. Retries if Queue 2 returns a rate limit error.
Queue 2) FIFO. Deduplicates messages received from Queue 1 and calls the page regeneration handler.

Ultimately, I think adding a FIFO queue makes things too complicated for deduplicating requests due to the rate limit.

I would just regenerate the page via an async lambda, and if it gets regenerate a whole bunch of times because a page is popular, you're just spending some additional money to handle those requests.

Those pages are probably currently using SSR with right now anyway where every request already currently regenerates the page, so ISR is already going to be a money saver.

If you just use an async lambda, you'll save a little bit less money but you can avoid the complexity that SQS would involve to do it The Right Way.

I could probably be convinced either way is good, but I think just 1 FIFO queue might be an issue.

Unrelated: s-maxage = initialRevalidateSeconds - createdAgo Isn't s-maxage supposed to be how long the page should be cached before revalidation? In other words: s-maxage = Date.now() + initialRevalidateSeconds or something? I'm not super familiar with the various expiration headers, so ignore me if I'm just wrong here.

@kirkness
Copy link
Collaborator Author

Hey @evangow, cheers for the message. I've not investigated this in-depth yet, however, have implemented the queue with fifo + message deduplication so that the receiving lambda will only be invoked at most once for every time a page needs to be regenerated. What I don't know is whether the queue's rate limit is depleted even when the message is discarded when its a duplicate?

@evangow
Copy link

evangow commented Apr 28, 2021

I didn't see anything in the docs about that.

If it's the case that only the 1st instance counts against the API usage limits, then that would be awesome.

I assumed the docs to mean that any sqs.send() call would count against your usage even if the message is deduplicated.

@jvarho
Copy link
Collaborator

jvarho commented Apr 29, 2021

Nice 👍

I also started working on this yesterday, but your efforts are clearly much further along.

However, I don't think your cache-control logic in default-handler will work. It isn't sufficient to just check the prerender manifest for revalidate times, since 1) dynamic pages won't be there and 2) revalidate is returned from getStaticProps and can differ from one render to the next. Instead, I think you need to store and check when the page will actually expire.

Here's my series up to the point where actual revalidation logic would be needed: master...jvarho:isr-wip
Feel free to use (or not) as you like. (Edit: fixed an issue in the topmost commit's tests.)

@kirkness
Copy link
Collaborator Author

Nice 👍

I also started working on this yesterday, but your efforts are clearly much further along.

However, I don't think your cache-control logic in default-handler will work. It isn't sufficient to just check the prerender manifest for revalidate times, since 1) dynamic pages won't be there and 2) revalidate is returned from getStaticProps and can differ from one render to the next. Instead, I think you need to store and check when the page will actually expire.

Here's my series up to the point where actual revalidation logic would be needed: master...jvarho:isr-wip
Feel free to use (or not) as you like. (Edit: fixed an issue in the topmost commit's tests.)

Thanks, @jvarho! I hadn't considered these cases and I think your approach of using the Expires header (based on what I can see in the AWS docs) would work a treat! I'll give it a go implementing and see if I can push a demo of it working (will add tests eventually as well).

@jvarho
Copy link
Collaborator

jvarho commented Apr 29, 2021

Thanks, @jvarho! I hadn't considered these cases and I think your approach of using the Expires header (based on what I can see in the AWS docs) would work a treat! I'll give it a go implementing and see if I can push a demo of it working (will add tests eventually as well).

The way I used it, the Expires header is only used to store the time in S3. It is the Cache-Control I set here that actually matters for Cloudfront: master...jvarho:isr-wip#diff-f4709477772e5badfaaf543f6104de4b4ab3bf8300a056c088f84c184d800288R828

The Expires header could – and probably should – be removed at that point.

(The reason to use Cache-Control is of course the distinction between Cloudfront and browser caching.)

@kirkness
Copy link
Collaborator Author

Thanks, @jvarho! I hadn't considered these cases and I think your approach of using the Expires header (based on what I can see in the AWS docs) would work a treat! I'll give it a go implementing and see if I can push a demo of it working (will add tests eventually as well).

The way I used it, the Expires header is only used to store the time in S3. It is the Cache-Control I set here that actually matters for Cloudfront: master...jvarho:isr-wipdiff-f4709477772e5badfaaf543f6104de4b4ab3bf8300a056c088f84c184d800288R828

The Expires header could – and probably should – be removed at that point.

(The reason to use Cache-Control is of course the distinction between Cloudfront and browser caching.)

Ok, so you set the Expires header when creating the file in S3, and then use that to define the smax header at which point in the Origin Response the Expires header can be removed?

@jvarho
Copy link
Collaborator

jvarho commented Apr 29, 2021

Ok, so you set the Expires header when creating the file in S3, and then use that to define the smax header at which point in the Origin Response the Expires header can be removed?

Exactly 👍

@kirkness
Copy link
Collaborator Author

Ok, so you set the Expires header when creating the file in S3, and then use that to define the smax header at which point in the Origin Response the Expires header can be removed?

Exactly 👍

Added this in 0ee2f55 and fc8bd38.

@adamdottv
Copy link

tumblr_ljh0puClWT1qfkt17

@jvarho
Copy link
Collaborator

jvarho commented Apr 29, 2021

Ok, so you set the Expires header when creating the file in S3, and then use that to define the smax header at which point in the Origin Response the Expires header can be removed?

Exactly +1

Added this in 0ee2f55 and fc8bd38.

That looks correct for the regeneration path. However, you still need something like my changes in jvarho@bd988a4 and jvarho@4423a47 to get prerendered and fallback-rendered pages to expire, do you not? (I think you could just cherry-pick those two commits.)

@kirkness
Copy link
Collaborator Author

Ok, so you set the Expires header when creating the file in S3, and then use that to define the smax header at which point in the Origin Response the Expires header can be removed?

Exactly +1

Added this in 0ee2f55 and fc8bd38.

That looks correct for the regeneration path. However, you still need something like my changes in jvarho@bd988a4 and jvarho@4423a47 to get prerendered and fallback-rendered pages to expire, do you not? (I think you could just cherry-pick those two commits.)

You are probably right, I've not looked at those behaviours yet, but if I can just cherry-pick then that's awesome!

@kirkness
Copy link
Collaborator Author

Like I wrote in elsewhere, dynamic pages do not have initialRevalidateSeconds in the manifest. The only way to know that they should be revalidated is to store that information in S3 – i.e. the Expires header you now use.

Update: 8a0d5ff

Frustratingly there is no simple way of setting the Expires header per-file when uploading the initial build via CDK (obviously possible via the serverless component). With the CDK construct, we can only set headers for an entire directory. Because of this, I've updated the default-handler (logic moved in getStaticRegenerationResponse) to either check for an Expires header or for an initialRevalidateSeconds. This should result in new pages being regenerated when needed and initial pages being regenerated, however, the narrow case you've just mentioned, where you might convert an ISR page to a static page will not be. Unless you can think of another edge-case that may not be fulfilled, just this one feels acceptable until we've figured a method of setting Expires per-file in CDK.

@kirkness
Copy link
Collaborator Author

@kirkness updated the policy, I think you need to add sqs.SendMessage permission to the default policy here?

Also 1/3 of the new ISR tests passed, not sure if they are flaky but AFAIK there are no more permissions issues

Sweet, thanks for checking, odd that it worked locally! I'll add this today along with making sure these tests aren't too flakey!

@kirkness
Copy link
Collaborator Author

@kirkness updated the policy, I think you need to add sqs.SendMessage permission to the default policy here?

Also 1/3 of the new ISR tests passed, not sure if they are flaky but AFAIK there are no more permissions issues

That permission appears to already exist on the default lambda's policy:

{
Effect: "Allow",
Resource: queue.arn,
Action: ["sqs:SendMessage"]
}

Perhaps the policy isn't getting updated?

@dphang
Copy link
Collaborator

dphang commented May 15, 2021

@kirkness updated the policy, I think you need to add sqs.SendMessage permission to the default policy here?

Also 1/3 of the new ISR tests passed, not sure if they are flaky but AFAIK there are no more permissions issues

That permission appears to already exist on the default lambda's policy:

{
Effect: "Allow",
Resource: queue.arn,
Action: ["sqs:SendMessage"]
}

Perhaps the policy isn't getting updated?

Yup, thanks! I think the policy might not be getting updated after it's already deployed...but I think it should be ok to ask users to update it

@kirkness
Copy link
Collaborator Author

kirkness commented May 18, 2021

@dphang - the e2e's are killing me slowly 😂.... It looks like the issue is caused by the shared libs not reliably being built due to yarn failing to install (which I understand is a known issue). I've gone through each of the tests locally and each app/test suite passes.

I'd be fairly keen to work on this issue in a follow-up PR. My suspicion is that the issue is related to the postinstall script is running yarn install in parallel (perhaps trying to create the same file/dir at the same time). So perhaps shifting to using workspaces would be beneficial? Has this been looked at before?

Edit: Looks like @jvarho's PR will help!

@dphang
Copy link
Collaborator

dphang commented May 18, 2021

@dphang - the e2e's are killing me slowly 😂.... It looks like the issue is caused by the shared libs not reliably being built due to yarn failing to install (which I understand is a known issue). I've gone through each of the tests locally and each app/test suite passes.

I'd be fairly keen to work on this issue in a follow-up PR. My suspicion is that the issue is related to the postinstall script is running yarn install in parallel (perhaps trying to create the same file/dir at the same time). So perhaps shifting to using workspaces would be beneficial? Has this been looked at before?

Edit: Looks like @jvarho's PR will help!

Yep hopefully it helps.

I do see that the code limit is reached (75 GB), I'll add a script to clean up old Lambdas which can run in the finalize step. Hopefully we can get it checked in today, just wanting to make sure that the e2e tests are relatively stable at this point.

@raffij
Copy link
Contributor

raffij commented May 19, 2021

Looking at this I learnt about versionFunctions: false in the serverless.yml which is useful for our deploys, but possibly not for this instance.

"lambda:CreateEventSourceMapping",
"iam:UpdateAssumeRolePolicy",
"iam:DeleteRolePolicy",
"sqs:*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better security, we should update this to just the sqs permissions needed, but not blocking for this PR

@dphang
Copy link
Collaborator

dphang commented May 19, 2021

I think so far it looks good to me, really great work on this! If you have any improvements like updating the README on how to use this feature (e.g all the SQS permissions etc needed), improving tests, cleaning up more of the code etc. please feel free to add in future PRs.

@dphang dphang merged commit d5bbdc6 into master May 19, 2021
@delete-merged-branch delete-merged-branch bot deleted the feature/incremental-static-regeneration branch May 19, 2021 03:41
@amman20
Copy link

amman20 commented Jun 20, 2021

Awesome work guys,
when are we expecting this to be available in the latest stable version?

@leerob
Copy link

leerob commented Jun 21, 2021

Nice work, everyone! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants