Is there anything preventing wp2static being run externally? #760

petewilcock · 2021-01-27T15:42:07Z

petewilcock
Jan 27, 2021

I've been thinking about scalability for the wp2static jobs of crawling, processing, and deploying.

For an increasingly large site this takes a long time, and has to run on the host machine of the Wordpress installation and seems susceptible to OOM or general resource contention.

I'm interested in investigating how to externalise these processes (e.g in AWS Lambda, or possibly AWS Batch) to fan out resources rapidly to get a site examined, crawled, and processed. What do you think @leonstafford ?

I'm an eat/sleep AWS dude so I'm happy to invest some time into looking at this. My current Wordpress staging set-up uses ECS Fargate and Serverless mySQL so maximum cheapness in pushing static updates, so it'd be great to iterate on it further.

leonstafford · 2021-01-28T04:23:39Z

leonstafford
Jan 28, 2021

Great thinking, @petewilcock! (thanks for the recent donation, btw, a big help!)

If it could be done as a standalone add-on, that would be awesome. The wp2static-addon-boilerplate is a bit out of date, in list of others I'm working to get all up to date next. For the time being, the S3 and Advanced Crawling Add-ons are quite up to date (adv crawling version limited to previous release of WP2Static until next refactor) and should give an idea of how to hook in to the right places.

@john-shaffer may have some thoughts on this, as he's integrated WP2Static for https://staticweb.io, which is all AWS stack and could help their use case, too...

I've got an Issue open to revisit async parallel requests while crawling, which should speed things up (but use more resources on host!).

I haven't tried Batch yet, but quite familiar with Lambda, so should be able to help test things.

The only short term changes expected in core and all the addons is around code quality tooling, code improvements, so shouldn't break any interfaces besides getting all using Guzzle, as core is now doing in latest builds.

SiteSauce is an external service, which uses some a WP plugin to get inside info, not sure if that may give any ideas.

WP dev site on AWS in same region I'd assume as ideal setup to work with this? Maybe can have a 1-click deployable infra eventually for the ideal AWS setup for non-public dev WP site, the external crawling and deployment through to S3/CloudFront with some Lambdas thrown in for forms/search handling?

Can you ELI5 for me the Fargate and serverless MySQL setup you use?

Exciting stuff!

0 replies

petewilcock · 2021-01-28T18:35:27Z

petewilcock
Jan 28, 2021
Author

So, I've written a custom Terraform module where I provide a couple of attributes and it deploys a Cloudfront-enabled S3-backed website, along with the ECS cluster/service config for a dockerised Wordpress. On launch the container automatically checks for and installs wp2static and populates config variables into wp_options (fed in from Terraform on deployment) to preconfigure the plugin with the correct values.

I toggle the scale-up of the Wordpress container, which runs serverless on ECS and connects to a serverless RDS database. Start-up time is < 1 minute and costs essentially $0.00 when not in use. EFS backs the Wordpress files and is attached to the container on launch. So I do my modifications, fire off wp2static, and then shut it down again. Means I don't have to host the Wordpress install locally and all the target files are already in AWS. There's a lot more going on in the Terraform module which sets up the AWS environment nicely but that's basically it. Currently I have to run the plugin a couple of times before it deploys without something erroring and it's probably a resource issue somewhere (PHP errors don't emit into container logs - but I need to fix that)

My main question for you is: Does the crawl or process URLs need any inside info about Wordpress to work? At least, anything that can't be fired off to an external process to start the jobs.

I'm imagining a process like this:

wp2static gets the auto-detected list of URLs to crawl, and an add-on fires off the list to a crawl job in AWS batch (or Lambda consuming from a queue) to crawl them all at a highly parallelised rate. EFS can be attached to Lambda to gather all the site data together in the same place.
Next process triggers to post-process the URLs (I'm guessing a lot of URL rewriting happens here and the like), and again an ideal case for a fast fan-out to blast through the list of files and write back clean files to the filesystem ready to deploy.
Finally, fast sync from EFS to S3 with the correct permissions to do a speedy deploy.

Whole process lives in AWS end-to-end and should be super fast. Any failures could be logged to a fail-queue for optional retries of just the failed parts of the run.

I know a lot about AWS but virtually nothing about creating Wordpress plugins, so if you can give me a heads up either with the updated boilerplate or a few pointers I'm sure I can figure out the rest :)

0 replies

leonstafford · 2021-01-28T23:04:11Z

leonstafford
Jan 28, 2021

Super cool setup!

PHP errors don't emit into container logs - but I need to fix that

Wonder if adding a hook within WP2Static's logger class may be of use there... ie post copy of log to CloudWatch API?

My main question for you is: Does the crawl or process URLs need any inside info about Wordpress to work? At least, anything that can't be fired off to an external process to start the jobs.

WP Site and destination URLs, mainly.

For your setup, my gut feel is to advise not to use WP2Static and just do an outside-in crawl, like I do with Appi.sh. I think in the repo's issues, I've listed some better CLI tools than wget for doing multithreaded/parallel crawling, then you just have to code up the rewriting/processing logic, but can do that in your language of choice then.

If really wanting to go the WP2Static route, I'd encourage you to do my work for me and rewrite the crawling engine to do parallel/async requests :D That should achieve close to same goal without bending it too far from what it's setup to do currently.

Thinking out loud, if I had to do it with WP2Static like that, one way would be to use the WP_CLI commands and just have WP2Static do the wp wp2static detect part, then use the modify_crawl_list filter with your own function to send that off to your crawling function and save those files back into the usual ./wp-content/uploads/wp2static-generated-site dir and call the wp wp2static process && wp wp2static deploy.

Without automated test coverage (which I'm working on), I can't easily remember the crawl mechanism differences between WP2Static and Static HTML Output, but there's some logic to crawl and parse URLs from CSS, XML files, etc and feed those back into the crawl. @john-shaffer's been doing some great work with the https://github.com/leonstafford/wp2static-addon-advanced-crawling add-on, which helps for WP2Static.

Back to the non-WP2Static version, using WP2Static's initial crawl could still be an advantage, but if you were doing your own discovery of new URLs while crawling, the time saving may not be that impressive.

Cost of scraping with Lambda could be high.

Outside the box idea is to really make WP a static site generator (now thinking WP2Static isn't quite one, as it's crawling mostly from outside-in). So, using output buffering and calling the same WP functions which generate pages, but don't make the requests via a web server at all, just drive it all from PHP. The requests are the slowest part of the whole thing, as you're seeing.

Anyway, plenty of options and I think you've done the hardest parts already, with a reproducible, optimized end to end environment, that's really cool!

With your spin-up/down of environments, are you wearing cost for the at-rest containers/custom images? That was something I couldn't solve years back with AWS. Azure looked promising for a bit, but still accrued costs.

If there's a way to offer solid end to end remote dev site + deployment site hosting like you're doing (for free or a few $ a month), I'd love to throw some support behind it!

0 replies

petewilcock · 2021-01-29T00:24:07Z

petewilcock
Jan 29, 2021
Author

Oh damn, now you've really got me thinking....

If all permutations of possible Wordpress URLs are calculable (and the advanced crawler is mostly seeking out linked assets embedded within pages) and we can make WP simply export the generated index.html code of all the pages, then post-processing and crawling could become completely unnecessary?

i.e. Could we override the value of wp_home/site used when WP page generate function is called? No need for post-crawl rewrites in this scenario? I guess we lose the cleanup of comments and other bloat, but park that for the moment.

And then, thinking in purely AWS terms, crawling can be avoided because all of the image/css/font assets you'd want to grab already exist on the EFS volume and simply need to be synced over to the S3 target. Sync being the key word there as s3 sync command in AWS has an optional --delete flag that'll remove anything in target that's not in source - keeps the bucket nice and clean.

I might have oversimplified and missed something in there, not fully appreciating how each stage of the current process works, but feels possible?

Now, if I was sticking with the wp2static route, I'd probably avoid curling the public URLs from my crawler function. If executed entirely within a VPC none of the traffic would need to go outbound to the public internet via the internet gateway - as well as being much faster you'd save on transfer costs. It'd just be a slight adjustment of the curling logic and ECS tasks can use service discovery to provide a static local endpoint for Wordpress. Think of it like crawling your localhost, much faster and the internet doesn't need to get involved.

Sorry I'm rambling away here at the possibilities, but to answer your other question, the only costs at rest are EFS storage costs for Wordpress files (a maximum of $0.30 per gb-month if I'm not using an infrequent access storage class), and extremely minimal ECR storage costs for the container so it's there when I want to deploy it (barely a couple of cents a month). So... completely negligible. Main thing to bear in mind with my set-up is the running costs whilst you're editing pages - ECS is really cheap to run Wordpress for my posts-edit time but my choice of Aurora Serverless is comparatively expensive if I were to keep it running for a long time (kind of like lambda itself). Now, if I wanted to shave down costs further (albeit losing the lovely benefits of RDS with its auto-backups), I could just run mySQL as a secondary container in ECS and have Wordpress talk to that. Again EFS could host the underlying files of the database so we're really talking barely a few cents.

As I mentioned there's more to my module than just the containers - it also sets up a CodeBuild pipeline to pull the source image of Wordpress from Dockerhub, rebake it with my customisations, and push it into ECR ready for my deployment. I published a tiny piece of that recently in this module, but I should really tidy up and package the whole thing for general use as it sounds like some people might get a lot out of it.

0 replies

leonstafford · 2021-01-29T01:09:52Z

leonstafford
Jan 29, 2021

Awesome sauce!

So, the WP pure SSG option hasn't been trialed yet, so could be some unknowns there... We should be able to mock the Site URL and such. Other option could be to configure the internal DNS in that section to resolve the production domain to that localhost, so the site actually uses prod domain and figure out how to transform that when working on dev site...

The only real hesitation I had to attack that before was conflict with other plugins/themes which may be using output buffering, as I believed that could take precedence over our output buffering, but I say that with no confidence in my knowledge :D

Big question now: who are the users? Making something for others creates more value than just yourself, so I guess some scenarios:

I want to run 10 sites
How to control/auto-shutoff instances to avoid costs (see Strattic, Shifter, StaticWeb, HardyPress as all tackling similar problem)
free or paid
turn it into a massive SaaS thing or allow people to self manage it within their own AWS accounts

How long is current setup, from say a new AWS account and how many steps required for a beginner to get up and running?

How to upgrade base image (less important with non-public instances)? Import export tools abound, so manually exporting and importing into newer image should be acceptable.

I'm excited to support this either way. You'll see more excitement from me if it's targeted at either/both:

something for noobs that makes it dead simple (as simple as dealing with AWS for first time can be!). Video walkthrough should counter that difficulty, anyway.
something for the AWS crowd, less ease of use, but more geeky controls, ability to tie it into their own AWS stuff

Very excited :D

0 replies

leonstafford · 2021-01-29T03:13:27Z

leonstafford
Jan 29, 2021

Started testing that non-webserver option, seems promising. ie REQUEST_URI="/author/admin/" php index.php generates the right output.

will stick notes in https://github.com/leonstafford/real-wp-ssg-test

0 replies

petewilcock · 2021-01-29T16:50:18Z

petewilcock
Jan 29, 2021
Author

Ok let's have a chat! Hit me up on email :)

0 replies

leonstafford · 2021-01-30T23:48:22Z

leonstafford
Jan 30, 2021

I'm a bit allergic to meetings, so let's see how that goes or email back n forth for a bit.

Some nice tips from @szepeviktor here: https://github.com/leonstafford/real-wp-ssg-test/issues/1

0 replies

szepeviktor · 2021-01-31T00:28:17Z

szepeviktor
Jan 31, 2021

Both of you must have a WP.org account, then you can chat on Slack! https://make.wordpress.org/chat/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there anything preventing wp2static being run externally? #760

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Is there anything preventing wp2static being run externally? #760

petewilcock Jan 27, 2021

Replies: 9 comments

leonstafford Jan 28, 2021

petewilcock Jan 28, 2021 Author

leonstafford Jan 28, 2021

petewilcock Jan 29, 2021 Author

leonstafford Jan 29, 2021

leonstafford Jan 29, 2021

petewilcock Jan 29, 2021 Author

leonstafford Jan 30, 2021

szepeviktor Jan 31, 2021

petewilcock
Jan 27, 2021

leonstafford
Jan 28, 2021

petewilcock
Jan 28, 2021
Author

leonstafford
Jan 28, 2021

petewilcock
Jan 29, 2021
Author

leonstafford
Jan 29, 2021

leonstafford
Jan 29, 2021

petewilcock
Jan 29, 2021
Author

leonstafford
Jan 30, 2021

szepeviktor
Jan 31, 2021