This repository provides an AWS Lambda layer that can run Puppeteer in a headless chromium browser, a utility library and a sample Lambda with an AWS CloudFormation template.
- The AWS CLI
- The AWS Serverless Application Model must be installed
- Node 12.x
To build the layer and the sample crawler run the following:
sam build --template-file cloudformation-template.yml
sam deploy --template-file .aws-sam/build/template.yaml --stack-name lambda-crawler3 --resolve-s3 --capabilities "CAPABILITY_NAMED_IAM"
const crawler = require("crawler")
exports.handler = async (event, context, callback) => {
try {
await crawler(process.env.URLS.split(","),(data) => {
// <Your code here>
});
return;
} catch (err) {
console.log(err);
throw err;
}
};
The crawler
function takes two parameters:
- an array of URLs to crawl. The pages can either be regular web pages or PDF files.
- a callback function with one parameter --
data
data
will contain five properties after the page is loaded:
- id - a short hash of the url.
- url - the url of the page that was crawled
- title - the page title for HTML or the name of the PDF
- content - the content of the page
- page - the Puppeteer page object
The CrawlPage sample Lambda simply crawls the pages specified by the URLS
environment variable and saves the content to the local filesystem on a schedule using CloudWatch Events.
You can use Puppeteer directly by adding a require statement at the beginning of your code:
const puppeteer = require("puppeteer")
See the official web site for usage.