Skip to content

aws-samples/aws-lambda-layer-node-puppeteer-headless-chromium

Puppeteer/Node Headless Chromium Lambda Layer

Purpose

This repository provides an AWS Lambda layer that can run Puppeteer in a headless chromium browser, a utility library and a sample Lambda with an AWS CloudFormation template.

Requirements

Building

To build the layer and the sample crawler run the following:

sam build --template-file cloudformation-template.yml  
sam deploy --template-file .aws-sam/build/template.yaml --stack-name lambda-crawler3 --resolve-s3 --capabilities "CAPABILITY_NAMED_IAM"

Usage

Using the included Crawler library

const crawler = require("crawler")

exports.handler = async (event, context, callback) => {

  try {
    await crawler(process.env.URLS.split(","),(data) => {
        // <Your code here>
        
    });
    return;
  } catch (err) {
    console.log(err);
    throw err;
  }
};

The crawler function takes two parameters:

  • an array of URLs to crawl. The pages can either be regular web pages or PDF files.
  • a callback function with one parameter -- data

data will contain five properties after the page is loaded:

  • id - a short hash of the url.
  • url - the url of the page that was crawled
  • title - the page title for HTML or the name of the PDF
  • content - the content of the page
  • page - the Puppeteer page object

CrawlPage sample Lambda

The CrawlPage sample Lambda simply crawls the pages specified by the URLS environment variable and saves the content to the local filesystem on a schedule using CloudWatch Events.

Using the Puppeteer module directly

You can use Puppeteer directly by adding a require statement at the beginning of your code:

const puppeteer = require("puppeteer")

See the official web site for usage.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published