Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(academy-puppeteer): clarify disadvantages of browsers and unified cheerio parsing #1442

Merged
merged 4 commits into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ Now that we know how to execute scripts on a page, we're ready to learn a bit ab
1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`.
2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio)

:::tip Crawlee and parsing with Cheerio

If you are using Crawlee, we highly recommend the [parseWithCheerio](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio) function for unified data extraction syntax. This way, switching between browser and plain HTTP scraping is a breeze.

:::

## Setup

Here is the base setup for our code, upon which we'll be building off of in this lesson:
Expand Down
8 changes: 7 additions & 1 deletion sources/academy/webscraping/puppeteer_playwright/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,13 @@ Both packages were developed by the same team and are very similar, which is why

When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc.

Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped).
Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the [headful mode](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) to see exactly what the browser is doing.

Browsers can also be effective for [overcoming anti-scraping measures](../anti_scraping/index.md), especially if the website is running [JavaScript browser challenges](../anti_scraping/techniques/browser_challenges.md).

## Disadvantages of headless browsers

Browsers are slow and expensive to run. In the follow-up courses, the Apify Academy will show you how to scrape websites without a browser. Every website can potentially be reverse-engineered into a series of quick and cheap HTTP calls, but it might require significant effort and specialized knowledge.

## Setup {#setup}

Expand Down
Loading