Skip to content

Commit

Permalink
fix: handle all broken links
Browse files Browse the repository at this point in the history
  • Loading branch information
metalwarrior665 committed Feb 1, 2025
1 parent e787093 commit 009f6fd
Show file tree
Hide file tree
Showing 5 changed files with 7 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X

## How to parse URLs from sitemaps

The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps) provides code examples for parsing sitemaps.

## Using Crawlee

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ sidebar_position:: 3
slug: /advanced-web-scraping/crawling/crawling-with-search
---

# Scraping websites with search
# Scraping websites with search

In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.
In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.

Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.

Expand Down Expand Up @@ -281,6 +281,6 @@ await crawler.addRequests(requestsToEnqueue);

## Summary {#summary}

And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.
And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.

Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters).
2 changes: 1 addition & 1 deletion sources/academy/webscraping/advanced_web_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ If you've managed to follow along with all of the courses prior to this one, the

## First up

First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website.
First, we will explore [advanced crawling section](./crawling/sitemaps-vs-search.md) that will help us to find all pages or products on the website.
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ Here's what the output of this code looks like:
## Final note {#final-note}
Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/scraping-paginated-sites).
Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/crawling/crawling-with-search).
## Next up {#next}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem';

If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content.

![Amazon pagination](/academy/advanced_web_scraping/crawling/images/pagination.png)
![Amazon pagination](../../advanced_web_scraping/crawling/images/pagination.png)

## Page number-based pagination {#page-number-based-pagination}

Expand Down

0 comments on commit 009f6fd

Please sign in to comment.