From ad77d26f2d85146b626b176a7d66379b832f2263 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Luk=C3=A1=C5=A1=20K=C5=99ivka?= Date: Mon, 3 Feb 2025 23:35:48 +0100 Subject: [PATCH] Apply suggestions from code review Commit all suggestions from Honza Co-authored-by: Honza Javorek --- .../node_js/scraping_from_sitemaps.md | 2 +- .../crawling/crawling-sitemaps.md | 14 ++++++------- .../crawling/crawling-with-search.md | 2 +- .../crawling/sitemaps-vs-search.md | 20 +++++++++---------- .../advanced_web_scraping/index.md | 2 +- .../handling_pagination.md | 2 +- 6 files changed, 21 insertions(+), 21 deletions(-) diff --git a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md index 7027ee93b..4222bba2a 100644 --- a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md +++ b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md @@ -9,7 +9,7 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; # How to scrape from sitemaps {#scraping-with-sitemaps} -:::note +:::tip Processing sitemaps automatically with Crawlee Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code. diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index c32992e22..040d2fe4c 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -1,7 +1,7 @@ --- title: Crawling sitemaps description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -sidebar_position:: 2 +sidebar_position: 2 slug: /advanced-web-scraping/crawling/crawling-sitemaps --- @@ -16,7 +16,7 @@ We will look at the following topics: ## How to find sitemap URLs -Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. +Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in `robots.txt` and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. ### Google @@ -24,7 +24,7 @@ You can try your luck on Google by searching for `site:example.com sitemap.xml` ### robots.txt {#robots-txt} -If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. +If the website has a `robots.txt` file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. ### Common URL paths @@ -49,19 +49,19 @@ Some websites also provide an HTML version, to help indexing bots find new conte - /sitemap.html - /sitemap_index -Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. +Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually. ## How to set up HTTP requests to download sitemaps -For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. +For most sitemaps, you can make a single HTTP request and parse the downloaded XML text. Some sitemaps are compressed and have to be streamed and decompressed. The code can get fairly complicated, but scraping frameworks, such as [Crawlee](#using-crawlee), can do this out of the box. ## How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps) provides code examples for parsing sitemaps. +Use your favorite XML parser to extract the URLs from inside the `` tags. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. `/about`, `/contact`, or various special category sections). For specific code examples, see [our Node.js guide](/academy/node-js/scraping-from-sitemaps). ## Using Crawlee -Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: +Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev), a scraping framework, which has rich traversing and parsing support for sitemap. It can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all the URLs in a few lines of code: ```js import { RobotsFile } from 'crawlee'; diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index ad6157328..fbd471bb2 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -1,7 +1,7 @@ --- title: Crawling with search description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -sidebar_position:: 3 +sidebar_position: 3 slug: /advanced-web-scraping/crawling/crawling-with-search --- diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 5fab8a50b..90238546f 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -1,13 +1,13 @@ --- title: Sitemaps vs search description: Learn how to extract all of a website's listings even if they limit the number of results pages. -sidebar_position:: 1 +sidebar_position: 1 slug: /advanced-web-scraping/crawling/sitemaps-vs-search --- The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. -Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. +Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: @@ -31,13 +31,13 @@ Sitemap is usually a simple XML file that contains a list of all pages on the we - **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. - **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. - **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. -- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson. +- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework. ## Pros and cons of categories, search, and filters -This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit. +This approach means traversing the website like a normal user does by going through categories, setting up different filters, ranges, and sorting options. The goal is to ensure that we cover all categories or ranges where products can be located, and that for each of those we stay under the pagination limit. -The pros and cons of this approach are pretty much the opposite of the sitemaps approach. +The pros and cons of this approach are pretty much the opposite of relying on sitemaps. ### Pros @@ -47,16 +47,16 @@ The pros and cons of this approach are pretty much the opposite of the sitemaps ### Cons -- **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons. +- **Complex to set up** - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons. - **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. -- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this. +- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this. ## Do we know how many products there are? -Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. +Most websites list a total number of detail pages somewhere. It might be displayed on the home page, search results, or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. -Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. +Some sites, like Amazon, do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the following lessons as well. ## Next up -First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls. +Next, we will look into [sitemap crawling](./crawling-sitemaps.md). After that we will go through all the intricacies of the category, search and filter crawling, and build up tools implementing a generic approach that we can use on any website. At last, we will combine the results of both and set up monitoring and persistence to ensure we can run this regularly without any manual controls. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 1388e4b80..3e41abb0a 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -1,7 +1,7 @@ --- title: Advanced web scraping description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. -sidebar_position:: 6 +sidebar_position: 6 category: web scraping & automation slug: /advanced-web-scraping --- diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 1ecaebbbb..e734e7be4 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -198,7 +198,7 @@ Here's what the output of this code looks like: ## Final note {#final-note} -Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/crawling/crawling-with-search). +Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the [Crawling with search](/academy/advanced-web-scraping/crawling/crawling-with-search) article. ## Next up {#next}