theme

background

class

highlighter

lineNumbers

info

drawings

css

default

./bg.jpeg

text-center

shiki

true

## Design a web crawler Presentation slides for developers by Son Nguyen

persist

unocss

Design a web crawler

Presentation slides for developers

Made by Son Nguyen

Press Space for next page

What is a Web Crawler?

A web crawler (also known as a spider) is a system for downloading, storing, and analyzing web pages

Most prominently, they are one of the main components of web search engines that compile a collection of web pages, index it, and allow users to issue index queries and find web pages that match queries.

A web crawler bot is like someone whose job is to go through all the books in a disorganized library and organize them in such a way that anyone visiting the library can quickly and easily find the information they need. This process starts by collecting a few web pages and then following links to collect new content.

What is a Web Crawler?

Use Cases of Web Crawler

A crawler is used for many purposes:

📝 Search engine indexing - A crawler collects web pages to generate a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine.
📥 Web archiving - This is the compilation of web-based information to store data for future use
⛏️ Web mining - An incredible opportunity for data mining is created by the exponential growth of the web. Web mining allows the internet to discover valuable information
🤹 Web monitoring - The crawlers help to monitor copyright and trademark violations over the Internet. Digimarc, for instance, uses crawlers to discover pirated activities and reports.

Requirements of the System

A simple web crawler should have the following functionalities:

Given a set of URLs, visit the URL and store the web page.Google search engine.
Now, extract the URLs in these web pages.
Append the new URLs extracted to the list of URLs to be visited.
🔄 Repeat the process.

System Analysis and characteristics

It is important to take note of the characteristics of a good web crawler. A web crawler service should have the following features:

⛹️‍♂️ Scalability - There are billions of web pages to be crawled and analyzed. Therefore, web crawling should be extremely efficient using parallelization
🤖 Robustness - The internet is full of traps such as bad HTML, unresponsive servers, crashes, malicious links, etc. The crawler must be able to handle all these cases
🙅‍♂️ Politeness - The web crawler should not make too many requests to a website within a short time interval as it may unintentionally lead to DDoS on several websites
🧱 Extensibility - The system should be adaptable and flexible to any changes that we might encounter in the future, like if we want to crawl images or music files.
⚙️ Manageability and Reconfigurability - An appropriate interface is needed to monitor the crawl, including the crawler's speed, statistics about hosts and pages, and the sizes of the main data sets.

Estimation of Scale and Constraints

We should estimate the scale by taking many assumptions by discussing them with the interviewer.

Assuming 1 billion web pages to be crawled every month, we can estimate the average QPS to be: QPS (Query per Second) = 1,000,000,000 / 30 days / 24 hours / 3600 seconds ~ 400 pages per second.
We can assume that the peak value of queries per second is 2 times the average, i.e., 800 pages per second.
Now, assuming that, on average, a web page is 500 Kb in size, we can estimate storage required per month: 1-billion-page 500k = 500 TB storage per month.
Also, assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB.

High-Level Design

The Workflow of a Web Crawler System

The crawling process consists of several threads, and each thread performs work cycles repeatedly. The following is a brief overview of a work cycle:

Add seed URLs to the URL Frontier
HTML Downloader fetches a list of URLs from URL Frontier.
HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading.
Content Parser parses HTML pages and checks if pages are malformed.
After content is parsed and validated, it is passed to the “Content Seen?” component.
“Content Seen” component checks if a HTML page is already in the storage.
- If it is in the storage, this means the same content in a different URL has already been processed. In this case, the HTML page is discarded.
- If it is not in the storage, the system has not processed the same content before. The content is passed to Link Extractor.

The Workflow of a Web Crawler System

The crawling process consists of several threads, and each thread performs work cycles repeatedly. The following is a brief overview of a work cycle:

Link extractor extracts links from HTML pages.
Extracted links are passed to the URL filter.
After links are filtered, they are passed to the “URL Seen?” component.
“URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done.
If a URL has not been processed before, it is added to the URL Frontier.

layout: center class: text-center

Design Deep Dive

Up until now, we have discussed the high-level design.

Next, we will discuss the most important building components and techniques in depth

Depth-first search (DFS) vs Breadth-first search (BFS)

Design deep dive

The Depth–first search (DFS) algorithm starts at the root of the tree (or some arbitrary node for a graph) and explored as far as possible along each branch before backtracking.

The Breadth–first search (BFS) algorithm also starts at the root of the tree (or some arbitrary node of a graph), but unlike DFS, it explores the neighbor nodes first, before moving to the next-level neighbors. In other words, BFS explores vertices in the order of their distance from the source vertex, where distance is the minimum length of a path from the source vertex to the node.

Depth-first search (DFS) vs Breadth-first search (BFS)

Depth-first search (DFS) vs Breadth-first search (BFS)

URL frontier

Design deep dive

A URL frontier is a data structure that stores URLs to be downloaded.

The URL frontier is an important component to ensure:

Politeness
Priority
Freshness

Politeness

URL frontier

Priority

URL frontier

Freshness

URL frontier

Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time- consuming and resource intensive. Few strategies to optimize freshness are listed as follows:

Recrawl based on web pages update history.
Prioritize URLs and recrawl important pages first and more frequently.

HTML Downloader

Design deep dive

The HTML Downloader downloads web pages from the internet using the HTTP protocol.

Robots.txt
Performance optimization

Robots.txt

HTML Downloader

Called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download.

👉 Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.

User-agent: Googlebot
Disallow: /creatorhub/*
Disallow: /rss/people/*/reviews
Disallow: /gp/pdp/rss/*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/

Performance optimization

HTML Downloader

Below is a list of performance optimizations for HTML downloader:

Distributed crawl
Cache DNS Resolver
Locality
Short timeout

Robustness

Design deep dive

A few approaches to improve the system robustness:

Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to Chapter 5: Design consistent hashing for more details.
Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data.
Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system.
Data validation: This is an important measure to prevent system errors.

Extensibility

Design deep dive

As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules.

For Example:

PNG Downloader module is plugged-in to download PNG files.
Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.

Detect and avoid problematic content

Design deep dive

Redundant content

As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication [11].
Spider traps

A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...
Data noise

Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible.

layout: center class: text-center

Q & A

Reference

System Design Interview – An insider's guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slides.md

slides.md

Design a web crawler

What is a Web Crawler?

What is a Web Crawler?

Use Cases of Web Crawler

Requirements of the System

System Analysis and characteristics

Estimation of Scale and Constraints

High-Level Design

The Workflow of a Web Crawler System

The Workflow of a Web Crawler System

layout: center class: text-center

Design Deep Dive

Depth-first search (DFS) vs Breadth-first search (BFS)

Depth-first search (DFS) vs Breadth-first search (BFS)

Depth-first search (DFS) vs Breadth-first search (BFS)

URL frontier

Politeness

Priority

Freshness

HTML Downloader

Robots.txt

Performance optimization

Robustness

Extensibility

Detect and avoid problematic content

layout: center class: text-center

Q & A

Reference

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

Design a web crawler

What is a Web Crawler?

What is a Web Crawler?

Use Cases of Web Crawler

Requirements of the System

System Analysis and characteristics

Estimation of Scale and Constraints

High-Level Design

The Workflow of a Web Crawler System

The Workflow of a Web Crawler System

layout: center class: text-center

Design Deep Dive

Depth-first search (DFS) vs Breadth-first search (BFS)

Depth-first search (DFS) vs Breadth-first search (BFS)

Depth-first search (DFS) vs Breadth-first search (BFS)

URL frontier

Politeness

Priority

Freshness

HTML Downloader

Robots.txt

Performance optimization

Robustness

Extensibility

Detect and avoid problematic content

layout: center class: text-center

Q & A

Reference