theme | background | class | highlighter | lineNumbers | info | drawings | css | |
---|---|---|---|---|---|---|---|---|
default |
./bg.jpeg |
text-center |
shiki |
true |
## Design a web crawler
Presentation slides for developers by Son Nguyen
|
|
unocss |
Presentation slides for developers
Made by Son Nguyen
A web crawler (also known as a spider) is a system for downloading, storing, and analyzing web pages
Most prominently, they are one of the main components of web search engines that compile a collection of web pages, index it, and allow users to issue index queries and find web pages that match queries.
A web crawler bot is like someone whose job is to go through all the books in a disorganized library and organize them in such a way that anyone visiting the library can quickly and easily find the information they need. This process starts by collecting a few web pages and then following links to collect new content.
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } </style><style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>
A crawler is used for many purposes:
- 📝 Search engine indexing - A crawler collects web pages to generate a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine.
- 📥 Web archiving - This is the compilation of web-based information to store data for future use
- ⛏️ Web mining - An incredible opportunity for data mining is created by the exponential growth of the web. Web mining allows the internet to discover valuable information
- 🤹 Web monitoring - The crawlers help to monitor copyright and trademark violations over the Internet. Digimarc, for instance, uses crawlers to discover pirated activities and reports.
A simple web crawler should have the following functionalities:
- Given a set of URLs, visit the URL and store the web page.Google search engine.
- Now, extract the URLs in these web pages.
- Append the new URLs extracted to the list of URLs to be visited.
- 🔄 Repeat the process.
It is important to take note of the characteristics of a good web crawler. A web crawler service should have the following features:
- ⛹️♂️ Scalability - There are billions of web pages to be crawled and analyzed. Therefore, web crawling should be extremely efficient using parallelization
- 🤖 Robustness - The internet is full of traps such as bad HTML, unresponsive servers, crashes, malicious links, etc. The crawler must be able to handle all these cases
- 🙅♂️ Politeness - The web crawler should not make too many requests to a website within a short time interval as it may unintentionally lead to DDoS on several websites
- 🧱 Extensibility - The system should be adaptable and flexible to any changes that we might encounter in the future, like if we want to crawl images or music files.
- ⚙️ Manageability and Reconfigurability - An appropriate interface is needed to monitor the crawl, including the crawler's speed, statistics about hosts and pages, and the sizes of the main data sets.
We should estimate the scale by taking many assumptions by discussing them with the interviewer.
- Assuming 1 billion web pages to be crawled every month, we can estimate the average QPS to be: QPS (Query per Second) = 1,000,000,000 / 30 days / 24 hours / 3600 seconds ~ 400 pages per second.
- We can assume that the peak value of queries per second is 2 times the average, i.e., 800 pages per second.
- Now, assuming that, on average, a web page is 500 Kb in size, we can estimate storage required per month: 1-billion-page 500k = 500 TB storage per month.
- Also, assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB.
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>
The crawling process consists of several threads, and each thread performs work cycles repeatedly. The following is a brief overview of a work cycle:
- Add seed URLs to the URL Frontier
- HTML Downloader fetches a list of URLs from URL Frontier.
- HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading.
- Content Parser parses HTML pages and checks if pages are malformed.
- After content is parsed and validated, it is passed to the “Content Seen?” component.
- “Content Seen” component checks if a HTML page is already in the storage.
- If it is in the storage, this means the same content in a different URL has already been processed. In this case, the HTML page is discarded.
- If it is not in the storage, the system has not processed the same content before. The content is passed to Link Extractor.
The crawling process consists of several threads, and each thread performs work cycles repeatedly. The following is a brief overview of a work cycle:
- Link extractor extracts links from HTML pages.
- Extracted links are passed to the URL filter.
- After links are filtered, they are passed to the “URL Seen?” component.
- “URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done.
- If a URL has not been processed before, it is added to the URL Frontier.
Up until now, we have discussed the high-level design.
Next, we will discuss the most important building components and techniques in depth
Design deep dive
The Depth–first search (DFS) algorithm starts at the root of the tree (or some arbitrary node for a graph) and explored as far as possible along each branch before backtracking.
The Breadth–first search (BFS) algorithm also starts at the root of the tree (or some arbitrary node of a graph), but unlike DFS, it explores the neighbor nodes first, before moving to the next-level neighbors. In other words, BFS explores vertices in the order of their distance from the source vertex, where distance is the minimum length of a path from the source vertex to the node.
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } </style><style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>
Design deep dive
A URL frontier is a data structure that stores URLs to be downloaded.
The URL frontier is an important component to ensure:
- Politeness
- Priority
- Freshness
URL frontier
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>URL frontier
<style> h1 { background-color: #2B90B6; background-image: linear-gradient(45deg, #4EC5D4 10%, #146b8c 20%); background-size: 100%; -webkit-background-clip: text; -moz-background-clip: text; -webkit-text-fill-color: transparent; -moz-text-fill-color: transparent; } div { height: 95%; } img { max-width: auto; height: 100%; } </style>URL frontier
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time- consuming and resource intensive. Few strategies to optimize freshness are listed as follows:
- Recrawl based on web pages update history.
- Prioritize URLs and recrawl important pages first and more frequently.
Design deep dive
The HTML Downloader downloads web pages from the internet using the HTTP protocol.
- Robots.txt
- Performance optimization
HTML Downloader
Called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download.
👉 Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.
User-agent: Googlebot
Disallow: /creatorhub/*
Disallow: /rss/people/*/reviews
Disallow: /gp/pdp/rss/*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/
HTML Downloader
Below is a list of performance optimizations for HTML downloader:
- Distributed crawl
- Cache DNS Resolver
- Locality
- Short timeout
Design deep dive
A few approaches to improve the system robustness:
- Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to Chapter 5: Design consistent hashing for more details.
- Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data.
- Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system.
- Data validation: This is an important measure to prevent system errors.
Design deep dive
As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules.
For Example:
- PNG Downloader module is plugged-in to download PNG files.
- Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.
Design deep dive
-
Redundant content
As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication [11].
-
Spider traps
A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...
-
Data noise
Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible.