-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
91141ed
commit 4575fb9
Showing
2 changed files
with
111 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Minet Crawlers | ||
|
||
If you need to quickly implement a performant and resilient web crawler with custom logic, `minet.crawl` provides a handful of building blocks that you can easily repurpose to suit your use-case. | ||
|
||
As such, `minet` crawlers are multi-threaded, can defer computations to a process pool, can be made persistent to deal with large queues of urls and to be able to resume. | ||
|
||
## Summary | ||
|
||
- [Examples](#examples) | ||
- [The most simple crawler](#the-most-simple-crawler) | ||
- [The most simple crawler, with typings](#the-most-simple-crawler-with-typings) | ||
- [Crawler](#crawler) | ||
- [CrawlTarget](#crawltarget) | ||
- [CrawlJob](#crawljob) | ||
- [CrawlResult](#crawlresult) | ||
|
||
## Examples | ||
|
||
### The most simple crawler | ||
|
||
In `minet`, a crawler is a multithreaded executor that reads from a queue of jobs (the combination of a url to request alongside some additional data & parameters) and perform HTTP requests for you. | ||
|
||
Then, a crawler needs to be given one or several "spiders" that will process the result of a given crawl job (typically exposing a completed HTTP response) to extract some data and potentially enqueue the next jobs to perform. | ||
|
||
The most simple "spider" is therefore a function taking a [crawl job](#crawljob) and a HTTP [response](./web.md#response) that must return some extracted data and the next urls to enqueue. | ||
|
||
Let's create such a spider: | ||
|
||
```python | ||
def spider(job, response): | ||
|
||
# We are only interested in extracting stuff from completed HTML responses | ||
if response.status != 200 or not response.is_html: | ||
|
||
# NOTE: the function can return nothing, which is the same a | ||
# returning (None, None), meaning no data, no next urls. | ||
return | ||
|
||
# Scraping the page's title | ||
title = response.soup().scrape('meta > title') | ||
|
||
# Extracting links | ||
urls = response.links() | ||
|
||
# We return some extracted data, then the next urls to enqueue | ||
return title, urls | ||
``` | ||
|
||
Now we can create a [Crawler](#crawler) using our `spider` function and iterate over [crawl results](#crawlresult) downstream: | ||
|
||
```python | ||
from minet.crawl import Crawler | ||
|
||
# Always prefer using the context manager that ensures the resources | ||
# managed by the crawler will be correctly cleaned up: | ||
with Crawler(spider) as crawler: | ||
|
||
# Enqueuing our start url: | ||
crawler.enqueue("https://www.lemonde.fr") | ||
|
||
# Iterating over the crawler's results (after spider processing) | ||
for result in crawler: | ||
print("Url", result.url) | ||
|
||
if result.error is not None: | ||
print("Error", result.error) | ||
else: | ||
print("Depth", result.depth) | ||
print("Title", result.data) | ||
``` | ||
|
||
### The most simple crawler, with typings | ||
|
||
If you want to rely on python [typings](https://docs.python.org/3/library/typing.html) when writing your spider, know that `minet.crawl` APIs are all completely typed. | ||
|
||
Here is how one would write a typed spider function: | ||
|
||
```python | ||
from minet.web import Response | ||
from minet.crawl import CrawlJob, SpiderResult | ||
|
||
def spider(job: CrawlJob, response: Response) -> SpiderResult[str]: | ||
|
||
# We are only interested in extracting stuff from completed HTML responses | ||
if response.status != 200 or not response.is_html: | ||
|
||
# NOTE: the function can return nothing, which is the same a | ||
# returning (None, None), meaning no data, no next urls. | ||
return | ||
|
||
# Scraping the page's title | ||
title = response.soup().scrape('meta > title') | ||
|
||
# Extracting links | ||
urls = response.links() | ||
|
||
# We return some extracted data, then the next urls to enqueue | ||
return title, urls | ||
``` | ||
|
||
<!-- TODO: plural spiders, crawl targets --> | ||
|
||
## Crawler | ||
|
||
## CrawlTarget | ||
|
||
## CrawlJob | ||
|
||
## CrawlResult |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters