Skip to content

Commit

Permalink
Drafting crawler docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Yomguithereal committed Nov 17, 2023
1 parent 91141ed commit 4575fb9
Show file tree
Hide file tree
Showing 2 changed files with 111 additions and 1 deletion.
109 changes: 109 additions & 0 deletions docs/crawlers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Minet Crawlers

If you need to quickly implement a performant and resilient web crawler with custom logic, `minet.crawl` provides a handful of building blocks that you can easily repurpose to suit your use-case.

As such, `minet` crawlers are multi-threaded, can defer computations to a process pool, can be made persistent to deal with large queues of urls and to be able to resume.

## Summary

- [Examples](#examples)
- [The most simple crawler](#the-most-simple-crawler)
- [The most simple crawler, with typings](#the-most-simple-crawler-with-typings)
- [Crawler](#crawler)
- [CrawlTarget](#crawltarget)
- [CrawlJob](#crawljob)
- [CrawlResult](#crawlresult)

## Examples

### The most simple crawler

In `minet`, a crawler is a multithreaded executor that reads from a queue of jobs (the combination of a url to request alongside some additional data & parameters) and perform HTTP requests for you.

Then, a crawler needs to be given one or several "spiders" that will process the result of a given crawl job (typically exposing a completed HTTP response) to extract some data and potentially enqueue the next jobs to perform.

The most simple "spider" is therefore a function taking a [crawl job](#crawljob) and a HTTP [response](./web.md#response) that must return some extracted data and the next urls to enqueue.

Let's create such a spider:

```python
def spider(job, response):

# We are only interested in extracting stuff from completed HTML responses
if response.status != 200 or not response.is_html:

# NOTE: the function can return nothing, which is the same a
# returning (None, None), meaning no data, no next urls.
return

# Scraping the page's title
title = response.soup().scrape('meta > title')

# Extracting links
urls = response.links()

# We return some extracted data, then the next urls to enqueue
return title, urls
```

Now we can create a [Crawler](#crawler) using our `spider` function and iterate over [crawl results](#crawlresult) downstream:

```python
from minet.crawl import Crawler

# Always prefer using the context manager that ensures the resources
# managed by the crawler will be correctly cleaned up:
with Crawler(spider) as crawler:

# Enqueuing our start url:
crawler.enqueue("https://www.lemonde.fr")

# Iterating over the crawler's results (after spider processing)
for result in crawler:
print("Url", result.url)

if result.error is not None:
print("Error", result.error)
else:
print("Depth", result.depth)
print("Title", result.data)
```

### The most simple crawler, with typings

If you want to rely on python [typings](https://docs.python.org/3/library/typing.html) when writing your spider, know that `minet.crawl` APIs are all completely typed.

Here is how one would write a typed spider function:

```python
from minet.web import Response
from minet.crawl import CrawlJob, SpiderResult

def spider(job: CrawlJob, response: Response) -> SpiderResult[str]:

# We are only interested in extracting stuff from completed HTML responses
if response.status != 200 or not response.is_html:

# NOTE: the function can return nothing, which is the same a
# returning (None, None), meaning no data, no next urls.
return

# Scraping the page's title
title = response.soup().scrape('meta > title')

# Extracting links
urls = response.links()

# We return some extracted data, then the next urls to enqueue
return title, urls
```

<!-- TODO: plural spiders, crawl targets -->

## Crawler

## CrawlTarget

## CrawlJob

## CrawlResult
3 changes: 2 additions & 1 deletion docs/lib.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

**Sorry, this section is currently being reworked, hang tight...**

<!-- TODO: `minet.extract`, crawlers -->
<!-- TODO: `minet.extract` -->

## Summary

* [`minet.web`](./web.md): module containing utilities related to one-shot HTTP requests, redirections etc.
* [`minet.executors`](./executors.md): specialized threadpool executors that can be used to download/resolve large numbers of urls efficiently.
* [`minet.crawlers`](./crawlers.md): multithreaded crawlers to navigate collect data on the web.
* [`WonderfulSoup`](./soup.md): enhanced `BeautifulSoup` class.

0 comments on commit 4575fb9

Please sign in to comment.