Drafting crawler docs

medialab · Nov 17, 2023 · 4575fb9 · 4575fb9
1 parent 91141ed
commit 4575fb9
Show file tree

Hide file tree

Showing 2 changed files with 111 additions and 1 deletion.
diff --git a/docs/crawlers.md b/docs/crawlers.md
@@ -0,0 +1,109 @@
+# Minet Crawlers
+
+If you need to quickly implement a performant and resilient web crawler with custom logic, `minet.crawl` provides a handful of building blocks that you can easily repurpose to suit your use-case.
+
+As such, `minet` crawlers are multi-threaded, can defer computations to a process pool, can be made persistent to deal with large queues of urls and to be able to resume.
+
+## Summary
+
+- [Examples](#examples)
+  - [The most simple crawler](#the-most-simple-crawler)
+  - [The most simple crawler, with typings](#the-most-simple-crawler-with-typings)
+- [Crawler](#crawler)
+- [CrawlTarget](#crawltarget)
+- [CrawlJob](#crawljob)
+- [CrawlResult](#crawlresult)
+
+## Examples
+
+### The most simple crawler
+
+In `minet`, a crawler is a multithreaded executor that reads from a queue of jobs (the combination of a url to request alongside some additional data & parameters) and perform HTTP requests for you.
+
+Then, a crawler needs to be given one or several "spiders" that will process the result of a given crawl job (typically exposing a completed HTTP response) to extract some data and potentially enqueue the next jobs to perform.
+
+The most simple "spider" is therefore a function taking a [crawl job](#crawljob) and a HTTP [response](./web.md#response) that must return some extracted data and the next urls to enqueue.
+
+Let's create such a spider:
+
+```python
+def spider(job, response):
+
+  # We are only interested in extracting stuff from completed HTML responses
+  if response.status != 200 or not response.is_html:
+
+    # NOTE: the function can return nothing, which is the same a
+    # returning (None, None), meaning no data, no next urls.
+    return
+
+  # Scraping the page's title
+  title = response.soup().scrape('meta > title')
+
+  # Extracting links
+  urls = response.links()
+
+  # We return some extracted data, then the next urls to enqueue
+  return title, urls
+```
+
+Now we can create a [Crawler](#crawler) using our `spider` function and iterate over [crawl results](#crawlresult) downstream:
+
+```python
+from minet.crawl import Crawler
+
+# Always prefer using the context manager that ensures the resources
+# managed by the crawler will be correctly cleaned up:
+with Crawler(spider) as crawler:
+
+  # Enqueuing our start url:
+  crawler.enqueue("https://www.lemonde.fr")
+
+  # Iterating over the crawler's results (after spider processing)
+  for result in crawler:
+    print("Url", result.url)
+
+    if result.error is not None:
+      print("Error", result.error)
+    else:
+      print("Depth", result.depth)
+      print("Title", result.data)
+```
+
+### The most simple crawler, with typings
+
+If you want to rely on python [typings](https://docs.python.org/3/library/typing.html) when writing your spider, know that `minet.crawl` APIs are all completely typed.
+
+Here is how one would write a typed spider function:
+
+```python
+from minet.web import Response
+from minet.crawl import CrawlJob, SpiderResult
+
+def spider(job: CrawlJob, response: Response) -> SpiderResult[str]:
+
+  # We are only interested in extracting stuff from completed HTML responses
+  if response.status != 200 or not response.is_html:
+
+    # NOTE: the function can return nothing, which is the same a
+    # returning (None, None), meaning no data, no next urls.
+    return
+
+  # Scraping the page's title
+  title = response.soup().scrape('meta > title')
+
+  # Extracting links
+  urls = response.links()
+
+  # We return some extracted data, then the next urls to enqueue
+  return title, urls
+```
+
+<!-- TODO: plural spiders, crawl targets -->
+
+## Crawler
+
+## CrawlTarget
+
+## CrawlJob
+
+## CrawlResult
diff --git a/docs/lib.md b/docs/lib.md
@@ -2,10 +2,11 @@
 
 **Sorry, this section is currently being reworked, hang tight...**
 
-<!-- TODO: `minet.extract`, crawlers -->
+<!-- TODO: `minet.extract` -->
 
 ## Summary
 
 * [`minet.web`](./web.md): module containing utilities related to one-shot HTTP requests, redirections etc.
 * [`minet.executors`](./executors.md): specialized threadpool executors that can be used to download/resolve large numbers of urls efficiently.
+* [`minet.crawlers`](./crawlers.md): multithreaded crawlers to navigate collect data on the web.
 * [`WonderfulSoup`](./soup.md): enhanced `BeautifulSoup` class.