Skip to content

Commit 2a13030

Browse files
committed
Documenting Spider
1 parent 767cb0d commit 2a13030

File tree

1 file changed

+178
-2
lines changed

1 file changed

+178
-2
lines changed

docs/crawlers.md

+178-2
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ As such, `minet` crawlers are multi-threaded, can defer computations to a proces
99
- [Examples](#examples)
1010
- [The simplest crawler](#the-simplest-crawler)
1111
- [The simplest crawler, with typings](#the-simplest-crawler-with-typings)
12+
- [Using multiple spiders](#using-multiple-spiders)
13+
- [Implementing a more complex spider](#implementing-a-more-complex-spider)
1214
- [Crawler](#crawler)
1315
- [\_\_len\_\_](#__len__)
1416
- [\_\_iter\_\_](#__iter__)
@@ -19,6 +21,17 @@ As such, `minet` crawlers are multi-threaded, can defer computations to a proces
1921
- [write](#write)
2022
- [submit](#submit)
2123
- [Spider](#spider)
24+
- [Implementable class properties](#implementable-class-properties)
25+
- [START_URL](#start_url)
26+
- [START_URLS](#start_urls)
27+
- [START_TARGET](#start_target)
28+
- [START_TARGETS](#start_targets)
29+
- [Implementable methods](#implementable-methods)
30+
- [start](#start-1)
31+
- [process](#process)
32+
- [Methods](#methods-1)
33+
- [write](#write-1)
34+
- [submit](#submit-1)
2235
- [CrawlTarget](#crawltarget)
2336
- [CrawlJob](#crawljob)
2437
- [CrawlResult](#crawlresult)
@@ -132,7 +145,105 @@ with Crawler(spider) as crawler:
132145
print("Title", result.data)
133146
```
134147

135-
<!-- TODO: plural spiders, crawl targets, auto join, auto depth, auto spider dispatch, threaded callback -->
148+
### Using multiple spiders
149+
150+
Sometimes you might want to separate the processing logic into multiple functions/spiders.
151+
152+
For instance, one of the spider might scrape and navigate some pagination, while some other one might be scraping the articles found in said pagination.
153+
154+
In that case, know that a crawler is able to accept multiple spiders given as a dict mapping names to spider function or [Spider](#spider) instances.
155+
156+
```python
157+
from minet.crawl import Crawler, CrawlTarget
158+
159+
def pagination_spider(job, response):
160+
next_link = response.soup().scrape("a.next", "href")
161+
162+
if next_link is None:
163+
return
164+
165+
return None, CrawlTarget(next_link, spider="article")
166+
167+
def article_spider(job, response):
168+
titles = response.soup().scrape("h2")
169+
170+
return titles, None
171+
172+
spiders = {
173+
"pagination": pagination_spider,
174+
"article": article_spider
175+
}
176+
177+
with Crawler(spiders) as crawler:
178+
crawler.enqueue("http://someurl.com", spider="pagination")
179+
180+
for result in crawler:
181+
print("From spider:", result.spider, 'got result:', result)
182+
```
183+
184+
### Implementing a more complex spider
185+
186+
Sometimes a function might not be enough and you might want to be able to design more complex spiders. Indeed, what if you want to specify custom starting logic, custom request arguments for the calls or use the crawler's utilities wrt the threadsafe file writer or process pool?
187+
188+
For this, you need to implement your own spider, that must inherit from the [Spider](#spider) class.
189+
190+
A spider class MVP needs at least to implement a `process` method that will perform the same job as its function counterpart that we learnt about earlier.
191+
192+
```python
193+
from minet.crawl import Crawler, Spider
194+
195+
class MySpider(Spider):
196+
def process(self, job, response):
197+
if response.status != 200 or not response.is_html:
198+
return
199+
200+
return None, response.links()
201+
202+
# NOTE: we are now giving an instance of MySpider to the Crawler, not the class itself
203+
# This means you can now use your spider class __init__ to parametrize the spider if
204+
# required.
205+
with Crawler(MySpider()) as crawler:
206+
for result in crawler:
207+
...
208+
```
209+
210+
*Declaring starting targets*
211+
212+
```python
213+
from minet.crawl import Spider, CrawlTarget
214+
215+
# Using any of those class attributes:
216+
class MySpider(Spider):
217+
START_URL = "http://lemonde.fr"
218+
START_URLS = ["http://lemonde.fr", "http://lefigaro.fr"],
219+
START_TARGET = CrawlTarget(url="http://lemonde.fr")
220+
START_TARGETS = [CrawlTarget(url="http://lemonde.fr"), CrawlTarget(url="http://lefigaro.fr")]
221+
222+
# Implementing the start method
223+
class MySpider(Spider):
224+
def start():
225+
yield "http://lemonde.fr"
226+
yield CrawlTarget(url="http://lefigaro.fr")
227+
```
228+
229+
*Accessing the crawler's utilities*
230+
231+
```python
232+
from minet.crawl import Crawler, Spider
233+
234+
class MySpider(Spider):
235+
def process(self, job, response):
236+
if response.status != 200 or not response.is_html:
237+
return
238+
239+
# Submitting a computation to the process pool:
240+
data = self.submit(heavy_computation_function, response.body)
241+
242+
# Writing a file to disk in a threadsafe manner:
243+
self.write(data.path, response.body, compress=True)
244+
245+
return data, response.links()
246+
```
136247

137248
## Crawler
138249

@@ -280,6 +391,7 @@ Returns the actually written path after resolution and extension mangling.
280391
*Arguments*
281392

282393
- **filename** *str*: path to write. Will be resolved with the crawler's `writer_root_directory` if relative.
394+
- **contents** *str | bytes*: text content, binary or text, to write to disk.
283395
- **relative** *bool* `False`: if `True`, the returned path will be relative instead of absolute.
284396
- **compress** *bool* `False`: whether to gzip the file when writing. Will add `.gz` to the path if necessary.
285397

@@ -296,7 +408,71 @@ result = crawler.submit(heavy_html_processing, some_html)
296408

297409
## Spider
298410

299-
TODO...
411+
### Implementable class properties
412+
413+
#### START_URL
414+
415+
Class property that you can use to specify a single starting url.
416+
417+
#### START_URLS
418+
419+
Class property that you can use to specify multiple starting urls as a non-lazy iterable (implement the [#.start](#start-1) method for lazy iterables, generators etc.).
420+
421+
#### START_TARGET
422+
423+
Class property that you can use to specify a single starting [CrawlTarget](#crawltarget).
424+
425+
#### START_TARGETS
426+
427+
Class property that you can use to specify multiple starting [[CrawlTarget](#crawltarget)] objects as a non-lazy iterable (implement the [#.start](#start-1) method for lazy iterables, generators etc.).
428+
429+
### Implementable methods
430+
431+
#### start
432+
433+
Method that must return an iterable of crawl targets as urls or [CrawlTarget](#crawltarget) instances.
434+
435+
Note that this method is only called the first time the crawler starts, and will not be called again when resuming.
436+
437+
```python
438+
from minet.crawl import Spider
439+
440+
class MySpider(Spider):
441+
def start():
442+
yield "http://lemonde.fr"
443+
```
444+
445+
#### process
446+
447+
Method that must be implemented for the spider to be able to process the crawler's completed jobs.
448+
449+
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitraty data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
450+
451+
Note that next crawl targets can be relative (they will be resolved wrt current's job last redirected url) and that their depth, if not provided, will default to the current job's depth + 1.
452+
453+
Note also that if the crawler is plural (handling multiple spiders), next target will be dispatched to the same spider by default if a spider name for the target is not provided.
454+
455+
```python
456+
from minet.web import Response
457+
from minet.crawl import Spider, CrawlJob, SpiderResult
458+
459+
class MySpider(Spider):
460+
def process(self, job: CrawlJob, response: Response) -> SpiderResult:
461+
if response.status != 200 or not response.is_html:
462+
return
463+
464+
return None, response.links()
465+
```
466+
467+
### Methods
468+
469+
#### write
470+
471+
Same as calling the attached crawler's [#.write](#write) method.
472+
473+
#### submit
474+
475+
Same as calling the attached crawler's [#.submit](#submit) method.
300476

301477
## CrawlTarget
302478

0 commit comments

Comments
 (0)