You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Implementing a more complex spider](#implementing-a-more-complex-spider)
12
14
-[Crawler](#crawler)
13
15
-[\_\_len\_\_](#__len__)
14
16
-[\_\_iter\_\_](#__iter__)
@@ -19,6 +21,17 @@ As such, `minet` crawlers are multi-threaded, can defer computations to a proces
19
21
-[write](#write)
20
22
-[submit](#submit)
21
23
-[Spider](#spider)
24
+
-[Implementable class properties](#implementable-class-properties)
25
+
-[START_URL](#start_url)
26
+
-[START_URLS](#start_urls)
27
+
-[START_TARGET](#start_target)
28
+
-[START_TARGETS](#start_targets)
29
+
-[Implementable methods](#implementable-methods)
30
+
-[start](#start-1)
31
+
-[process](#process)
32
+
-[Methods](#methods-1)
33
+
-[write](#write-1)
34
+
-[submit](#submit-1)
22
35
-[CrawlTarget](#crawltarget)
23
36
-[CrawlJob](#crawljob)
24
37
-[CrawlResult](#crawlresult)
@@ -132,7 +145,105 @@ with Crawler(spider) as crawler:
132
145
print("Title", result.data)
133
146
```
134
147
135
-
<!-- TODO: plural spiders, crawl targets, auto join, auto depth, auto spider dispatch, threaded callback -->
148
+
### Using multiple spiders
149
+
150
+
Sometimes you might want to separate the processing logic into multiple functions/spiders.
151
+
152
+
For instance, one of the spider might scrape and navigate some pagination, while some other one might be scraping the articles found in said pagination.
153
+
154
+
In that case, know that a crawler is able to accept multiple spiders given as a dict mapping names to spider function or [Spider](#spider) instances.
Sometimes a function might not be enough and you might want to be able to design more complex spiders. Indeed, what if you want to specify custom starting logic, custom request arguments for the calls or use the crawler's utilities wrt the threadsafe file writer or process pool?
187
+
188
+
For this, you need to implement your own spider, that must inherit from the [Spider](#spider) class.
189
+
190
+
A spider class MVP needs at least to implement a `process` method that will perform the same job as its function counterpart that we learnt about earlier.
191
+
192
+
```python
193
+
from minet.crawl import Crawler, Spider
194
+
195
+
classMySpider(Spider):
196
+
defprocess(self, job, response):
197
+
if response.status !=200ornot response.is_html:
198
+
return
199
+
200
+
returnNone, response.links()
201
+
202
+
#NOTE: we are now giving an instance of MySpider to the Crawler, not the class itself
203
+
# This means you can now use your spider class __init__ to parametrize the spider if
@@ -280,6 +391,7 @@ Returns the actually written path after resolution and extension mangling.
280
391
*Arguments*
281
392
282
393
-**filename***str*: path to write. Will be resolved with the crawler's `writer_root_directory` if relative.
394
+
-**contents***str | bytes*: text content, binary or text, to write to disk.
283
395
-**relative***bool*`False`: if `True`, the returned path will be relative instead of absolute.
284
396
-**compress***bool*`False`: whether to gzip the file when writing. Will add `.gz` to the path if necessary.
285
397
@@ -296,7 +408,71 @@ result = crawler.submit(heavy_html_processing, some_html)
296
408
297
409
## Spider
298
410
299
-
TODO...
411
+
### Implementable class properties
412
+
413
+
#### START_URL
414
+
415
+
Class property that you can use to specify a single starting url.
416
+
417
+
#### START_URLS
418
+
419
+
Class property that you can use to specify multiple starting urls as a non-lazy iterable (implement the [#.start](#start-1) method for lazy iterables, generators etc.).
420
+
421
+
#### START_TARGET
422
+
423
+
Class property that you can use to specify a single starting [CrawlTarget](#crawltarget).
424
+
425
+
#### START_TARGETS
426
+
427
+
Class property that you can use to specify multiple starting [[CrawlTarget](#crawltarget)] objects as a non-lazy iterable (implement the [#.start](#start-1) method for lazy iterables, generators etc.).
428
+
429
+
### Implementable methods
430
+
431
+
#### start
432
+
433
+
Method that must return an iterable of crawl targets as urls or [CrawlTarget](#crawltarget) instances.
434
+
435
+
Note that this method is only called the first time the crawler starts, and will not be called again when resuming.
436
+
437
+
```python
438
+
from minet.crawl import Spider
439
+
440
+
classMySpider(Spider):
441
+
defstart():
442
+
yield"http://lemonde.fr"
443
+
```
444
+
445
+
#### process
446
+
447
+
Method that must be implemented for the spider to be able to process the crawler's completed jobs.
448
+
449
+
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitraty data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
450
+
451
+
Note that next crawl targets can be relative (they will be resolved wrt current's job last redirected url) and that their depth, if not provided, will default to the current job's depth + 1.
452
+
453
+
Note also that if the crawler is plural (handling multiple spiders), next target will be dispatched to the same spider by default if a spider name for the target is not provided.
454
+
455
+
```python
456
+
from minet.web import Response
457
+
from minet.crawl import Spider, CrawlJob, SpiderResult
0 commit comments