You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/crawlers.md
+5-5
Original file line number
Diff line number
Diff line change
@@ -254,7 +254,7 @@ class MySpider(Spider):
254
254
-**persistent_storage_path***Optional[str]*: path to a folder that will contain persistent on-disk resources for the crawler's queue and url cache. If not given, the crawler will work entirely in-memory, which means memory could be exceeded if the url queue or cache becomes too large and you also won't be able to resume if your python process crashes.
255
255
-**resume***bool*`False`: whether to attempt to resume from persistent storage. Will raise if `persistent_storage_path=None`.
256
256
-**visit_urls_only_once***bool*`False`: whether to guarantee the crawler won't visit the same url twice.
257
-
-**normalized_url_cache***bool*`False`: whether to use [`ural.normalize_url`](https://github.com/medialab/ural#normalize_url) before adding a url to the crawler's cache. This can be handy to avoid visting a same page having subtly different urls twice. This will do nothing if `visit_urls_only_once=False`.
257
+
-**normalized_url_cache***bool*`False`: whether to use [`ural.normalize_url`](https://github.com/medialab/ural#normalize_url) before adding a url to the crawler's cache. This can be handy to avoid visiting a same page having subtly different urls twice. This will do nothing if `visit_urls_only_once=False`.
258
258
-**max_depth***Optional[int]*: global maximum allowed depth for the crawler to accept a job.
259
259
-**writer_root_directory***Optional[str]*: root directory that will be used to resolve path written by the crawler's own threadsafe file writer.
260
260
-**sqlar***bool*`False`: whether the crawler's threadsafe file writer should target a [sqlar](https://www.sqlite.org/sqlar/doc/trunk/README.md) archive instead.
@@ -272,7 +272,7 @@ class MySpider(Spider):
272
272
-**retryer_kwargs***Optional[dict]*: arguments that will be given to [create_reques.t_retryer](./web.md#create_request_retryer) to create the retryer for each of the spawned threads.
273
273
-**request_args***Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [request](./web.md#request) call for a given item from the iterable.
274
274
-**use_pycurl***bool*`False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
275
-
-**compressed***bool*`False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
275
+
-**compressed***bool*`False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
276
276
-**known_encoding***Optional[str]*: encoding of the body of requested urls. Defaults to `None` which means this encoding will be inferred from the body itself.
277
277
-**max_redirects***int*`5`: maximum number of redirections the request will be allowed to follow before raising an error.
278
278
-**stateful_redirects***bool*`False`: whether to allow the resolver to be stateful and store cookies along the redirection chain. This is useful when dealing with GDPR compliance patterns from websites etc. but can hurt performance a little bit.
@@ -330,7 +330,7 @@ for result, written_path in crawler.crawl(callback=callback):
330
330
331
331
*Arguments*
332
332
333
-
-**callback***Optional[Callable[[Crawler, SuccessfulCrawlResult], T]]*: callback that can be used to perform IO-intensive tasks within the same thread used for peforming the crawler's request and to return additional information. If callback is given, the iterator returned by the method will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
333
+
-**callback***Optional[Callable[[Crawler, SuccessfulCrawlResult], T]]*: callback that can be used to perform IO-intensive tasks within the same thread used for performing the crawler's request and to return additional information. If callback is given, the iterator returned by the method will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
334
334
335
335
#### enqueue
336
336
@@ -446,7 +446,7 @@ class MySpider(Spider):
446
446
447
447
Method that must be implemented for the spider to be able to process the crawler's completed jobs.
448
448
449
-
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitraty data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
449
+
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitrary data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
450
450
451
451
Note that next crawl targets can be relative (they will be resolved wrt current's job last redirected url) and that their depth, if not provided, will default to the current job's depth + 1.
452
452
@@ -528,7 +528,7 @@ Those jobs are also provided to spider's processing functions and can be accesse
528
528
529
529
-**job***[CrawlJob](#crawljob)*: job that was completed or errored.
530
530
-**data***Optional[T]*: data extracted by the spider for the job.
531
-
-**error***Optional[Exception]*: error that happend when requesting the job's url.
531
+
-**error***Optional[Exception]*: error that happened when requesting the job's url.
532
532
-**error_code***Optional[str]*: human-readable error code if an error happened when requesting the job's url.
533
533
-**response***Optional[[Response](./web.md#response)]*: HTTP response if the job did not error.
534
534
-**degree***int*: number of new jobs enqueued when processing this job.
Copy file name to clipboardExpand all lines: docs/executors.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@ Download urls as fast as possible. Yields [RequestResult](#requestresult) object
107
107
-**callback***Optional[Callable[[T, str, Response], C]]*: callback that can be used to perform IO-intensive tasks within the same thread used for the request and to return additional information. If callback is given, the iterator returned by the pool will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
108
108
-**request_args***Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [request](./web.md#request) call for a given item from the iterable.
109
109
-**use_pycurl***bool*`False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
110
-
-**compressed***bool*`False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
110
+
-**compressed***bool*`False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
111
111
-**throttle***float*`0.2`: time to wait, in seconds, between two calls to the same domain.
112
112
-**buffer_size***int*`1024`: number of items to pull ahead of time from the iterable in hope of finding some url that can be requested immediately. Decreasing this number will ease up memory usage but can slow down overall performance.
113
113
-**domain_parallelism***int*`1`: maximum number of concurrent calls allowed on a same domain.
@@ -127,7 +127,7 @@ Resolve urls as fast as possible. Yields [ResolveResult](#resolveresult) objects
127
127
-**callback***Optional[Callable[[T, str, Response], C]]*: callback that can be used to perform IO-intensive tasks within the same thread used for the request and to return additional information. If callback is given, the iterator returned by the pool will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
128
128
-**resolve_args***Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [resolve](./web.md#resolve) call for a given item from the iterable.
129
129
-**use_pycurl***bool*`False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
130
-
-**compressed***bool*`False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
130
+
-**compressed***bool*`False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
131
131
-**throttle***float*`0.2`: time to wait, in seconds, between two calls to the same domain.
132
132
-**buffer_size***int*`1024`: number of items to pull ahead of time from the iterable in hope of finding some url that can be requested immediately. Decreasing this number will ease up memory usage but can slow down overall performance.
133
133
-**domain_parallelism***int*`1`: maximum number of concurrent calls allowed on a same domain.
@@ -145,7 +145,7 @@ Resolve urls as fast as possible. Yields [ResolveResult](#resolveresult) objects
145
145
146
146
-**item***str | T*: item from the iterable given to [request](#request).
147
147
-**url***Optional[str]*: url for the wrapped item, if any.
148
-
-**error***Optional[Exception]*: any error that was raised when peforming the HTTP request.
148
+
-**error***Optional[Exception]*: any error that was raised when performing the HTTP request.
149
149
-**error_code***Optional[str]*: human-readable error code if any error was raised when performing the HTTP request.
150
150
-**response***Optional[[Response](./web.md#response)]*: the completed response, if no error was raised.
151
151
@@ -183,7 +183,7 @@ assert successful_result.response is not None
183
183
184
184
-**item***str | T*: item from the iterable given to [resolve](#resolve).
185
185
-**url***Optional[str]*: url for the wrapped item, if any.
186
-
-**error***Optional[Exception]*: any error that was raised when peforming the HTTP request.
186
+
-**error***Optional[Exception]*: any error that was raised when performing the HTTP request.
187
187
-**error_code***Optional[str]*: human-readable error code if any error was raised when performing the HTTP request.
188
188
-**stack***Optional[List[[Redirection](./web.md#redirection)]]*: the redirection stack if no error was raised.
Copy file name to clipboardExpand all lines: docs/web.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -55,7 +55,7 @@ response = request(
55
55
-**raise_on_statuses***Optional[Container[int]]*: if given, request will raise if the response has a status in the given set, instead of returning the response.
56
56
-**stateful***bool*`False`: whether to allow the resolver to be stateful and store cookies along the redirection chain. This is useful when dealing with GDPR compliance patterns from websites etc. but can hurt performance a little bit.
57
57
-**use_pycurl***bool*`False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
58
-
-**compressed***bool*`False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
58
+
-**compressed***bool*`False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
59
59
-**pool_manager***Optional[urllib3.PoolManager]*: urllib3 pool manager to use to perform the request. Will use a default sensible pool manager if not given. This should only be cared about when you want to use a custom pool manager. This will not be used if `pycurl=True`.
"help": "Whether url parsing should know about some specififc platform such as Facebook, YouTube etc. into account when normalizing urls. Note that this is different than activating --facebook or --youtube.",
229
+
"help": "Whether url parsing should know about some specific platform such as Facebook, YouTube etc. into account when normalizing urls. Note that this is different than activating --facebook or --youtube.",
0 commit comments