Skip to content

Commit dd5bfbf

Browse files
authored
Fixing various typos
1 parent 479be93 commit dd5bfbf

File tree

7 files changed

+17
-17
lines changed

7 files changed

+17
-17
lines changed

docs/cli.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ Optional Arguments:
238238
asking for a compressed response. Usually better
239239
for bandwidth but at the cost of more CPU work.
240240
--connect-timeout CONNECT_TIMEOUT
241-
Maxium socket connection time to host. Defaults
241+
Maximum socket connection time to host. Defaults
242242
to `5`.
243243
--domain-parallelism DOMAIN_PARALLELISM
244244
Max number of urls per domain to hit at the same
@@ -419,7 +419,7 @@ Optional Arguments:
419419
asking for a compressed response. Usually better
420420
for bandwidth but at the cost of more CPU work.
421421
--connect-timeout CONNECT_TIMEOUT
422-
Maxium socket connection time to host. Defaults
422+
Maximum socket connection time to host. Defaults
423423
to `5`.
424424
-C, --content-filter CONTENT_FILTER
425425
Regex used to filter fetched content.
@@ -1622,7 +1622,7 @@ Optional Arguments:
16221622
AMP urls when normalizing url. Defaults to
16231623
`True`.
16241624
--platform-aware Whether url parsing should know about some
1625-
specififc platform such as Facebook, YouTube
1625+
specific platform such as Facebook, YouTube
16261626
etc. into account when normalizing urls. Note
16271627
that this is different than activating
16281628
--facebook or --youtube.
@@ -2581,7 +2581,7 @@ Optional Arguments:
25812581
asking for a compressed response. Usually better
25822582
for bandwidth but at the cost of more CPU work.
25832583
--connect-timeout CONNECT_TIMEOUT
2584-
Maxium socket connection time to host. Defaults
2584+
Maximum socket connection time to host. Defaults
25852585
to `15`.
25862586
--domain-parallelism DOMAIN_PARALLELISM
25872587
Max number of urls per domain to hit at the same

docs/crawlers.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ class MySpider(Spider):
254254
- **persistent_storage_path** *Optional[str]*: path to a folder that will contain persistent on-disk resources for the crawler's queue and url cache. If not given, the crawler will work entirely in-memory, which means memory could be exceeded if the url queue or cache becomes too large and you also won't be able to resume if your python process crashes.
255255
- **resume** *bool* `False`: whether to attempt to resume from persistent storage. Will raise if `persistent_storage_path=None`.
256256
- **visit_urls_only_once** *bool* `False`: whether to guarantee the crawler won't visit the same url twice.
257-
- **normalized_url_cache** *bool* `False`: whether to use [`ural.normalize_url`](https://github.com/medialab/ural#normalize_url) before adding a url to the crawler's cache. This can be handy to avoid visting a same page having subtly different urls twice. This will do nothing if `visit_urls_only_once=False`.
257+
- **normalized_url_cache** *bool* `False`: whether to use [`ural.normalize_url`](https://github.com/medialab/ural#normalize_url) before adding a url to the crawler's cache. This can be handy to avoid visiting a same page having subtly different urls twice. This will do nothing if `visit_urls_only_once=False`.
258258
- **max_depth** *Optional[int]*: global maximum allowed depth for the crawler to accept a job.
259259
- **writer_root_directory** *Optional[str]*: root directory that will be used to resolve path written by the crawler's own threadsafe file writer.
260260
- **sqlar** *bool* `False`: whether the crawler's threadsafe file writer should target a [sqlar](https://www.sqlite.org/sqlar/doc/trunk/README.md) archive instead.
@@ -272,7 +272,7 @@ class MySpider(Spider):
272272
- **retryer_kwargs** *Optional[dict]*: arguments that will be given to [create_reques.t_retryer](./web.md#create_request_retryer) to create the retryer for each of the spawned threads.
273273
- **request_args** *Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [request](./web.md#request) call for a given item from the iterable.
274274
- **use_pycurl** *bool* `False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
275-
- **compressed** *bool* `False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
275+
- **compressed** *bool* `False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
276276
- **known_encoding** *Optional[str]*: encoding of the body of requested urls. Defaults to `None` which means this encoding will be inferred from the body itself.
277277
- **max_redirects** *int* `5`: maximum number of redirections the request will be allowed to follow before raising an error.
278278
- **stateful_redirects** *bool* `False`: whether to allow the resolver to be stateful and store cookies along the redirection chain. This is useful when dealing with GDPR compliance patterns from websites etc. but can hurt performance a little bit.
@@ -330,7 +330,7 @@ for result, written_path in crawler.crawl(callback=callback):
330330

331331
*Arguments*
332332

333-
- **callback** *Optional[Callable[[Crawler, SuccessfulCrawlResult], T]]*: callback that can be used to perform IO-intensive tasks within the same thread used for peforming the crawler's request and to return additional information. If callback is given, the iterator returned by the method will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
333+
- **callback** *Optional[Callable[[Crawler, SuccessfulCrawlResult], T]]*: callback that can be used to perform IO-intensive tasks within the same thread used for performing the crawler's request and to return additional information. If callback is given, the iterator returned by the method will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
334334

335335
#### enqueue
336336

@@ -446,7 +446,7 @@ class MySpider(Spider):
446446

447447
Method that must be implemented for the spider to be able to process the crawler's completed jobs.
448448

449-
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitraty data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
449+
The method takes a [CrawlJob](#crawljob) instance, a HTTP [Response](./web.md#response) and must return either `None` or a 2-tuple containing: 1. some optional & arbitrary data extracted from the response, 2. an iterable of next targets for the crawler to enqueue.
450450

451451
Note that next crawl targets can be relative (they will be resolved wrt current's job last redirected url) and that their depth, if not provided, will default to the current job's depth + 1.
452452

@@ -528,7 +528,7 @@ Those jobs are also provided to spider's processing functions and can be accesse
528528

529529
- **job** *[CrawlJob](#crawljob)*: job that was completed or errored.
530530
- **data** *Optional[T]*: data extracted by the spider for the job.
531-
- **error** *Optional[Exception]*: error that happend when requesting the job's url.
531+
- **error** *Optional[Exception]*: error that happened when requesting the job's url.
532532
- **error_code** *Optional[str]*: human-readable error code if an error happened when requesting the job's url.
533533
- **response** *Optional[[Response](./web.md#response)]*: HTTP response if the job did not error.
534534
- **degree** *int*: number of new jobs enqueued when processing this job.

docs/executors.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Download urls as fast as possible. Yields [RequestResult](#requestresult) object
107107
- **callback** *Optional[Callable[[T, str, Response], C]]*: callback that can be used to perform IO-intensive tasks within the same thread used for the request and to return additional information. If callback is given, the iterator returned by the pool will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
108108
- **request_args** *Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [request](./web.md#request) call for a given item from the iterable.
109109
- **use_pycurl** *bool* `False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
110-
- **compressed** *bool* `False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
110+
- **compressed** *bool* `False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
111111
- **throttle** *float* `0.2`: time to wait, in seconds, between two calls to the same domain.
112112
- **buffer_size** *int* `1024`: number of items to pull ahead of time from the iterable in hope of finding some url that can be requested immediately. Decreasing this number will ease up memory usage but can slow down overall performance.
113113
- **domain_parallelism** *int* `1`: maximum number of concurrent calls allowed on a same domain.
@@ -127,7 +127,7 @@ Resolve urls as fast as possible. Yields [ResolveResult](#resolveresult) objects
127127
- **callback** *Optional[Callable[[T, str, Response], C]]*: callback that can be used to perform IO-intensive tasks within the same thread used for the request and to return additional information. If callback is given, the iterator returned by the pool will yield `(result, callback_result)` instead of just `result`. Note that this callback must be threadsafe.
128128
- **resolve_args** *Optional[Callable[[T], dict]]*: function returning arguments that will be given to the threaded [resolve](./web.md#resolve) call for a given item from the iterable.
129129
- **use_pycurl** *bool* `False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
130-
- **compressed** *bool* `False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
130+
- **compressed** *bool* `False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
131131
- **throttle** *float* `0.2`: time to wait, in seconds, between two calls to the same domain.
132132
- **buffer_size** *int* `1024`: number of items to pull ahead of time from the iterable in hope of finding some url that can be requested immediately. Decreasing this number will ease up memory usage but can slow down overall performance.
133133
- **domain_parallelism** *int* `1`: maximum number of concurrent calls allowed on a same domain.
@@ -145,7 +145,7 @@ Resolve urls as fast as possible. Yields [ResolveResult](#resolveresult) objects
145145

146146
- **item** *str | T*: item from the iterable given to [request](#request).
147147
- **url** *Optional[str]*: url for the wrapped item, if any.
148-
- **error** *Optional[Exception]*: any error that was raised when peforming the HTTP request.
148+
- **error** *Optional[Exception]*: any error that was raised when performing the HTTP request.
149149
- **error_code** *Optional[str]*: human-readable error code if any error was raised when performing the HTTP request.
150150
- **response** *Optional[[Response](./web.md#response)]*: the completed response, if no error was raised.
151151

@@ -183,7 +183,7 @@ assert successful_result.response is not None
183183

184184
- **item** *str | T*: item from the iterable given to [resolve](#resolve).
185185
- **url** *Optional[str]*: url for the wrapped item, if any.
186-
- **error** *Optional[Exception]*: any error that was raised when peforming the HTTP request.
186+
- **error** *Optional[Exception]*: any error that was raised when performing the HTTP request.
187187
- **error_code** *Optional[str]*: human-readable error code if any error was raised when performing the HTTP request.
188188
- **stack** *Optional[List[[Redirection](./web.md#redirection)]]*: the redirection stack if no error was raised.
189189

docs/web.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ response = request(
5555
- **raise_on_statuses** *Optional[Container[int]]*: if given, request will raise if the response has a status in the given set, instead of returning the response.
5656
- **stateful** *bool* `False`: whether to allow the resolver to be stateful and store cookies along the redirection chain. This is useful when dealing with GDPR compliance patterns from websites etc. but can hurt performance a little bit.
5757
- **use_pycurl** *bool* `False`: whether to use [`pycurl`](http://pycurl.io/) instead of [`urllib3`](https://urllib3.readthedocs.io/en/stable/) to perform the request. The `pycurl` library must be installed for this kwarg to work.
58-
- **compressed** *bool* `False`: whether to automatically specifiy the `Accept` header to ask the server to compress the response's body on the wire.
58+
- **compressed** *bool* `False`: whether to automatically specify the `Accept` header to ask the server to compress the response's body on the wire.
5959
- **pool_manager** *Optional[urllib3.PoolManager]*: urllib3 pool manager to use to perform the request. Will use a default sensible pool manager if not given. This should only be cared about when you want to use a custom pool manager. This will not be used if `pycurl=True`.
6060

6161
## resolve

minet/cli/crawl/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@
113113
},
114114
"connect_timeout": {
115115
"flag": "--connect-timeout",
116-
"help": "Maxium socket connection time to host.",
116+
"help": "Maximum socket connection time to host.",
117117
"type": float,
118118
"default": 5,
119119
},

minet/cli/url_parse/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ def __call__(self, parser, cli_args, values, option_string=None):
226226
},
227227
{
228228
"flag": "--platform-aware",
229-
"help": "Whether url parsing should know about some specififc platform such as Facebook, YouTube etc. into account when normalizing urls. Note that this is different than activating --facebook or --youtube.",
229+
"help": "Whether url parsing should know about some specific platform such as Facebook, YouTube etc. into account when normalizing urls. Note that this is different than activating --facebook or --youtube.",
230230
"action": "store_true",
231231
},
232232
],

minet/exceptions.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ class PycurlProtocolError(PycurlError):
212212
pass
213213

214214

215-
# NOTE: we cannot distinguish connexion error and unknown host errors
215+
# NOTE: we cannot distinguish connection error and unknown host errors
216216
# This is the reason why `PycurlHostResolutionError` inherits from
217217
# `PycurlProtocolError` so we can retry it.
218218
class PycurlHostResolutionError(PycurlProtocolError):

0 commit comments

Comments
 (0)