Skip to content

Latest commit

 

History

History
61 lines (44 loc) · 2.41 KB

README.md

File metadata and controls

61 lines (44 loc) · 2.41 KB

py-quick-crawlers

Exposes a set of Python based APIs wrapping a collection of Scrapy based generic crawlers, for quick use, hiding out all the complex crawling configurations and coding complexities.

Installation

This package is not in PyPi yet. For now, download / install directly from GitHub.

sudo pip install git+https://github.com/vdraceil/py-quick-crawlers.git

API

APIs are exposed via the Controller module.

from py_quick_crawlers import Controller

# you can invoke the APIs on the Controller instance
controller = Controller()

See examples/ for detailed usage instructions.

Pattern Match Crawl

Crawls a given set of URLs for a given depth and retrieves all items that matches the given set of patterns.

Signature
pattern_match_crawl(target, patter_dict, out_file=None, feed_type='JSON')
Parameters

target - A list of tuples specifying the start URLs and individual depths [ (<START_URL1>, <DEPTH1>), (<START_URL2>, <DEPTH2>), ... ]
pattern_dict - A dictionary mapping names (for output formatting) to patterns as compiled regular expressions { '<key1>': <REGEX1>, 'key2': <REGEX2>, ... }
out_file - Output file; replaced if it exists
feed_type - Output file format '<JSON/CSV/XML>'

Output

A file with the crawled data in the requested feed format.

Content Download Crawl

Crawls a given set of URLs for a given depth and downloads all files matching the given list of file patterns into a output directory.

Signature
content_download_crawl(target, pattern_list, out_dir=None, enable_dir_structure=False)
Parameters

target - A list of tuples specifying the start URLs and individual depths [ (<START_URL1>, <DEPTH1>), (<START_URL2>, <DEPTH2>), ... ]
pattern_list - A list of allowed file patterns as compiled regular expressions [ <REGEX1>, <REGEX2>, ... ]
out_dir - Output directory; created if it does not exist
enable_dir_structure - Boolean <True/False>
Determines how the files are downloaded in the output directory.
If True, the downloaded files will be organized in a directory structure resembling its URL in the web server.
If False, the all downloaded files will be at the first level in the output directory with their names being SHA1 hashes of their URL path.

Output

A directory of files downloaded from target websites matching the given file patterns.