Skip to content

vdraceil/py-quick-crawlers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

py-quick-crawlers

Exposes a set of Python based APIs wrapping a collection of Scrapy based generic crawlers, for quick use, hiding out all the complex crawling configurations and coding complexities.

Installation

This package is not in PyPi yet. For now, download / install directly from GitHub.

sudo pip install git+https://github.com/vdraceil/py-quick-crawlers.git

API

APIs are exposed via the Controller module.

from py_quick_crawlers import Controller

# you can invoke the APIs on the Controller instance
controller = Controller()

See examples/ for detailed usage instructions.

Pattern Match Crawl

Crawls a given set of URLs for a given depth and retrieves all items that matches the given set of patterns.

Signature
pattern_match_crawl(target, patter_dict, out_file=None, feed_type='JSON')
Parameters

target - A list of tuples specifying the start URLs and individual depths [ (<START_URL1>, <DEPTH1>), (<START_URL2>, <DEPTH2>), ... ]
pattern_dict - A dictionary mapping names (for output formatting) to patterns as compiled regular expressions { '<key1>': <REGEX1>, 'key2': <REGEX2>, ... }
out_file - Output file; replaced if it exists
feed_type - Output file format '<JSON/CSV/XML>'

Output

A file with the crawled data in the requested feed format.

Content Download Crawl

Crawls a given set of URLs for a given depth and downloads all files matching the given list of file patterns into a output directory.

Signature
content_download_crawl(target, pattern_list, out_dir=None, enable_dir_structure=False)
Parameters

target - A list of tuples specifying the start URLs and individual depths [ (<START_URL1>, <DEPTH1>), (<START_URL2>, <DEPTH2>), ... ]
pattern_list - A list of allowed file patterns as compiled regular expressions [ <REGEX1>, <REGEX2>, ... ]
out_dir - Output directory; created if it does not exist
enable_dir_structure - Boolean <True/False>
Determines how the files are downloaded in the output directory.
If True, the downloaded files will be organized in a directory structure resembling its URL in the web server.
If False, the all downloaded files will be at the first level in the output directory with their names being SHA1 hashes of their URL path.

Output

A directory of files downloaded from target websites matching the given file patterns.

About

Python APIs wrapping generic Scrapy crawlers

Resources

Stars

Watchers

Forks

Releases

No releases published