Skip to content

Latest commit

 

History

History
51 lines (30 loc) · 2.35 KB

README.md

File metadata and controls

51 lines (30 loc) · 2.35 KB

scrapy-collector

A collection of website spiders. At this time, we only have one spider (mailcollect).

Installation / Usage

For users

For developers

Familiarize yourself with scrapy.

git clone [email protected]:evait-security/scrapy-collector.git

cd scrapy-collector/

pipenv shell (or your prefered way to initiate a virtualenv)

pipenv install (pip users: python -m pip install -r requirements.txt)

cd scrapy scrapy_collector/

scrapy crawl <spider> <options> (see examples below)


Spiders

mailcollect

Tries to collect email addresses from a given domain. Will follow internal links, including subdomains. Does not filter collected mail addresses from other domains, all found addresses are included in the results. Optionally outputs the crawled paths.

Options
-a target=<target-domain> The domain to be crawled. Subdomains will be included automatically (if they are linked within the page)
-a show-paths=true Optional. Include the crawled paths in the output file
-O outfile.json Write results to outfile, in JSON format. Other formats are available too (see https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats)

Usage examples:

scrapy runspider mailcollect.py -a target=<target-domain> -O outfile.json

scrapy runspider mailcollect.py -a target=<target-domain> -O outfile.json -a show-paths=true