-
-
Notifications
You must be signed in to change notification settings - Fork 432
developer guide
This explains the inner workings of news-please and is directed at developers and advanced users. In the following sections we explain the program flow and the architecture of this project.
Here is an overview of the program flow (this diagram is not an UML-diagram or similar, it is just for clarification on how the program works):
After starting news-please, a number of crawlers will be started as sub processes. The number of processes started depends on the input (number of targets) and is limited by the configuration
Each sub process calls single_crawler.py
loading the settings defined for the crawler.
As mentioned before, this project heavily relies on Scrapy 1.1, an easy modifiable crawler-framework.
The crawlers are implemented as Scrapy spideres located in the spider directory (./newscrawler/crawler/spiders/
).
Right now there are multiple crawlers implemented. For further information on how which spider works, read here.
We use multiple heuristics to detect whether a site is an article or not. All these heuristics are (and if you want to add some, these should be as well) located in ./newscrawler/helpers/heuristics.py
.
Heuristics can be enabled and disabled per site, also how heuristics work can be changed per site.
Heuristics must return a boolean, a string, an int or a float. For each heuristic a value can be set, that must be matched. More background information about the heuristics can be found here.
For further information, read the [Heuristics]-Section of the Configuration page.
Sites that passed the heuristics (from now on called articles) are passed to pipelines. Disabling, enabling and the order of pipelines can be set in in the [Scrapy]
-section of the newscrawler.cfg
.
news-please offers several pipeline modules to filter, edit and save scraped articles. If your interested in developing your own make sure to add them to pipelines.py
.
Our file structure has a simple file-hierarchy. Classes should only rely on classes which are stored in the same or child-directories.
-
__init__.py
(empty [1]) -
.gitignore
-
init-db.sql
(Setup script for the optional MySQL database) -
README.md
-
LICENSE.txt
-
requirements.txt
(simple Python requirements.txt) -
single_crawler.py
(A single crawler-manager) -
__main__.py
(Entry point, manages all crawlers) -
config/
-
sitelist.hjson
(the input file containing the crawling-urls) -
config.cfg
(general config file)
-
-
newscrawler/
-
__init__.py
(empty [1]) -
config.py
(Reading and parsing the config files (default: sitelist.json and config.cfg)) -
helper.py
(Helper class, containing objects of classes in helper_classes/ for passing to the crawler-spiders) -
crawler/
(containing mostly basic crawler-logic and scrapy-functionality)-
__init__.py
(empty [1]) -
items.py
(Scrapys items-functionality) -
spiders/
-
__init__.py
(empty [1]) -
download_crawler.py
A download crawler for testing. -
recursive_crawler.py
(href-following, recursive crawler)-
recursive_sitemap_crawler.py
(crawler using the sitemap as starting point, then going recursive)
-
-
rss_crawler.py
(RSS-Feed-crawler) -
sitemap_crawler.py
(crawler reading the sitemaps via robots.txt)
-
-
-
helper_classes/
-
__init__.py
(empty [1]) -
heuristics.py
(heuristics used to detect articles) -
parse_crawler.py
(helper class for the crawlers parse-method) -
savepath_parser.py
(helper-class for saving files) -
url_extractor.py
(URL-Extraction-helper) -
sub_classes/
-
__init__.py
(empty [1]) -
heuristics_manager.py
(class used in heuristics.py for easier configuration later on)
-
-
-
-
__init__.py
(empty [1]) -
pipelines.py
(Scrapys pipelines-functionality, handling database inserts, local storage, wrong HTTP-Codes ...) -
extractor/
(Additional resources needed for the ArticleMasterExtractor)
-
-
[1]: These files are empty but required because otherwise python would not recognize these directories as packages.