-
-
Notifications
You must be signed in to change notification settings - Fork 432
user guide
This guide helps users learn how to use and configure news-please. This guide describes running news-please in CLI mode (with full crawling and extraction). If you want to programmatically use news-please within your Python project, or if you want to extract articles from commoncrawl.org, please refer to the README.md.
- Basic setup
- First test run
- Inspect results stored in Elastic Search
- Optional arguments
- Add own URLs
- Advanced Configuration
news-please is a registered PyPi package and can be installed using pip. While news-please runs on both, Python 2.7+ and 3.x, we recommend Python 3.5 and explain the setup for this version.
Users of Windows systems may experience problems installing news-please with pip due to missing requirements. Therefore we have to install the required packages manually:
-
lxml:
-
Go to Christoph's Gohlke's Python page and download the compatible wheel for your system.
(32bit : "lxml-X.X.X-cp35-cp35m-win32.whl"; 64bit: "lxml-X.X.X-cp35-cp35m-win_amd64.whl") -
Open the Windows console and navigate to your Python installation:
C:\Users\USERNAME>cd C:\Python35
-
Install the wheel with the following command:
C:\Python35> pip install lxml-X.X.X-cp35-cp35m-win32.whl
-
-
pywin32:
-
Download the latest build of pywin32.
Make sure you select correct version (matches Python version, 32bit/64bit) -
Execute the installer
-
news-please is a registered PyPi package and can be installed via pip:
sudo pip install news-please
Before we can start a simple test run we have to check the configuration. news-please will automatically generate a config directory and files if the directory does not exist. The default location is ~/news-please/config
, which can be changed by providing a custom location using the -c
parameter.
For our first test run we only look at the [Elasticsearch]
section.
This section handles the the connection to the Elasticsearch database. If you freshly installed Elasticsearch on your system you probably wont need change the configuration. Otherwise you should review the default settings.
Address of the Elasticsearch database and the used port:
host = localhost
port = 9200
The indices used to store the extracted meta-data:
index_current = 'news-please'
index_archive = 'news-please-archive'
Credentials used for Authentication (supports CA-certificates):
use_ca_certificates = False' #If True Authentification is performed
ca_cert_path = '/path/to/cacert.pem'
client_cert_path = '/path/to/client_cert.pem'
client_key_path = '/path/to/client_key.pem'
username = 'root'
secret = 'password'
While not necessary, its highly recommended to change the user-agent. Otherwise, it is likely that the crawler will be blocked from many sites or earlier.
USER_AGENT = 'news-please (+http://www.example.com)'
Be sure to have your server Elasticsearch running. Open a terminal and enter the following code lines:
news-please
If you did not install news-please with pip
but checked out the source code, you can also go into the source code directory and run python __main__.py
.
Let the programm run for a minute and terminate it by pressing CTRL+C
once. Wait for news-please to terminate gracefully instead of pressing CTRL+C
multiple times.
While it is possible to retrieve data stored in Elasticsearch without any specific tools we recommend ElasticHQ. In order to use ElasticHQ follow these simple steps:
-
Ensure the database is not running!
-
Open the configuration file
elasticsearch.yml
located at either/etc/elasticsearch/
or
at./elasticsearch/conf/
if downloaded as archive. -
Add the following lines at the bottom of the file:
http.cors.enabled : true http.cors.allow-origin : "*" http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length
-
Save the configuration file and start Elasticsearch again.
-
Go to ElasticHQ and chose your preferred version of the tool (Cloud/Plugin/Download).
-
Enter the address of your database and press
Connect
. Now you should be able to see the previously defined indices and the number of articles stored within them.
news-please supports optional arguments that can be passed when starting the crawler. Start news-please with the -h
parameter to see them.
To add your own websites you have to either edit sitelist.hjson
or create a new file and register it within the configuration. Both files are located in config directory.
If you want to created a new input file you have to add the path to the [Files]
section of config.cfg
:
url_input_file_name = sitelist.hjson
The input file consists of one array called base_urls
and each entry represents one website to be crawled:
{
"base_urls" : [
{
"url": "http://www.faz.net/",
"crawler": "RecursiveCrawler",
"overwrite_heuristics": {
"meta_contains_article_keyword": true,
"og_type": true,
"linked_headlines": true,
"self_linked_headlines": false
},
"pass_heuristics_condition": "meta_contains_article_keyword or (og_type and linked_headlines)"
},
{
"url": "http://www.nytimes.com/",
"crawler": "RssCrawler",
"daemonize": 3600
},
...
]
}
news-please also supports direct URL download, i.e., you can define a list of URLs each pointing to an actual article that should just be downloaded and extracted.
# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
# Every URL has to be in an array-object in "base_urls".
# The same URL in combination with the same crawler may only appear once in this array.
"base_urls" : [
{
"crawler": "Download",
"url": [
# Cubs win Championship ~03.11.2016
"http://www.dailymail.co.uk/news/article-3899956/Chicago-Cubs-win-World-Series-epic-Game-7-showdown-Cleveland.html",
"http://www.mirror.co.uk/sport/other-sports/american-sports/chicago-cubs-win-world-series-9185077",
"https://www.theguardian.com/sport/2016/nov/03/world-series-game-7-chicago-cubs-cleveland-indians-mlb",
"http://www.telegraph.co.uk/baseball/2016/11/03/chicago-cubs-break-108-year-curse-of-the-billy-goat-winning-worl/",
"https://www.thesun.co.uk/sport/othersports/2106710/chicago-cubs-win-world-series-hillary-clinton-bill-murray-and-barack-obama-lead-celebrations-as-cubs-end-108-year-curse/",
"http://www.bbc.com/sport/baseball/37857919"
],
"overwrite_heuristics": {
"meta_contains_article_keyword": true,
"og_type": false,
"linked_headlines": false,
"self_linked_headlines": false
}
}
The entries within base_urls
may have up to four parameters defining the start point, the used crawler and the heuristics used to detect articles:
-
url
: (string)
A String defining the root URL to start crawling e.g."http://example.com"
.
Optional Parameters:
-
crawler
: (string)
The crawler used to collect the data. For all implemented crawlers see crawlers. -
overwrite_heuristics
: (dictionary, containing mixed types)
This overwrites the default heuristics used to detect sites containing an article. news-please expects a dict containing heuristic names as keys and as value the condition necessary for articles to pass the heuristic.Depending on the return value of a heuristic, the condition can be a bool, a string, an int or a float.
- bool:
Acceptable conditions areTrue
andFalse
, butFalse
will disable the heuristic! - string:
Acceptable conditions are simple strings:"string_heuristic": "matched_value"
- float/int:
Acceptable conditions are strings that may contain one equality operator (<
,>
,<=
,>=
,=
) and a number, e.g."linked_headlines": "<=0.65"
.
Do not put spaces between the equality operator an the number!
For all implemented heuristic and their supported conditions see heuristics.
- bool:
-
pass_heuristics_condition
: (string)
This overwrites the default boolean expression defining the evaluation of the used heuristics. After all heuristics are tested and returnedTrue
orFalse
, this expression will be checked.It may contain any heuristics-name (e.g.
og_type
,overwrite_heuristics
), the boolean operators (e.g.and
,or
,not
) and parentheses ((
,)
).To disable a heuristic you can either set the condition
False
or skip it inpass_heuristics_condition
. -
daemonize
: (int)
If this parameter is set, the crawler will be started as a daemon. The value defines the seconds the crawler waits until scraping the target again. This parameter is only supported by theRSSCrawler
. -
additional_rss_daemonize
: (int)
If this parameter is set, an additionalRSSCrawler
is spawned for the same target. The value defines the seconds the crawler waits until scraping the target again. This parameter is not supported by theRSSCrawler
.
This guide covers most of the standard use cases, if your interested in more specialized configurations visit: