Retrieve recipes from cooking sites and store them in a mongo database.
You must have Python 2.7, pip, and MongoDB 3.4 installed.
- From terminal, in project, enter:
pip install -r requirements.txt
(you may need to usesudo
)
This program uses http://schema.org/Recipe to parse the recipes and selectively store the field values.
It also contains a utility for building a profile based on a url containing a recipe. Unfortunately, you are on your own for a link generation function.
You can, however, override use of a link generation function by providing a file containing a list of urls to check (one url per line).
A profile module must contain a site_profile dictionary of parameters and a link generation function; arguments to this function can be specified on the crawler command line and passed to the function. The crawler script makes the mongo collection available to the profile when it is loaded.
Arguments are first issue date and last issue date to collect recipes from
(inclusive). The date format is yyyy-mm-dd
. To retrieve one page, omit
the second argument.
Example:
$ ./crawler.py collect -p bonappetit -a 2017-01-01
Epicurious has an archive of recipes from Gourmet, organized by page. Arguments to the link generator are first and last page (inclusive) and the link depth. There are 1136 pages, with 10 recipes per page.
Example, to fetch recipes from the first two pages:
$ ./crawler.py collect -p gourmet -a 1 2 -d 1
NYT does not seem to have a centralized page, except for the main cooking page. cooking.nytimes.com is always added to the returned list of links, and this can be augmented by specifying the size of a random sample of previously collected articles. You must change the link depth to at least 1 to get any new results.
To seed your database:
$ ./crawler.py collect -p nyt -d 1
After downloading some data, you can crawl these recipes to expand your collection:
$ ./crawler.py collect -p nyt -a 5 -d 1
Saveur organizes their recipes by page. Arguments are first and last page to retrieve (inclusive). The second argument can be omitted to retrieve only one page. At the time of this writing, there are 153 pages.
Example:
$ ./crawler.py collect -p saveur -a 1 3
Use the build
option to attempt to discover how a site formats their
recipes and what fields they present by checking a sample URL. Sometimes fields
fall outside a Recipe scope. The collector will first try to find it in the scope
and fallback to anywhere in the document but the profile builder only looks in the
scope, so the available field list generated with this command might not be
exhaustive.
Example:
./crawler.py build http://www.saveur.com/flaky-honey-butter-biscuit-recipe
A simple command line utility for interacting with the collection is available. It is mostly self documenting. The only required option is the name of the recipe collection in mongo.
Example:
./search.py -m saveur
In the shell, type help
to see available commands.
To start MongoDB shell:
$ mongo
Verify you have the recipes database:
> show dbs
...
recipes 0.000GB
Select the recipes
database and view available collections:
> use recipes
switched to db recipes
> show collections
bonappetit
gourmet
nyt
saveur
>
To view a random recipe:
> db.bonappetit.findOne()
You can use col.find()
, you must first setup a text index:
> db.bonappetit.createIndex({ name: "text", recipeIngredient: "text", instructions: "text"})
Then:
> db.bonappetit.find({ '$text': { '$search': 'granola'}})