GenderedNews is a dashboard of gender biases in French news created with Python, MongoDB and Metabase. Check out the website!
To setup this project, please refer to the initial setup guide.
Here is how to use the examples the simplest way:
# Fill the database with fake article data
python3 examples/example_fake_data.py
# Fill the database with yesterday articles from Le Monde
python3 examples/example_rss_extract_store.py
To see the results you can setup a Metabase dashboard connected to the database.
Here is how to setup a daily cron job at 01:00 (change script.py
to the the desired script):
# Open the cron config file
crontab -e
# Add the following line in the config file:
0 1 * * * cd /path/to/genderednews/ && /path/to/genderednews/env/bin/python3 /path/to/genderednews/main_local.py
# See the cron config file
crontab -l
This is based on the following folder structure (non exhaustive):
~/
└── genderednews/
├── current -> versions/2021-XX-XX
├── versions/
│ ├── 2020-XX-XX/
│ | └── script.py
│ └── 2021-XX-XX/
│ └── script.py
├── shared/
└── logs/
In step 1, there are 2 methods for scraping articles links, one is via rss feeds and the other is via twitter.
# if you want to scrape via rss feeds
collector = collector(scraping_mode = 'rss')
# if you want to scrape via twitter
collector = collector(scraping_mode = 'twitter')
The step 3 will check if there is any articles with missing process. If the parameter 'fix' is set on 'True', all articles with missing process will be processed again and updated in the database.
A list of the main technologies used within the project (see requirements.txt
for full dependency list):
- Main tools:
- Main libraries:
- BeautifulSoup v4.9.3 - Parse HTML
- Dotenv v0.15.0 - For .env files
- Faker v8.1.2 - Generate fake data
- Feedparser v6.0.2 - Parse RSS feeds
- Newspaper3k v0.2.8 - Parse articles
- PyMongo v3.11.3 - Database driver for Python
- Tweepy v3.10.0 - Connect, parse tweets via twitter api
- Others:
- The Quotation Extraction model of this project will soon be replaced from a rule-based system to a ML model!
The data was downloaded from public websites of newspapers only for non-commercial and research purposes.
List of news sources:
- Aujourd’hui en France (édition nationale du Parisien) : https://www.leparisien.fr/
- La Croix : https://www.la-croix.com/
- Le Figaro : https://www.lefigaro.fr/
- Le Monde : https://www.lemonde.fr/
- Libération : https://www.liberation.fr/
- L'Équipe : https://www.lequipe.fr/
- Les Échos: https://www.lesechos.fr/
Mentions/Quotes
The data will permit to calculate the masculinity rates in mentions and quotes which will be represented by graphs on our website.
The Canadian project GenderGapTracker (source) has the same goal but for Canadian news.
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE
file for details.
For more information about the research methodology and for questions regarding collaboration, please contact: [email protected], [email protected] or [email protected]