Project Micromort (Micro + Mortality) started at --- lab, NUS is the first attempt to detect the Risk from Social Media data. Think of it as doing next version of sentiment analysis in which we are trying to detect the risk pulse. This repo is the scriptpack for the project, containing most of the engineering work we are doing.
------TODO
- Generate RSS feed
- Get social media shares/likes
- Scrapers
- Mysql
- Make sure you have mysql creds stored in /etc/mysql/my.cnf
Sample file:
[client]
user=user
password=password
- Mongo DB running on localhost
- Setting up the python virtual env and installing the requirements.
Create the virtual env (one time process), virtualenv is in gitignore hence you have to create one on your local machine
virtualenv --no-site-packages virtualenv
Activate it:
source virtualenv/bin/activate
Install the requirements:
pip install -r requirements.txt
- You need a database
micromort
in mysql. For creating the schema:
mysql -uroot -p micromort < ./resources/DB/mysql_schema.sql
- Add path to your bash profile. Add following line into your bash profile
# ~/.bash_profile for mac
# ~/.bashrc for ubuntu
# windows ? what do you mean by windows?
export PYTHONPATH="${PYTHONPATH}:/absolute/path/to/repo/micromort/"
- Unique database and indexes in mongo
use sgtalk
db.posts.createIndex( {"post.post_url" : 1 }, {"unique": true })
db.news_tweets.createIndex( {"id" : 1 }, {"unique": true })
-
Crontab entries are in crons file
-
Creating the systemctl file for share getter.
sudo vim /lib/systemd/system/share-getter.service
sudo systemctl start share-getter.service
sudo systemctl status share-getter.service
- Sgtalk scrappers: Starting from "sgtalk.org" page it crawl all thread and posts on sgtalk and stores the following data data in mongo db (sgtalk/posts).
cd micromort/scrapers/sgtalk/sgtalk/
scrapy crawl sgtalk
- Run Rss feeder to get the urls of the articles
python micromort/share_metrics/newsfeedcrawler.py
- Get share/liked counts:
python micromort/share_metrics/shares_getter.py
Please consider following practices while contributing to the repo
Logger has been defined in the ./utils/logger.py please DO NOT use any other logger or print.
# Usage:
sys.path.append("./utils/")
from logger import logger
logger.info("Hello world!")
To change the logging level, change value of level in file: ./resources/configs/loggerconfig.py
Connection to data stores like mysql and mongo have been defined in ./data_stores dir.
# usage:
sys.path.append("./data_stores")
from mysql import db, cursor
from mongodb import mongo_collection_articles
- Move out the code which can be public to a different repo.
Scrape (One time) following websites:
Forums:
- ✅ SgTalk
- ✅ harwarezone
- (sub) reddit Singapore (Headsup: https://www.find-me.co/blog/reddit_creators)
News Websites:
- ✅ Straits times
- Asia one
- channelnewsasia.com
- Today
- Stomp
- Get Real time data for following websites using RSS-feed
- Move the mysql database from local machine to some common machine
- Setup a daily email report to get the number of data fetched everyday
This project is licensed under the MIT License.