So you've collected tens of millions of tweets about a given topic and stored them in Google BigQuery. Now it's time to analyze them.
This project builds upon the research of Tauhid Zaman, Nicolas Guenon Des Mesnards, et. al., as described by the paper: "Detecting Bots and Assessing Their Impact in Social Networks".
NOTE: we used the code in this repo to support the collection and analysis of tweets about the First Trump Impeachment. But this codebase is superseded by the Tweet Analysis (2021) repo for subsequent projects.
Version 2020:
- Tweet Collection v1
- Friend Collection v1
- PG Pipeline (Local Database Migrations)
Retweet Graphs v1- Retweet Graphs v2
- API v0
- API v1
- Toxicity Classification
- Tweet Recollection
- News Sources
- Botometer Sampling
Version 2021:
- Tweet Collection v2
- Continued at: Tweet Analysis 2021
Dependencies:
- Git
- Python 3.8
- PostgreSQL (optional)
Clone this repo onto your local machine and navigate there from the command-line:
cd tweet-analysis-py/
Create and activate a virtual environment, using anaconda for example, if you like that kind of thing:
conda create -n tweet-analyzer-env-38 python=3.8
conda activate tweet-analyzer-env-38
Install package dependencies:
pip install -r requirements.txt
If you want to collect tweets or user friends, obtain credentials which provide read access to the Twitter API. Set the environment variables TWITTER_CONSUMER_KEY
, TWITTER_CONSUMER_SECRET
, TWITTER_ACCESS_TOKEN
, and TWITTER_ACCESS_TOKEN_SECRET
accordingly (see environment variable setup below).
The massive volume of tweets are stored in a Google BigQuery database, so we'll need BigQuery credentials to access the data. From the Google Cloud console, enable the BigQuery API, then generate and download the corresponding service account credentials. Move them into the root directory of this repo as "credentials.json", and set the GOOGLE_APPLICATION_CREDENTIALS
environment variable accordingly (see environment variable setup below).
There will be many twitter user network graph objects generated, and they can be so large that trying to construct them on a laptop is not feasible due to memory constraints. So there may be need to run various graph construction scripts on a larger remote server. File storage on a Heroku server is ephemeral, so we'll save the files to a Google Cloud Storage bucket so they persist. Create a new bucket or gain access to an existing bucket, and set the GCS_BUCKET_NAME
environment variable accordingly (see environment variable setup below).
FYI: in the bucket, there will also exist some temporary tables used by BigQuery during batch job performances, so we're namespacing the storage of graph data under "storage/data", which is a mirror of the local "data" directory.
The app will run scripts that take a long time. To have those scripts send emails when they are done, first obtain a SendGrid API Key, then set it as an environment variable (see environment variable setup below).
To optionally download some of the data from BigQuery into a local database, first create a local PostgreSQL database called something like "impeachment_analysis", then set the DATABASE_URL
environment variable accordingly (see environment variable setup below).
Create a new file in the root directory of this repo called ".env", and set your environment variables there, as necessary:
# example .env file
#
# GOOGLE APIs
#
GOOGLE_APPLICATION_CREDENTIALS="/path/to/tweet-analysis-py/credentials.json"
BIGQUERY_PROJECT_NAME="tweet-collector-py"
BIGQUERY_DATASET_NAME="impeachment_development"
GCS_BUCKET_NAME="impeachment-analysis-2020"
#
# LOCAL PG DATABASE
#
# DATABASE_URL="postgresql://USERNAME:PASSWORD@localhost/impeachment_analysis"
#
# EMAIL
#
# SENDGRID_API_KEY="__________"
# MY_EMAIL_ADDRESS="[email protected]"
#
# NLP
#
# BASILICA_API_KEY="______________"
Testing the Google BigQuery connection:
python -m app.bq_service
Testing the Google Cloud Storage connection, saving some mock files in the specified bucket:
python -m app.gcs_service
Run tests:
APP_ENV="test" pytest
On the CI server, skips web requests:
CI="true" APP_ENV="test" pytest