Skip to content

Latest commit

 

History

History
176 lines (118 loc) · 9.88 KB

File metadata and controls

176 lines (118 loc) · 9.88 KB

Reddit Political Subreddit Comments during the 2016 U.S. Presidential Election

View the presentation for this project here.

Data Acquisition and Cleaning

Acquisition

We downloaded all Reddit comments during the months of October (RC_2016-10.bz2) and November (RC_2016-11.bz2) 2016 from pushshift. However, comments from the months of December 2005 to February 2018 ranging from 118K bytes to 7.75B bytes. The compressed files contain a collection of JSON objects representing reddit comments. Here is an example of the JSON object:

{  
    "author":"Dethcola",
    "author_flair_css_class":"",
    "author_flair_text":"Clairemont",
    "body":"A quarry",
    "can_gild":true,
    "controversiality":0,
    "created_utc":1506816000,
    "distinguished":null,
    "edited":false,
    "gilded":0,
    "id":"dnqik14",
    "is_submitter":false,
    "link_id":"t3_73ieyz",
    "parent_id":"t3_73ieyz",
    "permalink":"/r/sandiego/comments/73ieyz/best_place_for_granite_counter_tops/dnqik14/",
    "retrieved_on":1509189606,
    "score":3,
    "stickied":false,
    "subreddit":"sandiego",
    "subreddit_id":"t5_2qq2q"
}

The October dataset was 6.45 GBs compressed and ~35.2 GBs uncompressed for a total of 54,129,644 JSON objects. The November dataset was 6.45 GBs compressed and ~46.71 GBs uncompressed for a total of 71,826,554 JSON objects.

Cleaning

The cleaning of the downloaded data sets include multiple phases due to the magnitude of the data sets. The first phase of filtering is meant to decrease the size of the data set by selecting comments from 2 weeks before election day (October 26) to 2 weeks after election day (November 23). Additionally, we only keep comments that were posted in the list of political subreddits.

Additional cleaning and filtering of the data set is described below:

  • Removed comments if the author was one of the following -
    • [deleted]
    • AutoModerator
    • Contains the word bot
  • Remove comments if the author was a known moderator due to the fact that the majority of their comments were relating to their moderator dutes. This removal of mods from the list of known moderators removed 233 comments.
  • Comments were not in the list of relevant subreddits.
  • Remove all hardcoded white space characters in the body of comment text, such as \n\r\t
  • Features were selectively kept based on the data required to do the analysis in this project.
{  
    "author":"Dethcola",
    "body":"A quarry",
    "created_utc":1506816000,
    "score":3,
    "subreddit":"sandiego",
    "subreddit_id":"t5_2qq2q"
}

Partitioning

Dates around the election, broken up into varying sizes. Note: All UTC stamps for dates were calculated from the date at 0:0:0 in CT. For example, the range of October 26 to November 1 would be October 26 at 0:0:0 to November 2 at 0:0:0, so we consider all comments on nov 1. Initially done with Epoch Converter, but later wrote time conversion functions.

Sentiment Analysis

What is Sentiment Analysis?

Sentiment analysis, or opinion mining, is an active area of study in the field of natural language processing that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions via the computational treatment of subjectivity in text. It is not our intention to review the entire body of literature concerning sentiment analysis. Indeed, such an endeavor would not be possible within the limited space available (such treatments are available in Liu (2012) and Pang & Lee (2008)). We do provide a brief overview of anonical works and techniques relevant to our study.

What are Sentiment Lexicons?

A sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative (Liu, 2010).

Using NLTK's Vader for Sentiment Analysis

References for NLTK: NLTK page, NLTK Sentiment Github, Vader author Github

First, it is necessary to download nltk via pip. This can be done with the following command.

sudo pip install -U nltk

Note: The integration between nltk and python3 was pretty painful. I could not actually install nltk for python3, so we had to default to using python2. The first statement below details how it was installed and then the next 3 statements show how it was used in the python file.

import nltk

nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

What is compound score?

[NOT CHOSEN] TextBlob for Python

[NOT CHOSEN] Sentiment Analysis Tool for Python

Emotion Analysis

We tried to use ParallelDots for Emotional Analysis Tool, BUT there is a super strict api limit of 1000 api hits a day.

Thus, we started using the TidyText package in R.

[TODO] Fill the rest of this out ...

Word Count

Based on the highest frequency of words throughout comments on the day of the election, we build a word cloud masked by the shape of the United States. Here is the Python package used to build the WordCloud.

Note: Similar to Vader, the integration between WordCloud and python3 was pretty painful. So, we had to default to using python2.

Technical Notes

NLTK's Vader (Valence Aware Dictionary for sEntiment Reasoning)

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

We use a combination of qualitative and quantitative methods to produce, and then empirically validate, a gold-standard sentiment lexicon that is especially attuned to microblog-like contexts.

We find that incorporating these heuristics improves the accuracy of the sentiment analysis engine across several domain contexts (social media text, NY Times editorials, movie reviews, and product reviews). Interestingly, the VADER lexicon performs exceptionally well in the social media domain. The correlation coefficient shows that VADER (r = 0.881) performs as well as individual human raters (r = 0.888) at matching ground truth (aggregated group mean from 20 human raters for sentiment intensity of each tweet).

METHODS

Our approach seeks to leverage the advantages of parsimonious rule-based modeling to construct a computational sentiment analysis engine that 1) works well on social media style text, yet readily generalizes to multiple domains, 2) requires no training data, but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon 3) is fast enough to be used online with streaming data, and 4) does not severely suffer from a speed-performance tradeoff.

MACHINE LEARNING

We use the Python-based machine learning algorithms from scikit-learn.org for the NB, Maximum Entropy ( makes no conditional independence assumption between features, and thereby accounts for information entropy (feature weightings), SVM-Classification (SVM-C) and SVM-Regression (SVM-R) models.

  • bag of words, TF-IDF and 2 important algorithms NB and SVM

From the commit logs we can easily see that the first version sentiment_analyzer was a Naive Bayes classifier with unigram features, one week later a Maximum entropy classifier was added, and two days after that facility for bigram features were added.

Links

MACHINE LEARNING

N-Grams

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the bigrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

Links Using Bigrams to Enhance Text Classification

Difference between Naive Bayes and Multinomial Naive Bayes

Text classification using NB

Stack Exchange Explanation