Find the categories of questions posted to Yahoo! Answers.
A Kaggle-based text classification task for INFO 256, UC Berkeley School of Information.
Classify questions into one of the following categories:
- Business&Finance
- Computers&Internet
- Entertainment&Music
- Family&Relationships
- Education&Reference
- Health
- Science&Mathematics
Each document—a row in the data file representing a single question—is short.
The training data contains 2,698 questions, already labeled with one of the above categories. The test data contains 1,874 questions that are unlabeled.
The data were loaded into pandas DataFrames. We removed HTML-escaped
characters, such as 
<br>
, using regular expressions.
We started with logistic regression and multinomial naive Bayes models.
We then used a document similarity approach, using Scikit-Learn's
TfidfVectorizer
and cosine_similarity
function.
Finally, we experimented with support vector classifiers.
All models were validated using cross-validation.