GitHub - sanjayp/ckptzero: ML predictor for Yelp reviews.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
liblinear		liblinear
other_supporting_functions		other_supporting_functions
pre_processing		pre_processing
unused_model		unused_model
README		README
final_submission_readme		final_submission_readme
fsvd		fsvd
init_model.m		init_model.m
make_final_prediction.m		make_final_prediction.m
models.m		models.m
script_test.m		script_test.m
word_statistics		word_statistics

Repository files navigation

Group: CHECKPOINT0
Group Member: Di Mu(dimu), Sanjay Paul, Hangfei Lin(hangfei)
I. To use other three models.
To use other models:
They are located in the unused_model folder. Each model is made up of two .m files: one init_model file, the other one is prediciton_model file. To use the model, run the init_model function, and then run the predition_model function.

init_model	make_final_prediction
init_model_SVD	prediction_SVD
init_model_nb	prediction_nb
init_model_add_feature	prediction_add_feature

II. How we train our models
We train all our models in the model_script.m file(in unused_model subfolder). We implement the following four methods(we also tried other methods, which is included in the additional information):
A. Logistic Regression

The model we decided upon for the purposes of completing the project requirements is trained as follows: choose the 250000 data set(default) and use Logistic Regression (model = train(Y, X, '-s 7 -c 0.06 -e 0.001 -q')) from the liblinear package.

B. Naive Bayes 

For the Naive Bayes model, we used Matlab's Naive Bayes implementation.

C. SVD

The feature space of the training data set is too large, so we first pick the most frequent words by throwing away those that appear less than 100 times. Then we perform a 'fast SVD' (fsvd, included in the folder) and choose the top 500 of the output components.

D. Additional Features

For adding features, we add the punctuations of the review text as additional features(10 features).


III. Additional Information
Other models we tried are trained using different methods(included in final report).
In detail:
0. Load the data.

1. You need to setup your training set first. There are several flags and a switch case help you do this(to facilitate debug).
data_comb flag: you could set your train data and test data. default is 25000 train data, empty test data.

2. You could add some additional features to your training set. This including review length, punctuation info, stemmer info, IDF info, frequency info. The corresponding flags are as follows:
review_length_flag = 0;
add_punc_flag = 0;
freq_flag = 0;  


3. You could also eliminate some features to increase the accuracy or reduce the data amount to boost the training process. It's also control by flags:
porterStemmer_flag = 0;
stemmed_X_flag = 0;
stopwords_process_flag = 0;
stopwords_use_flag = 0;
idf_flag = 0;
SVDs_flag = 1;

4. We also include a rescaling process, but it doesn't work well. 
scale_flag = 0;
standardization_flag = 0;
normalization_flag = 0;

5. This part includes the control flag for the models. You could use several models separately.
SVM_flag = 0;   % too slow, some problem?
SVM_liblinear_flag = 0;
LG_flag = 0;
NB_flag = 0;
KNN_flag = 0;   % to be implemented
discriminant_flag = 0;
kmeans_flag = 0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sanjayp/ckptzero

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages