-
Notifications
You must be signed in to change notification settings - Fork 0
sanjayp/ckptzero
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Group: CHECKPOINT0 Group Member: Di Mu(dimu), Sanjay Paul, Hangfei Lin(hangfei) I. To use other three models. To use other models: They are located in the unused_model folder. Each model is made up of two .m files: one init_model file, the other one is prediciton_model file. To use the model, run the init_model function, and then run the predition_model function. init_model make_final_prediction init_model_SVD prediction_SVD init_model_nb prediction_nb init_model_add_feature prediction_add_feature II. How we train our models We train all our models in the model_script.m file(in unused_model subfolder). We implement the following four methods(we also tried other methods, which is included in the additional information): A. Logistic Regression The model we decided upon for the purposes of completing the project requirements is trained as follows: choose the 250000 data set(default) and use Logistic Regression (model = train(Y, X, '-s 7 -c 0.06 -e 0.001 -q')) from the liblinear package. B. Naive Bayes For the Naive Bayes model, we used Matlab's Naive Bayes implementation. C. SVD The feature space of the training data set is too large, so we first pick the most frequent words by throwing away those that appear less than 100 times. Then we perform a 'fast SVD' (fsvd, included in the folder) and choose the top 500 of the output components. D. Additional Features For adding features, we add the punctuations of the review text as additional features(10 features). III. Additional Information Other models we tried are trained using different methods(included in final report). In detail: 0. Load the data. 1. You need to setup your training set first. There are several flags and a switch case help you do this(to facilitate debug). data_comb flag: you could set your train data and test data. default is 25000 train data, empty test data. 2. You could add some additional features to your training set. This including review length, punctuation info, stemmer info, IDF info, frequency info. The corresponding flags are as follows: review_length_flag = 0; add_punc_flag = 0; freq_flag = 0; 3. You could also eliminate some features to increase the accuracy or reduce the data amount to boost the training process. It's also control by flags: porterStemmer_flag = 0; stemmed_X_flag = 0; stopwords_process_flag = 0; stopwords_use_flag = 0; idf_flag = 0; SVDs_flag = 1; 4. We also include a rescaling process, but it doesn't work well. scale_flag = 0; standardization_flag = 0; normalization_flag = 0; 5. This part includes the control flag for the models. You could use several models separately. SVM_flag = 0; % too slow, some problem? SVM_liblinear_flag = 0; LG_flag = 0; NB_flag = 0; KNN_flag = 0; % to be implemented discriminant_flag = 0; kmeans_flag = 0;
About
ML predictor for Yelp reviews.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published