Skip to content

Latest commit

 

History

History
365 lines (285 loc) · 12.7 KB

File metadata and controls

365 lines (285 loc) · 12.7 KB

E4571 Personalisation Theory Class Project -- Fall 2017


Report for Part 2 of the project can be found in Part2/report/final_project_report.pdf.

Team Members:

Name GitHub UNI
Tejas Dharamsi https://github.com/Dharamsitejas td2520
Abhay S Pawar https://github.com/abhayspawar asp2197
Janak A Jain https://github.com/janakajain jaj2186
Vijayraghavan Balaji https://github.com/vijaybalaji30 vb2428


~Steps to run the code~
  • Clone/Download the Repository
  • install dependencies pip3 install -r requirements.txt
  • move to folder Part2/analysis.

Report for Part 1 of the project can be found in Part1/documents/report_part1.pdf

Note: The main file containing the code for Part 1 is CF-Data.ipynb

File Structure

Top

  • Part2

    • analysis

      • DatasetCreation_Benchmark_ContentBased.ipynb: contains the code for combination of dataset, Naïve baseline model, item-item collaborative filtering model and content based model
      • Hybrid.ipynb: Contains code for Hybrid Model: LSH Model + Content Based Model, Validates serendpity for books recommended by our best model: LSH
      • LSH_Complete.ipynb: contains the code for LSH model
      • book_features.ipynb: contains the code for generating word2vec features for books
      • feature_extraction_from_api.ipynb: Contains code to get book meta data from goodreads API using book isbn
      • tree_based_ann.ipynb: contains the code for Tree Based ANN model
    • created_datasets

      • Combine.csv : contains the combined dataset of BX and Amazon dataset
      • book_features.csv: contains the data with features generated using word2vec
      • ibsn_features_new_batch.pickle: contains the data with features extracted BookReads API and enriched using word2vec
    • figures: Contains Plots generated by our code.

    • raw-data: Contains Book Crossing Dataset, amazon book dataset can be downloaded from here

    • Final_Project_Outline.pdf

  • Part1

    • analysis: CF-Data.ipynb main part1 file along with exploratory stuff.
    • clean-data: Contains subset smaller datasets
    • raw-data: Contains book-crossing raw datasets.
    • documents: instructions and report
    • figures: Contains Plots for visualisation
  • License

  • Readme

  • requirements.txt

About the Project

Book Shelf Image
Image Courtesy: WellBuiltStyle.com

The project is part of the course on Personalization Theory and Applications by Prof. Brett Vintch. The aim of this project is to create a recommender system for books that is capable of offering customized recommendations to book readers based on the books they have already read.

Motivation

There is no friend as loyal as a friend - Ernest Hemingway

Thanks to Gutenberg and now, the digital boom, we now have access to a huge amount of collective intelligence, wisdom and stories. Indeed, humans perish but their voice continues to resonate through humans brains and minds long after they are gone - sometimes provoking us to think, making us parts of revolutions and sometimes confiding in us with their secrets. They have the ability to make us laugh, cry, think - think hard, and most imporantly, change our lives the way, perhaps nobody else can. In this sense, books are truly our loyal friends.

Can the importance of books as loyal friends ever be overestimated? We think not. Which is why we think that creating just the 'right' recommendations for readers is a noble objective. Consider it a quieter (Shh.. no noise in this library! :)) Facebook or a classier Tinder for those who like to read and listen, patiently.


Part II - Summary of findings

We have implemented four different types of algorithms from scratch and have compared them with with a naïve model. These four models are Tree-based Approximate Nearest Neighbor (ANN), Locality Sensitive Hashing (LSH), Item-item collaborative filtering (CF) and Content-based model. We also created a hybrid model that is a combination of LSH and Content-based model.

We used five-fold cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

We have evaluated each of the developed models on following evaluation metrics:

  • Training time
  • RMSE
  • MAE
  • Coverage
  • Novelty

Results:

           Comparison of several models on various comparison metrics
Model Name Training Time (hours) Best K Average Test MAE Average Test RMSE Coverage
Naïve N/A N/A 0.763 0.944 N/A
Item-item CF 4.1 15 0.553 0.759 76.0%
Tree based ANN 1.927 20 0.55 0.76
LSH 1.29 15 0.573 0.796 65.6%
Content-based 0.6 (approx.) 25 0.593 0.8031 31.55%
Hybrid (LSH + Contentt) 1.89 15 0.5834 0.799 46.54%

Tweaking the Hybrid model

After developing the Hybrid model from scratch, the next step for us was to evaluate it different values of its hyper-parameter - the distribution of weights on the two underlying models. Given below is a summary of the MAE and RMSE metrics for the Hybrid model for various combinations of these weights.

                   Performance of the Hybrid model for various weight combinations of the underlying models
W_LSH W_Content MAE RMSE
0.9 0.1 0.587 0.813
0.8 0.2 0.585 0.806
0.7 0.3 0.583 0.799
0.6 0.4 0.583 0.796
0.5 0.5 0.583 0.792

Interpretations

We selected the Hybrid model with a W_lsh to W_content weight ratio of 7:3 in order to select the right blend of coverage and serendipity. However, we observed that even at this level, the coverage of the model was significatly lower than that of the LSH model that we implemented from scratch. Hence, we would recommend the use of LSH model for making recommendations.

A special note on Serendipity of the best model

Our best model is LSH - which has comparable values of MAE and RMSE versus the traditional item-based CF model. Moreover, LSH trains in about a third of the time taken to train the item-based CF model. Another evaluation metric is serendity or novelty of recommendations.

An example of recommendation is shown in Figures 8 and 9 in the report. An interesting recommendation that can be observed from Figure 9 is "Don Quixote". It belongs to a genre that is not currently present in the user's rapport of genres. What's more is that Don Quixote is considered one of the most influential works from the Spanish Golden Age.

Upon closer observation, we find that Don Quixote contains several thematic plots and stylistic elements which are very similar to other books that the user has read. Moreover, such a serendipitous result is also likely to be liked by the user given the higher chances of similarity in stylistic and thematic patterns.

Future Scope of Work

In the future, we would like to extend this study to convert our code into a Python package. We invite members of the larger academic community to contribute to this project.


Part I - Summary of findings

We have implemented two different types of algorithms from scratch and have compared them with competitive models available from other packages. These two algorithms are Item-Item Collaborative Filtering and Non-negative Matrix Factorization (NMF)

We implemented our models using two approaches:

  • Collaborative filtering based (Approach 1)
  • Non-negative Matrix Factorization (NMF)based (Approach 2)

We used cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.

For both these approaches, we implemented two separate models for this study - one model was developed from scratch, while one was developed using Surprise.

Results:

  • For Approach 1, our model performed better than Surprise model for by a significant measure for Average MAE.
  • For Approach 2, our model did not fare well in front of Surprise model.

For each approach the results are described below below for each of the norms, viz. Euclidean distance, cosine distance and pearson correlation coefficient:

Approach 1: Item-Item Collaborative filtering based

Euclidean distance
Model Name Average RMSE Average MAE
Our model 1.54 0.96
Suprise 1.58 1.13
Cosine similarity
Model Name Average RMSE Average MAE
Our model 1.57 1.06
Suprise 1.64 1.22
Pearson correlation coefficient
Model Name Average RMSE Average MAE
Our model 1.53 1.01
Suprise 1.61 1.20

Approach 2: None-negative Matrix Factorization

NMF
Model Name Average RMSE Average MAE
Our model 2.97 2
Suprise 1.53 0.98

Feedback

We look forward to your feedback and comments on this project. Our email IDs are a combination of our four-letter UNI codes e.g. 'td2520' and follow the following rule: {UNI}@columbia.edu.