Report for Part 2 of the project can be found in Part2/report/final_project_report.pdf.
Team Members:
Name | GitHub | UNI |
Tejas Dharamsi | https://github.com/Dharamsitejas | td2520 |
Abhay S Pawar | https://github.com/abhayspawar | asp2197 |
Janak A Jain | https://github.com/janakajain | jaj2186 |
Vijayraghavan Balaji | https://github.com/vijaybalaji30 | vb2428 |
~Steps to run the code~
- Clone/Download the Repository
- install dependencies
pip3 install -r requirements.txt
- move to folder Part2/analysis.
Report for Part 1 of the project can be found in Part1/documents/report_part1.pdf
Note: The main file containing the code for Part 1 is CF-Data.ipynb
-
Part2
-
- DatasetCreation_Benchmark_ContentBased.ipynb: contains the code for combination of dataset, Naïve baseline model, item-item collaborative filtering model and content based model
- Hybrid.ipynb: Contains code for Hybrid Model: LSH Model + Content Based Model, Validates serendpity for books recommended by our best model: LSH
- LSH_Complete.ipynb: contains the code for LSH model
- book_features.ipynb: contains the code for generating word2vec features for books
- feature_extraction_from_api.ipynb: Contains code to get book meta data from goodreads API using book isbn
- tree_based_ann.ipynb: contains the code for Tree Based ANN model
-
- Combine.csv : contains the combined dataset of BX and Amazon dataset
- book_features.csv: contains the data with features generated using word2vec
- ibsn_features_new_batch.pickle: contains the data with features extracted BookReads API and enriched using word2vec
-
figures: Contains Plots generated by our code.
-
raw-data: Contains Book Crossing Dataset, amazon book dataset can be downloaded from here
-
-
Part1
- analysis: CF-Data.ipynb main part1 file along with exploratory stuff.
- clean-data: Contains subset smaller datasets
- raw-data: Contains book-crossing raw datasets.
- documents: instructions and report
- figures: Contains Plots for visualisation
-
License
-
Readme
-
requirements.txt
Image Courtesy: WellBuiltStyle.com
The project is part of the course on Personalization Theory and Applications by Prof. Brett Vintch. The aim of this project is to create a recommender system for books that is capable of offering customized recommendations to book readers based on the books they have already read.
There is no friend as loyal as a friend - Ernest Hemingway
Thanks to Gutenberg and now, the digital boom, we now have access to a huge amount of collective intelligence, wisdom and stories. Indeed, humans perish but their voice continues to resonate through humans brains and minds long after they are gone - sometimes provoking us to think, making us parts of revolutions and sometimes confiding in us with their secrets. They have the ability to make us laugh, cry, think - think hard, and most imporantly, change our lives the way, perhaps nobody else can. In this sense, books are truly our loyal friends.
Can the importance of books as loyal friends ever be overestimated? We think not. Which is why we think that creating just the 'right' recommendations for readers is a noble objective. Consider it a quieter (Shh.. no noise in this library! :)) Facebook or a classier Tinder for those who like to read and listen, patiently.
We have implemented four different types of algorithms from scratch and have compared them with with a naïve model. These four models are Tree-based Approximate Nearest Neighbor (ANN), Locality Sensitive Hashing (LSH), Item-item collaborative filtering (CF) and Content-based model. We also created a hybrid model that is a combination of LSH and Content-based model.
We used five-fold cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.
We have evaluated each of the developed models on following evaluation metrics:
- Training time
- RMSE
- MAE
- Coverage
- Novelty
Results:
Model Name | Training Time (hours) | Best K | Average Test MAE | Average Test RMSE | Coverage |
---|---|---|---|---|---|
Naïve | N/A | N/A | 0.763 | 0.944 | N/A |
Item-item CF | 4.1 | 15 | 0.553 | 0.759 | 76.0% |
Tree based ANN | 1.927 | 20 | 0.55 | 0.76 | |
LSH | 1.29 | 15 | 0.573 | 0.796 | 65.6% |
Content-based | 0.6 (approx.) | 25 | 0.593 | 0.8031 | 31.55% |
Hybrid (LSH + Contentt) | 1.89 | 15 | 0.5834 | 0.799 | 46.54% |
After developing the Hybrid model from scratch, the next step for us was to evaluate it different values of its hyper-parameter - the distribution of weights on the two underlying models. Given below is a summary of the MAE and RMSE metrics for the Hybrid model for various combinations of these weights.
W_LSH | W_Content | MAE | RMSE |
---|---|---|---|
0.9 | 0.1 | 0.587 | 0.813 |
0.8 | 0.2 | 0.585 | 0.806 |
0.7 | 0.3 | 0.583 | 0.799 |
0.6 | 0.4 | 0.583 | 0.796 |
0.5 | 0.5 | 0.583 | 0.792 |
We selected the Hybrid model with a W_lsh to W_content weight ratio of 7:3 in order to select the right blend of coverage and serendipity. However, we observed that even at this level, the coverage of the model was significatly lower than that of the LSH model that we implemented from scratch. Hence, we would recommend the use of LSH model for making recommendations.
Our best model is LSH - which has comparable values of MAE and RMSE versus the traditional item-based CF model. Moreover, LSH trains in about a third of the time taken to train the item-based CF model. Another evaluation metric is serendity or novelty of recommendations.
An example of recommendation is shown in Figures 8 and 9 in the report. An interesting recommendation that can be observed from Figure 9 is "Don Quixote". It belongs to a genre that is not currently present in the user's rapport of genres. What's more is that Don Quixote is considered one of the most influential works from the Spanish Golden Age.
Upon closer observation, we find that Don Quixote contains several thematic plots and stylistic elements which are very similar to other books that the user has read. Moreover, such a serendipitous result is also likely to be liked by the user given the higher chances of similarity in stylistic and thematic patterns.
In the future, we would like to extend this study to convert our code into a Python package. We invite members of the larger academic community to contribute to this project.
We have implemented two different types of algorithms from scratch and have compared them with competitive models available from other packages. These two algorithms are Item-Item Collaborative Filtering and Non-negative Matrix Factorization (NMF)
We implemented our models using two approaches:
- Collaborative filtering based (Approach 1)
- Non-negative Matrix Factorization (NMF)based (Approach 2)
We used cross-validation for all of our developed models, which helped us in selecting the best model for comparison against the benchmark.
For both these approaches, we implemented two separate models for this study - one model was developed from scratch, while one was developed using Surprise.
Results:
- For Approach 1, our model performed better than Surprise model for by a significant measure for Average MAE.
- For Approach 2, our model did not fare well in front of Surprise model.
For each approach the results are described below below for each of the norms, viz. Euclidean distance, cosine distance and pearson correlation coefficient:
Model Name | Average RMSE | Average MAE |
---|---|---|
Our model | 1.54 | 0.96 |
Suprise | 1.58 | 1.13 |
Model Name | Average RMSE | Average MAE |
---|---|---|
Our model | 1.57 | 1.06 |
Suprise | 1.64 | 1.22 |
Model Name | Average RMSE | Average MAE |
---|---|---|
Our model | 1.53 | 1.01 |
Suprise | 1.61 | 1.20 |
Model Name | Average RMSE | Average MAE |
---|---|---|
Our model | 2.97 | 2 |
Suprise | 1.53 | 0.98 |
We look forward to your feedback and comments on this project. Our email IDs are a combination of our four-letter UNI codes e.g. 'td2520' and follow the following rule: {UNI}@columbia.edu.