You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tex/merity_cook.tex
+16-5
Original file line number
Diff line number
Diff line change
@@ -69,11 +69,21 @@ \section{Introduction}
69
69
70
70
\section{Data}
71
71
72
+
The Netflix Prize was a large-scale recommendation competition held by Netflix.
73
+
Their aim was to improve the recommendations they provided for their users by allowing third party researchers to analyze their data.
74
+
At the time, the Netflix dataset was the largest real world dataset available to researchers.
75
+
Collected over 7 years, it contained over 100 million ratings for 17,700 movies provided by over 480,000 users.
76
+
To compete, participants would send predicted ratings for a specific test set to Netflix.
77
+
Netflix would then return the root mean squared error (RMSE) for a portion of this test set.
78
+
By providing RMSE on only a portion of the test set, teams cannot overfit the dataset to win the competition as their accuracy on the hidden portion would fall substantially.
79
+
After the competition concluded, this dataset was released publicly for continued research.
80
+
A full description of the rules and dataset can be found at the Netflix Prize website.
81
+
82
+
72
83
%Here, we talk about the Netflix dataset. How we scrubbed it, what it consists of, etc.
73
-
The core of the Netflix dataset consists of 17,770 text files.
84
+
The Netflix dataset consists of 17,770 text files.
74
85
Each text file represents a distinct movie.
75
-
The first line in the text file is the movie's unique ID number, which is
76
-
an integer from 1 to 17,770.
86
+
The first line in the text file is the movie's unique ID number, which is an integer from 1 to 17,770.
77
87
All other lines have three comma-delimited entries: user ID, rating, and date.
78
88
79
89
There are 480,189 unique users in the dataset, with their IDs ranging from 1 to 2,649,429, with gaps.
@@ -94,13 +104,14 @@ \section{Data}
94
104
95
105
In order to be able to perform SVD, we need a matrix with users on the rows and movies on the columns.
96
106
This matrix would be $480,179\times17,770 = 8.5\textrm{ billion}$ entries.
97
-
In a regular matrix format, this would too big to hold in memory. One estimate is that it takes roughly 65 GB of RAM to hold the entire matrix \citep{revoR} although the actual size would depend on the amount of space allocated for each rating.
107
+
In a regular matrix format, this would too big to hold in memory.
108
+
One estimate is that it takes roughly 65 GB of RAM to hold the entire matrix \citep{revoR} although the actual size would depend on the amount of space allocated for each rating.
98
109
Fortunately, the matrix is extremely sparse, containing around 100 million non-zero entries.
99
110
To store the data in our project, we use SciPy's \verb!scipy.sparse.lil_matrix! which constructs sparse matrices using row-based linked lists.
100
111
We store data from the text files in this sparse matrix as we read them.
101
112
After reading in all of the text files, we output the matrix to a Matrix Market format.
102
113
The Matrix Market format starts with a line containing the dimensions of the matrix and the number of non-zero entries.
103
-
Then, each line contains $i \hspace{2ex} j \hspace{2ex} <\textrm{value}>$.
114
+
Then, each line contains $i \enskip j \enskip rating$.
104
115
For example, these are the first few lines of a Matrix Market file with a subset of the Netflix data:
0 commit comments