Skip to content

Commit 3ab4225

Browse files
committed
Updating data section of report
1 parent 8e7b00f commit 3ab4225

File tree

3 files changed

+29
-19
lines changed

3 files changed

+29
-19
lines changed

incremental_svd2.pyc

143 Bytes
Binary file not shown.

svd_reconstruct.py

+13-14
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
from prettyplotlib import plt
66
#import matplotlib.pyplot as plt
77
from scipy.io import mmread
8-
from sklearn.metrics import mean_squared_error
98
##
109
from incremental_svd2 import incremental_SVD
1110

@@ -21,7 +20,7 @@ def check_orthogonality(A):
2120

2221
if __name__ == '__main__':
2322
train = np.matrix(mmread('subset_train.mtx').todense())
24-
train = train[0:2000, 0:100]
23+
train = train[0:3000, 0:1000]
2524
print 'Using matrix of size {}'.format(train.shape)
2625

2726
print 'Testing SVD'
@@ -35,7 +34,8 @@ def check_orthogonality(A):
3534
for k in xrange(1, 100):
3635
low_s = [s[i] for i in xrange(k)] + (min(u.shape[0], vT.shape[1]) - k) * [0]
3736
reconstruct = u.dot(scipy.linalg.diagsvd(low_s, u.shape[0], vT.shape[1]).dot(vT))
38-
err = np.sqrt(mean_squared_error(train, reconstruct))
37+
#err = np.sqrt(mean_squared_error(train, reconstruct))
38+
err = np.linalg.norm(train - reconstruct, 'fro')
3939
print 'Exact SVD with low-rank approximation {}'.format(k)
4040
#print err
4141
#print
@@ -52,14 +52,13 @@ def check_orthogonality(A):
5252
print '... with block size of {}'.format(num)
5353
X, Y = [], []
5454
incr_orthoY = []
55-
for k in xrange(1, 101, 1):
56-
if k % 25 == 0:
57-
print ' ... up to k={}'.format(k)
58-
u, s, vT = incremental_SVD(train, k, num)
59-
reconstruct = u.dot(s.dot(vT))
60-
X.append(k)
61-
Y.append(np.sqrt(mean_squared_error(train, reconstruct)))
62-
incr_orthoY.append(check_orthogonality(u))
55+
uL, sL, vTL = incremental_SVD(train, range(1, 101), num)
56+
for i in xrange(len(uL)):
57+
reconstruct = uL[i].dot(sL[i].dot(vTL[i]))
58+
err = np.linalg.norm(train - reconstruct, 'fro')
59+
X.append(i + 1)
60+
Y.append(err)
61+
incr_orthoY.append(check_orthogonality(uL[i]))
6362
incr_ortho.append(['iSVD u={}'.format(num), X, incr_orthoY])
6463
plt.plot(X, Y, label='iSVD u={}'.format(num))
6564
"""
@@ -74,18 +73,18 @@ def check_orthogonality(A):
7473
##
7574
plt.title('SVD reconstruction error on {}x{} matrix'.format(*train.shape))
7675
plt.xlabel('Low rank approximation (k)')
77-
plt.ylabel('Root Mean Squared Error')
76+
plt.ylabel('Frobenius norm')
7877
plt.ylim(0, max(svdY))
7978
plt.legend(loc='best')
80-
plt.savefig('reconstruct_error_{}x{}.pdf'.format(*train.shape))
79+
plt.savefig('reconstruct_fro_{}x{}.pdf'.format(*train.shape))
8180
plt.show(block=True)
8281
##
8382
plt.plot(svdX, svdY, label="SVD", color='black', linewidth='2', linestyle='--')
8483
for label, X, Y in incr_ortho:
8584
plt.plot(X, Y, label=label)
8685
plt.title('SVD orthogonality error on {}x{} matrix'.format(*train.shape))
8786
plt.xlabel('Low rank approximation (k)')
88-
plt.ylabel('Orthogonality error')
87+
plt.ylabel('Deviation from orthogonality')
8988
plt.semilogy()
9089
#plt.ylim(0, max(orthoY))
9190
plt.legend(loc='best')

tex/merity_cook.tex

+16-5
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,21 @@ \section{Introduction}
6969

7070
\section{Data}
7171

72+
The Netflix Prize was a large-scale recommendation competition held by Netflix.
73+
Their aim was to improve the recommendations they provided for their users by allowing third party researchers to analyze their data.
74+
At the time, the Netflix dataset was the largest real world dataset available to researchers.
75+
Collected over 7 years, it contained over 100 million ratings for 17,700 movies provided by over 480,000 users.
76+
To compete, participants would send predicted ratings for a specific test set to Netflix.
77+
Netflix would then return the root mean squared error (RMSE) for a portion of this test set.
78+
By providing RMSE on only a portion of the test set, teams cannot overfit the dataset to win the competition as their accuracy on the hidden portion would fall substantially.
79+
After the competition concluded, this dataset was released publicly for continued research.
80+
A full description of the rules and dataset can be found at the Netflix Prize website.
81+
82+
7283
%Here, we talk about the Netflix dataset. How we scrubbed it, what it consists of, etc.
73-
The core of the Netflix dataset consists of 17,770 text files.
84+
The Netflix dataset consists of 17,770 text files.
7485
Each text file represents a distinct movie.
75-
The first line in the text file is the movie's unique ID number, which is
76-
an integer from 1 to 17,770.
86+
The first line in the text file is the movie's unique ID number, which is an integer from 1 to 17,770.
7787
All other lines have three comma-delimited entries: user ID, rating, and date.
7888

7989
There are 480,189 unique users in the dataset, with their IDs ranging from 1 to 2,649,429, with gaps.
@@ -94,13 +104,14 @@ \section{Data}
94104

95105
In order to be able to perform SVD, we need a matrix with users on the rows and movies on the columns.
96106
This matrix would be $480,179 \times 17,770 = 8.5 \textrm{ billion}$ entries.
97-
In a regular matrix format, this would too big to hold in memory. One estimate is that it takes roughly 65 GB of RAM to hold the entire matrix \citep{revoR} although the actual size would depend on the amount of space allocated for each rating.
107+
In a regular matrix format, this would too big to hold in memory.
108+
One estimate is that it takes roughly 65 GB of RAM to hold the entire matrix \citep{revoR} although the actual size would depend on the amount of space allocated for each rating.
98109
Fortunately, the matrix is extremely sparse, containing around 100 million non-zero entries.
99110
To store the data in our project, we use SciPy's \verb!scipy.sparse.lil_matrix! which constructs sparse matrices using row-based linked lists.
100111
We store data from the text files in this sparse matrix as we read them.
101112
After reading in all of the text files, we output the matrix to a Matrix Market format.
102113
The Matrix Market format starts with a line containing the dimensions of the matrix and the number of non-zero entries.
103-
Then, each line contains $i \hspace{2ex} j \hspace{2ex} <\textrm{value}>$.
114+
Then, each line contains $i \enskip j \enskip rating$.
104115
For example, these are the first few lines of a Matrix Market file with a subset of the Netflix data:
105116

106117
\begin{verbatim}

0 commit comments

Comments
 (0)