You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
In file fastText/python/fasttext_module/fasttext/util/util.py, the function _reduce_matrix() is used as a PCA algorithm to reduce dimension. But it seems to be an uncorrect implementation of PCA. Assume we've got the covariance matrix C, we need to get dim eigenvectors corresponding to the top dim eigenvalues,in order to keep the most information. But function np.linalg.eig() returns eigenvalues that are not necessarily ordered, which leads to the mistake.
def _reduce_matrix(X_orig, dim, eigv):
"""
Reduces the dimension of a (m × n) matrix `X_orig` to
to a (m × dim) matrix `X_reduced`
It uses only the first 100000 rows of `X_orig` to do the mapping.
Matrix types are all `np.float32` in order to avoid unncessary copies.
"""
if eigv is None:
mapping_size = 100000
X = X_orig[:mapping_size]
X = X - X.mean(axis=0, dtype=np.float32)
C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
_, U = np.linalg.eig(C)
eigv = U[:, :dim]
X_reduced = np.matmul(X_orig, eigv)
return (X_reduced, eigv)
We can correct the code in this way:
def _reduce_matrix(X_orig, dim, eigv):
if eigv is None:
mapping_size = 100000
X = X_orig[:mapping_size]
X = X - X.mean(axis=0, dtype=np.float32)
C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
V, U = np.linalg.eig(C)
ind = list(V.argsort()[-dim:]) # find the indices of the top k eigenvalues
eigv = U[:, ind]
X_reduced = np.matmul(X_orig, eigv)
return (X_reduced, eigv)