Skip to content
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
This repository was archived by the owner on Mar 19, 2024. It is now read-only.

function _reduce_matrix seems to be an uncorrect implementation of PCA #1199

@LittletreeZou

Description

@LittletreeZou

In file fastText/python/fasttext_module/fasttext/util/util.py, the function _reduce_matrix() is used as a PCA algorithm to reduce dimension. But it seems to be an uncorrect implementation of PCA. Assume we've got the covariance matrix C, we need to get dim eigenvectors corresponding to the top dim eigenvalues,in order to keep the most information. But function np.linalg.eig() returns eigenvalues that are not necessarily ordered, which leads to the mistake.

def _reduce_matrix(X_orig, dim, eigv):
    """
    Reduces the dimension of a (m × n)   matrix `X_orig` to
                          to a (m × dim) matrix `X_reduced`
    It uses only the first 100000 rows of `X_orig` to do the mapping.
    Matrix types are all `np.float32` in order to avoid unncessary copies.
    """
    if eigv is None:
        mapping_size = 100000
        X = X_orig[:mapping_size]
        X = X - X.mean(axis=0, dtype=np.float32)
        C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
        _, U = np.linalg.eig(C)    
        eigv = U[:, :dim]

    X_reduced = np.matmul(X_orig, eigv)

    return (X_reduced, eigv)

We can correct the code in this way:

def _reduce_matrix(X_orig, dim, eigv):
    if eigv is None:
        mapping_size = 100000
        X = X_orig[:mapping_size]
        X = X - X.mean(axis=0, dtype=np.float32)
        C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
        V, U = np.linalg.eig(C)
        ind = list(V.argsort()[-dim:])    # find the indices of the top k eigenvalues
        eigv = U[:, ind]

    X_reduced = np.matmul(X_orig, eigv)

    return (X_reduced, eigv)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions