function _reduce_matrix seems to be an uncorrect implementation of PCA

In file fastText/python/fasttext_module/fasttext/util/util.py, the function `_reduce_matrix()` is used as a PCA algorithm to reduce dimension. But it seems to be an uncorrect implementation of PCA. Assume we've got the covariance matrix  `C`,  we need to get `dim` eigenvectors corresponding to the top `dim` eigenvalues，in order to keep the most information. But function `np.linalg.eig()`  returns eigenvalues that are not necessarily ordered, which leads to the mistake.
```
def _reduce_matrix(X_orig, dim, eigv):
    """
    Reduces the dimension of a (m × n)   matrix `X_orig` to
                          to a (m × dim) matrix `X_reduced`
    It uses only the first 100000 rows of `X_orig` to do the mapping.
    Matrix types are all `np.float32` in order to avoid unncessary copies.
    """
    if eigv is None:
        mapping_size = 100000
        X = X_orig[:mapping_size]
        X = X - X.mean(axis=0, dtype=np.float32)
        C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
        _, U = np.linalg.eig(C)    
        eigv = U[:, :dim]

    X_reduced = np.matmul(X_orig, eigv)

    return (X_reduced, eigv)
```

We can correct the code in this way:
```
def _reduce_matrix(X_orig, dim, eigv):
    if eigv is None:
        mapping_size = 100000
        X = X_orig[:mapping_size]
        X = X - X.mean(axis=0, dtype=np.float32)
        C = np.divide(np.matmul(X.T, X), X.shape[0] - 1, dtype=np.float32)
        V, U = np.linalg.eig(C)
        ind = list(V.argsort()[-dim:])    # find the indices of the top k eigenvalues
        eigv = U[:, ind]

    X_reduced = np.matmul(X_orig, eigv)

    return (X_reduced, eigv)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

function _reduce_matrix seems to be an uncorrect implementation of PCA #1199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

function _reduce_matrix seems to be an uncorrect implementation of PCA #1199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions