Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculate_cng_indices #97

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 67 additions & 1 deletion factor_analyzer/factor_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,21 @@
"""

import warnings
from typing import Tuple

import numpy as np
import pandas as pd
import scipy as sp
from scipy.optimize import minimize
from scipy.stats import chi2, pearsonr
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_models import LinearRegression
from sklearn.utils import check_array
from sklearn.utils.extmath import randomized_svd
from sklearn.utils.validation import check_is_fitted

from .rotator import OBLIQUE_ROTATIONS, POSSIBLE_ROTATIONS, Rotator
from .utils import corr, impute_values, partial_correlations, smc
from .utils import corr, covariance_to_correlation, impute_values, partial_correlations, smc

POSSIBLE_SVDS = ['randomized', 'lapack']

Expand Down Expand Up @@ -114,6 +116,70 @@ def calculate_bartlett_sphericity(x):
return statistic, p_value


def calculate_cng_indices(
data: np.ndarray, model: str = "components"
) -> Tuple[int, pd.DataFrame]:
"""Calculate the Cattel-Nelson-Gorsuch indices, which are used to determine
the appropriate number of factors for a factor analysis.

Direct port of nCng function from nFactors package:
https://rdrr.io/cran/nFactors/man/nCng.html

Parameters
----------
data : array-like
The array of samples x observable for which to calculate CNG indices
model : str
"components" or "factors"

Returns
-------
num_factors : int
The number of components/factors to retain
details : pd.DataFrame
The eigenvalues and CNG indices of the dataset
"""
data = corr(data)
if model == "factors":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just going to have "factors", do we need the model argument? Does it make sense to exclude this for now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model == "components", the adjustment underneath is not needed. In other words, only taking the correlation of the data is sufficient. We might want to add elif model != "components": raise ValueError

data -= np.linalg.pinv(np.diag(np.diag(np.linalg.pinv(data))))
# TODO: Should this line be here?
data = covariance_to_correlation(data)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line comes from the eigenComputes function. When the data passed in are correlations and cor == TRUE, it calls eigen(cov2cor(corFa(x))). However, it doesn't do this if the variables/observations themselves are passed, in which case it calls eigen(corFA(cov(x))). This produces very different results! I think this is only needed if covariances are passed in rather than correlations, but since this implementation calculates the correlations I think this line should be removed. Agreed?

Copy link
Collaborator

@jbiggsets jbiggsets Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are we assuming the input value here will always be eigenForm == "data"? If so, I think you're right that this should be removed, but I may be misunderstanding.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right. I was getting confused by the case where dataType == correlation && cor == FALSE, but I think in that case the data passed in should actually be a covariance matrix, not a correlation matrix.


values = np.sort(np.linalg.eigvals(data))[::-1]

num_variables = len(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't len(data) give you the number of observations, rather than the number of variables, assuming data is a numpy array or pandas data frame?

Unfortunately, this doesn't translate exactly from R:

> df <- data.frame(A = c(1, 2, 3), B = c(1, 2, 3))
> length(df) 
[1] 2
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]})
>>> len(df)
3
>>> len(df.values)
3

Also, should we move the line below to the beginning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this line of the code data refers to the correlation matrix, which is square. data.shape[0] == data.shape[1]

if num_variables < 6:
raise ValueError("The number of variables must be at least 6")

fit_size = 3
cng = np.diff(
[
[
LinearRegression()
.fit(idx_values[:, np.newaxis], values[idx_values])
.coef_
for idx_values in [
np.arange(idx_fit, idx_fit + fit_size),
np.arange(idx_fit + fit_size, idx_fit + 2 * fit_size),
]
]
for idx_fit in range(num_variables - 2 * fit_size)
],
axis=1,
).squeeze(axis=(1, 2))

num_factors = np.nanargmax(cng) + fit_size

details = pd.DataFrame(
{
"data": values[: len(cng)],
"cng": cng,
}
)

return num_factors, details


class FactorAnalyzer(BaseEstimator, TransformerMixin):
"""
A FactorAnalyzer class, which -
Expand Down