-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add calculate_cng_indices #97
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,19 +8,21 @@ | |
""" | ||
|
||
import warnings | ||
from typing import Tuple | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import scipy as sp | ||
from scipy.optimize import minimize | ||
from scipy.stats import chi2, pearsonr | ||
from sklearn.base import BaseEstimator, TransformerMixin | ||
from sklearn.linear_models import LinearRegression | ||
from sklearn.utils import check_array | ||
from sklearn.utils.extmath import randomized_svd | ||
from sklearn.utils.validation import check_is_fitted | ||
|
||
from .rotator import OBLIQUE_ROTATIONS, POSSIBLE_ROTATIONS, Rotator | ||
from .utils import corr, impute_values, partial_correlations, smc | ||
from .utils import corr, covariance_to_correlation, impute_values, partial_correlations, smc | ||
|
||
POSSIBLE_SVDS = ['randomized', 'lapack'] | ||
|
||
|
@@ -114,6 +116,70 @@ def calculate_bartlett_sphericity(x): | |
return statistic, p_value | ||
|
||
|
||
def calculate_cng_indices( | ||
data: np.ndarray, model: str = "components" | ||
) -> Tuple[int, pd.DataFrame]: | ||
"""Calculate the Cattel-Nelson-Gorsuch indices, which are used to determine | ||
the appropriate number of factors for a factor analysis. | ||
|
||
Direct port of nCng function from nFactors package: | ||
https://rdrr.io/cran/nFactors/man/nCng.html | ||
|
||
Parameters | ||
---------- | ||
data : array-like | ||
The array of samples x observable for which to calculate CNG indices | ||
model : str | ||
"components" or "factors" | ||
|
||
Returns | ||
------- | ||
num_factors : int | ||
The number of components/factors to retain | ||
details : pd.DataFrame | ||
The eigenvalues and CNG indices of the dataset | ||
""" | ||
data = corr(data) | ||
if model == "factors": | ||
data -= np.linalg.pinv(np.diag(np.diag(np.linalg.pinv(data)))) | ||
# TODO: Should this line be here? | ||
data = covariance_to_correlation(data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line comes from the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, are we assuming the input value here will always be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think you're right. I was getting confused by the case where |
||
|
||
values = np.sort(np.linalg.eigvals(data))[::-1] | ||
|
||
num_variables = len(data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Won't Unfortunately, this doesn't translate exactly from R:
Also, should we move the line below to the beginning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this line of the code |
||
if num_variables < 6: | ||
raise ValueError("The number of variables must be at least 6") | ||
|
||
fit_size = 3 | ||
cng = np.diff( | ||
[ | ||
[ | ||
LinearRegression() | ||
.fit(idx_values[:, np.newaxis], values[idx_values]) | ||
.coef_ | ||
for idx_values in [ | ||
np.arange(idx_fit, idx_fit + fit_size), | ||
np.arange(idx_fit + fit_size, idx_fit + 2 * fit_size), | ||
] | ||
] | ||
for idx_fit in range(num_variables - 2 * fit_size) | ||
], | ||
axis=1, | ||
).squeeze(axis=(1, 2)) | ||
|
||
num_factors = np.nanargmax(cng) + fit_size | ||
|
||
details = pd.DataFrame( | ||
{ | ||
"data": values[: len(cng)], | ||
"cng": cng, | ||
} | ||
) | ||
|
||
return num_factors, details | ||
|
||
|
||
class FactorAnalyzer(BaseEstimator, TransformerMixin): | ||
""" | ||
A FactorAnalyzer class, which - | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just going to have "factors", do we need the
model
argument? Does it make sense to exclude this for now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if model == "components"
, the adjustment underneath is not needed. In other words, only taking the correlation of the data is sufficient. We might want to addelif model != "components": raise ValueError