-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: exclude highly correlated covariates from propensity score calculation #155
Comments
I don't think automatically removing them is a good idea. We need to review whether they are actual relevant differences between target and comparator. If so, we may need to redesign the study, or stop altogether. I do think we should do a better job of recording the high-correlation covariates. Also the Comparator Selection Tool does a great job of identifying high-correlation covariates beforehand. |
I'm still urging you to create a separate function for determining correlated covariates (and return these to the user). This would help to identify these covariates BEFORE fitting the propensity score models, prevent the analysis from running into a fatal error, and let the user decide what to do with them. Receiving the list would allow us to remove these manually or automatically. I think that it would be helpful to have still an option to exclude such covariates automatically if they appear. I need to do this automatically because I'm not studying two cohorts + outcome only but a large set of cohorts, and I'm trying to determine the potential risk factors (cohorts) of the outcome. I'm also using a large set of outcomes. So, I'm trying to use the CohortMethod package for dirty work to identify such risk factors/cohorts from a large set of cohorts and later continue working on these in more detail. Therefore, requiring manual exclusion for each cohort pair in this dirty work stage is very inconvenient, and cannot see a reason why the package cannot have an attribute to do this automatically (with necessary warning messages). When the potential candidate risk factors have been found, one should review the highly correlated covariates one by one, for sure. |
That is what the Automatically removing the covariates may lead to invalid causal estimates, which I don't want to encourage. |
Currently, CohortMethod::createPs() checks whether any of the covariates are highly correlated with the treatment. It is a very useful feature as whenever such a correlation is found, I think propensity scores become highly biased and hardly usable for matching. However, there is no automatic way to disable such covariates from the propensity score calculation. The only way to exclude such covariates is to manually add these to the exclude list, which is... painful as it interrupts the automated flow. Furthermore, there is no good way to identify and remove the highly correlated covariates (
CohortMethod::createPs()
only shows these out).I tried to solve this problem by relying on cohort definitions - and excluded concept_id-s given there from the CohortMethodData. However, some highly correlated events cannot be removed this way (for instance, when the diagnostic test result is the index event, but the examination/taking the test (not given in the cohort definition) is also highly correlated).
As a second attempt, I reused some code from
CohortMethod::createPs()
function to identify the correlated covariates. But there are several problems with it - 1) I have to build studyPopulation twice, which is time-consuming (first, to identify correlations, second, to use exclude list); 2) some correlations do not have concept_id-s to exclude (e.g index year may be correlated) and require more advanced "exclusion mechanism" (need to turn off covariateSetting flag); 3) it is an ugly copy-code-hack; 4) does not work with multi-analysis approach.So, I'm having trouble finding the best method to automate this step and asking for advice.
It seems most reasonable to have a feature in CohortMethod package that automatically excludes the highly correlated covariates from propensity score fitting. I know that doing too much in the background will make us lazy, but to conduct many analyses with dozens of cohorts in a row, this manual step in the middle is very inconvenient.
The text was updated successfully, but these errors were encountered: