Skip to content

Conversation

@henrydavidge
Copy link
Contributor

@henrydavidge henrydavidge commented Dec 10, 2020

What changes are proposed in this pull request?

  • Moves logic shared by pandas based linear and logistic regression to a common file
  • Adds scaffolding for a pandas based logistic regression test with fallback logic for potentially significant variants
  • Implements a fast multi-pheno, multi-geno score test

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
@codecov
Copy link

codecov bot commented Dec 10, 2020

Codecov Report

Merging #316 (3ea40cd) into master (a63306e) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #316   +/-   ##
=======================================
  Coverage   93.64%   93.64%           
=======================================
  Files          95       95           
  Lines        4814     4814           
  Branches      472      472           
=======================================
  Hits         4508     4508           
  Misses        306      306           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a63306e...3ea40cd. Read the comment docs.

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
@henrydavidge henrydavidge changed the title Logistic regression pandas Pandas based logistic regression Dec 12, 2020
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
@karenfeng karenfeng requested a review from kianfar77 December 14, 2020 21:36
Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of high-level questions I had when writing a first cut for the approximate firth correction.

phenotype_df: pd.DataFrame,
covariate_df: pd.DataFrame = pd.DataFrame({}),
offset_df: pd.DataFrame = pd.DataFrame({}),
# TODO: fallback is probably not the best name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to fallback (I propose correction as an alternative name), we should expose a parallel to pvalue_threshold below which we perform the correction.

sql_type = gwas_fx._regression_sql_type(dt)
genotype_df = gwas_fx._prepare_genotype_df(genotype_df, values_column, sql_type)
result_fields = [
# TODO: Probably want to put effect size and stderr here for approx-firth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still calculate effect size and stderr without the corrections, right? As in: https://github.com/rgcgithub/regenie/blob/247483cd5617f048682062553265837c2b95d6ee/src/Data.cpp#L2456

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that they compute bhat, but I don't understand what they actually represent. Based on the regenie paper and other resources, it seems that "effect size" universally corresponds to the maximum likelihood coefficient for the genotype feature. If that's the case, how could you know the effect size without fitting a model?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that effect size here simply refers to the difference between means, which is similar to the t-test stat (without the scaling).

Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! I had some comments.

np.nan_to_num(Y, copy=False)
_residualize_in_place(Y, Q)

if not offset_df.empty:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably also need to give an error message when the number of rows in phenotype_df and offset_df do not match, similar to what we do with phenotype_df and covariate_df.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check that they have the same row index, which is actually more strict.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I just saw the columns comparison.

On the driver node, we fit a logistic regression model based on the covariates for each
phenotype. We broadcast the resulting residuals, gamma vectors
(where gamma is defined as y_hat * (1 - y_hat)), and (C.T gamma C)^-1 matrices. In each task,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to write the logit linear expression somewhere before this for notations you use here to make sense.

genotype_df : Spark DataFrame containing genomic data
phenotype_df : Pandas DataFrame containing phenotypic data
covariate_df : An optional Pandas DataFrame containing covariates
offset_df : An optional Pandas DataFrame containing the phenotype offset. The actual phenotype used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence 'The actual phenotype ...` needs to be adjusted for logistic regression context.

X[y_mask],
family=sm.families.Binomial(),
offset=offset,
missing='ignore')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sm documentation https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM, I see none, drop, and raise for the missing argument. Does 'ignore' work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It looks like if you specify an invalid option, statsmodels defaults to none, which is actually what I want. I will change to none.



@typechecked
def _assemble_log_reg_state(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you use create for linear regression and assemble here. Perhaps better to use same verb in both.

])
gamma = Y_pred * (1 - Y_pred)
CtGammaC = C.T @ (gamma[:, :, None] * C)
CtGammaC_inv = np.linalg.inv(CtGammaC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: some places CtGammaC_inv is used and some other places inv_CtGammaC, can we fix on one?

`genotype_df` should have a column with this name and a numeric array type. If a column expression
is provided, the expression should return a numeric array type.
dt : The numpy datatype to use in the linear regression test. Must be `np.float32` or `np.float64`.
'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add Returns descriptions?

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Copy link
Collaborator

@kianfar77 kianfar77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple of nits.

have one or two levels of indexing. If one level, the index should be the same as the `phenotype_df`.
If two levels, the level 0 index should be the same as the `phenotype_df`, and the level 1 index
offset_df : An optional Pandas DataFrame containing the phenotype offset. This value will be used
as a offset in the covariate only and per variant logistic regression models. The ``offset_df`` may
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: a offset

offset_df: pd.DataFrame = pd.DataFrame({}),
# TODO: fallback is probably not the best name
fallback: str = 'none', # TODO: Make approx-firth default
correction: str = 'none', # TODO: Make approx-firth default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the term correction is misleading for this argument. Something like alternative may make more sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction is the term used in the regenie paper to refer to Firth/SPA.

Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, can you set a numpy random seed? Otherwise, we can end up with test sets with perfect separation (which we should probably have as well, but only when we expect it).

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
@henrydavidge henrydavidge force-pushed the logistic-regression-pandas branch from e05a351 to 3ea40cd Compare December 23, 2020 15:51
@henrydavidge henrydavidge merged commit e1b52e4 into projectglow:master Dec 23, 2020
bcajes pushed a commit to bcajes/glow that referenced this pull request Sep 27, 2021
* initial work

Signed-off-by: Henry D <henrydavidge@gmail.com>

* add file

Signed-off-by: Henry D <henrydavidge@gmail.com>

* workign score test

Signed-off-by: Henry D <henrydavidge@gmail.com>

* seems to work

Signed-off-by: Henry D <henrydavidge@gmail.com>

* continue

Signed-off-by: Henry D <henrydavidge@gmail.com>

* offset support; more tests

Signed-off-by: Henry D <henrydavidge@gmail.com>

* delete lin_reg.py

Signed-off-by: Henry D <henrydavidge@gmail.com>

* add docs, few more tests

Signed-off-by: Henry D <henrydavidge@gmail.com>

* add test file

Signed-off-by: Henry D <henrydavidge@gmail.com>

* fix last test

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Fix docs, tests

Signed-off-by: Henry D <henrydavidge@gmail.com>

* memory limit

Signed-off-by: Henry D <henrydavidge@gmail.com>

* try explicitly broadcasting

Signed-off-by: Henry D <henrydavidge@gmail.com>

* update environment

Signed-off-by: Henry D <henrydavidge@gmail.com>

* undo explicit broadcast

Signed-off-by: Henry D <henrydavidge@gmail.com>

* fix typo

Signed-off-by: Henry D <henrydavidge@gmail.com>

* f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment

Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: brian cajes <brian@empiricotx.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants