Pandas based logistic regression #316

henrydavidge · 2020-12-10T21:07:44Z

What changes are proposed in this pull request?

Moves logic shared by pandas based linear and logistic regression to a common file
Adds scaffolding for a pandas based logistic regression test with fallback logic for potentially significant variants
Implements a fast multi-pheno, multi-geno score test

How is this patch tested?

Unit tests
Integration tests
Manual tests

(Details)

Signed-off-by: Henry D <henrydavidge@gmail.com>

…gression-pandas

Signed-off-by: Henry D <henrydavidge@gmail.com>

codecov · 2020-12-10T21:28:20Z

Codecov Report

Merging #316 (3ea40cd) into master (a63306e) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #316   +/-   ##
=======================================
  Coverage   93.64%   93.64%           
=======================================
  Files          95       95           
  Lines        4814     4814           
  Branches      472      472           
=======================================
  Hits         4508     4508           
  Misses        306      306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a63306e...3ea40cd. Read the comment docs.

Signed-off-by: Henry D <henrydavidge@gmail.com>

karenfeng

I left a couple of high-level questions I had when writing a first cut for the approximate firth correction.

karenfeng · 2020-12-15T19:04:25Z

python/glow/gwas/logistic_regression.py

+        phenotype_df: pd.DataFrame,
+        covariate_df: pd.DataFrame = pd.DataFrame({}),
+        offset_df: pd.DataFrame = pd.DataFrame({}),
+        # TODO: fallback is probably not the best name


In addition to fallback (I propose correction as an alternative name), we should expose a parallel to pvalue_threshold below which we perform the correction.

karenfeng · 2020-12-15T19:42:52Z

python/glow/gwas/logistic_regression.py

+    sql_type = gwas_fx._regression_sql_type(dt)
+    genotype_df = gwas_fx._prepare_genotype_df(genotype_df, values_column, sql_type)
+    result_fields = [
+        # TODO: Probably want to put effect size and stderr here for approx-firth


I think we can still calculate effect size and stderr without the corrections, right? As in: https://github.com/rgcgithub/regenie/blob/247483cd5617f048682062553265837c2b95d6ee/src/Data.cpp#L2456

I saw that they compute bhat, but I don't understand what they actually represent. Based on the regenie paper and other resources, it seems that "effect size" universally corresponds to the maximum likelihood coefficient for the genotype feature. If that's the case, how could you know the effect size without fitting a model?

I believe that effect size here simply refers to the difference between means, which is similar to the t-test stat (without the scaling).

kianfar77

Looks nice! I had some comments.

kianfar77 · 2020-12-14T23:46:55Z

python/glow/gwas/linear_regression.py

    np.nan_to_num(Y, copy=False)
    _residualize_in_place(Y, Q)

-    if not offset_df.empty:


We probably also need to give an error message when the number of rows in phenotype_df and offset_df do not match, similar to what we do with phenotype_df and covariate_df.

We check that they have the same row index, which is actually more strict.

My bad. I just saw the columns comparison.

kianfar77 · 2020-12-15T00:15:30Z

python/glow/gwas/logistic_regression.py

+
+    On the driver node, we fit a logistic regression model based on the covariates for each
+    phenotype. We broadcast the resulting residuals, gamma vectors
+    (where gamma is defined as y_hat * (1 - y_hat)), and (C.T gamma C)^-1 matrices. In each task,


You probably need to write the logit linear expression somewhere before this for notations you use here to make sense.

kianfar77 · 2020-12-15T00:17:03Z

python/glow/gwas/logistic_regression.py

+        genotype_df : Spark DataFrame containing genomic data
+        phenotype_df : Pandas DataFrame containing phenotypic data
+        covariate_df : An optional Pandas DataFrame containing covariates
+        offset_df : An optional Pandas DataFrame containing the phenotype offset. The actual phenotype used


The sentence 'The actual phenotype ...` needs to be adjusted for logistic regression context.

kianfar77 · 2020-12-15T00:38:54Z

python/glow/gwas/logistic_regression.py

+                   X[y_mask],
+                   family=sm.families.Binomial(),
+                   offset=offset,
+                   missing='ignore')


In sm documentation https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM, I see none, drop, and raise for the missing argument. Does 'ignore' work?

Good catch! It looks like if you specify an invalid option, statsmodels defaults to none, which is actually what I want. I will change to none.

kianfar77 · 2020-12-15T21:45:14Z

python/glow/gwas/logistic_regression.py

+
+
+@typechecked
+def _assemble_log_reg_state(


nit: you use create for linear regression and assemble here. Perhaps better to use same verb in both.

kianfar77 · 2020-12-15T21:46:53Z

python/glow/gwas/logistic_regression.py

+    ])
+    gamma = Y_pred * (1 - Y_pred)
+    CtGammaC = C.T @ (gamma[:, :, None] * C)
+    CtGammaC_inv = np.linalg.inv(CtGammaC)


nit: some places CtGammaC_inv is used and some other places inv_CtGammaC, can we fix on one?

kianfar77 · 2020-12-15T21:51:27Z

python/glow/gwas/logistic_regression.py

+                        `genotype_df` should have a column with this name and a numeric array type. If a column expression
+                        is provided, the expression should return a numeric array type.
+        dt : The numpy datatype to use in the linear regression test. Must be `np.float32` or `np.float64`.
+    '''


Can we add Returns descriptions?

Signed-off-by: Henry D <henrydavidge@gmail.com>

kianfar77

Looks great! Just a couple of nits.

kianfar77 · 2020-12-22T17:55:38Z

python/glow/gwas/log_reg.py

-                    have one or two levels of indexing. If one level, the index should be the same as the `phenotype_df`.
-                    If two levels, the level 0 index should be the same as the `phenotype_df`, and the level 1 index
+        offset_df : An optional Pandas DataFrame containing the phenotype offset. This value will be used
+                    as a offset in the covariate only and per variant logistic regression models. The ``offset_df`` may


typo: a offset

kianfar77 · 2020-12-22T18:08:17Z

python/glow/gwas/log_reg.py

        offset_df: pd.DataFrame = pd.DataFrame({}),
-        # TODO: fallback is probably not the best name
-        fallback: str = 'none',  # TODO: Make approx-firth default
+        correction: str = 'none',  # TODO: Make approx-firth default


nit: I think the term correction is misleading for this argument. Something like alternative may make more sense.

Correction is the term used in the regenie paper to refer to Firth/SPA.

karenfeng

As discussed offline, can you set a numpy random seed? Otherwise, we can end up with test sets with perfect separation (which we should probably have as well, but only when we expect it).

Signed-off-by: Henry D <henrydavidge@gmail.com>

* initial work Signed-off-by: Henry D <henrydavidge@gmail.com> * add file Signed-off-by: Henry D <henrydavidge@gmail.com> * workign score test Signed-off-by: Henry D <henrydavidge@gmail.com> * seems to work Signed-off-by: Henry D <henrydavidge@gmail.com> * continue Signed-off-by: Henry D <henrydavidge@gmail.com> * offset support; more tests Signed-off-by: Henry D <henrydavidge@gmail.com> * delete lin_reg.py Signed-off-by: Henry D <henrydavidge@gmail.com> * add docs, few more tests Signed-off-by: Henry D <henrydavidge@gmail.com> * add test file Signed-off-by: Henry D <henrydavidge@gmail.com> * fix last test Signed-off-by: Henry D <henrydavidge@gmail.com> * Fix docs, tests Signed-off-by: Henry D <henrydavidge@gmail.com> * memory limit Signed-off-by: Henry D <henrydavidge@gmail.com> * try explicitly broadcasting Signed-off-by: Henry D <henrydavidge@gmail.com> * update environment Signed-off-by: Henry D <henrydavidge@gmail.com> * undo explicit broadcast Signed-off-by: Henry D <henrydavidge@gmail.com> * fix typo Signed-off-by: Henry D <henrydavidge@gmail.com> * f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment Signed-off-by: Henry D <henrydavidge@gmail.com> Signed-off-by: brian cajes <brian@empiricotx.com>

henrydavidge added 5 commits December 1, 2020 11:12

initial work

6a2a976

Signed-off-by: Henry D <henrydavidge@gmail.com>

Merge branch 'master' of github.com:projectglow/glow into logistic-re…

770c621

…gression-pandas

add file

9df2d0f

Signed-off-by: Henry D <henrydavidge@gmail.com>

workign score test

95241a3

Signed-off-by: Henry D <henrydavidge@gmail.com>

seems to work

ad8a137

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge added 3 commits December 10, 2020 21:44

continue

ce7f08f

Signed-off-by: Henry D <henrydavidge@gmail.com>

offset support; more tests

13f28a7

Signed-off-by: Henry D <henrydavidge@gmail.com>

delete lin_reg.py

8a9a46b

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge requested a review from karenfeng December 11, 2020 21:07

add docs, few more tests

62562e9

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge changed the title ~~Logistic regression pandas~~ Pandas based logistic regression Dec 12, 2020

henrydavidge added 2 commits December 11, 2020 20:17

add test file

c9dfad1

Signed-off-by: Henry D <henrydavidge@gmail.com>

fix last test

d753b19

Signed-off-by: Henry D <henrydavidge@gmail.com>

karenfeng requested a review from kianfar77 December 14, 2020 21:36

karenfeng reviewed Dec 15, 2020

View reviewed changes

kianfar77 requested changes Dec 15, 2020

View reviewed changes

henrydavidge added 4 commits December 21, 2020 11:57

Fix docs, tests

b48975d

Signed-off-by: Henry D <henrydavidge@gmail.com>

memory limit

daa8766

Signed-off-by: Henry D <henrydavidge@gmail.com>

try explicitly broadcasting

fd2edd0

Signed-off-by: Henry D <henrydavidge@gmail.com>

clean up missingness tests

cc47711

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge requested a review from kianfar77 December 22, 2020 16:25

kianfar77 approved these changes Dec 22, 2020

View reviewed changes

karenfeng reviewed Dec 22, 2020

View reviewed changes

henrydavidge added 4 commits December 22, 2020 16:30

update environment

97b0a5a

Signed-off-by: Henry D <henrydavidge@gmail.com>

undo explicit broadcast

f15c0eb

Signed-off-by: Henry D <henrydavidge@gmail.com>

fix typo

138559c

Signed-off-by: Henry D <henrydavidge@gmail.com>

f97b0a5aee82445baa8bb4770a4a7ed0437dc6b13ormatting; karen's comment

3ea40cd

Signed-off-by: Henry D <henrydavidge@gmail.com>

henrydavidge force-pushed the logistic-regression-pandas branch from e05a351 to 3ea40cd Compare December 23, 2020 15:51

henrydavidge merged commit e1b52e4 into projectglow:master Dec 23, 2020



		@typechecked
		def _assemble_log_reg_state(

Pandas based logistic regression #316

Pandas based logistic regression #316

Uh oh!

Conversation

henrydavidge commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this patch tested?

Uh oh!

codecov bot commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

karenfeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kianfar77 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kianfar77 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karenfeng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henrydavidge commented Dec 10, 2020 •

edited

Loading

codecov bot commented Dec 10, 2020 •

edited

Loading