Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: simphenotype #64

Merged
merged 45 commits into from
Jul 20, 2022
Merged

ref: simphenotype #64

merged 45 commits into from
Jul 20, 2022

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Jul 2, 2022

Overview

This PR adapts the simphenotype subcommand to work with the new .hap file format.

It adds a new PhenoSimulator class that uses the data.Genotypes, data.Phenotypes, and data.Haplotype classes from the data module to create a PLINK2-compatible .pheno file.

Usage and docs

The simphenotype command docs can be found here.

Usage of the PhenoSimulator class is documented in the API docs here. The class uses a subclass of the data.Haplotype class which is documented here.

I've also made some changes to the other docs. The biggest change is that the file format section is now more similar to PLINK's documentation: each type of input has its own page within the section.

Details

The PhenoSimulator class

The new class is initialized with a data.Genotypes instance containing transformed haplotypes or regular variants. At the time of initialization, it creates an internal data.Phenotypes instance into which it stores any phenotypes that it generates. To create phenotypes, one need only call the data.PhenoSimulator.run() method. Here's its signature.

run(self, effects: list[Haplotype], heritability: float = 1, prevalence: float = None) -> npt.NDArray

The run() method generates phenotypes from a list of sim_phenotype.Haplotype objects, where each haplotype is encoded as an independent causal variable in the linear model.

$$ \vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon $$

where

$$ \epsilon_i \sim N(0, \sigma^2) $$

and

$$ \sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1) $$

depends on heritability $h^2$, which is an input to the model (with a default of 1). Users can also specify a prevalence for the disease if it should be modeled as case/control.
The final phenotypes are returned as a numpy array. They are also stored in the internal phenotypes object for safe-keeping.

The data.Genotypes class

The most notable change is that this class now has an index() and subset() method. The subset() method allows for pandas-style subsetting of the entire class by variant and sample IDs. The index() method performs some indexing internally to improve the amortized cost of the subsetting operation. The first time subset() is called, it will index the instance using index(). Subsequent calls to subset() will automatically utilize the stored index.

The data.Haplotypes class

The transform() method of the data.Haplotypes class will now utilize the data.Genotypes.subset() method for improved speed.

The data.Phenotypes and data.Covariates classes

I rewrote much of the data.Phenotypes and data.Covariates class to have them use PLINK2-compatible .pheno and .covar files. Perhaps most notably, the data.Covariates class is a subclass of data.Phenotypes now, and both classes can store more than one phenotype/covariate. There's also a new data.Phenotypes.append() method for adding another phenotype to an existing data.Phenotypes instance.

Testing

I added the following tests to a new TestSimPhenotype class in tests/test_simphenotype.py:

  1. test_one_hap_zero_noise()
    Try the run() method with a single haplotype.
  2. test_one_hap_zero_noise_neg_beta()
    Try the run() method with a single haplotype and a negative effect size.
  3. test_two_haps_zero_noise()
    Try the run() method twice, each with one haplotype.
  4. test_combined_haps_zero_noise()
    Try the run() method with two independent effects from two haplotypes.
  5. test_noise()
    The previous test used a heritability value of 1. This test does the same thing, but with decreasing heritability values.
  6. test_case_control()
    Perform the previous test but generate case/control phenotypes this time.

I also added tests to the existing TestGenotypes class in tests/test_data.py:

  1. test_subset_genotypes
    Test the data.Genotypes.subset() method on various combinations of parameters.

Future work

I still need to verify that the phenotypes we generate look good in a Manhattan plot.

In the future, we don't want to generate transformed haplotypes within the simphenotype command. Instead, we will use the output of the transform command as input to the simphenotype command. And we'll utilize the local ancestry information in the transform command. (Currently, the local ancestry info is just ignored.)

We may want to implement an alternative way to simulate case/control phenotypes in the future. Currently, the user specifies a fraction of samples that should be positive (aka the prevalence parameter), and we just convert the quantitative phenotypes to booleans by thresholding the quantitative phenotypes accordingly. But we might want to consider using a logistic regression model to generate the phenotypes, instead. It's unclear to me how the resulting phenotypes might be different, but one thing to note is that errors in a logistic regression would be modeled via a binomial distribution and currently they are modeled via a normal distribution.

An extension of the Genotypes class could allow us to load STR genotypes using trtools. Phenotype simulation of STRs should come automatically with that change.

There are a lot of improvements I want to make to classes in the data module, as well. See #19 and #49 -- not to mention that I'd like to add full support for PLINK2 files after I merge #16.

Eventually, it might be nice to support interaction terms in the phenotype simulation (see #4).

aryarm added 30 commits May 13, 2022 22:38
and make covariates class a subclass of phenotypes
see #19 for more details
remove @mlamkin7's email so he doesn't get spam from bots that scrape repositories
@aryarm aryarm marked this pull request as ready for review July 8, 2022 19:37
@aryarm aryarm requested a review from gymreklab July 8, 2022 19:37
dunno why I named them like that originally - must not have been thinking
@aryarm
Copy link
Member Author

aryarm commented Jul 8, 2022

@gymreklab, it's ready for you!

I would appreciate any comments you might have, but specifically, it would be helpful if you could look at the sim_phenotype.PhenoSimulator.run() method, specifically the way that I handle situations where we don't want to have to specify a heritability value. I'm worried this doesn't amount to what we discussed originally.

@gymreklab
Copy link
Collaborator

I think it is looking good!

Some quick comments:

  • It says for heritability: "If not provided, this will be estimated from the variability of the genotypes" but it seems by default to be set to 1. Even if we do estimate it, I was thinking we would estimate it from the effect sizes, not the genotypes.

  • It seems we only have the GCTA version for now, right?

  • We should do some basic sanity checks. e.g. if we simulate a trait with a certain h2, and then we go estiamte its h2, we should get a similar number.

@aryarm
Copy link
Member Author

aryarm commented Jul 19, 2022

It says for heritability...

Ok, this has been fixed. It now says "If not provided, it will be computed from the sum of the squared effect sizes"

It seems we only have the GCTA version for now, right?

The other version should be implemented now!

We should do some basic sanity checks. e.g. if we simulate a trait with a certain h2, and then we go estiamte its h2, we should get a similar number

I'm thinking of doing this within the haptools-paper repo instead of as part of our tests here, since I'm not sure how to define "similar" in an automated test. The haptools-paper repo is meant for some of the pipeline/analysis/sanity check work. Would that be ok?

@aryarm
Copy link
Member Author

aryarm commented Jul 19, 2022

@gymreklab, regarding your third point about sanity checks, here's the plot you asked for:
haptools-simphenotype-heritability
It looks like the generated heritabilities consistently overestimate the desired heritability at large values, but otherwise, we seem stick to it.

@aryarm
Copy link
Member Author

aryarm commented Jul 19, 2022

In any case, I think we're ready to merge now!

@aryarm aryarm removed the request for review from gymreklab July 20, 2022 04:55
@aryarm aryarm merged commit c5c1339 into main Jul 20, 2022
@aryarm aryarm deleted the ref/simphenotype branch July 20, 2022 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants