-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref: simphenotype #64
Conversation
dunno why I named them like that originally - must not have been thinking
@gymreklab, it's ready for you! I would appreciate any comments you might have, but specifically, it would be helpful if you could look at the |
I think it is looking good! Some quick comments:
|
Ok, this has been fixed. It now says "If not provided, it will be computed from the sum of the squared effect sizes"
The other version should be implemented now!
I'm thinking of doing this within the haptools-paper repo instead of as part of our tests here, since I'm not sure how to define "similar" in an automated test. The haptools-paper repo is meant for some of the pipeline/analysis/sanity check work. Would that be ok? |
@gymreklab, regarding your third point about sanity checks, here's the plot you asked for: |
In any case, I think we're ready to merge now! |
Overview
This PR adapts the
simphenotype
subcommand to work with the new.hap
file format.It adds a new
PhenoSimulator
class that uses thedata.Genotypes
,data.Phenotypes
, anddata.Haplotype
classes from thedata
module to create a PLINK2-compatible.pheno
file.Usage and docs
The
simphenotype
command docs can be found here.Usage of the PhenoSimulator class is documented in the API docs here. The class uses a subclass of the
data.Haplotype
class which is documented here.I've also made some changes to the other docs. The biggest change is that the file format section is now more similar to PLINK's documentation: each type of input has its own page within the section.
Details
The
PhenoSimulator
classThe new class is initialized with a
data.Genotypes
instance containing transformed haplotypes or regular variants. At the time of initialization, it creates an internaldata.Phenotypes
instance into which it stores any phenotypes that it generates. To create phenotypes, one need only call thedata.PhenoSimulator.run()
method. Here's its signature.The
run()
method generates phenotypes from a list ofsim_phenotype.Haplotype
objects, where each haplotype is encoded as an independent causal variable in the linear model.where
and
depends on heritability$h^2$ , which is an input to the model (with a default of 1). Users can also specify a prevalence for the disease if it should be modeled as case/control.
The final phenotypes are returned as a numpy array. They are also stored in the internal phenotypes object for safe-keeping.
The
data.Genotypes
classThe most notable change is that this class now has an
index()
andsubset()
method. Thesubset()
method allows for pandas-style subsetting of the entire class by variant and sample IDs. Theindex()
method performs some indexing internally to improve the amortized cost of the subsetting operation. The first timesubset()
is called, it will index the instance usingindex()
. Subsequent calls tosubset()
will automatically utilize the stored index.The
data.Haplotypes
classThe
transform()
method of thedata.Haplotypes
class will now utilize thedata.Genotypes.subset()
method for improved speed.The
data.Phenotypes
anddata.Covariates
classesI rewrote much of the
data.Phenotypes
anddata.Covariates
class to have them use PLINK2-compatible.pheno
and.covar
files. Perhaps most notably, thedata.Covariates
class is a subclass ofdata.Phenotypes
now, and both classes can store more than one phenotype/covariate. There's also a newdata.Phenotypes.append()
method for adding another phenotype to an existingdata.Phenotypes
instance.Testing
I added the following tests to a new
TestSimPhenotype
class intests/test_simphenotype.py
:test_one_hap_zero_noise()
Try the
run()
method with a single haplotype.test_one_hap_zero_noise_neg_beta()
Try the
run()
method with a single haplotype and a negative effect size.test_two_haps_zero_noise()
Try the
run()
method twice, each with one haplotype.test_combined_haps_zero_noise()
Try the
run()
method with two independent effects from two haplotypes.test_noise()
The previous test used a heritability value of 1. This test does the same thing, but with decreasing heritability values.
test_case_control()
Perform the previous test but generate case/control phenotypes this time.
I also added tests to the existing
TestGenotypes
class intests/test_data.py
:test_subset_genotypes
Test the
data.Genotypes.subset()
method on various combinations of parameters.Future work
I still need to verify that the phenotypes we generate look good in a Manhattan plot.
In the future, we don't want to generate transformed haplotypes within the
simphenotype
command. Instead, we will use the output of thetransform
command as input to thesimphenotype
command. And we'll utilize the local ancestry information in thetransform
command. (Currently, the local ancestry info is just ignored.)We may want to implement an alternative way to simulate case/control phenotypes in the future. Currently, the user specifies a fraction of samples that should be positive (aka the
prevalence
parameter), and we just convert the quantitative phenotypes to booleans by thresholding the quantitative phenotypes accordingly. But we might want to consider using a logistic regression model to generate the phenotypes, instead. It's unclear to me how the resulting phenotypes might be different, but one thing to note is that errors in a logistic regression would be modeled via a binomial distribution and currently they are modeled via a normal distribution.An extension of the Genotypes class could allow us to load STR genotypes using
trtools
. Phenotype simulation of STRs should come automatically with that change.There are a lot of improvements I want to make to classes in the data module, as well. See #19 and #49 -- not to mention that I'd like to add full support for PLINK2 files after I merge #16.
Eventually, it might be nice to support interaction terms in the phenotype simulation (see #4).