ref: simphenotype #64

aryarm · 2022-07-02T22:07:53Z

Overview

This PR adapts the simphenotype subcommand to work with the new .hap file format.

It adds a new PhenoSimulator class that uses the data.Genotypes, data.Phenotypes, and data.Haplotype classes from the data module to create a PLINK2-compatible .pheno file.

Usage and docs

The simphenotype command docs can be found here.

Usage of the PhenoSimulator class is documented in the API docs here. The class uses a subclass of the data.Haplotype class which is documented here.

I've also made some changes to the other docs. The biggest change is that the file format section is now more similar to PLINK's documentation: each type of input has its own page within the section.

Details

The `PhenoSimulator` class

The new class is initialized with a data.Genotypes instance containing transformed haplotypes or regular variants. At the time of initialization, it creates an internal data.Phenotypes instance into which it stores any phenotypes that it generates. To create phenotypes, one need only call the data.PhenoSimulator.run() method. Here's its signature.

run(self, effects: list[Haplotype], heritability: float = 1, prevalence: float = None) -> npt.NDArray

The run() method generates phenotypes from a list of sim_phenotype.Haplotype objects, where each haplotype is encoded as an independent causal variable in the linear model.

$$ \vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon $$

where

$$ \epsilon_i \sim N(0, \sigma^2) $$

and

$$ \sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1) $$

depends on heritability $h^2$, which is an input to the model (with a default of 1). Users can also specify a prevalence for the disease if it should be modeled as case/control.
The final phenotypes are returned as a numpy array. They are also stored in the internal phenotypes object for safe-keeping.

The `data.Genotypes` class

The most notable change is that this class now has an index() and subset() method. The subset() method allows for pandas-style subsetting of the entire class by variant and sample IDs. The index() method performs some indexing internally to improve the amortized cost of the subsetting operation. The first time subset() is called, it will index the instance using index(). Subsequent calls to subset() will automatically utilize the stored index.

The `data.Haplotypes` class

The transform() method of the data.Haplotypes class will now utilize the data.Genotypes.subset() method for improved speed.

The `data.Phenotypes` and `data.Covariates` classes

I rewrote much of the data.Phenotypes and data.Covariates class to have them use PLINK2-compatible .pheno and .covar files. Perhaps most notably, the data.Covariates class is a subclass of data.Phenotypes now, and both classes can store more than one phenotype/covariate. There's also a new data.Phenotypes.append() method for adding another phenotype to an existing data.Phenotypes instance.

Testing

I added the following tests to a new TestSimPhenotype class in tests/test_simphenotype.py:

test_one_hap_zero_noise()
Try the run() method with a single haplotype.
test_one_hap_zero_noise_neg_beta()
Try the run() method with a single haplotype and a negative effect size.
test_two_haps_zero_noise()
Try the run() method twice, each with one haplotype.
test_combined_haps_zero_noise()
Try the run() method with two independent effects from two haplotypes.
test_noise()
The previous test used a heritability value of 1. This test does the same thing, but with decreasing heritability values.
test_case_control()
Perform the previous test but generate case/control phenotypes this time.

I also added tests to the existing TestGenotypes class in tests/test_data.py:

test_subset_genotypes
Test the data.Genotypes.subset() method on various combinations of parameters.

Future work

I still need to verify that the phenotypes we generate look good in a Manhattan plot.

In the future, we don't want to generate transformed haplotypes within the simphenotype command. Instead, we will use the output of the transform command as input to the simphenotype command. And we'll utilize the local ancestry information in the transform command. (Currently, the local ancestry info is just ignored.)

We may want to implement an alternative way to simulate case/control phenotypes in the future. Currently, the user specifies a fraction of samples that should be positive (aka the prevalence parameter), and we just convert the quantitative phenotypes to booleans by thresholding the quantitative phenotypes accordingly. But we might want to consider using a logistic regression model to generate the phenotypes, instead. It's unclear to me how the resulting phenotypes might be different, but one thing to note is that errors in a logistic regression would be modeled via a binomial distribution and currently they are modeled via a normal distribution.

An extension of the Genotypes class could allow us to load STR genotypes using trtools. Phenotype simulation of STRs should come automatically with that change.

There are a lot of improvements I want to make to classes in the data module, as well. See #19 and #49 -- not to mention that I'd like to add full support for PLINK2 files after I merge #16.

Eventually, it might be nice to support interaction terms in the phenotype simulation (see #4).

see #19 (comment)

…mmand

and make covariates class a subclass of phenotypes see #19 for more details

see pysam-developers/pysam#1104

@mlamkin7

remove @mlamkin7's email so he doesn't get spam from bots that scrape repositories

still need to revise case/control test

dunno why I named them like that originally - must not have been thinking

aryarm · 2022-07-08T20:58:31Z

@gymreklab, it's ready for you!

I would appreciate any comments you might have, but specifically, it would be helpful if you could look at the sim_phenotype.PhenoSimulator.run() method, specifically the way that I handle situations where we don't want to have to specify a heritability value. I'm worried this doesn't amount to what we discussed originally.

gymreklab · 2022-07-18T16:34:58Z

I think it is looking good!

Some quick comments:

It says for heritability: "If not provided, this will be estimated from the variability of the genotypes" but it seems by default to be set to 1. Even if we do estimate it, I was thinking we would estimate it from the effect sizes, not the genotypes.
It seems we only have the GCTA version for now, right?
We should do some basic sanity checks. e.g. if we simulate a trait with a certain h2, and then we go estiamte its h2, we should get a similar number.

aryarm · 2022-07-19T06:51:56Z

It says for heritability...

Ok, this has been fixed. It now says "If not provided, it will be computed from the sum of the squared effect sizes"

It seems we only have the GCTA version for now, right?

The other version should be implemented now!

We should do some basic sanity checks. e.g. if we simulate a trait with a certain h2, and then we go estiamte its h2, we should get a similar number

I'm thinking of doing this within the haptools-paper repo instead of as part of our tests here, since I'm not sure how to define "similar" in an automated test. The haptools-paper repo is meant for some of the pipeline/analysis/sanity check work. Would that be ok?

aryarm · 2022-07-19T18:31:13Z

@gymreklab, regarding your third point about sanity checks, here's the plot you asked for:

It looks like the generated heritabilities consistently overestimate the desired heritability at large values, but otherwise, we seem stick to it.

aryarm · 2022-07-19T18:39:12Z

In any case, I think we're ready to merge now!

aryarm added 30 commits May 13, 2022 22:38

start on simphenotype refactor

0116bbf

remove original simphenotype code

4fbb293

continue adding support for other file formats

30e5949

refmt with black

06e3806

move transform subcommand to its own module

52d90a0

setup PhenoSimulator.__init__ method

e3c9df9

add Genotypes.subset() method

49341e6

also allow indexing in GenotypesRefAlt class

a0ae5e1

simplify the transform method of the Haplotype and Haplotypes classes

f2a3017

see #19 (comment)

add region and samples params to simphenotype

a13f94a

raise error if any samples or variants are invalid in transform subco…

fa55ee8

…mmand

copy transform subroutine to simphenotypes

5542ccb

rename covar and pheno files

def4826

describe phenotypes in a separate page of the docs

bffd27c

support reading PLINK2-style pheno and covar files

afcdda5

and make covariates class a subclass of phenotypes see #19 for more details

clarify transform documentation

2849dca

add Phenotypes.write() method

a76e1a6

refmt with black

80bf943

prelim finish PhenoSimulator.run

d249206

try simulating phenotypes from multiple haplotypes, additively

e00ff2b

merge haplotype module with sim_phenotypes module

05430ee

compute heritability when not provided in simphenotype

0636294

implement case/control

54f48de

improve commenting in __main__

7fba06c

rename sim_phenotypes module to sim_phenotype (singular)

6c9d7ce

create test module for simphenotype

c6f1c5c

try to use new pysam add_samples() method to speed up vcf creation

1d96775

see pysam-developers/pysam#1104

use default heritability of 1 and add append method to Phenotypes class

23a067a

clarify installation and contribution guidelines

30852bc

remove @mlamkin7's email so he doesn't get spam from bots that scrape repositories

seed the simphenotype random number generator

fe63e32

aryarm added 6 commits July 8, 2022 08:53

mark case/control in pheno output

24fef39

add tests for simphenotype

9de40ee

still need to revise case/control test

finish testing case/control simphenotype

9a616fd

ensure written pheno file has unique names

b7e0dc0

refmt with black

f02cb93

add docs for phenosimulator to API section

d2f144c

aryarm marked this pull request as ready for review July 8, 2022 19:37

aryarm requested a review from gymreklab July 8, 2022 19:37

rename phenotype simulation tests

7291380

dunno why I named them like that originally - must not have been thinking

aryarm added 2 commits July 14, 2022 09:00

revise simphenotype command docs after changes to CLI

81e48a3

oops - fix typos

8b5de5c

aryarm added 2 commits July 18, 2022 23:35

add non-gcta model to simphenotype

b237872

switch if else in simphenotype to fix bug

9f9d0ea

aryarm added 2 commits July 19, 2022 09:10

adjust logging of info in PhenoSimulator.run()

6aefc33

warn about case/control encoding in simphenotype docs

38931d7

log expected heritability and expand docs

3f9af3a

aryarm removed the request for review from gymreklab July 20, 2022 04:55

fix conflicts with main branch

0359ced

aryarm merged commit c5c1339 into main Jul 20, 2022

aryarm deleted the ref/simphenotype branch July 20, 2022 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref: simphenotype #64

ref: simphenotype #64

aryarm commented Jul 2, 2022 •

edited

Loading

aryarm commented Jul 8, 2022

gymreklab commented Jul 18, 2022

aryarm commented Jul 19, 2022

aryarm commented Jul 19, 2022 •

edited

Loading

aryarm commented Jul 19, 2022

ref: simphenotype #64

ref: simphenotype #64

Conversation

aryarm commented Jul 2, 2022 • edited Loading

Overview

Usage and docs

Details

The PhenoSimulator class

The data.Genotypes class

The data.Haplotypes class

The data.Phenotypes and data.Covariates classes

Testing

Future work

aryarm commented Jul 8, 2022

gymreklab commented Jul 18, 2022

aryarm commented Jul 19, 2022

aryarm commented Jul 19, 2022 • edited Loading

aryarm commented Jul 19, 2022

aryarm commented Jul 2, 2022 •

edited

Loading

The `PhenoSimulator` class

The `data.Genotypes` class

The `data.Haplotypes` class

The `data.Phenotypes` and `data.Covariates` classes

aryarm commented Jul 19, 2022 •

edited

Loading