diff --git a/README.md b/README.md index 678dbeff..f8692eab 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,10 @@ Homepage: https://haptools.readthedocs.io/ ## Installation -UNDER CONSTRUCTION +We have not officially published `haptools` yet, but in the meantime, you can install it directly from our Github repository. +```bash +pip install git+https://github.com/gymrek-lab/haptools.git +``` ## Haptools utilities @@ -21,12 +24,13 @@ Haptools consists of multiple utilities listed below. Click on a utility to see * [`haptools karyogram`](docs/commands/karyogram.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`. +* [`haptools transform`](https://haptools.readthedocs.io/en/latest/commands/transform.html): Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants. + Outputs produced by these utilities are compatible with each other. For example `haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. ## Contributing -If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu. - - +We gladly welcome any contributions to `haptools`! +Please read [our contribution guidelines](https://haptools.readthedocs.io/en/latest/project_info/contributing.html) and then submit a [Github issue](https://github.com/gymrek-lab/haptools/issues). diff --git a/docs/api/haptools.rst b/docs/api/haptools.rst index 66d0bab8..ce092e1d 100644 --- a/docs/api/haptools.rst +++ b/docs/api/haptools.rst @@ -52,8 +52,16 @@ haptools.data.haplotypes module :undoc-members: :show-inheritance: +haptools.sim_phenotype module +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. automodule:: haptools.sim_phenotype + :members: + :undoc-members: + :show-inheritance: + haptools.sim_genotype module -~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automodule:: haptools.sim_genotype :members: diff --git a/docs/commands/karyogram.md b/docs/commands/karyogram.md index 706f2419..9b898e40 100644 --- a/docs/commands/karyogram.md +++ b/docs/commands/karyogram.md @@ -1,4 +1,4 @@ -# Haptools karyogram +# karyogram `haptools karyogram` takes as input a breakpoints file (e.g. as output by `haptools simgenotype`) and a sample name, and plots a karyogram depicting local ancestry tracks. diff --git a/docs/commands/simphenotype.md b/docs/commands/simphenotype.md deleted file mode 100644 index 629a95be..00000000 --- a/docs/commands/simphenotype.md +++ /dev/null @@ -1,76 +0,0 @@ -# simphenotype - -Haptools simphenotype simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. It takes causal effects and genotypes as input and outputs simulated phenotypes. - -Usage is modeled based on the [GCTA GWAS Simulation](https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation) utility. - -## Usage - -Below is a basic `haptools simphenotype` command: - -``` -haptools simphenotype \ - --vcf \ - --hap \ - --out \ - < --simu-qt | --simu-cc > \ - [simulation options] -``` - -Required parameters: - -* `--vcf `: A bgzipped, tabix-indexed, phased VCF file. If you are simulating local-ancestry effects, the VCF file must contain the `FORMAT/LA` tag included in output of `haptools simgenotype`. See [haptools file formats](../../docs/formats/inputs.rst) for more details. - -* `--hap `: A bgzipped, tabix-indexed HAP file, which specifies causal effects. This is a custom format described in more detail in [haptools file formats](../../docs/formats/haplotypes.rst). The HAP format enables flexible specification of a range of effect types including traditional variant-level effects, haplotype-level effects, associations with repeat lengths at short tandem repeats, and interaction of these effects with local ancestry labels. See [Examples](#examples) below for detailed examples of how to specify effects. - -* `--out `: Prefix to name output files. - -* `--simu-qt` or `simu-cc` indicate whether to simulate a quantitative or case control trait. One of these two options must be specified. - -Additional parameters: - -* `--simu-rep `: Number of phenotypes to simulate. Default: 1. - -* `--simu-hsq `: Trait heritability. Default: 0.1. - -* `--simu-k `: Disease prevalence (for case-control). Default: 0.1 - -## Output files - -`haptools simphenotypes` outputs the following files: - -* `.phen`: Based on the phenotype files accepted by [plink](https://www.cog-genomics.org/plink/1.9/input#pheno). Tab-delimited file with one row per sample. The first and second columns give the sample ID. The third column gives the simulated phenotype. If `--simu-rep` was set to greater than 1, additional columns are output for each simulated trait. Example file: - -``` -HG00096 HG00096 0.058008375906919506 -HG00097 HG00097 0.03472768471423458 -HG00099 HG00099 -0.20850127859705808 -HG00100 HG00100 -0.21206803352471154 -HG00101 HG00101 0.3157913763428323 -``` - -* `.par`: summarizes the frequencies and effects of simulated haplotypes. The columns are: haplotype ID (from the HAP file), haplotype frequency, and effect. Example file: - -``` -Haplotype Frequency Effect -H-001 0.6 -0.2 -``` - - -## Examples - -#### Simulate a single haplotype-effect based on a 2 SNP haplotype: - -``` -haptools simphenotype \ - --vcf tests/data/simple.vcf.gz \ - --hap tests/data/simple.hap.gz \ - --out test \ - --simu-qt --simu-hsq 0.8 --simu-rep 1 -``` - -based on this HAP file (available in `tests/data`) - -``` -H-001 1 10114 10116 1:10114:T:C,1:10116:A:G T,G * -0.2 -``` \ No newline at end of file diff --git a/docs/commands/simphenotype.rst b/docs/commands/simphenotype.rst index 5c2afcfe..216d70fa 100644 --- a/docs/commands/simphenotype.rst +++ b/docs/commands/simphenotype.rst @@ -1,4 +1,81 @@ .. _commands-simphenotype: -.. include:: simphenotype.md - :parser: myst_parser.sphinx_ \ No newline at end of file + +simphenotype +============ + +Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a ``.hap`` file. + +The implementation is based on the `GCTA GWAS Simulation `_ utility. + +Usage +~~~~~ +.. code-block:: bash + + haptools simphenotype \ + --replications INT \ + --heritability FLOAT \ + --prevalence FLOAT \ + --region TEXT \ + --sample SAMPLE \ + --samples-file FILENAME \ + --output PATH \ + --verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \ + GENOTYPES HAPLOTYPES + +Model +~~~~~ +Each normalized haplotype :math:`\vec{Z_j}` is encoded as an independent causal variable in a linear model: + +.. math:: + + \vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon + +where + +.. math:: + + \epsilon_i \sim N(0, \sigma^2) + +.. math:: + + \sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1) + +The heritability :math:`h^2` is user-specified, but if it is not provided, then :math:`\sigma^2` will be computed purely from the effect sizes, instead: + +.. math:: + + \sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }} + +If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals. + +Output +~~~~~~ +Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait. + +Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLIN2 docs `_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK. + +Examples +~~~~~~~~ +.. code-block:: bash + + haptools simphenotype -o simulated.pheno tests/data/example.vcf.gz tests/data/simphenotype.hap + +Simulate two replicates of a case/control trait that occurs in 60% of your samples with a heritability of 0.8. Encode all of the haplotypes in ``tests/data/example.hap.gz`` as independent causal variables. + +.. code-block:: bash + + haptools simphenotype \ + --replications 2 \ + --heritability 0.8 \ + --prevalence 0.6 \ + --output simulated.pheno \ + tests/data/example.vcf.gz tests/data/example.hap.gz + +Detailed Usage +~~~~~~~~~~~~~~ + +.. click:: haptools.__main__:main + :prog: haptools + :show-nested: + :commands: simphenotype diff --git a/docs/commands/transform.rst b/docs/commands/transform.rst index dcfb3346..01e55e71 100644 --- a/docs/commands/transform.rst +++ b/docs/commands/transform.rst @@ -24,7 +24,7 @@ Examples ~~~~~~~~ .. code-block:: bash - haptools transform tests/data/example.vcf.gz tests/data/example.hap.gz | less + haptools transform tests/data/example.vcf.gz tests/data/basic.hap.gz | less .. code-block:: bash diff --git a/docs/formats/genotypes.rst b/docs/formats/genotypes.rst new file mode 100644 index 00000000..3c204ff9 --- /dev/null +++ b/docs/formats/genotypes.rst @@ -0,0 +1,7 @@ +.. _formats-genotypes: + + +Genotypes +========= + +Genotype files must be specified as VCF or BCF files. diff --git a/docs/formats/haplotypes.rst b/docs/formats/haplotypes.rst index a6707d1e..73e4d747 100644 --- a/docs/formats/haplotypes.rst +++ b/docs/formats/haplotypes.rst @@ -1,8 +1,8 @@ .. _formats-haplotypes: -.hap -==== +Haplotypes +========== This document describes our custom file format specification for haplotypes: the ``.hap`` file. diff --git a/docs/formats/inputs.rst b/docs/formats/inputs.rst deleted file mode 100644 index 39b639eb..00000000 --- a/docs/formats/inputs.rst +++ /dev/null @@ -1,43 +0,0 @@ -.. _formats-inputs: - - -Inputs -========= - -Genotype file format --------------------- -Genotype files must be specified as VCF or BCF files. - -Phenotype file format ---------------------- -Phenotypes are expected to be in a tab-separated format with two columns: - -1. sample ID, which must match those from the genotype file -2. the phenotype value - -There should be no header line in the file. - -See ``tests/data/simple.tsv`` for an example of a phenotype file. - -Covariate file format ---------------------- -Covariates are expected to be in a tab-separated format with the first column -corresponding to the sample ID. Subsequent columns should contain each of your -covariates. - -The first line of the file will be treated as a header. The column with sample IDs -should be named "sample" and subsequent columns can be labeled however you'd like. - -See ``tests/data/covars.tsv`` for an example of a covariate file. - -1000 Genomes sample_info file format ------------------------------------- -Within the subcommand ``haptools simgenotype`` we use a file to map samples in the -reference to their population listed in the model file. This code is expected to work -out of the box with 1000 genomes data and we have pre-computed this mapping file as -well as given the command to how to compute it if you desire another as well. - -``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv`` - -See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes -GRCh38 samples mapped to their subpopulations. diff --git a/docs/formats/phenotypes.rst b/docs/formats/phenotypes.rst new file mode 100644 index 00000000..f300076f --- /dev/null +++ b/docs/formats/phenotypes.rst @@ -0,0 +1,28 @@ +.. _formats-phenotypes: + + +Phenotypes and Covariates +========================= + +Phenotype file format +--------------------- +Phenotypes are expected to follow `the PLINK2 .pheno file format `_. This is a +tab-separated format where the first column corresponds to the sample ID, and +subsequent columns contain each of your phenotypes. + +The first line of the file corresponds with the header and must begin with ``#IID``. +The names of each of your phenotypes belong in the subbsequent columns of the header. + +See `tests/data/simple.pheno `_ for an example of a phenotype file: + +.. include:: ../../tests/data/simple.pheno + :literal: + +Covariate file format +--------------------- +Covariates follow the same format as phenotypes. + +See `tests/data/simple.covar `_ for an example of a covariate file: + +.. include:: ../../tests/data/simple.covar + :literal: diff --git a/docs/formats/sample_info.rst b/docs/formats/sample_info.rst new file mode 100644 index 00000000..80e7812f --- /dev/null +++ b/docs/formats/sample_info.rst @@ -0,0 +1,17 @@ +.. _formats-sample_info: + + +Sample Info +=========== + +1000 Genomes sample_info file format +------------------------------------ +Within the subcommand ``haptools simgenotype`` we use a file to map samples in the +reference to their population listed in the model file. This code is expected to work +out of the box with 1000 genomes data and we have pre-computed this mapping file as +well as given the command to how to compute it if you desire another as well. + +``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv`` + +See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes +GRCh38 samples mapped to their subpopulations. diff --git a/docs/index.rst b/docs/index.rst index 6b3266dc..56076d41 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -9,8 +9,10 @@ :hidden: :maxdepth: 1 - formats/inputs.rst + formats/genotypes.rst formats/haplotypes.rst + formats/phenotypes.rst + formats/sample_info.rst .. toctree:: :caption: Commands @@ -20,6 +22,7 @@ commands/simgenotype.rst commands/simphenotype.rst + commands/karyogram.rst commands/transform.rst .. toctree:: diff --git a/haptools/__main__.py b/haptools/__main__.py old mode 100644 new mode 100755 index 68b3f581..2160bb11 --- a/haptools/__main__.py +++ b/haptools/__main__.py @@ -88,39 +88,169 @@ def simgenotype(invcf, sample_info, model, mapdir, out, popsize, seed, chroms): output_vcf(breakpoints, model, invcf, sample_info, out) ############ Haptools simphenotype ############### -DEFAULT_SIMU_REP = 1 -DEFAULT_SIMU_HSQ = 0.1 -DEFAULT_SIMU_K = 0.1 -################################################## @main.command() -@click.option('--vcf', help='Phased VCF file', type=str, required=True) -@click.option('--hap', help='Haplotype file with effect sizes', \ - type=str, required=True) -@click.option('--out', help='Prefix for output files', \ - type=str, required=True) -@click.option('--simu-qt', help='Simulate a quantitative trait', \ - default=False, is_flag=True) -@click.option('--simu-cc', help='Simulate a case/control trait', \ - default=False, is_flag=True) -@click.option('--simu-rep', help='Number of rounds of simulation to perform', \ - type=int, default=DEFAULT_SIMU_REP) -@click.option('--simu-hsq', help='Trait heritability', \ - type=float, default=DEFAULT_SIMU_HSQ) -@click.option('--simu-k', help='Specify the disease prevalence', \ - type=float, default=DEFAULT_SIMU_K) -def simphenotype(vcf, hap, simu_rep, simu_hsq, simu_k, simu_qt, simu_cc, out): +@click.argument("genotypes", type=click.Path(exists=True, path_type=Path)) +@click.argument("haplotypes", type=click.Path(exists=True, path_type=Path)) +@click.option( + "-r", + "--replications", + type=click.IntRange(min=1), + default=1, + show_default=True, + help="Number of rounds of simulation to perform", +) +@click.option( + "-h", + "--heritability", + type=click.FloatRange(min=0, max=1), + default=None, + show_default=True, + help="Trait heritability", +) +@click.option( + "-p", + "--prevalence", + type=click.FloatRange(min=0, max=1, min_open=False, max_open=True), + show_default="quantitative trait", + help="Disease prevalence if simulating a case-control trait" +) +@click.option( + "--region", + type=str, + default=None, + show_default="all haplotypes", + help=( + "The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'" + "For this to work, the VCF and .hap file must be indexed and the seqname " + "provided must correspond with one in the files" + ) +) +@click.option( + "-s", + "--sample", + "samples", + type=str, + multiple=True, + show_default="all samples", + help=( + "A list of the samples to subset from the genotypes file (ex: '-s sample1 -s" + " sample2')" + ), +) +@click.option( + "-S", + "--samples-file", + type=click.File("r"), + show_default="all samples", + help=( + "A single column txt file containing a list of the samples (one per line) to" + " subset from the genotypes file" + ), +) +@click.option( + "-o", + "--output", + type=click.Path(path_type=Path), + default=Path("/dev/stdout"), + show_default="stdout", + help="A TSV file containing simulated phenotypes", +) +@click.option( + "-v", + "--verbosity", + type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]), + default="ERROR", + show_default="only errors", + help="The level of verbosity desired", +) +def simphenotype( + genotypes: Path, + haplotypes: Path, + replications: int = 1, + heritability: float = None, + prevalence: float = None, + region: str = None, + samples: tuple[str] = tuple(), + samples_file: Path = None, + output: Path = Path("-"), + verbosity: str = 'ERROR', +): """ - Haplotype-aware phenotype simulation + Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a + set of haplotypes. + + GENOTYPES must be formatted as a VCF and HAPLOTYPES must be formatted according + to the .hap format spec + + \f + Examples + -------- + >>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simulated.pheno + + Parameters + ---------- + genotypes : Path + The path to the genotypes in VCF format + haplotypes : Path + The path to the haplotypes in a .hap file + replications : int, optional + The number of rounds of simulation to perform + heritability : int, optional + The heritability of the simulated trait; must be a float between 0 and 1 + + If not provided, it will be computed from the sum of the squared effect sizes. + prevalence : int, optional + The prevalence of the disease if the trait should be simulated as case/control; + must be a float between 0 and 1 + + If not provided, a quantitative trait will be simulated, instead + region : str, optional + The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7' + + For this to work, the VCF and .hap file must be indexed and the seqname must + match! + + Defaults to loading all haplotypes + sample : tuple[str], optional + A subset of the samples from which to extract genotypes + + Defaults to loading genotypes from all samples + samples_file : Path, optional + A single column txt file containing a list of the samples (one per line) to + subset from the genotypes file + output : Path, optional + The location to which to write the simulated phenotypes + verbosity : str, optional + The level of verbosity desired in messages written to stderr """ - from .sim_phenotypes import simulate_pt - # Basic checks on input - # TODO - check VCF zipped, check only one of simu-qt/simu-cc, - # check values of other inputs - # Only use simu-k for case/control + import logging + + from .sim_phenotype import simulate_pt + + log = logging.getLogger("run") + logging.basicConfig( + format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)", + level=verbosity, + ) + # handle samples + if samples and samples_file: + raise click.UsageError( + "You may only use one of --sample or --samples-file but not both." + ) + if samples_file: + with samples_file as samps_file: + samples = samps_file.read().splitlines() + elif samples: + # needs to be converted from tuple to list + samples = list(samples) + else: + samples = None # Run simulation - simulate_pt(vcf, hap, simu_rep, \ - simu_hsq, simu_k, simu_qt, simu_cc, out) + simulate_pt( + genotypes, haplotypes, replications, heritability, prevalence, region, samples, + output, log + ) @main.command(short_help="Transform a genotypes matrix via a set of haplotypes") @click.argument("genotypes", type=click.Path(exists=True, path_type=Path)) @@ -129,10 +259,12 @@ def simphenotype(vcf, hap, simu_rep, simu_hsq, simu_k, simu_qt, simu_cc, out): "--region", type=str, default=None, - show_default="all genotypes", - help=""" - The region from which to extract genotypes; ex: 'chr1:1234-34566' or 'chr7'\n - For this to work, the VCF must be indexed and the seqname must match!""", + show_default="all haplotypes", + help=( + "The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'" + "For this to work, the VCF and .hap file must be indexed and the seqname " + "provided must correspond with one in the files" + ) ) @click.option( "-s", @@ -199,9 +331,16 @@ def transform( haplotypes : Path The path to the haplotypes in a .hap file region : str, optional - See documentation for :py:meth:`~.data.Genotypes.read` - sample : Tuple[str], optional - See documentation for :py:meth:`~.data.Genotypes.read` + The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7' + + For this to work, the VCF and .hap file must be indexed and the seqname must + match! + + Defaults to loading all haplotypes + sample : tuple[str], optional + A subset of the samples from which to extract genotypes + + Defaults to loading genotypes from all samples samples_file : Path, optional A single column txt file containing a list of the samples (one per line) to subset from the genotypes file @@ -212,8 +351,7 @@ def transform( """ import logging - from haptools import data - from .haplotype import HaptoolsHaplotype + from .transform import transform_haps log = logging.getLogger("run") logging.basicConfig( @@ -234,23 +372,7 @@ def transform( else: samples = None - log.info("Loading haplotypes") - hp = data.Haplotypes(haplotypes) - hp.read(region=region) - log.info("Extracting variants from haplotypes") - variants = {var.id for hap in hp.data.values() for var in hap.variants} - log.info("Loading genotypes") - gt = data.GenotypesRefAlt(genotypes, log=log) - # gt._prephased = True - gt.read(region=region, samples=samples, variants=variants) - gt.check_missing(discard_also=True) - gt.check_biallelic(discard_also=True) - gt.check_phase() - log.info("Transforming genotypes via haplotypes") - hp_gt = data.GenotypesRefAlt(fname=output, log=log) - hp.transform(gt, hp_gt) - log.info("Writing haplotypes to VCF") - hp_gt.write() + transform_haps(genotypes, haplotypes, region, samples, output, log) if __name__ == "__main__": diff --git a/haptools/data/covariates.py b/haptools/data/covariates.py index 45ad1f30..04960e50 100644 --- a/haptools/data/covariates.py +++ b/haptools/data/covariates.py @@ -8,16 +8,17 @@ import numpy as np from .data import Data +from .phenotypes import Phenotypes -class Covariates(Data): +class Covariates(Phenotypes): """ A class for processing covariates from a file Attributes ---------- data : np.array - The covariates in an n (samples) x 1 (covariate value) array + The covariates in an n (samples) x m (covariates) array fname : Path The path to the read-only file containing the data samples : tuple[str] @@ -33,124 +34,5 @@ class Covariates(Data): """ def __init__(self, fname: Path, log: Logger = None): - super().__init__(fname, log) - self.samples = tuple() - self.names = tuple() - - @classmethod - def load(cls: Covariates, fname: Path, samples: list[str] = None) -> Covariates: - """ - Load covariates from a TSV file - - Read the file contents and standardize the covariates - - Parameters - ---------- - fname - See documentation for :py:attr:`~.Data.fname` - samples : list[str], optional - See documentation for :py:meth:`~.Data.Covariates.read` - - Returns - ------- - covariates - A Covariates object with the data loaded into its properties - """ - covariates = cls(fname) - covariates.read(samples) - return covariates - - def read(self, samples: list[str] = None): - """ - Read covariates from a TSV file into a numpy matrix stored in :py:attr:`~.Covariates.data` - - Parameters - ---------- - samples : list[str], optional - A subset of the samples from which to extract covariates - - Defaults to loading covariates from all samples - - Raises - ------ - AssertionError - If the provided file doesn't follow the expected format - """ - super().read() - # load all info into memory - # use hook_compressed to automatically handle gz files - with hook_compressed(self.fname, mode="rt") as covars: - covar_text = reader(covars, delimiter="\t") - header = next(covar_text) - # there should at least two columns - if len(header) < 2: - raise ValueError("The covariates TSV should have at least two columns.") - # the first column should be called "sample" - if header[0] != "sample": - self.log.warning( - "The first column of the covariates TSV should contain sample IDs" - " and should be named 'sample' in the header line" - ) - # convert to list and subset samples if need be - if samples: - samples = set(samples) - covar_text = [covar for covar in covar_text if covar[0] in samples] - else: - covar_text = list(covar_text) - # the second column should be castable to a float - try: - float(covar_text[0][1]) - except: - self.log.error( - "Every column in the covariates file (besides the sample column) must" - " be numeric." - ) - # fill out the samples and data properties - data = list(zip(*covar_text)) - self.samples = data[0] - self.names = tuple(header[1:]) - # coerce strings to floats - self.data = np.transpose(np.array(data[1:], dtype="float64")) - - def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]: - """ - Read covariates from a TSV line by line without storing anything - - Parameters - ---------- - samples : list[str], optional - A subset of the samples from which to extract covariates - - Defaults to loading covariates from all samples - - Yields - ------ - Iterator[namedtuple] - An iterator over each line in the file, where each line is encoded as a - namedtuple containing each of the class properties - """ - with hook_compressed(self.fname, mode="rt") as covars: - covar_text = reader(covars, delimiter="\t") - header = next(covar_text) - # there should at least two columns - if len(header) < 2: - raise ValueError("The covariates TSV should have at least two columns.") - # the first column should be called "sample" - if header[0] != "sample": - self.log.warning( - "The first column of the covariates TSV should contain sample IDs" - " and should be named 'sample' in the header line" - ) - header = tuple(header[1:]) - Record = namedtuple("Record", "data samples names") - for covar in covar_text: - if samples is None or covar[0] in samples: - try: - yield Record( - np.array(covar[1:], dtype="float64"), covar[0], header - ) - except: - self.log.error( - "Every column in the covariates file (besides the sample" - " column) must be numeric." - ) + super(Phenotypes, self).__init__(fname, log) + self._ext = "covar" diff --git a/haptools/data/genotypes.py b/haptools/data/genotypes.py index 24e36fd9..98d4cd83 100644 --- a/haptools/data/genotypes.py +++ b/haptools/data/genotypes.py @@ -1,9 +1,11 @@ from __future__ import annotations +import re from csv import reader from pathlib import Path from typing import Iterator from collections import namedtuple from logging import getLogger, Logger +from fileinput import hook_compressed import numpy as np import numpy.typing as npt @@ -36,6 +38,10 @@ class Genotypes(Data): _prephased : bool If True, assume that the genotypes are phased. Otherwise, extract their phase when reading from the VCF. + _samp_idx : dict[str, int] + Sample index; maps samples to indices in self.samples + _var_idx : dict[str, int] + Variant index; maps variant IDs to indices in self.variants Examples -------- @@ -59,6 +65,8 @@ def __init__(self, fname: Path, log: Logger = None): ], ) self._prephased = False + self._samp_idx = None + self._var_idx = None @classmethod def load( @@ -93,8 +101,7 @@ def load( genotypes.read(region, samples, variants) genotypes.check_missing() genotypes.check_biallelic() - if not self._prephased: - genotypes.check_phase() + genotypes.check_phase() # genotypes.to_MAC() return genotypes @@ -280,6 +287,89 @@ def __iter__( # see https://stackoverflow.com/a/36726497 return self._iterate(vcf, region, variants) + def index(self, samples: bool = True, variants: bool = True): + """ + Call this function once to improve the amortized time-complexity of look-ups of + samples and variants by their ID. This is useful if you intend to later subset + by a set of samples or variant IDs. + The time complexity of this function should be roughly O(n+m) if both + parameters are True. Otherwise, it will be either O(n) or O(m). + + Parameters + ---------- + samples: bool, optional + Whether to index the samples for fast loop-up. Adds complexity O(n). + variants: bool, optional + Whether to index the variants for fast look-up. Adds complexity O(m). + """ + if samples and self._samp_idx is None: + self._samp_idx = dict(zip(self.samples, range(len(self.samples)))) + if variants and self._var_idx is None: + self._var_idx = dict(zip(self.variants["id"], range(len(self.variants)))) + + def subset( + self, + samples: tuple[str] = None, + variants: tuple[str] = None, + inplace: bool = False, + ): + """ + Subset these genotypes to a smaller set of samples or a smaller set of variants + + The order of the samples and variants in the subsetted instance will match + the order in the provided tuple parameters. + + Parameters + ---------- + samples: tuple[str] + A subset of samples to keep + variants: tuple[str] + A subset of variant IDs to keep + inplace: bool, optional + If False, return a new Genotypes object; otherwise, alter the current one + + Returns + ------- + A new Genotypes object if inplace is set to False, else returns None + """ + # First, initialize variables + gts = self + if not inplace: + gts = self.__class__(self.fname, self.log) + gts.samples = self.samples + gts.variants = self.variants + gts.data = self.data + # Index the current set of samples and variants so we can have fast look-up + self.index(samples=(samples is not None), variants=(variants is not None)) + # Subset the samples + if samples is not None: + gts.samples = tuple(samp for samp in samples if samp in self._samp_idx) + if len(gts.samples) < len(samples): + diff = len(samples) - len(gts.samples) + self.log.warning( + f"Saw {diff} fewer samples than requested. Proceeding with " + f"{len(gts.samples)} samples." + ) + samp_idx = tuple(self._samp_idx[samp] for samp in gts.samples) + if inplace: + self._samp_idx = None + gts.data = gts.data[samp_idx, :] + # Subset the variants + if variants is not None: + var_idx = [self._var_idx[var] for var in variants if var in self._var_idx] + if len(var_idx) < len(variants): + diff = len(variants) - len(var_idx) + self.log.warning( + f"Saw {diff} fewer variants than requested. Proceeding with " + f"{len(var_idx)} variants." + ) + gts.variants = self.variants[var_idx] + if inplace: + self._var_idx = None + gts.data = gts.data[:, var_idx] + if not inplace: + return gts + def check_missing(self, discard_also=False): """ Check that each sample is properly genotyped @@ -306,6 +396,7 @@ def check_missing(self, discard_also=False): ) self.data = np.delete(self.data, samp_idx, axis=0) self.samples = tuple(np.delete(self.samples, samp_idx)) + self._samp_idx = None else: raise ValueError( "Genotype with ID {} at POS {}:{} is missing for sample {}".format( @@ -347,6 +438,7 @@ def check_biallelic(self, discard_also=False): self.log.info(f"Ignoring {len(variant_idx)} multiallelic variants") self.data = np.delete(self.data, variant_idx, axis=1) self.variants = np.delete(self.variants, variant_idx) + self._var_idx = None else: raise ValueError( "Variant with ID {} at POS {}:{} is multiallelic for sample {}" @@ -462,6 +554,8 @@ def __init__(self, fname: Path, log: Logger = None): ], ) self._prephased = False + self._samp_idx = None + self._var_idx = None def _variant_arr(self, record: Variant): """ @@ -496,10 +590,15 @@ def write(self): ("Description", "Genotype"), ], ) - for sample in self.samples: - # TODO: figure out how to make this work for large datasets - # it gets stuck and takes a long time to exit this loop - vcf.header.add_sample(sample) + try: + vcf.header.add_samples(self.samples) + except AttributeError: + self.log.warning( + "Upgrade to pysam >=0.19.1 to reduce the time required to create " + "VCFs. See https://github.com/pysam-developers/pysam/issues/1104" + ) + for sample in self.samples: + vcf.header.add_sample(sample) self.log.info("Writing VCF records") for var_idx, var in enumerate(self.variants): rec = { @@ -547,8 +646,81 @@ class GenotypesPLINK(GenotypesRefAlt): >>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen') """ + def read_samples(self, samples: list[str] = None): + """ + Read sample IDs from a PSAM file into a list stored in + :py:attr:`~.GenotypesPLINK.samples` + + One of either variants or max_variants MUST be specified! + + Parameters + ---------- + samples : list[str], optional + See documentation for :py:attr:`~.GenotypesRefAlt.read` + + Returns + ------- + npt.NDArray[np.uint32] + The indices of each of the samples within the PSAM file + """ + if len(self.samples) != 0: + self.log.warning("Sample data has already been loaded. Overriding.") + if samples is not None and not isinstance(samples, set): + samples = set(samples) + with hook_compressed(self.fname.with_suffix(".psam"), mode="rt") as psam: + psamples = reader(psam, delimiter="\t") + # find the line that declares the header + for header in psamples: + if header[0].startswith("#FID") or header[0].startswith("#IID"): + break + # we need the header to contain the IID column + if header[0][0] != "#": + raise ValueError("Your PSAM file is missing a header!") + col_idx = 1 + self.log.info( + "Your PVAR file lacks a proper header! Assuming first five columns" + " are [CHROM, POS, ID, REF, ALT] in that order..." + ) + # TODO: add header back in to psamples using itertools.chain? + else: + header[0] = header[0][1:] + try: + col_idx = header.index("IID") + except ValueError: + raise ValueError("Your PSAM file must have an IID column.") + self.samples = { + ct: samp[col_idx] + for ct, samp in enumerate(psamples) + if (samples is None) or (samp[col_idx] in samples) + } + indices = np.array(list(self.samples.keys()), dtype=np.uint32) + self.samples = list(self.samples.values()) + return indices + + def _check_region( + self, pos: tuple, chrom: str, start: int = 0, end: int = float("inf") + ): + """ + Check that pos lies within the provided chrom, start, and end coordinates + + This is a helper function for :py:meth:`~.GenotypesPLINK.read_variants` + + Parameters + ---------- + pos : tuple[str, int] + A tuple of two elements: contig (str) and chromosomal position (int) + chrom: str + The contig of the region to check + start: int, optional + The start position of the region + end: int, optional + The end position of the region + """ + return (pos[0] == chrom) and (start <= pos[1]) and (end >= pos[1]) + def read_variants( self, + region: str = None, variants: set[str] = None, max_variants: int = None, ): @@ -578,53 +750,68 @@ def read_variants( max_variants = len(variants) if max_variants is None: raise ValueError("Provide either the variants or max_variants parameter!") - # TODO: try using pysam.tabix_iterator with the parser param specified for VCF - if region: - pvar = TabixFile(str(self.fname.with_suffix(".pvar"))) - pvariants = pvar.fetch(region) - else: - pvar = hook_compressed(self.fname.with_suffix(".pvar"), mode="rt") + # split the region string so each portion is an element + if region is not None: + region = re.split(":|-", region) + if len(region) > 1: + region[1:] = [int(pos) for pos in region[1:] if pos] + with hook_compressed(self.fname.with_suffix(".pvar"), mode="rt") as pvar: pvariants = reader(pvar, delimiter="\t") - # first, preallocate the array - self.variants = np.empty((max_variants,), dtype=self.variants.dtype) - header = next(pvariants) - # there should be at least five columns - if len(header) < 5: - raise ValueError("Your PVAR file should have at least five columns.") - if header[0][0] == "#": - header[0] = header[0][1:] - else: - raise ValueError("Your PVAR file is missing a header!") - header = ["CHROM", "POS", "ID", "REF", "ALT"] - self.log.info( - "Your PVAR file lacks a proper header! Assuming first five columns" - " are [CHROM, POS, ID, REF, ALT] in that order..." - ) - # TODO: add header back in to pvariants using itertools.chain? - cid = {item: header.index(item.upper()) for item in self.variants.dtype.names} - if variants is not None: + # first, preallocate the array + self.variants = np.empty((max_variants,), dtype=self.variants.dtype) + # find the line that declares the header + for header in pvariants: + if not header[0].startswith("##"): + break + # there should be at least five columns + if len(header) < 5: + raise ValueError("Your PVAR file should have at least five columns.") + if header[0][0] == "#": + header[0] = header[0][1:] + else: + raise ValueError("Your PVAR file is missing a header!") + header = ["CHROM", "POS", "ID", "REF", "ALT"] + self.log.info( + "Your PVAR file lacks a proper header! Assuming first five columns" + " are [CHROM, POS, ID, REF, ALT] in that order..." + ) + # TODO: add header back in to pvariants using itertools.chain? + cid = { + item: header.index(item.upper()) + for item in self.variants.dtype.names + if item != "aaf" + } indices = np.empty((max_variants,), dtype=np.uint32) - num_seen = 0 - for ct, rec in enumerate(pvariants): - if variants is not None: - if rec[cid["id"]] not in variants: + num_seen = 0 + for ct, rec in enumerate(pvariants): + if region and not self._check_region( + (rec[cid["chrom"]], int(rec[cid["pos"]])), *region + ): continue + if variants is not None: + if rec[cid["id"]] not in variants: + continue indices[num_seen] = ct - self.variants[num_seen] = np.array( - ( - rec[cid["id"]], - rec[cid["chrom"]], - rec[cid["pos"]], - 0, - rec[cid["ref"]], - rec[cid["alt"]], - ), - dtype=self.variants.dtype, + self.variants[num_seen] = np.array( + ( + rec[cid["id"]], + rec[cid["chrom"]], + rec[cid["pos"]], + 0, + rec[cid["ref"]], + rec[cid["alt"]], + ), + dtype=self.variants.dtype, + ) + num_seen += 1 + if max_variants > num_seen: + self.log.info( + f"Removing {max_variants-num_seen} unneeded variant records that " + "were preallocated b/c max_variants was specified." ) - num_seen += 1 - pvar.close() - if variants is not None: - return indices + indices = indices[:num_seen] + self.variants = self.variants[:num_seen] + return indices def read( self, @@ -648,34 +835,57 @@ def read( max_variants : int, optional See documentation for :py:attr:`~.GenotypesRefAlt.read` """ - super().read() - # load the pgen-reader file - # note: very little is loaded into memory at this point + super(Genotypes, self).read() # TODO: figure out how to install this package from pgenlib import PgenReader - pgen = PgenReader(bytes(self.fname, "utf8")) + sample_idxs = self.read_samples(samples) + with PgenReader( + bytes(str(self.fname), "utf8"), sample_subset=sample_idxs + ) as pgen: - if variants is not None: - max_variants = len(variants) - if max_variants is None: - # use the pgen file to figure out how many variants there are - max_variants = pgen.get_variant_ct() - indices = self.read_variants(region, variants, max_variants) - - variant_ct_start = 0 - variant_ct_end = max_variants - variant_ct = variant_ct_end - variant_ct_start - - sample_ct = pgen.get_raw_sample_ct() - # the genotypes start out as a simple 2D array with twice the number of samples - # so each column is a different chromosomal strand - self.data = np.empty((variant_ct, sample_ct * 2), dtype=np.int32) - pgen.read_alleles(indices, self.data) - # extract the genotypes to a np matrix of size n x p x 2 - # the last dimension has two items: - # 1) presence of REF in strand one - # 2) presence of REF in strand two - self.data = np.dstack((self.data[:, ::2], self.data[:, 1::2])) - # transpose the GT matrix so that samples are rows and variants are columns - self.data = self.data.transpose((1, 0, 2)) + if variants is not None: + max_variants = len(variants) + if max_variants is None: + # use the pgen file to figure out how many variants there are + max_variants = pgen.get_variant_ct() + indices = self.read_variants(region, variants, max_variants) + + # the genotypes start out as a simple 2D array with twice the number of samples + if not self._prephased: + # raise an error message b/c this is untested code that doesn't really + # seem to work properly + # for ex, the phasing info is not boolean for some reason? + raise ValueError("Not implemented yet!") + # ...so each column is a different chromosomal strand + self.data = np.empty( + (len(indices), len(sample_idxs) * 2), dtype=np.int32 + ) + phasing = np.empty((len(indices), len(sample_idxs)), dtype=np.uint8) + # The haplotype-major mode of read_alleles_and_phasepresent_list has + # not been implemented yet, so we need to read the genotypes in sample- + # major mode and then transpose them + pgen.read_alleles_and_phasepresent_list(indices, self.data, phasing) + # missing alleles will have a value of -9 + # let's make them be -1 to be consistent with cyvcf2 + self.data[self.data == -9] = -1 + self.data = np.dstack((self.data[:, ::2], self.data[:, 1::2])).astype( + np.uint8 + ) + self.data = np.concatenate( + (self.data, phasing[:, :, np.newaxis]), axis=2 + ) + # transpose the GT matrix so that samples are rows and variants are columns + self.data = self.data.transpose((1, 0, 2)) + else: + # ...so each row is a different chromosomal strand + self.data = np.empty( + (len(sample_idxs) * 2, len(indices)), dtype=np.int32 + ) + pgen.read_alleles_list(indices, self.data, hap_maj=True) + # missing alleles will have a value of -9 + # let's make them be -1 to be consistent with cyvcf2 + self.data[self.data == -9] = -1 + self.data = np.dstack((self.data[::2, :], self.data[1::2, :])).astype( + np.uint8 + ) diff --git a/haptools/data/haplotypes.py b/haptools/data/haplotypes.py index f15f5c9b..5adb031a 100644 --- a/haptools/data/haplotypes.py +++ b/haptools/data/haplotypes.py @@ -1,5 +1,4 @@ from __future__ import annotations -import re from pathlib import Path from logging import getLogger, Logger from fileinput import hook_compressed @@ -88,7 +87,7 @@ def fmt_str(self) -> str: """ Convert an Extra into a fmt string - Retruns + Returns ------- str A python format string (ex: "{beta:.3f}") @@ -235,8 +234,8 @@ class Haplotype: The chromosomal end position of the haplotype id: str The haplotype's unique ID - variants: list[Variant] - A list of the variants in this haplotype + variants: tuple[Variant] + The variants in this haplotype _extras: tuple[Extra] Extra fields for the haplotype @@ -278,7 +277,7 @@ def _fmt(self): @property # TODO: use @cached_property in py3.8 def varIDs(self): - return {var.id for var in self.variants} + return tuple(var.id for var in self.variants) @classmethod def from_hap_spec( @@ -335,9 +334,7 @@ def extras_head(cls) -> tuple: """ return tuple(extra.to_hap_spec("H") for extra in cls._extras) - def transform( - self, genotypes: GenotypesRefAlt, samples: list[str] = None - ) -> npt.NDArray[bool]: + def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]: """ Transform a genotypes matrix via the current haplotype @@ -351,8 +348,6 @@ def transform( If the genotypes have not been loaded into the Genotypes object yet, this method will call Genotypes.read(), while loading only the needed variants - samples : list[str], optional - See documentation for :py:attr:`~.Genotypes.read` Returns ------- @@ -361,34 +356,27 @@ def transform( denotes the presence of the haplotype in one chromosome of a sample """ var_IDs = self.varIDs - # check: have the genotypes been loaded yet? - # if not, we can load just the variants we need - if genotypes.unset(): - start = min(var.start for var in self.variants) - end = max(var.end for var in self.variants) - region = f"{self.chrom}:{start}-{end}" - genotypes.read(region=region, samples=samples, variants=var_IDs) - genotypes.check_biallelic(discard_also=True) - genotypes.check_phase() - # create a dict where the variants are keyed by ID - var_dict = { - var["id"]: var["ref"] for var in genotypes.variants if var["id"] in var_IDs - } - var_idxs = [ - idx for idx, var in enumerate(genotypes.variants) if var["id"] in var_IDs - ] - missing_IDs = var_IDs - var_dict.keys() - if len(missing_IDs): + gts = genotypes.subset(variants=var_IDs) + # check: were any of the variants absent from the genotypes? + if len(gts.variants) < len(var_IDs): + missing_IDs = set(var_IDs) - set(gts.variants["id"]) raise ValueError( f"Variants {missing_IDs} are present in haplotype '{self.id}' but " "absent in the provided genotypes" ) # create a np array denoting the alleles that we want - alleles = [int(var.allele != var_dict[var.id]) for var in self.variants] - allele_arr = np.array([[[al] for al in alleles]]) # shape: (1, n, 1) + # note: the excessive use of square-brackets gives us shape (1, n, 1) + allele_arr = np.array( + [ + [ + [int(var.allele != gts.variants[i]["ref"])] + for i, var in enumerate(self.variants) + ] + ] + ) # look for the presence of each allele in each chromosomal strand # and then just AND them together - return np.all(allele_arr == genotypes.data[:, var_idxs], axis=1) + return np.all(allele_arr == gts.data, axis=1) class Haplotypes(Data): @@ -505,7 +493,7 @@ def check_header(self, lines: list[str], check_version=False): expected_lines.remove(line) except ValueError: # extract the name of the extra field - name = line.split("\t", maxsplit=1)[1] + name = line.split("\t", maxsplit=2)[1] raise ValueError( f"The extra field '{name}' is declared in the header of the" " .hap file but is not accepted by this tool." @@ -730,12 +718,7 @@ def __repr__(self): def write(self): """ - Write the contents of this Haplotypes object to the file given by fname - - Parameters - ---------- - file: TextIO - A file-like object to which this Haplotypes object should be written. + Write the contents of this Haplotypes object to the file at :py:attr:`~.Haplotypes.fname` Examples -------- @@ -752,9 +735,7 @@ def write(self): def transform( self, genotypes: GenotypesRefAlt, - hap_gts: GenotypesRefAlt, - samples: list[str] = None, - low_memory: bool = False, + hap_gts: GenotypesRefAlt = None, ) -> GenotypesRefAlt: """ Transform a genotypes matrix via the current haplotype @@ -766,42 +747,32 @@ def transform( ---------- genotypes : GenotypesRefAlt The genotypes which to transform using the current haplotype - - If the genotypes have not been loaded into the Genotypes object yet, this - method will call Genotypes.read(), while loading only the needed variants hap_gts: GenotypesRefAlt An empty GenotypesRefAlt object into which the haplotype genotypes should be stored - samples : list[str], optional - See documentation for :py:attr:`~.Genotypes.read` - low_memory : bool, optional - If True, each haplotype's genotypes will be loaded one at a time. Returns ------- GenotypesRefAlt A Genotypes object composed of haplotypes instead of regular variants. """ + # Initialize GenotypesRefAlt return value + if hap_gts is None: + hap_gts = GenotypesRefAlt(fname=None, log=self.log) hap_gts.samples = genotypes.samples hap_gts.variants = np.array( [(hap.id, hap.chrom, hap.start, 0, "A", "T") for hap in self.data.values()], - dtype=[ - ("id", "U50"), - ("chrom", "U10"), - ("pos", np.uint32), - ("aaf", np.float64), - ("ref", "U100"), - ("alt", "U100"), - ], + dtype=hap_gts.variants.dtype, ) + # Obtain and merge the haplotype genotypes self.log.info( f"Transforming a set of genotypes from {len(genotypes.variants)} total " f"variants with a list of {len(self.data)} haplotypes" ) hap_gts.data = np.concatenate( tuple( - hap.transform(genotypes, samples)[:, np.newaxis] - for hap in self.data.values() + hap.transform(genotypes)[:, np.newaxis] for hap in self.data.values() ), axis=1, - ).astype(np.uint8) + ).astype(genotypes.data.dtype) + return hap_gts diff --git a/haptools/data/phenotypes.py b/haptools/data/phenotypes.py index a27c2535..d43222c4 100644 --- a/haptools/data/phenotypes.py +++ b/haptools/data/phenotypes.py @@ -1,11 +1,14 @@ from __future__ import annotations from csv import reader from pathlib import Path -from collections import namedtuple +from io import TextIOBase +from collections.abc import Iterable from logging import getLogger, Logger from fileinput import hook_compressed +from collections import namedtuple, Counter import numpy as np +import numpy.typing as npt from .data import Data @@ -17,11 +20,13 @@ class Phenotypes(Data): Attributes ---------- data : np.array - The phenotypes in an n (samples) x 1 (phenotype value) array + The phenotypes in an n (samples) x m (phenotypes) array fname : Path The path to the read-only file containing the data samples : tuple The names of each of the n samples + names : tuple[str] + The names of the phenotypes log: Logger A logging instance for recording debug statements. @@ -33,6 +38,8 @@ class Phenotypes(Data): def __init__(self, fname: Path, log: Logger = None): super().__init__(fname, log) self.samples = tuple() + self.names = tuple() + self._ext = "pheno" @classmethod def load(cls: Phenotypes, fname: Path, samples: list[str] = None) -> Phenotypes: @@ -75,36 +82,28 @@ def read(self, samples: list[str] = None): If the provided file doesn't follow the expected format """ super().read() - # load all info into memory - # use hook_compressed to automatically handle gz files - with hook_compressed(self.fname, mode="rt") as phens: - phen_text = reader(phens, delimiter="\t") - # convert to list and subset samples if need be - if samples: - samples = set(samples) - phen_text = [phen for phen in phen_text if phen[0] in samples] - else: - phen_text = list(phen_text) - # there should only be two columns - if len(phen_text[0]) != 2: - self.log.warning("The phenotype TSV should only have two columns.") - # the second column should be castable to a float - try: - float(phen_text[0][1]) - except: - self.log.error("The second column of the TSV file must numeric.") + # call self.__iter__() to obtain the data and samples + data, self.samples = zip(*self.__iter__(samples)) + self.log.info(f"Loaded {len(self.samples)} samples from .{self._ext} file") # fill out the samples and data properties - self.samples, self.data = zip(*phen_text) - # coerce strings to floats - self.data = np.array(self.data, dtype="float64") + # collect data in a np array + self.data = np.array(data) - def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]: + def _iterate( + self, phens: TextIOBase, phen_text: Iterable, samples: set[str] = None + ): """ - Read phenotypes from a TSV line by line without storing anything + A generator over the lines of a TSV + + This is a helper function for :py:meth:`~.Phenotypes.__iter__` Parameters ---------- - samples : list[str], optional + phens: TextIOBase + The file handler for the stream + phen_text: Iterable + The csv.reader object containing the lines of text from the file as lists + samples : set[str], optional A subset of the samples from which to extract phenotypes Defaults to loading phenotypes from all samples @@ -115,17 +114,88 @@ def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]: An iterator over each line in the file, where each line is encoded as a namedtuple containing each of the class properties """ - with hook_compressed(self.fname, mode="rt") as phens: - phen_text = reader(phens, delimiter="\t") - Record = namedtuple("Record", "data samples") - for phen in phen_text: - if samples is None or phen[0] in samples: - try: - yield Record(float(phen[1]), phen[0]) - except: - self.log.error( - "The second column of the TSV file must numeric." - ) + self.log.info(f"Loading {len(self.names)} columns from .{self._ext} file") + Record = namedtuple("Record", "data samples") + for phen in phen_text: + if samples is None or phen[0] in samples: + try: + yield Record(np.array(phen[1:], dtype="float64"), phen[0]) + except: + self.log.error( + f"Every column in the .{self._ext} file (besides the sample" + " column) must be numeric." + ) + phens.close() + + def __iter__(self, samples: list[str] = None) -> Iterable[namedtuple]: + """ + Read phenotypes from a TSV line by line without storing anything + + Parameters + ---------- + samples : list[str], optional + A subset of the samples from which to extract phenotypes + + Defaults to loading phenotypes from all samples + + Returns + ------ + Iterable[namedtuple] + See documentation for :py:meth:`~.Phenotypes._iterate` + """ + phens = hook_compressed(self.fname, mode="rt") + phen_text = reader(phens, delimiter="\t") + # ignore all of the comment lines + while True: + header = next(phen_text) + if not header[0].startswith("#") or header[0].startswith("#IID"): + break + + # there should be at least two columns + if len(header) < 2: + raise ValueError(f"The .{self._ext} file should have at least two columns.") + # the first column should be called "#IID" + if header[0] != "#IID": + self.log.warning( + f"The first column of the .{self._ext} file should contain sample IDs" + " and should be named '#IID' in the header line" + ) + self.names = tuple(header[1:]) + samples = set(samples) if samples else None + # call another function to force the lines above to be run immediately + # see https://stackoverflow.com/a/36726497 + return self._iterate(phens, phen_text, samples) + + def write(self): + """ + Write the phenotypes in this class to a file at :py:attr:`~.Phenotypes.fname` + + Examples + -------- + To write to a file, you must first initialize a Phenotypes object and then + fill out the names, data, and samples properties: + >>> phenotypes = Phenotypes('tests/data/simple.pheno') + >>> phenotypes.names = ('height',) + >>> phenotypes.data = np.array([1, 1, 2], dtype='float64') + >>> phenotypes.samples = ('HG00096', 'HG00097', 'HG00099') + >>> phenotypes.write() + """ + # make sure the names are unique + uniq_names = Counter() + names = [None] * len(self.names) + for idx, name in enumerate(self.names): + suffix = "" + if uniq_names[name]: + suffix = f"-{uniq_names[name]}" + names[idx] = name + suffix + uniq_names[name] += 1 + # now we can finally write the file + with hook_compressed(self.fname, mode="wt") as haps: + haps.write("#IID\t" + "\t".join(names) + "\n") + formatter = {"float_kind": lambda x: "%.2f" % x} + for samp, phen in zip(self.samples, self.data): + line = np.array2string(phen, separator="\t", formatter=formatter)[1:-1] + haps.write(f"{samp}\t" + line + "\n") def standardize(self): """ @@ -133,4 +203,34 @@ def standardize(self): This function modifies :py:attr:`~.Genotypes.data` in-place """ - self.data = (self.data - np.mean(self.data)) / np.std(self.data) + self.data = (self.data - np.mean(self.data, axis=0)) / np.std(self.data, axis=0) + + def append(self, name: str, data: npt.NDArray): + """ + Append a new set of phenotypes to the current set + + Parameters + ---------- + name: str + The name of the new phenotype + data: npt.NDArray + A 1D np array of the same length as :py:attr:`~.Phenotypes.samples`, + containing the phenotype values for each sample. Must have the same dtype + as :py:attr:`~.Phenotypes.data.` + """ + if len(self.samples): + if len(self.samples) != len(data): + self.log.error( + "The data provided to the add() method is not of the appropriate" + "length" + ) + else: + self.log.warning( + "Set the samples property of the Phenotypes instance before calling " + "the add() method" + ) + if self.unset(): + self.data = data[:, np.newaxis] + else: + self.data = np.concatenate((self.data, data[:, np.newaxis]), axis=1) + self.names = self.names + (name,) diff --git a/haptools/haplotype.py b/haptools/haplotype.py deleted file mode 100644 index b6c0a1d8..00000000 --- a/haptools/haplotype.py +++ /dev/null @@ -1,26 +0,0 @@ -from __future__ import annotations -from dataclasses import dataclass, field - -import numpy as np - -from .data import Extra, Haplotype - - -@dataclass -class HaptoolsHaplotype(Haplotype): - """ - A haplotype with sufficient fields for simphenotype - - Properties and functions are shared with the base Haplotype object - """ - - ancestry: str - beta: float - _extras: tuple = field( - repr=False, - init=False, - default=( - Extra("ancestry", "s", "Local ancestry"), - Extra("beta", ".2f", "Effect size in linear model"), - ), - ) diff --git a/haptools/haplotypes.py b/haptools/haplotypes.py deleted file mode 100644 index 3c9993e4..00000000 --- a/haptools/haplotypes.py +++ /dev/null @@ -1,199 +0,0 @@ -""" -Classes for reading/writing and other functions -on haplotypes - -TODO: -* transform - * deal with LA and STRs - -* writephenotypes -* documentation and type hinting -* tests - -* Compare simulation output to GCTA -* fail gracefully on assertions - we need ERROR utility - -""" -import re -import gzip -from pathlib import Path - -import pysam -import numpy as np - - -class Haplotype: - def __init__(self, hap_id, hap_chr, hap_start, hap_end, \ - hap_varids, hap_varalleles, \ - hap_LA, hap_effect): - self.hap_id = hap_id - self.hap_chr = hap_chr - self.hap_start = hap_start - self.hap_end = hap_end - self.hap_varids = hap_varids - self.hap_varalleles = hap_varalleles - self.hap_LA = hap_LA - self.hap_effect = hap_effect - - ##### Do some basic checks ##### - # hap_varids and hap_varalleles must be the same length - assert(len(self.hap_varids)==len(self.hap_varalleles)) - # Do not repeat variant IDs - assert(len(set(self.hap_varids))==len(self.hap_varids)) - # TODO - STR and LA checks - - def GetVariantID(self, record): - return "%s_%s_%s_%s"%(record.CHROM, record.POS, record.REF, ":".join(record.ALT)) - - def CheckHapMatch(self, vcf_reader): - sample_list = vcf_reader.samples - - # Keep track of whether each chrom - # of each sample matches or not - hap_matches = {} - for sample in sample_list: - hap_matches[sample] = [True, True] - - found_ids = 0 - records = vcf_reader("%s:%s-%s"%(self.hap_chr, self.hap_start, self.hap_end)) - for record in records: - # Check if the record is part of the haplotype - if (record.ID in self.hap_varids): - varind = self.hap_varids.index(record.ID) - elif (GetVariantID(record) in self.varids): - varind = self.hap_varids.index(self.GetVariantID(record)) - else: continue - var_allele = self.hap_varalleles[varind] - var_id = self.hap_varids[varind] - found_ids += 1 - - # Determine the index of the target allele - record_alleles = [record.REF]+record.ALT - assert(var_allele in record_alleles) # TODO fail more gracefully with a message - allele_ind = record_alleles.index(var_allele) - - # Check if each sample matches - # If not, set hap_matches to False - for i in range(len(sample_list)): - sample = sample_list[i] - call = record.genotypes[i] - assert(call[2]) # make sure it is phased - for j in range(2): - if call[j]!=allele_ind: hap_matches[sample][j] = False - - # Check we got through all the IDs - assert(found_ids == len(self.hap_varids)) # TODO fail more gracefully with a message - return hap_matches - - def Transform(self, vcf_reader): - hap_gts = {} - - ##### Case 1 - STR length ##### - pass # TODO - - ##### Case 2 - variant haplotype ##### - hap_matches = self.CheckHapMatch(vcf_reader) - - ##### Case 3 - LA (may be in combination with case 2) ##### - pass # TODO - - # Get haplotype counts - for sample in vcf_reader.samples: - hap_gts[sample] = sum(hap_matches[sample]) - return hap_gts - - def GetFrequency(self, vcf_reader): - hap_gts = self.Transform(vcf_reader) - return np.sum(list(hap_gts.values()))/(2*len(hap_gts.keys())) - - def __str__(self): - return self.hap_id - -class HapReader: - def __init__(self, hapfile: Path, region: str = None): - """ - Open a haplotype file with the given region - - Parameters - ---------- - hapfile : Path - The path to a .haps file containing haplotypes - region : str, optional - A region from which to query specific haplotypes; ex: "chr1:1000-2000" - """ - if region is None: - self.haps = gzip.open(hapfile) - else: - # split the region string so each portion is an element - region = re.split(':|-', region) - if len(region) > 1: - region[1:] = [int(pos) for pos in region[1:] if pos] - # fetch region - self.haps = pysam.TabixFile(hapfile).fetch(*region) - - def GetHaplotype(self, hapline): - hap_id = hapline[0] - hap_chr = hapline[1] - hap_start = int(hapline[2]) - hap_end = int(hapline[3]) - hap_varids = hapline[4].split(",") - hap_varalleles = hapline[5].split(",") - hap_LA = hapline[6].split(",") - hap_effect = float(hapline[7]) - return Haplotype(hap_id, hap_chr, hap_start, hap_end, \ - hap_varids, hap_varalleles, \ - hap_LA, hap_effect) - - def __iter__(self): - return self - - def __next__(self): - # Get the next haplotype - hapline = next(self.haps) - - # If reading from gzip file, convert to list - # so it is in the same format as from tabix - if type(hapline) == bytes: - hapline = hapline.decode('utf-8') - while hapline.startswith('#'): - hapline = next(self.haps).decode('utf-8') - hapline = hapline.strip().split() - - # Convert the line to a haplotype object - return self.GetHaplotype(hapline) - -class PhenoSimulator: - def __init__(self, sample_list): - self.pts = {} - for sample in sample_list: - self.pts[sample] = 0 - - def AddEffect(self, sample_to_hap, effect): - mean = np.mean(list(sample_to_hap.values())) - sd = np.sqrt(np.var(list(sample_to_hap.values()))) - for sample in self.pts.keys(): - assert(sample in sample_to_hap.keys()) - self.pts[sample] += effect*(sample_to_hap[sample]-mean)/sd - - def WritePhenotypes(self, outprefix, simu_rep, simu_hsq, \ - simu_k, simu_qt, simu_cc): - # First compute var(additive) - var_add = np.var(list(self.pts.values())) - - # Prepare output file - outf_sim = open("%s.phen"%outprefix, "w") - - # Simulate and write - for sample in self.pts.keys(): - outinfo = [sample, sample] - genetic_component = self.pts[sample] - for i in range(simu_rep): - resid_component = np.random.normal(0, var_add*(1/simu_hsq-1)) - if simu_qt: - outinfo.append(genetic_component+resid_component) - else: - raise NotImplementedError("Case control not implemented") - outf_sim.write("\t".join([str(item) for item in outinfo])+"\n") - - # Done - outf_sim.close() \ No newline at end of file diff --git a/haptools/sim_phenotype.py b/haptools/sim_phenotype.py new file mode 100644 index 00000000..df1c417c --- /dev/null +++ b/haptools/sim_phenotype.py @@ -0,0 +1,274 @@ +from __future__ import annotations +from pathlib import Path +from itertools import combinations +from logging import getLogger, Logger, DEBUG +from dataclasses import dataclass, field + +import numpy as np +import numpy.typing as npt + +from .data import Haplotype as HaplotypeBase +from .data import GenotypesRefAlt, Phenotypes, Haplotypes, Extra + + +@dataclass +class Haplotype(HaplotypeBase): + """ + A haplotype with sufficient fields for simphenotype + + Properties and functions are shared with the base Haplotype object, "HaplotypeBase" + """ + + ancestry: str + beta: float + _extras: tuple = field( + repr=False, + init=False, + default=( + Extra("ancestry", "s", "Local ancestry"), + Extra("beta", ".2f", "Effect size in linear model"), + ), + ) + + +class PhenoSimulator: + """ + Simulate phenotypes from genotypes + + Attributes + ---------- + gens: Genotypes + Genotypes to simulate + phens: Phenotypes + Simulated phenotypes; filled by :py:meth:`~.PhenoSimular.run` + rng: np.random.Generator, optional + A numpy random number generator + log: Logger + A logging instance for recording debug statements + + Examples + -------- + >>> gens = Genotypes.load("tests/data/example.vcf.gz") + >>> haps = Haplotypes.load("tests/data/basic.hap") + >>> haps_gts = haps.transform(gens) + >>> phenosim = PhenoSimulator(haps_gts) + >>> phenosim.run(haps.data.values()) + >>> phenotypes = phenosim.phens + """ + + def __init__( + self, + genotypes: Genotypes, + output: Path = Path("/dev/stdout"), + seed: int = None, + log: Logger = None, + ): + """ + Initialize a PhenoSimulator object + + Parameters + ---------- + genotypes: Genotypes + Genotypes for each haplotype + output: Path, optional + Path to a '.pheno' file to which the generated phenotypes could be written + + Defaults to stdout if not provided + seed: int, optional + A seed to initialize the random number generator + + This is useful if you want the generated phenotypes to be the same across + multiple PhenoSimulator instances. If not provided, it will be random. + log: Logger, optional + A logging instance for recording debug statements + """ + self.gens = genotypes + self.phens = Phenotypes(fname=output) + self.phens.data = None + self.phens.samples = self.gens.samples + self.rng = np.random.default_rng(seed) + self.log = log or getLogger(self.__class__.__name__) + + def run( + self, + effects: list[Haplotype], + heritability: float = None, + prevalence: float = None, + ) -> npt.NDArray: + """ + Simulate phenotypes for an entry in the Genotypes object + + The generated phenotypes will also be added to + :py:attr:`~.PhenoSimulator.output` + + Parameters + ---------- + effects: list[Haplotype] + A list of Haplotypes to use in an additive fashion within the simulations + heritability: float, optional + The simulated heritability of the trait + + If not provided, this will default to the sum of the squared effect sizes + prevalence: float, optional + How common should the disease be within the population? + + If this value is specified, case/control phenotypes will be generated + instead of quantitative traits. + + Returns + ------- + npt.NDArray + The simulated phenotypes, as a np array of shape num_samples x 1 + """ + # extract the relevant haplotype info from the Haplotype objects + ids = [hap.id for hap in effects] + betas = np.array([hap.beta for hap in effects]) + self.log.debug(f"Extracting haplotype genotypes for haps: {ids}") + self.log.debug(f"Beta values are {betas}") + # extract the haplotype "genotypes" and compute the phenotypes + gts = self.gens.subset(variants=ids).data.sum(axis=2) + self.log.info(f"Computing genetic component w/ {gts.shape[0]} causal effects") + # standardize the genotypes + gts = (gts - gts.mean(axis=0)) / gts.std(axis=0) + # generate the genetic component + pt = (betas * gts).sum(axis=1) + # compute the heritability + if heritability is None: + self.log.debug("Computing heritability as the sum of the squared betas") + heritability = np.power(betas, 2).sum() + if heritability > 1: + heritability = 1 + # compute the environmental effect + noise = 1 - heritability + else: + # compute the environmental effect + noise = np.var(pt) * (np.reciprocal(heritability) - 1) + self.log.info(f"Adding environmental component {noise} for h^2 {heritability}") + # finally, add everything together to get the simulated phenotypes + pt_noise = self.rng.normal(0, noise, size=pt.shape) + if self.log.getEffectiveLevel() == DEBUG: + # if we're in debug mode, compute the pearson correlation and report it + # but don't do this otherwise to keep things fast + corr = np.corrcoef(pt, pt + pt_noise)[1, 0] + self.log.debug(f"Estimated heritability is {corr}") + pt += pt_noise + # now, handle case/control + name_suffix = "" + if prevalence is not None: + self.log.info(f"Converting to case/control with prevalence {prevalence}") + # first, find the number of desired positives + k = int(prevalence * len(pt)) + # choose the top k values and label them positive + bool_pt = np.zeros(len(pt), dtype=np.bool_) + if k == len(pt): + bool_pt = ~bool_pt + else: + max_indices = np.argpartition(-pt, k)[:k] + bool_pt[max_indices] = True + pt = bool_pt + name_suffix = "-cc" + # now, save the archived phenotypes for later + self.phens.append(name="-".join(ids) + name_suffix, data=pt.astype(np.float64)) + return pt + + def write(self): + """ + Write the generated phenotypes to the file specified in + :py:meth:`~.PhenoSimular.__init__` + """ + self.phens.write() + + +def simulate_pt( + genotypes: Path, + haplotypes: Path, + num_replications: int = 1, + heritability: float = None, + prevalence: float = None, + region: str = None, + samples: list[str] = None, + output: Path = Path("-"), + log: Logger = None, +): + """ + Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a + set of haplotypes. + + GENOTYPES must be formatted as a VCF and HAPLOTYPES must be formatted according + to the .hap format spec + + \f + Examples + -------- + >>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv + + Parameters + ---------- + genotypes : Path + The path to the genotypes in VCF format + haplotypes : Path + The path to the haplotypes in a .hap file + replications : int, optional + The number of rounds of simulation to perform + heritability : int, optional + The heritability of the simulated trait; must be a float between 0 and 1 + + If not provided, it will be computed from the sum of the squared effect sizes + prevalence : int, optional + The prevalence of the disease if the trait should be simulated as case/control; + must be a float between 0 and 1 + + If not provided, a quantitative trait will be simulated, instead + region : str, optional + The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7' + + For this to work, the VCF and .hap file must be indexed and the seqname must + match! + + Defaults to loading all haplotypes + sample : tuple[str], optional + A subset of the samples from which to extract genotypes + + Defaults to loading genotypes from all samples + samples_file : Path, optional + A single column txt file containing a list of the samples (one per line) to + subset from the genotypes file + output : Path, optional + The location to which to write the simulated phenotypes + log : Logger, optional + The logging module for this task + """ + if log is None: + log = logging.getLogger("run") + logging.basicConfig( + format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)", + level="ERROR", + ) + + log.info("Loading haplotypes") + hp = Haplotypes(haplotypes, haplotype=Haplotype, log=log) + hp.read(region=region) + + log.info("Extracting variants from haplotypes") + variants = {var.id for hap in hp.data.values() for var in hap.variants} + + log.info("Loading genotypes") + gt = GenotypesRefAlt(genotypes, log=log) + # gt._prephased = True + gt.read(region=region, samples=samples, variants=variants) + log.info("QC-ing genotypes") + gt.check_missing() + gt.check_biallelic() + gt.check_phase() + + log.info("Transforming genotypes via haplotypes") + hp_gt = GenotypesRefAlt(fname=None, log=log) + hp.transform(gt, hp_gt) + + # Initialize phenotype simulator (haptools simphenotype) + log.info("Simulating phenotypes") + pt_sim = PhenoSimulator(hp_gt, output=output, log=log) + for i in range(num_replications): + pt_sim.run(hp.data.values(), heritability, prevalence) + log.info("Writing phenotypes") + pt_sim.write() diff --git a/haptools/sim_phenotypes.py b/haptools/sim_phenotypes.py deleted file mode 100644 index 1701a1b0..00000000 --- a/haptools/sim_phenotypes.py +++ /dev/null @@ -1,40 +0,0 @@ -""" -Example test command: - -haptools simphenotype --vcf tests/data/simple.vcf.gz --hap tests/data/simple.hap.gz \ - --out test --simu-qt --simu-hsq 0.8 --simu-rep 2 -""" - -from cyvcf2 import VCF - -from .haplotypes import * - -def simulate_pt(vcffile, hapfile, simu_rep, \ - simu_hsq, simu_k, simu_qt, simu_cc, outprefix): - - # Initialize readers - vcf_reader = VCF(vcffile, gts012=True) - hap_reader = HapReader(hapfile) - - # Initialize phenotype simulator (haptools simphenotypes) - pt_sim = PhenoSimulator(vcf_reader.samples) - - # Set up file to write summary info - outf_sum = open("%s.par"%outprefix, "w") - outf_sum.write("\t".join(["Haplotype", "Frequency", "Effect"])+"\n") - - # Add effect for each haplotype - for hap in hap_reader: - sample_to_hap = hap.Transform(vcf_reader) - pt_sim.AddEffect(sample_to_hap, hap.hap_effect) - - # Output summary info - hap_freq = hap.GetFrequency(vcf_reader) - outf_sum.write("\t".join([hap.hap_id, str(hap_freq), str(hap.hap_effect)])+"\n") - - # Output simulated phenotypes to a file (haptools simphenotypes) - pt_sim.WritePhenotypes(outprefix, simu_rep, simu_hsq, \ - simu_k, simu_qt, simu_cc) - - # Done - outf_sum.close() \ No newline at end of file diff --git a/haptools/transform.py b/haptools/transform.py new file mode 100644 index 00000000..2de57a2f --- /dev/null +++ b/haptools/transform.py @@ -0,0 +1,64 @@ +from __future__ import annotations +import logging +from pathlib import Path + +from haptools import data + + +def transform_haps( + genotypes: Path, + haplotypes: Path, + region: str = None, + samples: list[str] = None, + output: Path = Path("-"), + log: Logger = None, +): + """ + Creates a VCF composed of haplotypes + + Parameters + ---------- + genotypes : Path + The path to the genotypes in VCF format + haplotypes : Path + The path to the haplotypes in a .hap file + region : str, optional + See documentation for :py:meth:`~.data.Genotypes.read` + and :py:meth:`~.data.Haplotypes.read` + samples : list[str], optional + See documentation for :py:meth:`~.data.Genotypes.read` + output : Path, optional + The location to which to write output + log : Logger, optional + A logging module to which to write messages about progress and any errors + """ + if log is None: + log = logging.getLogger("run") + logging.basicConfig( + format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)", + level="ERROR", + ) + + log.info("Loading haplotypes") + hp = data.Haplotypes(haplotypes, log=log) + hp.read(region=region) + + log.info("Extracting variants from haplotypes") + variants = {var.id for hap in hp.data.values() for var in hap.variants} + + log.info("Loading genotypes") + gt = data.GenotypesRefAlt(genotypes, log=log) + # gt._prephased = True + gt.read(region=region, samples=samples, variants=variants) + gt.check_missing() + gt.check_biallelic() + gt.check_phase() + + log.info("Transforming genotypes via haplotypes") + hp_gt = data.GenotypesRefAlt(fname=output, log=log) + hp.transform(gt, hp_gt) + + log.info("Writing haplotypes to VCF") + hp_gt.write() + + return hp_gt diff --git a/tests/data/covars.tsv b/tests/data/simple.covar similarity index 81% rename from tests/data/covars.tsv rename to tests/data/simple.covar index 9d4ec1d1..c2b96e30 100644 --- a/tests/data/covars.tsv +++ b/tests/data/simple.covar @@ -1,4 +1,4 @@ -sample sex age +#IID sex age HG00096 0 4 HG00097 1 20 HG00099 1 33 diff --git a/tests/data/simple.pheno b/tests/data/simple.pheno new file mode 100644 index 00000000..b567f784 --- /dev/null +++ b/tests/data/simple.pheno @@ -0,0 +1,6 @@ +#IID height bmi +HG00096 1 3 +HG00097 1 6 +HG00099 2 8 +HG00100 2 1 +HG00101 0 4 \ No newline at end of file diff --git a/tests/data/simple.tsv b/tests/data/simple.tsv deleted file mode 100644 index f36f9427..00000000 --- a/tests/data/simple.tsv +++ /dev/null @@ -1,5 +0,0 @@ -HG00096 1 -HG00097 1 -HG00099 2 -HG00100 2 -HG00101 0 diff --git a/tests/test_data.py b/tests/test_data.py index 9f3efc55..6f15873f 100644 --- a/tests/test_data.py +++ b/tests/test_data.py @@ -5,7 +5,7 @@ import numpy as np import numpy.lib.recfunctions as rfn -from haptools.haplotype import HaptoolsHaplotype +from haptools.sim_phenotype import Haplotype as HaptoolsHaplotype from haptools.data import ( Genotypes, GenotypesRefAlt, @@ -217,95 +217,193 @@ def test_load_genotypes_subset(self): np.testing.assert_allclose(gts.data, expected) assert gts.samples == tuple(samples) + def test_subset_genotypes(self): + gts = self._get_fake_genotypes() + + # subset to just the samples we want + expected_data = gts.data[:3] + expected_variants = gts.variants + samples = ("HG00096", "HG00097", "HG00099") + gts_sub = gts.subset(samples=samples) + assert gts_sub.samples == samples + np.testing.assert_allclose(gts_sub.data, expected_data) + assert np.array_equal(gts_sub.variants, expected_variants) + + # subset to just the variants we want + expected_data = gts.data[:, [1, 2]] + expected_variants = gts.variants[[1, 2]] + variants = ("1:10116:A:G", "1:10117:C:A") + gts_sub = gts.subset(variants=variants) + assert gts_sub.samples == gts.samples + np.testing.assert_allclose(gts_sub.data, expected_data) + assert np.array_equal(gts_sub.variants, expected_variants) + + # subset both: samples and variants + expected_data = gts.data[[3, 4], :2] + expected_variants = gts.variants[:2] + samples = ("HG00100", "HG00101") + variants = ("1:10114:T:C", "1:10116:A:G") + gts_sub = gts.subset(samples=samples, variants=variants) + assert gts_sub.samples == samples + np.testing.assert_allclose(gts_sub.data, expected_data) + assert np.array_equal(gts_sub.variants, expected_variants) + + +class TestPhenotypes: + def _get_expected_phenotypes(self): + # create a phen matrix with shape: samples x phenotypes + expected = np.array( + [ + [1, 1, 2, 2, 0], + [3, 6, 8, 1, 4], + ], + dtype="float64", + ).T + return expected -def test_load_phenotypes(caplog): - # create a phenotype vector with shape: num_samples x 1 - expected = np.array([1, 1, 2, 2, 0]) - - # can we load the data from the phenotype file? - phens = Phenotypes(DATADIR.joinpath("simple.tsv")) - phens.read() - np.testing.assert_allclose(phens.data, expected) - assert phens.samples == ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") - - # try loading the data again - it should warn b/c we've already done it - phens.read() - assert len(caplog.records) == 1 and caplog.records[0].levelname == "WARNING" - - expected = (expected - np.mean(expected)) / np.std(expected) - phens.standardize() - np.testing.assert_allclose(phens.data, expected) + def _get_fake_phenotypes(self): + gts = Phenotypes(fname=None) + gts.data = self._get_expected_phenotypes() + gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + gts.names = ("height", "bmi") + return gts + def test_load_phenotypes(self, caplog): + expected_phen = self._get_fake_phenotypes() + expected = expected_phen.data -def test_load_phenotypes_iterate(): - # create a phenotype vector with shape: num_samples x 1 - expected = np.array([1, 1, 2, 2, 0]) - samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + # can we load the data from the phenotype file? + phens = Phenotypes(DATADIR.joinpath("simple.pheno")) + phens.read() + np.testing.assert_allclose(phens.data, expected) + assert phens.samples == expected_phen.samples - # can we load the data from the phenotype file? - phens = Phenotypes(DATADIR.joinpath("simple.tsv")) - for idx, line in enumerate(phens): - np.testing.assert_allclose(line.data, expected[idx]) - assert line.samples == samples[idx] + # try loading the data again - it should warn b/c we've already done it + caplog.clear() + phens.read() + assert len(caplog.records) > 0 and caplog.records[0].levelname == "WARNING" + expected = (expected - np.mean(expected, axis=0)) / np.std(expected, axis=0) + phens.standardize() + np.testing.assert_allclose(phens.data, expected) -def test_load_phenotypes_subset(): - # create a phenotype vector with shape: num_samples x 1 - expected = np.array([1, 1, 2, 2, 0]) + def test_load_phenotypes_iterate(self): + expected_phen = self._get_fake_phenotypes() + expected = expected_phen.data + samples = expected_phen.samples - # subset for just the samples we want - expected = expected[[1, 3]] + # can we load the data from the phenotype file? + phens = Phenotypes(DATADIR.joinpath("simple.pheno")) + for idx, line in enumerate(phens): + np.testing.assert_allclose(line.data, expected[idx]) + assert line.samples == samples[idx] - # can we load the data from the phenotype file? - phens = Phenotypes(DATADIR.joinpath("simple.tsv")) - samples = ["HG00097", "HG00100"] - phens.read(samples=samples) - np.testing.assert_allclose(phens.data, expected) - assert phens.samples == tuple(samples) + def test_load_phenotypes_subset(self): + expected_phen = self._get_fake_phenotypes() + expected = expected_phen.data + samples = ["HG00097", "HG00100"] + # subset for just the samples we want + expected = expected[[1, 3]] -def test_load_covariates(caplog): - # create a covariate vector with shape: num_samples x num_covars - expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)]) + # can we load the data from the phenotype file? + phens = Phenotypes(DATADIR.joinpath("simple.pheno")) + phens.read(samples=samples) + np.testing.assert_allclose(phens.data, expected) + assert phens.samples == tuple(samples) + + def test_write_phenotypes(self): + expected_phen = self._get_fake_phenotypes() + + # first, we write the data + expected_phen.fname = DATADIR.joinpath("test.pheno") + expected_phen.write() + + # now, let's load the data and check that it's what we wrote + result = Phenotypes(expected_phen.fname) + result.read() + np.testing.assert_allclose(expected_phen.data, result.data) + assert expected_phen.names == result.names + assert expected_phen.samples == result.samples + + # let's clean up after ourselves and delete the file + os.remove(str(expected_phen.fname)) + + def test_append_phenotype(self): + expected1 = self._get_fake_phenotypes() + expected2 = self._get_fake_phenotypes() + + # add a phenotype called "test" to the set of phenotypes + new_phen = np.array([5, 2, 8, 0, 3], dtype=expected2.data.dtype) + expected2.append("test", new_phen) + + # did it get added correctly? + assert expected1.data.shape[1] == expected2.data.shape[1] - 1 + assert len(expected1.names) == len(expected2.names) - 1 + assert expected2.names[2] == "test" + assert len(expected1.samples) == len(expected2.samples) + np.testing.assert_allclose(expected1.data, expected2.data[:, :2]) + np.testing.assert_allclose(expected2.data[:, 2], new_phen) + + +class TestCovariates: + def _get_expected_covariates(self): + # create a covar matrix with shape: samples x covariates + expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)]) + return expected - # can we load the data from the covariates file? - covars = Covariates(DATADIR.joinpath("covars.tsv")) - covars.read() - np.testing.assert_allclose(covars.data, expected) - assert covars.samples == ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") - assert covars.names == ("sex", "age") + def _get_fake_covariates(self): + gts = Phenotypes(fname=None) + gts.data = self._get_expected_covariates() + gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + gts.names = ("sex", "age") + return gts - # try loading the data again - it should warn b/c we've already done it - covars.read() - assert len(caplog.records) == 1 and caplog.records[0].levelname == "WARNING" + def test_load_covariates(self, caplog): + # create a covariate vector with shape: num_samples x num_covars + expected_covar = self._get_fake_covariates() + expected = expected_covar.data + samples = expected_covar.samples + # can we load the data from the covariates file? + covars = Covariates(DATADIR.joinpath("simple.covar")) + covars.read() + np.testing.assert_allclose(covars.data, expected) + assert covars.samples == samples + assert covars.names == ("sex", "age") -def test_load_covariates_iterate(): - # create a covariate vector with shape: num_samples x num_covars - expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)]) - samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + # try loading the data again - it should warn b/c we've already done it + caplog.clear() + covars.read() + assert len(caplog.records) > 0 and caplog.records[0].levelname == "WARNING" - # can we load the data from the covariates file? - covars = Covariates(DATADIR.joinpath("covars.tsv")) - for idx, line in enumerate(covars): - np.testing.assert_allclose(line.data, expected[idx]) - assert line.samples == samples[idx] - assert line.names == ("sex", "age") + def test_load_covariates_iterate(self): + # create a covariate vector with shape: num_samples x num_covars + expected_covar = self._get_fake_covariates() + expected = expected_covar.data + samples = expected_covar.samples + # can we load the data from the covariates file? + covars = Covariates(DATADIR.joinpath("simple.covar")) + for idx, line in enumerate(covars): + np.testing.assert_allclose(line.data, expected[idx]) + assert line.samples == samples[idx] + assert covars.names == ("sex", "age") -def test_load_covariates_subset(): - # create a covariate vector with shape: num_samples x num_covars - expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)]) + def test_load_covariates_subset(self): + # create a covariate vector with shape: num_samples x num_covars + expected_covar = self._get_fake_covariates() + expected = expected_covar.data + samples = ["HG00097", "HG00100"] - # subset for just the samples we want - expected = expected[[1, 3]] + # subset for just the samples we want + expected = expected[[1, 3]] - # can we load the data from the covariate file? - covars = Covariates(DATADIR.joinpath("covars.tsv")) - samples = ["HG00097", "HG00100"] - covars.read(samples=samples) - np.testing.assert_allclose(covars.data, expected) - assert covars.samples == tuple(samples) + # can we load the data from the covariate file? + covars = Covariates(DATADIR.joinpath("simple.covar")) + covars.read(samples=samples) + np.testing.assert_allclose(covars.data, expected) + assert covars.samples == tuple(samples) class TestHaplotypes: diff --git a/tests/test_simphenotype.py b/tests/test_simphenotype.py new file mode 100644 index 00000000..00339656 --- /dev/null +++ b/tests/test_simphenotype.py @@ -0,0 +1,196 @@ +import os +from pathlib import Path + +import pytest +import numpy as np +import numpy.lib.recfunctions as rfn + +from haptools.sim_phenotype import Haplotype, PhenoSimulator +from haptools.data import ( + Genotypes, + Phenotypes, + Haplotypes, +) + + +DATADIR = Path(__file__).parent.joinpath("data") + + +class TestSimPhenotype: + def _get_fake_gens(self): + gts = Genotypes(fname=None) + gts.data = np.array( + [ + [[1, 0], [0, 1]], + [[0, 1], [0, 1]], + [[0, 0], [1, 1]], + [[1, 1], [1, 1]], + [[0, 0], [0, 0]], + ], + dtype=np.uint8, + ) + gts.variants = np.array( + [ + ("1:10114:T:C", "1", 10114, 0.7), + ("1:10116:A:G", "1", 10116, 0.6), + ], + dtype=[ + ("id", "U50"), + ("chrom", "U10"), + ("pos", np.uint32), + ("aaf", np.float64), + ], + ) + gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + return gts + + def _get_fake_haps(self): + return [ + Haplotype("1", 10114, 10115, "1:10114:T:C", "CEU", 0.25), + Haplotype("1", 10116, 10117, "1:10116:A:G", "YRI", 0.75), + ] + + def _get_expected_phens(self): + pts = Phenotypes(fname=None) + pts.data = np.array( + [ + [-0.13363062, False], + [-0.13363062, False], + [0.53452248, True], + [1.20267559, True], + [-1.46993683, False], + ], + dtype=np.float64, + ) + pts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + pts.names = ("1:10114:T:C", "1:10116:A:G") + return pts + + def test_one_hap_zero_noise(self): + gts = self._get_fake_gens() + hps = self._get_fake_haps() + expected = self._get_expected_phens() + + pt_sim = PhenoSimulator(gts, seed=42) + data = pt_sim.run([hps[0]], heritability=1) + data = data[:, np.newaxis] + phens = pt_sim.phens + + # check the data and the generated phenotype object + assert phens.data.shape == (5, 1) + np.testing.assert_allclose(phens.data, data) + assert phens.data[0, 0] == phens.data[1, 0] + assert phens.data[2, 0] == phens.data[4, 0] + assert phens.data[3, 0] > phens.data[0, 0] + assert phens.data[2, 0] < phens.data[0, 0] + assert phens.samples == expected.samples + assert phens.names[0] == expected.names[0] + + def test_one_hap_zero_noise_neg_beta(self): + """ + the same test as test_one_phen_zero_noise but with a negative beta this time + """ + gts = self._get_fake_gens() + hps = self._get_fake_haps() + # make the beta value negative + hps[0].beta = -hps[0].beta + expected = self._get_expected_phens() + + pt_sim = PhenoSimulator(gts, seed=42) + data = pt_sim.run([hps[0]], heritability=1) + data = data[:, np.newaxis] + phens = pt_sim.phens + + # check the data and the generated phenotype object + assert phens.data.shape == (5, 1) + np.testing.assert_allclose(phens.data, data) + assert phens.data[0, 0] == phens.data[1, 0] + assert phens.data[2, 0] == phens.data[4, 0] + assert phens.data[3, 0] < phens.data[0, 0] + assert phens.data[2, 0] > phens.data[0, 0] + assert phens.samples == expected.samples + assert phens.names[0] == expected.names[0] + + def test_two_haps_zero_noise(self): + gts = self._get_fake_gens() + hps = self._get_fake_haps() + expected = self._get_expected_phens() + + pt_sim = PhenoSimulator(gts, seed=42) + pt_sim.run([hps[0]], heritability=1) + pt_sim.run([hps[1]], heritability=1) + phens = pt_sim.phens + + # check the data and the generated phenotype object + assert phens.data.shape == (5, 2) + assert phens.data[0, 0] == phens.data[1, 0] + assert phens.data[2, 0] == phens.data[4, 0] + assert phens.data[3, 0] > phens.data[0, 0] + assert phens.data[2, 0] < phens.data[0, 0] + + assert phens.data[0, 1] == phens.data[1, 1] + assert phens.data[2, 1] == phens.data[3, 1] + assert phens.data[3, 1] > phens.data[0, 1] + assert phens.data[4, 1] < phens.data[0, 1] + + assert phens.samples == expected.samples + assert phens.names == expected.names + + def test_combined_haps_zero_noise(self): + gts = self._get_fake_gens() + hps = self._get_fake_haps() + expected = self._get_expected_phens() + + pt_sim = PhenoSimulator(gts, seed=42) + pt_sim.run(hps, heritability=1) + phens = pt_sim.phens + + # check the data and the generated phenotype object + assert phens.data.shape == (5, 1) + np.testing.assert_allclose(phens.data[:, 0], expected.data[:, 0]) + + assert phens.samples == expected.samples + assert phens.names == ("-".join(expected.names),) + + def test_noise(self): + gts = self._get_fake_gens() + hps = self._get_fake_haps() + expected = self._get_expected_phens() + + pt_sim = PhenoSimulator(gts, seed=42) + pt_sim.run(hps, heritability=1) + pt_sim.run(hps, heritability=0.96) + pt_sim.run(hps, heritability=0.5) + pt_sim.run(hps, heritability=0.2) + phens = pt_sim.phens + + # check the data and the generated phenotype object + assert phens.data.shape == (5, 4) + np.testing.assert_allclose(phens.data[:, 0], expected.data[:, 0]) + diff1 = np.abs(phens.data[:, 1] - phens.data[:, 0]).sum() + assert diff1 > 0 + diff2 = np.abs(phens.data[:, 2] - phens.data[:, 0]).sum() + assert diff2 > diff1 + diff3 = np.abs(phens.data[:, 3] - phens.data[:, 0]).sum() + assert diff3 > diff2 + + def test_case_control(self): + gts = self._get_fake_gens() + hps = self._get_fake_haps() + expected = self._get_expected_phens() + all_false = np.zeros(expected.data.shape, dtype=np.float64) + some_true = expected.data[:, 1] + all_true = (~(all_false.astype(np.bool_))).astype(np.float64) + + pt_sim = PhenoSimulator(gts, seed=42) + pt_sim.run(hps, heritability=1, prevalence=0.4) + pt_sim.run(hps, heritability=1, prevalence=1) + pt_sim.run(hps, heritability=1, prevalence=0) + pt_sim.run(hps, heritability=0.5, prevalence=0.4) + phens = pt_sim.phens + assert phens.data.shape == (5, 4) + np.testing.assert_allclose(phens.data[:, 0], some_true) + np.testing.assert_allclose(phens.data[:, 1], all_true[:, 1]) + np.testing.assert_allclose(phens.data[:, 2], all_false[:, 1]) + diff1 = (phens.data[:, 3] == phens.data[:, 0]).sum() + assert diff1 > 0