diff --git a/README.md b/README.md
index 678dbeff..f8692eab 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,10 @@ Homepage: https://haptools.readthedocs.io/
 
 ## Installation
 
-UNDER CONSTRUCTION
+We have not officially published `haptools` yet, but in the meantime, you can install it directly from our Github repository.
+```bash
+pip install git+https://github.com/gymrek-lab/haptools.git
+```
 
 ## Haptools utilities
 
@@ -21,12 +24,13 @@ Haptools consists of multiple utilities listed below. Click on a utility to see
 
 * [`haptools karyogram`](docs/commands/karyogram.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`.
 
+* [`haptools transform`](https://haptools.readthedocs.io/en/latest/commands/transform.html): Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.
+
 Outputs produced by these utilities are compatible with each other. For example
 `haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`. 
 
 ## Contributing
 
-If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu.
-
-
+We gladly welcome any contributions to `haptools`!
 
+Please read [our contribution guidelines](https://haptools.readthedocs.io/en/latest/project_info/contributing.html) and then submit a [Github issue](https://github.com/gymrek-lab/haptools/issues).
diff --git a/docs/api/haptools.rst b/docs/api/haptools.rst
index 66d0bab8..ce092e1d 100644
--- a/docs/api/haptools.rst
+++ b/docs/api/haptools.rst
@@ -52,8 +52,16 @@ haptools.data.haplotypes module
    :undoc-members:
    :show-inheritance:
 
+haptools.sim_phenotype module
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. automodule:: haptools.sim_phenotype
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 haptools.sim_genotype module
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. automodule:: haptools.sim_genotype
    :members:
diff --git a/docs/commands/karyogram.md b/docs/commands/karyogram.md
index 706f2419..9b898e40 100644
--- a/docs/commands/karyogram.md
+++ b/docs/commands/karyogram.md
@@ -1,4 +1,4 @@
-# Haptools karyogram
+# karyogram
 
 `haptools karyogram` takes as input a breakpoints file (e.g. as output by `haptools simgenotype`) and a sample name, and plots a karyogram depicting local ancestry tracks.
 
diff --git a/docs/commands/simphenotype.md b/docs/commands/simphenotype.md
deleted file mode 100644
index 629a95be..00000000
--- a/docs/commands/simphenotype.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# simphenotype
-
-Haptools simphenotype simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. It takes causal effects and genotypes as input and outputs simulated phenotypes.
-
-Usage is modeled based on the [GCTA GWAS Simulation](https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation) utility.
-
-## Usage
-
-Below is a basic `haptools simphenotype` command:
-
-```
-haptools simphenotype \
-  --vcf <gzipped vcf file> \
-  --hap <gzipped hap file> \
-  --out <outprefix> \
-  < --simu-qt | --simu-cc > \
-  [simulation options]
-```
-
-Required parameters:
-
-* `--vcf <string>`: A bgzipped, tabix-indexed, phased VCF file. If you are simulating local-ancestry effects, the VCF file must contain the `FORMAT/LA` tag included in output of `haptools simgenotype`. See [haptools file formats](../../docs/formats/inputs.rst) for more details.
-
-* `--hap <string>`: A bgzipped, tabix-indexed HAP file, which specifies causal effects. This is a custom format described in more detail in [haptools file formats](../../docs/formats/haplotypes.rst). The HAP format enables flexible specification of a range of effect types including traditional variant-level effects, haplotype-level effects, associations with repeat lengths at short tandem repeats, and interaction of these effects with local ancestry labels. See [Examples](#examples) below for detailed examples of how to specify effects.
-
-* `--out <string>`: Prefix to name output files.
-
-* `--simu-qt` or `simu-cc` indicate whether to simulate a quantitative or case control trait. One of these two options must be specified.
-
-Additional parameters:
-
-* `--simu-rep <int>`: Number of phenotypes to simulate. Default: 1.
-
-* `--simu-hsq <float>`: Trait heritability. Default: 0.1.
-
-* `--simu-k <float>`: Disease prevalence (for case-control). Default: 0.1
-
-## Output files
-
-`haptools simphenotypes` outputs the following files:
-
-* `<outprefix>.phen`: Based on the phenotype files accepted by [plink](https://www.cog-genomics.org/plink/1.9/input#pheno). Tab-delimited file with one row per sample. The first and second columns give the sample ID. The third column gives the simulated phenotype. If `--simu-rep` was set to greater than 1, additional columns are output for each simulated trait. Example file:
-
-```
-HG00096	HG00096	0.058008375906919506
-HG00097	HG00097	0.03472768471423458
-HG00099	HG00099	-0.20850127859705808
-HG00100	HG00100	-0.21206803352471154
-HG00101	HG00101	0.3157913763428323
-```
-
-* `<outprefix>.par`: summarizes the frequencies and effects of simulated haplotypes. The columns are: haplotype ID (from the HAP file), haplotype frequency, and effect. Example file:
-
-```
-Haplotype	Frequency	Effect
-H-001	0.6	-0.2
-```
-
-<a name="examples"></a>
-## Examples
-
-#### Simulate a single haplotype-effect based on a 2 SNP haplotype:
-
-```
-haptools simphenotype \
-  --vcf tests/data/simple.vcf.gz \
-  --hap tests/data/simple.hap.gz \
-  --out test \
-  --simu-qt --simu-hsq 0.8 --simu-rep 1
-```
-
-based on this HAP file (available in `tests/data`)
-
-```
-H-001	1	10114	10116	1:10114:T:C,1:10116:A:G	T,G	*	-0.2
-```
\ No newline at end of file
diff --git a/docs/commands/simphenotype.rst b/docs/commands/simphenotype.rst
index 5c2afcfe..216d70fa 100644
--- a/docs/commands/simphenotype.rst
+++ b/docs/commands/simphenotype.rst
@@ -1,4 +1,81 @@
 .. _commands-simphenotype:
 
-.. include:: simphenotype.md
-   :parser: myst_parser.sphinx_
\ No newline at end of file
+
+simphenotype
+============
+
+Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a ``.hap`` file.
+
+The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility.
+
+Usage
+~~~~~
+.. code-block:: bash
+
+   haptools simphenotype \
+   --replications INT \
+   --heritability FLOAT \
+   --prevalence FLOAT \
+   --region TEXT \
+   --sample SAMPLE \
+   --samples-file FILENAME \
+   --output PATH \
+   --verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
+   GENOTYPES HAPLOTYPES
+
+Model
+~~~~~
+Each normalized haplotype :math:`\vec{Z_j}` is encoded as an independent causal variable in a linear model:
+
+.. math::
+
+   \vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon
+
+where
+
+.. math::
+
+   \epsilon_i \sim N(0, \sigma^2)
+
+.. math::
+
+   \sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1)
+
+The heritability :math:`h^2` is user-specified, but if it is not provided, then :math:`\sigma^2` will be computed purely from the effect sizes, instead:
+
+.. math::
+
+   \sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }}
+
+If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.
+
+Output
+~~~~~~
+Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait.
+
+Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLIN2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK.
+
+Examples
+~~~~~~~~
+.. code-block:: bash
+
+   haptools simphenotype -o simulated.pheno tests/data/example.vcf.gz tests/data/simphenotype.hap
+
+Simulate two replicates of a case/control trait that occurs in 60% of your samples with a heritability of 0.8. Encode all of the haplotypes in ``tests/data/example.hap.gz`` as independent causal variables.
+
+.. code-block:: bash
+
+   haptools simphenotype \
+   --replications 2 \
+   --heritability 0.8 \
+   --prevalence 0.6 \
+   --output simulated.pheno \
+   tests/data/example.vcf.gz tests/data/example.hap.gz
+
+Detailed Usage
+~~~~~~~~~~~~~~
+
+.. click:: haptools.__main__:main
+   :prog: haptools
+   :show-nested:
+   :commands: simphenotype
diff --git a/docs/commands/transform.rst b/docs/commands/transform.rst
index dcfb3346..01e55e71 100644
--- a/docs/commands/transform.rst
+++ b/docs/commands/transform.rst
@@ -24,7 +24,7 @@ Examples
 ~~~~~~~~
 .. code-block:: bash
 
-	haptools transform tests/data/example.vcf.gz tests/data/example.hap.gz | less
+	haptools transform tests/data/example.vcf.gz tests/data/basic.hap.gz | less
 
 .. code-block:: bash
 
diff --git a/docs/formats/genotypes.rst b/docs/formats/genotypes.rst
new file mode 100644
index 00000000..3c204ff9
--- /dev/null
+++ b/docs/formats/genotypes.rst
@@ -0,0 +1,7 @@
+.. _formats-genotypes:
+
+
+Genotypes
+=========
+
+Genotype files must be specified as VCF or BCF files.
diff --git a/docs/formats/haplotypes.rst b/docs/formats/haplotypes.rst
index a6707d1e..73e4d747 100644
--- a/docs/formats/haplotypes.rst
+++ b/docs/formats/haplotypes.rst
@@ -1,8 +1,8 @@
 .. _formats-haplotypes:
 
 
-.hap
-====
+Haplotypes
+==========
 
 This document describes our custom file format specification for haplotypes: the ``.hap`` file.
 
diff --git a/docs/formats/inputs.rst b/docs/formats/inputs.rst
deleted file mode 100644
index 39b639eb..00000000
--- a/docs/formats/inputs.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-.. _formats-inputs:
-
-
-Inputs
-=========
-
-Genotype file format
---------------------
-Genotype files must be specified as VCF or BCF files.
-
-Phenotype file format
----------------------
-Phenotypes are expected to be in a tab-separated format with two columns:
-
-1. sample ID, which must match those from the genotype file
-2. the phenotype value
-
-There should be no header line in the file.
-
-See ``tests/data/simple.tsv`` for an example of a phenotype file.
-
-Covariate file format
----------------------
-Covariates are expected to be in a tab-separated format with the first column
-corresponding to the sample ID. Subsequent columns should contain each of your
-covariates.
-
-The first line of the file will be treated as a header. The column with sample IDs
-should be named "sample" and subsequent columns can be labeled however you'd like.
-
-See ``tests/data/covars.tsv`` for an example of a covariate file.
-
-1000 Genomes sample_info file format
-------------------------------------
-Within the subcommand ``haptools simgenotype`` we use a file to map samples in the 
-reference to their population listed in the model file. This code is expected to work
-out of the box with 1000 genomes data and we have pre-computed this mapping file as 
-well as given the command to how to compute it if you desire another as well.
-
-``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv``
-
-See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes 
-GRCh38 samples mapped to their subpopulations.
diff --git a/docs/formats/phenotypes.rst b/docs/formats/phenotypes.rst
new file mode 100644
index 00000000..f300076f
--- /dev/null
+++ b/docs/formats/phenotypes.rst
@@ -0,0 +1,28 @@
+.. _formats-phenotypes:
+
+
+Phenotypes and Covariates
+=========================
+
+Phenotype file format
+---------------------
+Phenotypes are expected to follow `the PLINK2 .pheno file format <https://www.cog-genomics.org/plink/2.0/input#pheno>`_. This is a
+tab-separated format where the first column corresponds to the sample ID, and
+subsequent columns contain each of your phenotypes.
+
+The first line of the file corresponds with the header and must begin with ``#IID``.
+The names of each of your phenotypes belong in the subbsequent columns of the header.
+
+See `tests/data/simple.pheno <https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.pheno>`_ for an example of a phenotype file:
+
+.. include:: ../../tests/data/simple.pheno
+   :literal:
+
+Covariate file format
+---------------------
+Covariates follow the same format as phenotypes.
+
+See `tests/data/simple.covar <https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.covar>`_ for an example of a covariate file:
+
+.. include:: ../../tests/data/simple.covar
+   :literal:
diff --git a/docs/formats/sample_info.rst b/docs/formats/sample_info.rst
new file mode 100644
index 00000000..80e7812f
--- /dev/null
+++ b/docs/formats/sample_info.rst
@@ -0,0 +1,17 @@
+.. _formats-sample_info:
+
+
+Sample Info
+===========
+
+1000 Genomes sample_info file format
+------------------------------------
+Within the subcommand ``haptools simgenotype`` we use a file to map samples in the 
+reference to their population listed in the model file. This code is expected to work
+out of the box with 1000 genomes data and we have pre-computed this mapping file as 
+well as given the command to how to compute it if you desire another as well.
+
+``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv``
+
+See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes 
+GRCh38 samples mapped to their subpopulations.
diff --git a/docs/index.rst b/docs/index.rst
index 6b3266dc..56076d41 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -9,8 +9,10 @@
    :hidden:
    :maxdepth: 1
 
-   formats/inputs.rst
+   formats/genotypes.rst
    formats/haplotypes.rst
+   formats/phenotypes.rst
+   formats/sample_info.rst
 
 .. toctree::
    :caption: Commands
@@ -20,6 +22,7 @@
 
    commands/simgenotype.rst
    commands/simphenotype.rst
+   commands/karyogram.rst
    commands/transform.rst
 
 .. toctree::
diff --git a/haptools/__main__.py b/haptools/__main__.py
old mode 100644
new mode 100755
index 68b3f581..2160bb11
--- a/haptools/__main__.py
+++ b/haptools/__main__.py
@@ -88,39 +88,169 @@ def simgenotype(invcf, sample_info, model, mapdir, out, popsize, seed, chroms):
     output_vcf(breakpoints, model, invcf, sample_info, out)
 
 ############ Haptools simphenotype ###############
-DEFAULT_SIMU_REP = 1
-DEFAULT_SIMU_HSQ = 0.1
-DEFAULT_SIMU_K = 0.1
-##################################################
 @main.command()
-@click.option('--vcf', help='Phased VCF file', type=str, required=True)
-@click.option('--hap', help='Haplotype file with effect sizes', \
-        type=str, required=True)
-@click.option('--out', help='Prefix for output files', \
-        type=str, required=True)
-@click.option('--simu-qt', help='Simulate a quantitative trait', \
-        default=False, is_flag=True)
-@click.option('--simu-cc', help='Simulate a case/control trait', \
-        default=False, is_flag=True)
-@click.option('--simu-rep', help='Number of rounds of simulation to perform', \
-        type=int, default=DEFAULT_SIMU_REP)
-@click.option('--simu-hsq', help='Trait heritability', \
-        type=float, default=DEFAULT_SIMU_HSQ)
-@click.option('--simu-k', help='Specify the disease prevalence', \
-        type=float, default=DEFAULT_SIMU_K)
-def simphenotype(vcf, hap, simu_rep, simu_hsq, simu_k, simu_qt, simu_cc, out):
+@click.argument("genotypes", type=click.Path(exists=True, path_type=Path))
+@click.argument("haplotypes", type=click.Path(exists=True, path_type=Path))
+@click.option(
+    "-r",
+    "--replications",
+    type=click.IntRange(min=1),
+    default=1,
+    show_default=True,
+    help="Number of rounds of simulation to perform",
+)
+@click.option(
+    "-h",
+    "--heritability",
+    type=click.FloatRange(min=0, max=1),
+    default=None,
+    show_default=True,
+    help="Trait heritability",
+)
+@click.option(
+    "-p",
+    "--prevalence",
+    type=click.FloatRange(min=0, max=1, min_open=False, max_open=True),
+    show_default="quantitative trait",
+    help="Disease prevalence if simulating a case-control trait"
+)
+@click.option(
+    "--region",
+    type=str,
+    default=None,
+    show_default="all haplotypes",
+    help=(
+        "The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'"
+        "For this to work, the VCF and .hap file must be indexed and the seqname "
+        "provided must correspond with one in the files"
+    )
+)
+@click.option(
+    "-s",
+    "--sample",
+    "samples",
+    type=str,
+    multiple=True,
+    show_default="all samples",
+    help=(
+        "A list of the samples to subset from the genotypes file (ex: '-s sample1 -s"
+        " sample2')"
+    ),
+)
+@click.option(
+    "-S",
+    "--samples-file",
+    type=click.File("r"),
+    show_default="all samples",
+    help=(
+        "A single column txt file containing a list of the samples (one per line) to"
+        " subset from the genotypes file"
+    ),
+)
+@click.option(
+    "-o",
+    "--output",
+    type=click.Path(path_type=Path),
+    default=Path("/dev/stdout"),
+    show_default="stdout",
+    help="A TSV file containing simulated phenotypes",
+)
+@click.option(
+    "-v",
+    "--verbosity",
+    type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]),
+    default="ERROR",
+    show_default="only errors",
+    help="The level of verbosity desired",
+)
+def simphenotype(
+    genotypes: Path,
+    haplotypes: Path,
+    replications: int = 1,
+    heritability: float = None,
+    prevalence: float = None,
+    region: str = None,
+    samples: tuple[str] = tuple(),
+    samples_file: Path = None,
+    output: Path = Path("-"),
+    verbosity: str = 'ERROR',
+):
     """
-    Haplotype-aware phenotype simulation
+    Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a
+    set of haplotypes.
+
+    GENOTYPES must be formatted as a VCF and HAPLOTYPES must be formatted according
+    to the .hap format spec
+
+    \f
+    Examples
+    --------
+    >>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simulated.pheno
+
+    Parameters
+    ----------
+    genotypes : Path
+        The path to the genotypes in VCF format
+    haplotypes : Path
+        The path to the haplotypes in a .hap file
+    replications : int, optional
+        The number of rounds of simulation to perform
+    heritability : int, optional
+        The heritability of the simulated trait; must be a float between 0 and 1
+
+        If not provided, it will be computed from the sum of the squared effect sizes.
+    prevalence : int, optional
+        The prevalence of the disease if the trait should be simulated as case/control;
+        must be a float between 0 and 1
+
+        If not provided, a quantitative trait will be simulated, instead
+    region : str, optional
+        The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'
+
+        For this to work, the VCF and .hap file must be indexed and the seqname must
+        match!
+
+        Defaults to loading all haplotypes
+    sample : tuple[str], optional
+        A subset of the samples from which to extract genotypes
+
+        Defaults to loading genotypes from all samples
+    samples_file : Path, optional
+        A single column txt file containing a list of the samples (one per line) to
+        subset from the genotypes file
+    output : Path, optional
+        The location to which to write the simulated phenotypes
+    verbosity : str, optional
+        The level of verbosity desired in messages written to stderr
     """
-    from .sim_phenotypes import simulate_pt
-    # Basic checks on input
-    # TODO - check VCF zipped, check only one of simu-qt/simu-cc,
-    # check values of other inputs
-    # Only use simu-k for case/control
+    import logging
+
+    from .sim_phenotype import simulate_pt
+
+    log = logging.getLogger("run")
+    logging.basicConfig(
+        format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)",
+        level=verbosity,
+    )
+    # handle samples
+    if samples and samples_file:
+        raise click.UsageError(
+            "You may only use one of --sample or --samples-file but not both."
+        )
+    if samples_file:
+        with samples_file as samps_file:
+            samples = samps_file.read().splitlines()
+    elif samples:
+        # needs to be converted from tuple to list
+        samples = list(samples)
+    else:
+        samples = None
 
     # Run simulation
-    simulate_pt(vcf, hap, simu_rep, \
-        simu_hsq, simu_k, simu_qt, simu_cc, out)
+    simulate_pt(
+        genotypes, haplotypes, replications, heritability, prevalence, region, samples,
+        output, log
+    )
 
 @main.command(short_help="Transform a genotypes matrix via a set of haplotypes")
 @click.argument("genotypes", type=click.Path(exists=True, path_type=Path))
@@ -129,10 +259,12 @@ def simphenotype(vcf, hap, simu_rep, simu_hsq, simu_k, simu_qt, simu_cc, out):
     "--region",
     type=str,
     default=None,
-    show_default="all genotypes",
-    help="""
-    The region from which to extract genotypes; ex: 'chr1:1234-34566' or 'chr7'\n
-    For this to work, the VCF must be indexed and the seqname must match!""",
+    show_default="all haplotypes",
+    help=(
+        "The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'"
+        "For this to work, the VCF and .hap file must be indexed and the seqname "
+        "provided must correspond with one in the files"
+    )
 )
 @click.option(
     "-s",
@@ -199,9 +331,16 @@ def transform(
     haplotypes : Path
         The path to the haplotypes in a .hap file
     region : str, optional
-        See documentation for :py:meth:`~.data.Genotypes.read`
-    sample : Tuple[str], optional
-        See documentation for :py:meth:`~.data.Genotypes.read`
+        The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'
+
+        For this to work, the VCF and .hap file must be indexed and the seqname must
+        match!
+
+        Defaults to loading all haplotypes
+    sample : tuple[str], optional
+        A subset of the samples from which to extract genotypes
+
+        Defaults to loading genotypes from all samples
     samples_file : Path, optional
         A single column txt file containing a list of the samples (one per line) to
         subset from the genotypes file
@@ -212,8 +351,7 @@ def transform(
     """
     import logging
 
-    from haptools import data
-    from .haplotype import HaptoolsHaplotype
+    from .transform import transform_haps
 
     log = logging.getLogger("run")
     logging.basicConfig(
@@ -234,23 +372,7 @@ def transform(
     else:
         samples = None
 
-    log.info("Loading haplotypes")
-    hp = data.Haplotypes(haplotypes)
-    hp.read(region=region)
-    log.info("Extracting variants from haplotypes")
-    variants = {var.id for hap in hp.data.values() for var in hap.variants}
-    log.info("Loading genotypes")
-    gt = data.GenotypesRefAlt(genotypes, log=log)
-    # gt._prephased = True
-    gt.read(region=region, samples=samples, variants=variants)
-    gt.check_missing(discard_also=True)
-    gt.check_biallelic(discard_also=True)
-    gt.check_phase()
-    log.info("Transforming genotypes via haplotypes")
-    hp_gt = data.GenotypesRefAlt(fname=output, log=log)
-    hp.transform(gt, hp_gt)
-    log.info("Writing haplotypes to VCF")
-    hp_gt.write()
+    transform_haps(genotypes, haplotypes, region, samples, output, log)
 
 
 if __name__ == "__main__":
diff --git a/haptools/data/covariates.py b/haptools/data/covariates.py
index 45ad1f30..04960e50 100644
--- a/haptools/data/covariates.py
+++ b/haptools/data/covariates.py
@@ -8,16 +8,17 @@
 import numpy as np
 
 from .data import Data
+from .phenotypes import Phenotypes
 
 
-class Covariates(Data):
+class Covariates(Phenotypes):
     """
     A class for processing covariates from a file
 
     Attributes
     ----------
     data : np.array
-        The covariates in an n (samples) x 1 (covariate value) array
+        The covariates in an n (samples) x m (covariates) array
     fname : Path
         The path to the read-only file containing the data
     samples : tuple[str]
@@ -33,124 +34,5 @@ class Covariates(Data):
     """
 
     def __init__(self, fname: Path, log: Logger = None):
-        super().__init__(fname, log)
-        self.samples = tuple()
-        self.names = tuple()
-
-    @classmethod
-    def load(cls: Covariates, fname: Path, samples: list[str] = None) -> Covariates:
-        """
-        Load covariates from a TSV file
-
-        Read the file contents and standardize the covariates
-
-        Parameters
-        ----------
-        fname
-            See documentation for :py:attr:`~.Data.fname`
-        samples : list[str], optional
-            See documentation for :py:meth:`~.Data.Covariates.read`
-
-        Returns
-        -------
-        covariates
-            A Covariates object with the data loaded into its properties
-        """
-        covariates = cls(fname)
-        covariates.read(samples)
-        return covariates
-
-    def read(self, samples: list[str] = None):
-        """
-        Read covariates from a TSV file into a numpy matrix stored in :py:attr:`~.Covariates.data`
-
-        Parameters
-        ----------
-        samples : list[str], optional
-            A subset of the samples from which to extract covariates
-
-            Defaults to loading covariates from all samples
-
-        Raises
-        ------
-        AssertionError
-            If the provided file doesn't follow the expected format
-        """
-        super().read()
-        # load all info into memory
-        # use hook_compressed to automatically handle gz files
-        with hook_compressed(self.fname, mode="rt") as covars:
-            covar_text = reader(covars, delimiter="\t")
-            header = next(covar_text)
-            # there should at least two columns
-            if len(header) < 2:
-                raise ValueError("The covariates TSV should have at least two columns.")
-            # the first column should be called "sample"
-            if header[0] != "sample":
-                self.log.warning(
-                    "The first column of the covariates TSV should contain sample IDs"
-                    " and should be named 'sample' in the header line"
-                )
-            # convert to list and subset samples if need be
-            if samples:
-                samples = set(samples)
-                covar_text = [covar for covar in covar_text if covar[0] in samples]
-            else:
-                covar_text = list(covar_text)
-        # the second column should be castable to a float
-        try:
-            float(covar_text[0][1])
-        except:
-            self.log.error(
-                "Every column in the covariates file (besides the sample column) must"
-                " be numeric."
-            )
-        # fill out the samples and data properties
-        data = list(zip(*covar_text))
-        self.samples = data[0]
-        self.names = tuple(header[1:])
-        # coerce strings to floats
-        self.data = np.transpose(np.array(data[1:], dtype="float64"))
-
-    def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]:
-        """
-        Read covariates from a TSV line by line without storing anything
-
-        Parameters
-        ----------
-        samples : list[str], optional
-            A subset of the samples from which to extract covariates
-
-            Defaults to loading covariates from all samples
-
-        Yields
-        ------
-        Iterator[namedtuple]
-            An iterator over each line in the file, where each line is encoded as a
-            namedtuple containing each of the class properties
-        """
-        with hook_compressed(self.fname, mode="rt") as covars:
-            covar_text = reader(covars, delimiter="\t")
-            header = next(covar_text)
-            # there should at least two columns
-            if len(header) < 2:
-                raise ValueError("The covariates TSV should have at least two columns.")
-            # the first column should be called "sample"
-            if header[0] != "sample":
-                self.log.warning(
-                    "The first column of the covariates TSV should contain sample IDs"
-                    " and should be named 'sample' in the header line"
-                )
-            header = tuple(header[1:])
-            Record = namedtuple("Record", "data samples names")
-            for covar in covar_text:
-                if samples is None or covar[0] in samples:
-                    try:
-                        yield Record(
-                            np.array(covar[1:], dtype="float64"), covar[0], header
-                        )
-                    except:
-                        self.log.error(
-                            "Every column in the covariates file (besides the sample"
-                            " column) must be numeric."
-                        )
+        super(Phenotypes, self).__init__(fname, log)
+        self._ext = "covar"
diff --git a/haptools/data/genotypes.py b/haptools/data/genotypes.py
index 24e36fd9..98d4cd83 100644
--- a/haptools/data/genotypes.py
+++ b/haptools/data/genotypes.py
@@ -1,9 +1,11 @@
 from __future__ import annotations
+import re
 from csv import reader
 from pathlib import Path
 from typing import Iterator
 from collections import namedtuple
 from logging import getLogger, Logger
+from fileinput import hook_compressed
 
 import numpy as np
 import numpy.typing as npt
@@ -36,6 +38,10 @@ class Genotypes(Data):
     _prephased : bool
         If True, assume that the genotypes are phased. Otherwise, extract their phase
         when reading from the VCF.
+    _samp_idx : dict[str, int]
+        Sample index; maps samples to indices in self.samples
+    _var_idx : dict[str, int]
+        Variant index; maps variant IDs to indices in self.variants
 
     Examples
     --------
@@ -59,6 +65,8 @@ def __init__(self, fname: Path, log: Logger = None):
             ],
         )
         self._prephased = False
+        self._samp_idx = None
+        self._var_idx = None
 
     @classmethod
     def load(
@@ -93,8 +101,7 @@ def load(
         genotypes.read(region, samples, variants)
         genotypes.check_missing()
         genotypes.check_biallelic()
-        if not self._prephased:
-            genotypes.check_phase()
+        genotypes.check_phase()
         # genotypes.to_MAC()
         return genotypes
 
@@ -280,6 +287,89 @@ def __iter__(
         # see https://stackoverflow.com/a/36726497
         return self._iterate(vcf, region, variants)
 
+    def index(self, samples: bool = True, variants: bool = True):
+        """
+        Call this function once to improve the amortized time-complexity of look-ups of
+        samples and variants by their ID. This is useful if you intend to later subset
+        by a set of samples or variant IDs.
+        The time complexity of this function should be roughly O(n+m) if both
+        parameters are True. Otherwise, it will be either O(n) or O(m).
+
+        Parameters
+        ----------
+        samples: bool, optional
+            Whether to index the samples for fast loop-up. Adds complexity O(n).
+        variants: bool, optional
+            Whether to index the variants for fast look-up. Adds complexity O(m).
+        """
+        if samples and self._samp_idx is None:
+            self._samp_idx = dict(zip(self.samples, range(len(self.samples))))
+        if variants and self._var_idx is None:
+            self._var_idx = dict(zip(self.variants["id"], range(len(self.variants))))
+
+    def subset(
+        self,
+        samples: tuple[str] = None,
+        variants: tuple[str] = None,
+        inplace: bool = False,
+    ):
+        """
+        Subset these genotypes to a smaller set of samples or a smaller set of variants
+
+        The order of the samples and variants in the subsetted instance will match
+        the order in the provided tuple parameters.
+
+        Parameters
+        ----------
+        samples: tuple[str]
+            A subset of samples to keep
+        variants: tuple[str]
+            A subset of variant IDs to keep
+        inplace: bool, optional
+            If False, return a new Genotypes object; otherwise, alter the current one
+
+        Returns
+        -------
+            A new Genotypes object if inplace is set to False, else returns None
+        """
+        # First, initialize variables
+        gts = self
+        if not inplace:
+            gts = self.__class__(self.fname, self.log)
+        gts.samples = self.samples
+        gts.variants = self.variants
+        gts.data = self.data
+        # Index the current set of samples and variants so we can have fast look-up
+        self.index(samples=(samples is not None), variants=(variants is not None))
+        # Subset the samples
+        if samples is not None:
+            gts.samples = tuple(samp for samp in samples if samp in self._samp_idx)
+            if len(gts.samples) < len(samples):
+                diff = len(samples) - len(gts.samples)
+                self.log.warning(
+                    f"Saw {diff} fewer samples than requested. Proceeding with "
+                    f"{len(gts.samples)} samples."
+                )
+            samp_idx = tuple(self._samp_idx[samp] for samp in gts.samples)
+            if inplace:
+                self._samp_idx = None
+            gts.data = gts.data[samp_idx, :]
+        # Subset the variants
+        if variants is not None:
+            var_idx = [self._var_idx[var] for var in variants if var in self._var_idx]
+            if len(var_idx) < len(variants):
+                diff = len(variants) - len(var_idx)
+                self.log.warning(
+                    f"Saw {diff} fewer variants than requested. Proceeding with "
+                    f"{len(var_idx)} variants."
+                )
+            gts.variants = self.variants[var_idx]
+            if inplace:
+                self._var_idx = None
+            gts.data = gts.data[:, var_idx]
+        if not inplace:
+            return gts
+
     def check_missing(self, discard_also=False):
         """
         Check that each sample is properly genotyped
@@ -306,6 +396,7 @@ def check_missing(self, discard_also=False):
                 )
                 self.data = np.delete(self.data, samp_idx, axis=0)
                 self.samples = tuple(np.delete(self.samples, samp_idx))
+                self._samp_idx = None
             else:
                 raise ValueError(
                     "Genotype with ID {} at POS {}:{} is missing for sample {}".format(
@@ -347,6 +438,7 @@ def check_biallelic(self, discard_also=False):
                 self.log.info(f"Ignoring {len(variant_idx)} multiallelic variants")
                 self.data = np.delete(self.data, variant_idx, axis=1)
                 self.variants = np.delete(self.variants, variant_idx)
+                self._var_idx = None
             else:
                 raise ValueError(
                     "Variant with ID {} at POS {}:{} is multiallelic for sample {}"
@@ -462,6 +554,8 @@ def __init__(self, fname: Path, log: Logger = None):
             ],
         )
         self._prephased = False
+        self._samp_idx = None
+        self._var_idx = None
 
     def _variant_arr(self, record: Variant):
         """
@@ -496,10 +590,15 @@ def write(self):
                 ("Description", "Genotype"),
             ],
         )
-        for sample in self.samples:
-            # TODO: figure out how to make this work for large datasets
-            # it gets stuck and takes a long time to exit this loop
-            vcf.header.add_sample(sample)
+        try:
+            vcf.header.add_samples(self.samples)
+        except AttributeError:
+            self.log.warning(
+                "Upgrade to pysam >=0.19.1 to reduce the time required to create "
+                "VCFs. See https://github.com/pysam-developers/pysam/issues/1104"
+            )
+            for sample in self.samples:
+                vcf.header.add_sample(sample)
         self.log.info("Writing VCF records")
         for var_idx, var in enumerate(self.variants):
             rec = {
@@ -547,8 +646,81 @@ class GenotypesPLINK(GenotypesRefAlt):
     >>> genotypes = GenotypesPLINK.load('tests/data/simple.pgen')
     """
 
+    def read_samples(self, samples: list[str] = None):
+        """
+        Read sample IDs from a PSAM file into a list stored in
+        :py:attr:`~.GenotypesPLINK.samples`
+
+        One of either variants or max_variants MUST be specified!
+
+        Parameters
+        ----------
+        samples : list[str], optional
+            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+
+        Returns
+        -------
+        npt.NDArray[np.uint32]
+            The indices of each of the samples within the PSAM file
+        """
+        if len(self.samples) != 0:
+            self.log.warning("Sample data has already been loaded. Overriding.")
+        if samples is not None and not isinstance(samples, set):
+            samples = set(samples)
+        with hook_compressed(self.fname.with_suffix(".psam"), mode="rt") as psam:
+            psamples = reader(psam, delimiter="\t")
+            # find the line that declares the header
+            for header in psamples:
+                if header[0].startswith("#FID") or header[0].startswith("#IID"):
+                    break
+            # we need the header to contain the IID column
+            if header[0][0] != "#":
+                raise ValueError("Your PSAM file is missing a header!")
+                col_idx = 1
+                self.log.info(
+                    "Your PVAR file lacks a proper header! Assuming first five columns"
+                    " are [CHROM, POS, ID, REF, ALT] in that order..."
+                )
+                # TODO: add header back in to psamples using itertools.chain?
+            else:
+                header[0] = header[0][1:]
+                try:
+                    col_idx = header.index("IID")
+                except ValueError:
+                    raise ValueError("Your PSAM file must have an IID column.")
+            self.samples = {
+                ct: samp[col_idx]
+                for ct, samp in enumerate(psamples)
+                if (samples is None) or (samp[col_idx] in samples)
+            }
+            indices = np.array(list(self.samples.keys()), dtype=np.uint32)
+            self.samples = list(self.samples.values())
+            return indices
+
+    def _check_region(
+        self, pos: tuple, chrom: str, start: int = 0, end: int = float("inf")
+    ):
+        """
+        Check that pos lies within the provided chrom, start, and end coordinates
+
+        This is a helper function for :py:meth:`~.GenotypesPLINK.read_variants`
+
+        Parameters
+        ----------
+        pos : tuple[str, int]
+            A tuple of two elements: contig (str) and chromosomal position (int)
+        chrom: str
+            The contig of the region to check
+        start: int, optional
+            The start position of the region
+        end: int, optional
+            The end position of the region
+        """
+        return (pos[0] == chrom) and (start <= pos[1]) and (end >= pos[1])
+
     def read_variants(
         self,
+        region: str = None,
         variants: set[str] = None,
         max_variants: int = None,
     ):
@@ -578,53 +750,68 @@ def read_variants(
             max_variants = len(variants)
         if max_variants is None:
             raise ValueError("Provide either the variants or max_variants parameter!")
-        # TODO: try using pysam.tabix_iterator with the parser param specified for VCF
-        if region:
-            pvar = TabixFile(str(self.fname.with_suffix(".pvar")))
-            pvariants = pvar.fetch(region)
-        else:
-            pvar = hook_compressed(self.fname.with_suffix(".pvar"), mode="rt")
+        # split the region string so each portion is an element
+        if region is not None:
+            region = re.split(":|-", region)
+            if len(region) > 1:
+                region[1:] = [int(pos) for pos in region[1:] if pos]
+        with hook_compressed(self.fname.with_suffix(".pvar"), mode="rt") as pvar:
             pvariants = reader(pvar, delimiter="\t")
-        # first, preallocate the array
-        self.variants = np.empty((max_variants,), dtype=self.variants.dtype)
-        header = next(pvariants)
-        # there should be at least five columns
-        if len(header) < 5:
-            raise ValueError("Your PVAR file should have at least five columns.")
-        if header[0][0] == "#":
-            header[0] = header[0][1:]
-        else:
-            raise ValueError("Your PVAR file is missing a header!")
-            header = ["CHROM", "POS", "ID", "REF", "ALT"]
-            self.log.info(
-                "Your PVAR file lacks a proper header! Assuming first five columns"
-                " are [CHROM, POS, ID, REF, ALT] in that order..."
-            )
-            # TODO: add header back in to pvariants using itertools.chain?
-        cid = {item: header.index(item.upper()) for item in self.variants.dtype.names}
-        if variants is not None:
+            # first, preallocate the array
+            self.variants = np.empty((max_variants,), dtype=self.variants.dtype)
+            # find the line that declares the header
+            for header in pvariants:
+                if not header[0].startswith("##"):
+                    break
+            # there should be at least five columns
+            if len(header) < 5:
+                raise ValueError("Your PVAR file should have at least five columns.")
+            if header[0][0] == "#":
+                header[0] = header[0][1:]
+            else:
+                raise ValueError("Your PVAR file is missing a header!")
+                header = ["CHROM", "POS", "ID", "REF", "ALT"]
+                self.log.info(
+                    "Your PVAR file lacks a proper header! Assuming first five columns"
+                    " are [CHROM, POS, ID, REF, ALT] in that order..."
+                )
+                # TODO: add header back in to pvariants using itertools.chain?
+            cid = {
+                item: header.index(item.upper())
+                for item in self.variants.dtype.names
+                if item != "aaf"
+            }
             indices = np.empty((max_variants,), dtype=np.uint32)
-        num_seen = 0
-        for ct, rec in enumerate(pvariants):
-            if variants is not None:
-                if rec[cid["id"]] not in variants:
+            num_seen = 0
+            for ct, rec in enumerate(pvariants):
+                if region and not self._check_region(
+                    (rec[cid["chrom"]], int(rec[cid["pos"]])), *region
+                ):
                     continue
+                if variants is not None:
+                    if rec[cid["id"]] not in variants:
+                        continue
                 indices[num_seen] = ct
-            self.variants[num_seen] = np.array(
-                (
-                    rec[cid["id"]],
-                    rec[cid["chrom"]],
-                    rec[cid["pos"]],
-                    0,
-                    rec[cid["ref"]],
-                    rec[cid["alt"]],
-                ),
-                dtype=self.variants.dtype,
+                self.variants[num_seen] = np.array(
+                    (
+                        rec[cid["id"]],
+                        rec[cid["chrom"]],
+                        rec[cid["pos"]],
+                        0,
+                        rec[cid["ref"]],
+                        rec[cid["alt"]],
+                    ),
+                    dtype=self.variants.dtype,
+                )
+                num_seen += 1
+        if max_variants > num_seen:
+            self.log.info(
+                f"Removing {max_variants-num_seen} unneeded variant records that "
+                "were preallocated b/c max_variants was specified."
             )
-            num_seen += 1
-        pvar.close()
-        if variants is not None:
-            return indices
+            indices = indices[:num_seen]
+            self.variants = self.variants[:num_seen]
+        return indices
 
     def read(
         self,
@@ -648,34 +835,57 @@ def read(
         max_variants : int, optional
             See documentation for :py:attr:`~.GenotypesRefAlt.read`
         """
-        super().read()
-        # load the pgen-reader file
-        # note: very little is loaded into memory at this point
+        super(Genotypes, self).read()
         # TODO: figure out how to install this package
         from pgenlib import PgenReader
 
-        pgen = PgenReader(bytes(self.fname, "utf8"))
+        sample_idxs = self.read_samples(samples)
+        with PgenReader(
+            bytes(str(self.fname), "utf8"), sample_subset=sample_idxs
+        ) as pgen:
 
-        if variants is not None:
-            max_variants = len(variants)
-        if max_variants is None:
-            # use the pgen file to figure out how many variants there are
-            max_variants = pgen.get_variant_ct()
-        indices = self.read_variants(region, variants, max_variants)
-
-        variant_ct_start = 0
-        variant_ct_end = max_variants
-        variant_ct = variant_ct_end - variant_ct_start
-
-        sample_ct = pgen.get_raw_sample_ct()
-        # the genotypes start out as a simple 2D array with twice the number of samples
-        # so each column is a different chromosomal strand
-        self.data = np.empty((variant_ct, sample_ct * 2), dtype=np.int32)
-        pgen.read_alleles(indices, self.data)
-        # extract the genotypes to a np matrix of size n x p x 2
-        # the last dimension has two items:
-        # 1) presence of REF in strand one
-        # 2) presence of REF in strand two
-        self.data = np.dstack((self.data[:, ::2], self.data[:, 1::2]))
-        # transpose the GT matrix so that samples are rows and variants are columns
-        self.data = self.data.transpose((1, 0, 2))
+            if variants is not None:
+                max_variants = len(variants)
+            if max_variants is None:
+                # use the pgen file to figure out how many variants there are
+                max_variants = pgen.get_variant_ct()
+            indices = self.read_variants(region, variants, max_variants)
+
+            # the genotypes start out as a simple 2D array with twice the number of samples
+            if not self._prephased:
+                # raise an error message b/c this is untested code that doesn't really
+                # seem to work properly
+                # for ex, the phasing info is not boolean for some reason?
+                raise ValueError("Not implemented yet!")
+                # ...so each column is a different chromosomal strand
+                self.data = np.empty(
+                    (len(indices), len(sample_idxs) * 2), dtype=np.int32
+                )
+                phasing = np.empty((len(indices), len(sample_idxs)), dtype=np.uint8)
+                # The haplotype-major mode of read_alleles_and_phasepresent_list has
+                # not been implemented yet, so we need to read the genotypes in sample-
+                # major mode and then transpose them
+                pgen.read_alleles_and_phasepresent_list(indices, self.data, phasing)
+                # missing alleles will have a value of -9
+                # let's make them be -1 to be consistent with cyvcf2
+                self.data[self.data == -9] = -1
+                self.data = np.dstack((self.data[:, ::2], self.data[:, 1::2])).astype(
+                    np.uint8
+                )
+                self.data = np.concatenate(
+                    (self.data, phasing[:, :, np.newaxis]), axis=2
+                )
+                # transpose the GT matrix so that samples are rows and variants are columns
+                self.data = self.data.transpose((1, 0, 2))
+            else:
+                # ...so each row is a different chromosomal strand
+                self.data = np.empty(
+                    (len(sample_idxs) * 2, len(indices)), dtype=np.int32
+                )
+                pgen.read_alleles_list(indices, self.data, hap_maj=True)
+                # missing alleles will have a value of -9
+                # let's make them be -1 to be consistent with cyvcf2
+                self.data[self.data == -9] = -1
+                self.data = np.dstack((self.data[::2, :], self.data[1::2, :])).astype(
+                    np.uint8
+                )
diff --git a/haptools/data/haplotypes.py b/haptools/data/haplotypes.py
index f15f5c9b..5adb031a 100644
--- a/haptools/data/haplotypes.py
+++ b/haptools/data/haplotypes.py
@@ -1,5 +1,4 @@
 from __future__ import annotations
-import re
 from pathlib import Path
 from logging import getLogger, Logger
 from fileinput import hook_compressed
@@ -88,7 +87,7 @@ def fmt_str(self) -> str:
         """
         Convert an Extra into a fmt string
 
-        Retruns
+        Returns
         -------
         str
             A python format string (ex: "{beta:.3f}")
@@ -235,8 +234,8 @@ class Haplotype:
         The chromosomal end position of the haplotype
     id: str
         The haplotype's unique ID
-    variants: list[Variant]
-        A list of the variants in this haplotype
+    variants: tuple[Variant]
+        The variants in this haplotype
     _extras: tuple[Extra]
         Extra fields for the haplotype
 
@@ -278,7 +277,7 @@ def _fmt(self):
     @property
     # TODO: use @cached_property in py3.8
     def varIDs(self):
-        return {var.id for var in self.variants}
+        return tuple(var.id for var in self.variants)
 
     @classmethod
     def from_hap_spec(
@@ -335,9 +334,7 @@ def extras_head(cls) -> tuple:
         """
         return tuple(extra.to_hap_spec("H") for extra in cls._extras)
 
-    def transform(
-        self, genotypes: GenotypesRefAlt, samples: list[str] = None
-    ) -> npt.NDArray[bool]:
+    def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]:
         """
         Transform a genotypes matrix via the current haplotype
 
@@ -351,8 +348,6 @@ def transform(
 
             If the genotypes have not been loaded into the Genotypes object yet, this
             method will call Genotypes.read(), while loading only the needed variants
-        samples : list[str], optional
-            See documentation for :py:attr:`~.Genotypes.read`
 
         Returns
         -------
@@ -361,34 +356,27 @@ def transform(
             denotes the presence of the haplotype in one chromosome of a sample
         """
         var_IDs = self.varIDs
-        # check: have the genotypes been loaded yet?
-        # if not, we can load just the variants we need
-        if genotypes.unset():
-            start = min(var.start for var in self.variants)
-            end = max(var.end for var in self.variants)
-            region = f"{self.chrom}:{start}-{end}"
-            genotypes.read(region=region, samples=samples, variants=var_IDs)
-            genotypes.check_biallelic(discard_also=True)
-            genotypes.check_phase()
-        # create a dict where the variants are keyed by ID
-        var_dict = {
-            var["id"]: var["ref"] for var in genotypes.variants if var["id"] in var_IDs
-        }
-        var_idxs = [
-            idx for idx, var in enumerate(genotypes.variants) if var["id"] in var_IDs
-        ]
-        missing_IDs = var_IDs - var_dict.keys()
-        if len(missing_IDs):
+        gts = genotypes.subset(variants=var_IDs)
+        # check: were any of the variants absent from the genotypes?
+        if len(gts.variants) < len(var_IDs):
+            missing_IDs = set(var_IDs) - set(gts.variants["id"])
             raise ValueError(
                 f"Variants {missing_IDs} are present in haplotype '{self.id}' but "
                 "absent in the provided genotypes"
             )
         # create a np array denoting the alleles that we want
-        alleles = [int(var.allele != var_dict[var.id]) for var in self.variants]
-        allele_arr = np.array([[[al] for al in alleles]])  # shape: (1, n, 1)
+        # note: the excessive use of square-brackets gives us shape (1, n, 1)
+        allele_arr = np.array(
+            [
+                [
+                    [int(var.allele != gts.variants[i]["ref"])]
+                    for i, var in enumerate(self.variants)
+                ]
+            ]
+        )
         # look for the presence of each allele in each chromosomal strand
         # and then just AND them together
-        return np.all(allele_arr == genotypes.data[:, var_idxs], axis=1)
+        return np.all(allele_arr == gts.data, axis=1)
 
 
 class Haplotypes(Data):
@@ -505,7 +493,7 @@ def check_header(self, lines: list[str], check_version=False):
                     expected_lines.remove(line)
                 except ValueError:
                     # extract the name of the extra field
-                    name = line.split("\t", maxsplit=1)[1]
+                    name = line.split("\t", maxsplit=2)[1]
                     raise ValueError(
                         f"The extra field '{name}' is declared in the header of the"
                         " .hap file but is not accepted by this tool."
@@ -730,12 +718,7 @@ def __repr__(self):
 
     def write(self):
         """
-        Write the contents of this Haplotypes object to the file given by fname
-
-        Parameters
-        ----------
-        file: TextIO
-            A file-like object to which this Haplotypes object should be written.
+        Write the contents of this Haplotypes object to the file at :py:attr:`~.Haplotypes.fname`
 
         Examples
         --------
@@ -752,9 +735,7 @@ def write(self):
     def transform(
         self,
         genotypes: GenotypesRefAlt,
-        hap_gts: GenotypesRefAlt,
-        samples: list[str] = None,
-        low_memory: bool = False,
+        hap_gts: GenotypesRefAlt = None,
     ) -> GenotypesRefAlt:
         """
         Transform a genotypes matrix via the current haplotype
@@ -766,42 +747,32 @@ def transform(
         ----------
         genotypes : GenotypesRefAlt
             The genotypes which to transform using the current haplotype
-
-            If the genotypes have not been loaded into the Genotypes object yet, this
-            method will call Genotypes.read(), while loading only the needed variants
         hap_gts: GenotypesRefAlt
             An empty GenotypesRefAlt object into which the haplotype genotypes should
             be stored
-        samples : list[str], optional
-            See documentation for :py:attr:`~.Genotypes.read`
-        low_memory : bool, optional
-            If True, each haplotype's genotypes will be loaded one at a time.
 
         Returns
         -------
         GenotypesRefAlt
             A Genotypes object composed of haplotypes instead of regular variants.
         """
+        # Initialize GenotypesRefAlt return value
+        if hap_gts is None:
+            hap_gts = GenotypesRefAlt(fname=None, log=self.log)
         hap_gts.samples = genotypes.samples
         hap_gts.variants = np.array(
             [(hap.id, hap.chrom, hap.start, 0, "A", "T") for hap in self.data.values()],
-            dtype=[
-                ("id", "U50"),
-                ("chrom", "U10"),
-                ("pos", np.uint32),
-                ("aaf", np.float64),
-                ("ref", "U100"),
-                ("alt", "U100"),
-            ],
+            dtype=hap_gts.variants.dtype,
         )
+        # Obtain and merge the haplotype genotypes
         self.log.info(
             f"Transforming a set of genotypes from {len(genotypes.variants)} total "
             f"variants with a list of {len(self.data)} haplotypes"
         )
         hap_gts.data = np.concatenate(
             tuple(
-                hap.transform(genotypes, samples)[:, np.newaxis]
-                for hap in self.data.values()
+                hap.transform(genotypes)[:, np.newaxis] for hap in self.data.values()
             ),
             axis=1,
-        ).astype(np.uint8)
+        ).astype(genotypes.data.dtype)
+        return hap_gts
diff --git a/haptools/data/phenotypes.py b/haptools/data/phenotypes.py
index a27c2535..d43222c4 100644
--- a/haptools/data/phenotypes.py
+++ b/haptools/data/phenotypes.py
@@ -1,11 +1,14 @@
 from __future__ import annotations
 from csv import reader
 from pathlib import Path
-from collections import namedtuple
+from io import TextIOBase
+from collections.abc import Iterable
 from logging import getLogger, Logger
 from fileinput import hook_compressed
+from collections import namedtuple, Counter
 
 import numpy as np
+import numpy.typing as npt
 
 from .data import Data
 
@@ -17,11 +20,13 @@ class Phenotypes(Data):
     Attributes
     ----------
     data : np.array
-        The phenotypes in an n (samples) x 1 (phenotype value) array
+        The phenotypes in an n (samples) x m (phenotypes) array
     fname : Path
         The path to the read-only file containing the data
     samples : tuple
         The names of each of the n samples
+    names : tuple[str]
+        The names of the phenotypes
     log: Logger
         A logging instance for recording debug statements.
 
@@ -33,6 +38,8 @@ class Phenotypes(Data):
     def __init__(self, fname: Path, log: Logger = None):
         super().__init__(fname, log)
         self.samples = tuple()
+        self.names = tuple()
+        self._ext = "pheno"
 
     @classmethod
     def load(cls: Phenotypes, fname: Path, samples: list[str] = None) -> Phenotypes:
@@ -75,36 +82,28 @@ def read(self, samples: list[str] = None):
             If the provided file doesn't follow the expected format
         """
         super().read()
-        # load all info into memory
-        # use hook_compressed to automatically handle gz files
-        with hook_compressed(self.fname, mode="rt") as phens:
-            phen_text = reader(phens, delimiter="\t")
-            # convert to list and subset samples if need be
-            if samples:
-                samples = set(samples)
-                phen_text = [phen for phen in phen_text if phen[0] in samples]
-            else:
-                phen_text = list(phen_text)
-        # there should only be two columns
-        if len(phen_text[0]) != 2:
-            self.log.warning("The phenotype TSV should only have two columns.")
-        # the second column should be castable to a float
-        try:
-            float(phen_text[0][1])
-        except:
-            self.log.error("The second column of the TSV file must numeric.")
+        # call self.__iter__() to obtain the data and samples
+        data, self.samples = zip(*self.__iter__(samples))
+        self.log.info(f"Loaded {len(self.samples)} samples from .{self._ext} file")
         # fill out the samples and data properties
-        self.samples, self.data = zip(*phen_text)
-        # coerce strings to floats
-        self.data = np.array(self.data, dtype="float64")
+        # collect data in a np array
+        self.data = np.array(data)
 
-    def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]:
+    def _iterate(
+        self, phens: TextIOBase, phen_text: Iterable, samples: set[str] = None
+    ):
         """
-        Read phenotypes from a TSV line by line without storing anything
+        A generator over the lines of a TSV
+
+        This is a helper function for :py:meth:`~.Phenotypes.__iter__`
 
         Parameters
         ----------
-        samples : list[str], optional
+        phens: TextIOBase
+            The file handler for the stream
+        phen_text: Iterable
+            The csv.reader object containing the lines of text from the file as lists
+        samples : set[str], optional
             A subset of the samples from which to extract phenotypes
 
             Defaults to loading phenotypes from all samples
@@ -115,17 +114,88 @@ def __iter__(self, samples: list[str] = None) -> Iterator[namedtuple]:
             An iterator over each line in the file, where each line is encoded as a
             namedtuple containing each of the class properties
         """
-        with hook_compressed(self.fname, mode="rt") as phens:
-            phen_text = reader(phens, delimiter="\t")
-            Record = namedtuple("Record", "data samples")
-            for phen in phen_text:
-                if samples is None or phen[0] in samples:
-                    try:
-                        yield Record(float(phen[1]), phen[0])
-                    except:
-                        self.log.error(
-                            "The second column of the TSV file must numeric."
-                        )
+        self.log.info(f"Loading {len(self.names)} columns from .{self._ext} file")
+        Record = namedtuple("Record", "data samples")
+        for phen in phen_text:
+            if samples is None or phen[0] in samples:
+                try:
+                    yield Record(np.array(phen[1:], dtype="float64"), phen[0])
+                except:
+                    self.log.error(
+                        f"Every column in the .{self._ext} file (besides the sample"
+                        " column) must be numeric."
+                    )
+        phens.close()
+
+    def __iter__(self, samples: list[str] = None) -> Iterable[namedtuple]:
+        """
+        Read phenotypes from a TSV line by line without storing anything
+
+        Parameters
+        ----------
+        samples : list[str], optional
+            A subset of the samples from which to extract phenotypes
+
+            Defaults to loading phenotypes from all samples
+
+        Returns
+        ------
+        Iterable[namedtuple]
+            See documentation for :py:meth:`~.Phenotypes._iterate`
+        """
+        phens = hook_compressed(self.fname, mode="rt")
+        phen_text = reader(phens, delimiter="\t")
+        # ignore all of the comment lines
+        while True:
+            header = next(phen_text)
+            if not header[0].startswith("#") or header[0].startswith("#IID"):
+                break
+
+        # there should be at least two columns
+        if len(header) < 2:
+            raise ValueError(f"The .{self._ext} file should have at least two columns.")
+        # the first column should be called "#IID"
+        if header[0] != "#IID":
+            self.log.warning(
+                f"The first column of the .{self._ext} file should contain sample IDs"
+                " and should be named '#IID' in the header line"
+            )
+        self.names = tuple(header[1:])
+        samples = set(samples) if samples else None
+        # call another function to force the lines above to be run immediately
+        # see https://stackoverflow.com/a/36726497
+        return self._iterate(phens, phen_text, samples)
+
+    def write(self):
+        """
+        Write the phenotypes in this class to a file at :py:attr:`~.Phenotypes.fname`
+
+        Examples
+        --------
+        To write to a file, you must first initialize a Phenotypes object and then
+        fill out the names, data, and samples properties:
+        >>> phenotypes = Phenotypes('tests/data/simple.pheno')
+        >>> phenotypes.names = ('height',)
+        >>> phenotypes.data = np.array([1, 1, 2], dtype='float64')
+        >>> phenotypes.samples = ('HG00096', 'HG00097', 'HG00099')
+        >>> phenotypes.write()
+        """
+        # make sure the names are unique
+        uniq_names = Counter()
+        names = [None] * len(self.names)
+        for idx, name in enumerate(self.names):
+            suffix = ""
+            if uniq_names[name]:
+                suffix = f"-{uniq_names[name]}"
+            names[idx] = name + suffix
+            uniq_names[name] += 1
+        # now we can finally write the file
+        with hook_compressed(self.fname, mode="wt") as haps:
+            haps.write("#IID\t" + "\t".join(names) + "\n")
+            formatter = {"float_kind": lambda x: "%.2f" % x}
+            for samp, phen in zip(self.samples, self.data):
+                line = np.array2string(phen, separator="\t", formatter=formatter)[1:-1]
+                haps.write(f"{samp}\t" + line + "\n")
 
     def standardize(self):
         """
@@ -133,4 +203,34 @@ def standardize(self):
 
         This function modifies :py:attr:`~.Genotypes.data` in-place
         """
-        self.data = (self.data - np.mean(self.data)) / np.std(self.data)
+        self.data = (self.data - np.mean(self.data, axis=0)) / np.std(self.data, axis=0)
+
+    def append(self, name: str, data: npt.NDArray):
+        """
+        Append a new set of phenotypes to the current set
+
+        Parameters
+        ----------
+        name: str
+            The name of the new phenotype
+        data: npt.NDArray
+            A 1D np array of the same length as :py:attr:`~.Phenotypes.samples`,
+            containing the phenotype values for each sample. Must have the same dtype
+            as :py:attr:`~.Phenotypes.data.`
+        """
+        if len(self.samples):
+            if len(self.samples) != len(data):
+                self.log.error(
+                    "The data provided to the add() method is not of the appropriate"
+                    "length"
+                )
+        else:
+            self.log.warning(
+                "Set the samples property of the Phenotypes instance before calling "
+                "the add() method"
+            )
+        if self.unset():
+            self.data = data[:, np.newaxis]
+        else:
+            self.data = np.concatenate((self.data, data[:, np.newaxis]), axis=1)
+        self.names = self.names + (name,)
diff --git a/haptools/haplotype.py b/haptools/haplotype.py
deleted file mode 100644
index b6c0a1d8..00000000
--- a/haptools/haplotype.py
+++ /dev/null
@@ -1,26 +0,0 @@
-from __future__ import annotations
-from dataclasses import dataclass, field
-
-import numpy as np
-
-from .data import Extra, Haplotype
-
-
-@dataclass
-class HaptoolsHaplotype(Haplotype):
-    """
-    A haplotype with sufficient fields for simphenotype
-
-    Properties and functions are shared with the base Haplotype object
-    """
-
-    ancestry: str
-    beta: float
-    _extras: tuple = field(
-        repr=False,
-        init=False,
-        default=(
-            Extra("ancestry", "s", "Local ancestry"),
-            Extra("beta", ".2f", "Effect size in linear model"),
-        ),
-    )
diff --git a/haptools/haplotypes.py b/haptools/haplotypes.py
deleted file mode 100644
index 3c9993e4..00000000
--- a/haptools/haplotypes.py
+++ /dev/null
@@ -1,199 +0,0 @@
-"""
-Classes for reading/writing and other functions
-on haplotypes
-
-TODO:
-* transform
-  * deal with LA and STRs
-
-* writephenotypes
-* documentation and type hinting
-* tests
-
-* Compare simulation output to GCTA
-* fail gracefully on assertions - we need ERROR utility
-
-"""
-import re
-import gzip
-from pathlib import Path
-
-import pysam
-import numpy as np
-
-
-class Haplotype:
-	def __init__(self, hap_id, hap_chr, hap_start, hap_end, \
-					hap_varids, hap_varalleles, \
-					hap_LA, hap_effect):
-		self.hap_id = hap_id
-		self.hap_chr = hap_chr
-		self.hap_start = hap_start
-		self.hap_end = hap_end
-		self.hap_varids = hap_varids
-		self.hap_varalleles = hap_varalleles
-		self.hap_LA = hap_LA
-		self.hap_effect = hap_effect
-
-		##### Do some basic checks #####
-		# hap_varids and hap_varalleles must be the same length
-		assert(len(self.hap_varids)==len(self.hap_varalleles))
-		# Do not repeat variant IDs
-		assert(len(set(self.hap_varids))==len(self.hap_varids))
-		# TODO - STR and LA checks
-
-	def GetVariantID(self, record):
-		return "%s_%s_%s_%s"%(record.CHROM, record.POS, record.REF, ":".join(record.ALT))
-
-	def CheckHapMatch(self, vcf_reader):
-		sample_list = vcf_reader.samples
-
-		# Keep track of whether each chrom
-		# of each sample matches or not
-		hap_matches = {}
-		for sample in sample_list:
-			hap_matches[sample] = [True, True]
-
-		found_ids = 0
-		records = vcf_reader("%s:%s-%s"%(self.hap_chr, self.hap_start, self.hap_end))
-		for record in records:
-			# Check if the record is part of the haplotype
-			if (record.ID in self.hap_varids):
-				varind = self.hap_varids.index(record.ID)
-			elif (GetVariantID(record) in self.varids):
-				varind = self.hap_varids.index(self.GetVariantID(record))
-			else: continue
-			var_allele = self.hap_varalleles[varind]
-			var_id = self.hap_varids[varind]
-			found_ids += 1
-
-			# Determine the index of the target allele
-			record_alleles = [record.REF]+record.ALT
-			assert(var_allele in record_alleles) # TODO fail more gracefully with a message
-			allele_ind = record_alleles.index(var_allele)
-
-			# Check if each sample matches
-			# If not, set hap_matches to False
-			for i in range(len(sample_list)):
-				sample = sample_list[i]
-				call = record.genotypes[i]
-				assert(call[2]) # make sure it is phased
-				for j in range(2):
-					if call[j]!=allele_ind: hap_matches[sample][j] = False
-
-		# Check we got through all the IDs
-		assert(found_ids == len(self.hap_varids)) # TODO fail more gracefully with a message
-		return hap_matches
-
-	def Transform(self, vcf_reader):
-		hap_gts = {}
-
-		##### Case 1 - STR length #####
-		pass # TODO
-
-		##### Case 2 - variant haplotype #####
-		hap_matches = self.CheckHapMatch(vcf_reader)
-
-		##### Case 3 - LA (may be in combination with case 2) #####
-		pass # TODO
-
-		# Get haplotype counts
-		for sample in vcf_reader.samples:
-			hap_gts[sample] = sum(hap_matches[sample])
-		return hap_gts
-
-	def GetFrequency(self, vcf_reader):
-		hap_gts = self.Transform(vcf_reader)
-		return np.sum(list(hap_gts.values()))/(2*len(hap_gts.keys()))
-
-	def __str__(self):
-		return self.hap_id
-
-class HapReader:
-	def __init__(self, hapfile: Path, region: str = None):
-		"""
-		Open a haplotype file with the given region
-
-		Parameters
-		----------
-		hapfile : Path
-		  The path to a .haps file containing haplotypes
-		region : str, optional
-		  A region from which to query specific haplotypes; ex: "chr1:1000-2000"
-		"""
-		if region is None:
-			self.haps = gzip.open(hapfile)
-		else:
-			# split the region string so each portion is an element
-			region = re.split(':|-', region)
-			if len(region) > 1:
-				region[1:] = [int(pos) for pos in region[1:] if pos]
-			# fetch region
-			self.haps = pysam.TabixFile(hapfile).fetch(*region)
-
-	def GetHaplotype(self, hapline):
-		hap_id = hapline[0]
-		hap_chr = hapline[1]
-		hap_start = int(hapline[2])
-		hap_end = int(hapline[3])
-		hap_varids = hapline[4].split(",")
-		hap_varalleles = hapline[5].split(",")
-		hap_LA = hapline[6].split(",")
-		hap_effect = float(hapline[7])
-		return Haplotype(hap_id, hap_chr, hap_start, hap_end, \
-							hap_varids, hap_varalleles, \
-							hap_LA, hap_effect)
-
-	def __iter__(self):
-		return self
-
-	def __next__(self):
-		# Get the next haplotype
-		hapline = next(self.haps)
-
-		# If reading from gzip file, convert to list
-		# so it is in the same format as from tabix
-		if type(hapline) == bytes:
-			hapline = hapline.decode('utf-8')
-			while hapline.startswith('#'):
-				hapline = next(self.haps).decode('utf-8')
-			hapline = hapline.strip().split()
-
-		# Convert the line to a haplotype object
-		return self.GetHaplotype(hapline)
-
-class PhenoSimulator:
-	def __init__(self, sample_list):
-		self.pts = {}
-		for sample in sample_list:
-			self.pts[sample] = 0
-
-	def AddEffect(self, sample_to_hap, effect):
-		mean = np.mean(list(sample_to_hap.values()))
-		sd = np.sqrt(np.var(list(sample_to_hap.values())))
-		for sample in self.pts.keys():
-			assert(sample in sample_to_hap.keys())
-			self.pts[sample] += effect*(sample_to_hap[sample]-mean)/sd
-
-	def WritePhenotypes(self, outprefix, simu_rep, simu_hsq, \
-						simu_k, simu_qt, simu_cc):
-		# First compute var(additive)
-		var_add = np.var(list(self.pts.values()))
-
-		# Prepare output file
-		outf_sim = open("%s.phen"%outprefix, "w")
-
-		# Simulate and write
-		for sample in self.pts.keys():
-			outinfo = [sample, sample]
-			genetic_component = self.pts[sample]
-			for i in range(simu_rep):
-				resid_component = np.random.normal(0, var_add*(1/simu_hsq-1))
-				if simu_qt:
-					outinfo.append(genetic_component+resid_component)
-				else:
-					raise NotImplementedError("Case control not implemented")
-			outf_sim.write("\t".join([str(item) for item in outinfo])+"\n")
-
-		# Done
-		outf_sim.close()
\ No newline at end of file
diff --git a/haptools/sim_phenotype.py b/haptools/sim_phenotype.py
new file mode 100644
index 00000000..df1c417c
--- /dev/null
+++ b/haptools/sim_phenotype.py
@@ -0,0 +1,274 @@
+from __future__ import annotations
+from pathlib import Path
+from itertools import combinations
+from logging import getLogger, Logger, DEBUG
+from dataclasses import dataclass, field
+
+import numpy as np
+import numpy.typing as npt
+
+from .data import Haplotype as HaplotypeBase
+from .data import GenotypesRefAlt, Phenotypes, Haplotypes, Extra
+
+
+@dataclass
+class Haplotype(HaplotypeBase):
+    """
+    A haplotype with sufficient fields for simphenotype
+
+    Properties and functions are shared with the base Haplotype object, "HaplotypeBase"
+    """
+
+    ancestry: str
+    beta: float
+    _extras: tuple = field(
+        repr=False,
+        init=False,
+        default=(
+            Extra("ancestry", "s", "Local ancestry"),
+            Extra("beta", ".2f", "Effect size in linear model"),
+        ),
+    )
+
+
+class PhenoSimulator:
+    """
+    Simulate phenotypes from genotypes
+
+    Attributes
+    ----------
+    gens: Genotypes
+        Genotypes to simulate
+    phens: Phenotypes
+        Simulated phenotypes; filled by :py:meth:`~.PhenoSimular.run`
+    rng: np.random.Generator, optional
+        A numpy random number generator
+    log: Logger
+        A logging instance for recording debug statements
+
+    Examples
+    --------
+    >>> gens = Genotypes.load("tests/data/example.vcf.gz")
+    >>> haps = Haplotypes.load("tests/data/basic.hap")
+    >>> haps_gts = haps.transform(gens)
+    >>> phenosim = PhenoSimulator(haps_gts)
+    >>> phenosim.run(haps.data.values())
+    >>> phenotypes = phenosim.phens
+    """
+
+    def __init__(
+        self,
+        genotypes: Genotypes,
+        output: Path = Path("/dev/stdout"),
+        seed: int = None,
+        log: Logger = None,
+    ):
+        """
+        Initialize a PhenoSimulator object
+
+        Parameters
+        ----------
+        genotypes: Genotypes
+            Genotypes for each haplotype
+        output: Path, optional
+            Path to a '.pheno' file to which the generated phenotypes could be written
+
+            Defaults to stdout if not provided
+        seed: int, optional
+            A seed to initialize the random number generator
+
+            This is useful if you want the generated phenotypes to be the same across
+            multiple PhenoSimulator instances. If not provided, it will be random.
+        log: Logger, optional
+            A logging instance for recording debug statements
+        """
+        self.gens = genotypes
+        self.phens = Phenotypes(fname=output)
+        self.phens.data = None
+        self.phens.samples = self.gens.samples
+        self.rng = np.random.default_rng(seed)
+        self.log = log or getLogger(self.__class__.__name__)
+
+    def run(
+        self,
+        effects: list[Haplotype],
+        heritability: float = None,
+        prevalence: float = None,
+    ) -> npt.NDArray:
+        """
+        Simulate phenotypes for an entry in the Genotypes object
+
+        The generated phenotypes will also be added to
+        :py:attr:`~.PhenoSimulator.output`
+
+        Parameters
+        ----------
+        effects: list[Haplotype]
+            A list of Haplotypes to use in an additive fashion within the simulations
+        heritability: float, optional
+            The simulated heritability of the trait
+
+            If not provided, this will default to the sum of the squared effect sizes
+        prevalence: float, optional
+            How common should the disease be within the population?
+
+            If this value is specified, case/control phenotypes will be generated
+            instead of quantitative traits.
+
+        Returns
+        -------
+        npt.NDArray
+            The simulated phenotypes, as a np array of shape num_samples x 1
+        """
+        # extract the relevant haplotype info from the Haplotype objects
+        ids = [hap.id for hap in effects]
+        betas = np.array([hap.beta for hap in effects])
+        self.log.debug(f"Extracting haplotype genotypes for haps: {ids}")
+        self.log.debug(f"Beta values are {betas}")
+        # extract the haplotype "genotypes" and compute the phenotypes
+        gts = self.gens.subset(variants=ids).data.sum(axis=2)
+        self.log.info(f"Computing genetic component w/ {gts.shape[0]} causal effects")
+        # standardize the genotypes
+        gts = (gts - gts.mean(axis=0)) / gts.std(axis=0)
+        # generate the genetic component
+        pt = (betas * gts).sum(axis=1)
+        # compute the heritability
+        if heritability is None:
+            self.log.debug("Computing heritability as the sum of the squared betas")
+            heritability = np.power(betas, 2).sum()
+            if heritability > 1:
+                heritability = 1
+            # compute the environmental effect
+            noise = 1 - heritability
+        else:
+            # compute the environmental effect
+            noise = np.var(pt) * (np.reciprocal(heritability) - 1)
+        self.log.info(f"Adding environmental component {noise} for h^2 {heritability}")
+        # finally, add everything together to get the simulated phenotypes
+        pt_noise = self.rng.normal(0, noise, size=pt.shape)
+        if self.log.getEffectiveLevel() == DEBUG:
+            # if we're in debug mode, compute the pearson correlation and report it
+            # but don't do this otherwise to keep things fast
+            corr = np.corrcoef(pt, pt + pt_noise)[1, 0]
+            self.log.debug(f"Estimated heritability is {corr}")
+        pt += pt_noise
+        # now, handle case/control
+        name_suffix = ""
+        if prevalence is not None:
+            self.log.info(f"Converting to case/control with prevalence {prevalence}")
+            # first, find the number of desired positives
+            k = int(prevalence * len(pt))
+            # choose the top k values and label them positive
+            bool_pt = np.zeros(len(pt), dtype=np.bool_)
+            if k == len(pt):
+                bool_pt = ~bool_pt
+            else:
+                max_indices = np.argpartition(-pt, k)[:k]
+                bool_pt[max_indices] = True
+            pt = bool_pt
+            name_suffix = "-cc"
+        # now, save the archived phenotypes for later
+        self.phens.append(name="-".join(ids) + name_suffix, data=pt.astype(np.float64))
+        return pt
+
+    def write(self):
+        """
+        Write the generated phenotypes to the file specified in
+        :py:meth:`~.PhenoSimular.__init__`
+        """
+        self.phens.write()
+
+
+def simulate_pt(
+    genotypes: Path,
+    haplotypes: Path,
+    num_replications: int = 1,
+    heritability: float = None,
+    prevalence: float = None,
+    region: str = None,
+    samples: list[str] = None,
+    output: Path = Path("-"),
+    log: Logger = None,
+):
+    """
+    Haplotype-aware phenotype simulation. Create a set of simulated phenotypes from a
+    set of haplotypes.
+
+    GENOTYPES must be formatted as a VCF and HAPLOTYPES must be formatted according
+    to the .hap format spec
+
+    \f
+    Examples
+    --------
+    >>> haptools simphenotype tests/data/example.vcf.gz tests/data/example.hap.gz > simu_phens.tsv
+
+    Parameters
+    ----------
+    genotypes : Path
+        The path to the genotypes in VCF format
+    haplotypes : Path
+        The path to the haplotypes in a .hap file
+    replications : int, optional
+        The number of rounds of simulation to perform
+    heritability : int, optional
+        The heritability of the simulated trait; must be a float between 0 and 1
+
+        If not provided, it will be computed from the sum of the squared effect sizes
+    prevalence : int, optional
+        The prevalence of the disease if the trait should be simulated as case/control;
+        must be a float between 0 and 1
+
+        If not provided, a quantitative trait will be simulated, instead
+    region : str, optional
+        The region from which to extract haplotypes; ex: 'chr1:1234-34566' or 'chr7'
+
+        For this to work, the VCF and .hap file must be indexed and the seqname must
+        match!
+
+        Defaults to loading all haplotypes
+    sample : tuple[str], optional
+        A subset of the samples from which to extract genotypes
+
+        Defaults to loading genotypes from all samples
+    samples_file : Path, optional
+        A single column txt file containing a list of the samples (one per line) to
+        subset from the genotypes file
+    output : Path, optional
+        The location to which to write the simulated phenotypes
+    log : Logger, optional
+        The logging module for this task
+    """
+    if log is None:
+        log = logging.getLogger("run")
+        logging.basicConfig(
+            format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)",
+            level="ERROR",
+        )
+
+    log.info("Loading haplotypes")
+    hp = Haplotypes(haplotypes, haplotype=Haplotype, log=log)
+    hp.read(region=region)
+
+    log.info("Extracting variants from haplotypes")
+    variants = {var.id for hap in hp.data.values() for var in hap.variants}
+
+    log.info("Loading genotypes")
+    gt = GenotypesRefAlt(genotypes, log=log)
+    # gt._prephased = True
+    gt.read(region=region, samples=samples, variants=variants)
+    log.info("QC-ing genotypes")
+    gt.check_missing()
+    gt.check_biallelic()
+    gt.check_phase()
+
+    log.info("Transforming genotypes via haplotypes")
+    hp_gt = GenotypesRefAlt(fname=None, log=log)
+    hp.transform(gt, hp_gt)
+
+    # Initialize phenotype simulator (haptools simphenotype)
+    log.info("Simulating phenotypes")
+    pt_sim = PhenoSimulator(hp_gt, output=output, log=log)
+    for i in range(num_replications):
+        pt_sim.run(hp.data.values(), heritability, prevalence)
+    log.info("Writing phenotypes")
+    pt_sim.write()
diff --git a/haptools/sim_phenotypes.py b/haptools/sim_phenotypes.py
deleted file mode 100644
index 1701a1b0..00000000
--- a/haptools/sim_phenotypes.py
+++ /dev/null
@@ -1,40 +0,0 @@
-"""
-Example test command:
-
-haptools simphenotype --vcf tests/data/simple.vcf.gz --hap tests/data/simple.hap.gz \
-	--out test --simu-qt  --simu-hsq 0.8 --simu-rep 2
-"""
-
-from cyvcf2 import VCF
-
-from .haplotypes import *
-
-def simulate_pt(vcffile, hapfile, simu_rep, \
-        simu_hsq, simu_k, simu_qt, simu_cc, outprefix):
-
-	# Initialize readers
-	vcf_reader = VCF(vcffile, gts012=True)
-	hap_reader = HapReader(hapfile)
-
-	# Initialize phenotype simulator (haptools simphenotypes)
-	pt_sim = PhenoSimulator(vcf_reader.samples)
-
-	# Set up file to write summary info
-	outf_sum = open("%s.par"%outprefix, "w")
-	outf_sum.write("\t".join(["Haplotype", "Frequency", "Effect"])+"\n")
-
-	# Add effect for each haplotype
-	for hap in hap_reader:
-		sample_to_hap = hap.Transform(vcf_reader)
-		pt_sim.AddEffect(sample_to_hap, hap.hap_effect)
-
-		# Output summary info
-		hap_freq = hap.GetFrequency(vcf_reader)
-		outf_sum.write("\t".join([hap.hap_id, str(hap_freq), str(hap.hap_effect)])+"\n")
-
-	# Output simulated phenotypes to a file (haptools simphenotypes)
-	pt_sim.WritePhenotypes(outprefix, simu_rep, simu_hsq, \
-		simu_k, simu_qt, simu_cc)
-
-	# Done
-	outf_sum.close()
\ No newline at end of file
diff --git a/haptools/transform.py b/haptools/transform.py
new file mode 100644
index 00000000..2de57a2f
--- /dev/null
+++ b/haptools/transform.py
@@ -0,0 +1,64 @@
+from __future__ import annotations
+import logging
+from pathlib import Path
+
+from haptools import data
+
+
+def transform_haps(
+    genotypes: Path,
+    haplotypes: Path,
+    region: str = None,
+    samples: list[str] = None,
+    output: Path = Path("-"),
+    log: Logger = None,
+):
+    """
+    Creates a VCF composed of haplotypes
+
+    Parameters
+    ----------
+    genotypes : Path
+        The path to the genotypes in VCF format
+    haplotypes : Path
+        The path to the haplotypes in a .hap file
+    region : str, optional
+        See documentation for :py:meth:`~.data.Genotypes.read`
+        and :py:meth:`~.data.Haplotypes.read`
+    samples : list[str], optional
+        See documentation for :py:meth:`~.data.Genotypes.read`
+    output : Path, optional
+        The location to which to write output
+    log : Logger, optional
+        A logging module to which to write messages about progress and any errors
+    """
+    if log is None:
+        log = logging.getLogger("run")
+        logging.basicConfig(
+            format="[%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)",
+            level="ERROR",
+        )
+
+    log.info("Loading haplotypes")
+    hp = data.Haplotypes(haplotypes, log=log)
+    hp.read(region=region)
+
+    log.info("Extracting variants from haplotypes")
+    variants = {var.id for hap in hp.data.values() for var in hap.variants}
+
+    log.info("Loading genotypes")
+    gt = data.GenotypesRefAlt(genotypes, log=log)
+    # gt._prephased = True
+    gt.read(region=region, samples=samples, variants=variants)
+    gt.check_missing()
+    gt.check_biallelic()
+    gt.check_phase()
+
+    log.info("Transforming genotypes via haplotypes")
+    hp_gt = data.GenotypesRefAlt(fname=output, log=log)
+    hp.transform(gt, hp_gt)
+
+    log.info("Writing haplotypes to VCF")
+    hp_gt.write()
+
+    return hp_gt
diff --git a/tests/data/covars.tsv b/tests/data/simple.covar
similarity index 81%
rename from tests/data/covars.tsv
rename to tests/data/simple.covar
index 9d4ec1d1..c2b96e30 100644
--- a/tests/data/covars.tsv
+++ b/tests/data/simple.covar
@@ -1,4 +1,4 @@
-sample	sex	age
+#IID	sex	age
 HG00096	0	4
 HG00097	1	20
 HG00099	1	33
diff --git a/tests/data/simple.pheno b/tests/data/simple.pheno
new file mode 100644
index 00000000..b567f784
--- /dev/null
+++ b/tests/data/simple.pheno
@@ -0,0 +1,6 @@
+#IID	height	bmi
+HG00096	1	3
+HG00097	1	6
+HG00099	2	8
+HG00100	2	1
+HG00101	0	4
\ No newline at end of file
diff --git a/tests/data/simple.tsv b/tests/data/simple.tsv
deleted file mode 100644
index f36f9427..00000000
--- a/tests/data/simple.tsv
+++ /dev/null
@@ -1,5 +0,0 @@
-HG00096	1
-HG00097	1
-HG00099	2
-HG00100	2
-HG00101	0
diff --git a/tests/test_data.py b/tests/test_data.py
index 9f3efc55..6f15873f 100644
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -5,7 +5,7 @@
 import numpy as np
 import numpy.lib.recfunctions as rfn
 
-from haptools.haplotype import HaptoolsHaplotype
+from haptools.sim_phenotype import Haplotype as HaptoolsHaplotype
 from haptools.data import (
     Genotypes,
     GenotypesRefAlt,
@@ -217,95 +217,193 @@ def test_load_genotypes_subset(self):
         np.testing.assert_allclose(gts.data, expected)
         assert gts.samples == tuple(samples)
 
+    def test_subset_genotypes(self):
+        gts = self._get_fake_genotypes()
+
+        # subset to just the samples we want
+        expected_data = gts.data[:3]
+        expected_variants = gts.variants
+        samples = ("HG00096", "HG00097", "HG00099")
+        gts_sub = gts.subset(samples=samples)
+        assert gts_sub.samples == samples
+        np.testing.assert_allclose(gts_sub.data, expected_data)
+        assert np.array_equal(gts_sub.variants, expected_variants)
+
+        # subset to just the variants we want
+        expected_data = gts.data[:, [1, 2]]
+        expected_variants = gts.variants[[1, 2]]
+        variants = ("1:10116:A:G", "1:10117:C:A")
+        gts_sub = gts.subset(variants=variants)
+        assert gts_sub.samples == gts.samples
+        np.testing.assert_allclose(gts_sub.data, expected_data)
+        assert np.array_equal(gts_sub.variants, expected_variants)
+
+        # subset both: samples and variants
+        expected_data = gts.data[[3, 4], :2]
+        expected_variants = gts.variants[:2]
+        samples = ("HG00100", "HG00101")
+        variants = ("1:10114:T:C", "1:10116:A:G")
+        gts_sub = gts.subset(samples=samples, variants=variants)
+        assert gts_sub.samples == samples
+        np.testing.assert_allclose(gts_sub.data, expected_data)
+        assert np.array_equal(gts_sub.variants, expected_variants)
+
+
+class TestPhenotypes:
+    def _get_expected_phenotypes(self):
+        # create a phen matrix with shape: samples x phenotypes
+        expected = np.array(
+            [
+                [1, 1, 2, 2, 0],
+                [3, 6, 8, 1, 4],
+            ],
+            dtype="float64",
+        ).T
+        return expected
 
-def test_load_phenotypes(caplog):
-    # create a phenotype vector with shape: num_samples x 1
-    expected = np.array([1, 1, 2, 2, 0])
-
-    # can we load the data from the phenotype file?
-    phens = Phenotypes(DATADIR.joinpath("simple.tsv"))
-    phens.read()
-    np.testing.assert_allclose(phens.data, expected)
-    assert phens.samples == ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
-
-    # try loading the data again - it should warn b/c we've already done it
-    phens.read()
-    assert len(caplog.records) == 1 and caplog.records[0].levelname == "WARNING"
-
-    expected = (expected - np.mean(expected)) / np.std(expected)
-    phens.standardize()
-    np.testing.assert_allclose(phens.data, expected)
+    def _get_fake_phenotypes(self):
+        gts = Phenotypes(fname=None)
+        gts.data = self._get_expected_phenotypes()
+        gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        gts.names = ("height", "bmi")
+        return gts
 
+    def test_load_phenotypes(self, caplog):
+        expected_phen = self._get_fake_phenotypes()
+        expected = expected_phen.data
 
-def test_load_phenotypes_iterate():
-    # create a phenotype vector with shape: num_samples x 1
-    expected = np.array([1, 1, 2, 2, 0])
-    samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        # can we load the data from the phenotype file?
+        phens = Phenotypes(DATADIR.joinpath("simple.pheno"))
+        phens.read()
+        np.testing.assert_allclose(phens.data, expected)
+        assert phens.samples == expected_phen.samples
 
-    # can we load the data from the phenotype file?
-    phens = Phenotypes(DATADIR.joinpath("simple.tsv"))
-    for idx, line in enumerate(phens):
-        np.testing.assert_allclose(line.data, expected[idx])
-        assert line.samples == samples[idx]
+        # try loading the data again - it should warn b/c we've already done it
+        caplog.clear()
+        phens.read()
+        assert len(caplog.records) > 0 and caplog.records[0].levelname == "WARNING"
 
+        expected = (expected - np.mean(expected, axis=0)) / np.std(expected, axis=0)
+        phens.standardize()
+        np.testing.assert_allclose(phens.data, expected)
 
-def test_load_phenotypes_subset():
-    # create a phenotype vector with shape: num_samples x 1
-    expected = np.array([1, 1, 2, 2, 0])
+    def test_load_phenotypes_iterate(self):
+        expected_phen = self._get_fake_phenotypes()
+        expected = expected_phen.data
+        samples = expected_phen.samples
 
-    # subset for just the samples we want
-    expected = expected[[1, 3]]
+        # can we load the data from the phenotype file?
+        phens = Phenotypes(DATADIR.joinpath("simple.pheno"))
+        for idx, line in enumerate(phens):
+            np.testing.assert_allclose(line.data, expected[idx])
+            assert line.samples == samples[idx]
 
-    # can we load the data from the phenotype file?
-    phens = Phenotypes(DATADIR.joinpath("simple.tsv"))
-    samples = ["HG00097", "HG00100"]
-    phens.read(samples=samples)
-    np.testing.assert_allclose(phens.data, expected)
-    assert phens.samples == tuple(samples)
+    def test_load_phenotypes_subset(self):
+        expected_phen = self._get_fake_phenotypes()
+        expected = expected_phen.data
+        samples = ["HG00097", "HG00100"]
 
+        # subset for just the samples we want
+        expected = expected[[1, 3]]
 
-def test_load_covariates(caplog):
-    # create a covariate vector with shape: num_samples x num_covars
-    expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)])
+        # can we load the data from the phenotype file?
+        phens = Phenotypes(DATADIR.joinpath("simple.pheno"))
+        phens.read(samples=samples)
+        np.testing.assert_allclose(phens.data, expected)
+        assert phens.samples == tuple(samples)
+
+    def test_write_phenotypes(self):
+        expected_phen = self._get_fake_phenotypes()
+
+        # first, we write the data
+        expected_phen.fname = DATADIR.joinpath("test.pheno")
+        expected_phen.write()
+
+        # now, let's load the data and check that it's what we wrote
+        result = Phenotypes(expected_phen.fname)
+        result.read()
+        np.testing.assert_allclose(expected_phen.data, result.data)
+        assert expected_phen.names == result.names
+        assert expected_phen.samples == result.samples
+
+        # let's clean up after ourselves and delete the file
+        os.remove(str(expected_phen.fname))
+
+    def test_append_phenotype(self):
+        expected1 = self._get_fake_phenotypes()
+        expected2 = self._get_fake_phenotypes()
+
+        # add a phenotype called "test" to the set of phenotypes
+        new_phen = np.array([5, 2, 8, 0, 3], dtype=expected2.data.dtype)
+        expected2.append("test", new_phen)
+
+        # did it get added correctly?
+        assert expected1.data.shape[1] == expected2.data.shape[1] - 1
+        assert len(expected1.names) == len(expected2.names) - 1
+        assert expected2.names[2] == "test"
+        assert len(expected1.samples) == len(expected2.samples)
+        np.testing.assert_allclose(expected1.data, expected2.data[:, :2])
+        np.testing.assert_allclose(expected2.data[:, 2], new_phen)
+
+
+class TestCovariates:
+    def _get_expected_covariates(self):
+        # create a covar matrix with shape: samples x covariates
+        expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)])
+        return expected
 
-    # can we load the data from the covariates file?
-    covars = Covariates(DATADIR.joinpath("covars.tsv"))
-    covars.read()
-    np.testing.assert_allclose(covars.data, expected)
-    assert covars.samples == ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
-    assert covars.names == ("sex", "age")
+    def _get_fake_covariates(self):
+        gts = Phenotypes(fname=None)
+        gts.data = self._get_expected_covariates()
+        gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        gts.names = ("sex", "age")
+        return gts
 
-    # try loading the data again - it should warn b/c we've already done it
-    covars.read()
-    assert len(caplog.records) == 1 and caplog.records[0].levelname == "WARNING"
+    def test_load_covariates(self, caplog):
+        # create a covariate vector with shape: num_samples x num_covars
+        expected_covar = self._get_fake_covariates()
+        expected = expected_covar.data
+        samples = expected_covar.samples
 
+        # can we load the data from the covariates file?
+        covars = Covariates(DATADIR.joinpath("simple.covar"))
+        covars.read()
+        np.testing.assert_allclose(covars.data, expected)
+        assert covars.samples == samples
+        assert covars.names == ("sex", "age")
 
-def test_load_covariates_iterate():
-    # create a covariate vector with shape: num_samples x num_covars
-    expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)])
-    samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        # try loading the data again - it should warn b/c we've already done it
+        caplog.clear()
+        covars.read()
+        assert len(caplog.records) > 0 and caplog.records[0].levelname == "WARNING"
 
-    # can we load the data from the covariates file?
-    covars = Covariates(DATADIR.joinpath("covars.tsv"))
-    for idx, line in enumerate(covars):
-        np.testing.assert_allclose(line.data, expected[idx])
-        assert line.samples == samples[idx]
-        assert line.names == ("sex", "age")
+    def test_load_covariates_iterate(self):
+        # create a covariate vector with shape: num_samples x num_covars
+        expected_covar = self._get_fake_covariates()
+        expected = expected_covar.data
+        samples = expected_covar.samples
 
+        # can we load the data from the covariates file?
+        covars = Covariates(DATADIR.joinpath("simple.covar"))
+        for idx, line in enumerate(covars):
+            np.testing.assert_allclose(line.data, expected[idx])
+            assert line.samples == samples[idx]
+        assert covars.names == ("sex", "age")
 
-def test_load_covariates_subset():
-    # create a covariate vector with shape: num_samples x num_covars
-    expected = np.array([(0, 4), (1, 20), (1, 33), (0, 15), (0, 78)])
+    def test_load_covariates_subset(self):
+        # create a covariate vector with shape: num_samples x num_covars
+        expected_covar = self._get_fake_covariates()
+        expected = expected_covar.data
+        samples = ["HG00097", "HG00100"]
 
-    # subset for just the samples we want
-    expected = expected[[1, 3]]
+        # subset for just the samples we want
+        expected = expected[[1, 3]]
 
-    # can we load the data from the covariate file?
-    covars = Covariates(DATADIR.joinpath("covars.tsv"))
-    samples = ["HG00097", "HG00100"]
-    covars.read(samples=samples)
-    np.testing.assert_allclose(covars.data, expected)
-    assert covars.samples == tuple(samples)
+        # can we load the data from the covariate file?
+        covars = Covariates(DATADIR.joinpath("simple.covar"))
+        covars.read(samples=samples)
+        np.testing.assert_allclose(covars.data, expected)
+        assert covars.samples == tuple(samples)
 
 
 class TestHaplotypes:
diff --git a/tests/test_simphenotype.py b/tests/test_simphenotype.py
new file mode 100644
index 00000000..00339656
--- /dev/null
+++ b/tests/test_simphenotype.py
@@ -0,0 +1,196 @@
+import os
+from pathlib import Path
+
+import pytest
+import numpy as np
+import numpy.lib.recfunctions as rfn
+
+from haptools.sim_phenotype import Haplotype, PhenoSimulator
+from haptools.data import (
+    Genotypes,
+    Phenotypes,
+    Haplotypes,
+)
+
+
+DATADIR = Path(__file__).parent.joinpath("data")
+
+
+class TestSimPhenotype:
+    def _get_fake_gens(self):
+        gts = Genotypes(fname=None)
+        gts.data = np.array(
+            [
+                [[1, 0], [0, 1]],
+                [[0, 1], [0, 1]],
+                [[0, 0], [1, 1]],
+                [[1, 1], [1, 1]],
+                [[0, 0], [0, 0]],
+            ],
+            dtype=np.uint8,
+        )
+        gts.variants = np.array(
+            [
+                ("1:10114:T:C", "1", 10114, 0.7),
+                ("1:10116:A:G", "1", 10116, 0.6),
+            ],
+            dtype=[
+                ("id", "U50"),
+                ("chrom", "U10"),
+                ("pos", np.uint32),
+                ("aaf", np.float64),
+            ],
+        )
+        gts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        return gts
+
+    def _get_fake_haps(self):
+        return [
+            Haplotype("1", 10114, 10115, "1:10114:T:C", "CEU", 0.25),
+            Haplotype("1", 10116, 10117, "1:10116:A:G", "YRI", 0.75),
+        ]
+
+    def _get_expected_phens(self):
+        pts = Phenotypes(fname=None)
+        pts.data = np.array(
+            [
+                [-0.13363062, False],
+                [-0.13363062, False],
+                [0.53452248, True],
+                [1.20267559, True],
+                [-1.46993683, False],
+            ],
+            dtype=np.float64,
+        )
+        pts.samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        pts.names = ("1:10114:T:C", "1:10116:A:G")
+        return pts
+
+    def test_one_hap_zero_noise(self):
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        expected = self._get_expected_phens()
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        data = pt_sim.run([hps[0]], heritability=1)
+        data = data[:, np.newaxis]
+        phens = pt_sim.phens
+
+        # check the data and the generated phenotype object
+        assert phens.data.shape == (5, 1)
+        np.testing.assert_allclose(phens.data, data)
+        assert phens.data[0, 0] == phens.data[1, 0]
+        assert phens.data[2, 0] == phens.data[4, 0]
+        assert phens.data[3, 0] > phens.data[0, 0]
+        assert phens.data[2, 0] < phens.data[0, 0]
+        assert phens.samples == expected.samples
+        assert phens.names[0] == expected.names[0]
+
+    def test_one_hap_zero_noise_neg_beta(self):
+        """
+        the same test as test_one_phen_zero_noise but with a negative beta this time
+        """
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        # make the beta value negative
+        hps[0].beta = -hps[0].beta
+        expected = self._get_expected_phens()
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        data = pt_sim.run([hps[0]], heritability=1)
+        data = data[:, np.newaxis]
+        phens = pt_sim.phens
+
+        # check the data and the generated phenotype object
+        assert phens.data.shape == (5, 1)
+        np.testing.assert_allclose(phens.data, data)
+        assert phens.data[0, 0] == phens.data[1, 0]
+        assert phens.data[2, 0] == phens.data[4, 0]
+        assert phens.data[3, 0] < phens.data[0, 0]
+        assert phens.data[2, 0] > phens.data[0, 0]
+        assert phens.samples == expected.samples
+        assert phens.names[0] == expected.names[0]
+
+    def test_two_haps_zero_noise(self):
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        expected = self._get_expected_phens()
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        pt_sim.run([hps[0]], heritability=1)
+        pt_sim.run([hps[1]], heritability=1)
+        phens = pt_sim.phens
+
+        # check the data and the generated phenotype object
+        assert phens.data.shape == (5, 2)
+        assert phens.data[0, 0] == phens.data[1, 0]
+        assert phens.data[2, 0] == phens.data[4, 0]
+        assert phens.data[3, 0] > phens.data[0, 0]
+        assert phens.data[2, 0] < phens.data[0, 0]
+
+        assert phens.data[0, 1] == phens.data[1, 1]
+        assert phens.data[2, 1] == phens.data[3, 1]
+        assert phens.data[3, 1] > phens.data[0, 1]
+        assert phens.data[4, 1] < phens.data[0, 1]
+
+        assert phens.samples == expected.samples
+        assert phens.names == expected.names
+
+    def test_combined_haps_zero_noise(self):
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        expected = self._get_expected_phens()
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        pt_sim.run(hps, heritability=1)
+        phens = pt_sim.phens
+
+        # check the data and the generated phenotype object
+        assert phens.data.shape == (5, 1)
+        np.testing.assert_allclose(phens.data[:, 0], expected.data[:, 0])
+
+        assert phens.samples == expected.samples
+        assert phens.names == ("-".join(expected.names),)
+
+    def test_noise(self):
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        expected = self._get_expected_phens()
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        pt_sim.run(hps, heritability=1)
+        pt_sim.run(hps, heritability=0.96)
+        pt_sim.run(hps, heritability=0.5)
+        pt_sim.run(hps, heritability=0.2)
+        phens = pt_sim.phens
+
+        # check the data and the generated phenotype object
+        assert phens.data.shape == (5, 4)
+        np.testing.assert_allclose(phens.data[:, 0], expected.data[:, 0])
+        diff1 = np.abs(phens.data[:, 1] - phens.data[:, 0]).sum()
+        assert diff1 > 0
+        diff2 = np.abs(phens.data[:, 2] - phens.data[:, 0]).sum()
+        assert diff2 > diff1
+        diff3 = np.abs(phens.data[:, 3] - phens.data[:, 0]).sum()
+        assert diff3 > diff2
+
+    def test_case_control(self):
+        gts = self._get_fake_gens()
+        hps = self._get_fake_haps()
+        expected = self._get_expected_phens()
+        all_false = np.zeros(expected.data.shape, dtype=np.float64)
+        some_true = expected.data[:, 1]
+        all_true = (~(all_false.astype(np.bool_))).astype(np.float64)
+
+        pt_sim = PhenoSimulator(gts, seed=42)
+        pt_sim.run(hps, heritability=1, prevalence=0.4)
+        pt_sim.run(hps, heritability=1, prevalence=1)
+        pt_sim.run(hps, heritability=1, prevalence=0)
+        pt_sim.run(hps, heritability=0.5, prevalence=0.4)
+        phens = pt_sim.phens
+        assert phens.data.shape == (5, 4)
+        np.testing.assert_allclose(phens.data[:, 0], some_true)
+        np.testing.assert_allclose(phens.data[:, 1], all_true[:, 1])
+        np.testing.assert_allclose(phens.data[:, 2], all_false[:, 1])
+        diff1 = (phens.data[:, 3] == phens.data[:, 0]).sum()
+        assert diff1 > 0