Skip to content

Commit

Permalink
feat: .snplist input to simphenotype (#217)
Browse files Browse the repository at this point in the history
  • Loading branch information
aryarm authored Jun 1, 2023
1 parent aaa8cee commit cb18970
Show file tree
Hide file tree
Showing 9 changed files with 218 additions and 67 deletions.
51 changes: 25 additions & 26 deletions docs/api/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,21 +56,19 @@ As an example, let's say we would like to convert the following ``.blocks.det``
.. _api-examples-snps2hap:

Creating a ``.hap`` file of SNPs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :ref:`simphenotype <commands-simphenotype>` command requires a ``.hap`` file containing haplotypes, but what if you want to give it SNPs, instead?
Converting a ``.snplist`` file into a ``.hap`` file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How would you convert a :doc:`.snplist file </formats/snplist>` into a ``.hap`` file suitable for use by ``simphenotype``?

Well, you can encode each SNP as a haplotype containing only a single allele. For example, let's say you have two SNPs, rs429358 and rs7412, with ALT alleles C and T respectively. Then your ``.hap`` file might look something like this.
The basic idea is to encode each SNP as a haplotype containing only a single allele. For example, let's say your ``.snplist`` file has two SNPs like this.

.. code-block::
.. include:: ../../tests/data/apoe.snplist
:literal:

Then your ``.hap`` file might look something like this.

# orderH beta
# version 0.1.0
#H beta .2f Effect size in linear model
H 19 45411941 45411942 rs429358 0.73
H 19 45412079 45412080 rs7412 0.30
V rs429358 45411941 45411942 rs429358 C
V rs7412 45412079 45412080 rs7412 T
.. include:: ../../tests/data/apoe.hap
:literal:

You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API <api-haptools-sim_phenotype>` to create such a file.

Expand All @@ -79,31 +77,32 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
from haptools import data
from haptools.sim_phenotype import Haplotype
# which variants do we want to write to the haplotype file?
variants = {"rs429358", "rs7412"}
variants = {}
# load variants from the snplist file
with open("tests/data/apoe.snplist") as snplist_file:
for line in snplist_file.readlines():
# parse variant ID and beta from file
ID, beta = line.split("\t")
variants[ID] = float(beta)
# load the genotypes file
# you can use either a VCF or PGEN file
gt = data.GenotypesVCF("tests/data/apoe.vcf.gz")
gt.read(variants=variants)
# the advantage of using a PGEN file is that you can use read_variants() to load
# the variants quickly w/o having to load the genotypes, too
gt = data.GenotypesPLINK("tests/data/apoe.pgen")
gt.read_variants(variants=variants)
gt.read(variants=variants.keys())
# initialize an empty haplotype file
hp = data.Haplotypes("output.hap", haplotype=Haplotype)
hp.data = {}
for variant in gt.variants:
ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
end = pos + len(alleles[1])
ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
# we arbitrarily choose to use the ALT allele but alleles[0] will give you REF
allele = alleles[1]
end = pos + len(allele)
# create a haplotype line in the .hap file
# you should fill out "beta" with your own value
hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=0.5)
hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=variants[ID])
# create variant lines for each haplotype
hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alleles[1]),)
# create a variant line for each haplotype
hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=allele),)
hp.write()
14 changes: 11 additions & 3 deletions docs/commands/simphenotype.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
simphenotype
============

Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal haplotypes or variants by specifying them in a :doc:`.hap file </formats/haplotypes>`. Phenotypes are simulated from genotypes output by the :doc:`transform command </commands/transform>`.
Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variants or haplotypes by specifying them in a :doc:`.snplist file </formats/snplist>` or :doc:`.hap file </formats/haplotypes>`. Phenotypes are simulated from genotypes output by the :doc:`transform command </commands/transform>`.

The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility.

Expand Down Expand Up @@ -57,7 +57,7 @@ If a prevalence for the disease is specified, the final :math:`\vec{y}` value wi

Input
~~~~~
Genotypes must be specified in VCF and haplotypes must be specified in the :doc:`.hap file format </formats/haplotypes>`. If you'd like to encode simple SNPs as causal variants within a ``.hap`` file, use the haptools API like in :ref:`this example <api-examples-snps2hap>`.
Genotypes must be specified in VCF and haplotypes must be specified in the :doc:`.snplist </formats/snplist>` or :doc:`.hap file format </formats/haplotypes>`.

.. note::
Your ``.hap`` files must contain a "beta" extra field. See :ref:`this section <formats-haplotypes-extrafields-simphenotype>` of the ``.hap`` format spec for more details.
Expand All @@ -73,12 +73,20 @@ Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replica

Examples
~~~~~~~~
In its simplest usage, ``simphenotype`` can be used to simulate traits arising from SNPs in a :doc:`.snplist file </formats/snplist>`.

.. code-block:: bash
haptools simphenotype tests/data/apoe.vcf.gz tests/data/apoe.snplist
However, if you want to simulate haplotype-based effects, you will need to ``transform`` your SNPs into haplotypes first. You can pass the same ``.hap`` file to both commands.

.. code-block:: bash
haptools transform tests/data/simple.vcf tests/data/simple.hap | \
haptools simphenotype -o simulated.pheno /dev/stdin tests/data/simple.hap
By default, all of the haplotypes in the ``.hap`` file will be encoded as causal variables. Alternatively, you can select the causal variables manually via the ``--id`` or ``--ids-file`` parameters.
By default, all of the effects in the ``.hap`` file will be encoded as causal variables. Alternatively, you can select the causal variables manually via the ``--id`` or ``--ids-file`` parameters.

.. code-block:: bash
Expand Down
36 changes: 36 additions & 0 deletions docs/formats/snplist.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
.. _formats-snplist:


Causal SNPs
===========

You can specify causal SNPs in a tab-delimited ``.snplist`` file. We follow `GCTA's .snplist format <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ for this type of file. It has just two columns:

.. list-table::
:widths: 15 15 25
:header-rows: 1

* - Name
- Type
- Description
* - ID
- string
- A unique identifier for this variant in the file (ex: 'rs1234')
* - BETA
- float
- The effect size assigned to this variant (ex: 0.08)

.. note::
You should not include a header in this file. The file format does not have one.

Examples
--------

Refer to `tests/data/apoe.snplist <https://github.com/cast-genomics/haptools/blob/main/tests/data/apoe.snplist>`_ for an example containing just two SNPs.

.. include:: ../../tests/data/apoe.snplist
:literal:

Converting to a ``.hap`` file
-----------------------------
The capabilities of the ``.snplist`` format are limited. For example, it does not allow users to specify a causal allele (REF vs ALT) for each SNP. You can use :ref:`the haptools API to upgrade to a .hap file <api-examples-snps2hap>` if needed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ There is an option to *Cite this repository* on the right sidebar of `the reposi
formats/phenotypes.rst
formats/ld.rst
formats/linear.rst
formats/snplist.rst
formats/breakpoints.rst
formats/sample_info.rst
formats/models.rst
Expand Down
104 changes: 78 additions & 26 deletions haptools/sim_phenotype.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,41 @@
)


@dataclass
class Effect:
"""
A variable in the simphenotype linear model
Attributes
----------
id : str
The ID of the variable; corresponds to a variant in a Genotypes object
beta : float
The effect size of the variable
"""

id: str
beta: float

@classmethod
def from_hap_spec(cls: Effect, line: str) -> Effect:
"""
Convert a .snplist line into an Effect object
Parameters
----------
line: str
A line from a .snplist file
Returns
-------
Effect
The converted Effect instance
"""
ID, beta = line.split("\t")[:2]
return cls(id=ID, beta=float(beta))


@dataclass
class Haplotype(HaplotypeBase):
"""
Expand Down Expand Up @@ -112,7 +147,7 @@ def __init__(

def run(
self,
effects: list[Haplotype | Repeat],
effects: list[Effect | Haplotype | Repeat],
heritability: float = None,
prevalence: float = None,
normalize: bool = True,
Expand All @@ -125,7 +160,7 @@ def run(
Parameters
----------
effects: list[Haplotype]
effects: list[Effect|Haplotype|Repeat]
A list of Haplotypes to use in an additive fashion within the simulations
heritability: float, optional
The simulated heritability of the trait
Expand Down Expand Up @@ -334,26 +369,42 @@ def simulate_pt(
if log is None:
log = getLogger(name="simphenotype", level="ERROR")

log.info("Loading haplotypes")
hp = Haplotypes(haplotypes, haplotype=Haplotype, repeat=Repeat, log=log)
hp.read(region=region, haplotypes=haplotype_ids)

if haplotype_ids is None:
haplotype_ids = set(hp.data.keys())

# check if these are all repeat IDs, haplotype IDs, or a mix of them
if len(hp.type_ids["R"]) >= len(haplotype_ids) and repeats is None:
# if they're all repeat IDs or --repeats was specified
log.info("Loading TR genotypes")
gt = GenotypesTR(fname=genotypes, log=log)
load_as_haps = True
# either load SNPs from the snplist file or load haps/repeats from the hap file
if haplotypes.suffix == ".snplist":
log.info("Loading SNP effects")
with open(haplotypes) as snplist_file:
effects = map(Effect.from_hap_spec, snplist_file.readlines())
if haplotype_ids is None:
effects = list(effects)
haplotype_ids = set(effect.id for effect in effects)
else:
effects = list(filter(lambda e: e.id in haplotype_ids, effects))
else:
# the genotypes variable must contain haplotype genotypes
# but first, check if they're a mix but --repeats wasn't specified
if len(hp.type_ids["H"]) < len(haplotype_ids) and repeats is None:
raise ValueError(
"The --repeats option must be specified when simulating a mix of both "
"haplotypes and repeats as causal effects."
)
log.info("Loading haplotypes")
hp = Haplotypes(haplotypes, haplotype=Haplotype, repeat=Repeat, log=log)
hp.read(region=region, haplotypes=haplotype_ids)
effects = hp.data.values()

if haplotype_ids is None:
haplotype_ids = set(hp.data.keys())

# check if these are all repeat IDs, haplotype IDs, or a mix of them
if len(hp.type_ids["R"]) >= len(haplotype_ids) and repeats is None:
# if they're all repeat IDs or --repeats was specified
log.info("Loading TR genotypes")
gt = GenotypesTR(fname=genotypes, log=log)
load_as_haps = False
else:
# the genotypes variable must contain haplotype genotypes
# but first, check if they're a mix but --repeats wasn't specified
if len(hp.type_ids["H"]) < len(haplotype_ids) and repeats is None:
raise ValueError(
"The --repeats option must be specified when simulating a mix of"
" both haplotypes and repeats as causal effects."
)

if load_as_haps:
# load these as haplotype pseudo-genotypes
if genotypes.suffix == ".pgen":
log.info("Loading haplotype genotypes from PGEN file")
Expand All @@ -370,7 +421,7 @@ def simulate_pt(
if repeats:
log.info("Merging with TR genotypes")
tr_gt = GenotypesTR(fname=repeats, log=log)
tr_gt.read(region=region, samples=samples, variants=set(hp.type_ids["R"]))
tr_gt.read(region=region, samples=samples, variants=haplotype_ids)
tr_gt.check_missing()
gt = Genotypes.merge_variants((gt, tr_gt), fname=None)

Expand All @@ -379,15 +430,16 @@ def simulate_pt(
diff = list(haplotype_ids.difference(gt.variants["id"]))
first_few = 5 if len(diff) > 5 else len(diff)
log.warning(
f"{len(diff)} haplotypes could not be found in the genotypes file. Check "
"that the hap IDs in your .hap file correspond with those in the genotypes"
f" file. Here are the first few missing variants: {diff[:first_few]}"
f"{len(diff)} effects could not be found in the genotypes file. Check "
"that the IDs in your .snplist or .hap file correspond with those in the "
"genotypes file. Here are the first few missing variants: "
f"{diff[:first_few]}"
)

# Initialize phenotype simulator (haptools simphenotype)
log.info("Simulating phenotypes")
pt_sim = PhenoSimulator(gt, output=output, seed=seed, log=log)
for i in range(num_replications):
pt_sim.run(hp.data.values(), heritability, prevalence, normalize)
pt_sim.run(effects, heritability, prevalence, normalize)
log.info("Writing phenotypes")
pt_sim.write()
7 changes: 7 additions & 0 deletions tests/data/apoe.hap
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# orderH beta
# version 0.2.0
#H beta .2f Effect size in linear model
H 19 45411941 45411942 rs429358 0.73
H 19 45412079 45412080 rs7412 0.30
V rs429358 45411941 45411942 rs429358 C
V rs7412 45412079 45412080 rs7412 T
2 changes: 2 additions & 0 deletions tests/data/apoe.snplist
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
rs429358 0.73
rs7412 0.30
32 changes: 20 additions & 12 deletions tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1824,32 +1824,40 @@ def test_breakpoints_to_pop_array_chrom_no_match(self):

class TestDocExamples:
def test_gts2hap(self):
# which variants do we want to write to the haplotype file?
variants = {"rs429358", "rs7412"}
# load variants from the snplist file
variants = {}
with open(DATADIR / "apoe.snplist") as snplist_file:
for line in snplist_file.readlines():
# parse variant ID and beta from file
ID, beta = line.split("\t")
variants[ID] = float(beta)

# load the genotypes file
# you can use either a VCF or PGEN file
gt = GenotypesVCF(DATADIR / "apoe.vcf.gz")
gt.read(variants=variants)
gt.read(variants=variants.keys())

# initialize an empty haplotype file
hp = Haplotypes("output.hap", haplotype=Haplotype)
hp = Haplotypes("output.hap", haplotype=HaptoolsHaplotype)
hp.data = {}

for variant in gt.variants:
ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
end = pos + len(alleles[1])
# we arbitrarily choose to use the ALT allele but alleles[0] will give you REF
allele = alleles[1]
end = pos + len(allele)

# create a haplotype line in the .hap file
# you should fill out "beta" with your own value
hp.data[ID] = HaptoolsHaplotype(
chrom=chrom, start=pos, end=end, id=ID, beta=0.5
chrom=chrom, start=pos, end=end, id=ID, beta=variants[ID]
)

# create variant lines for each haplotype
hp.data[ID].variants = (
Variant(start=pos, end=end, id=ID, allele=alleles[1]),
)
# create a variant line for each haplotype
hp.data[ID].variants = (Variant(start=pos, end=end, id=ID, allele=allele),)

hp.write()

# validate the output and clean up afterwards
with open("output.hap") as hp_file:
with open(DATADIR / "apoe.hap") as expected:
assert hp_file.read() == expected.read()
hp.fname.unlink()
Loading

0 comments on commit cb18970

Please sign in to comment.