Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Genotypes.check_maf() method #124

Merged
merged 12 commits into from
Oct 25, 2022
15 changes: 13 additions & 2 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Properties
**********
The ``data`` property of a :class:`Genotypes` object is a numpy array representing the genotype matrix. Rows of the array are samples and columns are variants. Each entry in the matrix is a tuple of values -- one for each chromosome. Each value is an integer denoting the index of the allele (0 for REF, 1 for the first ALT allele, 2 for the next ALT allele, etc).

There are two additional properties that contain variant and sample metadata. The ``variants`` property is a numpy structured array and the ``samples`` property is a simple tuple of sample IDs. The ``variants`` structured array has four named columns: "id", "chrom", "pos", and "aaf" (alternate allele frequency).
There are two additional properties that contain variant and sample metadata. The ``variants`` property is a numpy structured array and the ``samples`` property is a simple tuple of sample IDs. The ``variants`` structured array has three named columns: "id" (variant ID), "chrom" (chromosome name), and "pos" (chromosomal position).

Reading a file
**************
Expand Down Expand Up @@ -129,6 +129,17 @@ There are several quality-control checks performed by default (in the ``load()``
2. ``check_biallelic()`` - raises an error if any variants have more than one ALT allele
3. ``check_phase()`` - raises an error if any genotypes are unphased

Additionally, you can use the ``check_maf()`` method after checking for missing genotypes and confirming that all variants are biallelic.

.. code-block:: python

genotypes = data.Genotypes('tests/data/simple.vcf.gz')
genotypes.read()
genotypes.check_missing()
genotypes.check_biallelic()
genotypes.check_maf(threshold=0.05)
genotypes.check_phase()

Subsetting
**********
You can index into a loaded :class:`Genotypes` instance using the ``subset()`` function. This works similiar to numpy indexing with the added benefit that you can specify a subset of variants and/or samples by their IDs instead of just their indices.
Expand All @@ -143,7 +154,7 @@ By default, the ``subset()`` method returns a new :class:`Genotypes` instance. T

GenotypesRefAlt
+++++++++++++++
The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "aaf", "ref", and "alt".
The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "ref", and "alt".

All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.

Expand Down
4 changes: 2 additions & 2 deletions docs/api/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ As an example, let's say we would like to convert the following ``.blocks.det``
# load the genotypes file
# you can use either a VCF or PGEN file
gt = data.GenotypesRefAlt.load("input.vcf.gz")
gt = data.GenotypesPGEN.load("input.pgen")
gt = data.GenotypesPLINK.load("input.pgen")

# load the haplotypes
hp = data.Haplotypes("output.hap")
Expand Down Expand Up @@ -88,7 +88,7 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
gt.read(variants=variants)
# the advantage of using a PGEN file is that you can use read_variants() to load
# the variants quickly w/o having to load the genotypes, too
gt = data.GenotypesPGEN("tests/data/apoe.pgen")
gt = data.GenotypesPLINK("tests/data/apoe.pgen")
gt.read_variants(variants=variants)

# initialize an empty haplotype file
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
autosummary_generate = True
numpydoc_show_class_members = False
# allow for both rst and md syntax
source_suffix = ['.rst']
source_suffix = [".rst"]

# -- Options for HTML output -------------------------------------------------

Expand Down
7 changes: 3 additions & 4 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -764,7 +764,7 @@ def ld(
"-o",
"--output",
type=click.Path(path_type=Path),
default= None,
default=None,
show_default="input file",
help="A .hap file containing sorted and indexed haplotypes and variants",
)
Expand All @@ -778,11 +778,10 @@ def ld(
)
def index(
haplotypes: Path,
sort: bool= False,
sort: bool = False,
output: Path = None,
verbosity: str = 'CRITICAL',
verbosity: str = "CRITICAL",
):

"""
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
"""
Expand Down
32 changes: 21 additions & 11 deletions haptools/admix_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ def __repr__(self):

def __str__(self):
return f"(Chrom {self.chrom}, {self.cm_map_pos} cM, {self.bp_map_pos} bp)"

def get_chrom(self):
return self.chrom

Expand All @@ -30,25 +30,35 @@ class HaplotypeSegment:
def __init__(self, pop_num, chrom, end_coord, end_pos):
"""
Note the beginning of the segment is inferred based on previous
segments stored throughout the simulation process.
Arguments
pop_num - population label
chrom - chromosome the haplotype segment lies on.
end_coord - Ending coordinate in bp of the haplotype segment
end_pos - Ending coordinate in centimorgans of the hap segment
segments stored throughout the simulation process.

Parameters
----------
pop_num
population label
chrom
chromosome the haplotype segment lies on.
end_coord
Ending coordinate in bp of the haplotype segment
end_pos
Ending coordinate in centimorgans of the hap segment
"""
self.pop_num = pop_num
self.chrom = chrom
self.end_coord = end_coord
self.end_pos = end_pos

def __repr__(self):
return f"(Population {self.pop_num}, chrom {self.chrom}, " + \
f"end_coord {self.end_coord}, end_pos {self.end_pos})"
return (
f"(Population {self.pop_num}, chrom {self.chrom}, "
+ f"end_coord {self.end_coord}, end_pos {self.end_pos})"
)

def __str__(self):
return f"Population {self.pop_num}, chrom {self.chrom}, " + \
f"end_coord {self.end_coord}, end_pos {self.end_pos}"
return (
f"Population {self.pop_num}, chrom {self.chrom}, "
+ f"end_coord {self.end_coord}, end_pos {self.end_pos}"
)

def get_end_pos(self):
return self.end_pos
Expand Down
2 changes: 1 addition & 1 deletion haptools/data/breakpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ def population_array(
A subset of samples to include in the output, ordered by their given order

Returns
------
-------
npt.NDArray
An array of shape: samples x variants x 2

Expand Down
102 changes: 66 additions & 36 deletions haptools/data/genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ class Genotypes(Data):
1. ID
2. CHROM
3. POS
4. AAF: allele freq of alternate allele (or MAF if to_MAC() is called)
log: Logger
A logging instance for recording debug statements
_prephased : bool
Expand Down Expand Up @@ -61,7 +60,6 @@ def __init__(self, fname: Path | str, log: Logger = None):
("id", "U50"),
("chrom", "U10"),
("pos", np.uint32),
("aaf", np.float64),
],
)
self._prephased = False
Expand Down Expand Up @@ -219,7 +217,7 @@ def _variant_arr(self, record: Variant):
A row from the :py:attr:`~.Genotypes.variants` array
"""
return np.array(
(record.ID, record.CHROM, record.POS, record.aaf),
(record.ID, record.CHROM, record.POS),
dtype=self.variants.dtype,
)

Expand Down Expand Up @@ -299,15 +297,15 @@ def index(self, samples: bool = True, variants: bool = True):
Call this function once to improve the amortized time-complexity of look-ups of
samples and variants by their ID. This is useful if you intend to later subset
by a set of samples or variant IDs.
The time complexity of this function should be roughly O(n+m) if both
parameters are True. Otherwise, it will be either O(n) or O(m).
The time complexity of this function should be roughly O(n+p) if both
parameters are True. Otherwise, it will be either O(n) or O(p).

Parameters
----------
samples: bool, optional
Whether to index the samples for fast loop-up. Adds complexity O(n).
variants: bool, optional
Whether to index the variants for fast look-up. Adds complexity O(m).
Whether to index the variants for fast look-up. Adds complexity O(p).

Raises
------
Expand Down Expand Up @@ -504,37 +502,73 @@ def check_phase(self):
# remove the last dimension that contains the phase info
self.data = self.data[:, :, :2]

def to_MAC(self):
def check_maf(
self,
threshold: float = None,
discard_also: bool = False,
warn_only: bool = False,
) -> npt.NDArray[np.float64]:
"""
Convert the ALT count GT matrix into a matrix of minor allele counts
Check the minor allele frequency of each variant

This function modifies :py:attr:`~.Genotypes.data` in-place
Raise a ValueError if any variant's MAF doesn't satisfy the threshold, if
one is provided

.. note::
You should call :py:meth:`~.Genotypes.check_missing` and
:py:meth:`~.Genotypes.check_biallelic` before executing this method, for
best results. Otherwise, the frequencies may be computed incorrectly.

It also changes the 'aaf' record in :py:attr:`~.Genotypes.variants` to 'maf'
Parameters
----------
threshold: float, optional
If a variant has a minor allele frequency (MAF) rarer than this threshold,
raise a ValueError
discard_also : bool, optional
If True, discard any variants that would otherwise cause a ValueError

This parameter will be ignored if a threshold is not specified
warn_only: bool, optional
Just raise a warning instead of a ValueError

Raises
------
AssertionError
If the matrix has already been converted
ValueError
If any variant does not meet the provided threshold minor allele frequency

Returns
-------
The minor allele frequency of each variant
"""
if self.variants.dtype.names[3] == "maf":
self.log.warning(
"The matrix already counts instances of the minor allele rather than "
"the ALT allele."
)
return
need_conversion = self.variants["aaf"] > 0.5
# flip the count on the variants that have an alternate allele frequency
# above 0.5
self.data[:, need_conversion, :2] = ~self.data[:, need_conversion, :2]
# also encode an MAF instead of an AAF in self.variants
self.variants["aaf"][need_conversion] = (
1 - self.variants["aaf"][need_conversion]
)
# replace 'aaf' with 'maf' in the matrix
self.variants.dtype.names = [
(x, "maf")[x == "aaf"] for x in self.variants.dtype.names
]
num_strands = 2 * self.data.shape[0]
# TODO: make this work for multi-allelic variants, too?
ref_af = self.data[:, :, :2].astype(np.bool_).sum(axis=(0, 2)) / num_strands
maf = np.array([ref_af, 1 - ref_af]).min(axis=0)
if threshold is None:
return maf
rare_variants = maf < threshold
if np.any(rare_variants):
idx = np.nonzero(rare_variants)[0]
if discard_also:
original_num_variants = len(self.variants)
self.data = np.delete(self.data, idx, axis=1)
self.variants = np.delete(self.variants, idx)
maf = np.delete(maf, idx)
self.log.info(
"Ignoring missing genotypes from "
f"{original_num_variants - len(self.variants)} samples"
)
self._var_idx = None
else:
vals = tuple(self.variants[idx[0]])[:3] + (maf[idx[0]], threshold)
msg = "Variant with ID {} at POS {}:{} has MAF {} < {}".format(*vals)
if warn_only:
self.log.warning(msg)
else:
# raise error if the minor allele frequency of a variant does not
# meet the threshold
raise ValueError(msg)
return maf


class GenotypesRefAlt(Genotypes):
Expand All @@ -556,9 +590,8 @@ class GenotypesRefAlt(Genotypes):
1. ID
2. CHROM
3. POS
4. AAF: allele freq of alternate allele (or MAF if to_MAC() is called)
5. REF
6. ALT
4. REF
5. ALT
log: Logger
See documentation for :py:attr:`~.Genotypes.log`
"""
Expand All @@ -571,7 +604,6 @@ def __init__(self, fname: Path | str, log: Logger = None):
("id", "U50"),
("chrom", "U10"),
("pos", np.uint32),
("aaf", np.float64),
("ref", "U100"),
("alt", "U100"),
],
Expand All @@ -586,7 +618,6 @@ def _variant_arr(self, record: Variant):
record.ID,
record.CHROM,
record.POS,
record.aaf,
record.REF,
record.ALT[0],
),
Expand Down Expand Up @@ -795,7 +826,6 @@ def _variant_arr(
record[cid["ID"]],
record[cid["CHROM"]],
record[cid["POS"]],
0.5,
record[cid["REF"]],
record[cid["ALT"]],
),
Expand Down
2 changes: 1 addition & 1 deletion haptools/data/haplotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1080,7 +1080,7 @@ def transform(
hap_gts = GenotypesRefAlt(fname=None, log=self.log)
hap_gts.samples = gts.samples
hap_gts.variants = np.array(
[(hap.id, hap.chrom, hap.start, 0, "A", "T") for hap in self.data.values()],
[(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()],
dtype=hap_gts.variants.dtype,
)
# build a fast data structure for querying the alleles in each haplotype:
Expand Down
2 changes: 1 addition & 1 deletion haptools/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def index_haps(
haplotypes: Path,
sort: bool = False,
output: Path = None,
log: Logger = None,
log: logging.Logger = None,
):
"""
Takes in an unsorted .hap file and outputs it as a .gz and a .tbi file
Expand Down
2 changes: 1 addition & 1 deletion haptools/ld.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def calc_ld(
discard_missing: bool = False,
from_gts: bool = False,
output: Path = Path("/dev/stdout"),
log: Logger = None,
log: logging.Logger = None,
):
"""
Creates a VCF composed of haplotypes
Expand Down
2 changes: 1 addition & 1 deletion haptools/sim_genotype.py
Original file line number Diff line number Diff line change
Expand Up @@ -776,7 +776,7 @@ def start_segment(start, chrom, segments):

return len(segments)

def validate_params(model, mapdir, chroms, popsize, invcf, sample_info, region, only_bp=False):
def validate_params(model, mapdir, chroms, popsize, invcf, sample_info, region=None, only_bp=False):
# validate model file
mfile = open(model, 'r')
num_samples, *pops = mfile.readline().strip().split()
Expand Down
4 changes: 2 additions & 2 deletions haptools/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def transform(
hap_gts = data.GenotypesRefAlt(fname=None, log=self.log)
hap_gts.samples = gts.samples
hap_gts.variants = np.array(
[(hap.id, hap.chrom, hap.start, 0, "A", "T") for hap in self.data.values()],
[(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()],
dtype=hap_gts.variants.dtype,
)
# build a fast data structure for querying the alleles in each haplotype:
Expand Down Expand Up @@ -510,7 +510,7 @@ def transform_haps(
discard_missing: bool = False,
ancestry: bool = False,
output: Path = Path("-"),
log: Logger = None,
log: logging.Logger = None,
):
"""
Creates a VCF composed of haplotypes
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ haptools = 'haptools.__main__:main'
[tool.black]
line-length = 88
preview = true
extend-exclude = "haptools/(sim_genotype|karyogram).py"

[tool.pytest.ini_options]
log_cli_level = "DEBUG"
Expand Down
Loading