Skip to content

Commit

Permalink
start on docs page for data.Genotypes and data.Haplotypes
Browse files Browse the repository at this point in the history
see #50
  • Loading branch information
aryarm committed Jun 24, 2022
1 parent 9b4393a commit 0735e74
Show file tree
Hide file tree
Showing 3 changed files with 294 additions and 0 deletions.
282 changes: 282 additions & 0 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
.. _api-data:


data
====

Overview
~~~~~~~~

The ``data`` module is a submodule of ``haptools`` that handles IO for common file types including genotypes, haplotypes, phenotypes, and model covariates.

.. note::
This page documents common use-cases. To implement more advanced patterns, take a look at our detailed :ref:`API docs <api-haptools-data>`.

Motivation
----------
Using the ``data`` module makes your code agnostic to the type of file being used. For example, the :class:`Genotypes` class provides the same interface for reading VCFs as it does for PLINK2 PGEN files.

This module also helps reduce common boilerplate, since users can easily *extend* the classes in the ``data`` module for their own specific goals.

Data
~~~~
Classes in the ``data`` module inherit from an abstract class called Data, providing some level of standardization across classes within the module. All classes are initialized with the path to the file containing the data.

The abstract class requires that all classes contain methods for

1. reading the contents of a file into a ``data`` property of each class
2. iterating over lines of a file without loading all of the data into memory at once

.. code-block:: python
from haptools import data
data.Data
genotypes.py
~~~~~~~~~~~~
Overview
--------
This module supports reading and writing files that follow `the VCF <https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format>`_ and `PLINK2 PGEN <https://www.cog-genomics.org/plink/2.0/formats#pgen>`_ file format specifications. We may also offer support for `BGEN <https://www.well.ox.ac.uk/~gav/bgen_format>`_ and `Hail <https://hail.is/docs/0.2/methods/impex.html#native-file-formats>`_ files in the future.

Documentation
-------------
The :ref:`genotypes.py API docs <api-haptools-data-genotypes>` contain example usage of the :class:`Genotypes` class.
See the documentation for the :class:`GenotypesRefAlt` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.

Classes
-------
Genotypes
+++++++++
Properties
**********
The ``data`` property of a :class:`Genotypes` object is a numpy array representing the genotype matrix. Rows of the array are samples and columns are variants. Each entry in the matrix is a tuple of values -- one for each chromosome. Each value is an integer denoting the index of the allele (0 for REF, 1 for the first ALT allele, 2 for the next ALT allele, etc).

There are two additional properties that contain variant and sample metadata. The ``variants`` property is a numpy structured array and the ``samples`` property is a simple tuple of sample IDs. The ``variants`` structured array has four named columns: "id", "chrom", "pos", and "aaf" (alternate allele frequency).

Reading a file
**************
Extracting genotypes from a VCF file is quite simple:

.. code-block:: python
genotypes = data.Genotypes.load(Path('tests/data/simple.vcf'))
genotypes.data # a numpy array of shape n x p x 2
genotypes.variants # a numpy structured array of shape p x 4
genotypes.samples # a tuple of strings of length n
The ``load()`` method initializes an instance of the :class:`Genotypes` class, calls the ``read()`` method, and then performs some standard quality-control checks. You can also call the ``read()`` method manually if you'd like to forego these checks.

.. code-block:: python
genotypes = data.Genotypes(Path('tests/data/simple.vcf'))
genotypes.read()
genotypes.data # a numpy array of shape n x p x 3
genotypes.variants # a numpy structured array of shape p x 4
genotypes.samples # a tuple of strings of length n
Both the ``load()`` and ``read()`` methods support ``region``, ``samples``, and ``variants`` parameters that allow you to request a specific region, list of samples, or set of variant IDs to read from the file.

.. code-block:: python
genotypes = data.Genotypes(Path('tests/data/simple.vcf.gz'))
genotypes.read(
region="1:10115-10117",
samples=["HG00097", "HG00100"],
variants={"1:10117:C:A"},
)
The ``region`` parameter only works if the file is indexed, since in that case, the ``read()`` method can take advantage of the indexing to parse the file a bit faster.

Iterating over a file
*********************
If you're worried that the contents of the VCF file might be large, you may opt to parse the file line-by-line instead of loading it all into memory at once.

In cases like these, you can use the ``__iter__()`` method in a for-loop:

.. code-block:: python
genotypes = data.Genotypes(Path('tests/data/simple.vcf'))
for line in genotypes:
print(line)
You'll have to call ``__iter()__`` manually if you want to specify any function parameters:

.. code-block:: python
genotypes = data.Genotypes(Path('tests/data/simple.vcf.gz'))
for line in genotypes.__iter__(region="1:10115-10117", samples=["HG00097", "HG00100"]):
print(line)
Quality control
***************
There are several quality-control checks performed by default (in the ``load()`` method). You can call these methods yourself, if you'd like:

1. ``check_missing()`` - raises an error if any samples are missing genotypes
2. ``check_biallelic()`` - raises an error if any variants have more than one ALT allele
3. ``check_phase()`` - raises an error if any genotypes are unphased

Subsetting
**********
You can index into a loaded :class:`Genotypes` instance using the ``subset()`` function. This works similiar to numpy indexing with the added benefit that you can specify a subset of variants and/or samples by their IDs instead of just their indices.

.. code-block:: python
genotypes = data.Genotypes.load(Path('tests/data/simple.vcf'))
gts_subset = genotypes.subset(samples=("HG00100", "HG00101"), variants=("1:10114:T:C", '1:10116:A:G'))
gts_subset # a new Genotypes instance containing only the specified samples and variants
By default, the ``subset()`` method returns a new :class:`Genotypes` instance. The samples and variants in the new instance will be in the order specified.

GenotypesRefAlt
+++++++++++++++
The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "aaf", "ref", and "alt".

All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.

.. code-block:: python
genotypes = data.GenotypesRefAlt.load(Path('tests/data/simple.vcf'))
# make the first sample homozygous for the alt allele of the fourth variant
genotypes.data[0, 3] = (1, 1)
genotypes.write()
GenotypesPLINK
++++++++++++++
The :class:`GenotypesPLINK` class offers experimental support for reading (and writing, in the future) PLINK2 PGEN, PVAR, and PSAM files. We are able to read genotypes from a PLINK2 PGEN files in a fraction of the time of VCFs.

The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.

.. code-block:: python
genotypes = data.GenotypesPLINK.load(Path('tests/data/simple.pgen'))
genotypes.data # a numpy array of shape n x p x 2
genotypes.variants # a numpy structured array of shape p x 4
genotypes.samples # a tuple of strings of length n
Use of this class is not officially supported yet because it relies upon the as-yet-unpublished ``pgenlib`` python library. See `issue #16 <https://github.com/gymrek-lab/haptools/pull/16>`_ for current progress on this challenge.

haplotypes.py
~~~~~~~~~~~~~
Overview
--------
This module supports reading and writing files that follow the **.hap** file format specification.

Lines from the file are parsed into instances of the :class:`Haplotype` and :class:`Variant` classes. These classes can be *extended* (sub-classed) to support "extra" fields appended to the ends of each line.

Documentation
-------------

1. The **.hap** :ref:`format specification <formats-haplotypes>`
2. The :ref:`haplotypes.py API docs <api-haptools-data-haplotypes>` contain example usage of the :class:`Haplotypes` class and examples of sub-classing the :class:`Haplotype` and :class:`Variant` classes

Classes
-------
Haplotypes
++++++++++
Reading a file
**************
Parsing a basic **.hap** file without any extra fields is as simple as it gets:

.. code-block:: python
haplotypes = data.Haplotypes.load(Path('tests/data/basic.hap'))
haplotypes.data # returns a dictionary of Haplotype objects
The ``load()`` method initializes an instance of the :class:`Haplotypes` class and calls the ``read()`` method, but if the **.hap** file contains extra fields, you'll need to call the ``read()`` method manually. You'll also need to create :class:`Haplotype` and :class:`Variant` subclasses that support the extra fields and then specify the names of the classes when you initialize the :class:`Haplotypes` object:

.. code-block:: python
haplotypes = data.Haplotypes(Path('tests/data/basic.hap'), Haplotype, Variant)
haplotypes.read()
haplotypes.data # returns a dictionary of Haplotype objects
Both the ``load()`` and ``read()`` methods support `region` and `haplotypes` parameters that allow you to request a specific region or set of haplotype IDs to read from the file.

.. code-block:: python
haplotypes = data.Haplotypes(Path('tests/data/basic.hap.gz'), Haplotype, Variant)
haplotypes.read(region='chr21:26928472-26941960', haplotypes=["chr21.q.3365*10"])
The file must be indexed if you wish to use these parameters, since in that case, the ``read()`` method can take advantage of the indexing to parse the file a bit faster. Otherwise, if the file isn't indexed, the ``read()`` method will assume the file could be unsorted and simply reads each line one-by-one. Although I haven't tested it yet, streams like stdin should be supported by this case.

Iterating over a file
*********************
If you're worried that the contents of the **.hap** file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once.

In cases like these, you can use the ``__iter__()`` method in a for-loop:

.. code-block:: python
haplotypes = data.Haplotypes.load(Path('tests/data/basic.hap'))
for line in haplotypes:
print(line)
You'll have to call ``__iter()__`` manually if you want to specify any function parameters:

.. code-block:: python
haplotypes = data.Haplotypes.load(Path('tests/data/basic.hap'))
for line in haplotypes.__iter__(region='21:26928472-26941960', haplotypes={"chr21.q.3365*1"}):
print(line)
Writing a file
**************
To write to a **.hap** file, you must first initialize a :class:`Haplotypes` object and then fill out the data property:

.. code-block:: python
haplotypes = data.Haplotypes(Path('tests/data/example-write.hap'))
haplotypes.data = {}
haplotypes.data['H1'] = Haplotype(chrom='chr1', start=0, end=10, id='H1')
haplotypes.data['H1'].variants = [Variant(start=0, end=1, id='rs123', allele='A')]
haplotypes.write()
Obtaining haplotype "genotypes"
*******************************
Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesRefAlt` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.

.. code-block:: python
haplotypes = data.Haplotypes.load(Path('tests/data/example.hap.gz'))
genotypes = data.GenotypesRefAlt.load(Path('tests/data/example.vcf.gz'))
hap_gts = haplotypes.transform(genotypes)
hap_gts # a GenotypesRefAlt instance where haplotypes are variants
Haplotype
+++++++++
The :class:`Haplotype` class stores haplotype lines from the **.hap** file. Each property in the object is a field in the line. A separate ``variants`` property stores a tuple of :class:`Variant` objects belonging to this haplotype.

The :class:`Haplotypes` class will initialize :class:`Haplotype` objects in its ``read()`` and ``__iter__()`` methods. It uses a few methods within the :class:`Haplotype` class for this:

1. ``from_hap_spec()`` - this static method initializes a Haplotype object from a line in the **.hap** file.
2. ``to_hap_spec()`` - this method converts a Haplotype object into a line in the **.hap** file

To read "extra" fields from a **.hap** file, one need only *extend* (sub-class) the base :class:`Haplotype` class and add the extra properties.

Variant
+++++++
The :class:`Variant` class stores variant lines from the **.hap** file. Each property in the object is a field in the line.

The :class:`Haplotypes` class will initialize :class:`Variant` objects in its ``read()`` and ``__iter__()`` methods. It uses a few methods within the :class:`Variant` class for this:

1. ``from_hap_spec()`` - this static method initializes a :class:`Variant` object from a line in the **.hap** file.
2. ``to_hap_spec()`` - this method converts a :class:`Variant` object into a line in the **.hap** file

To read "extra" fields from a **.hap** file, one need only *extend* (sub-class) the base :class:`Variant` class and add the extra properties.

phenotypes.py
~~~~~~~~~~~~~
Overview
--------
This module supports reading and writing phenotype files.

Documentation
-------------

The :ref:`phenotypes.py API docs <api-haptools-data-phenotypes>` contain example usage of the Phenotypes class.

Classes
-------
:class:`Phenotypes`
+++++++++++++++++++
HI
11 changes: 11 additions & 0 deletions docs/api/haptools.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
.. _api-haptools:


Documentation
=============

Expand All @@ -12,6 +15,8 @@ Command line interface
Module contents
---------------

.. _api-haptools-data:

haptools.data.data module
~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -20,6 +25,8 @@ haptools.data.data module
:undoc-members:
:show-inheritance:

.. _api-haptools-data-genotypes:

haptools.data.genotypes module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -28,6 +35,8 @@ haptools.data.genotypes module
:undoc-members:
:show-inheritance:

.. _api-haptools-data-phenotypes:

haptools.data.phenotypes module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -44,6 +53,8 @@ haptools.data.covariates module
:undoc-members:
:show-inheritance:

.. _api-haptools-data-haplotypes:

haptools.data.haplotypes module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
:hidden:
:maxdepth: 1

api/data
api/modules

.. toctree::
Expand Down

0 comments on commit 0735e74

Please sign in to comment.