Skip to content

Commit f7df71f

Browse files
authored
Merge pull request #123 from CAST-genomics/feat/bkpts-binarysearch
ref: use binary search in `Breakpoints.population_array()`
2 parents 2040da0 + e77a491 commit f7df71f

File tree

7 files changed

+424
-109
lines changed

7 files changed

+424
-109
lines changed

docs/api/data.rst

+124-1
Original file line numberDiff line numberDiff line change
@@ -398,7 +398,7 @@ The ``load()`` method initializes an instance of the :class:`Phenotypes` class a
398398
phenotypes.read()
399399
phenotypes.data # returns a np array of shape p x k
400400
401-
Both the ``load()`` and ``read()`` methods support the `samples` parameter that allows you to request a specific set of sample IDs to read from the file.
401+
Both the ``load()`` and ``read()`` methods support the ``samples`` parameter that allows you to request a specific set of sample IDs to read from the file.
402402

403403
.. code-block:: python
404404
@@ -454,3 +454,126 @@ Classes
454454
Covariates
455455
++++++++++
456456
The :class:`Covariates` class is simply a sub-class of the :class:`Phenotypes` class. It has all of the same methods and properties. There are no major differences between the two classes, except between the file extensions that they use.
457+
458+
breakpoints.py
459+
~~~~~~~~~~~~~~
460+
Overview
461+
--------
462+
This module supports reading and writing files that follow the **.bp** file format specification.
463+
464+
Lines from the file are parsed into an instance of the :class:`Breakpoints` class.
465+
466+
Documentation
467+
-------------
468+
469+
1. The **.bp** :ref:`format specification <formats-breakpoints>`
470+
2. The :ref:`breakpoints.py API docs <api-haptools-data-breakpoints>` contain example usage of the :class:`Breakpoints` class
471+
472+
Classes
473+
-------
474+
Breakpoints
475+
+++++++++++
476+
Properties
477+
**********
478+
Just like all other classes in the data module, the :class:`Breakpoints` class has a ``data`` property. It is a dictionary, keyed by sample ID, where each value is a two-element list of numpy arrays (one for each chromosome). Each column in the array corresponds with a column in the breakpoints file:
479+
480+
1. ``pop`` - A population label (str), like 'YRI'
481+
2. ``chrom`` - A chromosome name (str), like 'chr19' or simply '19'
482+
3. ``bp`` - The end position of the block in base pairs (int), like 1001038
483+
4. ``cm`` - The end position of the block in centiMorgans (float), like 43.078
484+
485+
The dtype of each numpy array is stored as a variable called ``HapBlock``. It is available globally in the ``breakpoints`` and ``data`` modules.
486+
487+
.. code-block:: python
488+
489+
from haptools import data
490+
data.HapBlock # the dtype of each numpy array in the data property
491+
492+
Reading a file
493+
**************
494+
Loading a **.bp** file is easy.
495+
496+
.. code-block:: python
497+
498+
breakpoints = data.Breakpoints.load('tests/data/simple.bp')
499+
breakpoints.data # returns a dictionary keyed by sample ID, where each value is a list of np arrays
500+
501+
The ``load()`` method initializes an instance of the :class:`Breakpoints` class and calls the ``read()`` method, but you can also call the ``read()`` method manually.
502+
503+
.. code-block:: python
504+
505+
breakpoints = data.Breakpoints('tests/data/simple.bp')
506+
breakpoints.read()
507+
breakpoints.data # returns a dictionary keyed by sample ID, where each value is a list of np arrays
508+
509+
Both the ``load()`` and ``read()`` methods support the ``samples`` parameter that allows you to request a specific set of sample IDs to read from the file.
510+
511+
.. code-block:: python
512+
513+
breakpoints = data.Breakpoints('tests/data/simple.bp')
514+
breakpoints.read(samples={"HG00097", "HG00099"})
515+
516+
Iterating over a file
517+
*********************
518+
If you're worried that the contents of the **.bp** file will be large, you may opt to parse the file sample-by-sample instead of loading it all into memory at once.
519+
520+
In cases like these, you can use the ``__iter__()`` method in a for-loop.
521+
522+
.. code-block:: python
523+
524+
breakpoints = data.Breakpoints('tests/data/simple.bp')
525+
for sample, blocks in breakpoints:
526+
print(sample, blocks)
527+
528+
You'll have to call ``__iter()__`` manually if you want to specify any function parameters.
529+
530+
.. code-block:: python
531+
532+
breakpoints = data.Breakpoints('tests/data/simple.bp')
533+
for sample, blocks in breakpoints.__iter__(samples={"HG00097", "HG00099"}):
534+
print(sample, blocks)
535+
536+
Obtaining ancestral labels for a list of positions
537+
**************************************************
538+
In the end, we're usually only interested in the ancestral labels of a set of variant positions, as a matrix of values. The ``population_array()`` method generates a numpy array denoting the ancestral label of each sample for each variant you specify.
539+
540+
.. code-block:: python
541+
542+
breakpoints = data.Breakpoints.load('tests/data/simple.bp')
543+
variants = np.array(
544+
[("1", 10119), ("1", 10121)],
545+
dtype = [("chrom", "U10"), ("pos", np.uint32)],
546+
)
547+
arr = breakpoints.population_array(variants=variants)
548+
arr # returns a np array of shape n x p x 2 (where p = 2 in this example)
549+
550+
You can also select a subset of samples. The samples returned in the matrix will follow the order specified.
551+
552+
.. code-block:: python
553+
554+
breakpoints = data.Breakpoints.load('tests/data/simple.bp')
555+
variants = np.array(
556+
[("1", 10119), ("1", 10121)],
557+
dtype = [("chrom", "U10"), ("pos", np.uint32)],
558+
)
559+
samples = (HG00096, HG00100)
560+
arr = breakpoints.population_array(variants=variants, samples=samples)
561+
arr # returns a np array of shape 2 x p x 2 (where p = 2 in this example)
562+
563+
Writing a file
564+
**************
565+
To write to a **.bp** file, you must first initialize a :class:`Breakpoints` object and then fill out the ``data`` property.
566+
567+
.. code-block:: python
568+
569+
breakpoints = data.Breakpoints('tests/data/example-write.bp')
570+
breakpoints.data = {
571+
'HG00096': [
572+
np.array([('YRI','chr1',10114,4.3),('CEU','chr1',10116,5.2)], dtype=data.HapBlock)
573+
np.array([('CEU','chr1',10114,4.3),('YRI','chr1',10116,5.2)], dtype=data.HapBlock)
574+
], 'HG00097': [
575+
np.array([('YRI','chr1',10114,4.3),('CEU','chr2',10116,5.2)], dtype=data.HapBlock)
576+
np.array([('CEU','chr1',10114,4.3),('YRI','chr2',10116,5.2)], dtype=data.HapBlock)
577+
]
578+
}
579+
breakpoints.write()

docs/api/haptools.rst

+10
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,16 @@ haptools.data.haplotypes module
6565
:undoc-members:
6666
:show-inheritance:
6767

68+
.. _api-haptools-data-breakpoints:
69+
70+
haptools.data.breakpoints module
71+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72+
73+
.. automodule:: haptools.data.breakpoints
74+
:members:
75+
:undoc-members:
76+
:show-inheritance:
77+
6878
haptools.sim_genotype module
6979
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7080

docs/formats/breakpoints.rst

+36-9
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,41 @@
44
Breakpoints
55
===========
66

7-
* ``Sample Header`` - Name of sample following the structure ``Sample_{number}_{hap}`` eg. ``Sample_10_1`` for sample number 10 haplotype 1
8-
* ``pop`` - Population label corresponding to the index of the population in the dat file so in the example above CEU = 1, YRI = 2
9-
* ``chr`` - chromosome (1-22, X)
7+
Breakpoints files (``.bp`` files) store your samples' local ancestry labels. Each line in the file denotes the ancestral population (ex: YRI or CEU) of a portion of a chromosomal strand (or *haplotype block*) of an individual.
108

11-
.. code-block::
9+
The set of haplotype blocks for an individual are delimited by a sample header of the form ``{sample}_1`` (for the first chromosomal strand) or ``{sample}_2`` (for the second chromosomal strand). Blocks from ``{sample}_1`` must be directly followed by blocks from ``{sample}_2``.
1210

13-
Sample Header
14-
{pop}\t{chr}\t{pos bp}
15-
...
16-
Sample Header 2
17-
...
11+
Each set of haplotype blocks follows a tab-delimited format with the following fields. Lines within a sample's set of blocks must be sorted according to ``chrom``, ``bp``, and ``cm`` - in that order.
12+
13+
.. list-table::
14+
:widths: 15 15 25
15+
:header-rows: 1
16+
17+
* - Name
18+
- Type
19+
- Description
20+
* - pop
21+
- string
22+
- The population label of this haplotype block (ex: CEU or YRI)
23+
* - chrom
24+
- string
25+
- The name of the chromosome to which this haplotype block belongs (ex: chr1)
26+
* - bp
27+
- integer
28+
- The base-pair position of the end of the haplotype block (ex: 1001038)
29+
* - cm
30+
- float
31+
- The centimorgan position of the end of the haplotype block (ex: 43.078)
32+
33+
Examples
34+
--------
35+
36+
See `tests/data/outvcf_test.bp <https://github.com/cast-genomics/haptools/blob/main/tests/data/outvcf_test.bp>`_ for an example of a short breakpoint file:
37+
38+
.. include:: ../../tests/data/outvcf_test.bp
39+
:literal:
40+
41+
See `tests/data/simple.bp <https://github.com/cast-genomics/haptools/blob/main/tests/data/simple.bp>`_ for a longer example:
42+
43+
.. include:: ../../tests/data/simple.bp
44+
:literal:

0 commit comments

Comments
 (0)