Skip to content

Commit

Permalink
0.4
Browse files Browse the repository at this point in the history
  • Loading branch information
mikolmogorov committed Nov 19, 2021
1 parent 5fbd27f commit f863195
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 29 deletions.
60 changes: 32 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
# Hapdup
# HapDup

Hapdup (haplotype duplicator) is a pipeline to convert a haploid Oxford Nanopore assembly into a [dual](http://lh3.github.io/2021/10/10/introducing-dual-assembly) diploid assembly.
HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a
[dual](http://lh3.github.io/2021/10/10/introducing-dual-assembly) diploid assembly.
The reconstructed haplotypes preserve heterozygous structural variants (in addition to small variants) and
are locally phased.


## Version 0.3
## Version 0.4

Input requirements
------------------

Hapdup takes as input a long-read assembly, such as produced with [Flye](https://github.com/fenderglass/Flye) or
[Shasta](https://github.com/chanzuckerberg/shasta).
HapDup takes as input a haploid long-read assembly, such as produced with [Flye](https://github.com/fenderglass/Flye) or
[Shasta](https://github.com/chanzuckerberg/shasta). Currenty, ONT reads (Guppy 5+ recommended) and PacBio HiFi
reads are supported.

Hapdup is currently designed for low-heterozygosity genomes (such as human). The expectation is that the assembly
HapDup is currently designed for low-heterozygosity genomes (such as human). The expectation is that the assembly
has most of the diploid genome collapsed into a single haplotype. For assemblies with partially resolved haplotypes, alternative
alleles could be removed prior to running the pipeline using [purge_dups](https://github.com/dfguan/purge_dups).
We expect to add a better support of highly heterozygous genomes in the future.
Expand All @@ -29,20 +31,21 @@ samtools index -@ 4 assembly_lr_mapping.bam
Quick start using Docker
------------------------

Hapdup is available on the [Docker Hub](https://hub.docker.com/repository/docker/mkolmogo/hapdup).
HapDup is available on the [Docker Hub](https://hub.docker.com/repository/docker/mkolmogo/hapdup).

If Docker is not installed in your system, you need to set it up first following this [guide](https://docs.docker.com/engine/install/ubuntu/).

Next steps assume that your `assembly.fasta` and `lr_mapping.bam` are in the same directory,
which will also be used for hapdup output. If it is not the case, you might need to bind additional
which will also be used for HapDup output. If it is not the case, you might need to bind additional
directories using the Docker's `-v / --volume` argument. The number of threads (`-t` argument)
should be adjusted according to the available resources.
should be adjusted according to the available resources. For PacBio HiFi input, use
`--rtype hifi` instead of `--rtype ont`.

```
cd directory_with_assembly_and_alignment
HD_DIR=`pwd`
docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.3 \
hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64
docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.4 \
hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont
```

Quick start using Singularity
Expand All @@ -56,15 +59,16 @@ conda install singularity
```

Next steps assume that your `assembly.fasta` and `lr_mapping.bam` are in the same directory,
which will also be used for hapdup output. If it is not the case, you might need to bind additional
which will also be used for HapDup output. If it is not the case, you might need to bind additional
directories using the `--bind` argument. The number of threads (`-t` argument)
should be adjusted according to the available resources.
should be adjusted according to the available resources. For PacBio HiFi input, use
`--rtype hifi` instead of `--rtype ont`.

```
singularity pull docker://mkolmogo/hapdup:0.3
singularity pull docker://mkolmogo/hapdup:0.4
HD_DIR=`pwd`
singularity exec --bind $HD_DIR hapdup_0.3.sif \
hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64
singularity exec --bind $HD_DIR hapdup_0.4.sif \
hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont
```

Output files
Expand All @@ -82,7 +86,7 @@ Fully-phased blocks are given in the the `phased_blocks*` files.
Pipeline overview
-----------------

1. hapdup starts with filtering alignments that are likely originating from the unassembled parts of the genome.
1. HapDup starts with filtering alignments that are likely originating from the unassembled parts of the genome.
Such alignments may later create false haplotypes if not removed (e.g. if reads from a segmental duplication with two copies
can create four haplotypes).

Expand All @@ -100,19 +104,19 @@ Currently, it allows to recover large heterozygous inversions.
Benchmarks
----------

We evaluated hapdup haplotypes in terms of reconstructed structural variants signatures (heterozygous & homozygous)
We evaluated HapDup haplotypes in terms of reconstructed structural variants signatures (heterozygous & homozygous)
using the HG002 for which the [curated set of SVs](https://www.nature.com/articles/s41587-020-0538-8)
is available. We used the [recent ONT data](https://s3-us-west-2.amazonaws.com/miten-hg002/index.html?prefix=guppy_5.0.7/)
basecalled with Guppy 5.

Given hapdup haplotypes, we called SV using [dipdiff](https://github.com/fenderglass/dipdiff). We also compare SV
Given HapDup haplotypes, we called SV using [dipdiff](https://github.com/fenderglass/dipdiff). We also compare SV
set against hifiasm assemblies, even though they were produced from HiFi, rather than ONT reads.
Evaluated using truvari with `-r 2000` option. GT refers to genotype-considered benchmarks.


| Method | Precision | Recall | F1-score | GT Precision | GT Recall | GT F1-score |
|----------------|-----------|--------|----------|--------------|-----------|-------------|
| Shasta+Hapdup | 0.9500 | 0.9551 | 0.9525 | 0.934 | 0.9543 | 0.9405 |
| Shasta+HapDup | 0.9500 | 0.9551 | 0.9525 | 0.934 | 0.9543 | 0.9405 |
| Sniffles | 0.9294 | 0.9143 | 0.9219 | 0.8284 | 0.9051 | 0.8605 |
| CuteSV | 0.9324 | 0.9428 | 0.9376 | 0.9119 | 0.9416 | 0.9265 |
| hifiasm | 0.9512 | 0.9734 | 0.9622 | 0.9129 | 0.9723 | 0.9417 |
Expand All @@ -124,7 +128,7 @@ Yak k-mer based evaluations:
| 1 | 35 | 0.0389 | 0.1862 |
| 2 | 35 | 0.0385 | 0.1845 |

Given a minimap2 alignment, hapdup runs in ~400 CPUh and uses ~80 Gb of RAM.
Given a minimap2 alignment, HapDup runs in ~400 CPUh and uses ~80 Gb of RAM.

Source installation
-------------------
Expand All @@ -136,9 +140,9 @@ If you prefer, you can install from source as follows:
conda create -n hapdup python=3.8
conda activate hapdup
#get Hapdup source
git clone https://github.com/fenderglass/Hapdup
cd Hapdup
#get HapDup source
git clone https://github.com/fenderglass/hapdup
cd hapdup
git submodule update --init --recursive
#build and install Flye
Expand All @@ -155,13 +159,13 @@ To run, ensure that the conda environemnt is activated and then execute:

```
conda activate hapdup
./hapdup.py --assembly assembly.fasta --bam lr_mapping.bam --out-dir hapdup -t 64
./hapdup.py --assembly assembly.fasta --bam lr_mapping.bam --out-dir hapdup -t 64 --rtype ont
```

Acknowledgements
----------------

The major parts of the hapdup pipeline are:
The major parts of the HapDup pipeline are:

* [PEPPER](https://github.com/kishwarshafin/pepper)
* [Margin](https://github.com/UCSC-nanopore-cgl/margin)
Expand All @@ -184,7 +188,7 @@ PEPPER/Margin/Shasta support:
Citation
--------

If you use hapdup in your research, the most relevant papers to cite are:
If you use HapDup in your research, the most relevant papers to cite are:

Kishwar Shafin, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid et al.
"Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks." bioRxiv (2021).
Expand All @@ -198,7 +202,7 @@ Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin and Pavel Pevzner,
License
-------

hapdup is distributed under a BSD license. See the [LICENSE file](LICENSE) for details.
HapDup is distributed under a BSD license. See the [LICENSE file](LICENSE) for details.
Other software included in this discrubution is released under either MIT or BSD licenses.


Expand Down
2 changes: 1 addition & 1 deletion hapdup/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = 0.3
__version__ = 0.4

0 comments on commit f863195

Please sign in to comment.