GitHub - MehakBindra/CGProb: Counting coverage gaps

Introduction

CGProb estimates the probability of observing a coverage gap after a heterozygous locus on the second haplotype of a circular diploid genome. CGProb takes the genome length, coverage on each haplotype, the sequencing read length distribution as input. It estimates the probability of observing a coverage gap after a heterozygous locus on the second haplotype by counting the number of read sequencing outputs which have a coverage gap and dividing it by the total number of read sequencing outputs.

For ease of computation CGProb allows users to scale down the longer read lengths by a constant factor. This factor can be chosen by the user.

How to Use

# Install seqrequester
# Dependencies: GCC (tested with 11.3.0), git 2.25.1 or higher, 

cd /install/dir/for/seqrequester

git clone https://github.com/marbl/seqrequester.git
git checkout 31141c14d80fe0dde2766bcc3e622bf600e2bba4

cd src && make -j 12

# Activate conda environment containing gmp (https://anaconda.org/conda-forge/gmp)
conda activate myenv

# Install CGProb
# Dependencies:
# GCC 11.2.0 or higher, C++ version standard c++2b, 
# gmp library for c++ version 6.2.1
cd /install/dir/for/CGProb

git clone https://github.com/at-cg/CGProb.git
cd CGProb
python3 create_makefile.py /home/anaconda3/envs/myenv/include /home/anaconda3/envs/myenv/lib > makefile
make

# Create artifical genome for seqrequester. 
# Input desired length of chromosome after createGenome.py 
# Output in "simGenome.fasta"

python3 create_genome.py 1000000

# Simulate and sumarize reads from read length distribution for two haplotypes

/install/dir/for/seqrequester/build/bin/seqrequester simulate \
-circular \
-genome simGenome.fasta \
-genomesize 1000000 \
-coverage 10 \
-distribution CGProb/distributions/fixed.txt \
> readsHap1.fasta

/install/dir/for/seqrequester/build/bin/seqrequester summarize -simple readsHap1.fasta > originalReadLengthsHap1.txt

/install/dir/for/seqrequester/build/bin/seqrequester simulate \
-circular \
-genome simGenome.fasta \
-genomesize 1000000 \
-coverage 10 \
-distribution CGProb/distributions/fixed.txt \
> readsHap2.fasta

/install/dir/for/seqrequester/build/bin/seqrequester summarize -simple readsHap2.fasta > originalReadLengthsHap2.txt

# Choose scaling factor to scale down read lengths
# For computation new genome size and read lengths will be ceiling value of initial_size / scaling_factor

python3 scale_down.py originalReadLengthsHap1.txt shortLengthsHap1.txt 1000
python3 scale_down.py originalReadLengthsHap2.txt shortLengthsHap2.txt 1000

# Run computation
tail -2 readsHap1.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap1.txt
tail -2 readsHap2.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap2.txt

/install/dir/for/CGProb/compute \
-R $(cat readCountHap1.txt) -r $(cat readCountHap2.txt) -D shortLengthsHap1.txt -d shortLengthsHap2.txt > output.txt

tail output.txt | grep "probability = " | awk '{print $3}' > probability.txt

The output is stored in probability.txt. It is recommended that initial genome size is set to about 1,000,000 bp. We recommend scaling reads lengths down by a factor of 1000.

Usage Details

The following options can be used to customise the behaviour of CGProb

Genome Length [INT]
- Artificial Genome (created using create_genome.py)
Read Length Distribution of sequencing instrument
- Format: read_length number_of_reads
Coverage on first haplotype [INT]
Coverage on second haplotype [INT]
Reads on haplotypes
- Simulated using seqrequester
- readsHap1.fasta, originalReadLengthsHap1.txt, readCountHap1.txt
- readsHap2.fasta, originalReadLengthsHap2.txt, readCountHap1.txt
Scaling Factor [INT]
- Ideally, genome length is a multiple of this parameter, not necessary
- shortLengthsHap1.txt, generated using scale_down.py
- shortLengthsHap2.txt, generated using scale_down.py

Inputs to the executable compute are:

-g INT: genome size [1000]
-R INT: number of reads on first haplotype (required input)
-r INT: number of reads on second haplotype (required input)
-h INT: location of heterozygous locus [200]; 1 <= heterozygous locus <= genome size;
-D INT: read length distribution on first haplotype (required input)
-d INT: read legnth distribution on second haplotype (required input)
-p INT: number of bits of precision [128]
-t INT: number of threads [32]

Output File: Currently, the executable compute prints to stdout. This can be redirected as evinced in the How to Use section. The last line in output.txt contains the probability of observing a coverage gap on the second haplotype near a heterozygous locus due to contained read deletion.

Installation

Install seqrequester

Dependencies: GCC (tested with 11.3.0), git 2.25.1 or higher, gmp library for c++

git clone https://github.com/marbl/seqrequester.git
git checkout 31141c14d80fe0dde2766bcc3e622bf600e2bba4

Install gmp (https://anaconda.org/conda-forge/gmp)
Activate environment containing gmp

conda activate myenv

Install CGProb

Dependencies: gcc version 11.2.0 or higher; GMP version 6.2.1 or higher

git clone https://github.com/at-cg/CGProb.git
cd CGProb
python3 create_makefile.py /home/anaconda3/envs/myenv/include /home/anaconda3/envs/myenv/lib > makefile
make

Examples

Computing probability of a coverage gap near a heterozygous locus: Coverage 15x on each haplotype; PacBio HiFi read length distribution; Chromosome of size 100000; Reads scaled down by a factor of 100; Precision 256 bits; Number of threads 128

python3 create_genome.py 100000
seqrequester simulate -circular -genome simGenome.fasta -genomesize 100000 -coverage 15 -distribution CGProb/distributions/pacbio_hifi.txt \
> readsHap1.fasta
seqrequester summarize -simple readsHap1.fasta > originalReadLengthsHap1.txt

seqrequester simulate -circular -genome simGenome.fasta -genomesize 100000 -coverage 15 -distribution CGProb/distributions/pacbio_hifi.txt \
> readsHap2.fasta
seqrequester summarize -simple readsHap2.fasta > originalReadLengthsHap2.txt

# Scaling factor 100; compressed genome size 1000.
python3 scale_down.py originalReadLengthsHap1.txt shortLengthsHap1.txt 100
python3 scale_down.py originalReadLengthsHap2.txt shortLengthsHap2.txt 100
tail -2 readsHap1.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap1.txt
tail -2 readsHap2.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap2.txt

CGProb/compute -R $(cat readCountHap1.txt) -r $(cat readCountHap2.txt) -D shortLengthsHap1.txt -d shortLengthsHap2.txt -p 256 -t 128 > output.txt
tail output.txt | grep "probability = " | awk '{print $3}' > probability.txt

Computing probability of a coverage gap near a somatic mutation locus: Coverage 20x on haplotype 1 and 5x on haplotype 2; ONT Simplex read length distribution; Chromosome of size 1000000; Reads scaled down by a factor of 10000; Somatic mutation locus 50; Precision 256 bits; Number of threads 64.

python3 create_genome.py 1000000
seqrequester simulate -circular -genome simGenome.fasta -genomesize 1000000 -coverage 20 \
-distribution CGProb/distributions/nanopore_simplex.txt > readsHap1.fasta
seqrequester summarize -simple readsHap1.fasta > originalReadLengthsHap1.txt

seqrequester simulate -circular -genome simGenome.fasta -genomesize 100000 -coverage 5 \
-distribution CGProb/distributions/nanopore_simplex.txt > readsHap2.fasta
seqrequester summarize -simple readsHap2.fasta > originalReadLengthsHap2.txt

# Scaling factor 10000; compressed genome size 100.
python3 scale_down.py originalReadLengthsHap1.txt shortLengthsHap1.txt 100
python3 scale_down.py originalReadLengthsHap2.txt shortLengthsHap2.txt 100
tail -2 readsHap1.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap1.txt
tail -2 readsHap2.fasta | head -1 | awk -F'[=,]' '{print $2}' > readCountHap2.txt

CGProb/compute -g 100 -h 50 \
-R $(cat readCountHap1.txt) -r $(cat readCountHap2.txt) \
-D shortLengthsHap1.txt -d shortLengthsHap2.txt \
-p 256 -t 128 > output.txt
tail output.txt | grep "probability = " | awk '{print $3}' > probability.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

How to Use

Usage Details

Installation

Examples

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
distributions		distributions
README.md		README.md
compute.cpp		compute.cpp
create_genome.py		create_genome.py
create_makefile.py		create_makefile.py
makefile		makefile
scale_down.py		scale_down.py

MehakBindra/CGProb

Folders and files

Latest commit

History

Repository files navigation

Introduction

How to Use

Usage Details

Installation

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages