CSKS

Coalescent simulations of k-mer spectra to test Genome Profiling tools.

Usage

Set up conda environment: bash 00setUpEnv.sh
Run pipeline: bash -i 01pipeline.sh
Adjust paprameters in 01pipeline.sh and re-run if desired.

The motivation

Simulating genetic data is not trivial. In the case of one diploid (i.e. two genomes), one can just throw in heterozygous sites according to the desired heterozygosity level (θ). If more than two genomes are simulated, the task becomes more complicated. This is because related genomes have a common ancestor and thus share mutations. A summary of the number of times different mutations are shared between genomes in a sample is called the site-frequency spectrum.

Implementation

There are at the moment four steps that take place in a temporary directory. Only the final result, a k-mer spectrum, is stored in the working directory. The steps are:

The simulation of genetic variants and storage of simulated genomes
Generation of sequencing reads from the simulated genomes
Generation of a KMC3 database
Generation of a k-mer spectrum

Below, these steps are descibed in more detail.

Coalescent simulation

This is done by the script makeGenomes.py, which runs the coalescent simulator msprime to generate genetic variant data in a sample of n genomes (where n is the ploidy level specified). msprime generates information only for variant sites and so the invariant rest of the genome is made up separately.

The genomes are then written (concatenated) to a single-record FASTA file, genomes.fa, with a specific header format containing the genome size information: >genomes_p[ploidy]_s[haploid GS]_t[total GS].

This script can be modified to allow for allopolyploidy, genomic repeats, unequal sub-genomes, etc.

Generation of sequencing reads

This is done by makeReads.py, a very simple script that may well be replaced in the future by a more sophisticated sequnecing read simulator (there are several available for paired reads with error models).

For now, this generates error-free single-end reads in FASTA format which are written to reads.fa. To retrieve the genome size, the script looks at the header line of genomes.fa. It then generates reads of 150nt at 30x sequencing coverage. The read positions are sampled at random from a uniform distribution on the interval [1, genome size-150]. This should lead to narrow 'Poisson' peaks (bias parameter 0).

K-mer data base

KMC3 is run on reads.fa with the k-mer length set to 21.

Spectra

A k-mer spectrum is generated from the KMC DB and the temporary files are deleted.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
code		code
.gitignore		.gitignore
00setUpEnv.sh		00setUpEnv.sh
01pipeline.sh		01pipeline.sh
02peakWidthsPipeline.sh		02peakWidthsPipeline.sh
03delPipeline.sh		03delPipeline.sh
04delPipeCoal.sh		04delPipeCoal.sh
05coalDiplKLoop.sh		05coalDiplKLoop.sh
LICENSE		LICENSE
README.md		README.md
Tetmer_2.1.0.tar.gz		Tetmer_2.1.0.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSKS

Usage

The motivation

Implementation

Coalescent simulation

Generation of sequencing reads

K-mer data base

Spectra

About

Releases 1

Packages

Languages

License

hannesbecher/CSKS

Folders and files

Latest commit

History

Repository files navigation

CSKS

Usage

The motivation

Implementation

Coalescent simulation

Generation of sequencing reads

K-mer data base

Spectra

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages