Karp is a program for identifying the relative frequencies of reference sequences contributing to a pooled DNA sample. Karp was developed with 16S rRNA sequencing as its primary application, but will work in any context where reference sequences are available in fasta format and the pooled DNA is in fastq format with base quality scores. Karp employs a pseudoalignment step to quickly match reads with potential references. Reads are locally aligned to the references they pseudoalign with, and Karp calculates a likelihood of each read originating from each reference using base quality scores. Finally, Karp employs an EM algorithm that uses information from all the reads to accurately estimate the relative frequencies of each reference in the sample.
More details are provided in the manuscript: Using pseudoalignment and base quality to accurately quantify microbial community composition
After cloning the repository, or downloading and decompressing the tarball follow the following steps to install Karp. In the Karp directory
mkdir build
cd build
cmake ..
make
This should create a working karp executable in the folder /src If you get an error message during compilation see the bottom of this readme for trouble shooting tips.
Karp requires CMake v2.8+, ZLIB, HDF5, Eigen and TCLAP. The Eigen and TCLAP libraries are included in this repository and the downloadable .tar package.
Command | Description |
---|---|
-c / --command | Which of Karp's functions should run, options are 'index' 'quantify' 'tabulate' |
-h / --help | Print out Karp options and their descriptions |
The first stage of analysis with Karp is building the k-mer index for pseudoalignment. Indexing is performed if the option 'index' is specified as the -c value.
Command | Description |
---|---|
-r / --ref | Reference fasta file to build index from, must have matching .fai file created with samtools faidx. Enter multiple files with comma delimiter. |
-i / --index | Name of index file to output |
-k / --kmer | Length of k-mer to use when building index, must be odd and range between 3 and 31 [default = 31]. |
The main function of Karp is to quantify the taxonomy in a pooled DNA sample, that is done after the index has been constructed using the option 'quantify' as the -c value.
Command | Description |
---|---|
-r / --ref | Reference fasta files, must have matching .fai file created with samtools faidx. Multiple files entered with comma delimiter. |
-i / --index | Name of k-mer index file built with 'index' |
-f / --forward | Fastq files to be quantified, can be gzipped. Enter multiple files with comma separating them. If quantifying single-end reads, enter fastq files with this command. If quantifying paired-end reads enter forward files with this command. Note: Karp requires fastq files with base quality scores, it does not work with fasta files |
--paired | Enter this flag if you are quantifying paired-end reads |
-q / --reverse | Reverse oriented fastq files to be quantified. Should match order of files entered with --forward. Can be gzipped. |
-t / --tax | Taxonomy file(s) matching reference fasta file(s). See note below for more details about acceptable formats. |
-o / --out | Base name for output files. Karp will output a '.freqs' file containing results and '.log' file [default = 'karp']. |
--threads | Number of threads to use [default = 1]. |
--phred | Version of phred quality scores being used. Default is Phred+33, used by Illumina 1.8+. Other valid option is '64' which corresponds to Phred+64 quality coding [default = 33]. |
--min_freq | Lower frequency bound during EM update step. Taxa with frequencies below this threshold are removed from the analysis [default = 1/Number of reads]. |
--like_thresh | The maximum likelihood z-score threshold. Karp uses distribution of quality scores in input fastq file to calculate a "null" distribution for likelihood values when each read is matched to the correct reference and all mismatches are the result of sequencing error. Using this distribution Karp calculates a z-score for the maximum likelihood observed for each read, and if this value falls to far outside the theoretical distribution the read is filtered from analysis [default = -2.0]. |
--collapse | Run Karp in Collapse mode. In collapse mode classification is done at the level of taxonomic labels rather than individual reference sequences. |
Command | Description |
---|---|
--max_like_out | Output a file containing the maximum likelihood value for each read, use this to diagnose appropriate max likelihood filter threshold. |
--strict | Use the strict intersection of equivalence classes to declare psuedoalignment of read to reference (this is the definition of pseudoalignment used by Kallisto). Default Karp behavior is that if no strict intersections exists, the reference(s) with the greatest number of matching k-mers are declared as pseudoaligning to a query read. |
--em_tolerance | When the squared sum of allele frequency changes between EM algorithm iterations falls below this threshold the algorithm has converged [default = 1e-12]. |
--no_harp_filter | Turn off the likelihood z-score filter. |
--fail | Output a file listing the reads that fail to pseudoalign and those that are excluded by the maximum likelihood z-score filter. |
--max_em_it | Maximum number of iterations of EM algorithm before declaring failure to converge [default = 1000]. |
--readinfo | Adds additional output file, that contains the posterior probabilities assigned to each individual read after the EM algorithm converges. The format is one read per line, followed by pairs of (percentage,reference ID from fasta). |
After multiple samples have been classified using Karp, the program can be used to calculate some basic summary statistics using the 'tabulate' command.
Command | Description |
---|---|
--samples | File with list of Karp output files from each sample that has been classified. File should have two columns, the first with a sample ID and the second with a path to the corresponding Karp output file |
If using Karp with multiple threads, the internal local aligner runs more quickly if the reference database fasta file is broken into multiple smaller files. A single kmer index can still be used, as long as the references in the multiple fasta files are all present in the single index.
The default likelihood threshold that karp uses to filter reads for being unlikely to originate from the reference database can be too lenient or strict with some distributions of read quality scores. If it seems that karp is filtering too many (or too few) reads, it can be helpful to use the ** --max_likes_out** option, and plot the maximum likelihood for each read to get a sense of how to adjust the ** --like_thresh** option.
Data from a single 16S hypervariable region often have little power to distinguish between closely related reference sequences. When this is the case Karp will partition the probability weight across the closely related species, and they will appear at low frequency. In such situations the default minimum EM update frequency can be too strict, and setting it to a lower value based on the depth of the samples being classified can improve results. The default value is 1/Number of Reads, with a single hypervariable region 1/(10*Number of Reads) or 1/(20*Number of Reads) can sometimes give more accurate results.
In the repo there is a directory labeled example
, this contains some toy files for demonstrating how to use Karp.
File | Description |
---|---|
reference.fasta | Reference database in fasta format, contains 500 sequences |
reference.fasta.fai | Tabix index of reference fasta file |
reference.tax | Taxonomy file with description of sequences in reference.fasta |
simulated.data.fastq.gz | Fastq file containing 10,000 reads with mixture of organisms from reference.fasta |
simulated.data.actual_counts.txt | True origins of reads in simulated.data.fastq.gz |
The first step of running Karp is to create a k-mer index for pseudoalignment:
./karp -c index -r reference.fasta -i reference.index
This will create the index file reference.index
. Next we quantify the simulated reads:
./karp -c quantify -r reference.fasta -i reference.index -f simulated.data.fastq.gz -o simulated.results -t reference.tax
In the example folder we now have the files simulated.results.freqs
and simulated.results.log
. simulated.results.freqs
contains Karp's estimates of the number of reads each reference sequence in reference.fasta
contributed to the simulated sample. simulated.results.log
contains information about the Karp run we just performed.
Taxonomy files should have two columns and be tab delimited, with the first column giving the ID for the haplotype matching the reference fasta file, and the second giving the taxonomic label. Examples of acceptable formats include:
815395 Bacteria;Firmicutes;Bacilli;Bacillales;
815395 k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;
815395 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__; g__; s__
Error message: unrecognized command line option "-std=c++0x"
or
Error message: error: call of overloaded ‘to_string(size_t&)’ is ambiguous
Likely your default C++ compiler is out of date, check for newer compiler availability. Once you have found/installed an up to date compiler, delete and recreate the build directory and rerun the cmake step with command
cmake -DCMAKE_CXX_COMPILER=/path/to/compiler -DCMAKE_C_COMPILER=/path/to/compiler ..
Error message: Could NOT find HDF5 (missing: HDF5_LIBRARIES HDF5_INCLUDE_DIRS HDF5_HL_LIBRARIES)
Check if you have HDF5 libraries installed on your machine. If you do not they can be downloaded and installed https://support.hdfgroup.org/HDF5/
If you have HDF5 installed, and are still getting this message, make sure your $HDF5_DIR variable is pointing to the correct directory. If it is, try adding -DCMAKE_PREFIX_PATH=/path/to/HDF5 to the cmake command.
There is also a known issue with some versions of CMAKE and the HDF5 library. If you are running CMAKE versions 3.6.0 or 3.6.1 see if you have access to an earlier or later version of CMAKE and try using it instead.