Note: There is a known issue with CompareM that can results in no homologs being identified when run on some Linux system. This is related to different implementations of 'sort'. Titus Brown has suggest a solution that addresses this for Mac OS X. The AAI calculator from the Kostas Lab is an alternative solution.
CompareM is a software toolkit which supports performing large-scale comparative genomic analyses. It provides statistics across sets of genomes (e.g., amino acid identity) and for individual genomes (e.g., codon usage). Parallelized implementations are provided for computationally intensive tasks in order to allow scalability to thousands of genomes. Common workflows are provided as single methods to support easy adoption by users, and a more granular interface provided to allow experienced users to exploit specific functionality. CompareM is open source and released under the GNU General Public License (Version 3).
Comparative genomic statistics:
- average amino acid identity (AAI) between genomes
- taxonomic classification by calculating AAI between query genomes and a reference database
Genomic usage patterns:
- codon usage
- amino acid usage
- kmer usage for k <= 8 (e.g., tetranucleotide)
- stop codon usage
Other:
- di-nucleotide and codon usage patterns for identifying LGT
- data exploration using dissimilarity matrices, hierarchical clustering trees, and heat maps
- Ported to Python 3 starting with version 0.1.0
CompareM can be install via Conda using:
>conda install -c bioconda comparem
CompareM can be installed using pip using:
> sudo pip install comparem
You must install Prodigal and DIAMOND independently.
CompareM makes use of the numpy, scipy, matplotlib, and biolib python packages, and assumes the following 3rd party dependencies are on your system path:
- prodigal >= 2.6.2: Hyatt D, Locascio PF, Hauser LJ, Uberbacher EC. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28: 2223-2230.
- diamond >= 0.9.0: Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60 doi:10.1038/nmeth.3176.
Most systems already contain the “SciPy Stack” of numpy, scipy, and matplotlib. However, if you need to install these on your system, instructions can be found at:
The functionality provided by CompareM can be accessed through the help menu:
> comparem -h
Usage information about specific functions can also be accessed through the help menu, e.g.:
> comparem aa_usage –h
The most common task performed with CompareM is the calculation of pairwise amino acid identity (AAI) values between a set of genomes. This can be performed using the aai_wf command:
> comparem aai_wf <input_files> <output_dir>
The <input_file> argument indicates the set of genomes to compare and can either i) a text file where each line indicating the location of a genome, or ii) a directory containing all genomes to be compared. The genomic nucleotide sequences of genomes must be in FASTA format. The <output_dir> indicates the desired directory for all output files. A typical use of this command would be:
> comparem --cpus 32 aai_wf my_genomes aai_output
where the directory my_genomes contains a set of genomes in FASTA format, the results are to be written to a directory called aai_output, and 32 processors should be used to calculate the results.
A number of optional arguments can also be specified. This includes the sequence similarity parameters used to define reciprocal best hits between genomes(i.e., homologs). By default the e-value (--evalue), percent sequence identity (--per_identity), and percent alignment length (--per_aln_len) parameters are set to 1e-5, 30%, and 70%. When specifying a directory of genomes to process, CompareM only processes files with a fna extension. This can be changes with the --file_ext argument. In addition, if genomes are already represented by amino acid protein sequences (as opposed to genomic nucleotide sequences), this must be specified with the --proteins flag. Otherwise, genes will be identified de novo using the Prodigal gene caller. The time to compute all pairwise AAI values can be substantially reduced by using multiple processors as specified with the --cpus argument. Other arguments are for specialized uses and are discussed in the User's Guide.
Pairwise AAI statistics are provided in the output file ./<output_dir>/aai/aai_summary.tsv. This file consists of 8 columns indicating:
- Identifier of the first genome
- Number of genes in the first genome
- Identifier of the second genome
- Number of genes in the second genome
- Number of orthologous genes identified between the two genomes
- The mean amino acid identity (AAI) of orthologous genes
- The standard deviation of the AAI across orthologous genes
- The orthologous fraction (OF) between the two genomes defined as the number of orthologous genes divided the minimum number of genes in either genome
Other output files produced by this command are described below.
Detailed information regarding the use of CompareM can be found in the User's Guide (user_guide.pdf).
If you find this package useful, please cite this git repository (https://github.com/dparks1134/CompareM)
Copyright © 2014 Donovan Parks. See LICENSE for further details.