Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier in Java
The Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS) is a platform-friendly software for portable metagenomic analysis of Nanopore data. OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components apt to improve classification performance on incomplete/sampled databases (e.g., family subsets of bacteria), making it ideal for running on smartphones or tablets.
We provide two version of OCTOPUS: one (OCTOPUS_Android.java) is fully tailored to mobile devices and used as codebase to develop the Android OS app, while another (OCTOPUS.java) is more general-purpose and faster (but uses unsafe code).
OCTOPUS comes with three built-in indexed bacterial databases elaborated from: (1) all reference genomes from the Bacterial and Viral Bioinformatics Resource Center (OCTOPUS index downloadable at: https://osf.io/jgw9z/); (2) the World Health Organization’s set of bacteria of concern for drug resistance (OCTOPUS index downloadable at: https://osf.io/jgw9z/); (3) the MEGARes database, a hand-curated collection antimicrobial resistance genes (OCTOPUS index included in this GitHub repository).
Other databases can be created using the ancillary tools for k-mer extraction and indexing (GenomesToKmers.java, BuildOCTOPUSdb.java). A new database can be created from a multi FASTA file or a folder containing multiple FASTA files (one for each genome). Taxonomy tree is not needed, since OCTOPUS perform and internal clustering. There is a file named "info.txt" in the "_OCTOPUSdb" folder that links OCTOPUS' taxon IDs and clusters to the original genomes' names of the input FASTA(s).
OCTOPUS runs from the command line as follows:
- GENERAL PURPOSE: java -cp ".;octopus_android_jars/*" OCTOPUS_Android d:database_folder f:fastq_file (can be gzipped)
- ANDROID: java -cp ".;octopus_jars/*" OCTOPUS_Android d:database_folder f:fastq_file (can be gzipped)
- Please use -cp ".:octopus_jars/" or ".:octopus_android_jars/" when running on Linux/UNIX
- Additional command line options include
- t:number_of_threads
- o:output_file_name
- s:probthreshold_or_minimum_hits (for classification, default is probability>0.75, any value >=1 will be minimum frequency of hits)
- l:log2m_value (for HyperLogLog)
- h or help or -h or -help to print instructions
The output consists of two files:
- inputfilename_mappedGenomes.csv (taxonId, taxonName, genomeCoverage, readDepth)
- inputfilename_mappedReads.cvs (readId, taxonID, minimizKmerHits|totMinimizKmers)
GenomesToKmers runs from the command line as follows:
- (Windows) java -Xmx[desired_ram] -cp ".;octopus_jars/*" GenomesToKmers genomes_fasta_file_or_folder (will use default parameters)
- (UNIX/Linux) java -Xmx[desired_ram] -cp ".:octopus_jars/*" GenomesToKmers genomes_fasta_file_or_folder (will use default parameters)
- The input file can be also a fasta.gz
- Alternatively, run GenomesToKmers with options
- f:genomes_fasta_file_or_folder ('f:' must be specified in this case)
- with one or more options among
- n:[y,n] (n:y if you want to mask non-ACGT characters with N, n:n otherwise that is the default)
- k:kmer_length (default 29, min 15, max 35 for ACGT and max 29 for ACGTN bases)
- s:percent_similarity (for clustering, default 90, <=0 for none)
- m:subsample_x (it selects only the first x species)
- -h or -help to print instructions
BuildOCTOPUSdb runs from the command line as follows:
- (Windows) java -Xmx[desired_ram] -cp ".;octopus_jars/*" BuildOCTOPUSdb input_file [-a, -g GB]
- (UNIX/Linux) java -Xmx[desired_ram] -cp ".:octopus_jars/*" BuildOCTOPUSdb input_file [-a, -g GB]
- The input_file is the one generated by GenomesToKmers
- The -a option will create a database for the Android version
- The -g option will shrink the database to the desired GB size
- The desired_ram should be ~5G for datasets up to 50 million kmers, ~10G up to 100 million kmers, ~20G for up to 200 million kmers, etc. Even with suboptimal RAM allocation, it will work with any kmer cardinality but the minimal perfect hashing will be less efficient.
- -h or -help to print instructions
Assembly accession IDs and associated labels of the BV-BRC bacterial DB are available in data/BV-BRC_taxon_id_Assembly_accession_Species.txt.
Assembly accession IDs of the WHO DB are available in data/WHO_Assembly_accession_Species.txt.