RGI can align short DNA sequences in FASTQ format using Bowtie2 , BWA , or KMA against CARD's protein homolog models. The default and recommended read aligner is KMA due to its documented better performance for redundant databases such as CARD. While CARD is not truly redundant, i.e. there are no identical reference sequences, CARD does reflect the AMR alelle network problem in that many sequences are very similar. For example, the nucleotide sequences of TEM-1 and TEM-2 are 99% similar with no alignment gaps. A sample generating short reads from a legitimate TEM-1 gene may result in reads aligned among TEM-1, TEM-2, or other TEM beta-lactamases depending upon the alignment algorithm chosen. The KMA publication and our own simulations find KMA best resolves this issue:
The above illustrates simulated 90x short read coverage from seven antibiotic resistance gene nucleotide reference sequences in CARD (catB, OXA-1, AAC(6')-Ib, NDM-1, BRP(MBL), QnrB1, CTX-M-15), subsequently aligned with RGI bwt against CARD using Bowtie2 or KMA algorithms. Reads are aligned to a single reference gene using KMA but for Bowtie2 the same reads are aligned across a selection of similar reference sequences, with associated lower MAPQ scores. Note that KMA has limits in its ability to resolve very similar sequences, e.g. all simulated catB3 reads were all aligned to catI and all simulated AAC(6')-Ib reads were aligned to AAC(6')-Ib-cr. These simulated data are available at: https://github.com/raphenya/read-mapping-analysis.
UPDATED RGI version 6.0.0 onward: In earlier versions of RGI, by default RGI bwt aligned reads to reference sequences from CARD's protein homolog models, protein variant models, rRNA mutation models, and protein over-expression models. However, as outlined above, the latter three model types require comparison to CARD's curated lists of mutations known to confer phenotypic antibiotic resistance to differentiate alleles conferring resistance from antibiotic susceptible alleles, e.g. a wild-type gyrase susceptible to fluoroquinolones. As such, earlier versions of RGI were over-reporting antibiotic resistance genes by not checking for these curated mutations. For example, while the KMA algorithm reports SNPs relative to reference, RGI was not screening these SNPs against CARD. Read alignments against the protein variant model, rRNA mutation model, and protein over-expression model reference sequences can now only be listed by use of the new --include_other_models parameter, but at this time these results still do not include comparison to CARD's curated lists of mutations. As such, these often spurious results are no longer included in default RGI bwt output. Support for mutation screening models will be added to future versions of RGI bwt.
For RGI bwt, FASTQ sequences can be aligned to the 'canonical' curated CARD reference sequences associated with the Antibiotic Resistance Ontology (i.e. sequences available in GenBank with clear experimental evidence of elevated MIC in a peer-reviewed journal available in PubMED) or additionally to the in silico predicted allelic variants available in CARD's Resistomes & Variants data set. The latter is highly recommended for non-clinical samples as the allelic diversity for AMR genes is greatly unrepresented in the published literature, with a strong bias towards clinical antibiotic resistance genes and pathogens, hampering high-stringency read mapping for samples with divergent alleles. Inclusion of CARD's Resistomes & Variants allows read mapping to predicted allelic variants and AMR gene homologs for a wide variety of pathogens, incorporation of CARD's Prevalence Data for easier interpretation of predicted AMR genes, and ultimately use of k-mer classifiers for prediction of pathogen-of-origin for FASTQ reads predicted to encode AMR genes (see below).
> What data is included in CARD? Can I add unpublished data?
CARD's Resistomes & Variants and Prevalence Data (nicknamed WildCARD) were generated using the RGI to analyze molecular sequence data available in NCBI Genomes for hundreds of pathogens of interest (see Sampling Table). For each of these pathogens, complete chromosome sequences, complete plasmid sequences, genomic island sequences, and whole genome shotgun (WGS) assemblies were analyzed individually by RGI. RGI results were then aggregated to calculate prevalence statistics for distribution of AMR genes among pathogens and plasmids, predicted resistomes, and to produce a catalog of predicted AMR alleles. These data were predicted under RGI's Perfect and Strict paradigms (see above), the former tracking perfect matches at the amino acid level to the curated reference sequences and mutations in the CARD, while the latter predicts previously unknown variants of known AMR genes, including secondary screen for key mutations. The reported results are entirely dependant upon the curated AMR detection models in CARD, the algorithms available in RGI, the pathogens sampled, and the sequence data available at NCBI at their time of generation. RGI bwt will indicate if the reference sequence for aligned reads is from the 'canonical' curated CARD reference sequences or from CARD's Resistomes & Variants, allowing users to know if the underlying reference is an in silico prediction or experimentally validated resistance gene.
Note: While CARD's Resistomes & Variants increases the allelic diversity of the reference data for non-clinical samples, it does so at the cost of inflating the allele network problem outlined above. Summarizing results at the level of AMR Gene Family may be more accurate than summarizing at the level of individual antibiotic resistance genes.
Note: As RGI bwt makes no assumptions about pre-processing of metagenomics data, we suggest prior quality/adaptor trimming of reads with skewer and deduplication of reads using dedupe.sh. If needed, down-sampling of FASTQ data can be performed using seqtk. Thanks to Allison Guitor of McMaster University for these suggestions.
rgi bwt -h
usage: rgi bwt [-h] -1 READ_ONE [-2 READ_TWO] [-a {kma,bowtie2,bwa}]
[-n THREADS] -o OUTPUT_FILE [--debug] [--clean] [--local]
[--include_wildcard] [--include_other_models] [--include_baits]
[--mapq MAPQ] [--mapped MAPPED] [--coverage COVERAGE]
Resistance Gene Identifier - 6.0.2 - BWT
Aligns metagenomic reads to CARD and wildCARD reference using kma, bowtie2 or bwa and provide reports.
optional arguments:
-h, --help show this help message and exit
-1 READ_ONE, --read_one READ_ONE
raw read one (qc and trimmed)
-2 READ_TWO, --read_two READ_TWO
raw read two (qc and trimmed)
-a {kma,bowtie2,bwa}, --aligner {kma,bowtie2,bwa}
select read aligner (default=kma)
-n THREADS, --threads THREADS
number of threads (CPUs) to use (default=16)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
name of output filename(s)
--debug debug mode (default=False)
--clean removes temporary files (default=False)
--local use local database (default: uses database in executable directory)
--include_wildcard include wildcard (default=False)
--include_other_models
include protein variant, rRNA variant, knockout, and protein overexpression models (default=False)
--include_baits include baits (default=False)
--mapq MAPQ filter reads based on MAPQ score (default=False)
--mapped MAPPED filter reads based on mapped reads (default=False)
--coverage COVERAGE filter reads based on coverage of reference sequence
Note: The mapq, mapped, and coverage filters are planned features and do not yet work (but values are reported for manual filtering). Support for AMR bait capture methods (--include_baits) is forthcoming.
BWA usage within RGI bwt:
bwa mem -M -t {threads} {index_directory} {read_one} > {output_sam_file}
Bowtie2 usage within RGI bwt:
bowtie2 --very-sensitive-local --threads {threads} -x {index_directory} -U {unpaired_reads} -S {output_sam_file}
KMA usage within RGI bwt (default):
kma -mem_mode -ex_mode -1t1 -vcf -int {read_one} -t {threads} -t_db {index_directory} -o {output_sam_file}.temp -sam
If you have not already done so, you must load CARD reference data for these commands to work. First, remove any previous loads:
rgi clean --local
Download CARD data:
wget https://card.mcmaster.ca/latest/data tar -xvf data ./card.json
Load into local or working directory:
rgi load --card_json /path/to/card.json --local
Also pre-process these reference data for metagenomics reads (note that the filename card_database_v3.0.1.fasta depends on the version of CARD data downloaded, please adjust accordingly):
rgi card_annotation -i /path/to/card.json > card_annotation.log 2>&1 rgi load -i /path/to/card.json --card_annotation card_database_v3.0.1.fasta --local
As outlined above, metagenomics analyses may additionally include CARD's Resistomes & Variants protein homolog model reference data if desired. If you wish to include these reference data, additionally download the Resistomes & Variants (a.ka. WildCARD) data:
wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants mkdir -p wildcard tar -xjf wildcard_data.tar.bz2 -C wildcard gunzip wildcard/*.gz
Pre-process the WildCARD reference data for metagenomics reads (note that the filenames wildcard_database_v3.0.2.fasta and card_database_v3.0.1.fasta plus the parameter version_number depend on the version of CARD data downloaded, please adjust accordingly):
rgi wildcard_annotation -i wildcard --card_json /path/to/card.json -v version_number > wildcard_annotation.log 2>&1 rgi load --wildcard_annotation wildcard_database_v3.0.2.fasta --card_json /path/to/card.json --wildcard_index /path/to/wildcard/index-for-model-sequences.txt --card_annotation card_database_v3.0.1.fasta --local
RGI will use FASTQ files as provided, be sure to include linker and quality trimming, plus sorting or any other needed pre-processing prior to using RGI (see suggestions above). Note: RGI bwt will assume unpaired reads unless the -2 flag is used. The examples below assume paired reads.
The default settings for RGI bwt will align reads using KMA against CARD's protein homolog models, i.e. reference sequences that do not require SNP mapping to predict resistance. The default uses only 'canonical' curated CARD reference sequences associated with the Antibiotic Resistance Ontology (i.e. sequences available in GenBank with clear experimental evidence of elevated MIC in a peer-reviewed journal available in PubMED):
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local
The same analysis can be expanded to use multiple processors:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local -n 20
Although not recommended (see above), an alternate read aligner can be used:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local -n 20 -a bowtie2
RGI bwt can use an expanded reference set by aligning reads to both 'canonical' CARD and CARD's Resistomes & Variants WildCARD variants:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local -n 20 --include_wildcard
RGI bwt can also be used to align reads to CARD's protein homolog models plus protein variant models, rRNA mutation models, and protein over-expression models. As outlined above, the latter three model types require comparison to CARD's curated lists of mutations known to confer phenotypic antibiotic resistance to differentiate alleles conferring resistance from antibiotic susceptible alleles, but RGI bwt as of yet does not perform this comparison. Use these results with caution.
If you have not already done so, you must load CARD reference data for these commands to work. First, remove any previous loads:
rgi clean --local
Download CARD data:
wget https://card.mcmaster.ca/latest/data tar -xvf data ./card.json
Load into local or working directory:
rgi load --card_json /path/to/card.json --local
Also pre-process these reference data for metagenomics reads (note that the filename card_database_v3.0.1.fasta depends on the version of CARD data downloaded, please adjust accordingly). Note the use of the _all version of reference files when loading reference data for all model types:
rgi card_annotation -i /path/to/card.json > card_annotation.log 2>&1 rgi load -i /path/to/card.json --card_annotation_all_models card_database_v3.0.1_all.fasta --local
As outlined above, metagenomics analyses may additionally include CARD's Resistomes & Variants protein homolog model reference data if desired. If you wish to include these reference data, additionally download the Resistomes & Variants (a.ka. WildCARD) data:
wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants mkdir -p wildcard tar -xjf wildcard_data.tar.bz2 -C wildcard gunzip wildcard/*.gz
Pre-process the WildCARD reference data for metagenomics reads (note that the filenames wildcard_database_v3.0.2.fasta and card_database_v3.0.1.fasta plus the paramater version_number depend on the version of CARD data downloaded, please adjust accordingly). Note the use of the _all version of reference files when loading reference data for all model types:
rgi wildcard_annotation -i wildcard --card_json /path/to/card.json -v version_number > wildcard_annotation.log 2>&1 rgi load --card_json /path/to/card.json --wildcard_annotation_all_models wildcard_database_v3.0.2_all.fasta --wildcard_index /path/to/wildcard/index-for-model-sequences.txt --card_annotation_all_models card_database_v3.0.1_all.fasta --local
RGI will use FASTQ files as provided, be sure to include linker and quality trimming, plus sorting or any other needed pre-processing prior to using RGI (see suggestions above). Note: RGI bwt will assume unpaired reads unless the -2 flag is used. The examples below assume paired reads.
The default settings for RGI bwt will align reads using KMA:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local --include_other_models
The same analysis can be expanded to use multiple processors:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local --include_other_models -n 20
Although not recommended (see above), an alternate read aligner can be used:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local --include_other_models -n 20 -a bowtie2
RGI bwt can use an expanded reference set by aligning reads to both 'canonical' CARD and CARD's Resistomes & Variants WildCARD variants:
rgi bwt --read_one /path/to/fastq/R1.fastq.gz --read_two /path/to/fastq/R2.fastq.gz --output_file output_prefix --local --include_other_models -n 20 --include_wildcard
RGI bwt aligns FASTQ reads to the AMR alleles used as reference sequences, with results provided for allele mapping and summarized at the AMR gene level (i.e. summing allele level results by gene). Five tab-delimited files are produced:
File | Contents |
---|---|
output_prefix.allele_mapping_data.txt | RGI bwt read mapping results at allele level |
output_prefix.gene_mapping_data.txt | RGI bwt read mapping results at gene level |
output_prefix.artifacts_mapping_stats.txt | Statistics for read mapping artifacts |
output_prefix.overall_mapping_stats.txt | Statistics for overall read mapping results |
output_prefix.reference_mapping_stats.txt | Statistics for reference matches |
Field | Contents |
---|---|
Reference Sequence | Reference allele to which reads have been mapped |
ARO Term | ARO Term |
ARO Accession | ARO Accession |
Reference Model Type | CARD detection model type |
Reference DB | Reference allele is from either CARD or WildCARD |
Reference Allele Source | See below |
Resistomes & Variants: Observed in Genome(s) | Has this allele sequence been observed in a CARD Prevalence genome sequence? |
Resistomes & Variants: Observed in Plasmid(s) | Has this allele sequence been observed in a CARD Prevalence plasmid sequence? |
Resistomes & Variants: Observed Pathogen(s) | CARD Prevalence pathogens bearing this allele sequence. If Reference DB is CARD, pathogen used as the reference in the CARD detection model will be shown. Use k-mers to verify pathogen-of-origin. |
Completely Mapped Reads | Number of reads mapped completely to allele |
Mapped Reads with Flanking Sequence | Number of reads mapped incompletely to allele |
All Mapped Reads | Sum of previous two columns |
Percent Coverage | Percent of reference allele covered by reads |
Length Coverage (bp) | Base pairs of reference allele covered by reads |
Average MAPQ (Completely Mapped Reads) | Average MAPQ value |
Mate Pair Linkage | For mate pair sequencing, if a sister read maps to a different AMR gene, this is listed |
Reference Length | Length (bp) of reference allele |
AMR Gene Family | ARO Categorization |
Drug Class | ARO Categorization |
Resistance Mechanism | ARO Categorization |
Depth | Depth of coverage (reported only when using KMA) |
SNPs | Single nucleotide polymorphisms observed from mapped reads (reported only when using KMA and with depth of at least 5). Not screened against curated SNPs in CARD. |
Consensus Sequence DNA | Nucleotide Consensus Sequence using mapped reads (reported only when using KMA and with depth of at least 5). |
Consensus Sequence Protein | Protein Consensus Sequence translated from DNA (reported only when using KMA and with depth of at least 5). |
Reference Allele Source:
Entries with CARD Curation are aligned to a reference allele from a published, characterized AMR gene, i.e. 'canonical CARD', and thus encode a 100% match to the reference protein sequence. Otherwise, entries will be reported as in silico allele predictions based on either Perfect or Strict RGI matches in CARD's Resistomes & Variants, with percent identity to the CARD reference protein reported. Matches with low values should be used with caution, as CARD's Resistomes & Variants has predicted a low identity AMR homolog.
Field | Contents |
---|---|
ARO Term | ARO Term |
ARO Accession | ARO Accession |
Reference Model Type | CARD detection model type |
Reference DB | Reference allele(s) are from CARD and/or WildCARD |
Alleles with Mapped Reads | # of alleles for this AMR gene with mapped reads |
Reference Allele(s) Identity to CARD Reference Protein | See below |
Resistomes & Variants: Observed in Genome(s) | Have these allele sequences been observed in a CARD Prevalence genome sequence? |
Resistomes & Variants: Observed in Plasmid(s) | Have these allele sequences been observed in a CARD Prevalence plasmid sequence? |
Resistomes & Variants: Observed Pathogen(s) | CARD Prevalence pathogens bearing this allele sequence. If Reference DB is CARD, pathogen used as the reference in the CARD detection model will be shown. Use k-mers to verify pathogen-of-origin. |
Completely Mapped Reads | Number of reads mapped completely to these alleles |
Mapped Reads with Flanking Sequence | Number of reads mapped incompletely to these alleles |
All Mapped Reads | Sum of previous two columns |
Average Percent Coverage | Average % of reference allele(s) covered by reads |
Average Length Coverage (bp) | Average bp of reference allele(s) covered by reads |
Average MAPQ (Completely Mapped Reads) | Statistics for reference matches |
Number of Mapped Baits | not yet supported |
Number of Mapped Baits with Reads | not yet supported |
Average Number of reads per Bait | not yet supported |
Number of reads per Bait Coefficient of Variation (%) | not yet supported |
Number of reads mapping to baits and mapping to complete gene | not yet supported |
Number of reads mapping to baits and mapping to complete gene (%) | not yet supported |
Mate Pair Linkage (# reads) | For mate pair sequencing, if a sister read maps to a different AMR gene, this is listed (# reads supporting linkage in parentheses) |
Reference Length | Length (bp) of reference sequences |
AMR Gene Family | ARO Categorization |
Drug Class | ARO Categorization |
Resistance Mechanism | ARO Categorization |
Reference Allele(s) Identity to CARD Reference Protein:
Gives range of Reference Allele Source values reported in the RGI bwt read mapping results at allele level, indicating the range of percent identity at the amino acid level of the encoded proteins to the corresponding CARD reference sequence. Matches with low values should be used with caution, as CARD's Resistomes & Variants has predicted a low identity AMR homolog.