Skip to content

Latest commit

 

History

History
308 lines (237 loc) · 12.7 KB

README.md

File metadata and controls

308 lines (237 loc) · 12.7 KB

MLSTar

This R package allows you to easily determine the Multi Locus Sequence Type (MLST) of your genomes. It also works as an interface between PubMLST through their RESTful API, so you don't have to bother downloading and collecting files: the application does it automatically. Today the package works in Unix-based systems.

Note: cgMLST and wgMLST are not yet fully supported since the RESTful API interphase for theses cases is a little different than the one for the standard 7 loci schemes. See issues#11 for more details.

Citation

If you use MLSTar in your publications, please cite:

Ferrés I, Iraola G. MLSTar: automatic multilocus sequence typing of bacterial genomes in R. PeerJ. 2018;6 https://doi.org/10.7717/peerj.5098.

Installation

The easiest way to install this package is using devtools:

devtools::install_github('iferres/MLSTar')

Requirements

MLSTar depends on BLAST+ software, and must be installed in your $PATH prior to run this package.

Note: There seems to be incompatibility with MacOS X Mavericks 10.9.5. It was tested and works on MacOS>10.11.2 El Capitan.

Standard workflow

The first step in your analysis should be to check the in pubmlst.org database if your organism of interest is available. So, first load the package and then run listPubmlst_orgs() function, printing only the first 50 elements:

library(MLSTar)
## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## Loading required package: ape

## 
## Attaching package: 'ape'

## The following objects are masked from 'package:igraph':
## 
##     edges, mst, ring
listPubmlst_orgs()[1:50]
##  [1] "achromobacter"           "abaumannii"             
##  [3] "aeromonas"               "aphagocytophilum"       
##  [5] "arcobacter"              "bcereus"                
##  [7] "blicheniformis"          "bsubtilis"              
##  [9] "bbacilliformis"          "bhenselae"              
## [11] "bordetella"              "borrelia"               
## [13] "brachyspira"             "brucella"               
## [15] "bcc"                     "bpseudomallei"          
## [17] "campylobacter_nonjejuni" "campylobacter"          
## [19] "calbicans"               "cglabrata"              
## [21] "ctropicalis"             "cmaltaromaticum"        
## [23] "chlamydiales"            "cfreundii"              
## [25] "csinensis"               "cbotulinum"             
## [27] "cdifficile"              "csepticum"              
## [29] "cdiphtheriae"            "cronobacter"            
## [31] "dnodosus"                "edwardsiella"           
## [33] "ecloacae"                "efaecalis"              
## [35] "efaecium"                "fpsychrophilum"         
## [37] "gallibacterium"          "hinfluenzae"            
## [39] "hparasuis"               "hcinaedi"               
## [41] "helicobacter"            "hsuis"                  
## [43] "kaerogenes"              "koxytoca"               
## [45] "kseptempunctata"         "lsalivarius"            
## [47] "leptospira"              "mcanis"                 
## [49] "mcaseolyticus"           "mplutonius"

Lets say we are interested in Leptospira genus, which is in the place 47 (place may change over time) in the list above. So:

lst <- listPubmlst_orgs() 
lst[47]
## [1] "leptospira"

Now, lets check for available MLST schemes for this organism:

listPubmlst_schemes(org = lst[47])
## $scheme_1
## [1] "glmU_1" "pntA_1" "sucA_1" "tpiA_1" "pfkB_1" "mreA_1" "caiB_1"
## attr(,"Desc")
## [1] "MLST (scheme 1)"
## 
## $scheme_2
## [1] "adk_2"    "glmU_2"   "icdA_2"   "lipL32_2" "lipL41_2" "mreA_2"  
## [7] "pntA_2"  
## attr(,"Desc")
## [1] "MLST (scheme 2)"
## 
## $scheme_3
## [1] "adk_3"    "icdA_3"   "lipL32_3" "lipL41_3" "rrs2_3"   "secY_3"  
## attr(,"Desc")
## [1] "MLST (scheme 3)"

As you can see, listPubmlst_schemes return a list with the loci names corresponding to each scheme. As an attribute of each list element there is information about each mlst scheme.

Now you can choose between two ways: the easy way and the hard way.

The hard way implies calling downloadPubmlst_seq(org = lst[43], scheme = 1) and then downloadPubmlst_profile(org = lst[43], scheme = 1) functions included in this package to download the scheme fasta files and the profile tab file for the organism and the scheme of interest, and then passing the files to the subsequent doMLST() function to schemeFastas and schemeProfile arguments.

The easy way is to left those arguments NULL (default), and let the doMLST() function do it for you.

Let see an example with toy data attached on this package:

#First we list the atteched tar.gz file
tgz <- system.file('extdata', 'example.tar.gz', package = 'MLSTar')
genomes <- untar(tarfile = tgz, list = T)
#Decompress them
untar(tarfile = tgz,exdir = tempdir())
genomes <- list.files(path = tempdir(), 
                      pattern = paste0(genomes, collapse = '|'), 
                      full.names = TRUE)
genomes
##  [1] "/tmp/RtmpksRhvS/1049762.3.fasta" "/tmp/RtmpksRhvS/1049765.3.fasta"
##  [3] "/tmp/RtmpksRhvS/1049766.3.fasta" "/tmp/RtmpksRhvS/1049773.3.fasta"
##  [5] "/tmp/RtmpksRhvS/1049780.3.fasta" "/tmp/RtmpksRhvS/1049781.3.fasta"
##  [7] "/tmp/RtmpksRhvS/1218567.3.fasta" "/tmp/RtmpksRhvS/174.14.fasta"   
##  [9] "/tmp/RtmpksRhvS/174.15.fasta"    "/tmp/RtmpksRhvS/174.17.fasta"

In this example we have 10 pathogenic Leptospira borgpetersenii genomes(** * ), in fasta format. ( * **: Because of portability, just the corresponding alleles of each genomes are written in the fasta files for the scheme 1, and not the whole genomes. The purpose is to show the functions and not to provide a real case example.)

Lets determine the MLST for the scheme 1.

x <- doMLST(
  infiles = genomes, # The fasta files
  org = lst[47], # The organism, in this case is "leptospira"
  scheme = 1, # Scheme id number
  write = "none", # Don't write fasta files for alleles found
  ddir = paste0(tempdir(),'pubmlst_lep_sch1', collapse = '/') # Write downloaded sequences and profiles here.
  )   
## Downloading leptospira scheme 1 MLST sequences at /tmp/RtmpksRhvSpubmlst_lep_sch1// .
## Downloading leptospira scheme 1 MLST profile at /tmp/RtmpksRhvSpubmlst_lep_sch1// .
## Making BLAST databases... DONE!
## Running BLASTN... DONE!
## Checking if new alleles are equal... DONE!
x
## MLST
##  10 isolates
##  organism: leptospira
##  scheme: 1

The result is an object of class "mlst" which is a list of 2 data.frames and a series of attributes. The first data.frame show the result for this set of genomes, and the second is the MLST profile used:

# Attributes
attributes(x)
## $names
## [1] "result"  "profile"
## 
## $infiles
##  [1] "1049762.3.fasta" "1049765.3.fasta" "1049766.3.fasta"
##  [4] "1049773.3.fasta" "1049780.3.fasta" "1049781.3.fasta"
##  [7] "1218567.3.fasta" "174.14.fasta"    "174.15.fasta"   
## [10] "174.17.fasta"   
## 
## $org
## [1] "leptospira"
## 
## $scheme
## [1] 1
## 
## $write
## [1] "none"
## 
## $pid
## [1] 90
## 
## $scov
## [1] 0.9
## 
## $class
## [1] "mlst"
# Result
x$result
##           glmU_1 pntA_1 sucA_1 tpiA_1 pfkB_1 mreA_1 caiB_1  ST
## 1049762.3     24     32     30     36     67     26     12 149
## 1049765.3     24     27     30     34     67     27     11 143
## 1049766.3     24     27     30     34     67     27     51  NA
## 1049773.3     24     28     35     34     37     27     28 154
## 1049780.3     24     28     36     34     37     27     28 155
## 1049781.3     24     32     30     36     67     26     12 149
## 1218567.3     26     30     28     35     39     29     29 152
## 174.14        24     28     26     34     37     27     28 147
## 174.15        24     28     30     34     37     27     28 148
## 174.17        24     27     30     34     67     u1     28  NA
# Profile (leptospira, scheme 1)
head(x$profile)
##   glmU_1 pntA_1 sucA_1 tpiA_1 pfkB_1 mreA_1 caiB_1 ST
## 1      1      1      1      1      1      1      1  1
## 2      1      1      1      1      7      3     32  2
## 3      1      1      1      1     14     15      7  3
## 4      1      1      1      2      7      2      9  4
## 5      1      1      1      2     16      2     18  5
## 6      1      1      1      4     29      4      1  6

As you can see, each row is a genome, and each column is a scheme locus. The number refers to the allele number id.

A "u" plus an integer in the result data.frame means that a new allele was found, e.g. the mreA_1 gene in the last genome: this allele has not been yet reported in the pubmlst database. New alleles are named according to whether are the same or not, for example, if the same mreA_1 allele had been found in other genome, we would see another "u1" in the correspondig cell of the data.frame. If a different, but not reported nethier, allele had been found, we will see a "u1" in one of the cell, and a "u2" in the other. If option write is set to "new", then a fasta file is written with this new allele. If this option is set to "all", all alleles are written (both reported and not reported at pubmlst.org).

A <NA> means that no allele was found, i.e. no blastn local alignment pass the inclusion threshold (by default, this threshold are a percentage identity grater or equal to 90, and a subject coverage greater or equal to 0.9). In this example this was no the case for any of the screened genomes.

The last column refers to the Sequence Type (ST). If possible, the function identifies the ST of each genome, otherwise a NA is returned.

Minimum Spanning Tree

The mlst class defined in this package include a plot method which uses APE for compute a minimum spanning tree (mst), and igraph to build and object of class igraph and to plot it. In this case red nodes corresponds to the pubmlst reported STs, blue nodes are the ones detected, and white ones are those for which no ST was possible to assign.

set.seed(4)
plot(x)

The function plot returns invisibly an object of class igraph which can be further analized using that package.

set.seed(4)
g <- plot(x, plot = FALSE)
g
## IGRAPH 266becf UN-- 269 268 -- 
## + attr: name (v/c), color (v/c)
## + edges from 266becf (vertex names):
##  [1] 1049762.3--1049765.3 1049762.3--1049781.3 1049762.3--1        
##  [4] 1049762.3--94        1049762.3--142       1049762.3--149      
##  [7] 1049762.3--158       1049762.3--163       1049762.3--167      
## [10] 1049762.3--168       1049762.3--172       1049762.3--185      
## [13] 1049762.3--191       1049762.3--207       1049762.3--244      
## [16] 1049762.3--245       1049765.3--1049766.3 1049765.3--143      
## [19] 1049765.3--144       1049765.3--209       1049773.3--174.15   
## [22] 1049773.3--154       1049780.3--174.15    1049780.3--155      
## + ... omitted several edges

Beware with plotting the whole profile: a extensive MLST profile with, for instance, a model organism could consume a lot of resources and take a long time. The distance and mst computation scale exponentially with the number of elements. In this cases you can choose to plot just your isolates:

plot(x, what = 'result')

Binary Tree

The mst is the default plot method, but a binary tree can also be plotted, and in this case an object of class phylo (see APE package) is returned invisibly:

layout(cbind(1,2))
plot(x, type = 'phylo')
plot(x, type = 'phylo', 
     plot.phylo.args = list(type = 'fan'), 
     tiplabels.args = list(offset = 0.1))
     

See ?plot.mlst for more information about the plotting methods. New methods will be added in the future. Any suggestions are welcome.