-
Notifications
You must be signed in to change notification settings - Fork 9
MetQy functions and usage examples – Query functions
The query family of functions allow the user to query the KEGG data structures in a systematic (and automated) way. These functions rely on built-in formatted KEGG data (downloaded in February 2018) that is not directly accessible by the user, but feature optional arguments that allow users to provide up-to-date data (by using the parsing functions on KEGG FTP data) or their own data structures. Within user-provided data, advanced users can also define and evaluate custom-made KEGG-style modules. Additional query functions can be readily developed by the users, allowing expansion of MetQy. MetQy features five query functions for key functional analyses which are described below.
- query_genomes_to_modules
- query_modules_to_genomes
- query_genes_to_modules
- query_genes_to_modules
- query_genes_to_genomes
The query functions use built-in KEGG data (downloaded 20/02/2018) which is hidden from the user, in compliance with the KEGG FTP licence. Users with FTP access can use the parsing functions to process the KEGG database files and to provide up-to-date information to the query functions.
query_genomes_to_modules evaluates module completeness against a gene set and calculates the module completeness fraction (mcf). The input consists of specifying KEGG module(s) to evaluate across given genome(s), which can be specified by either (1) KEGG genome identifier(s) (T number(s) or 3-4 letter code(s)), (2) organism name(s), or (3) set(s) of genes (either K or Enzyme Commission (EC) numbers). In the first two cases, the built-in KEGG data are used to retrieve the gene content for the genome(s). Note that specifying the organism name may result in multiple genomes being matched and analyzed automatically. The main output of this function constitutes a matrix of the genome T number(s) (rows) and the module ID(s) matching the query (columns) containing the mcf.
While the implementation of the query_genomes_to_modules function is similar to KEGG mapper (a web interface tool that performs a similar task (Kanehisa 2017)), there are several key features that are different. First, the KEGG Mapper's web interface does not allow for module-specific evaluation nor for automation of the analysis. Our implementation allows for specific KEGG modules to be evaluated, given their ID, name and/or class. It also provides the capacity to determine the mcf of a module, rather than only identifying modules that are complete or that have one block missing. Finally, as EC numbers are widely used in systems biology, we used the KEGG orthology to translate the K number-based module definitions to EC number-based module definitions. This allows for module evaluation based on both K and EC numbers.
library(MetQy)
## USE T numbers
> T_NUMEBERS <- paste("T0000",1:5,sep="")
> OUT <- query_genomes_to_modules(T_NUMEBERS,MODULE_ID = paste("M0000",1:5,sep=""),META_OUT = T, ADD_OUT = T)
> names(OUT)
[1] "MATRIX" "QUERIES" "METADATA" "ADD_INFO" "GENOME_INFO_DATA"
# Retrieve the mcf
> mcf <- OUT$MATRIX
> mcf
M00001 M00002 M00003 M00004 M00005
T00001 0.8888889 1 1.0000000 0.8571429 1
T00002 0.8888889 1 0.7142857 0.5714286 1
T00003 1.0000000 1 0.8571429 0.7142857 1
T00004 0.8888889 1 0.8571429 1.0000000 1
T00005 1.0000000 1 1.0000000 1.0000000 1
# Retrieve the module information (trimmed <...> for display purposes, only showing 6 of 8 columns)
> OUT$METADATA[1,1:6]
MODULE_ID MODULE_NAME NAME_SHORT CLASS_I CLASS_II CLASS_III
M00001 Glycolysis ... Glycolysis - EM pathway Pathway module Carbohydrate and ... Central ...
# Retrieve additional information on the mcf (FRACTION) and the number of blocks in the KEGG module definition
> OUT$ADD_INFO[1, 1:5]
GENOME_ID MODULE_ID NAME_SHORT FRACTION nBLOCKS
T00001 M00001 Glycolysis - EM pathway 0.8888889 9
# Retrieve genome information (relationship between T identifier, letter code identifier and name)
> OUT$GENOME_INFO_DATA
ID ORG_ID ORGANISM
T00001 hin Haemophilus influenzae Rd KW20 (serotype d)
T00002 mge Mycoplasma genitalium G37
T00003 mja Methanocaldococcus jannaschii DSM 2661
T00004 syn Synechocystis sp. PCC 6803
T00005 sce Saccharomyces cerevisiae S288c
## USE SPECIES NAMES
> names <- c("escherichia coli","heliobacter")
> OUT <- query_genomes_to_modules(names,MODULE_ID = paste("M0000",1:5,sep=""),META_OUT = T, ADD_OUT = T)
> dim(OUT$MATRIX)
[1] 66 5 # 66 organisms matched the partial names provided and we specified 5 modules.
## USE USER-SPECIFIED GENE SETS
> data(data_example_multi_ECs_KOs) # load example data set - note that this entry does not exist in KEGG
> data_example_multi_ECs_KOs[1,] # trimmed for display purposes
ID ORG_ID ORGANISM KOs ECs
T09999 aaa A K00013;K00014;K00018;... 1.1.1.1;1.1.1.100;1.1.1.130;...
> OUT <- query_genomes_to_modules(data_example_multi_ECs_KOs,GENOME_ID_COL = "ID",GENES_COL = "KOs",
MODULE__ID = paste("M0000",1:5,sep=""),
META__OUT = T,ADD__OUT = T)
## Using EC numbers (less accurate, because KEGG modules are defined in terms of KEGG orthologs)
> OUT <- query_genomes_to_modules(data_example_multi_ECs_KOs,GENOME_ID_COL = "ID",GENES_COL = "ECs",
MODULE__ID = paste("M0000",1:5,sep=""),
META__OUT = T,ADD__OUT = T)
query_module_to_genomes determines the KEGG genome(s) that have user-specified module(s) that are complete above a mcf threshold (defaults to 1, i.e. complete). The output is a matrix with the module ID(s) (columns) and KEGG genome T number(s) (rows) containing the corresponding mcf.
> library(MetQy)
## Single module using threshold of 1 (default)
> genomes <- query_modules_to_genomes("M00001")
> genomes[1:5]
T00003 T00005 T00007 T00010 T00014
1 1 1 1 1
## Multiple modules using set threshold (0.75 in this case)
> genomes <- query_modules_to_genomes(MODULE_ID = c("M00001","M00002"),threshold = 0.9)
> head(genomes)
query_gene_to_modules determines those KEGG modules that feature specific user-specified gene(s), defined by their KEGG ortholog or Enzyme Commission (EC) identifier. The output is a data frame containing the modules (and their description) that contain the specified gene.
> library(MetQy)
## Find the module ID(s) where the given KEGG ortholog or EC number are involved.
> modules <- query_genes_to_modules("K00844")
> modules
M00001 M00549
K00844 1 1
> modules <- query_genes_to_modules("1.1.1.1")
query_genes_to_genomes determines which KEGG genomes contain user-specified gene(s). The output is a binary matrix showing the presence/absence of each of the user-specified gene(s) (in rows) against the user-specified genome(s) (in columns).
> library(MetQy)
## Find the genomes that have been annotated with a KEGG ortholog
> genomes <- query_genes_to_genomes("K00844")
> genomes[,1:5]
T00005 T00016 T00019 T00030 T00041
K00844 1 1 1 1 1
## Find the genomes that have been annotated with an EC number
> genomes <- query_genes_to_genomes("1.1.1.1")
> genomes[,1:5]
T00001 T00004 T00005 T00006 T00007
1.1.1.1 1 1 1 1 1
## Find the genomes that have been annotated with a KEGG ortholog
> genomes <- query_genes_to_genomes(genes = paste("K0000",1:3,sep=""))
> genomes[,1:5]
T00003 T00004 T00005 T00008 T00009
K00001 0 0 0 0 0
K00002 0 1 1 0 0
K00003 1 1 1 1 1
# Boolean relationship, where 1 indicates that the genome (column) has been annotated with that KEGG ortholog (row) and 0 means that has not been.
Given a genome (or set of genes), query_missingGenes_from_module determines the missing gene(s) (K or EC numbers) that would be required to have a complete KEGG module within that genome (or gene set). The user input to the function constitutes of either a KEGG genome identifier (T number or 3-4 letter code) or a set of genes (either K numbers or EC numbers). If a KEGG genome identifier is provided, the built-in KEGG data are used to retrieve the gene content for that genome. The function output consists of a data frame with (1) the KEGG module structure highlighting the missing genes within each block with an asterisk, (2) the absence or presence of each block with a binary indicator (see above for block presence/absence criteria) and (3) the list of missing genes.
> library(MetQy)
> data("data_example_KOnumbers_vector") # Load data
> OUT <- query_missingGenes_from_module(data_example_KOnumbers_vector,"M00010")
> OUT
block_No PRESENT BLOCK_DEF MISSING_GENES
1 1 K01647
2 1 K01681|K01682
3 1 K00031|*K00030*
# KEGG ortholog K00030 is missing, but the block is still complete, so there are no genes missing (i.e. required for the module to be complete) in this example.
> OUT <- query_missingGenes_from_module(data_example_KOnumbers_vector,"M00046")
> OUT
> OUT
block_No PRESENT BLOCK_DEF MISSING_GENES
1 1 *K00207*|K17722&K17723
2 1 K01464
3 0 *K01431*|*K06016* K01431;K06016
# There is one block missing in this example, which could be completed by having EITHER KOs K01431 OR K06016, according to the block definition.
Kanehisa, M. et al., 2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45(D1), pp.D353–D361.