Skip to content

MetQy functions and usage examples – Query functions

Andrea Martinez Vernon edited this page Aug 5, 2018 · 4 revisions

The query family of functions allow the user to query the KEGG data structures in a systematic (and automated) way. These functions rely on built-in formatted KEGG data (downloaded in February 2018) that is not directly accessible by the user, but feature optional arguments that allow users to provide up-to-date data (by using the parsing functions on KEGG FTP data) or their own data structures. Within user-provided data, advanced users can also define and evaluate custom-made KEGG-style modules. Additional query functions can be readily developed by the users, allowing expansion of MetQy. MetQy features five query functions for key functional analyses which are described below.

Use of KEGG data

The query functions use built-in KEGG data (downloaded 20/02/2018) which is hidden from the user, in compliance with the KEGG FTP licence. Users with FTP access can use the parsing functions to process the KEGG database files and to provide up-to-date information to the query functions.


query_genomes_to_modules

query_genomes_to_modules evaluates module completeness against a gene set and calculates the module completeness fraction (mcf). The input consists of specifying KEGG module(s) to evaluate across given genome(s), which can be specified by either (1) KEGG genome identifier(s) (T number(s) or 3-4 letter code(s)), (2) organism name(s), or (3) set(s) of genes (either K or Enzyme Commission (EC) numbers). In the first two cases, the built-in KEGG data are used to retrieve the gene content for the genome(s). Note that specifying the organism name may result in multiple genomes being matched and analyzed automatically. The main output of this function constitutes a matrix of the genome T number(s) (rows) and the module ID(s) matching the query (columns) containing the mcf.

While the implementation of the query_genomes_to_modules function is similar to KEGG mapper (a web interface tool that performs a similar task (Kanehisa 2017)), there are several key features that are different. First, the KEGG Mapper's web interface does not allow for module-specific evaluation nor for automation of the analysis. Our implementation allows for specific KEGG modules to be evaluated, given their ID, name and/or class. It also provides the capacity to determine the mcf of a module, rather than only identifying modules that are complete or that have one block missing. Finally, as EC numbers are widely used in systems biology, we used the KEGG orthology to translate the K number-based module definitions to EC number-based module definitions. This allows for module evaluation based on both K and EC numbers.

Usage examples

library(MetQy)

## USE T numbers
> T_NUMEBERS <- paste("T0000",1:5,sep="")
> OUT        <- query_genomes_to_modules(T_NUMEBERS,MODULE_ID = paste("M0000",1:5,sep=""),META_OUT = T, ADD_OUT = T)
> names(OUT)
[1] "MATRIX"           "QUERIES"          "METADATA"         "ADD_INFO"        "GENOME_INFO_DATA"

# Retrieve the mcf
> mcf <- OUT$MATRIX 
> mcf
          M00001 M00002    M00003    M00004 M00005
T00001 0.8888889      1 1.0000000 0.8571429      1
T00002 0.8888889      1 0.7142857 0.5714286      1
T00003 1.0000000      1 0.8571429 0.7142857      1
T00004 0.8888889      1 0.8571429 1.0000000      1
T00005 1.0000000      1 1.0000000 1.0000000      1

# Retrieve the module information (trimmed <...> for display purposes, only showing 6 of 8 columns)
> OUT$METADATA[1,1:6]
 MODULE_ID      MODULE_NAME                  NAME_SHORT            CLASS_I                CLASS_II    CLASS_III
    M00001   Glycolysis ...     Glycolysis - EM pathway     Pathway module    Carbohydrate and ...  Central ...

# Retrieve additional information on the mcf (FRACTION) and the number of blocks in the KEGG module definition
> OUT$ADD_INFO[1, 1:5]
 GENOME_ID MODULE_ID              NAME_SHORT  FRACTION nBLOCKS
    T00001    M00001 Glycolysis - EM pathway 0.8888889       9

# Retrieve genome information (relationship between T identifier, letter code identifier and name)
> OUT$GENOME_INFO_DATA
     ID ORG_ID                                    ORGANISM
 T00001    hin Haemophilus influenzae Rd KW20 (serotype d)
 T00002    mge                   Mycoplasma genitalium G37
 T00003    mja      Methanocaldococcus jannaschii DSM 2661
 T00004    syn                  Synechocystis sp. PCC 6803
 T00005    sce              Saccharomyces cerevisiae S288c

## USE SPECIES NAMES
> names      <- c("escherichia coli","heliobacter")
> OUT        <- query_genomes_to_modules(names,MODULE_ID = paste("M0000",1:5,sep=""),META_OUT = T, ADD_OUT = T)
> dim(OUT$MATRIX)
[1] 66  5  # 66 organisms matched the partial names provided and we specified 5 modules.

## USE USER-SPECIFIED GENE SETS
> data(data_example_multi_ECs_KOs) # load example data set - note that this entry does not exist in KEGG
> data_example_multi_ECs_KOs[1,] # trimmed for display purposes
     ID  ORG_ID  ORGANISM                        KOs                                 ECs
 T09999     aaa         A   K00013;K00014;K00018;...     1.1.1.1;1.1.1.100;1.1.1.130;...

> OUT        <- query_genomes_to_modules(data_example_multi_ECs_KOs,GENOME_ID_COL = "ID",GENES_COL = "KOs",
	                                         MODULE__ID = paste("M0000",1:5,sep=""),
	                                         META__OUT = T,ADD__OUT = T)

## Using EC numbers (less accurate, because KEGG modules are defined in terms of KEGG orthologs)
> OUT        <- query_genomes_to_modules(data_example_multi_ECs_KOs,GENOME_ID_COL = "ID",GENES_COL = "ECs",
	                                         MODULE__ID = paste("M0000",1:5,sep=""),
	                                         META__OUT = T,ADD__OUT = T)

query_modules_to_genomes

query_module_to_genomes determines the KEGG genome(s) that have user-specified module(s) that are complete above a mcf threshold (defaults to 1, i.e. complete). The output is a matrix with the module ID(s) (columns) and KEGG genome T number(s) (rows) containing the corresponding mcf.

Usage examples

> library(MetQy)

## Single module using threshold of 1 (default)
> genomes <- query_modules_to_genomes("M00001")
> genomes[1:5]
T00003 T00005 T00007 T00010 T00014 
     1      1      1      1      1 

## Multiple modules using set threshold (0.75 in this case)
> genomes <- query_modules_to_genomes(MODULE_ID = c("M00001","M00002"),threshold = 0.9)
> head(genomes)

query_genes_to_modules

query_gene_to_modules determines those KEGG modules that feature specific user-specified gene(s), defined by their KEGG ortholog or Enzyme Commission (EC) identifier. The output is a data frame containing the modules (and their description) that contain the specified gene.

Usage examples

> library(MetQy)

## Find the module ID(s) where the given KEGG ortholog or EC number are involved.
> modules <- query_genes_to_modules("K00844")
> modules
       M00001 M00549
K00844      1      1

> modules <- query_genes_to_modules("1.1.1.1")

query_genes_to_genomes

query_genes_to_genomes determines which KEGG genomes contain user-specified gene(s). The output is a binary matrix showing the presence/absence of each of the user-specified gene(s) (in rows) against the user-specified genome(s) (in columns).

Usage examples

> library(MetQy)

## Find the genomes that have been annotated with a KEGG ortholog
> genomes <- query_genes_to_genomes("K00844")
> genomes[,1:5]
       T00005 T00016 T00019 T00030 T00041
K00844      1      1      1      1      1

## Find the genomes that have been annotated with an EC number 
> genomes <- query_genes_to_genomes("1.1.1.1")
> genomes[,1:5]
        T00001 T00004 T00005 T00006 T00007
1.1.1.1      1      1      1      1      1

## Find the genomes that have been annotated with a KEGG ortholog
> genomes <- query_genes_to_genomes(genes = paste("K0000",1:3,sep=""))
> genomes[,1:5]
       T00003 T00004 T00005 T00008 T00009
K00001      0      0      0      0      0
K00002      0      1      1      0      0
K00003      1      1      1      1      1
# Boolean relationship, where 1 indicates that the genome (column) has been annotated with that KEGG ortholog (row) and 0 means that has not been. 

query_missingGenes_from_module

Given a genome (or set of genes), query_missingGenes_from_module determines the missing gene(s) (K or EC numbers) that would be required to have a complete KEGG module within that genome (or gene set). The user input to the function constitutes of either a KEGG genome identifier (T number or 3-4 letter code) or a set of genes (either K numbers or EC numbers). If a KEGG genome identifier is provided, the built-in KEGG data are used to retrieve the gene content for that genome. The function output consists of a data frame with (1) the KEGG module structure highlighting the missing genes within each block with an asterisk, (2) the absence or presence of each block with a binary indicator (see above for block presence/absence criteria) and (3) the list of missing genes.

Usage examples

> library(MetQy)

> data("data_example_KOnumbers_vector") # Load data
> OUT <- query_missingGenes_from_module(data_example_KOnumbers_vector,"M00010")
> OUT
 block_No PRESENT       BLOCK_DEF MISSING_GENES
        1       1          K01647              
        2       1   K01681|K01682              
        3       1 K00031|*K00030*                     
# KEGG ortholog K00030 is missing, but the block is still complete, so there are no genes missing (i.e. required for the module to be complete) in this example. 

> OUT <- query_missingGenes_from_module(data_example_KOnumbers_vector,"M00046")
> OUT
> OUT
 block_No PRESENT              BLOCK_DEF MISSING_GENES
        1       1 *K00207*|K17722&K17723              
        2       1                 K01464              
        3       0      *K01431*|*K06016* K01431;K06016
# There is one block missing in this example, which could be completed by having EITHER KOs K01431 OR K06016, according to the block definition. 

REFERENCES

Kanehisa, M. et al., 2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45(D1), pp.D353–D361.