-
Notifications
You must be signed in to change notification settings - Fork 29
v2 Data Sources
Release: v2.0.0
Data Access: https://zenodo.org/doi/10.5281/zenodo.7030039
Dependencies:
-
Data_Preparation.ipynb
documents the creation of all generated data -
Ontology_Cleaning.ipynb
documents all ontology cleaning and preprocessing
Rationale: The goal of this build was to create a knowledge graph that represented human disease mechanisms and included the central dogma. The data sources utilized in this release include many of the sources used in the initial release, as well as some new data made available by the Comparative Toxicogenomics Database and experimental data from the Human Protein Atlas.
- Cell Ontology
- Cell Line Ontology
- Chemical Entities of Biological Interest (ChEBI) Ontology
- Gene Ontology
- Human Phenotype Ontology
- Mondo Disease Ontology
- Pathway Ontology
- Protein Ontology
- Relations Ontology
- Sequence Ontology
- Uber-Anatomy Ontology
- Vaccine Ontology
Homepage: GitHub
Citation:
Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21
Usage: Utilized to connect transcripts
and proteins
to cells
. Additionally, the edges between this ontology and its dependencies are utilized:
Homepage: http://www.clo-ontology.org/
Citation:
Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37
Usage: Utilized this ontology to map cell lines
to transcripts
and proteins
. Additionally, the edges between this ontology and its dependencies are utilized:
Homepage: https://www.ebi.ac.uk/chebi/
Citation:
Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9
Usage: Utilized to connect chemicals
to complexes
, diseases
, genes
, GO biological processes
, GO cellular components
, GO molecular functions
, pathways
, phenotypes
, reactions
, and transcripts
.
Homepage: http://geneontology.org/
Citations:
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25
The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8
Usage: Utilized to connect biological processes
, cellular components
, and molecular functions
to chemicals
, pathways
, and proteins
. Additionally, the edges between this ontology and its dependencies are utilized:
Other Gene Ontology Data Used: goa_human.gaf.gz
Homepage: https://hpo.jax.org/
Citation:
Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27
Usage: Utilized to connect phenotypes
to chemicals
, diseases
, genes
, and variants
. Additionally, the edges between this ontology and its dependencies are utilized:
Files
- Other Human Phenotype Ontology Data Used:
phenotype.hpoa
Homepage: https://mondo.monarchinitiative.org/
Citation:
Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22
Usage: Utilized to connect diseases
to chemicals
, phenotypes
, genes
, and variants
. Additionally, the edges between this ontology and its dependencies are utilized:
Homepage: rgd.mcw.edu
Citation:
Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.
Usage: Utilized to connect pathways
to GO biological processes
, GO cellular components
, GO molecular functions
, Reactome pathways
. Several steps are taken in order to connect Pathway Ontology
identifiers to Reactome
pathways and GO biological processes
. To connect Pathway Ontology
identifiers to Reactome
pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).
Files
- Downloaded Mapping Data
- Generated Mapping Data
REACTOME_PW_GO_MAPPINGS.txt
Homepage: https://proconsortium.org/
Citation:
Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45
Usage: Utilized to connect proteins
to chemicals
, genes
, anatomy
, catalysts
, cell lines
, cofactors
, complexes
, GO biological processes
, GO cellular components
, GO molecular functions
, pathways
, proteins
, reactions
, and transcripts
. Additionally, the edges between this ontology and its dependencies are utilized:
Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb
Jupyter Notebook.
Files
-
Generated Human Version Protein Ontology (PRO)
human_pro.owl
-
Other PRO Data Used:
promapping.txt
-
Generated Mapping Data
- Merged Gene, RNA, Protein Map:
Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping:
ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Entrez Gene-PRO Identifier Mapping:
ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-PRO Identifier Mapping:
UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- STRING-PRO Identifier Mapping:
STRING_PRO_ONTOLOGY_MAP.txt
- Merged Gene, RNA, Protein Map:
Homepage: GitHub
Citation:
Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.
Usage: Utilizing this ontology to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.
Files
- Generated RO Data
INVERSE_RELATIONS.txt
RELATIONS_LABELS.txt
Homepage: GitHub
Citation:
Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44
Usage: Utilized to connect transcripts
and other genomic material like genes
and variants
.
Files
- Generated Mapping Data
genomic_sequence_ontology_mappings.xlsx
SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt
Homepage: GitHub
Citation:
Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5
Usage: Utilized to connect tissues
, fluids
, and cells
to proteins
and transcripts
. Additionally, the edges between this ontology and its dependencies are utilized:
Homepage: http://www.violinet.org/vaccineontology/
Citations:
He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32
Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8
Usage: Utilized the edges between this ontology and its dependencies:
- BioPortal
- ClinVar
- Comparative Toxicogenomics Database
- DisGeNET
- Ensembl
- GeneMANIA
- Genotype-Tissue Expression Project
- Human Genome Organisation Gene Nomenclature Committee
- Human Protein Atlas
- National Center for Biotechnology Information Gene
- Reactome Pathway Database
- Search Tool for Recurring Instances of Neighbouring Genes Database
- Universal Protein Resource Knowledgebase
Homepage: BioPortal
Citation:
BioPortal. Lexical OWL Ontology Matcher (LOOM)
Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association
Usage: BioPortal was utilized to obtain mappings between MeSH identifiers
and ChEBI identifiers
for chemicals-diseases
, chemicals-genes
, chemical-GO biological processes
, chemicals-GO cellular components
, chemicals-GO molecular functions
, chemicals-phenotypes
, chemicals-proteins
, and chemicals-transcripts
. Additional information on how this data was processed can be obtained from the NCBO_rest_api.py
GitHub Gist script.
⭐ ALTERNATIVE METHOD⭐ Since the above approach can take over two days to process, we have developed an alternative solution which downloads the mesh2021.nt
data file directly from MeSH and the Flat_file_tab_delimited/names.tsv.gz
file directly from ChEBI. Using these files, we have recapitulated the LOOM
algorithm implemented by BioPortal when creating mappings between these resources. The procedure is relatively straightforward and utilizes the following information from each resource:
- For all MeSH
SCR Chemicals
, obtain the following information:- Identifiers: MeSH identifiers
-
Labels: string labels using the
RDFS:label
object property -
Synonyms: track down all synonyms using the
vocab:concept
andvocab:preferredConcept
object properties
- For all ChEBI classes, obtain the following information:
-
Labels: string labels using the
RDFS:label
object property -
Synonyms: track down all synonyms using all
synonym
object properties
-
Labels: string labels using the
Files
- Generated Data:
MESH_CHEBI_MAP.txt
Homepage: https://www.ncbi.nlm.nih.gov/clinvar/
Citation:
Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2017;46(D1):D1062-7
Usage: ClinVar was utilized to create variant-gene
, variant-disease
, and variant-phenotype
edges. The original data is filtered such that only records meeting the following criteria were included:
-
Assembly
= "GRCh38" -
ClinSigSimple
=1
-
1 = at least one current record submitted with an interpretation of Likely pathogenic or Pathogenic (independent of whether that record includes assertion criteria and evidence)"
-
-
ReviewStatus
in ["criteria provided, multiple submitters, no conflicts", "reviewed by expert panel", "practice guideline"]
Files
-
Downloaded Data
-
Generated Edge Data:
CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt
Homepage: http://ctdbase.org/
Citations:
Curated [chemical–gene interactions|chemical-go interactions|chemical–disease interactions|gene–pathway interactions] data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/)
Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The comparative toxicogenomics database: update 2019. Nucleic Acids Research. 2018;47(D1):D948-54
Usage: Comparative Toxicogenomics Database (CTD) was utilized to create chemical-disease
, chemical-gene
, chemical-GO biological process
, chemical-GO cellular components
, chemical-GO molecular functions
, chemical-phenotype
, chemical-protein
, chemical-rna
, and gene-pathway
edges. The original data is filtered such that only records meeting the following criteria were included:
-
chemical-disease
:DirectEvidence
!= "" -
chemical-gene
:Organism
== "Homo sapiens",GeneForms
== "gene", and affects not inInteractionActions
-
chemical-GO biological process
:PhenotypeName
== "Biological Process" andInteraction
<= "1.04e-47" (10th percentile) -
chemical-GO cellular components
:PhenotypeName
== "Cellular Component" andInteraction
<= "1.04e-47" (10th percentile) -
chemical-GO molecular functions
:PhenotypeName
== "Molecular Function" andInteraction
<= "1.04e-47" (10th percentile) -
chemical-phenotype
:DirectEvidence
!= "" -
chemical-protein
:Organism
== "Homo sapiens",GeneForms
== "protein", and affects not inInteractionActions
-
chemical-rna
:Organism
== "Homo sapiens",GeneForms
== "mRNA", and affects and activity not inInteractionActions
-
gene-pathway edges
:PathwayName
== R-HSA-
Files
- Downloaded Data
- Chemical-Gene Relations:
CTD_chem_gene_ixns.tsv.gz
- Chemical-Disease/Phenotype Relations:
CTD_chemicals_diseases.tsv.gz
- Chemical-GO Relations:
CTD_chem_go_enriched.tsv.gz
- Gene-Pathway Relations:
CTD_genes_pathways.tsv.gz
- Chemical-Gene Relations:
Homepage: https://www.disgenet.org/
Citation:
Gene-disease association data retrieved from DisGeNET v6.0 (http://www.disgenet.org/), Integrative Biomedical Informatics Group GRIB/IMIM/UPF. [December, 2019].
Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research. 2019.
Usage: DisGeNET was utilized to create gene-disease
, and gene-phenotype
edges. The original data is filtered such that only records meeting the following criteria were included: EI
>= "1.0" (90th percentile). Additionally, data from this source was used to create mappings between different types of disease and phenotype identifiers, including:
- OMIM, ORPHA, UMLS, ICD ➞ DOID
- OMIM, ORPHA, UMLS, ICD ➞ HPO
Files
-
Downloaded Data
- Disease/Phenotype-Gene Relations:
curated_gene_disease_associations.tsv.gz
- Disease Identifier Mapping:
disease_mappings.tsv.gz
- Disease/Phenotype-Gene Relations:
-
Generated Mapping Data
- Disease Identifier Mapping:
PHENOTPYE_HPO_MAP.txt
) - Phenotype Identifier Mapping:
DISEASE_DOID_MAP.txt
- Disease Identifier Mapping:
Homepage: https://uswest.ensembl.org/index.html
Citation:
Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L. Ensembl 2018. Nucleic Acids Research. 2017;46(D1):D754-61
Usage: Ensembl data was utilized to create mappings between Ensembl genes, transcripts, and proteins with NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers in the knowledge graph (for additional details on the processing of these data, see Data_Preparation.ipynb
):
- Ensembl Transcript IDs ➞ PRO IDs
- Gene Ensembl IDs ➞ Entrez Gene IDs
- Gene Ensembl IDs ➞ PRO IDs
- Gene Symbols ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ PRO IDs
- Protein Ensembl IDs ➞ UniProt Protein Accession
- STRING IDs ➞ PRO IDs
- UniProt Protein Accession ➞ Entrez Gene IDs
Files
-
Downloaded Data
-
Generated Mapping Data
- Cleaned Ensembl Gene Set:
ensembl_identifier_data_cleaned.txt
- Merged Gene, RNA, Protein Map:
Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping:
ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Gene Symbol-Ensembl Transcript Identifier Mapping:
GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-Ensembl Transcript Identifier Mapping:
ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping:
ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping:
ENSEMBL_GENE_ENTREZ_GENE_MAP.txt
- Cleaned Ensembl Gene Set:
Homepage: https://genemania.org/
Citation:
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research. 2010;38(suppl_2):W214-20
Usage: GeneMANIA was utilized to create gene-gene
edges.
Files
- Downloaded Data:
COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt
Homepage: https://gtexportal.org/home/
Citation:
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B. The genotype-tissue expression (GTEx) project. Nature Genetics. 2013;45(6):580
Usage: The Genotype-Tissue Expression (GTEx) Project was utilized to create edges between protein-cell
, protein-anatomy
, rna-cell
and rna-anatomy
entities. The original data were filtered such that only those edges where the median TPM was >=1.0
and genes were of any type other than protein-coding were included. It should also be noted that we chose to use the RNASeQC file over the RSEM file as advised by the GTEx website.
The RSEM estimates are based on combining isoform-level estimates, which adds uncertainty to the resulting gene-level values (the isoform-level estimates are highly inaccurate in some cases).
The file contains 54
unique tissue and/or cell types. GTEx provides mappings from tissue types to UBERON and EFO. These provided mappings were verified and extended, such that all samples which referenced a cell type were also mapped to the Cell and the Cell Line ontologies. This resulted in a total of 56
mappings (1.04
mappings/concepts).
Files
-
Downloaded Data:
GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
-
Mapping Results:
zooma_tissue_cell_mapping_04JAN2020.xlsx
-
Generated Data
The final mapping set was combined with terms from the Human Protein Atlas, see here for more information.- All HPA tissue and cell type strings:
HPA_tissues.txt
- Final Term Mapping:
HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations:
HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt
- All HPA tissue and cell type strings:
Homepage: https://www.genenames.org/
Citations:
HGNC Database, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom www.genenames.org
Yates B, Braschi B, Gray K, Seal R, Tweedie S, Bruford E. Genenames.org: the HGNC and VGNC Resources in 2017. Nucleic Acids Research. 2017;45(D1):D619-625
Usage: The Human Genome Organisation (HUGO) data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb
:
- Ensembl Transcript IDs ➞ PRO IDs
- Gene Ensembl IDs ➞ Entrez Gene IDs
- Gene Ensembl IDs ➞ PRO IDs
- Gene Symbols ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ PRO IDs
- Protein Ensembl IDs ➞ UniProt Protein Accession
- STRING IDs ➞ PRO IDs
- UniProt Protein Accession ➞ Entrez Gene IDs
Files
-
Downloaded Data:
hgnc_complete_set.txt
-
Generated Data
- Merged Gene, RNA, Protein Map:
Merged_gene_rna_protein_identifiers.pkl
- Gene Symbol-Ensembl Transcript Identifier Mapping:
GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt
- Merged Gene, RNA, Protein Map:
Homepage: https://www.proteinatlas.org/
Citation:
Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419
Usage: The Human Protein Atlas (HPA) was utilized to create rna-cell
, rna-anatomy
, protein-cell
, and protein-anatomy
edges. Evidence between gene and RNA expression in specific tissue types was derived by HPA, such that the consensus normalized expression was >=1.0
. Zooma was utilized to automatically annotate the 153
unique tissues and cell types from Human Protein Atlas for all human protein-coding genes in the Human Proteome to the Cell Ontology, Cell Line Ontology, and the Uber-Anatomy Ontology. To best represent each concept, the automatic mappings from Zooma were extend through manual mapping efforts to ensure each concept cell type was matched to a Cell Ontology, Cell Line Ontology, and UBERON ontology term. This resulted in a total of 281
mappings (1.84
mappings/concepts).
Files
-
Downloaded Data:
proteinatlas_search.tsv
-
Mapping Results:
zooma_tissue_cell_mapping_04JAN2020.xlsx
-
Generated Data
- Final Term Mapping:
HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations:
HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt
- Final Term Mapping:
Homepage: https://www.ncbi.nlm.nih.gov/gene/
Citation:
Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2005;33(suppl_1):D54-8.
Usage: The National Center for Biotechnology Information (NCBI) Gene data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb
:
- Ensembl Transcript IDs ➞ PRO IDs
- Gene Ensembl IDs ➞ Entrez Gene IDs
- Gene Ensembl IDs ➞ PRO IDs
- Gene Symbols ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ PRO IDs
- Protein Ensembl IDs ➞ UniProt Protein Accession
- STRING IDs ➞ PRO IDs
- UniProt Protein Accession ➞ Entrez Gene IDs
Files
-
Downloaded Data:
Homo_sapiens.gene_info.gz
-
Generated Data
- Merged Gene, RNA, Protein Map:
Merged_gene_rna_protein_identifiers.pkl
- Entrez Gene-Ensembl Transcript Identifier Mapping:
ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping:
ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping:
ENSEMBL_GENE_ENTREZ_GENE_MAP.txt
- Uniprot Accession-Entrez Gene Identifier Mapping:
UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt
- Merged Gene, RNA, Protein Map:
Homepage: https://reactome.org/
Citation:
Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, Milacic M. The reactome pathway knowledgebase. Nucleic Acids Research. 2017;46(D1):D649-55
Usage: The Reactome Database was utilized to create chemical-pathway
, GO Biological process-pathway
, pathway-GO Cellular component
, GO Molecular function-pathway
, and protein-pathway
edges. The original data is filtered such that only records meeting the following criteria were included:
-
chemical-pathway
: column[5] == "Homo sapiens" -
GO Biological process-pathway
: column[5] startswith "REACTOME", column[8] == "P", and column[12] == "taxon:9606" -
pathway-GO Cellular component
: column[5] startswith "REACTOME", column[8] == "C", and column[12] == "taxon:9606" -
GO Molecular function-pathway
: column[5] startswith "REACTOME", column[8] == "F", and column[12] == "taxon:9606" -
protein-pathway
: column[5] == "Homo sapiens"
Files
- Downloaded Data
- Chemical-Pathway Relations:
ChEBI2Reactome_All_Levels.txt
- Pathway-GO Relations:
gene_association.reactome
- Protein-Pathway Relations:
UniProt2Reactome_All_Levels.txt
- Chemical-Pathway Relations:
Homepage: string-db.org
Citation:
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research. 2018;47(D1):D607-13
Usage: The Search Tool for Recurring Instances of Neighbouring Genes (STRING) Database was utilized to create protein-protein
edges. The original data is filtered such that only records meeting the following criteria were included: combined_score
>= "700" (>90th percentile).
Files
-
Downloaded Data:
9606.protein.links.v11.0.txt.gz
-
Generated Data: STRING-PRO Identifier Mapping:
STRING_PRO_ONTOLOGY_MAP.txt
Homepage: https://www.uniprot.org/
Citation:
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2018;47(D1):D506-15
Usage: The Universal Protein Resource (UniProt) Knowledgebase was utilized to obtain cofactor
/catalyst
-protein
and protein-coding gene
-protein
edges as well as mappings between NCBI Gene identifiers, HUGO gene symbols, Universal Protein Resource (UniProt) Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb
:
- Ensembl Transcript IDs ➞ PRO IDs
- Gene Ensembl IDs ➞ Entrez Gene IDs
- Gene Ensembl IDs ➞ PRO IDs
- Gene Symbols ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ Transcript Ensembl IDs
- Entrez Gene IDs ➞ PRO IDs
- Protein Ensembl IDs ➞ UniProt Protein Accession
- STRING IDs ➞ PRO IDs
- UniProt Protein Accession ➞ Entrez Gene IDs
Files
-
Downloaded Data
- Cofactor and Catalyst relations:
Cofactor/Catalyst Query Results
- UniProt Identifier Mapping:
UniProt Identifier Query Results
- Cofactor and Catalyst relations:
-
Generated Data
- Merged Gene, RNA, Protein Map:
Merged_gene_rna_protein_identifiers.pkl
- Protein-Cofactor Relations:
UNIPROT_PROTEIN_COFACTOR.txt
- Protein-Catalyst Relations:
UNIPROT_PROTEIN_CATALYST.txt
- UniProt Accession-PRO Identifier Mapping:
UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-Entrez Gene Identifier Mapping:
UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt
- Merged Gene, RNA, Protein Map:
This project is licensed under Apache License 2.0 - see the LICENSE.md
file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:
@misc{callahan_tj_2019_3401437,
author = {Callahan, TJ},
title = {PheKnowLator},
month = mar,
year = 2019,
doi = {10.5281/zenodo.3401437},
url = {https://doi.org/10.5281/zenodo.3401437}
}