Skip to content

Latest commit

 

History

History
2543 lines (2097 loc) · 84.3 KB

dictionary.md

File metadata and controls

2543 lines (2097 loc) · 84.3 KB

Data Distillery Data Dictionary

[TOC]

Introduction

The Data Distillery project aims to integrate summarized (“distilled”) Common Fund data within a knowledge graph. The purpose of the Data Distillery Knowledge Graph (DDKG) is to link multiple sources of expertly curated data, thus providing data integration across multiple Common Fund data coordinating centers (DCCs). The summarized data are provided by participating DCCs and funded as part of the Common Fund Data Ecosystem (CFDE) project. The DDKG schema is based on the Unified Biomedical Knowledge Graph (UBKG) which originates from the Unifield Medical Language System (UMLS). The UBKG supports the DDKG with over 180 different ontologies and standards supporting the Common Fund data that either are native to UMLS or were explicitly added to support biomolecular data (see Figure 1). The DDKG can be used to create simple to complex queries, and use the results for a range of different applications related to the use of Common Fund data. We include some use cases in another document <link to Use Case Document here>.

For the first phase of the project, the participating DCCs have submitted 29 different datasets for integration into the DDKG This document is focused on outlining these 29 datasets within the DDKG,, and describing the schema and information on each dataset.

Figure 1: Base datasets

Base Datasets

Information on the base set of ontologies included in the Data Distillery Knowledge Graph can be found in the documentation for the Unified Biomedical Knowledge Graph (UBKG), upon which the Data Distillery is built. See Figure 1 for a general schematic.

DCC Datasets

4D Nucleome (4DN) DCC

4D Nucleome datasets

Dataset SAB(s) 4DNQ 4DNL 4DNF 4DND
DCC Website data.4dnucleome.org
DCC 4DN-DCIC
Authority Andy Schroeder (PM) Harvard Medical School, Boston
Source Information Chromatin loops called from Hi-C experiments performed in select cell lines.
Purpose Representing topologically associated domains and loops by chromosomal location can allow exploration of gene expression, genomic variation and other biological information in the context of chromatin architecture.
Description Hi-C chromatin capture assays generate information on regions of the genome that can be located far apart along the linear sequence of DNA but are in close physical proximity in nuclear chromatin. Architectural features of the chromatin including topologically associated domains (TADs), loops and dots can be generated by algorithms from the results of Hi-C experiments.

Loop calls from several 4DNucleome Hi-C datasets generated from select cell lines and tissues are provided to the data distillery for ingestion.

A subset of loop calls were generated by two different 4DN research labs on datasets from the H1-ESC human ES cell line, H1 differentiated to endoderm and HFFc6, a human foreskin derived cell line. Loop calls from the Dekker lab were generated using the cooltools re-implementation of HICCUPS as described in Oksuz et al. 2021 https://pubmed.ncbi.nlm.nih.gov/34480151/. Loop calls from the Cremins lab were generated as described in Emerson et al. 2022 https://pubmed.ncbi.nlm.nih.gov/35676475/. Additional calls from the Cremins lab generated and part of the data from the Emerson paper from the HCT116 colorectal cancer cell line with or without depletion of WAPL or RAD21, genes that encode protein important for chromatin architecture are also provided.

In addition, the 4DN-DCIC generated loop calls on 4 additional datasets, in situ Hi-C performed on H1-ESC or GM12878 cell lines, 4DNESFSCP5L8 and 4DNES3JX38V5, respectively and DNase-Hi-C in fetal heart tissue (4DNESZFHB53P) or RUES2 stem cells differentiated to cardiomyocytes (4DNESGTHHJAC). These loop calls from the 25 kb resolution matrices of these datasets were further filtered for those loops that overlapped expressed genes identified from gene expression data from the same or comparable cells and tissues.

Summarization Methodology This document indicates the datasets and files used as input to the 4DN distilled data. For the genome-wide loop calls from the Dekker and Cremins groups the indicated files were the direct input into the summarization process. For the 4DN-DCIC loop calls the mcool files indicated were used to call loops and further summarized utilizing expression data to provide a file of loops that overlap expressed genes as described in this document.

Provided loop files were further prepared for ingestion by first creating dataset nodes (SAB: ‘4DND’) with the respective terms containing the dataset information (assay type, lab and cell type involved), file nodes (SAB: ‘4DNF’) with the respective terms containing the file information, loop nodes (SAB: ‘4DNL’) attached to HSCLO nodes at 1kpb resolution level corresponding to upstream start and end and downstream start and end nodes of the characteristic anchor of the loop and q-value nodes (SAB: ‘4DNQ’) corresponding to donut q-value of the loops. The mentioned nodes are then used to create concept nodes with connections depicted in the schematic below.

Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/4DN
Total nodes 462,178
Total edges 2,768,612

4DN Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

\

4DN Node and Edge counts

SAB Count
4DND 12
4DNF 12
4DNL 215822
4DNQ 354
Subject SAB Predicate Object SAB Count
4DND has_assay_type EFO 12
4DND has_assay_type OBI 12
4DND dataset_involved_cell_type EFO 11
4DND dataset_involved_cell_type UBERON 1
4DND dataset_has_file 4DNF 12
4DNF file_has_loop 4DNL 215822
4DNL loop_has_qvalue_bin 4DNQ 215822
4DNL loop_us_start HSCLO 215822
4DNL loop_us_end HSCLO 215822
4DNL loop_ds_start HSCLO 215822
4DNL loop_ds_end HSCLO 215822

Extracellular RNA Communication Program (ERCC)

ERCC RBP dataset

Dataset SAB(s) ENSEMBL, UBERON, UNIPROTKB, ENCODE.RBS.150.NO.OVERLAP, ENCODE.RBS.HepG2, ENCODE.RBS.HepG2.K562, ENCODE.RBS.K562
DCC Website https://exrna.org
DCC Extracellular RNA Communication Consortium (ERCC)
Authority Aleksandar Milosavljevic
Source Information The genomic coordinates of eCLIP peaks of 150 RNA binding proteins (RBPs) were taken from eCLIP-seq analysis results published by the ENCODE project. Control extracellular RNA (exRNA) sequencing profiles available through the exRNA Atlas were used to draw several relationships.
Purpose To help identify minimally invasive biomarkers of disease.
Description Assertions describe relationships between RBPs, RBP binding sites, genes, and biofluids. RBP binding sites refers to both eCLIP peaks and a second type of genomic locus. The group is the result of trimming the eCLIP loci so that there are no overlaps between sets of loci from a given pair of RBPs.
Summarization Methodology Relationships between RBPs and biofluids, and between eCLIP loci and biofluids are the result of a correlation-based analysis. This analysis was performed using the coverage of trimmed eCLIP loci within control exRNA profiles made available through the exRNA Atlas. This analysis is described in detail by LaPlante et al., Cell Genomics, 2023.
Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total nodes 1,169,178
Total edges 2,431,786
Source Data DOI(s) (optional) https://doi.org/10.1038/nmeth.3810
Source Data URL(s) (optional) https://www.encodeproject.org/encore-matrix/?type=Experiment&status=released&internal_tags=ENCORE

ERCC RBP Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

ERCC RBP Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

ERCC Regulatory Element dataset

Dataset SAB(s) CLINGEN.ALLELE.REGISTRY, ENSEMBL, GTEXEQTL, UBERON, ENCODE.CCRE, ENCODE.CCRE.ACTIVITY, ENCODE.CCRE.CTCF, ENCODE.CCRE.H3K27AC, ENCODE.CCRE.H3K4ME3 (node SABs)

ERCCREG, ERCCRBP (edge SABs)

DCC Website https://exrna.org
DCC Extracellular RNA Communication Consortium (ERCC)
Authority Aleksandar Milosavljevic
Source Information The results of CHIP-seq experiments conducted by the ENCODE project were used to identify regulatory elements active within specific tissues and their transcriptional role. Similarly, we used data published by the GTEx project to identify eQTLs active within specific tissues.
Purpose To identify regulatory elements active within a specific tissue which are also supported by having an active eQTL within the range of its genomic coordinates.
Description The tissue specific regulation of a gene by an eQTL is modeled using variant, tissue, eQTL, and gene nodes. The same model structure is also used for regulatory elements. In this case a “regulatory element activity” (SAB=ENCODE.CCRE.ACTIVITY) node is used as the central node rather than the eQTL node. Regulatory element activity nodes are also decorated with relationships to other nodes to assist in determining the tissue specific transcriptional role of the regulatory element. eQTL and regulatory element models are connected by a relationship between variant and regulatory element nodes.
Summarization Methodology To summarize regulatory element data, ENCODE biosamples were grouped by their respective tissue or cell line ontology code. These groups were then further grouped by the number of samples within each biosample group. Next, within the DNase Z-score data matrix provided by ENCODE, for each larger group and each regulatory element, the number of z-scores that were above 1.64 were counted within samples of each biosample (or small) group. This process was used to build a reference distribution of counts specific to a biosample group with a specific number of members. Regulatory elements were then classified as active within a specific tissue or cell type if the count of z-scores greater than 1.64 within ENCODE biosamples belonging to that group was above the median value of the reference distribution.

Only regulatory elements classified as active in at least one tissue or cell type are included. The process described above was repeated for the H3K4me3, H3K27Ac, and CTCF z-score data matrices to decorate regulatory element activity nodes with relationships to other nodes to help the user identify the transcriptional role of each regulatory element.

Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total nodes 2,918,828
Total edges 14,897,093
Source Data DOI(s) (optional) https://doi.org/10.1038/s41586-020-2493-4
Source Data URL(s) (optional) https://screen.wenglab.org/

https://www.gtexportal.org/home/

ERCC Regulatory Element Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

ERCC Regulatory Element Node and Edge Counts

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GlyGen

GlyGen Datasets

Dataset SAB(s) FALDO, GLYCOCOO,GLYCORDF, UNIPROTKB, PROTEOFORM, GLACANS
DCC Website https://www.glygen.org/
DCC GlyGen
Authority Raja Mazumder (PI) George Washington University; Mike Tiemeyer (PI) University of Georgia
Source Information Data for GlyGen is retrieved from multiple glycomics database (e.g. GlyTouCan, GlyConnect, MatrixDB), proteomics database (e.g. UniProtKB) and other domain database (e.g. Ensembl, RefSeq, BioMuta, OMA, MGI, Bgee). All data is transformed in standardized representation and integrate in GlyGen
Purpose Provide computational and informatics resources and tools for glycosciences research. Integrate data and knowledge from diverse disciplines relevant to glycobiology. Address needs inside and outside the glycoscience community.
Description GlyGen is a data integration and dissemination project for carbohydrate and glycoconjugate related data. GlyGen retrieves information from multiple international data sources and integrates and harmonizes this data. This web portal allows exploring this data and performing unique searches that cannot be executed in any of the integrated databases alone.
Summarization Methodology The data ingested for the KnowledgeGraph are from ontologies associated with glycan and proteoform domain. Select nodes and edges for glycans are retrieved from GlyCoCoo and GlyCoRDF. ontologies that describe the properties of glycans. \ The assertion data received from GlyGen in n-triples format (glycan.nt and proteoform.nt) were imported into the No4j environment using the n10s plug-in functions. Once the data was imported for each of the glycans and proteoform datasets, subgraphs were created.Finally, the resulting graph nodes and edges were exported as .csv files using APOC plug-in procedures.The resulting nodes and edges were reformatted by curating the relationship names and adding SABs for all entities (either by using existing SABs e.g. UNIPROTKB and GLYTOUCAN or creating custom SABs such as GLYGEN.LOCATION or GLYCOPROTEIN) and saved as the OWLNETS_node_metadata.tsv and OWLNETS_edgelist.tsv for ingestion. More information on FALDO can be found here: https://bioportal.bioontology.org/ontologies/FALDO. More information on GlycoRDF can be found here: https://github.com/glycoinfo/GlycoRDF. More information on GlycoCoO can be found here: https://github.com/glycoinfo/GlycoCoO. Data under SAB FALDO was ingested to assist in ..
Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/GLYGEN/GlyGen_workfolw.md
Total nodes 154, 241770 (PROTEOFORM), 182269 (GLYCANS)
Total edges 137, 455469 (PROTEOFORM), 464659 (GLYCANS)
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) Download from https://sparql.glygen.org/.

Data file: https://sparql.glygen.org/ln2triplestoredata/triples.tar.gz

GLYGEN Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts (FALDO)

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GlyGen Node and Edge counts (GLYCOCOO)

>>>>> gd2md-html alert: inline image link here (to images/image11.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image12.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts (GLYCORDF)

>>>>> gd2md-html alert: inline image link here (to images/image13.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image14.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge Counts (PROTEOFORM) \

SAB Count
UNIPROTKB 16810
GLYCOPROTEIN 63441
GLYCOPROTEIN.EVIDENCE 2088
GLYCOSYLATION.SITE 52451
GLYGEN.LOCATION 52450
UNIPROTKB.ISOFORM 8406
GLYGEN.CITATION 2088
GP.ID2PRO 61353
GLYTOUCAN 1554
AMINO.ACID 7

\

Subject SAB Predicate Object SAB Count
UNIPROTKB has_isoform UNIPROTKB.ISOFORM 8404
GLYCOPROTEIN has_evidence GLYCOPROTEIN.EVIDENCE 120158
GLYCOPROTEIN sequence UNIPROTKB.ISOFORM 61353
GLYCOPROTEIN.EVIDENCE citation GLYGEN.CITATION 2088
GLYCOPROTEIN has_pro_entry GP.ID2PRO 61353
GLYCOPROTEIN glycosylated_at GLYCOSYLATION.SITE 52450
GLYCOSYLATION.SITE location GLYGEN.LOCATION 52450
GLYCOSYLATION.SITE has_saccharide GLYTOUCAN 44763
GLYGEN.LOCATION has_amino_acid AMINO.ACID 52450

Node and Edge Counts (GLYCANS) \

SAB Count
GLYGEN.GLYCOSYLATION 91
GLYCOSYLTRANSFERASE.REACTION 91
GLYTOUCAN 33755
GLYGEN.RESIDUE 80
GLYGEN.SRC 30986
GLYGEN.GLYCOSEQUENCE 117146
GLYCAN.MOTIF 120

\

Subject SAB Predicate Object SAB Count
GLYGEN.GLYCOSYLATION has_enzyme_protein UNIPROTKB 91
GLYCOSYLTRANSFERASE.REACTION has_enzyme_protein UNIPROTKB 91
GLYTOUCAN is_from_source GLYGEN.SRC 30986
GLYTOUCAN has_glycosequence GLYGEN.GLYCOSEQUENCE 117146
GLYGEN.RESIDUE attached_by GLYGEN.GLYCOSYLATION 349
GLYTOUCAN synthesized_by GLYCOSYLTRANSFERASE.REACTION 210563
GLYTOUCAN has_motif GLYCAN.MOTIF 19321
GLYTOUCAN has_canonical_residue GLYGEN.RESIDUE 86033
GLYGEN.RESIDUE has_parent GLYGEN.RESIDUE 79

Genotype Tissue Expression (GTEx)

GTEx datasets

Dataset SAB(s) GTEXEXP, GTEXEQTL, EXPBINS, PVALUEBINS
DCC Website https://www.gtexportal.org/home/
DCC GTEx
Authority Kristen Ardlie
Source Information Documentation on the sources of GTEx data can be found here: https://biospecimens.cancer.gov/resources/sops/docs/GTEx_SOPs/BBRB-PR-0004-W1%20GTEx%20Tissue%20Harvesting%20Work%20Instruction.pdf
Purpose To include bulk RNA-seq gene expression levels from adult tissues as well as correlations between genotype and tissue-specific gene expression levels as expression quantitative trait loci (eQTLs) that identify regions of the genome that influence whether and how much a gene is expressed.
Description The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. This database includes expression levels for genes by tissue in terms of transcripts per mission (TPM). The database also contains the p-values and relationships between loci and genes as expression quantitative trait loci (eQTLs).
Summarization Methodology Three types of GTEx data were summarized and ingested into the knowledge graph listed below by SAB:

(1) GTEXEXP - Transcript per million (TPM) values, which represent gene-tissue expression levels, were ingested as is except that edges to ‘bin nodes’ (EXPBINS) were created.

For example, a GTEXEXP node with a TPM of 10.5 will have an edge to the bin node that represents [10,11] TPM.

(2) GTEXEQTL - GTEx eQTLs were filtered to include only those that are present in every tissue. This reduced the total set of eQTLs to ~2 million. P-values for the eQTLs are also included in the graph, however, they are represented as bin nodes just like the TPM values for the GTEXEXP dataset.

(3) GTEXCOEXP - A GTEx co-expression dataset was made by first calculating the Pearson’s correlation coefficient of all genes in GTEx intersection with HGNC master list separately for each of 54 tissues listed in GTEx using the provided TPMs. Then pairs of genes with correlation coefficient > 0.99 were tagged in each tissue as strongly correlated and reported as assertions with relationship types and counts in the attached table (please see below)

Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/GTEx
Total nodes 6,280,011
Total edges 31,904,034
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) https://www.gtexportal.org/home/datasets

(GTEx_Analysis_v8_eQTL.tar

and

GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)

GTEXEXP Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image15.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GTEXQTL Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image16.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

GTEx Node and Edge counts (GTEXEXP/EXPBINS)

>>>>> gd2md-html alert: inline image link here (to images/image17.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image18.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts (GTEXEQTL/PVALUEBINS)

>>>>> gd2md-html alert: inline image link here (to images/image19.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image20.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts (GTEXCOEXP)

>>>>> gd2md-html alert: inline image link here (to images/image21.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

For GTEXCOEXP, all nodes are existing HGNC. Edges and counts listed below.

Predicate Count Predicate Count
coexpression_Adipose___Subcutaneous 15485 coexpression_Esophagus___Gastroesophageal_Junction 12874
coexpression_Adipose___Visceral_(Omentum) 2646 coexpression_Esophagus___Mucosa 20463
coexpression_Adrenal_Gland 37897 coexpression_Esophagus___Muscularis 79416
coexpression_Artery___Aorta 642521 coexpression_Fallopian_Tube 769999
coexpression_Artery___Coronary 612950 coexpression_Heart___Atrial_Appendage 168057
coexpression_Artery___Tibial 10237 coexpression_Heart___Left_Ventricle 600676
coexpression_Bladder 529181 coexpression_Kidney___Cortex 583782
coexpression_Brain___Amygdala 22221 coexpression_Kidney___Medulla 10461695
coexpression_Brain___Anterior_cingulate_cortex_(BA24) 102887 coexpression_Liver 20645
coexpression_Brain___Caudate_(basal_ganglia) 7309 coexpression_Lung 17156
coexpression_Brain___Cerebellar_Hemisphere 40983 coexpression_Minor_Salivary_Gland 47164
coexpression_Brain___Cerebellum 106195 coexpression_Muscle___Skeletal 4061
coexpression_Brain___Cortex 764276 coexpression_Nerve___Tibial 12460
coexpression_Brain___Frontal_Cortex_(BA9) 27760 coexpression_Ovary 20177
coexpression_Brain___Hippocampus 84051 coexpression_Pancreas 24183
coexpression_Brain___Hypothalamus 185487 coexpression_Pituitary 28152
coexpression_Brain___Nucleus_accumbens_(basal_ganglia) 1198329 coexpression_Prostate 4197
coexpression_Brain___Putamen_(basal_ganglia) 1146393 coexpression_Skin___Not_Sun_Exposed_(Suprapubic) 1516
coexpression_Brain___Spinal_cord_(cervical_c_1) 267533 coexpression_Skin___Sun_Exposed_(Lower_leg) 84793
coexpression_Brain___Substantia_nigra 143792 coexpression_Small_Intestine___Terminal_Ileum 73157
coexpression_Breast___Mammary_Tissue 3094 coexpression_Spleen 375064
coexpression_Cells___Cultured_fibroblasts 11652 coexpression_Stomach 12964
coexpression_Cells___EBV_transformed_lymphocytes 90051 coexpression_Testis 141440
coexpression_Cervix___Ectocervix 817624 coexpression_Thyroid 31593
coexpression_Cervix___Endocervix 805252 coexpression_Uterus 16498
coexpression_Colon___Sigmoid 11358 coexpression_Vagina 20584
coexpression_Colon___Transverse 22786 coexpression_Whole_Blood 13845

The Human BioMolecular Atlas Program (HuBMAP)

HuBMAP data sets

Dataset SAB(s) AZ, HUBMAP
DCC Website https://hubmapconsortium.org/
DCC HuBMAP
Authority Jonathan Silverstein, Phil Blood
Source Information
Purpose HuBMAP data provides tissue, cell-type and gene specific markers from single-cell data. The purpose of the Hubmap/AZ data is to provide cell-type-specific gene expression markers from single-cell experiments across each tissue.
Description HuBMAP is working to catalyze the development of a framework for \ mapping the human body at single cell resolution and developing the tools to create an open, global atlas of the human body at the cellular level. In this database, we include cell-type specific gene markers from the Azimuth project form a subset of tissues including heart, liver and kidney.
Summarization Methodology See https://azimuth.hubmapconsortium.org/references/
Summarization Methodology Code Repository URL https://azimuth.hubmapconsortium.org/references/
Total nodes 769
Total edges 910

HuBMAP Az Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image22.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

HuBMAP Az Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image23.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image24.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Illuminating the Druggable Genome (IDG)

IDG Datasets

Dataset SAB(s) IDGP (compound/protein), IDGD (compound/disease)

(both are edge SABs)

DCC Website https://pharos.nih.gov/ https://commonfund.nih.gov/idg
DCC Illuminating the Druggable Genome Data Coordinating Center - Engagement Plan with the CFDE
Authority Christophe Lambert (PI), University of New Mexico Health Sciences Center
Source Information Relationships between compounds, diseases, and proteins drawn from the IDG Target Central Resource Database (TCRD) hosted at https://pharos.nih.gov and at DrugCentral https://drugcentral.org/.
Purpose The Illuminating the Druggable Genome (IDG) project elucidates the relationships between diseases, targets, and compounds, providing insights into lesser-known proteins, empowering researchers to discover novel therapeutic targets and accelerate drug development for various diseases.
Description The IDG contributions to the Knowledge Graph include compounds, diseases and proteins and their relationships. A full description of IDG data sources is here: https://pharos.nih.gov/about

Target Central Resource Database (TCRD) is the central resource supporting the IDG-KMC. TCRD has information about human drug targets with a focus on GPCRs, kinases, and ion channels. TCRD categorizes all drug targets into four Target Development Levels (TDLs) by making use of activity thresholds. Protein drug targets from TCRD with known bioactive compounds were incorporated into the IDG KG. Also included in our KG are diseases from TCRD that have known “indication” relationships to approved drugs from DrugCentral. The IDG KG can be used to explore compounds and proteins related to a specific disease among other similar queries. The IDG KG can be combined with data from other DCCs such as LINCS, GTeX, and others to create interesting scientific use cases.

Summarization Methodology Compound nodes were sourced using TCRD, DrugCentral, and PubChem using PUBCHEM_CID ontology. Specifically, chemical compounds from DrugCentral were included if TCRD indicated known bioactivity against protein targets. Compound node properties include SMILES as the node_definition, drugbank ID as node_dbxrefs, name as node_label, and ‘IDG’ as the node_namespace. The PubChem API was used to assign the node_synonyms and node_dbxrefs node properties.

Protein nodes were obtained from TCRD and DrugCentral, and rely on UNIPROTKB ontology. Protein targets from DrugCentral and TCRD with known bioactive compounds were included. A protein’s symbol is denoted as node_label, protein names as node_definition, EnsEMBL IDs as node_dbxrefs, and ‘IDG’ as node_namespace.

The disease nodes use SNOMED_US ontology and are sourced from TCRD and DrugCentral. Diseases from DrugCentral and TCRD with known indication relationships to approved drugs were included. The disease OMOP concept names are included as the node property node_label. Additionally, OMOP IDs are included as node_dbxrefs and ‘IDG’ as the node_namespace.

The bioactivity relationship is defined between compounds and proteins. While ChEMBL and PubChem offer complex and differing ontologies for bioactivity relationships, for the sake of simplicity and efficiency in this early version we use the custom term "bioactivity". The type of bioactivity measurement (e.g. IC50, Kd, EC50) is included as evidence_class.

The indication relationship links compounds and diseases. For simplicity, we introduce the custom, simple term “indication”. These indication relationships are defined between diseases and approved drugs from DrugCentral.

Summarization Methodology Code Repository URL Code by IDG team:

https://github.com/unmtransinfo/cfde-distillery

Code from core DD team (further processing and formatting):

https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/IDG/check_IDG.ipynb

Total nodes Total: 331788; compounds: 327951; disease: 1472; protein: 2365
Total edges Total: 463972; compound/protein (Bioactivity): 454957; compound/disease (Indication): 9015
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) https://app.globus.org/file-manager?origin_id=24c2ee95-146d-4513-a1b3-ac0bfdb7856f&origin_path=%2Fprojects%2Fdata-distillery%2FImport%2FIDG%2F

IDG Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image25.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

IDGP Node and Edge counts (compound/protein, SAB = IDGP)

>>>>> gd2md-html alert: inline image link here (to images/image26.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image27.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

IDGD Node and Edge counts (compound/disease, SAB = IDGD)

>>>>> gd2md-html alert: inline image link here (to images/image28.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image29.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

For IDGP and IDGD, the numbers shown in the tables are numbers of IDG edges connecting existing SAB nodes as indicated.

Gabriella Miller Kids First (GMKF)

KF Datasets

Dataset SAB(s) KFGENEBIN, KFPT, KFCOHORT
DCC Website https://kidsfirstdrc.org/
DCC Gabriella Miller Kids First (GMKF) Pediatric Research Program Data Resource Center (DRC)
Authority Deanne Taylor (PI), Children’s Hospital of Philadelphia
Source Information Genomic and phenotypic data, broadly summarized from trio cohorts with cardiac birth defects from the Pediatric Cardiac Genetics Consortium cohort in Kids First,cohort SD_PREASA7S
Purpose The main purpose of the KF DRC is to better understand the genetic causes and links between childhood cancer and structural birth defects.
Description The Kids First DRC is a collaborative pediatric research effort created to accelerate data-driven discoveries and the development of novel precision-based approaches for children diagnosed with cancer or a structural birth defect using large genomic datasets. The Kids First DRC is comprised of integrated core teams that support development of leading-edge big data infrastructure and provide the necessary resources and tools to empower researchers and clinicians.
Summarization Methodology Variant data from a Congenital Heart Defects (CHD) cohort was queried and filtered using the Kids First variant workbench platform. We filtered for variants that were scored as ‘high impact’ by the variant effect predictor tool (VEP). Variant counts per gene were then computed by counting how many times each gene appeared. The number of variations per gene is stored in the ‘value’ property of the SAB KFGENEBIN Code nodes.

Kids First cohorts are stored in the graph as their own nodes and have an SAB of KFCOHORT.

Patient IDs from the CHD cohort have also been ingested into the graph as their own nodes and have an SAB of KFPT.

Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/KidsFirst
Total nodes (rows) 18,719
Total edges (rows) 76,690
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) Raw data URL

KF Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image30.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

KF Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image31.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image32.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

The Library of Integrated Network-Based Cellular Signatures (LINCS)

LINCS datasets

Dataset SAB(s) LINCS (edge SAB only)
DCC Website https://lincsproject.org/
DCC Library of Integrated Network-Based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC)
Authority Avi Ma’ayan (PI), Icahn School of Medicine at Mount Sinai
Source Information Gene expression changes resulting from drug/small molecule perturbations across cell lines, and gene expression signature similarity between drug/small molecule based on LINCS L1000 signature similarity
Purpose Understand cellular responses to various drug and pre-clinical compound treatments through L1000 transcriptomics assays
Description The LINCS assertions include drug-gene associations and drug-drug similarity associations computed from the LINCS L1000 consensus signatures dataset. Each drug is linked to the top 25 most up-regulated and top 25 most down-regulated genes in the L1000 consensus signatures for the drug/small-molecule, as well as to the top 5 most similar other drugs in the dataset based on the correlation between the consensus signatures for each drug.
Summarization Methodology Level 3 L1000 profiles, drug metadata, and gene metadata were first downloaded from CLUE.io. The L1000 Level 5 signatures were then computed using the Characteristic Direction method [BMC Bioinformatics 15, 79 (2014)]. For each signature, replicate L1000 profiles for a given perturbagen and dosage were compared against all other L1000 profiles from the same cell line batch. Consensus signatures for each drug were then computed by taking the mean of all gene expression vectors corresponding to the given drug across cell lines, timepoints, and dosages.

Drugs were filtered to only those with known PubChem IDs in the original CLUE.io metadata, resulting in a final set of 4,523 drugs. The top 25 up- and down-regulated genes in each consensus signature with known Ensembl IDs from the metadata were determined by the greatest positive and negative Characteristic Direction coefficients, respectively. In total, 225,509 edges and 4,419 unique genes are represented in this collection of knowledge graph assertions.

Additionally, a drug-drug similarity matrix was generated by computing the cosine similarity between all possible pairs of the consensus drug signatures. For each drug, the top 5 other drugs with the greatest positive cosine similarity values were retained. Duplicate edges were removed, resulting in 20,785 total edges representing consensus signature-based drug-drug similarity between the 4,523 drugs with known PubChem IDs.

Summarization Methodology Code Repository URL The methods are described in the following publication:

Evangelista, J.E., Clarke, D.J.B., Xie, Z. et al. Toxicology knowledge graph for structural birth defects. Commun Med 3, 98 (2023).

The code to produce the assertions can be found at:

https://github.com/nih-cfde/ReproToxTables

Total nodes 8,942 (drugs: 4523; genes: 4419)
Total edges 246,294 (drug-gene: 225,509; drug-drug: 20,785)
Source Data DOI(s) (optional) N/A
Source Data URL(s) (optional) https://maayanlab.cloud/sigcom-lincs/#/Download

LINCS Schema Diagram \

>>>>> gd2md-html alert: inline image link here (to images/image33.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image34.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image35.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

For the SAB LINCS, the numbers shown in the tables are numbers of edges connecting existing SAB nodes as indicated.

The Molecular Transducers of Physical Activity Consortium (MoTrPAC)

MoTrPAC datasets

Dataset SAB(s) MOTRPAC
DCC Website https://motrpac-data.org
DCC Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC)
Authority Euan Ashley MD PhD (PI), Matthew Wheeler MD PhD (PI)
Source Information Gene differential expression changes resulting from the RNA-seq data of young adult rats (6 month old) performing endurance training exercise at the 1 week, 2 week, 4 week and 8 week time points.
Purpose The Molecular Transducers of Physical Activity Consortium (MoTrPAC) aims to elucidate how exercise improves health and ameliorates diseases by building a map of the molecular responses to endurance exercise.
Description MoTrPAC is a multi-site collaboration across the US encompassing various scientific disciplines: preclinical animal study sites and human clinical exercise sites, which perform the exercise testing and biospecimen collection; a consortium coordinating center and biorepository, which manages sample collection, distribution of samples, and consortium logistics; chemical analysis sites, which are responsible for omics analysis from the samples collected; and a bioinformatics center to collaboratively analyze and map the data generated by the other sites along with data dissemination to make the data and other resources available to the public. The animal studies enable analysis of the effects of exercise on many different tissues that are not readily obtainable in humans, whereas the collection of accessible human tissues (muscle, blood, and adipose) will permit the analysis of the direct effect of exercise in humans. Additional information can be found at the main consortium page (https://motrpac.org) or at the data portal (https://motrpac-data.org).

The MoTrPAC study is divided into two main parts - animal (rats) and human, with multiple phases or interventions in each of them. Preclinical animal study sites conduct the endurance exercise and training intervention in young adult (6 month old) and middle-aged adult (18 month old) rats, while Clinical study sites conduct the human endurance and resistance training interventions in pediatric, adults and highly active adults.

Summarization Methodology
Summarization Methodology Code Repository URL https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/MoTrPAC/MOTRPAC.ipynb
Total nodes 16149
Total edges 25714
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) Raw data URL

MOTRPAC Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image36.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image37.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

>>>>> gd2md-html alert: inline image link here (to images/image38.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Metabolomics Workbench (MW)

Dataset SAB(s) MW (REFMET; metabolite nodes)
DCC Website https://www.metabolomicsworkbench.org/
DCC Metabolomics Workbench
Authority Professor Shankar Subramaniam (PI)
Source Information Gene-metabolite relationships: MW database tables based on KEGG and other resources

Disease-metabolite relationships: Publication based on HMDB data (https://pubmed.ncbi.nlm.nih.gov/32426349/)

Cell-metabolite relationships: MW database tables for data submitted to NMDR

Purpose Understand what metabolites may be regulated by various genes and their spatial (anatomical) and disease context.
Description The National Institutes of Health (NIH) Common Fund Metabolomics Program was developed with the goal of increasing national capacity in metabolomics by supporting the development of next generation technologies, providing training and mentoring opportunities, increasing the inventory and availability of high quality reference standards, and promoting data sharing and collaboration. In support of this effort, the Metabolomics Common Fund's National Metabolomics Data Repository(NMDR), housed at the San Diego Supercomputer Center (SDSC), University of California, San Diego, has developed the Metabolomics Workbench. The Metabolomics Workbench serves as a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more.

NMDR houses data on metabolomics studies conducted by various centers and research laboratories across the nation and the world, spanning many species, sample sources, diseases, metabolomics experimental techniques and metabolite classes [https://www.metabolomicsworkbench.org/data/browse.php]. The data we have shared with the Data Distillery partnership is a key subset of all the data in NMDR, centered around metabolites. Specifically, we have shared disease-metabolite, gene-metabolite and cell/anatomy (sample source)-metabolite relationships, which when integrated with data from other DCC and external resources has the potential to address interesting biological questions.

Summarization Methodology Gene-Metabolite: Human genes catalyzing metabolic reactions and their associated metabolites were obtained from MW database tables. The HGNC ID was used as metabolic gene node_id, and its approved symbol and name are used as node_label and node_definition, respectively. UMLS, ENTREZ and ENSMBL IDs are used as node_dbxrefs. For the edges, the Subject (Gene: HGNC ID) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0002566: Causally influences).

Disease-Metabolite: Disease-metabolite entities and relationships were deduced from the publication (PMID: 32426349) based on HMDB. The PUBCHEM_CID/HMDB ID was used as node_id for the metabolite. Similarly, disease entities were encoded with DOID or HPO IDs. UMLS, PUBCHEM_CID, DRUGBANK and REFMET were used as node_dbxrefs. For the edges, the Subject (Metabolite: PUBCHEM_CID/HMDB ID) was related to the Object (Disease: DOID/HPO) by the Predicate (RO_0003308: Correlated with condition).

Cell-Metabolite: Metabolite-anatomy context (cell/tissue association) was obtained from MW database. Cell/tissue entity node_id is encoded with UBERON, CL and CLO IDs and cross referenced with UMLS. For the edges, the Subject (Spatial context: UBERON/CL/CLO) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0003000: Produces).

Summarization Methodology Code Repository URL https://github.com/mano-at-sdsc/MW_DataDistillery
Total nodes 51271
Total edges 10009
Source Data DOI(s) (optional) Raw data DOI
Source Data URL(s) (optional) Raw data URL

MW Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image39.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge Counts

SAB Count
UBERON 67
CL 12
CLO 3
HGNC 1061
PUBCHEM 8543
HMDB 18
DOID 276
HPO 32
Subject SAB Predicate Object SAB Count
UBERON produces PUBCHEM 34471
CL produces PUBCHEM 6777
CLO produces PUBCHEM 536
HGNC causally influences PUBCHEM 5527
PUBCHEM correlated with condition DOID 3856
HMDB correlated with condition DOID 27
PUBCHEM correlated with condition HPO 77

For the SAB MW, the numbers shown in the tables are numbers of edges connecting existing SAB nodes as indicated.

Stimulating Peripheral Activity to Relieve Conditions (SPARC)

Dataset SAB(s) SCKAN, NPO, UBERON, PATO, NIFSTD
DCC Website https://sparc.science/
DCC SPARC Data and Resouce Center (DRC) - Knowledge Management and Curation Core
Authority Tom Gillespie, Fahim Imam (SPARC K-Core, University of California San Diego)

Jyl Boline (PM - SPARC K-Core)

Source Information A key component of the SPARC Program is the SPARC Connectivity Knowledge Base of the Autonomic Nervous system, referred to as SCKAN. SCKAN is a semantic store housing a comprehensive knowledge base of autonomic nervous system (ANS) nerve to end organ connectivity. Connectivity information is derived from SPARC experts, SPARC data, and the literature and textbooks using a Natural Language Processing (NLP) pipeline.
Purpose Facilitate enhanced understanding of the peripheral nervous system to support the development of effective bioelectronic therapies by driving collaborative neurosciences and providing online resources for accessing and submitting curated data and models, as well as dynamic knowledge-management and visualization tools.
Description The SPARC Knowledge base of the Automatic Nervous System (SCKAN) is an integrated graph database composed of three parts: the SPARC dataset metadata graph, ApiNATOMY and Neuron Phenotype Ontology (NPO) models of connectivity, and the larger ontology used by SPARC which is a combination of the NIF-Ontology and community ontologies.
Summarization Methodology SCKAN provides a central location to populate, discover, and query ANS connectivity knowledge over multiple scales. It allows issuing queries such as, “what are the locations of neuron somas with processes that pass through spinal cord level C4?” and create a searchable visual atlas of ANS circuitry. Users of the SPARC maps can query SCKAN to find more information about routes, targets and evidence.

SCKAN contains statements about neuronal connectivity at the neuron population level, largely in the form of: “Neurons with somas in structure A project to structure B via nerve C.” SCKAN models connections at two levels of granularity: circuits and individual connections. A circuit represents a detailed model of connectivity that is associated with a particular organ like bladder or functional circuits like defensive breathing. Circuits contain detailed representations of neuron populations giving rise to ANS connections. They include mappings of the locations of cell bodies, dendrites, axon segments as well as synaptic endings involved in a particular circuit.

Circuits in SCKAN are modelled using ApiNATOMY, a knowledge model and a tool specifically created to represent multiscale connectivity. To provide a comprehensive knowledge about ANS connectivity, the circuit-based approach is supplemented with well-known connections of ANS derived from the literature and textbooks using a Natural Language Processing (NLP) pipeline.These types of individual connectivity statements do not have detailed topological information associated with them and are represented using NPO.

Summarization Methodology Code Repository URL https://zenodo.org/record/7476115
Total nodes 484,768
Total edges 1,337,124
Source Data DOI(s) (optional) 10.5281/zenodo.7476115
Source Data URL(s) (optional)

SPARC Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image40.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image41.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Additional Datasets

CLINVAR

The ClinVar dataset (v2023-01-05) was utilized to define assertions between human genes and phenotypes. Only genes with pathogenic, likely pathogenic and pathogenic/likely pathogenic variants were considered, and we excluded associations with no assertion criteria met. To retrieve the target phenotype/disease we used MedGen IDs listed in the ClinVar dataset (also already present in the KG). Processed ClinVar dataset contains 214,040 relationships (including reverse relationships) with the following characteristics [Type: “gene_associated_with_disease_or_phenotype”, SAB: “CLINVAR”] and [type: inverse_gene_associated_with_disease_or_phenotype, S AB: “CLINVAR”] connecting HGNC to MONDO, HPO, EFO and MESH Concept nodes.

CMAP

The edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset were obtained from the Harmonizome database https://maayanlab.cloud. The dataset added 2,625,336 relationships (including reverse relationships) connecting the CHEBI and HGNC nodes with predicates “negatively_correlated_with_gene”, “inverse_negatively_correlated_with_gene”, “positively_correlated_with_gene”, “inverse_positively_correlated_with_gene” (SAB: “CMAP”).

HPOMP

This set of assertions maps human phenotype ontology (HPO) nodes to mammalian phenotype ontology (MP) nodes through the ‘is_approximately_equivalent_to’. It is essentially a set of assertions mapping human phenotype codes to mouse phenotype codes. The mappings were produced by using a software tool called PheKnowLator. There are 1,785 HPOMP mappings. These assertions can be queried by specifying the SAB property as HPOMP on the ‘is_approximately_equivalent_to’ relationship.

HGNCHPO

This set of assertions maps HGNC gene nodes to human phenotype ontology (HPO) nodes through the ‘associated_with’ relationship. There are 671,046 HGNCHPO mappings. These assertions can be queried by specifying the SAB property as HGNCHPO on the ‘associated_with’ relationship.

HCOPHGNC

This set of assertions maps mouse gene nodes (HCOP) to human gene nodes (HGNC). Mouse gene nodes are referred to as ‘HCOP’ in the Data Distillery Knowledge Graph because the HGNC Comparison of Orthology Predictions (HCOP) tool was used to generate these mappings. The ‘in_1_to_1_orthology_relationship_with’ is used to connect the HGNC and HCOP nodes. There are 67,027 HCOPHGNC mappings. These assertions can be queried by specifying the SAB property as HCOPHGNC on the ‘in_1_to_1_orthology_relationship_with’ relationship.

HCOPMP

This set of assertions maps mouse gene nodes (HCOP) to the mammalian phenotype ontology (MP) nodes through the ‘involved_in’ relationship. These mappings are the mouse version of the HGNCHPO mappings. Files from the International Mouse Phenotyping Consortium (IMPC) and Mouse Genome Informatics (MGI) were used to create this dataset. There are 234,043 HCOPMP mappings. These assertions can be queried by specifying the SAB property as HCOPMP on the ‘involved_in’ relationship.

Homo Sapiens Chromosomal Location Ontology (HSCLO)

Homo Sapiens Chromosomal Location Ontology (HSCLO) was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect GTEXEQTL locations in the graph as searchable nodes at 1 kbp resolution (same as 4DN). The dataset relationships as well as nodes use HSCLO as their SAB. HSCLO nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connects to lower level with above_(resolution level)band (e.g. “above_1Mbp_band”, “above 1_kbp_band”) and nodes at the same resolution level are connected through prcedes(resolution level)_band (e.g. “precedes_10kbp_band”). The dataset contains 3,431,155 nodes and 6,862,195 relationships (13,724,390 bidirectional).

MSIGDB

Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets ). With this subset, MSIGDB Concept nodes were created for MSigDB systematic names (used as Codes excluding KEGG data). The relationships between these Concept nodes and HGNC nodes were defined using the mentioned 5 subsets where the subset information was included in the relationship SABs as “MSIGDB”.

RATHCOP

This set of assertions maps human ENSEMBL gene nodes to rat ENSEMBL gene nodes. These mappings were generated from the HCOP tool just like for the mouse to human assertions, except we used the ENSEMBL codes here instead of the HGNC codes. The ‘has_human_ortholog’ relationship is used to connect ENSEMBL Rat nodes to ENSEMBL Human nodes. There are 42,371 RATHCOP mappings and they can be queried by specifying the SAB property as RATHCOP on the ‘has_human_ortholog’ relationship.