Data Distillery Data Dictionary

[TOC]

Introduction

The Data Distillery project aims to integrate summarized (“distilled”) Common Fund data within a knowledge graph. The purpose of the Data Distillery Knowledge Graph (DDKG) is to link multiple sources of expertly curated data, thus providing data integration across multiple Common Fund data coordinating centers (DCCs). The summarized data are provided by participating DCCs and funded as part of the Common Fund Data Ecosystem (CFDE) project. The DDKG schema is based on the Unified Biomedical Knowledge Graph (UBKG) which originates from the Unifield Medical Language System (UMLS). The UBKG supports the DDKG with over 180 different ontologies and standards supporting the Common Fund data that either are native to UMLS or were explicitly added to support biomolecular data (see Figure 1). The DDKG can be used to create simple to complex queries, and use the results for a range of different applications related to the use of Common Fund data. We include some use cases in another document <link to Use Case Document here>.

For the first phase of the project, the participating DCCs have submitted 29 different datasets for integration into the DDKG This document is focused on outlining these 29 datasets within the DDKG,, and describing the schema and information on each dataset.

Figure 1: Base datasets

Base Datasets

Information on the base set of ontologies included in the Data Distillery Knowledge Graph can be found in the documentation for the Unified Biomedical Knowledge Graph (UBKG), upon which the Data Distillery is built. See Figure 1 for a general schematic.

DCC Datasets

4D Nucleome (4DN) DCC

4D Nucleome datasets

Dataset SAB(s)	4DNQ 4DNL 4DNF 4DND
DCC Website	data.4dnucleome.org
DCC	4DN-DCIC
Authority	Andy Schroeder (PM) Harvard Medical School, Boston
Source Information	Chromatin loops called from Hi-C experiments performed in select cell lines.
Purpose	Representing topologically associated domains and loops by chromosomal location can allow exploration of gene expression, genomic variation and other biological information in the context of chromatin architecture.
Description	Hi-C chromatin capture assays generate information on regions of the genome that can be located far apart along the linear sequence of DNA but are in close physical proximity in nuclear chromatin. Architectural features of the chromatin including topologically associated domains (TADs), loops and dots can be generated by algorithms from the results of Hi-C experiments. Loop calls from several 4DNucleome Hi-C datasets generated from select cell lines and tissues are provided to the data distillery for ingestion. A subset of loop calls were generated by two different 4DN research labs on datasets from the H1-ESC human ES cell line, H1 differentiated to endoderm and HFFc6, a human foreskin derived cell line. Loop calls from the Dekker lab were generated using the cooltools re-implementation of HICCUPS as described in Oksuz et al. 2021 https://pubmed.ncbi.nlm.nih.gov/34480151/. Loop calls from the Cremins lab were generated as described in Emerson et al. 2022 https://pubmed.ncbi.nlm.nih.gov/35676475/. Additional calls from the Cremins lab generated and part of the data from the Emerson paper from the HCT116 colorectal cancer cell line with or without depletion of WAPL or RAD21, genes that encode protein important for chromatin architecture are also provided. In addition, the 4DN-DCIC generated loop calls on 4 additional datasets, in situ Hi-C performed on H1-ESC or GM12878 cell lines, 4DNESFSCP5L8 and 4DNES3JX38V5, respectively and DNase-Hi-C in fetal heart tissue (4DNESZFHB53P) or RUES2 stem cells differentiated to cardiomyocytes (4DNESGTHHJAC). These loop calls from the 25 kb resolution matrices of these datasets were further filtered for those loops that overlapped expressed genes identified from gene expression data from the same or comparable cells and tissues.
Summarization Methodology	This document indicates the datasets and files used as input to the 4DN distilled data. For the genome-wide loop calls from the Dekker and Cremins groups the indicated files were the direct input into the summarization process. For the 4DN-DCIC loop calls the mcool files indicated were used to call loops and further summarized utilizing expression data to provide a file of loops that overlap expressed genes as described in this document. Provided loop files were further prepared for ingestion by first creating dataset nodes (SAB: ‘4DND’) with the respective terms containing the dataset information (assay type, lab and cell type involved), file nodes (SAB: ‘4DNF’) with the respective terms containing the file information, loop nodes (SAB: ‘4DNL’) attached to HSCLO nodes at 1kpb resolution level corresponding to upstream start and end and downstream start and end nodes of the characteristic anchor of the loop and q-value nodes (SAB: ‘4DNQ’) corresponding to donut q-value of the loops. The mentioned nodes are then used to create concept nodes with connections depicted in the schematic below.
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/4DN
Total nodes	462,178
Total edges	2,768,612

4DN Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

\

4DN Node and Edge counts

SAB	Count
4DND	12
4DNF	12
4DNL	215822
4DNQ	354

Subject SAB	Predicate	Object SAB	Count
4DND	has_assay_type	EFO	12
4DND	has_assay_type	OBI	12
4DND	dataset_involved_cell_type	EFO	11
4DND	dataset_involved_cell_type	UBERON	1
4DND	dataset_has_file	4DNF	12
4DNF	file_has_loop	4DNL	215822
4DNL	loop_has_qvalue_bin	4DNQ	215822
4DNL	loop_us_start	HSCLO	215822
4DNL	loop_us_end	HSCLO	215822
4DNL	loop_ds_start	HSCLO	215822
4DNL	loop_ds_end	HSCLO	215822

Extracellular RNA Communication Program (ERCC)

ERCC RBP dataset

Dataset SAB(s)	ENSEMBL, UBERON, UNIPROTKB, ENCODE.RBS.150.NO.OVERLAP, ENCODE.RBS.HepG2, ENCODE.RBS.HepG2.K562, ENCODE.RBS.K562
DCC Website	https://exrna.org
DCC	Extracellular RNA Communication Consortium (ERCC)
Authority	Aleksandar Milosavljevic
Source Information	The genomic coordinates of eCLIP peaks of 150 RNA binding proteins (RBPs) were taken from eCLIP-seq analysis results published by the ENCODE project. Control extracellular RNA (exRNA) sequencing profiles available through the exRNA Atlas were used to draw several relationships.
Purpose	To help identify minimally invasive biomarkers of disease.
Description	Assertions describe relationships between RBPs, RBP binding sites, genes, and biofluids. RBP binding sites refers to both eCLIP peaks and a second type of genomic locus. The group is the result of trimming the eCLIP loci so that there are no overlaps between sets of loci from a given pair of RBPs.
Summarization Methodology	Relationships between RBPs and biofluids, and between eCLIP loci and biofluids are the result of a correlation-based analysis. This analysis was performed using the coverage of trimmed eCLIP loci within control exRNA profiles made available through the exRNA Atlas. This analysis is described in detail by LaPlante et al., Cell Genomics, 2023.
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total nodes	1,169,178
Total edges	2,431,786
Source Data DOI(s) (optional)	https://doi.org/10.1038/nmeth.3810
Source Data URL(s) (optional)	https://www.encodeproject.org/encore-matrix/?type=Experiment&status=released&internal_tags=ENCORE

ERCC RBP Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

ERCC RBP Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

ERCC Regulatory Element dataset

Dataset SAB(s)	CLINGEN.ALLELE.REGISTRY, ENSEMBL, GTEXEQTL, UBERON, ENCODE.CCRE, ENCODE.CCRE.ACTIVITY, ENCODE.CCRE.CTCF, ENCODE.CCRE.H3K27AC, ENCODE.CCRE.H3K4ME3 (node SABs) ERCCREG, ERCCRBP (edge SABs)
DCC Website	https://exrna.org
DCC	Extracellular RNA Communication Consortium (ERCC)
Authority	Aleksandar Milosavljevic
Source Information	The results of CHIP-seq experiments conducted by the ENCODE project were used to identify regulatory elements active within specific tissues and their transcriptional role. Similarly, we used data published by the GTEx project to identify eQTLs active within specific tissues.
Purpose	To identify regulatory elements active within a specific tissue which are also supported by having an active eQTL within the range of its genomic coordinates.
Description	The tissue specific regulation of a gene by an eQTL is modeled using variant, tissue, eQTL, and gene nodes. The same model structure is also used for regulatory elements. In this case a “regulatory element activity” (SAB=ENCODE.CCRE.ACTIVITY) node is used as the central node rather than the eQTL node. Regulatory element activity nodes are also decorated with relationships to other nodes to assist in determining the tissue specific transcriptional role of the regulatory element. eQTL and regulatory element models are connected by a relationship between variant and regulatory element nodes.
Summarization Methodology	To summarize regulatory element data, ENCODE biosamples were grouped by their respective tissue or cell line ontology code. These groups were then further grouped by the number of samples within each biosample group. Next, within the DNase Z-score data matrix provided by ENCODE, for each larger group and each regulatory element, the number of z-scores that were above 1.64 were counted within samples of each biosample (or small) group. This process was used to build a reference distribution of counts specific to a biosample group with a specific number of members. Regulatory elements were then classified as active within a specific tissue or cell type if the count of z-scores greater than 1.64 within ENCODE biosamples belonging to that group was above the median value of the reference distribution. Only regulatory elements classified as active in at least one tissue or cell type are included. The process described above was repeated for the H3K4me3, H3K27Ac, and CTCF z-score data matrices to decorate regulatory element activity nodes with relationships to other nodes to help the user identify the transcriptional role of each regulatory element.
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total nodes	2,918,828
Total edges	14,897,093
Source Data DOI(s) (optional)	https://doi.org/10.1038/s41586-020-2493-4
Source Data URL(s) (optional)	https://screen.wenglab.org/ https://www.gtexportal.org/home/

ERCC Regulatory Element Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

ERCC Regulatory Element Node and Edge Counts

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

GlyGen

GlyGen Datasets

Dataset SAB(s)	FALDO, GLYCOCOO,GLYCORDF, UNIPROTKB, PROTEOFORM, GLACANS
DCC Website	https://www.glygen.org/
DCC	GlyGen
Authority	Raja Mazumder (PI) George Washington University; Mike Tiemeyer (PI) University of Georgia
Source Information	Data for GlyGen is retrieved from multiple glycomics database (e.g. GlyTouCan, GlyConnect, MatrixDB), proteomics database (e.g. UniProtKB) and other domain database (e.g. Ensembl, RefSeq, BioMuta, OMA, MGI, Bgee). All data is transformed in standardized representation and integrate in GlyGen
Purpose	Provide computational and informatics resources and tools for glycosciences research. Integrate data and knowledge from diverse disciplines relevant to glycobiology. Address needs inside and outside the glycoscience community.
Description	GlyGen is a data integration and dissemination project for carbohydrate and glycoconjugate related data. GlyGen retrieves information from multiple international data sources and integrates and harmonizes this data. This web portal allows exploring this data and performing unique searches that cannot be executed in any of the integrated databases alone.
Summarization Methodology	The data ingested for the KnowledgeGraph are from ontologies associated with glycan and proteoform domain. Select nodes and edges for glycans are retrieved from GlyCoCoo and GlyCoRDF. ontologies that describe the properties of glycans. \ The assertion data received from GlyGen in n-triples format (glycan.nt and proteoform.nt) were imported into the No4j environment using the n10s plug-in functions. Once the data was imported for each of the glycans and proteoform datasets, subgraphs were created`.`Finally, the resulting graph nodes and edges were exported as .csv files using APOC plug-in procedures.The resulting nodes and edges were reformatted by curating the relationship names and adding SABs for all entities (either by using existing SABs e.g. UNIPROTKB and GLYTOUCAN or creating custom SABs such as GLYGEN.LOCATION or GLYCOPROTEIN) and saved as the OWLNETS_node_metadata.tsv and OWLNETS_edgelist.tsv for ingestion. More information on FALDO can be found here: https://bioportal.bioontology.org/ontologies/FALDO. More information on GlycoRDF can be found here: https://github.com/glycoinfo/GlycoRDF. More information on GlycoCoO can be found here: https://github.com/glycoinfo/GlycoCoO. Data under SAB FALDO was ingested to assist in ..
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/GLYGEN/GlyGen_workfolw.md
Total nodes	154, 241770 (PROTEOFORM), 182269 (GLYCANS)
Total edges	137, 455469 (PROTEOFORM), 464659 (GLYCANS)
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	Download from https://sparql.glygen.org/. Data file: https://sparql.glygen.org/ln2triplestoredata/triples.tar.gz

GLYGEN Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image8.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts (FALDO)

>>>>> gd2md-html alert: inline image link here (to images/image9.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image10.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

GlyGen Node and Edge counts (GLYCOCOO)

>>>>> gd2md-html alert: inline image link here (to images/image11.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image12.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts (GLYCORDF)

>>>>> gd2md-html alert: inline image link here (to images/image13.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image14.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge Counts (PROTEOFORM) \

SAB	Count
UNIPROTKB	16810
GLYCOPROTEIN	63441
GLYCOPROTEIN.EVIDENCE	2088
GLYCOSYLATION.SITE	52451
GLYGEN.LOCATION	52450
UNIPROTKB.ISOFORM	8406
GLYGEN.CITATION	2088
GP.ID2PRO	61353
GLYTOUCAN	1554
AMINO.ACID	7

\

Subject SAB	Predicate	Object SAB	Count
UNIPROTKB	has_isoform	UNIPROTKB.ISOFORM	8404
GLYCOPROTEIN	has_evidence	GLYCOPROTEIN.EVIDENCE	120158
GLYCOPROTEIN	sequence	UNIPROTKB.ISOFORM	61353
GLYCOPROTEIN.EVIDENCE	citation	GLYGEN.CITATION	2088
GLYCOPROTEIN	has_pro_entry	GP.ID2PRO	61353
GLYCOPROTEIN	glycosylated_at	GLYCOSYLATION.SITE	52450
GLYCOSYLATION.SITE	location	GLYGEN.LOCATION	52450
GLYCOSYLATION.SITE	has_saccharide	GLYTOUCAN	44763
GLYGEN.LOCATION	has_amino_acid	AMINO.ACID	52450

Node and Edge Counts (GLYCANS) \

SAB	Count
GLYGEN.GLYCOSYLATION	91
GLYCOSYLTRANSFERASE.REACTION	91
GLYTOUCAN	33755
GLYGEN.RESIDUE	80
GLYGEN.SRC	30986
GLYGEN.GLYCOSEQUENCE	117146
GLYCAN.MOTIF	120

\

Subject SAB	Predicate	Object SAB	Count
GLYGEN.GLYCOSYLATION	has_enzyme_protein	UNIPROTKB	91
GLYCOSYLTRANSFERASE.REACTION	has_enzyme_protein	UNIPROTKB	91
GLYTOUCAN	is_from_source	GLYGEN.SRC	30986
GLYTOUCAN	has_glycosequence	GLYGEN.GLYCOSEQUENCE	117146
GLYGEN.RESIDUE	attached_by	GLYGEN.GLYCOSYLATION	349
GLYTOUCAN	synthesized_by	GLYCOSYLTRANSFERASE.REACTION	210563
GLYTOUCAN	has_motif	GLYCAN.MOTIF	19321
GLYTOUCAN	has_canonical_residue	GLYGEN.RESIDUE	86033
GLYGEN.RESIDUE	has_parent	GLYGEN.RESIDUE	79

Genotype Tissue Expression (GTEx)

GTEx datasets

Dataset SAB(s)	GTEXEXP, GTEXEQTL, EXPBINS, PVALUEBINS
DCC Website	https://www.gtexportal.org/home/
DCC	GTEx
Authority	Kristen Ardlie
Source Information	Documentation on the sources of GTEx data can be found here: https://biospecimens.cancer.gov/resources/sops/docs/GTEx_SOPs/BBRB-PR-0004-W1%20GTEx%20Tissue%20Harvesting%20Work%20Instruction.pdf
Purpose	To include bulk RNA-seq gene expression levels from adult tissues as well as correlations between genotype and tissue-specific gene expression levels as expression quantitative trait loci (eQTLs) that identify regions of the genome that influence whether and how much a gene is expressed.
Description	The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. This database includes expression levels for genes by tissue in terms of transcripts per mission (TPM). The database also contains the p-values and relationships between loci and genes as expression quantitative trait loci (eQTLs).
Summarization Methodology	Three types of GTEx data were summarized and ingested into the knowledge graph listed below by SAB: (1) GTEXEXP - Transcript per million (TPM) values, which represent gene-tissue expression levels, were ingested as is except that edges to ‘bin nodes’ (EXPBINS) were created. For example, a GTEXEXP node with a TPM of 10.5 will have an edge to the bin node that represents [10,11] TPM. (2) GTEXEQTL - GTEx eQTLs were filtered to include only those that are present in every tissue. This reduced the total set of eQTLs to ~2 million. P-values for the eQTLs are also included in the graph, however, they are represented as bin nodes just like the TPM values for the GTEXEXP dataset. (3) GTEXCOEXP - A GTEx co-expression dataset was made by first calculating the Pearson’s correlation coefficient of all genes in GTEx intersection with HGNC master list separately for each of 54 tissues listed in GTEx using the provided TPMs. Then pairs of genes with correlation coefficient > 0.99 were tagged in each tissue as strongly correlated and reported as assertions with relationship types and counts in the attached table (please see below)
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/GTEx
Total nodes	6,280,011
Total edges	31,904,034
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	https://www.gtexportal.org/home/datasets (GTEx_Analysis_v8_eQTL.tar and GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)

GTEXEXP Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image15.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

GTEXQTL Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image16.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

GTEx Node and Edge counts (GTEXEXP/EXPBINS)

>>>>> gd2md-html alert: inline image link here (to images/image17.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image18.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts (GTEXEQTL/PVALUEBINS)

>>>>> gd2md-html alert: inline image link here (to images/image19.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image20.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts (GTEXCOEXP)

>>>>> gd2md-html alert: inline image link here (to images/image21.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

For GTEXCOEXP, all nodes are existing HGNC. Edges and counts listed below.

Predicate	Count	Predicate	Count
coexpression_Adipose___Subcutaneous	15485	coexpression_Esophagus___Gastroesophageal_Junction	12874
coexpression_Adipose___Visceral_(Omentum)	2646	coexpression_Esophagus___Mucosa	20463
coexpression_Adrenal_Gland	37897	coexpression_Esophagus___Muscularis	79416
coexpression_Artery___Aorta	642521	coexpression_Fallopian_Tube	769999
coexpression_Artery___Coronary	612950	coexpression_Heart___Atrial_Appendage	168057
coexpression_Artery___Tibial	10237	coexpression_Heart___Left_Ventricle	600676
coexpression_Bladder	529181	coexpression_Kidney___Cortex	583782
coexpression_Brain___Amygdala	22221	coexpression_Kidney___Medulla	10461695
coexpression_Brain___Anterior_cingulate_cortex_(BA24)	102887	coexpression_Liver	20645
coexpression_Brain___Caudate_(basal_ganglia)	7309	coexpression_Lung	17156
coexpression_Brain___Cerebellar_Hemisphere	40983	coexpression_Minor_Salivary_Gland	47164
coexpression_Brain___Cerebellum	106195	coexpression_Muscle___Skeletal	4061
coexpression_Brain___Cortex	764276	coexpression_Nerve___Tibial	12460
coexpression_Brain___Frontal_Cortex_(BA9)	27760	coexpression_Ovary	20177
coexpression_Brain___Hippocampus	84051	coexpression_Pancreas	24183
coexpression_Brain___Hypothalamus	185487	coexpression_Pituitary	28152
coexpression_Brain___Nucleus_accumbens_(basal_ganglia)	1198329	coexpression_Prostate	4197
coexpression_Brain___Putamen_(basal_ganglia)	1146393	coexpression_Skin___Not_Sun_Exposed_(Suprapubic)	1516
coexpression_Brain___Spinal_cord_(cervical_c_1)	267533	coexpression_Skin___Sun_Exposed_(Lower_leg)	84793
coexpression_Brain___Substantia_nigra	143792	coexpression_Small_Intestine___Terminal_Ileum	73157
coexpression_Breast___Mammary_Tissue	3094	coexpression_Spleen	375064
coexpression_Cells___Cultured_fibroblasts	11652	coexpression_Stomach	12964
coexpression_Cells___EBV_transformed_lymphocytes	90051	coexpression_Testis	141440
coexpression_Cervix___Ectocervix	817624	coexpression_Thyroid	31593
coexpression_Cervix___Endocervix	805252	coexpression_Uterus	16498
coexpression_Colon___Sigmoid	11358	coexpression_Vagina	20584
coexpression_Colon___Transverse	22786	coexpression_Whole_Blood	13845

The Human BioMolecular Atlas Program (HuBMAP)

HuBMAP data sets

Dataset SAB(s)	AZ, HUBMAP
DCC Website	https://hubmapconsortium.org/
DCC	HuBMAP
Authority	Jonathan Silverstein, Phil Blood
Source Information
Purpose	HuBMAP data provides tissue, cell-type and gene specific markers from single-cell data. The purpose of the Hubmap/AZ data is to provide cell-type-specific gene expression markers from single-cell experiments across each tissue.
Description	HuBMAP is working to catalyze the development of a framework for \ mapping the human body at single cell resolution and developing the tools to create an open, global atlas of the human body at the cellular level. In this database, we include cell-type specific gene markers from the Azimuth project form a subset of tissues including heart, liver and kidney.
Summarization Methodology	See https://azimuth.hubmapconsortium.org/references/
Summarization Methodology Code Repository URL	https://azimuth.hubmapconsortium.org/references/
Total nodes	769
Total edges	910

HuBMAP Az Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image22.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

HuBMAP Az Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image23.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image24.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Illuminating the Druggable Genome (IDG)

IDG Datasets

Dataset SAB(s)	IDGP (compound/protein), IDGD (compound/disease) (both are edge SABs)
DCC Website	https://pharos.nih.gov/ https://commonfund.nih.gov/idg
DCC	Illuminating the Druggable Genome Data Coordinating Center - Engagement Plan with the CFDE
Authority	Christophe Lambert (PI), University of New Mexico Health Sciences Center
Source Information	Relationships between compounds, diseases, and proteins drawn from the IDG Target Central Resource Database (TCRD) hosted at https://pharos.nih.gov and at DrugCentral https://drugcentral.org/.
Purpose	The Illuminating the Druggable Genome (IDG) project elucidates the relationships between diseases, targets, and compounds, providing insights into lesser-known proteins, empowering researchers to discover novel therapeutic targets and accelerate drug development for various diseases.
Description	The IDG contributions to the Knowledge Graph include compounds, diseases and proteins and their relationships. A full description of IDG data sources is here: https://pharos.nih.gov/about Target Central Resource Database (TCRD) is the central resource supporting the IDG-KMC. TCRD has information about human drug targets with a focus on GPCRs, kinases, and ion channels. TCRD categorizes all drug targets into four Target Development Levels (TDLs) by making use of activity thresholds. Protein drug targets from TCRD with known bioactive compounds were incorporated into the IDG KG. Also included in our KG are diseases from TCRD that have known “indication” relationships to approved drugs from DrugCentral. The IDG KG can be used to explore compounds and proteins related to a specific disease among other similar queries. The IDG KG can be combined with data from other DCCs such as LINCS, GTeX, and others to create interesting scientific use cases.
Summarization Methodology	Compound nodes were sourced using TCRD, DrugCentral, and PubChem using PUBCHEM_CID ontology. Specifically, chemical compounds from DrugCentral were included if TCRD indicated known bioactivity against protein targets. Compound node properties include SMILES as the node_definition, drugbank ID as node_dbxrefs, name as node_label, and ‘IDG’ as the node_namespace. The PubChem API was used to assign the node_synonyms and node_dbxrefs node properties. Protein nodes were obtained from TCRD and DrugCentral, and rely on UNIPROTKB ontology. Protein targets from DrugCentral and TCRD with known bioactive compounds were included. A protein’s symbol is denoted as node_label, protein names as node_definition, EnsEMBL IDs as node_dbxrefs, and ‘IDG’ as node_namespace. The disease nodes use SNOMED_US ontology and are sourced from TCRD and DrugCentral. Diseases from DrugCentral and TCRD with known indication relationships to approved drugs were included. The disease OMOP concept names are included as the node property node_label. Additionally, OMOP IDs are included as node_dbxrefs and ‘IDG’ as the node_namespace. The bioactivity relationship is defined between compounds and proteins. While ChEMBL and PubChem offer complex and differing ontologies for bioactivity relationships, for the sake of simplicity and efficiency in this early version we use the custom term "bioactivity". The type of bioactivity measurement (e.g. IC50, Kd, EC50) is included as evidence_class. The indication relationship links compounds and diseases. For simplicity, we introduce the custom, simple term “indication”. These indication relationships are defined between diseases and approved drugs from DrugCentral.
Summarization Methodology Code Repository URL	Code by IDG team: https://github.com/unmtransinfo/cfde-distillery Code from core DD team (further processing and formatting): https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/IDG/check_IDG.ipynb
Total nodes	Total: 331788; compounds: 327951; disease: 1472; protein: 2365
Total edges	Total: 463972; compound/protein (Bioactivity): 454957; compound/disease (Indication): 9015
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	https://app.globus.org/file-manager?origin_id=24c2ee95-146d-4513-a1b3-ac0bfdb7856f&origin_path=%2Fprojects%2Fdata-distillery%2FImport%2FIDG%2F

IDG Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image25.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

IDGP Node and Edge counts (compound/protein, SAB = IDGP)

>>>>> gd2md-html alert: inline image link here (to images/image26.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image27.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

IDGD Node and Edge counts (compound/disease, SAB = IDGD)

>>>>> gd2md-html alert: inline image link here (to images/image28.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image29.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

For IDGP and IDGD, the numbers shown in the tables are numbers of IDG edges connecting existing SAB nodes as indicated.

Gabriella Miller Kids First (GMKF)

KF Datasets

Dataset SAB(s)	KFGENEBIN, KFPT, KFCOHORT
DCC Website	https://kidsfirstdrc.org/
DCC	Gabriella Miller Kids First (GMKF) Pediatric Research Program Data Resource Center (DRC)
Authority	Deanne Taylor (PI), Children’s Hospital of Philadelphia
Source Information	Genomic and phenotypic data, broadly summarized from trio cohorts with cardiac birth defects from the Pediatric Cardiac Genetics Consortium cohort in Kids First,cohort SD_PREASA7S
Purpose	The main purpose of the KF DRC is to better understand the genetic causes and links between childhood cancer and structural birth defects.
Description	The Kids First DRC is a collaborative pediatric research effort created to accelerate data-driven discoveries and the development of novel precision-based approaches for children diagnosed with cancer or a structural birth defect using large genomic datasets. The Kids First DRC is comprised of integrated core teams that support development of leading-edge big data infrastructure and provide the necessary resources and tools to empower researchers and clinicians.
Summarization Methodology	Variant data from a Congenital Heart Defects (CHD) cohort was queried and filtered using the Kids First variant workbench platform. We filtered for variants that were scored as ‘high impact’ by the variant effect predictor tool (VEP). Variant counts per gene were then computed by counting how many times each gene appeared. The number of variations per gene is stored in the ‘value’ property of the SAB KFGENEBIN Code nodes. Kids First cohorts are stored in the graph as their own nodes and have an SAB of KFCOHORT. Patient IDs from the CHD cohort have also been ingested into the graph as their own nodes and have an SAB of KFPT.
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/KidsFirst
Total nodes (rows)	18,719
Total edges (rows)	76,690
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	Raw data URL

KF Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image30.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

KF Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image31.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image32.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

The Library of Integrated Network-Based Cellular Signatures (LINCS)

LINCS datasets

Dataset SAB(s)	LINCS (edge SAB only)
DCC Website	https://lincsproject.org/
DCC	Library of Integrated Network-Based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC)
Authority	Avi Ma’ayan (PI), Icahn School of Medicine at Mount Sinai
Source Information	Gene expression changes resulting from drug/small molecule perturbations across cell lines, and gene expression signature similarity between drug/small molecule based on LINCS L1000 signature similarity
Purpose	Understand cellular responses to various drug and pre-clinical compound treatments through L1000 transcriptomics assays
Description	The LINCS assertions include drug-gene associations and drug-drug similarity associations computed from the LINCS L1000 consensus signatures dataset. Each drug is linked to the top 25 most up-regulated and top 25 most down-regulated genes in the L1000 consensus signatures for the drug/small-molecule, as well as to the top 5 most similar other drugs in the dataset based on the correlation between the consensus signatures for each drug.
Summarization Methodology	Level 3 L1000 profiles, drug metadata, and gene metadata were first downloaded from CLUE.io. The L1000 Level 5 signatures were then computed using the Characteristic Direction method [BMC Bioinformatics 15, 79 (2014)]. For each signature, replicate L1000 profiles for a given perturbagen and dosage were compared against all other L1000 profiles from the same cell line batch. Consensus signatures for each drug were then computed by taking the mean of all gene expression vectors corresponding to the given drug across cell lines, timepoints, and dosages. Drugs were filtered to only those with known PubChem IDs in the original CLUE.io metadata, resulting in a final set of 4,523 drugs. The top 25 up- and down-regulated genes in each consensus signature with known Ensembl IDs from the metadata were determined by the greatest positive and negative Characteristic Direction coefficients, respectively. In total, 225,509 edges and 4,419 unique genes are represented in this collection of knowledge graph assertions. Additionally, a drug-drug similarity matrix was generated by computing the cosine similarity between all possible pairs of the consensus drug signatures. For each drug, the top 5 other drugs with the greatest positive cosine similarity values were retained. Duplicate edges were removed, resulting in 20,785 total edges representing consensus signature-based drug-drug similarity between the 4,523 drugs with known PubChem IDs.
Summarization Methodology Code Repository URL	The methods are described in the following publication: Evangelista, J.E., Clarke, D.J.B., Xie, Z. et al. Toxicology knowledge graph for structural birth defects. Commun Med 3, 98 (2023). The code to produce the assertions can be found at: https://github.com/nih-cfde/ReproToxTables
Total nodes	8,942 (drugs: 4523; genes: 4419)
Total edges	246,294 (drug-gene: 225,509; drug-drug: 20,785)
Source Data DOI(s) (optional)	N/A
Source Data URL(s) (optional)	https://maayanlab.cloud/sigcom-lincs/#/Download

LINCS Schema Diagram \

>>>>> gd2md-html alert: inline image link here (to images/image33.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image34.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image35.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

For the SAB LINCS, the numbers shown in the tables are numbers of edges connecting existing SAB nodes as indicated.

The Molecular Transducers of Physical Activity Consortium (MoTrPAC)

MoTrPAC datasets

Dataset SAB(s)	MOTRPAC
DCC Website	https://motrpac-data.org
DCC	Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC)
Authority	Euan Ashley MD PhD (PI), Matthew Wheeler MD PhD (PI)
Source Information	Gene differential expression changes resulting from the RNA-seq data of young adult rats (6 month old) performing endurance training exercise at the 1 week, 2 week, 4 week and 8 week time points.
Purpose	The Molecular Transducers of Physical Activity Consortium (MoTrPAC) aims to elucidate how exercise improves health and ameliorates diseases by building a map of the molecular responses to endurance exercise.
Description	MoTrPAC is a multi-site collaboration across the US encompassing various scientific disciplines: preclinical animal study sites and human clinical exercise sites, which perform the exercise testing and biospecimen collection; a consortium coordinating center and biorepository, which manages sample collection, distribution of samples, and consortium logistics; chemical analysis sites, which are responsible for omics analysis from the samples collected; and a bioinformatics center to collaboratively analyze and map the data generated by the other sites along with data dissemination to make the data and other resources available to the public. The animal studies enable analysis of the effects of exercise on many different tissues that are not readily obtainable in humans, whereas the collection of accessible human tissues (muscle, blood, and adipose) will permit the analysis of the direct effect of exercise in humans. Additional information can be found at the main consortium page (https://motrpac.org) or at the data portal (https://motrpac-data.org). The MoTrPAC study is divided into two main parts - animal (rats) and human, with multiple phases or interventions in each of them. Preclinical animal study sites conduct the endurance exercise and training intervention in young adult (6 month old) and middle-aged adult (18 month old) rats, while Clinical study sites conduct the human endurance and resistance training interventions in pediatric, adults and highly active adults.
Summarization Methodology
Summarization Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/MoTrPAC/MOTRPAC.ipynb
Total nodes	16149
Total edges	25714
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	Raw data URL

MOTRPAC Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image36.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image37.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image38.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Metabolomics Workbench (MW)

Dataset SAB(s)	MW (REFMET; metabolite nodes)
DCC Website	https://www.metabolomicsworkbench.org/
DCC	Metabolomics Workbench
Authority	Professor Shankar Subramaniam (PI)
Source Information	Gene-metabolite relationships: MW database tables based on KEGG and other resources Disease-metabolite relationships: Publication based on HMDB data (https://pubmed.ncbi.nlm.nih.gov/32426349/) Cell-metabolite relationships: MW database tables for data submitted to NMDR
Purpose	Understand what metabolites may be regulated by various genes and their spatial (anatomical) and disease context.
Description	The National Institutes of Health (NIH) Common Fund Metabolomics Program was developed with the goal of increasing national capacity in metabolomics by supporting the development of next generation technologies, providing training and mentoring opportunities, increasing the inventory and availability of high quality reference standards, and promoting data sharing and collaboration. In support of this effort, the Metabolomics Common Fund's National Metabolomics Data Repository(NMDR), housed at the San Diego Supercomputer Center (SDSC), University of California, San Diego, has developed the Metabolomics Workbench. The Metabolomics Workbench serves as a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more. NMDR houses data on metabolomics studies conducted by various centers and research laboratories across the nation and the world, spanning many species, sample sources, diseases, metabolomics experimental techniques and metabolite classes [https://www.metabolomicsworkbench.org/data/browse.php]. The data we have shared with the Data Distillery partnership is a key subset of all the data in NMDR, centered around metabolites. Specifically, we have shared disease-metabolite, gene-metabolite and cell/anatomy (sample source)-metabolite relationships, which when integrated with data from other DCC and external resources has the potential to address interesting biological questions.
Summarization Methodology	Gene-Metabolite: Human genes catalyzing metabolic reactions and their associated metabolites were obtained from MW database tables. The HGNC ID was used as metabolic gene node_id, and its approved symbol and name are used as node_label and node_definition, respectively. UMLS, ENTREZ and ENSMBL IDs are used as node_dbxrefs. For the edges, the Subject (Gene: HGNC ID) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0002566: Causally influences). Disease-Metabolite: Disease-metabolite entities and relationships were deduced from the publication (PMID: 32426349) based on HMDB. The PUBCHEM_CID/HMDB ID was used as node_id for the metabolite. Similarly, disease entities were encoded with DOID or HPO IDs. UMLS, PUBCHEM_CID, DRUGBANK and REFMET were used as node_dbxrefs. For the edges, the Subject (Metabolite: PUBCHEM_CID/HMDB ID) was related to the Object (Disease: DOID/HPO) by the Predicate (RO_0003308: Correlated with condition). Cell-Metabolite: Metabolite-anatomy context (cell/tissue association) was obtained from MW database. Cell/tissue entity node_id is encoded with UBERON, CL and CLO IDs and cross referenced with UMLS. For the edges, the Subject (Spatial context: UBERON/CL/CLO) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0003000: Produces).
Summarization Methodology Code Repository URL	https://github.com/mano-at-sdsc/MW_DataDistillery
Total nodes	51271
Total edges	10009
Source Data DOI(s) (optional)	Raw data DOI
Source Data URL(s) (optional)	Raw data URL

MW Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image39.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge Counts

SAB	Count
UBERON	67
CL	12
CLO	3
HGNC	1061
PUBCHEM	8543
HMDB	18
DOID	276
HPO	32

Subject SAB	Predicate	Object SAB	Count
UBERON	produces	PUBCHEM	34471
CL	produces	PUBCHEM	6777
CLO	produces	PUBCHEM	536
HGNC	causally influences	PUBCHEM	5527
PUBCHEM	correlated with condition	DOID	3856
HMDB	correlated with condition	DOID	27
PUBCHEM	correlated with condition	HPO	77

For the SAB MW, the numbers shown in the tables are numbers of edges connecting existing SAB nodes as indicated.

Stimulating Peripheral Activity to Relieve Conditions (SPARC)

Dataset SAB(s)	SCKAN, NPO, UBERON, PATO, NIFSTD
DCC Website	https://sparc.science/
DCC	SPARC Data and Resouce Center (DRC) - Knowledge Management and Curation Core
Authority	Tom Gillespie, Fahim Imam (SPARC K-Core, University of California San Diego) Jyl Boline (PM - SPARC K-Core)
Source Information	A key component of the SPARC Program is the SPARC Connectivity Knowledge Base of the Autonomic Nervous system, referred to as SCKAN. SCKAN is a semantic store housing a comprehensive knowledge base of autonomic nervous system (ANS) nerve to end organ connectivity. Connectivity information is derived from SPARC experts, SPARC data, and the literature and textbooks using a Natural Language Processing (NLP) pipeline.
Purpose	Facilitate enhanced understanding of the peripheral nervous system to support the development of effective bioelectronic therapies by driving collaborative neurosciences and providing online resources for accessing and submitting curated data and models, as well as dynamic knowledge-management and visualization tools.
Description	The SPARC Knowledge base of the Automatic Nervous System (SCKAN) is an integrated graph database composed of three parts: the SPARC dataset metadata graph, ApiNATOMY and Neuron Phenotype Ontology (NPO) models of connectivity, and the larger ontology used by SPARC which is a combination of the NIF-Ontology and community ontologies.
Summarization Methodology	SCKAN provides a central location to populate, discover, and query ANS connectivity knowledge over multiple scales. It allows issuing queries such as, “what are the locations of neuron somas with processes that pass through spinal cord level C4?” and create a searchable visual atlas of ANS circuitry. Users of the SPARC maps can query SCKAN to find more information about routes, targets and evidence. SCKAN contains statements about neuronal connectivity at the neuron population level, largely in the form of: “Neurons with somas in structure A project to structure B via nerve C.” SCKAN models connections at two levels of granularity: circuits and individual connections. A circuit represents a detailed model of connectivity that is associated with a particular organ like bladder or functional circuits like defensive breathing. Circuits contain detailed representations of neuron populations giving rise to ANS connections. They include mappings of the locations of cell bodies, dendrites, axon segments as well as synaptic endings involved in a particular circuit. Circuits in SCKAN are modelled using ApiNATOMY, a knowledge model and a tool specifically created to represent multiscale connectivity. To provide a comprehensive knowledge about ANS connectivity, the circuit-based approach is supplemented with well-known connections of ANS derived from the literature and textbooks using a Natural Language Processing (NLP) pipeline.These types of individual connectivity statements do not have detailed topological information associated with them and are represented using NPO.
Summarization Methodology Code Repository URL	https://zenodo.org/record/7476115
Total nodes	484,768
Total edges	1,337,124
Source Data DOI(s) (optional)	10.5281/zenodo.7476115
Source Data URL(s) (optional)	https://doi.org/10.5281/zenodo.7476115 https://github.com/open-physiology/apinatomy-models https://github.com/SciCrunch/NIF-Ontology/ https://bioportal.bioontology.org/ontologies/NPOKB

SPARC Schema Diagram

>>>>> gd2md-html alert: inline image link here (to images/image40.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Node and Edge counts

>>>>> gd2md-html alert: inline image link here (to images/image41.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Additional Datasets

CLINVAR

The ClinVar dataset (v2023-01-05) was utilized to define assertions between human genes and phenotypes. Only genes with pathogenic, likely pathogenic and pathogenic/likely pathogenic variants were considered, and we excluded associations with no assertion criteria met. To retrieve the target phenotype/disease we used MedGen IDs listed in the ClinVar dataset (also already present in the KG). Processed ClinVar dataset contains 214,040 relationships (including reverse relationships) with the following characteristics [Type: “gene_associated_with_disease_or_phenotype”, SAB: “CLINVAR”] and [type: inverse_gene_associated_with_disease_or_phenotype, S AB: “CLINVAR”] connecting HGNC to MONDO, HPO, EFO and MESH Concept nodes.

CMAP

The edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset were obtained from the Harmonizome database https://maayanlab.cloud. The dataset added 2,625,336 relationships (including reverse relationships) connecting the CHEBI and HGNC nodes with predicates “negatively_correlated_with_gene”, “inverse_negatively_correlated_with_gene”, “positively_correlated_with_gene”, “inverse_positively_correlated_with_gene” (SAB: “CMAP”).

HPOMP

This set of assertions maps human phenotype ontology (HPO) nodes to mammalian phenotype ontology (MP) nodes through the ‘is_approximately_equivalent_to’. It is essentially a set of assertions mapping human phenotype codes to mouse phenotype codes. The mappings were produced by using a software tool called PheKnowLator. There are 1,785 HPOMP mappings. These assertions can be queried by specifying the SAB property as HPOMP on the ‘is_approximately_equivalent_to’ relationship.

HGNCHPO

This set of assertions maps HGNC gene nodes to human phenotype ontology (HPO) nodes through the ‘associated_with’ relationship. There are 671,046 HGNCHPO mappings. These assertions can be queried by specifying the SAB property as HGNCHPO on the ‘associated_with’ relationship.

HCOPHGNC

This set of assertions maps mouse gene nodes (HCOP) to human gene nodes (HGNC). Mouse gene nodes are referred to as ‘HCOP’ in the Data Distillery Knowledge Graph because the HGNC Comparison of Orthology Predictions (HCOP) tool was used to generate these mappings. The ‘in_1_to_1_orthology_relationship_with’ is used to connect the HGNC and HCOP nodes. There are 67,027 HCOPHGNC mappings. These assertions can be queried by specifying the SAB property as HCOPHGNC on the ‘in_1_to_1_orthology_relationship_with’ relationship.

HCOPMP

This set of assertions maps mouse gene nodes (HCOP) to the mammalian phenotype ontology (MP) nodes through the ‘involved_in’ relationship. These mappings are the mouse version of the HGNCHPO mappings. Files from the International Mouse Phenotyping Consortium (IMPC) and Mouse Genome Informatics (MGI) were used to create this dataset. There are 234,043 HCOPMP mappings. These assertions can be queried by specifying the SAB property as HCOPMP on the ‘involved_in’ relationship.

Homo Sapiens Chromosomal Location Ontology (HSCLO)

Homo Sapiens Chromosomal Location Ontology (HSCLO) was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect GTEXEQTL locations in the graph as searchable nodes at 1 kbp resolution (same as 4DN). The dataset relationships as well as nodes use HSCLO as their SAB. HSCLO nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connects to lower level with above_(resolution level)band (e.g. “above_1Mbp_band”, “above 1_kbp_band”) and nodes at the same resolution level are connected through prcedes(resolution level)_band (e.g. “precedes_10kbp_band”). The dataset contains 3,431,155 nodes and 6,862,195 relationships (13,724,390 bidirectional).

MSIGDB

Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets ). With this subset, MSIGDB Concept nodes were created for MSigDB systematic names (used as Codes excluding KEGG data). The relationships between these Concept nodes and HGNC nodes were defined using the mentioned 5 subsets where the subset information was included in the relationship SABs as “MSIGDB”.

RATHCOP

This set of assertions maps human ENSEMBL gene nodes to rat ENSEMBL gene nodes. These mappings were generated from the HCOP tool just like for the mouse to human assertions, except we used the ENSEMBL codes here instead of the HGNC codes. The ‘has_human_ortholog’ relationship is used to connect ENSEMBL Rat nodes to ENSEMBL Human nodes. There are 42,371 RATHCOP mappings and they can be queried by specifying the SAB property as RATHCOP on the ‘has_human_ortholog’ relationship.

Files

dictionary.md

Latest commit

History

dictionary.md

File metadata and controls

Data Distillery Data Dictionary

Introduction

Base Datasets

DCC Datasets

4D Nucleome (4DN) DCC

4D Nucleome datasets

4DN Schema Diagram

\

Extracellular RNA Communication Program (ERCC)

ERCC RBP dataset

ERCC RBP Schema Diagram

ERCC RBP Node and Edge counts

ERCC Regulatory Element dataset

ERCC Regulatory Element Schema Diagram

ERCC Regulatory Element Node and Edge Counts

GlyGen

GlyGen Datasets

GLYGEN Schema Diagram

GlyGen Node and Edge counts (GLYCOCOO)

Genotype Tissue Expression (GTEx)

GTEx datasets

GTEXEXP Schema Diagram

GTEXQTL Schema Diagram

GTEx Node and Edge counts (GTEXEXP/EXPBINS)

The Human BioMolecular Atlas Program (HuBMAP)

HuBMAP data sets

HuBMAP Az Schema Diagram

HuBMAP Az Node and Edge counts

Illuminating the Druggable Genome (IDG)

IDG Datasets

IDG Schema Diagram

IDGP Node and Edge counts (compound/protein, SAB = IDGP)

IDGD Node and Edge counts (compound/disease, SAB = IDGD)

Gabriella Miller Kids First (GMKF)

KF Datasets

KF Schema Diagram

KF Node and Edge counts

The Library of Integrated Network-Based Cellular Signatures (LINCS)

LINCS datasets

LINCS Schema Diagram \

Node and Edge counts

The Molecular Transducers of Physical Activity Consortium (MoTrPAC)

MoTrPAC datasets

MOTRPAC Schema Diagram

Metabolomics Workbench (MW)

Stimulating Peripheral Activity to Relieve Conditions (SPARC)

Additional Datasets

CLINVAR

CMAP

HPOMP

HGNCHPO

HCOPHGNC

HCOPMP

Homo Sapiens Chromosomal Location Ontology (HSCLO)

MSIGDB

RATHCOP