-
Notifications
You must be signed in to change notification settings - Fork 0
Data
The data was downloaded on 2021-01-12 from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id
Columns in the data:
GeneID: The NCBI GeneID
GeneSymbol: The preferred symbol corresponding to the GeneID
ConceptID: The identifier assigned to a disorder associated with this
gene. If the value starts with a C and is followed by digits,
the ConceptID is a value from UMLS; if a value begins with
CN, it was created by NCBI-based processing
DiseaseName: Full name for the condition
SourceName: Sources that use this name
SourceID: The identifier used by this source
DiseaseMIM: MIM number for the condition
LastUpdated: Last time this record was modified by NCBI staff
The raw dataset contained 4439 unique genes (GeneID) and 6239 unique diseases (Concept ID).
SourceName overview:
NCBI curation 5279
MONDO 878 # only a small proportion has MONDO ids
OMIM 714
OMIM phenotypic series 573
Human Phenotype Ontology 539
Orphanet 316
GeneReviews 36
Clinvar data was added as a new relationship between Gene
and Disease
nodes: GENE_TO_DISEASE
. The relationship was created between the index properties of the nodes, i.e. ensembl_id
for Gene
and mondo_id
for Disease
. To make this relationship import easier, the following data filtering was done:
-
Converted gene names (GeneID) to
ensembl_id
using biomart build 37 (this is the data/version already used in the graph) -
Genes in clinvar data are labelled 'associated'/'related': kept both types and adding this infomation as a property to the
GENE_TO_DISEASE
relationship. -
Clinvar diseases matching to existing
Disease
nodes in the graph was done in two ways:
- Extracted gene-disease pairs that have MONDO id in SouceName: these were directly matched because
mondo_id
is the index property ofDisease
node. This covered only ~900 gene-disease pairs. - Everything else was matched via UMLS id (stored in ConceptID column). In the graph, UMLS is not an index property, so we couldn't directly match on it. Therefore, first, we needed to query the graph to map UMLS to MONDO. Then, those MONDO ids were used to map all remaining gene-disease pairs.
- Then, removed any duplicate pairs from the two approaches and import the table of
ensembl_id
,mondo_id
,gene_relationship_type
(associated/related), andlast_updated
into the graph.
After ensembl_id
filtering and querying the graph, the dataset contained 6369 relationships, in which there are 3612 unique genes and 4036 unique diseases.
-
4439 unique genes in raw data
- 4267 have biomart
ensembl_id
- 4255
ensembl_id
are present in the graph - 4255/4439 = 95% genes from clinvar are present in epigraphdb
- 3612/4439 = 81% genes from clinvar were added a relationship to disease, after all filtering
- 4267 have biomart
-
6239 unique UMLS ids in raw data:
- 4228 of those are in present in the graph
-
711 unique MONDO ids in raw data
-
709 of those are in present in the graph
-
out those, 291 were not matched vie UMLS
-
(4228+291)/6239 = 72% diseases from clinvar are present in epigraphdb (mapped by umls+mondo)
-
4036/6239 = 65% diseases from clinvar were added a relationship to gene, after all filtering
-