Skip to content
mvab edited this page Mar 15, 2021 · 4 revisions

Summary of Clinvar data

The data was downloaded on 2021-01-12 from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id

Columns in the data:

GeneID:               The NCBI GeneID
GeneSymbol:           The preferred symbol corresponding to the GeneID
ConceptID:            The identifier assigned to a disorder associated with this
                        gene. If the value starts with a C and is followed by digits,
                        the ConceptID is a value from UMLS; if a value begins with
                        CN, it was created by NCBI-based processing
DiseaseName:          Full name for the condition
SourceName:           Sources that use this name
SourceID:             The identifier used by this source
DiseaseMIM:           MIM number for the condition
LastUpdated:          Last time this record was modified by NCBI staff

The raw dataset contained 4439 unique genes (GeneID) and 6239 unique diseases (Concept ID).

SourceName overview:

NCBI curation               5279
MONDO                        878  # only a small proportion has MONDO ids
OMIM                         714
OMIM phenotypic series       573
Human Phenotype Ontology     539
Orphanet                     316
GeneReviews                   36

Clinvar data was added as a new relationship between Gene and Disease nodes: GENE_TO_DISEASE. The relationship was created between the index properties of the nodes, i.e. ensembl_id for Gene and mondo_id for Disease. To make this relationship import easier, the following data filtering was done:

Clinvar processing steps

  1. Converted gene names (GeneID) to ensembl_id using biomart build 37 (this is the data/version already used in the graph)

  2. Genes in clinvar data are labelled 'associated'/'related': kept both types and adding this infomation as a property to the GENE_TO_DISEASE relationship.

  3. Clinvar diseases matching to existing Disease nodes in the graph was done in two ways:

  • Extracted gene-disease pairs that have MONDO id in SouceName: these were directly matched because mondo_id is the index property of Disease node. This covered only ~900 gene-disease pairs.
  • Everything else was matched via UMLS id (stored in ConceptID column). In the graph, UMLS is not an index property, so we couldn't directly match on it. Therefore, first, we needed to query the graph to map UMLS to MONDO. Then, those MONDO ids were used to map all remaining gene-disease pairs.
  • Then, removed any duplicate pairs from the two approaches and import the table of ensembl_id, mondo_id, gene_relationship_type (associated/related), and last_updated into the graph.

After ensembl_id filtering and querying the graph, the dataset contained 6369 relationships, in which there are 3612 unique genes and 4036 unique diseases.

Summary counts

  • 4439 unique genes in raw data

    • 4267 have biomart ensembl_id
    • 4255 ensembl_id are present in the graph
    • 4255/4439 = 95% genes from clinvar are present in epigraphdb
    • 3612/4439 = 81% genes from clinvar were added a relationship to disease, after all filtering
  • 6239 unique UMLS ids in raw data:

    • 4228 of those are in present in the graph
  • 711 unique MONDO ids in raw data

    • 709 of those are in present in the graph

    • out those, 291 were not matched vie UMLS

    • (4228+291)/6239 = 72% diseases from clinvar are present in epigraphdb (mapped by umls+mondo)

    • 4036/6239 = 65% diseases from clinvar were added a relationship to gene, after all filtering

Clone this wiki locally