A curated list of applications, datasets and models for healthcare text analytics developed and shared by the Health Data Research (HDR) UK Text community. Further details of the HDR UK Text project can be found at hdruk-text.org.
If you'd like to contribute a resource, please message us at [email protected].
More health data focused applications, datasets and other resources are available by searching on the HDR UK Gateway.
- CALIBER drugdose: medication dosage instructions in electronic health records are often in the form of text rather than numbers. This program is designed to convert the text into numbers for the dose, frequency, units, duration etc.
- CogStack: a locally deployable, distributed, microservice architecture intended to make information retrieval/extraction easier from EHRs.
- CRIS / SLaM: library of applications available for use within South London and Maudsley (SLaM) on the Clinical Record Interaction Search (CRIS) platform. Access to CRIS must be applied for in order to use applications.
- EdIE-viz: provides an interface for stroke-related clinical concept recognition and negation detection in brain radiology reports.
- EdIE-R: a rule-based information extraction tool developed for brain imaging reports.
- EdIE-BiLSTM: a neural network system for named entity recognition and negation detection with a character-aware BiLSTM sentence encoder for brain imaging reports.
- EdIE-BERT: a neural network system for named entity recognition and negation detection with a pretrained BERT encoder (BlueBERT) for brain imaging reports.
- EndoMineR: a rule-based information extraction system for free-text and semi-structured endoscopy reports and their associated pathology specimens.
- Free text matching algorithm: this computer program is designed to extract diagnoses, dates, durations, laboratory results and selected examination findings (heart rate and blood pressure) from unstructured free text. The program was created based on text in general practice records from the Vision system, and information is encoded using Read Clinical Terms.
- HELIN: A web API demo for performing named entity recognition and linking (NER-L) on biomedical text.
- Komenti: uses background knowledge that researchers have already discovered about biology and medicine. It combines and uses this knowledge in new ways, with the aim to learn even more from sources that are usually difficult for computers to understand. For example, it can extract information about a patient and the illnesses they are suffering from letters written by their doctor.
- Med7: dedicated named entity and recognition system to identify 7 categories: dosage, drug, duration, form, frequency, route and strength.
- MedCAT: a medical concept annotation system that can be used to extract, structure and organize Health Records. It is based on unsupervised learning with the option of online/supervised learning via the MedCATtrainer interface.
- SemEHR: a text mining and semantic search system designed for Surfacing Semantic Data from Clinical Notes in Electronic Health Records for Tailored Care, Trial Recruitment and Clinical Research.
- SIPHS: a collection of software and datasets to support linguistic analysis of online health communities.
- BioReddit embeddings: a set of word embeddings (GloVe, ELMo, Flair) trained on medical subreddits. The embeddings are trained on ~800,000 Reddit posts from over 60 medical-themed communities.
- Biomedical ambiguities: abbreviations and gene names: corpora containing examples of two ambiguities from the biomedical domain (abbreviations and gene names).
- Cardiovascular research abstracts: corpus containing examples of potentially contradictory claims from Medline abstracts describing cardiovascular research intended as a useful resource for researchers working on similar problems.
- COMETA: an entity linking dataset of layman medical terminology collected by analysing four years of content in 68 health-themed subreddits.
- PheneBank: 24 million MEDLINE abstracts as well as 3.8M open-access PMC full articles annotated with 9 classes of entity: Phenotype, Disease, Anatomy, Cell, Cell_line, GPR, Gene_variant, Molecule, and Pathway mapped to five major ontologies: SNOMED, HPO, MeSH, PRO, and FMA.
- SapBERT: Despite the widespread success of self-supervised learning via masked language models, learning representations directly from text to accurately capture complex and fine-grained semantic relationships in the biomedical domain remains as a challenge. SapBERT is a pre-training scheme based on BERT. It self-aligns the representation space of biomedical entities with a metric learning objective function leveraging UMLS, a collection of biomedical ontologies with >4M concepts.