Master Thesis
M.Sc. in Data Science and Artificial Intelligence
Saarland University
Author: Pablo Valdunciel Sánchez
- Overview
- Publication
- Motivation
- Methods and Results
- Repository Content
- Reproducibility
- Resources
- About
This repository contains the code and resources for my master's thesis titled "Active Learning for Relation Extraction Task from Medical Text." The thesis explores methodologies to optimize Relation Extraction (RE) from medical texts using active learning techniques.
Figure: Demonstration of entities and their annotated relations in the n2c2 corpus (Henry et al., 2020): Each instance may feature multiple entities, and the annotations indicate the presence or absence of a relation between any two entities.The research and findings from this master's thesis formed the basis for a workshop paper published in the Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024).
📄 Optimizing Relation Extraction in Medical Texts through Active Learning: A Comparative Analysis of Trade-offs
Authors: Siting Liang, Pablo Valdunciel Sánchez, Daniel Sonntag
Published: March 2024
Journal/Conference: Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)
DOI: 10.18653/v1/2024.uncertainlp-1.3
The implementation of Health Information Technology (HIT) has substantially increased in healthcare systems worldwide, with a primary emphasis on the digitisation of medical records into Electronic Health Records (EHRs). EHRs incorporate a vast amount of useful health-related information, including relationships between biomedical entities such as drug-drug interactions, adverse drug events, and treatment efficacy. Biomedical publications also provide essential information regarding newly discovered protein-protein interactions, drug-drug interactions, and other types of biomedical relationships. However, given the vast amount of information available within biomedical and clinical documents, it is impractical for healthcare professionals to process this information manually. Therefore, automatic techniques, such as Biomedical and Clinical Relation Extraction, are necessary. Machine learning techniques have been developed for use in relation extraction tasks in the biomedical and clinical domains. Nevertheless, the annotation process required for these medical corpora is time-consuming and expensive. Active learning (AL) is a cost-effective method that labels only the most informative instances for model learning. This research aims to investigate the annotation costs associated with AL when used for relation extraction from biomedical and clinical texts using various supervised learning methods.
This work explores the applicability of three distinct supervised learning methods from different ML families for relation extraction from biomedical (SemEval-2013, Task 9.2, also known as DDI corpus) and clinical (n2c2 2018, Track 2) texts within an active learning framework. The four methods under consideration are Random Forest (a traditional ML method), BiLSTM-based method (a deep learning-based ML method), and a Clinical BERT-based method (a language model-based ML method) with two variations of input. The AL framework employs random sampling as a baseline query strategy, along with Least Confidence (LC) and two batch-based strategies: BatchLC for Random Forest and BatchBALD for the other methods. The evaluation considers not only the achieved performance with significantly fewer annotated samples compared to a standard supervised learning approach but also the overall cost of the annotation process. This includes measuring the duration of the AL step times (i.e., the time required to query and retrain the model in each AL step) as well as the annotation rates of tokens and characters. An empirical study is conducted to identify the most suitable method for relation extraction in this context.
The findings indicate that AL can achieve comparable performance to traditional supervised approaches in relation extraction from biomedical and clinical documents while utilising significantly fewer annotated data. This reduction in annotated data leads to a cost reduction in annotating a medical text corpus. The LC strategy outperformed the random baseline when applied with the Random Forest and Clinical BERT methods (One-sided Wilcoxon Signed-rank Test, p-values
-
./data/
: Contains the two corpora used in the evaluation, the pre-trained word embeddings and the pre-trained Clinical BERT model. -
./doc/
: Holds documentation related to the project, such as license and created figures. -
./results/
: Stores the results of experiments, including .csv files with metrics and generated plots. -
./src/
: The central location for all source files for the project. -
./tests/
: Contains PyTest tests for the feature generation and the data models. -
./pyproject.toml
: The project configuration file. -
./requirements.txt
: Lists the packages required to run the dataset preprocessing and active learning experiments.
A more detailed README
can be found in each of the subdirectories.
First, create a virtual environment and activate it.
python -m venv .venv
source .venv/bin/activate # (Linux) .venv/Scripts/activate (Windows)
Second, install the dependencies.
pip install -r requirements.txt
Go to data/ddi, data/n2c2, data/bioword2vec and data/ml_models and follow the instructions in the README
files.
The following code will do the preprocessing of the desired corpus:
from preprocessing import *
corpus = "n2c2" # or "ddi"
# Split documents into sentences (only does something for n2c2)
split_sentences(corpus)
# Generate relation collections
collections = generate_relations(corpus, save_to_disk=True)
# Generate statistics
generate_statistics(corpus, collections) # will be printed to console
For more information on how the preprocessing is implemented, please refer to the README
in the preprocessing module.
To run the experiments with the BiLSTM and BERT-based methods, it is necessary to generate HF Datasets with the precomputed input feature and representations. This datasets are then stored in the ./data/
folder and can be loaded at runtime. Computing this at runtime would be too slow.
The following code generates the HF Datasets for the desired corpus:
# Local Dependencies
# ------------------
from vocab import Vocabulary
from models import RelationCollection
from re_datasets import BilstmDatasetFactory, BertDatasetFactory
corpus = "n2c2" # or "ddi"
# Load collections
collection = RelationCollection.load_collections(corpus)
# Generate vocabulary
train_collection = collections["train"]
vocab = Vocabulary.create_vocabulary(corpus, train_collection, save_to_disk=True)
# Generate HF Datasets for BiLSTM and store in disk
BilstmDatasetFactory.create_datasets(corpus, collections, vocab)
# Generate HF Datasets for BERT and store in disk
BertDatasetFactory.create_datasets(corpus, collections, vocab)
To run the experiments, it is necessary to have the HF Datasets generated in the previous step.
The ./src/experiments
folder contains the code for running the different experiments. It is enough adjust certain varaibles (e.g. number of repetitions, with/without logging, training configuration) before running the corresponding fuction. For example:
# Local Dependencies
from experiments import *
# run experiments with BiLSTM on n2c2
bilstm_passive_learning_n2c2()
bilstm_active_learning_n2c2()
# or run experiments with Paired Clinical BERT on DDI
bert_passive_learning_ddi(pairs=True)
bert_active_learning_ddi(pairs=True)
For more information on how the experimental setting is implemented, please refer to the README
in the experiments module.
To track the experiments we have chosen neptune.ai. Neptune offers experiment tracking and model registry for machine learning projects in a very easy way, storing all the data online and making it accessible with a simple web interface.
To use neptune, you need to create an account and create a new project. Then, you need to create a new API token. Once you have created the project, you have to asign the name of the project and the API token in the config.py file to the NEPTUNE_PROJECT
and NEPTUNE_API_TOKEN
variables, respectively.
# config.py
NEPTUNE_PROJECT = "your_username/your_project_name"
NEPTUNE_API_TOKEN = "your_api_token"
If you don't want to use neptune, you can set the logging
parameter to False
when running an experiment.
bilstm_passive_learning_n2c2(logging=False)
- Machine Learning:
- Deep Learning:
- Experiments metadata store:
This project was developed as part of Pablo Valdunciel Sánchez's master's thesis in the Data Science and Artificial Intelligence master's programme at Saarland University (Germany). The work was carried out in collaboration with the German Research Centre for Artificial Intelligence (DFKI), supervised by Prof. Dr. Antonio Krüger, CEO and scientific director of the DFKI, and Prof. Dr. Daniel Sonntag, director of the Interactive Machine Learning (IML) department at DFKI, and advised by Siting Liang, researcher in the IML department.