This repository contains our team's solution for the BioASQ 2020 challenge, a large-scale biomedical semantic indexing and question answering competition.
The BioASQ challenge focuses on biomedical semantic indexing and question answering. It aims to push the frontiers of large-scale biomedical semantic indexing and question answering systems. The challenge consists of various tasks, and our team participated in Task 8b, which involves biomedical question answering.
The repository is organized into several main directories:
data/
: Contains the dataset files used for training and testing.task_factoid/
: Code and notebooks for the factoid question answering task.task_list/
: Code and notebooks for the list question answering task.task_yesno/
: Code and notebooks for the yes/no question answering task.utils/
: Utility functions and helper scripts used across different tasks.transformers_models/
: Pre-trained models used in the project.
We addressed three main types of questions in the BioASQ challenge:
- Factoid Questions: These are questions that require a specific fact or short phrase as an answer.
- List Questions: These questions expect a list of items as the answer.
- Yes/No Questions: These are binary questions that can be answered with either "yes" or "no".
Our approach to solving these tasks involved:
- Data Preprocessing: We used various NLP techniques to clean and prepare the biomedical text data.
- Embeddings: We utilized pre-trained biomedical embeddings, including BioBERT and ELMo, to represent the text data.
- Model Architecture: We implemented deep learning models using TensorFlow and Keras, with BERT-based architectures for factoid and list questions, and a custom neural network for yes/no questions.
- Training: We fine-tuned our models on the BioASQ dataset, using techniques like early stopping to prevent overfitting.
- Evaluation: We used various metrics to evaluate our models, including strict accuracy, lenient accuracy, and mean reciprocal rank for factoid and list questions, and standard classification metrics for yes/no questions.
task_factoid/train_factoid.py
: Main script for training the factoid question answering model.task_list/functions_list.py
: Contains functions for the list question answering task.task_yesno/notebooks/simple_embeddings.ipynb
: Notebook demonstrating the use of embeddings for yes/no questions.utils/data.py
: Utility functions for data loading and preprocessing.utils/evaluate.py
: Functions for evaluating model performance.
The project requires several Python libraries, which are listed in the requirements.txt
file. Key dependencies include:
- TensorFlow
- PyTorch
- Transformers
- Flair
- NLTK
- NumPy
- Pandas
To run the models:
- Install the required dependencies:
pip install -r requirements.txt
- Prepare the data by running the data preprocessing scripts in the
utils/
directory. - Train the models using the respective training scripts in each task directory.
- Evaluate the models using the provided evaluation functions.
Our models achieved competitive results on the BioASQ 2020 challenge. Detailed performance metrics and analysis can be found in the PDF summary in the main folder.
We would like to thank the BioASQ organizers for providing this challenging and important competition in the field of biomedical text mining and question answering.