A new collection of medical visual question answering dataset on MIMIC-CXR database
The MIMIC-CXR-VQA dataset is a complex (involving set and logical operations), diverse (with 48 templates), and large-scale (approximately 377K) resource, designed specifically for Visual Question Answering (VQA) tasks in the medical domain. Primarily focusing on chest radiographs, this dataset was mainly derived from the MIMIC-CXR-JPG and Chest ImaGenome datasets, both of which were sourced from Physionet.
The goal of the MIMIC-CXR-VQA dataset is to serve as a benchmark for evaluating the effectiveness of current medical VQA approaches. It not only functions as a tool for traditional medical VQA tasks but also has the unique quality of being an image-based Electronic Health Records (EHRs) Question Answering dataset resource. Therefore, we utilize question templates from the MIMIC-CXR-VQA dataset as seed question templates for image modality, to construct a multi-modal EHR QA dataset, EHRXQA.
- [07/20/2024] We released MIMIC-CXR-VQA dataset on Physionet.
- [12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
- [10/28/2023] We released our research paper on arXiv.
- Overview
- Updates
- Features
- Installation
- Setup
- Usage
- Versioning
- Contributing
- Contact
- Acknowledgements
- Citation
- License
- Provide a script to download source datasets (MIMIC-CXR-JPG, Chest ImaGenome, and MIMIC-IV) from Physionet.
- Provide a script to preprocess the source datasets.
- Provide a script to generate the MIMIC-CXR-VQA dataset (with answer information).
Ensure that you have Python 3.8.5 or higher installed on your machine. Set up the environment and install the required packages using the commands below:
# Set up the environment
conda create --name mimiccxrvqa python=3.8.5
# Activate the environment
conda activate mimiccxrvqa
# Install required packages
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2
Clone this repository and navigate into it:
git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa
We take data privacy very seriously. All of the data you access through this repository has been carefully prepared to prevent any privacy breaches or data leakage. You can use this data with confidence, knowing that all necessary precautions have been taken.
The MIMIC-CXR-VQA dataset is constructed from the MIMIC-CXR-JPG (v2.0.0), Chest ImaGenome (v1.0.0), and MIMIC-IV (v2.2). All these source datasets require a credentialed Physionet license. Due to these requirements and in adherence to the Data Use Agreement (DUA), only credentialed users can access the MIMIC-CXR-VQA dataset files (see Access Policy). To access the source datasets, you must fulfill all of the following requirements:
- Be a credentialed user
- If you do not have a PhysioNet account, register for one here.
- Follow these instructions for credentialing on PhysioNet.
- Complete the "CITI Data or Specimens Only Research" training course.
- Sign the data use agreement (DUA) for each project
To facilitate easy access to the MIMIC-CXR-VQA dataset for users who have pre-downloaded the MIMIC-CXR, MIMIC-IV, and Chest ImaGenome datasets, please ensure the predefined directory global variables (MIMIC_IV_BASE_DIR
, MIMIC_CXR_BASE_DIR
, CHEST_IMAGENOME_BASE_DIR
) in the script align with your local dataset paths.
To generate the MIMIC-CXR-VQA dataset from your pre-downloaded datasets, run the main script as follows:
bash build_dataset.sh
Alternatively, if you prefer to download the source datasets directly from Physionet and then generate the MIMIC-CXR-VQA dataset, use the script below, which requires your Physionet credentials:
bash download_and_build_dataset.sh
When running the script, you'll be prompted to enter your PhysioNet credentials:
- Username: Type your PhysioNet username and press
Enter
. - Password: Type your PhysioNet password and press
Enter
(note that the password will not be visible).
The script undertakes several actions: (1) downloading the source datasets from Physionet, (2) preprocessing these datasets, and (3) generating the complete MIMIC-CXR-VQA dataset by creating ground-truth answer information.
To enhance user convenience, we will provide a script that allows you to download only the CXR images relevant to the MIMIC-CXR-VQA dataset, rather than downloading all the MIMIC-CXR-JPG images.
bash download_images.sh
During script execution, enter your PhysioNet credentials when prompted:
- Username: Enter your PhysioNet username and press
Enter
. - Password: Enter your PhysioNet password and press
Enter
. The password characters won't appear on screen.
This script performs several actions: 1) it reads the image paths from the JSON files of the MIMIC-CXR-VQA dataset; 2) uses these paths to download the corresponding images from the MIMIC-CXR-JPG dataset hosted on Physionet; and 3) saves these images locally in the corresponding directories as per their paths.
The dataset is structured as follows:
mimiccxrvqa
└── dataset
├── ans2idx.json
├── _train_part1.json
├── _train_part2.json
├── _valid.json
├── _test.json
├── train.json (available post-script execution)
├── valid.json (available post-script execution)
└── test.json (available post-script execution)
- The
mimiccxrvqa
is the root directory. Within this, thedataset
directory contains various JSON files that are part of the MIMIC-CXR-VQA dataset. - The
ans2idx.json
file is a dictionary mapping from answers to their corresponding indices. _train_part1.json
,_train_part2.json
,_valid.json
, and_test.json
are pre-release versions of the dataset files corresponding to the training, validation, and testing sets respectively. These versions are intentionally incomplete to safeguard privacy and prevent the leakage of sensitive information; they do not include certain crucial information, such as the answers.- Once the main script is executed with valid Physionet credentials, the full versions of these files -
train.json
,valid.json
, andtest.json
- will be generated. These files contain the complete information, including images, questions, and the corresponding answers for each entry in the respective sets.
The QA samples in the MIMIC-CXR-VQA dataset are stored in individual .json
files. Each file contains a list of Python dictionaries with keys that indicate:
split
: a string indicating its split.idx
: a number indicating its instance index.image_id
: a string indicating the associated image ID.question
: a question string.content_type
: a string indicating its content type, which can be one of this list:anatomy
attribute
presence
abnormality
plane
gender
size
semantic_type
: a string indicating its semantic type, which can be one of this list:verify
choose
query
template
: a template string.template_program
: a string indicating its template program. Each template has a unique program to get its answer from the database.template_arguments
: a dictionary specifying its template arguments, consisting of five sub-dictionaries that represent the sampled values for arguments in the template. When an argument needs to appear multiple times in a question template, an index is appended to the dictionary.object
attribute
category
viewpos
gender
Note that these details can be open-sourced without safety concerns and without revealing the dataset's distribution information (including image, question, and answer distributions), thanks to our uniform sampling strategy.
After validating the PhysioNet credentials, the create_answer.py
script generates the following items:
answer
: an answer string.subject_id
: a string indicating the corresponding subject ID (patient ID).study_id
: a string indicating the corresponding study ID.image_path
: a string indicating the corresponding image path.
To be specific, here is the example instance:
{
"split": "train",
"idx": 13280,
"image_id": "34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0",
"question": "Are there signs of both pleural effusion and lung cancer in the left lower lung zone?",
"content_type": "attribute",
"semantic_type": "verify",
"template": "Are there signs of both ${attribute_1} and ${attribute_2} in the ${object}?",
"template_program": "program_5",
"template_arguments": {
"object": {
"0": "left lower lung zone"
},
"attribute": {
"0": "pleural effusion",
"1": "lung cancer"
},
"category": {},
"viewpos": {},
"gender": {}
},
"answer": "Will be generated by dataset_builder/generate_answer.py"
"subject_id": "Will be generated by dataset_builder/generate_answer.py"
"study_id": "Will be generated by dataset_builder/generate_answer.py"
"image_path": "Will be generated by dataset_builder/generate_answer.py"
}
We employ semantic versioning for our dataset, with the current version being v1.0.0. Generally, we will maintain and provide updates only for the latest version of the dataset. However, in cases where significant updates occur or when older versions are required for validating previous research, we may exceptionally retain previous dataset versions for a period of up to one year. For a detailed list of changes made in each version, check out our CHANGELOG.
Contributions to enhance the usability and functionality of this dataset are always welcomed. If you're interested in contributing, feel free to fork this repository, make your changes, and then submit a pull request. For significant changes, please first open an issue to discuss the proposed alterations.
For any questions or concerns regarding this dataset, please feel free to reach out to us ([email protected] or [email protected]). We appreciate your interest and are eager to assist.
More details will be provided soon.
When you use the MIMIC-CXR-VQA dataset, we would appreciate it if you cite the following:
@article{bae2024ehrxqa,
title={EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images},
author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric and Kim, Tackeun and others},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}
The code in this repository is provided under the terms of the MIT License. The final output of the dataset created using this code, the MIMIC-CXR-VQA, is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.