(*= equal contribution)
Metrics like FID are commonly used in medical image analysis to compare distributions of real and/or generated images following accepted practices in mainstream computer vision, but they may not be the best choice for medical imaging! Here we provide easy-to-use code for computing our proposed distance metric RaD (Radiomic Distance) between sets of medical images, which we introduced in our paper RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications (link), specifically designed for the needs of medical image analysis.
RaD utilizes standardized radiomic image features rather than pretrained deep image features (as in commonly-used metrics like FID, KID, CMMD etc.) to compare sets of medical images. We show in our paper that this results in a number of desirable improvements for medical image distribution comparison over these prior metrics, such as:
- Better alignment with downstream task performance (e.g., segmentation).
- Improved stability and computational efficiency for small-to-moderately-sized datasets.
- Improved interpretability, due to RaD utilizing features that are clearly defined and commonly used in the medical imaging community (see "Interpreting Differences between Image Sets" below for more).
In our paper, we validate these claims through a wide range of experiments across diverse medical imaging datasets and applications, showing RaD to be a promising alternative to FID, KID, etc. for applications such as:
- Out-of-Domain Detection (see "Out-of-Domain (OOD) Detection" below for more info)
- Image-to-Image Translation Model Evaluation
- Unconditional Generative Model Evaluation
with many other potential applications where distributions of real and/or synthetic medical images need to be compared.
Big thanks to PyRadiomics for providing the radiomic feature computational backend for our code!
Please cite our paper if you use our code or reference our work:
@article{konz2024radmetricmedicalimage,
title={RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications},
author={Nicholas Konz and Yuwen Chen and Hanxue Gu and Haoyu Dong and Yaqian Chen and Maciej A. Mazurowski},
year={2024},
eprint={2412.01496},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.01496},
}
Please run pip3 install -r requirements.txt
to install the required packages.
Next, install PyRadiomics by running bash install.sh
.
To compute the RaD between two sets of images, simply run the following command in the main directory:
python3 compute_rad.py \
--image_folder1 {IMAGE_FOLDER1} \
--image_folder2 {IMAGE_FOLDER2}
where:
IMAGE_FOLDER1
is the path to the first set of images.IMAGE_FOLDER2
is the path to the second set of images.
For example, if you want to use RaD to compare a set of generated images to a set of real images (e.g., to evaluate some generative model), these folders correspond to the paths to the generated and real images, respectively.
Want to see the specific name of each computed radiomic feature used in RaD for the sake of interpretability? The extracted features for each provided image folder are saved as a .csv
file in that folder, where each named column corresponds to a radiomic feature.
Additionally, you can add the argument return_feature_names=True
to the function convert_radiomic_dfs_to_vectors()
in src/radiomics_utils.py
, (e.g., as used in compute_rad.py
) to return the names of each radiomic feature in addition to the actual feature vectors of each of the two datasets where are used to compute RaD.
RaD can easily be used for detecting if some newly acquired medical image is from the same domain/distribution as some reference set, e.g., comparing some image acquired at an outside hospital to a reference set used to train a downstream task model from within your own hospital. This is helpful if you're wondering if your trained model will work as well on the new data as it did from data from your own institution, or if performance may be worsened due to the new data being out-of-domain (for example, because it was acquired with a different type of scanner).
You can do this with the following command, which runs Algorithm 1 from our paper:
python3 ood_detection.py \
--image_folder {IMAGE_FOLDER} \
--image_folder_reference {REFERENCE_IMAGE_FOLDER}
where:
IMAGE_FOLDER
is the path to the folder containing the test images you want to predict as being OOD or not.REFERENCE_IMAGE_FOLDER
is the path to the folder containing the reference (in-domain) images you're comparing the test images to.
The OOD detection results will be saved in a .csv
file within outputs/ood_predictions
, with columns of:
filename
(the name of the image file)ood_score
(the unnormalized RaD OOD score of the image)ood_prediction
(the binary prediction of the image for being OOD (pred=1
) or not (pred=0
)), andood_prediction
(a binary value indicating if the image is in-domain or out-of-domain)p_value
(the p-value/probability of the image being in-domain)
Alternatively, in our paper we propose a dataset-level OOD detection score,
python3 ood_detection.py \
--image_folder {IMAGE_FOLDER} \
--image_folder_reference {REFERENCE_IMAGE_FOLDER} \
--dataset_level
You can also perform additional radiomic feature interpretability analysis to understand what specific radiomic features are driving the differences between two sets of images (for example, are these differences mostly texture-based?), and if the datasets are paired, which images are most and least changed between the two distributions according to these features. For example, this can be used to interpret how an image-to-image translation model transformed images from one domain to another (as shown in the paper):
Which can be created easily by adding the --interpret
argument to compute_rad.py
, as
python3 compute_rad.py \
--image_folder1 {IMAGE_FOLDER1} \
--image_folder2 {IMAGE_FOLDER2} \
--interpret
This will result in the top outputs/interpretability_visualizations
which include (1) a plot of the sorted distribution of the changes for each feature between the two datasets (Fig. 7a in our paper), and (2) a t-SNE plot of the radiomic feature vectors of the two datasets.
If the two datasets are paired (e.g., when comparing images before and after being modified by an image-to-image translation model, as in the paper), which is detected by image pairs having the same filename in both dataset folders, the following additional analyses will be performed and outputted to outputs/interpretability_visualizations
:
- A plot of the sorted distribution of the changes for each image between the two datasets with respect to all radiomic features (Fig. 7c in our paper).
- Plots showing the most- and least-changes images between the two distributions according to the radiomic features (Fig. 7d in our paper).