Evaluating Histopathology Foundation Models for Few-shot Tissue Clustering: an Application to LC25000 Augmented Dataset Cleaning
Best paper award at 2nd Workshop on Data Engineering in Medical Imaging (DEMI), MICCAI 2024
Paper Link | Open Access Record* | Code | Cite
*Full open-access text will be available after the embargo period from October 2025.
Abstract: Recent digital histopathology datasets have significantly advanced the development of deep learning-based histopathology frameworks. However, data leakage in model training can lead to artificially high metrics that do not genuinely reflect the strength of the approach. The LC25000 dataset, consisting of tissue image tiles extracted from lung and colon samples, is a popular benchmark dataset. In the released version, tissue tiles were augmented randomly and mixed. Nevertheless, many studies report near-perfect accuracy scores, often due to data leakage, where augmented images of the same tissue tile are split into both training and test sets. To improve the quality of performance reports, we develop a semi-automatic pipeline to clean LC25000. By clustering and separating all augmented images of the same tiles, using recently proposed histopathology foundation models and manual correction, we create a clean version of LC25000. We then evaluate the quality of features extracted by these foundational models, using the clustering task as a benchmark. Our contributions are:
- We publicly release our semi-automatic annotation pipeline along with the LC25000-clean dataset to facilitate appropriate utilization of this dataset, reducing the risk of overestimating models' performance;
- We profile various combinations of feature extraction and clustering methods for identifying duplicates of the same image generated by basic image transformations;
- We propose the clustering task as a minimal-setup benchmark to evaluate the quality of tissue image features learned by histopathology foundation models.
The LC25000 Dataset is a widely used histology image dataset. The samples in the dataset contain highly correlated images, thus resulting in data leakage if models are both trained and evaluated on it. This repository contains (1) the cleaned dataset with highly correlated images grouped together, (2) the code for the semi-automatic cleaning pipeline, (3) the evaluation code for using the cleaned dataset as a minimal setup benchmark for new histopathology foundation models.
The LC25000 dataset is a large-scale dataset for histology image classification. It contains 25000 images (patches extracted from WSI images) with 5000 images per class. The dataset can be downloaded by following the instructions from the official GitHub repository: https://github.com/tampapath/lung_colon_image_set/
Dataset Paper: Borkowski AA, Bui MM, Thomas LB, Wilson CP, DeLand LA, Mastorides SM. Lung and Colon Cancer Histopathological Image Dataset (LC25000). arXiv:1912.12142v1 [eess.IV], 2019
The paper provides the following information about the dataset:
"HIPAA compliant and validated seven hundred fifty total images of lung tissue (250 benign lung tissue, 250 lung adenocarcinomas, and 250 lung squamous cell carcinomas) and 500 total images of colon tissue (250 benign colon tissue and 250 colon adenocarcinomas) were captured from pathology glass slides as we previously described.[8] ... Using Augmentor, we expanded our dataset to 25,000 images by the following augmentations: left and right rotations (up to 25 degrees, 1.0 probability) and by horizontal and vertical flips (0.5 probability)."
The LC25000 dataset is a popular resource with more than 140 works citing the it (142 according to Semantic Scholar on June 28th, 2024).
Each of the 5 classes contains 5000 images generated by doing rotations and flips from the 250 original images resulting on average in 20 highly-correlated samples. If the dataset is randomly split into train+validation and test sets using the 80/20 ratio, we can expect 99%
LC25000 dataset can be downloaded by following the instructions from the official GitHub repository (https://github.com/tampapath/lung_colon_image_set/) or from HuggingFace (https://huggingface.co/datasets/1aurent/LC25000).
The data directory should have the following structure:
LC25000-clean (this repository)
README.md
annotations/
...
LC25000/
lung_aca/
lungaca1.jpg
lungaca2.jpg
...
lung_n/
lungn1.jpg
lungn2.jpg
...
lung_scc/
lungscc1.jpg
lungscc2.jpg
...
colon_aca/
colonaca1.jpg
colonaca2.jpg
...
colon_n/
colonn1.jpg
colonn2.jpg
...
Run the commands detailed in the environment-creation.md file to create a conda environment with the necessary dependencies.
Used a pre-traned UNI model to extract features from each of the classes of the LC25000 dataset (5000 images per class). The features are then saved in a features.npy
file. The mapping of the image index in the .npy
file to the image path was also saved in ids_2_img_paths.json
.
UNI paper: Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., et al. Towards a general-purpose foundation model for computational pathology. Nat Med (2024). https://doi.org/10.1038/s41591-024-02857-3
Notebook: 1-feature-extraction.ipynb Script: extract_features.py
python extract_features.py \
--cancer_type lung_aca \
--img_norm resize_only \
--extractor_name UNI \
--device cuda \
--batch_size 256
Notes:
- If file exists, the user will be asked whether to overwrite it.
- The script will also print the progress of the feature extraction.
Used scikit-learn KMeans clustering to cluster the features extracted from the UNI model. The number of clusters was set to 250. The clustering was done on the features extracted from the LC25000 dataset.
After the clustering, pick the image closest to the cluster centroid to be the representative image of the cluster. The other samples in the cluster were compared manually to the representative image. If the samples were similar to the representative image, they were kept in the cluster. If they were not similar, they were recorded as not belonging to the cluster.
Notebook: 2-clustering-interactive.ipynb
The features was evaluated by using the manual annotations as the ground truth. The features were evaluated using the following metrics:
-
Retrieval metrics to evaluate if the closest images in the feature space are from the same original image
- precision@1
- precision@5
-
Binary connectivety metrics: two images are considered connected (label 1) if they are in the same ground truth cluster, and disconnected (label 0) otherwise.
- Confusion Matrix
- Accuracy
- Precision
- Recall
- F1 Score
- Specificity
- Balanced Accuracy
-
Clustering metrics: to evaluate the quality of the clustering against the manual clustering
- Fowlkes-Mallows Index
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- Homogeneity
- Completeness
- V-Measure
Notebook: 3-evaluation.ipynb Script: evaluate_clustering.py
python evaluate_clustering.py \
--cancer_type lung_aca \
--img_norm resize_only \
--extractor_name UNI \
--distance_metric euclidean \
--dimensionality_reduction PCA-0.95 \
--clustering kmeans
Except for the cancer_type
, all other arguments can also be set to all
to evaluate all computed features with all implemented distance metrics, dimensionality reduction techniques, and clustering algorithms.
Other arguments:
--manual_annotations_dir
: Directory containing the manual annotations--overwrite
: Overwrite the existing evaluation results--verbose
: Print the evaluation metrics
To reproduce plots from the paper, run the notebook 4-analyze-clustering-results.ipynb.
The cleaned dataset was used to understand how much the performance was affected by the dataset contamination. Set-up:
- 2 dataset versions: original and cleaned (used for train/test splitting in random or grouped manner respectively)
- 3 train/test split ratios: 80/20, 20/80, and 5/95
- 10 random splits of each dataset to get mean and standard deviation classification accuracy
- 3 feature extractors: UNI, Phikon, and ResNet18
- 2 classifiers: KNN and Linear
Notebook: 5-one-shot-and-linear-probing.ipynb
To evaluate a new model on the cleaned dataset, follow these steps:
- Download the data and set up the environment as described in the previous sections.
- Prepare the model in the same format as the other models in source/feature_extraction/get_model_with_transform.py. The model should inherit from
torch.nn.Module
and have aforward
method that takes an image tensor and returns a feature tensor. If the model is set-up in a different way, adjust it like shown in source/feature_extraction/models/owkin_phikon.py. - Extract features from the model using the
extract_features.py
script using your preferred normalization method. - Evaluate the features using the
evaluate_clustering.py
script. You can add other dimensionality reduction techniques and clustering algorithms to the script if needed by modifyingreduce_feature_dimensionality()
andget_clustering_labels()
in source/eval_utils.py - Analyze the evaluation results using the 4-analyze-clustering-results.ipynb notebook.
- Run classification experiments using the 5-one-shot-and-linear-probing.ipynb notebook.
George Batchkala is supported by Fergus Gleeson's A2 research funds, UKRI DART Lung Health Program (Innovate UK grant 40255), and the EPSRC Center for Doctoral Training in Health Data Science (EP/S02428X/1).
This project relied on other repositories.
- feature extraction: UNI, Prov-GigaPath, Phikon, DINOv2, dsmil-wsi
- dimensionality reduction and clustering: scikit-learn
Pre-publication citation. Will be updated after the workshop date (10th of October 2024).
@inproceedings{batchkala2025EvaluatingHistopathologyFoundation,
title = {Evaluating {{Histopathology Foundation Models}} for~{{Few-Shot Tissue Clustering}}: {{An~Application}} to~{{LC25000 Augmented Dataset Cleaning}}},
shorttitle = {Evaluating {{Histopathology Foundation Models}} for~{{Few-Shot Tissue Clustering}}},
booktitle = {Data {{Engineering}} in {{Medical Imaging}}},
author = {Batchkala, George and Li, Bin and Rittscher, Jens},
editor = {Bhattarai, Binod and Ali, Sharib and Rau, Anita and Caramalau, Razvan and Nguyen, Anh and Gyawali, Prashnna and Namburete, Ana and Stoyanov, Danail},
year = {2025},
pages = {11--21},
publisher = {Springer Nature Switzerland},
address = {Cham},
doi = {10.1007/978-3-031-73748-0_2},
isbn = {978-3-031-73748-0},
langid = {english}
}