Provides the official implementation of On UMAP's true loss function.
Adds loss logging capabilities to UMAP and validates that UMAP's optimization procedure optimizes a different loss than purported. Further information, for instance on how this can create artifacts in UMAP visualizations can be found in the paper.
Clone the repository
git clone https://github.com/hci-unihd/UMAPs-true-loss
Change into the directory, create a conda environment from environment.yml
and activate it
conda env create -f environment.yml
conda activate umaps_true_loss
Install the extension of the UMAP package
python setup.py install
Download the C. elegans, PBMC and lung cancer datasets
cd data
python get_c_elegans.py
python get_PBMC.py
python get_lung_cancer_data.py
Download the CIFAR-10 dataset and a pretrained Resnet50 to extract features (CUDA-ready GPU needed)
python get_cifar10_resnet50_features.py
If UMAP losses shall be logged on large datasets, a CUDA-ready GPU is needed.
To reproduce the results of the paper, run the notebooks below from a jupyter notebook
launched in notebooks
.
UMAP_*.ipynb
produces the visualizations in the paper; should be run first.*_histograms.ipynb
produces the histograms in the paper.embedding_quality_measures.ipynb
computes the measures for the quality of embeddings.run_times.ipynb
computes the run times of the key experiments.stability
Computes loss values given in the paper over several runs with differen random seeds.
The figures will be saved in data/figures
and other output in data/DATASET
.
Our implementation is extends version 0.5.0 of https://github.com/lmcinnes/umap. The added functionality provides
four new arguments to the UMAP
class:
graph
Allows to specify high-dimensional similarities as part of the input instead of inferring them from the datapush_tail
Specifies whether or not the tail of a negative sample should be pushed away from its headlog_losses
Specifies if and how losses should be loggedlog_samples
Specifies whether sampled edges and negative samples should be loggedlog_embeddings
Specifies whether intermediate embeddings should be logged
Our changes are confined to umap_.py
and layout.py
and two new files my_utils.py
and my_plots.py
.