This repository provides the official PyTorch implementation of our CVPR 2024 paper:
[Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models]Paper
Authors: Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, Lei Zhang
This repository contains the implementation of DMN for image classification with a pre-trained CLIP. We consider four task settings:
- Zero-shot classification in a test-time adaptation manner
- Few-shot classification
- Training-free few-shot classification
- Out-of-distribution generalization
Results on ImageNet dataset under different task settings.
The overall framework of our DMN.
This implementation is for the single-GPU configuration. All experiments can be reproduced on a GPU with more than 10GB memory (e.g., 1080Ti)!
The code is tested on PyTorch 1.13.1.
We suggest downloading all datasets to a root directory (${data_root}
), and renaming the directory of each dataset as suggested in ${ID_to_DIRNAME}
in ./data/datautils.py
. This would allow you to evaluate multiple datasets within the same run.
If this is not feasible, you could evaluate different datasets separately, and change the ${data_root}
accordingly in the bash script.
For zero/few-shot classification, we consider 11 datasets:
For out-of-distribution generalization, we consider 4 datasets:
We provide a simple bash script under ./scripts/run.sh
. You can modify the paths and other args in the script. One can easily reproduce all results by:
bash ./scripts/run.sh
For simplicity, we use set_id
to denote different datasets. A complete list of set_id
can be found in ${ID_to_DIRNAME}
in ./data/datautils.py
.
Few-shot classification results on 11 datasets with a VITB/16 image encoder.
Method | ImageNet(IN) | IN-A | IN-V2 | IN-R | IN-Sketch | Average | OOD Average |
---|---|---|---|---|---|---|---|
CLIP-RN50 | 58.16 | 21.83 | 51.41 | 56.15 | 33.37 | 44.18 | 40.69 |
Ensembled prompt | 59.81 | 23.24 | 52.91 | 60.72 | 35.48 | 46.43 | 43.09 |
CoOp | 63.33 | 23.06 | 55.40 | 56.60 | 34.67 | 46.61 | 42.43 |
CoCoOp | 62.81 | 23.32 | 55.72 | 57.74 | 34.48 | 46.81 | 42.82 |
TPT | 60.74 | 26.67 | 54.70 | 59.11 | 35.09 | 47.26 | 43.89 |
DMN-ZS | 63.87 | 28.57 | 56.12 | 61.44 | 39.84 | 49.97 | 46.49 |
If you find our code useful or our work relevant, please consider citing:
@inproceedings{zhang2024dual,
title={Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models},
author={Zhang, Yabin and Zhu, Wenjie and Tang, Hui and Ma, Zhiyuan and Zhou, Kaiyang and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
year={2024}
}
We thank the authors of CoOp/CoCoOp and TPT for their open-source implementation and instructions on data preparation.