Official implementation of 'Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification'.
The paper has been accepted by ECCV 2022.
- Our latest work, CaFo, is based on Tip-Adapter and accepted by CVPR 2023 🔥. Please refer here for the code.
Tip-Adapter is a training-free adaption method for CLIP to conduct few-shot classification, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art by fine-tuning the cache model for only 10x fewer epochs than existing approaches, which is both effective and efficient.
Create a conda environment and install dependencies:
git clone https://github.com/gaopengcuhk/Tip-Adapter.git
cd Tip-Adapter
conda create -n tip_adapter python=3.7
conda activate tip_adapter
pip install -r requirements.txt
# Install the according versions of torch and torchvision
conda install pytorch torchvision cudatoolkit
Follow DATASET.md to install ImageNet and other 10 datasets referring to CoOp.
The running configurations can be modified in configs/dataset.yaml
, including shot numbers, visual encoders, and hyperparamters.
For simplicity, we provide the hyperparamters achieving the overall best performance on 1~16 shots for a dataset, which accord with the scores reported in the paper. If respectively tuned for different shot numbers, the 1~16-shot performance can be further improved. You can edit the search_scale
, search_step
, init_beta
and init_alpha
for fine-grained tuning.
Note that the default load_cache
and load_pre_feat
are False
for the first running, which will store the cache model and val/test features in configs/dataset/
. For later running, they can be set as True
for faster hyperparamters tuning.
We provide Tip-Adapter's numerical results in Figure 4 and 5 of the paper at exp.log.
CLIP-Adapter's numerical results are also updated for comparison.
For ImageNet dataset:
CUDA_VISIBLE_DEVICES=0 python main_imagenet.py --config configs/imagenet.yaml
For other 10 datasets:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/dataset.yaml
The fine-tuning of Tip-Adapter-F will be automatically conducted after the training-free Tip-Adapter.
Renrui Zhang, Peng Gao
This repo benefits from CLIP, CoOp and CLIP-Adapter. Thanks for their wonderful works.
@article{zhang2021tip,
title={Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling},
author={Zhang, Renrui and Fang, Rongyao and Gao, Peng and Zhang, Wei and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2111.03930},
year={2021}
}
If you have any question about this project, please feel free to contact [email protected] and [email protected].