PyTorch implementation and pretrained models of TEAM. A new dataset which contains over 100M Chinese image-text pairs will also be released.
We provide three pre-trained models:
pretrained_4m.pth: TEAM with ViT-B/16 (initialized by DeiT-base) as image encoder, pre-trained on 4 millions of image-text pairs.
pretrained_14m_clip_large.pth: TEAM with ViT-L/14 (initialized by CLIP-L/14) as image encoder, pre-trained on 14 millions of image-text pairs.
Both of them can be found here
Besides, we also release TEAM trained on our collected Chinese image-text dataset, please refere to TEAM图文检索模型-中文-large for more details.
To evaluate the pretrained_14m_clip_large.pth on COCO Retrieval task, you can run:
python -m eval configs/pretrain_5m/team_clipl14.py
Note that the results of the second stage is the final results.
To train TEAM with ViT-L/14 as image encoder on 4 millions of image-text pairs:
python -m torch.distributed.launch --nproc_per_node=8 train.py configs/pretrain_5m/team_clipl14.py
Zero-shot | Finetune | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Text Retrieval | Image Retrieval | Text Retrieval | Image Retrieval | |||||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
pretrained_4m | 74.9 | 91.8 | 95.3 | 54.7 | 79.5 | 86.6 | 77.3 | 93.6 | 96.5 | 59.7 | 83.2 | 89.4 |
pretrained_14m_clip_large | 82.8 | 95.6 | 97.6 | 63.9 | 85.1 | 90.4 | 84.0 | 96.1 | 98.0 | 66.9 | 87.0 | 92.1 |
If you find this repository useful, please consider citing our paper:
@inproceedings{TEAM2022MM,
title = {Token Embeddings Alignment for Cross-Modal Retrieval},
author = {Xie, Chen-Wei and Wu, Jianmin and Zheng, Yun and Pan, Pan and Hua, Xian-Sheng},
booktitle = {ACMMM},
year = {2022}
}
Some code is borrowed from ALBEF and CLIP. Thanks a lot to them.