Skip to content

Latest commit

 

History

History
66 lines (51 loc) · 3 KB

README.md

File metadata and controls

66 lines (51 loc) · 3 KB

Token Embeddings Alignment for Cross-Modal Retrieval

PyTorch implementation and pretrained models of TEAM. A new dataset which contains over 100M Chinese image-text pairs will also be released.

Model

Pretrained Models

We provide three pre-trained models:

pretrained_4m.pth: TEAM with ViT-B/16 (initialized by DeiT-base) as image encoder, pre-trained on 4 millions of image-text pairs.

pretrained_14m_clip_large.pth: TEAM with ViT-L/14 (initialized by CLIP-L/14) as image encoder, pre-trained on 14 millions of image-text pairs.

Both of them can be found here

Besides, we also release TEAM trained on our collected Chinese image-text dataset, please refere to TEAM图文检索模型-中文-large for more details.

Evaluation

To evaluate the pretrained_14m_clip_large.pth on COCO Retrieval task, you can run:

python -m eval configs/pretrain_5m/team_clipl14.py

Note that the results of the second stage is the final results.

Training

To train TEAM with ViT-L/14 as image encoder on 4 millions of image-text pairs:

python -m torch.distributed.launch --nproc_per_node=8 train.py configs/pretrain_5m/team_clipl14.py

Experimental Results

COCO Retrieval

Zero-shotFinetune
Text RetrievalImage RetrievalText RetrievalImage Retrieval
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
pretrained_4m74.991.895.354.779.586.677.393.696.559.783.289.4
pretrained_14m_clip_large82.895.697.663.985.190.484.096.198.066.987.092.1

Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{TEAM2022MM,
  title = {Token Embeddings Alignment for Cross-Modal Retrieval},
  author = {Xie, Chen-Wei and Wu, Jianmin and Zheng, Yun and Pan, Pan and Hua, Xian-Sheng},
  booktitle = {ACMMM},
  year = {2022}
}

Some code is borrowed from ALBEF and CLIP. Thanks a lot to them.