VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.
conda create -n vast python=3.9
conda activate vast
sh preinstall.sh
make a dir named pretrained_weights under the main work dir.
1.download evaclip weight:
wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt
2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats
3.download bert weight:
from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')
The processed pretrained_weights path should be as follows:
├── pretrained_weights
│ ├── beats
│ │ └── BEATs_iter3_plus_AS2M.pt
│ ├── bert
│ │ └── bert-base-uncased
│ ├── clip
│ │ └── EVA01_CLIP_g_14_psz14_s11B.pt
make a dir named output under the main work dir.
1.download vast model (optional, for finetuning)
[Google Drive Link] [Baidu Cloud Link]
2.vision captioner (optional, for labeling images/videos)
[Google Drive Link] [Baidu Cloud Link]
3.audio captioner (optional, for labeling audios)
[Google Drive Link] [Baidu Cloud Link]
The processed output path should be as follows:
├── output
│ ├── vast
│ │ ├── pretrain_vast
│ │ ├── vision_captioner
│ │ └── audio_captioner
[Google Drive Link] [Baidu Cloud Link]
Raw videos could be downloaded from YouTube.
make a dir named datasets under the main work dir.
[Google Drive Link] [Baidu Cloud Link]
The processed datasets path should be as follows:
├── output
│ ├── annotations
│ │ ├── msrvtt
│ │ ├── ...
│ │ └── msvd
│ ├── srcdata
│ │ ├── msrvtt
│ │ ├── ...
│ │ └── msvd
srcdata (images/videos/audios) should be collected by yourself.
- finetune retrieval tasks
sh scripts/vast/finetune_ret.sh
- finetune captioning tasks
sh scripts/vast/finetune_cap.sh
- finetune QA tasks
sh scripts/vast/finetune_qa.sh
sh scripts/pretrain_vast.sh
For example, if the cmd for finetuning retrieval model is as follows:
python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \
if you want to test model, just add following two rows to the cmd:
--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt
You need to prepare 1)a folder containing all videos/images or audios.
2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]
and then write the config file.
sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh
--train_vision_sample_num
--test_vision_sample_num
--train_audio_sample_num
--test_audio_sample_num
--train_task
--test_task
--learning_rate
--train_batch_size
--test_batch_size
--train_epoch
--train_steps
--checkpointing
--frozen_vision
--valid_freq
--beam_size
If you find this code useful for your research, please consider citing:
@article{chen2024vast,
title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen and Zhu, Xinxin and Liu, Jing},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}
This project is released under the MIT license
For the full list of third-party licenses used in this project, please see the THIRD_PARTY_LICENSES.md file.