Abstract
This repo holds the code for our paper CoVGT accepted to IEEE T-PAMI'23. The work extends our preliminary publication at ECCV'22. We highlight the following differences compared to the conference version:- Jointly supervised and self-supervised contrastive objectives to optimize VGT.
- Substitute BERT with a stronger language model (e.g., RoBERTa) for QA embedding.
- Extended results on Causal-VidQA and STAR-QA and more comprehensive ablation studies.
The code is based on VGT.
- Release feature of other datasets. Please email the first author and specify the reason as the data is strictly for research purpose.
Assume you have installed Anaconda3, cuda version > 11.0 with gpu memory >= 24G, please do the following to setup the envs:
>conda create -n videoqa python==3.8.16
>conda activate videoqa
>git clone https://github.com/doc-doc/CoVGT.git
>pip install -r requirements.txt
>conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=11.1 -c pytorch -c nvidia
Please create a data folder outside this repo, so you have two folders in your workspace 'workspace/data/' and 'workspace/CoVGT/'.
Below we use NExT-QA as an example to get you farmiliar with the code.
Please download the related video feature and QA annotations according to the links provided in the Results and Resources
section. Note that the QA annotations will be saved into workspace/CoVGT/datasets/nextqa/
after you clone this repo., video features into workspace/data/nextqa/
and checkpoint files into workspace/data/save_models/nextqa/
. Change default paths in global_parameters.py and args.py for your own datasets.
./shell/next_test.sh 0
python eval_next.py --folder CoVGT_FTCoWV --mode test
Table 1. VideoQA Accuracy (%) on Test Set.
Cross-Modal Pretrain | NExT-QA | Causal-VidQA | STAR | TGIF-QA (Action) | TGIF-QA (Trans) | TGIF-QA (FrameQA) | TGIF-QA-R* (Action) | TGIF-QA-R* (Trans) | MSRVTT-QA |
---|---|---|---|---|---|---|---|---|---|
- | 59.4 | 59.1 | 44.0 | 94.7 | 97.6 | 61.6 | 60.8 | 73.8 | 38.3 |
WebVid0.18M | 59.7 | 60.8 | 46.2 | 91.3 | 96.2 | 61.7 | 61.0 | 73.2 | 40.0 |
- | feats | feats | feats | feats | feats | feats | feats | feats | feats |
- | videos | videos | videos | videos | videos | videos | videos | videos | videos |
- | Q&A | Q&A | Q&A | Q&A | Q&A | Q&A | Q&A | Q&A | Q&A |
We have provided all the scripts in the folder 'shells', you can start your training by specifying the GPU IDs behind the script. (If you have multiple GPUs, you can separate them with comma: ./shell/nextqa_train.sh 0,1)
./shell/nextqa_train.sh 0
It will train the model and save to the folder 'save_models/nextqa/CoVGT/'. You will get results around 60.1% and 59.4% on the val and test set respectively.
@ARTICLE {xiao2023contrastive,
author = {Junbin Xiao and Pan Zhou and Angela Yao and Yicong Li and Richang Hong and Shuicheng Yan and Tat Seng Chua},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {Contrastive Video Question Answering via Video Graph Transformer},
year = {2023},
volume = {45},
number = {11},
issn = {1939-3539},
pages = {13265-13280},
doi = {10.1109/TPAMI.2023.3292266},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {nov}
}
@inproceedings{xiao2022video,
title={Video Graph Transformer for Video Question Answering},
author={Xiao, Junbin and Zhou, Pan and Chua, Tat-Seng and Yan, Shuicheng},
booktitle={European Conference on Computer Vision},
pages={39--58},
year={2022},
organization={Springer}
}
If you use any resources from this repo, please kindly cite our paper and acknowledge the source.
This repository is released under the Apache 2.0 license as found in the LICENSE file.