This is an official Pytorch implementation of our AAAI 2023 paper InstanceFormer: An Online Video Instance Segmentation Framework. In this repository, we provide PyTorch code for training and testing our proposed InstanceFormer model. InstanceFormer is an efficient video instance segmentation and tracking model that achieves state-of-the-art results on several benchmarks, such as YTVIS-19/21/22 and OVIS.
First, install PyTorch 1.7 and Torchvision 0.8 and compile CUDA operators. Then, install Detectron2 following the official guide: INSTALL.md. Please use Detectron2 with commit id 9eb4831 if you have any issues related to Detectron2.
conda install pytorch==1.7.1 torchvision==0.8.2 -c pytorch
pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
conda install -c fastai opencv-python-headless
#CUDA
cd ./models/ops
sh ./make.sh
python test.py
Download and extract OVIS, YTVIS-19, YTVIS-21, YTVIS-22, and COCO 2017 datasets. Run coco_keep_for_ovis.py and coco_keep_for_ytvis21.py. Download 'coco_keepfor_ytvis19.json' to the datasets folder from this link.
InstanceFormer
├── datasets
│ ├── coco_keepfor_ytvis19.json
...
ytvis
├── train
├── val
├── annotations
│ ├── instances_train_sub.json
│ ├── instances_val_sub.json
coco
├── train2017
├── val2017
├── annotations
│ ├── instances_train2017.json
│ ├── instances_val2017.json
First, pre-train the InstanceFormer with the COCO dataset with frame size 1, or use the pretrained model weight (r50_pretrain.pth) available at this link.
bash configs/r50_pretrain.sh
All models of InstanceFormer are trained on four NVIDIA RTX A6000 GPUs with a total batch size of 16. To train InstanceFormer on YouTubeVIS and OVIS with 4 GPUs , run:
bash configs/train_r50_ytvis.sh
bash configs/train_r50_ovis.sh
To train InstanceFormer on multiple nodes, run:
MASTER_ADDR=<IP address of node 1> NODE_RANK=1 GPUS_PER_NODE=4 ./tools/run_dist_launch.sh 16 ./configs/train_r50_ytvis.sh
The model trained on YTVIS-21 is used to evaluate YTVIS-22 as they share the same training dataset. The trained model weights are available at this link.
bash configs/evaluate_r50_ytvis.sh
bash configs/evaluate_r50_ovis.sh
To simplify the laborious process of manually uploading result files to the Codalab server, we offer an automatic uploading functionality in the file Server Process. This functionality can be activated during Inference by adding the --upload_file flag.
During Inference, use the --analysis flag to save reference points, sampling locations, and instance embeddings. Run Analysis, TSNE, and Make Video to save the respective plots and videos.
If you find InstanceFormer useful in your research, please use the following BibTeX entry as a citation.
@article{koner2022instanceformer,
title={InstanceFormer: An Online Video Instance Segmentation Framework},
author={Koner, Rajat and Hannan, Tanveer and Shit, Suprosanna and Sharifzadeh, Sahand and Schubert, Matthias and Seidl, Thomas and Tresp, Volker},
journal={arXiv preprint arXiv:2208.10547},
year={2022}
}
We acknowledge the following repositories from where we have inherited code snippets.