MAGVIT: Masked Generative Video Transformer

Official code and models for the CVPR 2023 paper:

MAGVIT: Masked Generative Video Transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
CVPR 2023

Summary

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

If you find this code useful in your research, please cite

@inproceedings{yu2023magvit,
  title={{MAGVIT}: Masked generative video transformer},
  author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Disclaimers

Please note that this is not an officially supported Google product.

Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.

Installation

There is a conda environment file for running with GPUs. CUDA 11 and CuDNN 8.6 is required for JAX. This VM Image has been tested.

conda env create -f environment.yaml
conda activate magvit

Alternatively, you can install the dependencies via

pip install -r requirements.txt

Pretrained models

As for the model pretrained weight release, please see this note.

MAGVIT 3D-VQ models

Model	Size	Input	Output	Codebook size	Dataset
3D-VQ	B	16 frames x 64x64	4x16x16	1024	BAIR Robot Pushing
3D-VQ	L	16 frames x 64x64	4x16x16	1024	BAIR Robot Pushing
3D-VQ	B	16 frames x 128x128	4x16x16	1024	UCF-101
3D-VQ	L	16 frames x 128x128	4x16x16	1024	UCF-101
3D-VQ	B	16 frames x 128x128	4x16x16	1024	Kinetics-600
3D-VQ	L	16 frames x 128x128	4x16x16	1024	Kinetics-600
3D-VQ	B	16 frames x 128x128	4x16x16	1024	Something-Something-v2
3D-VQ	L	16 frames x 128x128	4x16x16	1024	Something-Something-v2

MAGVIT transformers

Each transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.

Model	Task	Size	Dataset	FVD
Transformer	Class-conditional	B	UCF-101	159
Transformer	Class-conditional	L	UCF-101	76
Transformer	Frame prediction	B	BAIR Robot Pushing	76 (48)
Transformer	Frame prediction	L	BAIR Robot Pushing	62 (31)
Transformer	Frame prediction (5)	B	Kinetics-600	24.5
Transformer	Frame prediction (5)	L	Kinetics-600	9.9
Transformer	Multi-task-8	B	BAIR Robot Pushing	32.8
Transformer	Multi-task-8	L	BAIR Robot Pushing	22.8
Transformer	Multi-task-10	B	Something-Something-v2	43.4
Transformer	Multi-task-10	L	Something-Something-v2	27.3

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
videogvt		videogvt
.gitignore		.gitignore
.pylintrc		.pylintrc
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAGVIT: Masked Generative Video Transformer

Summary

Disclaimers

Installation

Pretrained models

MAGVIT 3D-VQ models

MAGVIT transformers

About

Releases

Contributors 2

Languages

License

google-research/magvit

Folders and files

Latest commit

History

Repository files navigation

MAGVIT: Masked Generative Video Transformer

Summary

Disclaimers

Installation

Pretrained models

MAGVIT 3D-VQ models

MAGVIT transformers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages