MAMA: A Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan ECCV 2024
arxiv | bibtex | 🤗 demo | website
MAMA (Meta-optimized Angular MArgin Contrastive Framework for Video-Language Representation Learning from Large Vision Language Model) is a novel approach to learn video-language representations from Large Vision-Language Model (LVLM). We utilize LLaVA to augment training video-text data, and utilize an angular margin-based contrastive learning combined with meta-learning to optimize the effectiveness of the LLaVA-augmented data.
Sample Generation:
Try out our LVLM-based pipeline to generate text descriptions for your own videos! You can also try out a web demo here:
The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!
MAMA leverages Large Vision-Language Models (LVLM) as to automatically augment video-text training data, and uses these data to fine-tune strong video-language models.
Let's begin from creating and activating a Conda environment an virtual environment. Then install the requirements:
conda create --name mama_env python=3.9
conda activate mama_env
pip install -r requirements.txt
MAMA consists of a subtractive angular margin contrastive objective, powered by meta-learning to weigh the important of the training video-text data.
We provide some generated samples by our MAMA’s LVLM-based video-text data generation pipeline:
Run the MAMA demo using Colab (no GPU needed): or on the web using 🤗 Spaces: .
Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_mama.py locally. For more technical details, please refer to Section 3 in our paper.
# CPU mode
python demo_mama.py [--video-path $TEST_VIDEO]
# GPU mode
python demo_mama.py --cuda [--video-path $TEST_VIDEO]
To facilitate future research, we release our augmented data based on the HowTo100M dataset at this link (released later).
@article{nguyen2024meta,
title={Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning},
author={Nguyen, Thong and Bin, Yi and Wu, Xiaobao and Dong, Xinshuai and Hu, Zhiyuan and Le, Khoi and Nguyen, Cong-Duy and Ng, See-Kiong and Tuan, Luu Anh},
journal={arXiv preprint arXiv:2407.03788},
year={2024}
}