Skip to content

official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Notifications You must be signed in to change notification settings

KangarooGroup/Kangaroo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Release

  • [2024/08/29] 🔥 We release our paper on Arxiv. Check out the paper for more details.
  • [2024/08/5] 🔥 We submit evalution results of LongVideoBench(test split). Kangaroo achieves better results than existing open-source methods.
  • [2024/07/24] 🔥 We submit evalution results of VideoVista benchmark on the online leaderboard. Kangaroo achieves SOTA performance among open-source models.
  • [2024/07/23] 🔥 We submit evalution results of Video-MME benchmark on the online leaderboard. Our Kangaroo outperforms other 7B/8B models, and surpasses most models with over 10B parameters.
  • [2024/07/17] 🔥 We release blog and model. Please check out the blog for details.

Abstract

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Highlights

  • Large-scale Data Curation. We develop a data curation system to generate captions for open-source and internal videos and construct a video instruction tuning dataset covering a variety of tasks.
  • Long-context Video Input. We extend the maximum frames of input videos to 160, with corresponding sequence length up to 22k tokens.
  • Superior Performance. Our model achieves state-of-the-art performance on the a variety of comprehensive benchmarks and outperforms some larger open-source models with over 10B parameters and proprietary models on certain benchmarks.
  • Bilingual Conversation. Our model is equipped with the capability of Chinese, English and bilingual conversations, and support single/multi-round conversation paradigms.

Model

Quick Start

Installation

  1. Prepare environment
conda create -n kangaroo python=3.9 -y
conda activate kangaroo
pip install -r requirements
  1. Install flash-attn
pip install flash-attn --no-build-isolation
  1. Install nvidia apex according to apex

Multi-round Chat with 🤗 Transformers

See chat.ipynb

Streamlit Deploy

We provide code for users to build a web UI demo. Please use streamlit==1.36.0.

streamlit run streamlit_app.py --server.port PORT

Results

Evaluation Results

Results on VideoMME

Results on SeedBench-Video

Qualitative Examples





Citation

If you find it useful for your research , please cite related papers/blogs using this BibTeX:

@article{kangaroogroup,
	title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
	author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
	journal={arXiv preprint arXiv:2408.15542},
	year={2024}
}

About

official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published