Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Release

[2024/08/29] 🔥 We release our paper on Arxiv. Check out the paper for more details.
[2024/08/5] 🔥 We submit evalution results of LongVideoBench(test split). Kangaroo achieves better results than existing open-source methods.
[2024/07/24] 🔥 We submit evalution results of VideoVista benchmark on the online leaderboard. Kangaroo achieves SOTA performance among open-source models.
[2024/07/23] 🔥 We submit evalution results of Video-MME benchmark on the online leaderboard. Our Kangaroo outperforms other 7B/8B models, and surpasses most models with over 10B parameters.
[2024/07/17] 🔥 We release blog and model. Please check out the blog for details.

Abstract

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Highlights

Large-scale Data Curation. We develop a data curation system to generate captions for open-source and internal videos and construct a video instruction tuning dataset covering a variety of tasks.
Long-context Video Input. We extend the maximum frames of input videos to 160, with corresponding sequence length up to 22k tokens.
Superior Performance. Our model achieves state-of-the-art performance on the a variety of comprehensive benchmarks and outperforms some larger open-source models with over 10B parameters and proprietary models on certain benchmarks.
Bilingual Conversation. Our model is equipped with the capability of Chinese, English and bilingual conversations, and support single/multi-round conversation paradigms.

Model

Quick Start

Installation

Prepare environment

conda create -n kangaroo python=3.9 -y
conda activate kangaroo
pip install -r requirements

Install flash-attn

pip install flash-attn --no-build-isolation

Install nvidia apex according to apex

Multi-round Chat with 🤗 Transformers

See chat.ipynb

Streamlit Deploy

We provide code for users to build a web UI demo. Please use streamlit==1.36.0.

streamlit run streamlit_app.py --server.port PORT

Results

Evaluation Results

Results on VideoMME

Results on SeedBench-Video

Qualitative Examples

Citation

If you find it useful for your research , please cite related papers/blogs using this BibTeX:

@article{kangaroogroup,
	title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
	author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
	journal={arXiv preprint arXiv:2408.15542},
	year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
demo_videos		demo_videos
README.md		README.md
chat.ipynb		chat.ipynb
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Release

Abstract

Highlights

Model

Quick Start

Installation

Multi-round Chat with 🤗 Transformers

Streamlit Deploy

Results

Evaluation Results

Results on VideoMME

Results on SeedBench-Video

Qualitative Examples

Citation

About

Releases

Packages

Contributors 2

Languages

KangarooGroup/Kangaroo

Folders and files

Latest commit

History

Repository files navigation

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Release

Abstract

Highlights

Model

Quick Start

Installation

Multi-round Chat with 🤗 Transformers

Streamlit Deploy

Results

Evaluation Results

Results on VideoMME

Results on SeedBench-Video

Qualitative Examples

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages