Qiuheng Wang1,2*, Yukai Shi1,3*, Jiarong Ou1, Rui Chen1, Ke Lin1, Jiahao Wang1, Boyuan Jiang1, Haotian Yang1, Mingwu Zheng1,
Xin Tao1, Fei Yang1†, Pengfei Wan1, Di Zhang1
1Kuaishou Technology 2Shenzhen University 3Tsinghua University *Equal contribution †Corresponding author
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content.
We propose a large-scale, high-quality dataset that significantly enhances the consistency between multiple conditions and video content. Koala-36M features more accurate temporal splitting, more detailed captions, and improved video filtering based on the proposed Video Training Suitability Score (VTSS).
Koala-36M is a video dataset that simultaneously possesses a large number of videos (over 10M) and high-quality fine-grained captions (over 200 words).
we propose better splitting methods, structured caption system, training suitability assessment network and fine-grained conditioning in red box, improving the consistency between conditions and video content.
Koala-36M features more accurate temporal splitting, more detailed captions, and improved video filtering based on the proposed Video Training Suitability Score (VTSS).
we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus.
We release a base version of the scoring model, you can download the checkpoint from here. To predict the VTSS of the video, you can run:
cd training_suitability_assessment
pip install -r requirements.txt
mkdir ckpt
huggingface-cli download --resume-download Koala-36M/Training_Suitability_Assessment --local-dir ckpt
python inference.py
See license. The video samples are collected from a publicly available dataset. Users must follow the related license to use these video samples.
If you find this project useful for your research, please cite our paper.
@misc{wang2024koala36mlargescalevideodataset,
title={Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content},
author={Qiuheng Wang and Yukai Shi and Jiarong Ou and Rui Chen and Ke Lin and Jiahao Wang and Boyuan Jiang and Haotian Yang and Mingwu Zheng and Xin Tao and Fei Yang and Pengfei Wan and Di Zhang},
year={2024},
eprint={2410.08260},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.08260},
}