Tiehan Fan1* , Kepan Nan1*, Rui Xie1, Penghao Zhou2, Zhenheng Yang2, Chaoyou Fu1, Xiang Li3, Jian Yang1, Ying Tai1✉
1 Nanjing University 2 ByteDance 3 Nankai University *Equal Contribution ✉Corresponding Author
Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed
- Comming soon: 🎯 Website, end-to-end captioner and T2V model weights ……
- 2024.12.13: 🚀 Our code, dataset and arXiv paper are released.
We provide our major contribution, the python implementation of
- Instance-aware: The dataset contains 22K videos with corresponding captions, which are annotated with instance-level descriptions.
- Fine-grained Structured Caption: The dataset is designed to be used for fine-grained structured captioning, where each instance is described by a structured caption.
We release
- Video: This is the name or file path of the video being referenced.
- Global Description: A brief summary of the video content, providing context about what is happening in the video.
- Structured Description: Detailed breakdown of the video content, including information on the main instances (such as people and objects) and their actions.
- Main Instance: Represents a specific person or object in the video.
- No.0
- Class: The type or category of the instance (e.g., person, car).
- Appearance: A description of the physical appearance of the instance.
- Actions and Motion: What the instance is doing, including its movements or posture.
- Position: The position of the instance in the frame (e.g., bottom-left, bottom-right).
- No.1
- ...
- No.0
- Background Detail: A description of the environment in the video background, such as the setting, props, and any significant details about the location.
- Camera Movement: Information about how the camera behaves during the video, including whether it is static or dynamic and the type of shot.
- Main Instance: Represents a specific person or object in the video.
We share the tuning-free InstanceEnhancer implementation process in this repository. It can easily enhance the short prompt of user input into a structured prompt, and achieve the alignment of train-inference text data distribution. We provide prompt implementation based on GPT-4o version, you can also migrate to other models to get similar results.
We implement a CoT reasoning framework for generating structured QA responses to ensure objective and consistent evaluation, allowing us to derive instance-level evaluation scores that align closely with human perception and preferences. This approach provides a more nuanced and reliable assessment of instance-level generation quality. Following this guide, you can use Inseval to evaluate your own generation model.
Our work is benefited from HailuoAI, OpenSora, LLaVA-Video, CogvideoX and OpenVid-1M(data), without their excellent effects, we would have faced a lot of resistance in implementation.
@misc{fan2024instancecapimprovingtexttovideogeneration,
title={InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption},
author={Tiehan Fan and Kepan Nan and Rui Xie and Penghao Zhou and Zhenheng Yang and Chaoyou Fu and Xiang Li and Jian Yang and Ying Tai},
year={2024},
eprint={2412.09283},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09283},
}
@article{nan2024openvid,
title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation},
author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
journal={arXiv preprint arXiv:2407.02371},
year={2024}
}
Should you have any inquiries, please contact [email protected].