InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Tiehan Fan^1*, Kepan Nan^1*, Rui Xie¹, Penghao Zhou², Zhenheng Yang², Chaoyou Fu¹, Xiang Li³, Jian Yang¹, Ying Tai^1✉

¹ Nanjing University ² ByteDance ³ Nankai University ^*Equal Contribution ^✉Corresponding Author

🗣️Abstract

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed $InstanceCap$, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a $22K\ InstanceVid$ dataset is curated for training, and an enhancement pipeline that tailored to $InstanceCap$ structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

🔥News

Comming soon: 🎯 Website, end-to-end captioner and T2V model weights ……
2024.12.13: 🚀 Our code, dataset and arXiv paper are released.

🔍️InstanceCap

We provide our major contribution, the python implementation of $InstanceCap$, in this repository, and you can install and use the full version of our proposal based on guide fo InstanceCap. Alternatively, you can use the Captioner we tweaked to LLaVA-Next-Video-7B based on $InstanceVid$ to get a high quality description with less difficulty.

📽️InstanceVid

Key Features of InstanceVid

Instance-aware: The dataset contains 22K videos with corresponding captions, which are annotated with instance-level descriptions.
Fine-grained Structured Caption: The dataset is designed to be used for fine-grained structured captioning, where each instance is described by a structured caption.

Meta Files

We release $InstanceVid$, containing 22K videos and captions. The meta file for this is provided in HuggingFace Dataset with json format, JSON contains the following properties:

Video: This is the name or file path of the video being referenced.
Global Description: A brief summary of the video content, providing context about what is happening in the video.
Structured Description: Detailed breakdown of the video content, including information on the main instances (such as people and objects) and their actions.
- Main Instance: Represents a specific person or object in the video.
  - No.0
    - Class: The type or category of the instance (e.g., person, car).
    - Appearance: A description of the physical appearance of the instance.
    - Actions and Motion: What the instance is doing, including its movements or posture.
    - Position: The position of the instance in the frame (e.g., bottom-left, bottom-right).
  - No.1
    - ...
- Background Detail: A description of the environment in the video background, such as the setting, props, and any significant details about the location.
- Camera Movement: Information about how the camera behaves during the video, including whether it is static or dynamic and the type of shot.

🏋🏽InstanceEnhancer

We share the tuning-free InstanceEnhancer implementation process in this repository. It can easily enhance the short prompt of user input into a structured prompt, and achieve the alignment of train-inference text data distribution. We provide prompt implementation based on GPT-4o version, you can also migrate to other models to get similar results.

📏Inseval

We implement a CoT reasoning framework for generating structured QA responses to ensure objective and consistent evaluation, allowing us to derive instance-level evaluation scores that align closely with human perception and preferences. This approach provides a more nuanced and reliable assessment of instance-level generation quality. Following this guide, you can use Inseval to evaluate your own generation model.

👏Acknowledgment

Our work is benefited from HailuoAI, OpenSora, LLaVA-Video, CogvideoX and OpenVid-1M(data), without their excellent effects, we would have faced a lot of resistance in implementation.

📖BibTeX

@misc{fan2024instancecapimprovingtexttovideogeneration,
      title={InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption}, 
      author={Tiehan Fan and Kepan Nan and Rui Xie and Penghao Zhou and Zhenheng Yang and Chaoyou Fu and Xiang Li and Jian Yang and Ying Tai},
      year={2024},
      eprint={2412.09283},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09283}, 
}

@article{nan2024openvid,
  title={OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation},
  author={Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
  journal={arXiv preprint arXiv:2407.02371},
  year={2024}
}

📧Contact Information

Should you have any inquiries, please contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Inseval		Inseval
InstanceCap		InstanceCap
InstanceEnhancer		InstanceEnhancer
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

🗣️Abstract

🔥News

🔍️InstanceCap

📽️InstanceVid

Key Features of InstanceVid

Meta Files

🏋🏽InstanceEnhancer

📏Inseval

👏Acknowledgment

📖BibTeX

📧Contact Information

About

Releases

Packages

Languages

NJU-PCALab/InstanceCap

Folders and files

Latest commit

History

Repository files navigation

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

🗣️Abstract

🔥News

🔍️InstanceCap

📽️InstanceVid

Key Features of InstanceVid

Meta Files

🏋🏽InstanceEnhancer

📏Inseval

👏Acknowledgment

📖BibTeX

📧Contact Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages