Skip to content

Official Implementation of Video-T1: Test-Time Scaling for Video Generation

License

Notifications You must be signed in to change notification settings

liuff19/Video-T1

Repository files navigation

✨Video-T1: Test-Time Scaling for Video Generation✨

Fangfu Liu1*, Hanyang Wang1*, Yimo Cai1, Kaiyan Zhang1, Xiaohang Zhan, Yueqi Duan1,
*Equal Contribution.
1Tsinghua University

                    Teaser Visualization

Video-T1: We present the generative effects and performance improvements of video generation under test-time scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.

📢 News

  • 2025.3.24 🤗🤗🤗 We release Video-T1: Test-time Scaling for Video Generation

🎉 Results

Results Visualization

Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models’ performance exhibits consistent improvement.

🌟 Pipeline

Pipeline Visualization

Pipeline of Test-Time Scaling for Video Generation. Top: Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate a sequence of video clips through step-by-step denoising in a linear manner, and select the highest score from the test verifiers. Bottom: Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers V to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.

🔧 Installation

Dependencies:

git clone https://github.com/liuff19/Video-T1.git
cd VideoT1
conda create -n videot1 python==3.10
conda activate videot1
pip install -r requirements.txt
git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]"

Model Checkpoints:

You need to download the following models:

  • Pyramid-Flow model checkpoint (for video generation)
  • VisionReward-Video model checkpoint (for video reward guidance)
  • (Optional) Image-CoT-Generation model checkpoint (for ImageCoT)
  • (Optional) DeepSeek-R1-Distill-Llama-8B (Or other LLMs) model checkpoint (for hierarchical prompts)

💻 Inference

1. Quick start

cd VideoT1
# Modify videot1.py to assign checkpoints correctly.
python -m videot1.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard

2.Inference Code

For inference, please refer to videot1.py for usage.

# Import Pipeline and Base Model
from pyramid_flow.pyramid_dit import PyramidDiTForVideoGeneration
from pipeline.videot1_pipeline import VideoT1Generator

# Initialize Pyramid-Flow Model
pyramid_model = init_pyramid_model(model_path, device, model_variant)

# Initialize VisionReward Model
reward_model, tokenizer = init_vr_model(vr_path, device)

# Initialize VideoT1 Generator
generator = VideoT1Generator(
    pyramid_model,
    device,
    dtype=torch.bfloat16, 
    image_selector_path=imgcot_path,
    result_path=result_path,
    lm_path=lm_path,
)

# Courtesy of Pyramid-Flow
# Use the generator to generate videos using TTS strategy
best_video = generator.videot1_gen(
    prompt=prompt,
    num_inference_steps=[20, 20, 20],      # Inference steps for image branch at each level
    video_num_inference_steps=[20, 20, 20], # Inference steps for video branch at each level
    height=height,
    width=width,
    num_frames=temp,
    guidance_scale=7.0,           
    video_guidance_scale=5.0,      
    save_memory=True,             
    inference_multigpu=True,      
    video_branching_factors=video_branch,
    image_branching_factors=img_branch,
    reward_stages=reward_stages,
    hierarchical_prompts=True,     
    result_path=result_path,
    intermediate_path=intermed_path,
    video_name=video_name,
    **reward_params                
)

3.Multi-GPU Inference

Save GPU Memory by loading different models on different GPUs to avoid OOM problem.

Example: Load Reward Model in GPU0, Pyramid-Flow in GPU1 and Image-CoT model in GPU2

# Load Models in different GPUs
python videot1_multigpu.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard --reward_device_id 0 --base_device_id 1 --imgcot_device_id 2 --lm_device_id 3

Please refer to videot1_multigpu.py for multi-GPU inference.

4.Usage Tips

  1. reward_stages: Choose three indices for reward model pruning. If tree's depth is in three indices, all video clips would be fed into reward models for judging.

  2. variant: We recommend 768p for better quality, choose from 384,768 (same as Pyramid-Flow)

  3. img_branch: A list of integers, each correspond to the number of images at the beginning of ImageCoT process at this depth.

  4. video_branch: A list of integers, each correspond to the number of generated next frames at this depth.
    Namely, if img_branch is array $A[]$, video_branch is array $B[]$, then at depth $i$, we would have $A[i] \times B[i]$ initial images for each branch, and $B[i]$ next latent frames would be the children for each branch.

🚀 TODO

We would release Dataset for Test-Time Scaling in CogVideoX-5B

Acknowledgement

We are thankful for the following great works when implementing Video-T1 and Yixin's great figure design:

Pyramid-Flow
NOVA
VisionReward
VideoLLaMA3
CogVideoX
OpenSora
Image-Generation-CoT

📚 Citation

@article{liu2025video,
  title={Video-T1: Test-Time Scaling for Video Generation},
  author={Liu, Fangfu and Wang, Hanyang and Cai, Yimo and Zhang, Kaiyan and Zhan, Xiaohang and Duan, Yueqi},
  journal={arXiv preprint arXiv:2503.18942},
  year={2025}
}

About

Official Implementation of Video-T1: Test-Time Scaling for Video Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published