Fangfu Liu1*,
Hanyang Wang1*,
Yimo Cai1,
Kaiyan Zhang1,
Xiaohang Zhan,
Yueqi Duan1,
*Equal Contribution.
1Tsinghua University
Video-T1: We present the generative effects and performance improvements of video generation under test-time scaling (TTS) settings. The videos generated with TTS are of higher quality and more consistent with the prompt than those generated without TTS.
2025.3.24
🤗🤗🤗 We release Video-T1: Test-time Scaling for Video Generation
Results of Test-Time Scaling for Video Generation. As the number of samples in the search space increases by scaling test-time computation (TTS), the models’ performance exhibits consistent improvement.
Pipeline of Test-Time Scaling for Video Generation. Top: Random Linear Search for TTS video generation is to randomly sample Gaussian noises, prompt the video generator to generate a sequence of video clips through step-by-step denoising in a linear manner, and select the highest score from the test verifiers. Bottom: Tree of Frames (ToF) Search for TTS video generation is to divide the video generation process into three stages: (a) the first stage performs image-level alignment that influences the later frames; (b) the second stage is to apply dynamic prompt in test verifiers V to focus on motion stability, physical plausibility to provide feedback that guides heuristic searching process; (c) the last stage assesses the overall quality of the video and select the video with highest alignment with text prompts.
git clone https://github.com/liuff19/Video-T1.git
cd VideoT1
conda create -n videot1 python==3.10
conda activate videot1
pip install -r requirements.txt
git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]"
You need to download the following models:
- Pyramid-Flow model checkpoint (for video generation)
- VisionReward-Video model checkpoint (for video reward guidance)
- (Optional) Image-CoT-Generation model checkpoint (for ImageCoT)
- (Optional) DeepSeek-R1-Distill-Llama-8B (Or other LLMs) model checkpoint (for hierarchical prompts)
cd VideoT1
# Modify videot1.py to assign checkpoints correctly.
python -m videot1.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard
For inference, please refer to videot1.py for usage.
# Import Pipeline and Base Model
from pyramid_flow.pyramid_dit import PyramidDiTForVideoGeneration
from pipeline.videot1_pipeline import VideoT1Generator
# Initialize Pyramid-Flow Model
pyramid_model = init_pyramid_model(model_path, device, model_variant)
# Initialize VisionReward Model
reward_model, tokenizer = init_vr_model(vr_path, device)
# Initialize VideoT1 Generator
generator = VideoT1Generator(
pyramid_model,
device,
dtype=torch.bfloat16,
image_selector_path=imgcot_path,
result_path=result_path,
lm_path=lm_path,
)
# Courtesy of Pyramid-Flow
# Use the generator to generate videos using TTS strategy
best_video = generator.videot1_gen(
prompt=prompt,
num_inference_steps=[20, 20, 20], # Inference steps for image branch at each level
video_num_inference_steps=[20, 20, 20], # Inference steps for video branch at each level
height=height,
width=width,
num_frames=temp,
guidance_scale=7.0,
video_guidance_scale=5.0,
save_memory=True,
inference_multigpu=True,
video_branching_factors=video_branch,
image_branching_factors=img_branch,
reward_stages=reward_stages,
hierarchical_prompts=True,
result_path=result_path,
intermediate_path=intermed_path,
video_name=video_name,
**reward_params
)
Save GPU Memory by loading different models on different GPUs to avoid OOM problem.
Example: Load Reward Model in GPU0, Pyramid-Flow in GPU1 and Image-CoT model in GPU2
# Load Models in different GPUs
python videot1_multigpu.py --prompt "A cat wearing sunglasses and working as a lifeguard at a pool." --video_name cat_lifeguard --reward_device_id 0 --base_device_id 1 --imgcot_device_id 2 --lm_device_id 3
Please refer to videot1_multigpu.py for multi-GPU inference.
-
reward_stages: Choose three indices for reward model pruning. If tree's depth is in three indices, all video clips would be fed into reward models for judging.
-
variant: We recommend 768p for better quality, choose from 384,768 (same as Pyramid-Flow)
-
img_branch: A list of integers, each correspond to the number of images at the beginning of ImageCoT process at this depth.
-
video_branch: A list of integers, each correspond to the number of generated next frames at this depth.
Namely, if img_branch is array$A[]$ , video_branch is array$B[]$ , then at depth$i$ , we would have$A[i] \times B[i]$ initial images for each branch, and$B[i]$ next latent frames would be the children for each branch.
We would release Dataset for Test-Time Scaling in CogVideoX-5B
We are thankful for the following great works when implementing Video-T1 and Yixin's great figure design:
Pyramid-Flow
NOVA
VisionReward
VideoLLaMA3
CogVideoX
OpenSora
Image-Generation-CoT
@article{liu2025video,
title={Video-T1: Test-Time Scaling for Video Generation},
author={Liu, Fangfu and Wang, Hanyang and Cai, Yimo and Zhang, Kaiyan and Zhan, Xiaohang and Duan, Yueqi},
journal={arXiv preprint arXiv:2503.18942},
year={2025}
}