CogvideX-Interpolation: Keyframe Interpolation with CogvideoX

CogVideoX-Interpolation is a modified pipeline based on the CogVideoX structure, designed to provide more flexibility in keyframe interpolation generation.

Input frame 1	Input frame 2	Text	Generated video
		A group of people dance in the street at night, forming a circle and moving in unison, exuding a sense of community and joy. A woman in an orange jacket is actively engaged in the dance, smiling broadly. The atmosphere is vibrant and festive, with other individuals participating in the dance, contributing to the sense of community and joy.
		A man in a white suit stands on a stage, passionately preaching to an audience. The stage is decorated with vases with yellow flowers and a red carpet, creating a formal and engaging atmosphere. The audience is seated and attentive, listening to the speaker.
		A man in a blue suit is laughing.

Quick Start

1. Setup repository and environment

Our environment is totally same with CogvideoX and you can install by:

pip install -r requirement.txt

2. Download checkpoint

Download the finetuned checkpoint, and put it with model path variable.

3. Launch the inference script!

The example input keyframe pairs are in cases folder. You can run with mini code as following or refer to infer.py which generate cases.

from diffusers.utils import export_to_video, load_image 
from cogvideox_interpolation.pipeline import CogVideoXInterpolationPipeline

model_path = "\path\to\model\download"
pipe = CogVideoXInterpolationPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

first_image = load_image(first_image_path)
last_image = load_image(second_image_path)
video = pipe(
	prompt=prompt,
	first_image=first_image,
	last_image=last_image,
	num_videos_per_prompt=50,
	num_inference_steps=50,
	num_frames=49,
	guidance_scale=6,
	generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
export_to_video(video_save_path, fps=8)

Light-weight finetuing

You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:

sh finetune.sh

Note that we fine tune with part of selected parameters instead of lora.

We also provide the training data in Huggingface, where we first filter with fps and resolution and then auto-labled with advanced MLLM.

It takes about one week with 8 * A100 GPU for finetuning.

Finally, we provide some interesting observations about seleting which parameters to trainable groups:

Acknowledgments

The codebase is based on the awesome CogvideoX and diffusers repos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CogvideX-Interpolation: Keyframe Interpolation with CogvideoX

Quick Start

1. Setup repository and environment

2. Download checkpoint

3. Launch the inference script!

Light-weight finetuing

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

CogvideX-Interpolation: Keyframe Interpolation with CogvideoX

Quick Start

1. Setup repository and environment

2. Download checkpoint

3. Launch the inference script!

Light-weight finetuing

Acknowledgments