CogVideoX-Interpolation is a modified pipeline based on the CogVideoX structure, designed to provide more flexibility in keyframe interpolation generation.
Our environment is totally same with CogvideoX and you can install by:
pip install -r requirement.txt
Download the finetuned checkpoint, and put it with model path variable.
The example input keyframe pairs are in cases
folder.
You can run with mini code as following or refer to infer.py
which generate cases.
from diffusers.utils import export_to_video, load_image
from cogvideox_interpolation.pipeline import CogVideoXInterpolationPipeline
model_path = "\path\to\model\download"
pipe = CogVideoXInterpolationPipeline.from_pretrained(
model_path,
torch_dtype=torch.bfloat16
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
first_image = load_image(first_image_path)
last_image = load_image(second_image_path)
video = pipe(
prompt=prompt,
first_image=first_image,
last_image=last_image,
num_videos_per_prompt=50,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
export_to_video(video_save_path, fps=8)
You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:
sh finetune.sh
Note that we fine tune with part of selected parameters instead of lora.
We also provide the training data in Huggingface, where we first filter with fps and resolution and then auto-labled with advanced MLLM.
It takes about one week with 8 * A100 GPU for finetuning.
Finally, we provide some interesting observations about seleting which parameters to trainable groups:
The codebase is based on the awesome CogvideoX and diffusers repos.