Skip to content

[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Notifications You must be signed in to change notification settings

MCG-NJU/BIVDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (CVPR 2024)

Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcun Xu, Wei Zhang, Limin Wang

Project Website arXiv

Given an image diffusion model (IDM) for a specific image synthesis task, and a text-to-video diffusion foundation model (VDM), our model can perform training-free video synthesis, by bridging IDM and VDM with Mixed Inversion.

Method

BIVDiff pipeline. Our framework consists of three components, including Frame-wise Video Generation, Mixed Inversion, and Video Temporal Smoothing. We first use the image diffusion model to do frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for video temporal smoothing.

Setup

1. Requirements

conda create -n bivdiff python=3.10.9
conda activate bivdiff
bash install.txt

2. Download Weights

Our framework currently supports two text-to-video diffusion models (VidRD and ZeroScope) and five downstream image diffusion models (ControlNet, T2I-Adapter, InstructPix2Pix, Prompt2Prompt, Stable Diffusion Inpainting). These models may need Stable Diffusion v1.5 and Stable Diffusion v2.1 (put CLIP under Stable Diffusion 2.1 dir). All pre-trained weights are downloaded to checkpoints/ directory. The final file tree likes:

Note: remove the safetensors

checkpoints
├── stable-diffusion-v1-5
├── stable-diffusion-2-1-base
    ├── clip
    ├── ...
├── ControlNet
    ├── control_v11f1p_sd15_depth
    ├── control_v11p_sd15_canny
    ├── control_v11p_sd15_openpose
├── InstructPix2Pix
    ├── instruct-pix2pix
├── StableDiffusionInpainting
    ├── stable-diffusion-inpainting 
├── T2IAdapter
    ├── t2iadapter_depth_sd15v2
    ├── t2iadapter_canny_sd15v2
├── VidRD
    ├── ModelT2V.pth
├── ZeroScope
    ├── zeroscope_v2_576w
├── ... (more image or video diffusion models)

You can add more image or video models according to your needs, following the instructions in here.

Inference

# 1. Controllable Video Generation
# ControlNet + VidRD
python inference.py --config-name="controllable_video_generation_with_controlnet_and_vidrd"
# ControlNet + ZeroScope
python inference.py --config-name="controllable_video_generation_with_controlnet_and_zeroscope"
# T2I-Adapter + VidRD
python inference.py --config-name="controllable_video_generation_with_t2iadapter_and_vidrd"

# 2. Video Editing
# InstructPix2Pix + VidRD
python inference.py --config-name="video_editing_with_instruct_pix2pix_and_vidrd"
# InstructPix2Pix + ZeroScope
python inference.py --config-name="video_editing_with_instruct_pix2pix_and_zeroscope"
# Prompt2Prompt + VidRD
python inference.py --config-name="video_editing_with_instruct_prompt2prompt_and_vidrd"

# 3. Video Inpainting
# StableDiffusionInpainting + VidRD
python inference.py --config-name="video_inpainting_with_stable_diffusion_inpainting_and_vidrd"

# 4. Video Outpainting
# StableDiffusionInpainting + VidRD
python inference.py --config-name="video_outpainting_with_stable_diffusion_inpainting_and_vidrd"

Add New Diffusion Models

We decouple the implementation of image and video diffusion model, so it is easy to add your new image and video models with minor modifications. The inference pipeline in inference.py is as follows:

1. Load Models
2. Construct Pipelines
3. Read Video
4. Frame-wise Video Generation
5. Mixed Inversion
6. Video Temporal Smoothing

Image Diffusion Models

To add a new image diffusion model, what need to do is realize infer.py to return the video generated by the image diffusion model. First create Your_IDM directory in models folder and add infer.py under the directory. Then realize infer_Your_IDM function in infer.py. For example:

# BIVDiff/inference.py
from models.Your_IDM.infer import infer_Your_IDM
def infer(video, generator, config, latents=None):
    model_name = config.Model.idm
    prompt = config.Model.idm_prompt
    output_path = config.Model.output_path
    height = config.Model.height
    width = config.Model.width
    if model_name == "Your_IDM":
        # return the video generated by IDM
        return infer_Your_IDM(...)

# BIVDiff/models/Your_IDM/infer.py 
def infer_Your_IDM(...):
    idm_model = initialized_func(model_path)
    # return n-frame video
    frames = [idm_model(...) for i in range(video_length)]
    return frames

You can generate n-frame video in a sequential way like the above codes, which is easy to implement without modifying the image model code, but time-consuming. So you can extend image diffusion model to video with inflated conv and merge temporal dimension into batch dimension (please refer to Tune-A-Video, and ControlVideo) to simultaneously generate a n-frame video faster.

For the case that your image diffusion model is difficult to integrate our framework, you can input the gif file generated by your idm.

# 4. frame-wise video generation
frames_by_idm = []
from PIL import Image
from PIL import ImageSequence
gif = Image.open("./data/your_case.gif")
i = 0
for frame in ImageSequence.Iterator(gif):
    frame.save("frame%d.png" % i)
    frames_by_idm.append(Image.open("frame%d.png" % i).convert("RGB"))
    i += 1

Note: for the simplicity of adding new image and video models for users, we adopt a sequential way to generate video with image models in the open-source repo. However, we modified ControlNet and InstructPix2Pix for parallel generation in the old repo. We have provided the videos generated in this parallel manner (in BIVDiff/data/results_parallel), and you can use above method to reproduce the results of ControlNet and InstructPix2Pix.

Video Diffusion Models

To add a new video diffusion model, what need to do is provide U-Net, text encoder and text tokenizer of the video diffusion models in 1. Load Models step of inference.py. For example:

# BIVDiff/inference.py
video_unet = None
video_text_encoder = None
video_text_tokenizer = None
if config.Model.vdm == "Your_Model_Name":
    vdm_model = initialized_func(model_path)
    video_unet = vdm_model.unet
    video_text_encoder = vdm_model.text_encoder
    video_text_tokenizer = vdm_model.text_tokenizer

Config File

# add config.yaml in configs folder
hydra:
  output_subdir: null
  run:
    dir: .

VDM:
  vdm_name
  other settings of vdm_name :

IDM:
  idm_name
  mixing_ratio: # [0, 1]
  other settings of vdm_name :

MixedInversion:
  idm: "SD-v15"
  pretrained_model_path: "./checkpoints/stable-diffusion-v1-5"

Model:
  idm: idm_name
  vdm: vdm_name
  idm_prompt : 
  vdm_prompt : 
  video_path: 
  output_path: "./outputs/"
  
  # compatible with IDM and VDM
  video_length: 8
  width: 512
  height: 512
  frame_rate: 4
  seed: 42
  guidance_scale: 7.5
  num_inference_steps: 50

Results

Controllable Video Generation (ControlNet + VidRD)

Depth Canny Pose
"A person on a motorcycle does a burnout on a frozen lake" "A silver jeep car is moving on the winding forest road" "An astronaut moonwalks on the moon"

Controllable Video Generation (ControlNet + ZeroScope)

Depth Canny Pose
"A brown spotted cow is walking in heavy rain" "A bear walking through a snow mountain" "Iron Man moonwalks in the desert"

Controllable Video Generation (T2I-Adapter + VidRD)

 
 
"A red car moves in front of buildings" "Iron Man moonwalks on the beach"

Video Editing (InstructPix2Pix + VidRD)

Source Video Style Transfer Replace Object Replace Background
"A man is moonwalking" "Make it Minecraft" "Replace the man with Spider Man" "Change the background to stadium"

Video Editing (InstructPix2Pix + ZeroScope)

 
 
"Replace the man with a little boy" "Make it minecraft style"

Video Editing (Prompt2Prompt + VidRD)

Source Video Attention Replace Attention Refine
"A car is moving on the road" "A car bicycle is moving on the road" "A car is moving on the road at sunset" "A car is moving on the road in the forest" "A white car is moving on the road"

Video Inpainting (Stable Diffusion Inpainting + VidRD)

Erase Object Replace Object
Source Video Mask Output Video Source Video Mask Output Video
"" "A sports car is moving on the road"

Video Outpainting (Stable Diffusion Inpainting + VidRD)

Source Video Masked Video Output Video

Citation

If you make use of our work, please cite our paper.

@InProceedings{Shi_2024_CVPR,
    author    = {Shi, Fengyuan and Gu, Jiaxi and Xu, Hang and Xu, Songcen and Zhang, Wei and Wang, Limin},
    title     = {BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {7393-7402}
}

Acknowledgement

This work repository borrows heavily from Diffusers, Tune-A-Video, and ControlVideo. Thanks for their contributions!

About

[CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages