This is the official PyTorch implementation for the ACMMM 24 paper: "LoMOE: Localized Multi-Object Editing via Multi-Diffusion". All the published data is available on our project page.
This code was tested with python=3.9
, pytorch=2.0.1
and torchvision=0.15.2
. Please follow the instructions here to install PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
Create a conda environment with the following dependencies:
conda create -n lomoe python=3.9
conda activate lomoe
conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install accelerate==0.20.3 diffusers==0.12.1 einops==0.7.0 ipython transformers==4.26.1 salesforce-lavis==1.0.2
Start by downloading the SOE and MOE datasets from our project page to ./benchmark/data
.
To generate the prompt, inverted latent, and store intermediate latents for an image, first run the inversion script located at ./lomoe/invert/inversion.py
. Then, to apply edits, use ./lomoe/edit/main.py
. A sample image and corresponding masks for single and multi-object edit operations are provided in ./lomoe/sample/
.
The invert/inversion.py
script takes the following arguments
--input_image
: Path to the image.--results_folder
: Path to store the prompt, inverted and intermediate latents.
CUDA_VISIBLE_DEVICES=0 python invert/inversion.py \
--input_image "sample/single/init_image.jpg" \
--results_folder "invert/output/single"
CUDA_VISIBLE_DEVICES=0 python invert/inversion.py \
--input_image "sample/multi/init_image.png" \
--results_folder "invert/output/multi"
The edit/main.py
script takes the following arguments
--mask_paths
: Path to the object mask.--num_fgmasks
: Number of foreground masks (defaults to 1).--bg_prompt
: Path to the background prompt (we use the prompt generated byinversion.py
).--bg_negative
: Path to the background negative prompt (we use the prompt generated byinversion.py
).--fg_prompts
: Edit prompt corresponding to the masks.--fg_negative
: The foreground negative prompt. (We use "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image")--W
: Output image width.--H
: Output image height.--seed
: The seed to initialize random number generators (defaults to 0).--sd_version
: The stable diffusion version to be used (use the same as that ininversion.py
).--steps
: The number of diffusion timesteps (use the same as that ininversion.py
).--ca_coef
: Cross attention preservation loss coefficient (defaults to 1.0).--seg_coef
: Background loss coefficient (defaults to 1.75).--bootstrapping
: Value of the bootstrap parameter (defaults to 20).--latent
: Path to the inverted latent produced byinversion.py
.--latent_list
: Path to the latent list produced byinversion.py
.--rec_path
: Path to save the reconstructed input image.--edit_path
: Path to save the edited image.--save_path
: Path to save the merged reconstructed and edited image.
CUDA_VISIBLE_DEVICES=0 python edit/main.py \
--mask_paths "sample/single/mask_1.jpg" \
--bg_prompt "invert/output/single/prompt/init_image.txt" \
--bg_negative "invert/output/single/prompt/init_image.txt" \
--fg_negative "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" \
--H 512 \
--W 512 \
--bootstrapping 20 \
--latent 'invert/output/single/inversion/init_image.pt' \
--latent_list 'invert/output/single/latentlist/init_image.pt' \
--rec_path 'results/single/1_reconstruction.png' \
--edit_path 'results/single/2_edit.png' \
--fg_prompts "a red dog collar" \
--seed 1234 \
--save_path 'results/single/3_merged.png'
CUDA_VISIBLE_DEVICES=0 python edit/main.py \
--mask_paths "sample/multi/mask_1.png" "sample/multi/mask_2.png" \
--bg_prompt "invert/output/multi/prompt/init_image.txt" \
--bg_negative "invert/output/multi/prompt/init_image.txt" \
--fg_negative "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" "artifacts, blurry, smooth texture, bad quality, distortions, unrealistic, distorted image" \
--H 512 \
--W 512 \
--bootstrapping 20 \
--latent 'invert/output/multi/inversion/init_image.pt' \
--latent_list 'invert/output/multi/latentlist/init_image.pt' \
--rec_path 'results/multi/1_reconstruction.png' \
--edit_path 'results/multi/2_edit.png' \
--fg_prompts "a crochet bird" "an origami bird" \
--num_fgmasks 2 \
--seed 1234 \
--save_path 'results/multi/3_merged.png'
To compute the classical and neural metrics, use compute_metrics.py
in ./benchmark/metrics/{SOE/MOE}
. This includes the SRC and TGT Clip Scores, BG LPIPS, BG PSNR, BG MSE, BG SSIM and the Structural Distance. The compute_aesthetic.py
in ./benchmark/metrics/{SOE/MOE}
computes the aesthetic metrics including HPS, IR and Aesthetic Score. This file also requires additional dependencies, namely HPSv2 and ImageReward.
NOTE: The compute_metrics.py
and compute_aesthetic.py
scripts expect a folder containing edits for all images in the dataset. Please modify the code to run them on a smaller subset or single images.
CUDA_VISIBLE_DEVICES=0 python compute_metrics.py --folder_name PATH_TO_SAVED_EDITS
CUDA_VISIBLE_DEVICES=0 python compute_aesthetic.py --folder_name PATH_TO_SAVED_EDITS
If you use LoMOE or find this work useful for your research, please use the following BibTeX entry.
@InProceedings{Chakrabarty_2024_ACMMM,
author = {Chakrabarty$^*$, Goirik and Chandrasekar$^*$, Aditya and Hebbalaguppe, Ramya and Prathosh, AP},
title = {LoMOE: Localized Multi-Object Editing via Multi-Diffusion},
booktitle = {ACM Multimedia 2024},
month = {October},
year = {2024}
}