Skip to content

Commit 7632a41

Browse files
committed
feedback
1 parent 05a2a54 commit 7632a41

File tree

3 files changed

+28
-32
lines changed

3 files changed

+28
-32
lines changed

docs/source/en/using-diffusers/ip_adapter.md

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,18 @@ specific language governing permissions and limitations under the License.
1212

1313
# IP-Adapter
1414

15-
[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet. The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
15+
[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
1616

1717
> [!TIP]
18-
> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide.
18+
> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder.
1919
2020
This guide will walk you through using IP-Adapter for various tasks and use cases.
2121

2222
## General tasks
2323

24-
Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. Feel free to use another pipeline such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
24+
Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
25+
26+
In all the following examples, you'll see the [`~IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
2527

2628
<hfoptions id="tasks">
2729
<hfoption id="Text-to-image">
@@ -30,8 +32,6 @@ Crafting the precise text prompt to generate the image you want can be difficult
3032

3133
Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
3234

33-
The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
34-
3535
```py
3636
from diffusers import AutoPipelineForText2Image
3737
from diffusers.utils import load_image
@@ -75,8 +75,6 @@ IP-Adapter can also help with image-to-image by guiding the model to generate an
7575

7676
Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
7777

78-
The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
79-
8078
```py
8179
from diffusers import AutoPipelineForImage2Image
8280
from diffusers.utils import load_image
@@ -126,8 +124,6 @@ IP-Adapter is also useful for inpainting because the image prompt allows you to
126124

127125
Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
128126

129-
The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
130-
131127
```py
132128
from diffusers import AutoPipelineForInpainting
133129
from diffusers.utils import load_image
@@ -177,8 +173,6 @@ images[0]
177173

178174
IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter, and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method.
179175

180-
The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
181-
182176
> [!WARNING]
183177
> If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline.
184178
@@ -289,7 +283,8 @@ image
289283

290284
More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. Let's try this out!
291285

292-
IP-Adapter uses an image encoder to generate the image features. The image encoder is automatically loaded if the image encoder model weights exist in an `image_encoder` subfolder in your repository. You don't need to explicitly load the image encoder if this subfolder exists. Otherwise, you'll need to load the image encoder weights into [`~transformers.CLIPVisionModelWithProjection`] and pass that to your pipeline. In this example, you'll manually load the image encoder just to see how it's done.
286+
> [!TIP]
287+
> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder.
293288
294289
```py
295290
import torch
@@ -306,8 +301,8 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
306301

307302
Next, you'll load a base model, scheduler, and IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter:
308303

309-
* ip-adapter-plus_sdxl_vit-h uses patch embeddings and a ViT H image encoder
310-
* ip-adapter-plus-face_sdxl_vit-h has the same architecture but it is conditioned with images of cropped faces
304+
* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT H image encoder
305+
* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces
311306

312307
```py
313308
pipeline = AutoPipelineForText2Image.from_pretrained(

docs/source/en/using-diffusers/loading_adapters.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -334,22 +334,6 @@ Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.
334334
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
335335
```
336336

337-
IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline.
338-
339-
```py
340-
from diffusers import AutoPipelineForText2Image
341-
from transformers import CLIPVisionModelWithProjection
342-
import torch
343-
344-
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
345-
"h94/IP-Adapter",
346-
subfolder="models/image_encoder",
347-
torch_dtype=torch.float16,
348-
).to("cuda")
349-
350-
pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, torch_dtype=torch.float16).to("cuda")
351-
```
352-
353337
Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process.
354338

355339
```py
@@ -368,3 +352,22 @@ images
368352
<div class="flex justify-center">
369353
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip-bear.png" />
370354
</div>
355+
356+
### IP-Adapter Plus
357+
358+
IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder.
359+
360+
```py
361+
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
362+
"h94/IP-Adapter",
363+
subfolder="models/image_encoder",
364+
torch_dtype=torch.float16)
365+
366+
pipeline = AutoPipelineForText2Image.from_pretrained(
367+
"stabilityai/stable-diffusion-xl-base-1.0",
368+
image_encoder=image_encoder,
369+
torch_dtype=torch.float16
370+
).to("cuda")
371+
372+
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
373+
```

src/diffusers/loaders/ip_adapter.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,8 +184,6 @@ def set_ip_adapter_scale(self, scale):
184184
"""
185185
Sets the conditioning scale between text and image.
186186
"""
187-
if not isinstance(scale, list):
188-
scale = [scale]
189187
unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
190188
for attn_processor in unet.attn_processors.values():
191189
if isinstance(attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0)):

0 commit comments

Comments
 (0)