feedback

stevhliu · stevhliu · commit 7632a41a8282 · 2024-02-09T12:46:39.000-08:00
diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md
@@ -12,16 +12,18 @@ specific language governing permissions and limitations under the License.
 
 # IP-Adapter
 
-[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet. The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
+[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features.
 
 > [!TIP]
-> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide.
+> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder.
 
 This guide will walk you through using IP-Adapter for various tasks and use cases.
 
 ## General tasks
 
-Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. Feel free to use another pipeline such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
+Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff!
+
+In all the following examples, you'll see the [`~IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
 
 <hfoptions id="tasks">
 <hfoption id="Text-to-image">
@@ -30,8 +32,6 @@ Crafting the precise text prompt to generate the image you want can be difficult
 
 Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
 
-The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
-
 ```py
 from diffusers import AutoPipelineForText2Image
 from diffusers.utils import load_image
@@ -75,8 +75,6 @@ IP-Adapter can also help with image-to-image by guiding the model to generate an
 
 Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
 
-The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
-
 ```py
 from diffusers import AutoPipelineForImage2Image
 from diffusers.utils import load_image
@@ -126,8 +124,6 @@ IP-Adapter is also useful for inpainting because the image prompt allows you to
 
 Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the weights for SDXL.
 
-The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
-
 ```py
 from diffusers import AutoPipelineForInpainting
 from diffusers.utils import load_image
@@ -177,8 +173,6 @@ images[0]
 
 IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with it's motion adapter, and insert an IP-Adapter into the model with the [`~IPAdapterMixin.load_ip_adapter`] method.
 
-The [`~IPAdapterMixin.set_ip_adapter_scale`] method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
-
 > [!WARNING]
 > If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline.
 
@@ -289,7 +283,8 @@ image
 
 More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter FaceID to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. Let's try this out!
 
-IP-Adapter uses an image encoder to generate the image features. The image encoder is automatically loaded if the image encoder model weights exist in an `image_encoder` subfolder in your repository. You don't need to explicitly load the image encoder if this subfolder exists. Otherwise, you'll need to load the image encoder weights into [`~transformers.CLIPVisionModelWithProjection`] and pass that to your pipeline. In this example, you'll manually load the image encoder just to see how it's done.
+> [!TIP]
+> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder.
 
 ```py
 import torch
@@ -306,8 +301,8 @@ image_encoder = CLIPVisionModelWithProjection.from_pretrained(
 
 Next, you'll load a base model, scheduler, and IP-Adapter's. The IP-Adapter's to use are passed as a list to the `weight_name` parameter:
 
-* ip-adapter-plus_sdxl_vit-h uses patch embeddings and a ViT H image encoder
-* ip-adapter-plus-face_sdxl_vit-h has the same architecture but it is conditioned with images of cropped faces
+* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT H image encoder
+* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces
 
 ```py
 pipeline = AutoPipelineForText2Image.from_pretrained(
diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md
@@ -334,22 +334,6 @@ Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.
 pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
 ```
 
-IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline.
-
-```py
-from diffusers import AutoPipelineForText2Image
-from transformers import CLIPVisionModelWithProjection
-import torch
-
-image_encoder = CLIPVisionModelWithProjection.from_pretrained(
-    "h94/IP-Adapter", 
-    subfolder="models/image_encoder",
-    torch_dtype=torch.float16,
-).to("cuda")
-
-pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", image_encoder=image_encoder, torch_dtype=torch.float16).to("cuda")
-```
-
 Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process.
 
 ```py
@@ -368,3 +352,22 @@ images
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip-bear.png" />
 </div>
+
+### IP-Adapter Plus
+
+IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains a `image_encoder` subfolder, the image encoder is automatically loaded and registed to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder.
+
+```py
+image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+    "h94/IP-Adapter",
+    subfolder="models/image_encoder",
+    torch_dtype=torch.float16)
+
+pipeline = AutoPipelineForText2Image.from_pretrained(
+    "stabilityai/stable-diffusion-xl-base-1.0",
+    image_encoder=image_encoder,
+    torch_dtype=torch.float16
+).to("cuda")
+
+pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
+```
diff --git a/src/diffusers/loaders/ip_adapter.py b/src/diffusers/loaders/ip_adapter.py
@@ -184,8 +184,6 @@ def set_ip_adapter_scale(self, scale):
         """
         Sets the conditioning scale between text and image.
         """
-        if not isinstance(scale, list):
-            scale = [scale]
         unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
         for attn_processor in unet.attn_processors.values():
             if isinstance(attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0)):