huggingface
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/en/api/pipelines/stable_cascade.md‎
Lines changed: 88 additions & 0 deletions b/‎docs/source/en/api/pipelines/stable_cascade.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎scripts/convert_stable_cascade.py‎
Lines changed: 215 additions & 0 deletions b/‎scripts/convert_stable_cascade.py‎
Lines changed: 215 additions & 0 deletions
diff --git a/‎src/diffusers/__init__.py‎
Lines changed: 7 additions & 0 deletions b/‎src/diffusers/__init__.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎src/diffusers/models/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/diffusers/models/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/diffusers/models/unets/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎src/diffusers/models/unets/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -318,6 +318,8 @@
       title: Semantic Guidance
     - local: api/pipelines/shap_e
       title: Shap-E
+    - local: api/pipelines/stable_cascade
+      title: Stable Cascade
     - sections:
       - local: api/pipelines/stable_diffusion/overview
         title: Overview
 
@@ -0,0 +1,88 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Stable Cascade
+
+This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main 
+difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this 
+important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. 
+How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being 
+encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 
+1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the 
+highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable 
+Diffusion 1.5.
+
+Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
+like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
+
+The original codebase can be found at [Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade).
+
+## Model Overview
+Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
+hence the name "Stable Cascade".
+
+Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. 
+However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a 
+spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves 
+a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the 
+image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible 
+for generating the small 24 x 24 latents given a text prompt.
+
+## Uses
+
+### Direct Use
+
+The model is intended for research purposes for now. Possible research areas and tasks include
+
+- Research on generative models.
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of generative models.
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+
+Excluded uses are described below.
+
+### Out-of-Scope Use
+
+The model was not trained to be factual or true representations of people or events, 
+and therefore using the model to generate such content is out-of-scope for the abilities of this model.
+The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
+
+## Limitations and Bias
+
+### Limitations
+- Faces and people in general may not be generated properly.
+- The autoencoding part of the model is lossy.
+
+
+## StableCascadeCombinedPipeline
+
+[[autodoc]] StableCascadeCombinedPipeline
+	- all
+	- __call__
+
+## StableCascadePriorPipeline
+
+[[autodoc]] StableCascadePriorPipeline
+	- all
+	- __call__
+
+## StableCascadePriorPipelineOutput
+
+[[autodoc]] pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput
+
+## StableCascadeDecoderPipeline
+
+[[autodoc]] StableCascadeDecoderPipeline
+	- all
+	- __call__
+
@@ -0,0 +1,215 @@
+# Run this script to convert the Stable Cascade model weights to a diffusers pipeline.
+import argparse
+
+import accelerate
+import torch
+from safetensors.torch import load_file
+from transformers import (
+    AutoTokenizer,
+    CLIPConfig,
+    CLIPImageProcessor,
+    CLIPTextModelWithProjection,
+    CLIPVisionModelWithProjection,
+)
+
+from diffusers import (
+    DDPMWuerstchenScheduler,
+    StableCascadeCombinedPipeline,
+    StableCascadeDecoderPipeline,
+    StableCascadePriorPipeline,
+)
+from diffusers.models import StableCascadeUNet
+from diffusers.models.modeling_utils import load_model_dict_into_meta
+from diffusers.pipelines.wuerstchen import PaellaVQModel
+
+
+parser = argparse.ArgumentParser(description="Convert Stable Cascade model weights to a diffusers pipeline")
+parser.add_argument("--model_path", type=str, default="../StableCascade", help="Location of Stable Cascade weights")
+parser.add_argument("--stage_c_name", type=str, default="stage_c.safetensors", help="Name of stage c checkpoint file")
+parser.add_argument("--stage_b_name", type=str, default="stage_b.safetensors", help="Name of stage b checkpoint file")
+parser.add_argument("--use_safetensors", action="store_true", help="Use SafeTensors for conversion")
+parser.add_argument("--save_org", type=str, default="diffusers", help="Hub organization to save the pipelines to")
+parser.add_argument("--push_to_hub", action="store_true", help="Push to hub")
+
+args = parser.parse_args()
+model_path = args.model_path
+
+device = "cpu"
+
+# set paths to model weights
+prior_checkpoint_path = f"{model_path}/{args.stage_c_name}"
+decoder_checkpoint_path = f"{model_path}/{args.stage_b_name}"
+
+# Clip Text encoder and tokenizer
+config = CLIPConfig.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+config.text_config.projection_dim = config.projection_dim
+text_encoder = CLIPTextModelWithProjection.from_pretrained(
+    "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", config=config.text_config
+)
+tokenizer = AutoTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
+
+# image processor
+feature_extractor = CLIPImageProcessor()
+image_encoder = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14")
+
+# Prior
+if args.use_safetensors:
+    orig_state_dict = load_file(prior_checkpoint_path, device=device)
+else:
+    orig_state_dict = torch.load(prior_checkpoint_path, map_location=device)
+
+state_dict = {}
+for key in orig_state_dict.keys():
+    if key.endswith("in_proj_weight"):
+        weights = orig_state_dict[key].chunk(3, 0)
+        state_dict[key.replace("attn.in_proj_weight", "to_q.weight")] = weights[0]
+        state_dict[key.replace("attn.in_proj_weight", "to_k.weight")] = weights[1]
+        state_dict[key.replace("attn.in_proj_weight", "to_v.weight")] = weights[2]
+    elif key.endswith("in_proj_bias"):
+        weights = orig_state_dict[key].chunk(3, 0)
+        state_dict[key.replace("attn.in_proj_bias", "to_q.bias")] = weights[0]
+        state_dict[key.replace("attn.in_proj_bias", "to_k.bias")] = weights[1]
+        state_dict[key.replace("attn.in_proj_bias", "to_v.bias")] = weights[2]
+    elif key.endswith("out_proj.weight"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("attn.out_proj.weight", "to_out.0.weight")] = weights
+    elif key.endswith("out_proj.bias"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("attn.out_proj.bias", "to_out.0.bias")] = weights
+    else:
+        state_dict[key] = orig_state_dict[key]
+
+
+with accelerate.init_empty_weights():
+    prior_model = StableCascadeUNet(
+        in_channels=16,
+        out_channels=16,
+        timestep_ratio_embedding_dim=64,
+        patch_size=1,
+        conditioning_dim=2048,
+        block_out_channels=[2048, 2048],
+        num_attention_heads=[32, 32],
+        down_num_layers_per_block=[8, 24],
+        up_num_layers_per_block=[24, 8],
+        down_blocks_repeat_mappers=[1, 1],
+        up_blocks_repeat_mappers=[1, 1],
+        block_types_per_layer=[
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+        ],
+        clip_text_in_channels=1280,
+        clip_text_pooled_in_channels=1280,
+        clip_image_in_channels=768,
+        clip_seq=4,
+        kernel_size=3,
+        dropout=[0.1, 0.1],
+        self_attn=True,
+        timestep_conditioning_type=["sca", "crp"],
+        switch_level=[False],
+    )
+load_model_dict_into_meta(prior_model, state_dict)
+
+# scheduler for prior and decoder
+scheduler = DDPMWuerstchenScheduler()
+
+# Prior pipeline
+prior_pipeline = StableCascadePriorPipeline(
+    prior=prior_model,
+    tokenizer=tokenizer,
+    text_encoder=text_encoder,
+    image_encoder=image_encoder,
+    scheduler=scheduler,
+    feature_extractor=feature_extractor,
+)
+prior_pipeline.save_pretrained(f"{args.save_org}/StableCascade-prior", push_to_hub=args.push_to_hub)
+
+# Decoder
+if args.use_safetensors:
+    orig_state_dict = load_file(decoder_checkpoint_path, device=device)
+else:
+    orig_state_dict = torch.load(decoder_checkpoint_path, map_location=device)
+
+state_dict = {}
+for key in orig_state_dict.keys():
+    if key.endswith("in_proj_weight"):
+        weights = orig_state_dict[key].chunk(3, 0)
+        state_dict[key.replace("attn.in_proj_weight", "to_q.weight")] = weights[0]
+        state_dict[key.replace("attn.in_proj_weight", "to_k.weight")] = weights[1]
+        state_dict[key.replace("attn.in_proj_weight", "to_v.weight")] = weights[2]
+    elif key.endswith("in_proj_bias"):
+        weights = orig_state_dict[key].chunk(3, 0)
+        state_dict[key.replace("attn.in_proj_bias", "to_q.bias")] = weights[0]
+        state_dict[key.replace("attn.in_proj_bias", "to_k.bias")] = weights[1]
+        state_dict[key.replace("attn.in_proj_bias", "to_v.bias")] = weights[2]
+    elif key.endswith("out_proj.weight"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("attn.out_proj.weight", "to_out.0.weight")] = weights
+    elif key.endswith("out_proj.bias"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("attn.out_proj.bias", "to_out.0.bias")] = weights
+    # rename clip_mapper to clip_txt_pooled_mapper
+    elif key.endswith("clip_mapper.weight"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("clip_mapper.weight", "clip_txt_pooled_mapper.weight")] = weights
+    elif key.endswith("clip_mapper.bias"):
+        weights = orig_state_dict[key]
+        state_dict[key.replace("clip_mapper.bias", "clip_txt_pooled_mapper.bias")] = weights
+    else:
+        state_dict[key] = orig_state_dict[key]
+
+with accelerate.init_empty_weights():
+    decoder = StableCascadeUNet(
+        in_channels=4,
+        out_channels=4,
+        timestep_ratio_embedding_dim=64,
+        patch_size=2,
+        conditioning_dim=1280,
+        block_out_channels=[320, 640, 1280, 1280],
+        down_num_layers_per_block=[2, 6, 28, 6],
+        up_num_layers_per_block=[6, 28, 6, 2],
+        down_blocks_repeat_mappers=[1, 1, 1, 1],
+        up_blocks_repeat_mappers=[3, 3, 2, 2],
+        num_attention_heads=[0, 0, 20, 20],
+        block_types_per_layer=[
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock"],
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock"],
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+            ["SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"],
+        ],
+        clip_text_pooled_in_channels=1280,
+        clip_seq=4,
+        effnet_in_channels=16,
+        pixel_mapper_in_channels=3,
+        kernel_size=3,
+        dropout=[0, 0, 0.1, 0.1],
+        self_attn=True,
+        timestep_conditioning_type=["sca"],
+    )
+load_model_dict_into_meta(decoder, state_dict)
+
+# VQGAN from Wuerstchen-V2
+vqmodel = PaellaVQModel.from_pretrained("warp-ai/wuerstchen", subfolder="vqgan")
+
+# Decoder pipeline
+decoder_pipeline = StableCascadeDecoderPipeline(
+    decoder=decoder, text_encoder=text_encoder, tokenizer=tokenizer, vqgan=vqmodel, scheduler=scheduler
+)
+decoder_pipeline.save_pretrained(f"{args.save_org}/StableCascade-decoder", push_to_hub=args.push_to_hub)
+
+# Stable Cascade combined pipeline
+stable_cascade_pipeline = StableCascadeCombinedPipeline(
+    # Decoder
+    text_encoder=text_encoder,
+    tokenizer=tokenizer,
+    decoder=decoder,
+    scheduler=scheduler,
+    vqgan=vqmodel,
+    # Prior
+    prior_text_encoder=text_encoder,
+    prior_tokenizer=tokenizer,
+    prior_prior=prior_model,
+    prior_scheduler=scheduler,
+    prior_image_encoder=image_encoder,
+    prior_feature_extractor=feature_extractor,
+)
+stable_cascade_pipeline.save_pretrained(f"{args.save_org}/StableCascade", push_to_hub=args.push_to_hub)
@@ -86,6 +86,7 @@
             "MotionAdapter",
             "MultiAdapter",
             "PriorTransformer",
+            "StableCascadeUNet",
             "T2IAdapter",
             "T5FilmDecoder",
             "Transformer2DModel",
@@ -259,6 +260,9 @@
             "SemanticStableDiffusionPipeline",
             "ShapEImg2ImgPipeline",
             "ShapEPipeline",
+            "StableCascadeCombinedPipeline",
+            "StableCascadeDecoderPipeline",
+            "StableCascadePriorPipeline",
             "StableDiffusionAdapterPipeline",
             "StableDiffusionAttendAndExcitePipeline",
             "StableDiffusionControlNetImg2ImgPipeline",
@@ -626,6 +630,9 @@
             SemanticStableDiffusionPipeline,
             ShapEImg2ImgPipeline,
             ShapEPipeline,
+            StableCascadeCombinedPipeline,
+            StableCascadeDecoderPipeline,
+            StableCascadePriorPipeline,
             StableDiffusionAdapterPipeline,
             StableDiffusionAttendAndExcitePipeline,
             StableDiffusionControlNetImg2ImgPipeline,
 
@@ -47,6 +47,7 @@
     _import_structure["unets.unet_kandinsky3"] = ["Kandinsky3UNet"]
     _import_structure["unets.unet_motion_model"] = ["MotionAdapter", "UNetMotionModel"]
     _import_structure["unets.unet_spatio_temporal_condition"] = ["UNetSpatioTemporalConditionModel"]
+    _import_structure["unets.unet_stable_cascade"] = ["StableCascadeUNet"]
     _import_structure["unets.uvit_2d"] = ["UVit2DModel"]
     _import_structure["vq_model"] = ["VQModel"]
 
@@ -80,6 +81,7 @@
             I2VGenXLUNet,
             Kandinsky3UNet,
             MotionAdapter,
+            StableCascadeUNet,
             UNet1DModel,
             UNet2DConditionModel,
             UNet2DModel,
 
@@ -10,6 +10,7 @@
     from .unet_kandinsky3 import Kandinsky3UNet
     from .unet_motion_model import MotionAdapter, UNetMotionModel
     from .unet_spatio_temporal_condition import UNetSpatioTemporalConditionModel
+    from .unet_stable_cascade import StableCascadeUNet
     from .uvit_2d import UVit2DModel