Skip to content

v3‐sd.cpp‐model‐loading

cmdr2 edited this page Nov 26, 2025 · 1 revision

Analysis of the model loading and inference architecture in stable-diffusion.cpp, focusing on how models are loaded, managed, and how the ggml computational graphs are constructed and executed.

Current Model Loading Architecture

Context Initialization

The stable diffusion context (sd_ctx_t) is created using new_sd_ctx() with sd_ctx_params_t containing paths to various model files:

  • Main model file (model_path)
  • Diffusion model (diffusion_model_path)
  • VAE (vae_path)
  • ControlNet (control_net_path)
  • Text encoders (CLIP-L, CLIP-G, T5XXL, Qwen2VL)
  • LoRA model directory (lora_model_dir)

Model Loading Process

During StableDiffusionGGML::init():

  1. Single ModelLoader Instance: A single ModelLoader object loads tensors from all specified model files into a shared tensor_storage_map
  2. Model Object Creation: Individual model objects are created as shared pointers that reference tensors from this common storage:
    • diffusion_model (UNetModel, MMDiTModel, FluxModel, etc.)
    • first_stage_model (VAE implementations)
    • cond_stage_model (text encoders)
    • control_net (ControlNet model)
  3. Tensor References: All model objects store references to specific tensors by name from the tensor_storage_map

Model Types and Management

Stable Diffusion Models

  • Architecture Variants: UNet (SD 1.x, 2.x, XL), DiT (SD3, Flux, Wan), etc.
  • Loading: Fixed at context creation, no runtime switching
  • Storage: Tensors loaded into shared tensor_storage_map

VAE Models

  • Purpose: Image encoding/decoding to/from latent space
  • Loading: Fixed at context creation
  • Variants: AutoEncoderKL, WAN VAE, Tiny AutoEncoder (TAE)

ControlNet Models

  • Purpose: Conditional image generation
  • Loading: Fixed at context creation
  • Backend: Can use separate GPU backend for performance

LoRA Models

  • Purpose: Lightweight model adaptation
  • Loading: Dynamic loading from lora_model_dir or prompt-based
  • Application: Can be applied immediately or at runtime
  • Management: Stored in vectors (cond_stage_lora_models, diffusion_lora_models, first_stage_lora_models)

Text Encoders (Conditioner)

  • Components: CLIP-L, CLIP-G, T5XXL, Qwen2VL
  • Loading: Fixed at context creation
  • Architecture: FrozenCLIPEmbedderWithCustomWords, SD3CLIPEmbedder, etc.

Embeddings

  • Purpose: Custom token embeddings
  • Loading: Loaded during conditioner initialization
  • Storage: Integrated into text encoder models

GGML Graph Construction and Execution

Graph Generation

The ggml computational graph is generated FRESH for each inference request.

Key observations:

  1. Per-Request Context: Each generate_image() call creates a new ggml_context:

    struct ggml_init_params params;
    params.mem_size = static_cast<size_t>(1024 * 1024) * 1024;  // 1G
    struct ggml_context* work_ctx = ggml_init(params);
  2. Dynamic Construction: The graph is built dynamically during inference by:

    • Calling model compute() methods
    • Constructing ggml operations for diffusion steps
    • Building conditioning pipelines
  3. No Graph Reuse: Each request gets its own context and graph - no caching between requests

Inference Flow

  1. Conditioning: Text/image conditioning computed using conditioner models
  2. Diffusion Loop: For each denoising step:
    • Model forward passes construct ggml operations
    • Graph executed with ggml_graph_compute()
    • Results fed to next step
  3. VAE Decoding: Final latent decoded to image using VAE model

Model Switching Limitations

Current Constraints

Changing any model requires reloading the entire context because:

  1. Fixed Model Objects: Model instances are created once during init and reference specific tensors
  2. Shared Tensor Storage: All tensors loaded into single tensor_storage_map
  3. No Selective Reloading: No API to replace individual model objects or update tensor references
  4. Architecture Assumptions: Graph building code assumes specific model architectures

Specific Issues by Model Type

Stable Diffusion Model Switching (SD 1.5 ↔ SD XL)

  • Problem: Different architectures (UNet vs DiT) with incompatible tensor names/shapes
  • Impact: Graph building would fail with wrong tensor references
  • Current Solution: Create new context with new_sd_ctx()

VAE Model Switching

  • Problem: Different VAE implementations (KL, WAN, TAE) with different interfaces
  • Impact: Model object recreation required
  • Workaround: Some support for Tiny AutoEncoder as alternative

ControlNet Model Switching

  • Problem: ControlNet models are tightly coupled to base SD architecture
  • Impact: Requires compatible ControlNet for specific SD version

LoRA Model Switching

  • Advantage: Already supports dynamic loading and application
  • Implementation: load_lora_model_from_file() and prompt-based application
  • Limitation: LoRA models loaded into same tensor space

Text Encoder Switching

  • Problem: Different text encoder architectures (CLIP vs T5 vs Qwen)
  • Impact: Conditioner object recreation required

Performance Implications

Current Behavior

  • Full Reload: Changing any model requires loading all models again
  • Memory Management: Old tensors freed only when context destroyed
  • Load Times: Significant delay when switching models

LoRA Exception

  • Selective Loading: LoRA models can be loaded individually
  • Runtime Application: Can be applied without full context reload
  • Caching: LoRA state managed separately

Potential Solutions for Selective Model Loading

Architecture Changes Required

  1. Separate Model Loaders: One ModelLoader per model type instead of shared
  2. Model Factory Pattern: Ability to recreate model objects with new tensors
  3. Tensor Reference Updates: Methods to update model objects' tensor references
  4. API Extensions: New functions like sd_load_diffusion_model(), sd_load_vae(), etc.

Implementation Approach

// Hypothetical API extensions
SD_API bool sd_load_diffusion_model(sd_ctx_t* sd_ctx, const char* model_path);
SD_API bool sd_load_vae(sd_ctx_t* sd_ctx, const char* vae_path);
SD_API bool sd_load_controlnet(sd_ctx_t* sd_ctx, const char* controlnet_path);

Challenges

  1. Memory Management: Coordinating tensor allocation/deallocation
  2. Backend Compatibility: Ensuring new models work with existing backends
  3. Graph Validation: Ensuring new model architectures are compatible
  4. State Consistency: Managing LoRA and other modifications across reloads

Recommendations

Short Term

  • Document current limitations clearly
  • Optimize LoRA loading/caching for better performance
  • Consider Tiny AutoEncoder as VAE alternative where possible

Long Term

  • Implement selective model loading API
  • Add model compatibility validation
  • Support for model hot-swapping during inference

Conclusion

The current stable-diffusion.cpp architecture loads all models into a shared context at initialization time, requiring full reloads when any model changes. While LoRA models support dynamic loading, core models (SD, VAE, ControlNet, Text Encoders) are fixed for the context lifetime. GGML graphs are built fresh per inference request, providing flexibility but requiring careful management of model references.

Selective model loading would require significant architectural changes but could dramatically improve user experience by reducing model switching times.