-
-
Notifications
You must be signed in to change notification settings - Fork 42
v3‐sd.cpp‐model‐loading
Analysis of the model loading and inference architecture in stable-diffusion.cpp, focusing on how models are loaded, managed, and how the ggml computational graphs are constructed and executed.
The stable diffusion context (sd_ctx_t) is created using new_sd_ctx() with sd_ctx_params_t containing paths to various model files:
- Main model file (
model_path) - Diffusion model (
diffusion_model_path) - VAE (
vae_path) - ControlNet (
control_net_path) - Text encoders (CLIP-L, CLIP-G, T5XXL, Qwen2VL)
- LoRA model directory (
lora_model_dir)
During StableDiffusionGGML::init():
-
Single ModelLoader Instance: A single
ModelLoaderobject loads tensors from all specified model files into a sharedtensor_storage_map -
Model Object Creation: Individual model objects are created as shared pointers that reference tensors from this common storage:
-
diffusion_model(UNetModel, MMDiTModel, FluxModel, etc.) -
first_stage_model(VAE implementations) -
cond_stage_model(text encoders) -
control_net(ControlNet model)
-
-
Tensor References: All model objects store references to specific tensors by name from the
tensor_storage_map
- Architecture Variants: UNet (SD 1.x, 2.x, XL), DiT (SD3, Flux, Wan), etc.
- Loading: Fixed at context creation, no runtime switching
-
Storage: Tensors loaded into shared
tensor_storage_map
- Purpose: Image encoding/decoding to/from latent space
- Loading: Fixed at context creation
- Variants: AutoEncoderKL, WAN VAE, Tiny AutoEncoder (TAE)
- Purpose: Conditional image generation
- Loading: Fixed at context creation
- Backend: Can use separate GPU backend for performance
- Purpose: Lightweight model adaptation
-
Loading: Dynamic loading from
lora_model_diror prompt-based - Application: Can be applied immediately or at runtime
-
Management: Stored in vectors (
cond_stage_lora_models,diffusion_lora_models,first_stage_lora_models)
- Components: CLIP-L, CLIP-G, T5XXL, Qwen2VL
- Loading: Fixed at context creation
- Architecture: FrozenCLIPEmbedderWithCustomWords, SD3CLIPEmbedder, etc.
- Purpose: Custom token embeddings
- Loading: Loaded during conditioner initialization
- Storage: Integrated into text encoder models
The ggml computational graph is generated FRESH for each inference request.
Key observations:
-
Per-Request Context: Each
generate_image()call creates a newggml_context:struct ggml_init_params params; params.mem_size = static_cast<size_t>(1024 * 1024) * 1024; // 1G struct ggml_context* work_ctx = ggml_init(params);
-
Dynamic Construction: The graph is built dynamically during inference by:
- Calling model
compute()methods - Constructing ggml operations for diffusion steps
- Building conditioning pipelines
- Calling model
-
No Graph Reuse: Each request gets its own context and graph - no caching between requests
- Conditioning: Text/image conditioning computed using conditioner models
-
Diffusion Loop: For each denoising step:
- Model forward passes construct ggml operations
- Graph executed with
ggml_graph_compute() - Results fed to next step
- VAE Decoding: Final latent decoded to image using VAE model
Changing any model requires reloading the entire context because:
- Fixed Model Objects: Model instances are created once during init and reference specific tensors
-
Shared Tensor Storage: All tensors loaded into single
tensor_storage_map - No Selective Reloading: No API to replace individual model objects or update tensor references
- Architecture Assumptions: Graph building code assumes specific model architectures
- Problem: Different architectures (UNet vs DiT) with incompatible tensor names/shapes
- Impact: Graph building would fail with wrong tensor references
-
Current Solution: Create new context with
new_sd_ctx()
- Problem: Different VAE implementations (KL, WAN, TAE) with different interfaces
- Impact: Model object recreation required
- Workaround: Some support for Tiny AutoEncoder as alternative
- Problem: ControlNet models are tightly coupled to base SD architecture
- Impact: Requires compatible ControlNet for specific SD version
- Advantage: Already supports dynamic loading and application
-
Implementation:
load_lora_model_from_file()and prompt-based application - Limitation: LoRA models loaded into same tensor space
- Problem: Different text encoder architectures (CLIP vs T5 vs Qwen)
- Impact: Conditioner object recreation required
- Full Reload: Changing any model requires loading all models again
- Memory Management: Old tensors freed only when context destroyed
- Load Times: Significant delay when switching models
- Selective Loading: LoRA models can be loaded individually
- Runtime Application: Can be applied without full context reload
- Caching: LoRA state managed separately
- Separate Model Loaders: One ModelLoader per model type instead of shared
- Model Factory Pattern: Ability to recreate model objects with new tensors
- Tensor Reference Updates: Methods to update model objects' tensor references
-
API Extensions: New functions like
sd_load_diffusion_model(),sd_load_vae(), etc.
// Hypothetical API extensions
SD_API bool sd_load_diffusion_model(sd_ctx_t* sd_ctx, const char* model_path);
SD_API bool sd_load_vae(sd_ctx_t* sd_ctx, const char* vae_path);
SD_API bool sd_load_controlnet(sd_ctx_t* sd_ctx, const char* controlnet_path);- Memory Management: Coordinating tensor allocation/deallocation
- Backend Compatibility: Ensuring new models work with existing backends
- Graph Validation: Ensuring new model architectures are compatible
- State Consistency: Managing LoRA and other modifications across reloads
- Document current limitations clearly
- Optimize LoRA loading/caching for better performance
- Consider Tiny AutoEncoder as VAE alternative where possible
- Implement selective model loading API
- Add model compatibility validation
- Support for model hot-swapping during inference
The current stable-diffusion.cpp architecture loads all models into a shared context at initialization time, requiring full reloads when any model changes. While LoRA models support dynamic loading, core models (SD, VAE, ControlNet, Text Encoders) are fixed for the context lifetime. GGML graphs are built fresh per inference request, providing flexibility but requiring careful management of model references.
Selective model loading would require significant architectural changes but could dramatically improve user experience by reducing model switching times.