sgl-project · mickqian · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026
diff --git a/docs/diffusion/api/cli.md b/docs/diffusion/api/cli.md
@@ -9,6 +9,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
 
 ## Supported Arguments
 
+
 ### Server Arguments
 
 - `--model-path {MODEL_PATH}`: Path to the model or model ID
@@ -24,6 +25,16 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
 - `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
 - `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).
 
+### Quantized Transformers
+
+For quantized transformer checkpoints, prefer:
+
+- `--model-path` for the base model (the pipeline)
+- `--transformer-path` for a quantized `transformers` transformer component folder
+- `--transformer-weights-path` for a quantized safetensors file, directory, or repo
+
+See [Quantization](../quantization.md) for the supported quantization families and examples.
+
 
 ### Sampling Parameters
 

diff --git a/docs/diffusion/index.md b/docs/diffusion/index.md
@@ -54,6 +54,7 @@ sglang serve --model-path Qwen/Qwen-Image --port 30010
 ### Usage
 
 - **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
+- **[Quantization](quantization.md)** - Quantized transformer checkpoint usage and supported quantization families
 - **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
 - **[Post-Processing](api/post_processing.md)** - Frame interpolation (RIFE) and upscaling (Real-ESRGAN)
 

diff --git a/docs/diffusion/quantization.md b/docs/diffusion/quantization.md
@@ -0,0 +1,175 @@
+# Quantization
+
+SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep
+the base model and the quantized transformer override separate.
+
+## Quick Reference
+
+Use these paths:
+
+- `--model-path`: the base or original model
+- `--transformer-path`: a quantized transformers-style transformer component directory that already contains its own `config.json`
+- `--transformer-weights-path`: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID
+
+Recommended example:
+
+```bash
+sglang generate \
+  --model-path black-forest-labs/FLUX.2-dev \
+  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
+  --prompt "a curious pikachu"
+```
+
+For quantized transformers-style transformer component folders:
+
+```bash
+sglang generate \
+  --model-path /path/to/base-model \
+  --transformer-path /path/to/quantized-transformer \
+  --prompt "A Logo With Bold Large Text: SGL Diffusion"
+```
+
+NOTE: Some model-specific integrations also accept a quantized repo or local
+directory directly as `--model-path`, but that is a compatibility path. If a
+repo contains multiple candidate checkpoints, pass
+`--transformer-weights-path` explicitly.
+
+## Quant Families
+
+Here, `quant_family` means a checkpoint and loading family with shared CLI
+usage and loader behavior. It is not just the numeric precision or a kernel
+backend.
+
+| quant_family     | checkpoint form                                                                            | canonical CLI                                        | supported models                                             | extra dependency                      | platform / notes                                                                                                      |
+|------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
+| `fp8`            | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL                                                          | None                                  | Component-folder and single-file flows are both supported                                                             |
+| `nvfp4-modelopt` | NVFP4 safetensors file, sharded directory, or repo providing transformer weights           | `--transformer-weights-path`                         | FLUX.2                                                       | `comfy-kitchen` optional on Blackwell | Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
+| `nunchaku-svdq`  | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...`  | `--transformer-weights-path`                         | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku`                            | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`                           |
+
+## NVFP4
+
+### Usage Examples
+
+Recommended usage keeps the base model and quantized transformer override
+separate:
+
+```bash
+sglang generate \
+  --model-path black-forest-labs/FLUX.2-dev \
+  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
+  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
+  --save-output
+```
+
+SGLang also supports passing the NVFP4 repo or local directory directly as
+`--model-path`:
+
+```bash
+sglang generate \
+  --model-path black-forest-labs/FLUX.2-dev-NVFP4 \
+  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
+  --save-output
+```
+
+### Notes
+
+- `--transformer-weights-path` is still the canonical CLI for NVFP4
+  transformer checkpoints.
+- Direct `--model-path` loading is a compatibility path for FLUX.2 NVFP4-style
+  repos or local directories.
+- If `--transformer-weights-path` is provided explicitly, it takes precedence
+  over the compatibility `--model-path` flow.
+- For local directories, SGLang first looks for `*-mixed.safetensors`, then
+  falls back to loading from the directory.
+- On Blackwell, `comfy-kitchen` can provide the best-performance path when
+  available; otherwise SGLang falls back to the generic ModelOpt FP4 path.
+
+## Nunchaku (SVDQuant)
+
+### Install
+
+Install the runtime dependency first:
+
+```bash
+pip install nunchaku
+```
+
+For platform-specific installation methods and troubleshooting, see the
+[Nunchaku installation guide](https://nunchaku.tech/docs/nunchaku/installation/installation.html).
+
+### File Naming and Auto-Detection
+
+For Nunchaku checkpoints, `--model-path` should still point to the original
+base model, while `--transformer-weights-path` points to the quantized
+transformer weights.
+
+If the basename of `--transformer-weights-path` contains the pattern
+`svdq-(int4|fp4)_r{rank}`, SGLang will automatically:
+- enable SVDQuant
+- infer `--quantization-precision`
+- infer `--quantization-rank`
+
+Examples:
+
+| checkpoint name fragment | inferred precision | inferred rank | notes |
+|--------------------------|--------------------|---------------|-------|
+| `svdq-int4_r32`          | `int4`             | `32`          | Standard INT4 checkpoint |
+| `svdq-int4_r128`         | `int4`             | `128`         | Higher-quality INT4 checkpoint |
+| `svdq-fp4_r32`           | `nvfp4`            | `32`          | `fp4` in the filename maps to CLI value `nvfp4` |
+| `svdq-fp4_r128`          | `nvfp4`            | `128`         | Higher-quality NVFP4 checkpoint |
+
+Common filenames:
+
+| filename | precision | rank | typical use |
+|----------|-----------|------|-------------|
+| `svdq-int4_r32-qwen-image.safetensors` | `int4` | `32` | Balanced default |
+| `svdq-int4_r128-qwen-image.safetensors` | `int4` | `128` | Quality-focused |
+| `svdq-fp4_r32-qwen-image.safetensors` | `nvfp4` | `32` | RTX 50-series / NVFP4 path |
+| `svdq-fp4_r128-qwen-image.safetensors` | `nvfp4` | `128` | Quality-focused NVFP4 |
+| `svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors` | `int4` | `32` | Lightning 4-step |
+| `svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors` | `int4` | `128` | Lightning 8-step |
+
+If your checkpoint name does not follow this convention, pass
+`--enable-svdquant`, `--quantization-precision`, and `--quantization-rank`
+explicitly.
+
+### Usage Examples
+
+Recommended auto-detected flow:
+
+```bash
+sglang generate \
+  --model-path Qwen/Qwen-Image \
+  --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
+  --prompt "change the raccoon to a cute cat" \
+  --attention-backend torch_sdpa \
+  --save-output
+```
+
+Manual override when the filename does not encode the quant settings:
+
+```bash
+sglang generate \
+  --model-path Qwen/Qwen-Image \
+  --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
+  --enable-svdquant \
+  --quantization-precision int4 \
+  --quantization-rank 128 \
+  --prompt "a beautiful sunset" \
+  --attention-backend torch_sdpa \
+  --save-output
+```
+
+### Notes
+
+- `--transformer-weights-path` is the canonical flag for Nunchaku checkpoints.
+  Older config names such as `quantized_model_path` are treated as
+  compatibility aliases.
+- Auto-detection only happens when the checkpoint basename matches
+  `svdq-(int4|fp4)_r{rank}`.
+- The CLI values are `int4` and `nvfp4`. In filenames, the NVFP4 variant is
+  written as `fp4`.
+- Lightning checkpoints usually expect matching `--num-inference-steps`, such
+  as `4` or `8`.
+- Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x)
+  or SM12x GPUs. Hopper (SM90) is currently rejected.
diff --git a/docs/index.rst b/docs/index.rst
@@ -85,6 +85,7 @@ Its core features include:
    diffusion/installation
    diffusion/compatibility_matrix
    diffusion/api/cli
+   diffusion/quantization
    diffusion/api/openai_api
    diffusion/performance/index
    diffusion/performance/attention_backends