Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/diffusion/api/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f

## Supported Arguments


### Server Arguments

- `--model-path {MODEL_PATH}`: Path to the model or model ID
Expand All @@ -24,6 +25,16 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
- `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
- `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).

### Quantized Transformers

For quantized transformer checkpoints, prefer:

- `--model-path` for the base model (the pipeline)
- `--transformer-path` for a quantized `transformers` transformer component folder
- `--transformer-weights-path` for a quantized safetensors file, directory, or repo

See [Quantization](../quantization.md) for the supported quantization families and examples.


### Sampling Parameters

Expand Down
1 change: 1 addition & 0 deletions docs/diffusion/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ sglang serve --model-path Qwen/Qwen-Image --port 30010
### Usage

- **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
- **[Quantization](quantization.md)** - Quantized transformer checkpoint usage and supported quantization families
- **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
- **[Post-Processing](api/post_processing.md)** - Frame interpolation (RIFE) and upscaling (Real-ESRGAN)

Expand Down
175 changes: 175 additions & 0 deletions docs/diffusion/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Quantization

SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep
the base model and the quantized transformer override separate.

## Quick Reference

Use these paths:

- `--model-path`: the base or original model
- `--transformer-path`: a quantized transformers-style transformer component directory that already contains its own `config.json`
- `--transformer-weights-path`: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID

Recommended example:

```bash
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "a curious pikachu"
```

For quantized transformers-style transformer component folders:

```bash
sglang generate \
--model-path /path/to/base-model \
--transformer-path /path/to/quantized-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion"
```

NOTE: Some model-specific integrations also accept a quantized repo or local
directory directly as `--model-path`, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
`--transformer-weights-path` explicitly.

## Quant Families

Here, `quant_family` means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.

| quant_family | checkpoint form | canonical CLI | supported models | extra dependency | platform / notes |
|------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| `fp8` | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL | None | Component-folder and single-file flows are both supported |
| `nvfp4-modelopt` | NVFP4 safetensors file, sharded directory, or repo providing transformer weights | `--transformer-weights-path` | FLUX.2 | `comfy-kitchen` optional on Blackwell | Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
| `nunchaku-svdq` | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...` | `--transformer-weights-path` | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku` | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4` |

## NVFP4

### Usage Examples

Recommended usage keeps the base model and quantized transformer override
separate:

```bash
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
```

SGLang also supports passing the NVFP4 repo or local directory directly as
`--model-path`:

```bash
sglang generate \
--model-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
```

### Notes

- `--transformer-weights-path` is still the canonical CLI for NVFP4
transformer checkpoints.
- Direct `--model-path` loading is a compatibility path for FLUX.2 NVFP4-style
repos or local directories.
- If `--transformer-weights-path` is provided explicitly, it takes precedence
over the compatibility `--model-path` flow.
- For local directories, SGLang first looks for `*-mixed.safetensors`, then
falls back to loading from the directory.
- On Blackwell, `comfy-kitchen` can provide the best-performance path when
available; otherwise SGLang falls back to the generic ModelOpt FP4 path.

## Nunchaku (SVDQuant)

### Install

Install the runtime dependency first:

```bash
pip install nunchaku
```

For platform-specific installation methods and troubleshooting, see the
[Nunchaku installation guide](https://nunchaku.tech/docs/nunchaku/installation/installation.html).

### File Naming and Auto-Detection

For Nunchaku checkpoints, `--model-path` should still point to the original
base model, while `--transformer-weights-path` points to the quantized
transformer weights.

If the basename of `--transformer-weights-path` contains the pattern
`svdq-(int4|fp4)_r{rank}`, SGLang will automatically:
- enable SVDQuant
- infer `--quantization-precision`
- infer `--quantization-rank`

Examples:

| checkpoint name fragment | inferred precision | inferred rank | notes |
|--------------------------|--------------------|---------------|-------|
| `svdq-int4_r32` | `int4` | `32` | Standard INT4 checkpoint |
| `svdq-int4_r128` | `int4` | `128` | Higher-quality INT4 checkpoint |
| `svdq-fp4_r32` | `nvfp4` | `32` | `fp4` in the filename maps to CLI value `nvfp4` |
| `svdq-fp4_r128` | `nvfp4` | `128` | Higher-quality NVFP4 checkpoint |

Common filenames:

| filename | precision | rank | typical use |
|----------|-----------|------|-------------|
| `svdq-int4_r32-qwen-image.safetensors` | `int4` | `32` | Balanced default |
| `svdq-int4_r128-qwen-image.safetensors` | `int4` | `128` | Quality-focused |
| `svdq-fp4_r32-qwen-image.safetensors` | `nvfp4` | `32` | RTX 50-series / NVFP4 path |
| `svdq-fp4_r128-qwen-image.safetensors` | `nvfp4` | `128` | Quality-focused NVFP4 |
| `svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors` | `int4` | `32` | Lightning 4-step |
| `svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors` | `int4` | `128` | Lightning 8-step |

If your checkpoint name does not follow this convention, pass
`--enable-svdquant`, `--quantization-precision`, and `--quantization-rank`
explicitly.

### Usage Examples

Recommended auto-detected flow:

```bash
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
--prompt "change the raccoon to a cute cat" \
--attention-backend torch_sdpa \
--save-output
```

Manual override when the filename does not encode the quant settings:

```bash
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
--enable-svdquant \
--quantization-precision int4 \
--quantization-rank 128 \
--prompt "a beautiful sunset" \
--attention-backend torch_sdpa \
--save-output
```

### Notes

- `--transformer-weights-path` is the canonical flag for Nunchaku checkpoints.
Older config names such as `quantized_model_path` are treated as
compatibility aliases.
- Auto-detection only happens when the checkpoint basename matches
`svdq-(int4|fp4)_r{rank}`.
- The CLI values are `int4` and `nvfp4`. In filenames, the NVFP4 variant is
written as `fp4`.
- Lightning checkpoints usually expect matching `--num-inference-steps`, such
as `4` or `8`.
- Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x)
or SM12x GPUs. Hopper (SM90) is currently rejected.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Its core features include:
diffusion/installation
diffusion/compatibility_matrix
diffusion/api/cli
diffusion/quantization
diffusion/api/openai_api
diffusion/performance/index
diffusion/performance/attention_backends
Expand Down
Loading
Loading