Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 26 additions & 5 deletions docs/diffusion/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,35 +43,36 @@ backend.
| quant_family | checkpoint form | canonical CLI | supported models | extra dependency | platform / notes |
|-------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| `fp8` | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path` | ALL | None | Component-folder and single-file flows are both supported |
| `modelopt-fp8` | Converted ModelOpt FP8 transformer directory or repo with `config.json` | `--transformer-path` | FLUX.1, FLUX.2, Wan2.2, Qwen Image, Qwen Image Edit | None | Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled |
| `modelopt-fp8` | Converted ModelOpt FP8 transformer directory or repo with `config.json` | `--transformer-path` | FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit | None | Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled |
| `modelopt-nvfp4` | Mixed transformer directory/repo with `config.json`, or raw NVFP4 safetensors export/repo | `--transformer-path` for mixed overrides; `--transformer-weights-path` for raw exports | FLUX.1, FLUX.2, Wan2.2 | None | Mixed override repos keep the base model separate; raw exports such as `black-forest-labs/FLUX.2-dev-NVFP4` still use the weights-path flow |
| `nunchaku-svdq` | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...` | `--transformer-weights-path` | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku` | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4` |
| `msmodelslim` | Pre-quantized msmodelslim transformer weights | `--model-path` | Wan2.2 family | None | Currently only compatible with the Ascend NPU family and supports both `w8a8` and `w4a4` |

## Validated ModelOpt Checkpoints

This section is the canonical support matrix for the diffusion ModelOpt
This section is the canonical support matrix for the nine diffusion ModelOpt
checkpoints currently wired up in SGLang docs and validation coverage.

Published checkpoints keep the serialized quantization config as
`quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label
derived from `quant_algo`.

Seven of the eight repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
official `black-forest-labs/FLUX.2-dev-NVFP4` repo.

| Quant Algo | Base Model | Preferred CLI | HF Repo | Current Scope | Notes |
| --- | --- | --- | --- | --- | --- |
| `FP8` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-fp8-sglang-transformer` | single-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler trace | SGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use `--model-id FLUX.1-dev` for local mirrors |
| `FP8` | `black-forest-labs/FLUX.2-dev` | `--transformer-path` | `lmsys/flux2-dev-modelopt-fp8-sglang-transformer` | single-transformer override load and generation path | published SGLang-ready transformer override |
| `FP8` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer` | primary `transformer` quantized, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately |
| `FP8` | `hunyuanvideo-community/HunyuanVideo` | `--transformer-path` | `lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace | HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores |
| `FP8` | `Qwen/Qwen-Image` | `--transformer-path` | `lmsys/qwen-image-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler trace | shares the Qwen Image FP8 fallback preset; keep `img_in`, `txt_in`, timestep embedder, `norm_out.linear`, `proj_out`, `img_mod`/`txt_mod`, and `img_mlp.net.2` in BF16 |
| `FP8` | `Qwen/Qwen-Image-Edit-2511` | `--transformer-path` | `lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer` | TI2I edit smoke, BF16-vs-FP8 image comparison, H100 benchmark | shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset |
| `FP8` | `Qwen/Qwen-Image-Edit-2511` | `--transformer-path` | `lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer` | TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark | shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset |
| `NVFP4` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer` | mixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler trace | use `build_modelopt_nvfp4_transformer.py`; validated builder keeps selected FLUX.1 modules in BF16 and sets `swap_weight_nibbles=false` |
| `NVFP4` | `black-forest-labs/FLUX.2-dev` | `--transformer-weights-path` | `black-forest-labs/FLUX.2-dev-NVFP4` | packed-QKV load path | official raw export repo; validated packed export detection and runtime layout handling |
| `NVFP4` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer` | primary `transformer` quantized with ModelOpt NVFP4, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and current B200/Blackwell bring-up uses `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn` |

These eight checkpoints are also the intended case set for the B200 diffusion
These nine checkpoints are also the intended case set for the B200 diffusion
CI job (`multimodal-gen-test-1-b200`).

## ModelOpt FP8
Expand All @@ -98,6 +99,15 @@ sglang generate \
--save-output
```

```bash
sglang generate \
--model-path hunyuanvideo-community/HunyuanVideo \
--transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
--height 544 --width 960 --num-frames 17 \
--prompt "A cinematic shot of a red sports car driving through rain at night" \
--save-output
```

```bash
sglang generate \
--model-path Qwen/Qwen-Image \
Expand Down Expand Up @@ -131,6 +141,17 @@ sglang generate \
- On disk, the quantization config stays `quant_method=modelopt` with
`quant_algo=FP8`; the `modelopt-fp8` label in this document is a support
family name, not a serialized config key.
- `hunyuanvideo-community/HunyuanVideo` uses the `hunyuan-video` converter
preset. Use `--model-type hunyuan-video` to force it, or rely on
auto-detection from `_class_name=HunyuanVideoTransformer3DModel`.
- The validated HunyuanVideo FP8 fallback preset keeps `context_embedder`,
`x_embedder.proj`, timestep/guidance/text embedder linear layers,
`norm_out.linear`, `proj_out`, double-block modulation linear layers, and
single-block modulation linear layers in BF16.
- HunyuanVideo ModelOpt exports use diffusers module names that do not match
SGLang runtime module names for fused QKV and fused QKV+MLP layers. The
converter maps the names before selecting scale tensors and before writing
the runtime ignore list.
- `Qwen/Qwen-Image` and `Qwen/Qwen-Image-Edit-2511` share the `qwen-image`
converter preset. Use `--model-type qwen-image` to force it, or rely on
auto-detection from `_class_name=QwenImageTransformer2DModel`.
Expand Down
27 changes: 22 additions & 5 deletions docs_new/docs/sglang-diffusion/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ backend.
<td><code>modelopt-fp8</code></td>
<td>Converted ModelOpt FP8 transformer directory or repo with <code>config.json</code></td>
<td><code>--transformer-path</code></td>
<td>FLUX.1, FLUX.2, Wan2.2</td>
<td>FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit</td>
<td>None</td>
<td>Serialized config stays <code>quant_method=modelopt</code> with <code>quant_algo=FP8</code>; <code>dit_layerwise_offload</code> is supported and <code>dit_cpu_offload</code> stays disabled</td>
</tr>
Expand Down Expand Up @@ -109,14 +109,14 @@ backend.

## Validated ModelOpt Checkpoints

This section is the canonical support matrix for the eight diffusion ModelOpt
This section is the canonical support matrix for the nine diffusion ModelOpt
checkpoints currently wired up in SGLang docs and B200 CI coverage.

Published checkpoints keep the serialized quantization config as
`quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label
derived from `quant_algo`.

Seven of the eight repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
official `black-forest-labs/FLUX.2-dev-NVFP4` repo.

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
Expand Down Expand Up @@ -163,6 +163,14 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
<td>primary <code>transformer</code> quantized, <code>transformer_2</code> kept BF16</td>
<td>primary-transformer-only path; keep <code>transformer_2</code> on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately</td>
</tr>
<tr>
<td><code>FP8</code></td>
<td><code>hunyuanvideo-community/HunyuanVideo</code></td>
<td><code>--transformer-path</code></td>
<td><code>lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer</code></td>
<td>single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace</td>
<td>HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores</td>
</tr>
<tr>
<td><code>FP8</code></td>
<td><code>Qwen/Qwen-Image</code></td>
Expand All @@ -176,7 +184,7 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
<td><code>Qwen/Qwen-Image-Edit-2511</code></td>
<td><code>--transformer-path</code></td>
<td><code>lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer</code></td>
<td>TI2I edit smoke, BF16-vs-FP8 image comparison, H100 benchmark</td>
<td>TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark</td>
<td>shares <code>QwenImageTransformer2DModel</code> with Qwen Image and uses the same Qwen Image FP8 fallback preset</td>
</tr>
<tr>
Expand Down Expand Up @@ -206,7 +214,7 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
</tbody>
</table>

These eight checkpoints are also the intended case set for the B200 diffusion CI
These nine checkpoints are also the intended case set for the B200 diffusion CI
job (`multimodal-gen-test-1-b200`).

## ModelOpt FP8
Expand All @@ -233,6 +241,15 @@ sglang generate \
--save-output
```

```bash
sglang generate \
--model-path hunyuanvideo-community/HunyuanVideo \
--transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
--height 544 --width 960 --num-frames 17 \
--prompt "A cinematic shot of a red sports car driving through rain at night" \
--save-output
```

```bash
sglang generate \
--model-path Qwen/Qwen-Image \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,34 +63,34 @@ This repo now contains:

Validated documentation and CI coverage currently center on these ModelOpt diffusion transformer override families:

- FP8: FLUX.1-dev, FLUX.2-dev, Wan2.2, Qwen Image, Qwen Image Edit
- FP8: FLUX.1-dev, FLUX.2-dev, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit
- NVFP4: FLUX.1-dev, FLUX.2-dev, Wan2.2

Treat a new family, a new precision, or a new checkpoint layout as unsupported until it has a documented matrix row and a matching validation story.
Before writing CLI examples, re-read the active branch's `docs/diffusion/quantization.md`: FLUX.2 NVFP4 is an official `black-forest-labs/*` repo rather than a `lmsys/*` converted repo, and its preferred flag depends on the current documented loader flow. Use `--transformer-path` for a component override directory with `config.json`; use `--transformer-weights-path` when the repo or path should be probed as raw weights.

B200 CI coverage can include loose BF16-vs-quantized quality checks. Inspect the active branch's `run_suite.py` before assuming they are part of the suite; mainline and feature branches may differ. Those checks are intended to catch blank, corrupted, or obviously divergent images, not exact image parity.

Mainline documentation now uses `lmsys/*` for the five converted ModelOpt
Mainline documentation now uses `lmsys/*` for the eight converted ModelOpt
checkpoint repos; the FLUX.2 NVFP4 raw export remains
`black-forest-labs/FLUX.2-dev-NVFP4`. Do not use older `BBuf/*` examples unless
you are explicitly testing a historical branch.

## Open PR Watchlist
## Related PR Watchlist

As of 2026-05-02, these related SGLang PRs were open. Treat them as future
support or migration work until they merge and the docs/CI matrix is updated.
As of 2026-05-04, these related SGLang PRs are relevant to ModelOpt diffusion
support. Treat unmerged items as future support or migration work until the
docs/CI matrix is updated.

- #23155 adds Qwen Image ModelOpt FP8 support.
- #23155 added Qwen Image ModelOpt FP8 support.
- #23199 adds HunyuanVideo ModelOpt FP8 support.
- #23373 adds a runtime quantization flag; keep PTQ/export workflows separate from runtime quant examples until the CLI behavior is merged.
- #24024 adds transformer FP8-cast compatibility mode.
- #24186 re-enables B200 multimodal CI with NVFP4 fixes for FLUX.2 and Wan2.2.

Do not expand the validated matrix beyond FLUX.1, FLUX.2, and Wan2.2 solely
because one of these PRs exists. Add a row only after the exact checkpoint,
loader path, accuracy check, and benchmark scope are validated on the active
branch.
Do not expand the validated matrix beyond the documented rows solely because a
related PR exists. Add a row only after the exact checkpoint, loader path,
accuracy check, and benchmark scope are validated on the active branch.

## Documentation Maintenance

Expand Down Expand Up @@ -194,6 +194,28 @@ For `FLUX.1-dev`, the validated fallback set currently keeps these modules in BF

Use `--model-type flux1` to force that profile, or rely on `--model-type auto` when the export config identifies `FluxTransformer2DModel`.

HunyuanVideo uses `HunyuanVideoTransformer3DModel`, so the validated
HunyuanVideo FP8 fallback preset keeps these modules in BF16:

- `context_embedder.*`
- `x_embedder.proj`
- `time_text_embed.(timestep_embedder|guidance_embedder|text_embedder).linear_[12]`
- `norm_out.linear`
- `proj_out`
- `transformer_blocks.*.norm1.linear`
- `transformer_blocks.*.norm1_context.linear`
- `single_transformer_blocks.*.norm.linear`

Use `--model-type hunyuan-video` to force that profile, or rely on
`--model-type auto` when the export config identifies
`HunyuanVideoTransformer3DModel`.

HunyuanVideo ModelOpt exports use diffusers module names that differ from
SGLang runtime names for fused QKV and fused QKV+MLP layers. Keep the
diffusers-to-runtime mapping in `build_modelopt_fp8_transformer.py` in sync
with `runtime/models/dits/hunyuanvideo.py` before trusting converted scale
tensors.

Qwen Image and Qwen Image Edit share `QwenImageTransformer2DModel`, so one
ModelOpt FP8 fallback preset covers both. The validated Qwen Image fallback set
keeps these modules in BF16:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -477,9 +477,9 @@ def main():
ServerArgs.add_cli_args(parser)
BenchArgs.add_cli_args(parser)

args = parser.parse_args()
args, unknown_args = parser.parse_known_args()

server_args = ServerArgs.from_cli_args(args)
server_args = ServerArgs.from_cli_args(args, unknown_args)
bench_args = BenchArgs.from_cli_args(args)

set_global_server_args(server_args)
Expand Down
8 changes: 3 additions & 5 deletions python/sglang/multimodal_gen/runtime/layers/linear.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ def __init__(
skip_bias_add: bool = False,
params_dtype: torch.dtype | None = None,
quant_config: QuantizationConfig | None = None,
output_sizes: list[int] | None = None,
prefix: str = "",
):
super().__init__(
Expand All @@ -245,10 +246,11 @@ def __init__(

# All the linear layer supports quant method.
assert self.quant_method is not None
output_partition_sizes = output_sizes or [self.output_size]
self.quant_method.create_weights(
self,
self.input_size,
[self.output_size],
output_partition_sizes,
self.input_size,
self.output_size,
self.params_dtype,
Expand Down Expand Up @@ -497,7 +499,6 @@ def weight_loader(
loaded_weight: torch.Tensor,
loaded_shard_id: int | None = None,
) -> None:

param_data = param.data
output_dim = getattr(param, "output_dim", None)
# Special case for AQLM codebooks.
Expand Down Expand Up @@ -829,7 +830,6 @@ def weight_loader(
loaded_weight: torch.Tensor,
loaded_shard_id: str | None = None,
):

param_data = param.data
output_dim = getattr(param, "output_dim", None)
# Special case for AQLM codebooks.
Expand Down Expand Up @@ -866,7 +866,6 @@ def weight_loader(
]

for shard_id, shard_offset, shard_size in shard_offsets:

loaded_weight_shard = loaded_weight.narrow(
output_dim, shard_offset, shard_size
)
Expand Down Expand Up @@ -1037,7 +1036,6 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
param_data.copy_(loaded_weight)

def weight_loader_v2(self, param: BasevLLMParameter, loaded_weight: torch.Tensor):

# Special case for loading scales off disk, which often do not
# have a shape (such as in the case of AutoFP8).
if len(loaded_weight.shape) == 0:
Expand Down
Loading
Loading