sgl-project · BBuf · May 5, 2026 · Apr 19, 2026 · Apr 20, 2026 · Apr 25, 2026
@@ -43,35 +43,36 @@ backend.
 | quant_family      | checkpoint form                                                                            | canonical CLI                                                          | supported models                        | extra dependency                      | platform / notes                                                                                                                       |
 |-------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 | `fp8`             | Quantized transformer component folder, or safetensors with `quantization_config` metadata | `--transformer-path` or `--transformer-weights-path`                   | ALL                                     | None                                  | Component-folder and single-file flows are both supported                                                                              |
-| `modelopt-fp8`    | Converted ModelOpt FP8 transformer directory or repo with `config.json`                    | `--transformer-path`                                                    | FLUX.1, FLUX.2, Wan2.2, Qwen Image, Qwen Image Edit | None                                  | Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled |
+| `modelopt-fp8`    | Converted ModelOpt FP8 transformer directory or repo with `config.json`                    | `--transformer-path`                                                    | FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit | None                                  | Serialized config stays `quant_method=modelopt` with `quant_algo=FP8`; `dit_layerwise_offload` is supported and `dit_cpu_offload` stays disabled |
 | `modelopt-nvfp4`  | Mixed transformer directory/repo with `config.json`, or raw NVFP4 safetensors export/repo | `--transformer-path` for mixed overrides; `--transformer-weights-path` for raw exports | FLUX.1, FLUX.2, Wan2.2                  | None                                  | Mixed override repos keep the base model separate; raw exports such as `black-forest-labs/FLUX.2-dev-NVFP4` still use the weights-path flow |
 | `nunchaku-svdq`   | Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...`   | `--transformer-weights-path`                                           | Model-specific support such as Qwen-Image, FLUX, and Z-Image | `nunchaku`                            | SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`                                             |
 | `msmodelslim`     | Pre-quantized msmodelslim transformer weights                                              | `--model-path`                                                         | Wan2.2 family                           | None                                  | Currently only compatible with the Ascend NPU family and supports both `w8a8` and `w4a4`                                               |
 
 ## Validated ModelOpt Checkpoints
 
-This section is the canonical support matrix for the diffusion ModelOpt
+This section is the canonical support matrix for the nine diffusion ModelOpt
 checkpoints currently wired up in SGLang docs and validation coverage.
 
 Published checkpoints keep the serialized quantization config as
 `quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label
 derived from `quant_algo`.
 
-Seven of the eight repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
+Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
 official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
 
 | Quant Algo | Base Model | Preferred CLI | HF Repo | Current Scope | Notes |
 | --- | --- | --- | --- | --- | --- |
 | `FP8` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-fp8-sglang-transformer` | single-transformer override, deterministic latent/image comparison, H100 benchmark, torch-profiler trace | SGLang converter keeps a validated BF16 fallback set for modulation and FF projection layers; use `--model-id FLUX.1-dev` for local mirrors |
 | `FP8` | `black-forest-labs/FLUX.2-dev` | `--transformer-path` | `lmsys/flux2-dev-modelopt-fp8-sglang-transformer` | single-transformer override load and generation path | published SGLang-ready transformer override |
 | `FP8` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-fp8-sglang-transformer` | primary `transformer` quantized, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately |
+| `FP8` | `hunyuanvideo-community/HunyuanVideo` | `--transformer-path` | `lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace | HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores |
 | `FP8` | `Qwen/Qwen-Image` | `--transformer-path` | `lmsys/qwen-image-modelopt-fp8-sglang-transformer` | single-transformer override, BF16-vs-FP8 image comparison, H100 benchmark, torch-profiler trace | shares the Qwen Image FP8 fallback preset; keep `img_in`, `txt_in`, timestep embedder, `norm_out.linear`, `proj_out`, `img_mod`/`txt_mod`, and `img_mlp.net.2` in BF16 |
-| `FP8` | `Qwen/Qwen-Image-Edit-2511` | `--transformer-path` | `lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer` | TI2I edit smoke, BF16-vs-FP8 image comparison, H100 benchmark | shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset |
+| `FP8` | `Qwen/Qwen-Image-Edit-2511` | `--transformer-path` | `lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer` | TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark | shares `QwenImageTransformer2DModel` with Qwen Image and uses the same Qwen Image FP8 fallback preset |
 | `NVFP4` | `black-forest-labs/FLUX.1-dev` | `--transformer-path` | `lmsys/flux1-dev-modelopt-nvfp4-sglang-transformer` | mixed BF16+NVFP4 transformer override, correctness validation, 4x RTX 5090 benchmark, torch-profiler trace | use `build_modelopt_nvfp4_transformer.py`; validated builder keeps selected FLUX.1 modules in BF16 and sets `swap_weight_nibbles=false` |
 | `NVFP4` | `black-forest-labs/FLUX.2-dev` | `--transformer-weights-path` | `black-forest-labs/FLUX.2-dev-NVFP4` | packed-QKV load path | official raw export repo; validated packed export detection and runtime layout handling |
 | `NVFP4` | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | `--transformer-path` | `lmsys/wan22-t2v-a14b-modelopt-nvfp4-sglang-transformer` | primary `transformer` quantized with ModelOpt NVFP4, `transformer_2` kept BF16 | primary-transformer-only path; keep `transformer_2` on the base checkpoint, and current B200/Blackwell bring-up uses `SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn` |
 
-These eight checkpoints are also the intended case set for the B200 diffusion
+These nine checkpoints are also the intended case set for the B200 diffusion
 CI job (`multimodal-gen-test-1-b200`).
 
 ## ModelOpt FP8
@@ -98,6 +99,15 @@ sglang generate \
   --save-output
 ```
 
+```bash
+sglang generate \
+  --model-path hunyuanvideo-community/HunyuanVideo \
+  --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
+  --height 544 --width 960 --num-frames 17 \
+  --prompt "A cinematic shot of a red sports car driving through rain at night" \
+  --save-output
+```
+
 ```bash
 sglang generate \
   --model-path Qwen/Qwen-Image \
@@ -131,6 +141,17 @@ sglang generate \
 - On disk, the quantization config stays `quant_method=modelopt` with
   `quant_algo=FP8`; the `modelopt-fp8` label in this document is a support
   family name, not a serialized config key.
+- `hunyuanvideo-community/HunyuanVideo` uses the `hunyuan-video` converter
+  preset. Use `--model-type hunyuan-video` to force it, or rely on
+  auto-detection from `_class_name=HunyuanVideoTransformer3DModel`.
+- The validated HunyuanVideo FP8 fallback preset keeps `context_embedder`,
+  `x_embedder.proj`, timestep/guidance/text embedder linear layers,
+  `norm_out.linear`, `proj_out`, double-block modulation linear layers, and
+  single-block modulation linear layers in BF16.
+- HunyuanVideo ModelOpt exports use diffusers module names that do not match
+  SGLang runtime module names for fused QKV and fused QKV+MLP layers. The
+  converter maps the names before selecting scale tensors and before writing
+  the runtime ignore list.
 - `Qwen/Qwen-Image` and `Qwen/Qwen-Image-Edit-2511` share the `qwen-image`
   converter preset. Use `--model-type qwen-image` to force it, or rely on
   auto-detection from `_class_name=QwenImageTransformer2DModel`.

@@ -76,7 +76,7 @@ backend.
       <td><code>modelopt-fp8</code></td>
       <td>Converted ModelOpt FP8 transformer directory or repo with <code>config.json</code></td>
       <td><code>--transformer-path</code></td>
-      <td>FLUX.1, FLUX.2, Wan2.2</td>
+      <td>FLUX.1, FLUX.2, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit</td>
       <td>None</td>
       <td>Serialized config stays <code>quant_method=modelopt</code> with <code>quant_algo=FP8</code>; <code>dit_layerwise_offload</code> is supported and <code>dit_cpu_offload</code> stays disabled</td>
     </tr>
@@ -109,14 +109,14 @@ backend.
 
 ## Validated ModelOpt Checkpoints
 
-This section is the canonical support matrix for the eight diffusion ModelOpt
+This section is the canonical support matrix for the nine diffusion ModelOpt
 checkpoints currently wired up in SGLang docs and B200 CI coverage.
 
 Published checkpoints keep the serialized quantization config as
 `quant_method=modelopt`; the FP8 vs NVFP4 split below is a documentation label
 derived from `quant_algo`.
 
-Seven of the eight repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
+Eight of the nine repos live under `lmsys/*`. The FLUX.2 NVFP4 entry keeps the
 official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
 
 <table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
@@ -163,6 +163,14 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
       <td>primary <code>transformer</code> quantized, <code>transformer_2</code> kept BF16</td>
       <td>primary-transformer-only path; keep <code>transformer_2</code> on the base checkpoint, and do not describe this as dual-transformer full-model FP8 unless that path is validated separately</td>
     </tr>
+    <tr>
+      <td><code>FP8</code></td>
+      <td><code>hunyuanvideo-community/HunyuanVideo</code></td>
+      <td><code>--transformer-path</code></td>
+      <td><code>lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer</code></td>
+      <td>single-transformer override, BF16-vs-FP8 video comparison, H100 benchmark, torch-profiler trace</td>
+      <td>HunyuanVideo uses different ModelOpt/diffusers and SGLang runtime module names; the converter maps those names before writing FP8 scale tensors and BF16 fallback ignores</td>
+    </tr>
     <tr>
       <td><code>FP8</code></td>
       <td><code>Qwen/Qwen-Image</code></td>
@@ -176,7 +184,7 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
       <td><code>Qwen/Qwen-Image-Edit-2511</code></td>
       <td><code>--transformer-path</code></td>
       <td><code>lmsys/qwen-image-edit-modelopt-fp8-sglang-transformer</code></td>
-      <td>TI2I edit smoke, BF16-vs-FP8 image comparison, H100 benchmark</td>
+      <td>TI2I edit path, BF16-vs-FP8 image comparison, H100 benchmark</td>
       <td>shares <code>QwenImageTransformer2DModel</code> with Qwen Image and uses the same Qwen Image FP8 fallback preset</td>
     </tr>
     <tr>
@@ -206,7 +214,7 @@ official `black-forest-labs/FLUX.2-dev-NVFP4` repo.
   </tbody>
 </table>
 
-These eight checkpoints are also the intended case set for the B200 diffusion CI
+These nine checkpoints are also the intended case set for the B200 diffusion CI
 job (`multimodal-gen-test-1-b200`).
 
 ## ModelOpt FP8
@@ -233,6 +241,15 @@ sglang generate \
   --save-output
 ```
 
+```bash
+sglang generate \
+  --model-path hunyuanvideo-community/HunyuanVideo \
+  --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer \
+  --height 544 --width 960 --num-frames 17 \
+  --prompt "A cinematic shot of a red sports car driving through rain at night" \
+  --save-output
+```
+
 ```bash
 sglang generate \
   --model-path Qwen/Qwen-Image \

@@ -63,34 +63,34 @@ This repo now contains:
 
 Validated documentation and CI coverage currently center on these ModelOpt diffusion transformer override families:
 
-- FP8: FLUX.1-dev, FLUX.2-dev, Wan2.2, Qwen Image, Qwen Image Edit
+- FP8: FLUX.1-dev, FLUX.2-dev, Wan2.2, HunyuanVideo, Qwen Image, Qwen Image Edit
 - NVFP4: FLUX.1-dev, FLUX.2-dev, Wan2.2
 
 Treat a new family, a new precision, or a new checkpoint layout as unsupported until it has a documented matrix row and a matching validation story.
 Before writing CLI examples, re-read the active branch's `docs/diffusion/quantization.md`: FLUX.2 NVFP4 is an official `black-forest-labs/*` repo rather than a `lmsys/*` converted repo, and its preferred flag depends on the current documented loader flow. Use `--transformer-path` for a component override directory with `config.json`; use `--transformer-weights-path` when the repo or path should be probed as raw weights.
 
 B200 CI coverage can include loose BF16-vs-quantized quality checks. Inspect the active branch's `run_suite.py` before assuming they are part of the suite; mainline and feature branches may differ. Those checks are intended to catch blank, corrupted, or obviously divergent images, not exact image parity.
 
-Mainline documentation now uses `lmsys/*` for the five converted ModelOpt
+Mainline documentation now uses `lmsys/*` for the eight converted ModelOpt
 checkpoint repos; the FLUX.2 NVFP4 raw export remains
 `black-forest-labs/FLUX.2-dev-NVFP4`. Do not use older `BBuf/*` examples unless
 you are explicitly testing a historical branch.
 
-## Open PR Watchlist
+## Related PR Watchlist
 
-As of 2026-05-02, these related SGLang PRs were open. Treat them as future
-support or migration work until they merge and the docs/CI matrix is updated.
+As of 2026-05-04, these related SGLang PRs are relevant to ModelOpt diffusion
+support. Treat unmerged items as future support or migration work until the
+docs/CI matrix is updated.
 
-- #23155 adds Qwen Image ModelOpt FP8 support.
+- #23155 added Qwen Image ModelOpt FP8 support.
 - #23199 adds HunyuanVideo ModelOpt FP8 support.
 - #23373 adds a runtime quantization flag; keep PTQ/export workflows separate from runtime quant examples until the CLI behavior is merged.
 - #24024 adds transformer FP8-cast compatibility mode.
 - #24186 re-enables B200 multimodal CI with NVFP4 fixes for FLUX.2 and Wan2.2.
 
-Do not expand the validated matrix beyond FLUX.1, FLUX.2, and Wan2.2 solely
-because one of these PRs exists. Add a row only after the exact checkpoint,
-loader path, accuracy check, and benchmark scope are validated on the active
-branch.
+Do not expand the validated matrix beyond the documented rows solely because a
+related PR exists. Add a row only after the exact checkpoint, loader path,
+accuracy check, and benchmark scope are validated on the active branch.
 
 ## Documentation Maintenance
 
@@ -194,6 +194,28 @@ For `FLUX.1-dev`, the validated fallback set currently keeps these modules in BF
 
 Use `--model-type flux1` to force that profile, or rely on `--model-type auto` when the export config identifies `FluxTransformer2DModel`.
 
+HunyuanVideo uses `HunyuanVideoTransformer3DModel`, so the validated
+HunyuanVideo FP8 fallback preset keeps these modules in BF16:
+
+- `context_embedder.*`
+- `x_embedder.proj`
+- `time_text_embed.(timestep_embedder|guidance_embedder|text_embedder).linear_[12]`
+- `norm_out.linear`
+- `proj_out`
+- `transformer_blocks.*.norm1.linear`
+- `transformer_blocks.*.norm1_context.linear`
+- `single_transformer_blocks.*.norm.linear`
+
+Use `--model-type hunyuan-video` to force that profile, or rely on
+`--model-type auto` when the export config identifies
+`HunyuanVideoTransformer3DModel`.
+
+HunyuanVideo ModelOpt exports use diffusers module names that differ from
+SGLang runtime names for fused QKV and fused QKV+MLP layers. Keep the
+diffusers-to-runtime mapping in `build_modelopt_fp8_transformer.py` in sync
+with `runtime/models/dits/hunyuanvideo.py` before trusting converted scale
+tensors.
+
 Qwen Image and Qwen Image Edit share `QwenImageTransformer2DModel`, so one
 ModelOpt FP8 fallback preset covers both. The validated Qwen Image fallback set
 keeps these modules in BF16:

@@ -477,9 +477,9 @@ def main():
     ServerArgs.add_cli_args(parser)
     BenchArgs.add_cli_args(parser)
 
-    args = parser.parse_args()
+    args, unknown_args = parser.parse_known_args()
 
-    server_args = ServerArgs.from_cli_args(args)
+    server_args = ServerArgs.from_cli_args(args, unknown_args)
     bench_args = BenchArgs.from_cli_args(args)
 
     set_global_server_args(server_args)

@@ -232,6 +232,7 @@ def __init__(
         skip_bias_add: bool = False,
         params_dtype: torch.dtype | None = None,
         quant_config: QuantizationConfig | None = None,
+        output_sizes: list[int] | None = None,
         prefix: str = "",
     ):
         super().__init__(
@@ -245,10 +246,11 @@ def __init__(
 
         # All the linear layer supports quant method.
         assert self.quant_method is not None
+        output_partition_sizes = output_sizes or [self.output_size]
         self.quant_method.create_weights(
             self,
             self.input_size,
-            [self.output_size],
+            output_partition_sizes,
             self.input_size,
             self.output_size,
             self.params_dtype,
@@ -497,7 +499,6 @@ def weight_loader(
         loaded_weight: torch.Tensor,
         loaded_shard_id: int | None = None,
     ) -> None:
-
         param_data = param.data
         output_dim = getattr(param, "output_dim", None)
         # Special case for AQLM codebooks.
@@ -829,7 +830,6 @@ def weight_loader(
         loaded_weight: torch.Tensor,
         loaded_shard_id: str | None = None,
     ):
-
         param_data = param.data
         output_dim = getattr(param, "output_dim", None)
         # Special case for AQLM codebooks.
@@ -866,7 +866,6 @@ def weight_loader(
             ]
 
             for shard_id, shard_offset, shard_size in shard_offsets:
-
                 loaded_weight_shard = loaded_weight.narrow(
                     output_dim, shard_offset, shard_size
                 )
@@ -1037,7 +1036,6 @@ def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
         param_data.copy_(loaded_weight)
 
     def weight_loader_v2(self, param: BasevLLMParameter, loaded_weight: torch.Tensor):
-
         # Special case for loading scales off disk, which often do not
         # have a shape (such as in the case of AutoFP8).
         if len(loaded_weight.shape) == 0: