Add HunyuanVideo ModelOpt FP8 diffusion support#23199
Merged
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
79095de to
8f424f4
Compare
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/tag-and-rerun-ci |
mickqian
approved these changes
Apr 27, 2026
|
|
||
| import diffusers | ||
| import numpy as np | ||
| import torch |
Collaborator
Author
|
Updated this PR to use the new clean
|
Collaborator
Author
|
/tag-and-rerun-ci |
# Conflicts: # docs/diffusion/quantization.md # docs_new/docs/sglang-diffusion/quantization.mdx # python/sglang/multimodal_gen/test/server/testcase_configs.py
Collaborator
Author
|
/tag-and-rerun-ci |
# Conflicts: # docs/diffusion/quantization.md # docs_new/docs/sglang-diffusion/quantization.mdx # python/sglang/multimodal_gen/.claude/skills/sglang-diffusion-modelopt-quant/SKILL.md # python/sglang/multimodal_gen/test/server/gpu_cases.py # python/sglang/multimodal_gen/test/server/testcase_configs.py # python/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.py
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add HunyuanVideo ModelOpt FP8 diffusion support and publish the SGLang-native transformer override under the
lmsysHugging Face org.Published FP8 weights
The repo is intentionally clean:
README.md,config.json, and.safetensorsshards only.H100 Validation
Updated H100 validation used the
sglang-diffusion-benchmark-profileHunyuanVideo command shape. This supersedes the earlier short-video validation.Run setup:
CUDA_VISIBLE_DEVICES=0--backend=sglang; logs showUsing pipeline from model_index.json: HunyuanVideoPipeline, no diffusers fallback markers observedhunyuanvideo-community/HunyuanVideoA cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window.--text-encoder-cpu-offload --pin-cpu-memory --num-frames=65 --width=848 --height=480 --num-inference-steps=30 --save-output --warmup --enable-torch-compile --seed=42--fps=13, so the output is exactly 65 frames / 13 fps = 5.000s--transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer--profile --num-profiled-timesteps=5 --no-save-outputand settingSGLANG_DIFFUSION_TORCH_PROFILER_DIRBenchmark, warmup excluded:
Profiler kernel share, 5 profiled denoise timesteps. Profiler timings include profiling overhead and are not used as benchmark latency numbers.
cudaMemcpyAsync41.54%; FlashAttention 31.99%; BF16 GEMMnvjet_tst_192x208_64x4_2x1_v_bz_coopB_bias_TNT9.77%; BF16 GEMMnvjet_tst_192x208_64x4_1x2_h_bz_coopB_bias_TNT8.16%; BF16 GEMMnvjet_tst_256x152_64x4_1x2_h_bz_coopA_bias_TNT2.11%cudaMemcpyAsync40.62%; FlashAttention 36.80%; FP8 Cutlass GEMM 12.83%;triton_poi_fused_cat_gelu_view_01.93%;_static_quant_fp81.37%Local Validation After B200 CI Update
git diff --check-> passedpython3 -m py_compile python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py-> passedpython3 -m black --check python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py-> passedpython3 -m ruff check --select=F401,F821 python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py-> passedB200 CI
Added to
ONE_GPU_MODELOPT_CASESformultimodal-gen-test-1-b200:hunyuanvideo_modelopt_fp8_t2vOne caveat from the FP8 log: the CLI keeps the same offload flags as the skill preset, but the ModelOpt FP8 runtime currently forces
dit_cpu_offloadoff while preserving layerwise offload behavior for restored FP8 tensor strides.