Skip to content

Add HunyuanVideo ModelOpt FP8 diffusion support#23199

Merged
BBuf merged 15 commits into
sgl-project:mainfrom
BBuf:codex/hunyuanvideo-modelopt-fp8
May 5, 2026
Merged

Add HunyuanVideo ModelOpt FP8 diffusion support#23199
BBuf merged 15 commits into
sgl-project:mainfrom
BBuf:codex/hunyuanvideo-modelopt-fp8

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 20, 2026

Summary

Add HunyuanVideo ModelOpt FP8 diffusion support and publish the SGLang-native transformer override under the lmsys Hugging Face org.

  • add HunyuanVideo ModelOpt FP8 runtime support
  • document the HunyuanVideo ModelOpt FP8 checkpoint flow in the diffusion quantization docs
  • update the diffusion ModelOpt quant skill with the HunyuanVideo FP8 path
  • add the HunyuanVideo ModelOpt FP8 case to the B200 diffusion CI set

Published FP8 weights

The repo is intentionally clean: README.md, config.json, and .safetensors shards only.

H100 Validation

Updated H100 validation used the sglang-diffusion-benchmark-profile HunyuanVideo command shape. This supersedes the earlier short-video validation.

Run setup:

  • Host/GPU: H100 rank0, CUDA_VISIBLE_DEVICES=0
  • Backend: --backend=sglang; logs show Using pipeline from model_index.json: HunyuanVideoPipeline, no diffusers fallback markers observed
  • Model: hunyuanvideo-community/HunyuanVideo
  • Prompt: A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window.
  • Skill preset args preserved: --text-encoder-cpu-offload --pin-cpu-memory --num-frames=65 --width=848 --height=480 --num-inference-steps=30 --save-output --warmup --enable-torch-compile --seed=42
  • 5s adjustment: added --fps=13, so the output is exactly 65 frames / 13 fps = 5.000s
  • FP8 delta: same command plus --transformer-path lmsys/hunyuanvideo-modelopt-fp8-sglang-transformer
  • Profiler delta: same generation settings, replacing save/perf output with --profile --num-profiled-timesteps=5 --no-save-output and setting SGLANG_DIFFUSION_TORCH_PROFILER_DIR

Benchmark, warmup excluded:

Metric BF16 FP8 Delta Speedup
E2E latency 59.546 s 54.748 s -4.798 s (-8.1%) 1.09x
Denoising stage 42.542 s 37.980 s -4.562 s (-10.7%) 1.12x
Avg denoise step 1.4180 s 1.2659 s -0.1521 s 1.12x
Decoding stage 16.692 s 16.458 s -0.233 s (-1.4%) 1.01x
Text encoding 0.308 s 0.306 s -0.002 s (-0.7%) 1.01x

Profiler kernel share, 5 profiled denoise timesteps. Profiler timings include profiling overhead and are not used as benchmark latency numbers.

Precision Total CUDA op time Top CUDA/kernel shares
BF16 17.055 s cudaMemcpyAsync 41.54%; FlashAttention 31.99%; BF16 GEMM nvjet_tst_192x208_64x4_2x1_v_bz_coopB_bias_TNT 9.77%; BF16 GEMM nvjet_tst_192x208_64x4_1x2_h_bz_coopB_bias_TNT 8.16%; BF16 GEMM nvjet_tst_256x152_64x4_1x2_h_bz_coopA_bias_TNT 2.11%
FP8 15.324 s cudaMemcpyAsync 40.62%; FlashAttention 36.80%; FP8 Cutlass GEMM 12.83%; triton_poi_fused_cat_gelu_view_0 1.93%; _static_quant_fp8 1.37%

Local Validation After B200 CI Update

  • git diff --check -> passed
  • python3 -m py_compile python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed
  • python3 -m black --check python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed
  • python3 -m ruff check --select=F401,F821 python/sglang/multimodal_gen/test/server/testcase_configs.py python/sglang/multimodal_gen/test/server/gpu_cases.py -> passed

B200 CI

Added to ONE_GPU_MODELOPT_CASES for multimodal-gen-test-1-b200:

  • hunyuanvideo_modelopt_fp8_t2v

One caveat from the FP8 log: the CLI keeps the same offload flags as the skill preset, but the ModelOpt FP8 runtime currently forces dit_cpu_offload off while preserving layerwise offload behavior for restored FP8 tensor strides.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization diffusion SGLang Diffusion labels Apr 20, 2026
@BBuf BBuf force-pushed the codex/hunyuanvideo-modelopt-fp8 branch from 79095de to 8f424f4 Compare April 20, 2026 02:56
@BBuf BBuf requested a review from wisclmy0611 as a code owner April 25, 2026 09:24
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 25, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 25, 2026

/tag-and-rerun-ci


import diffusers
import numpy as np
import torch
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated contents?

@BBuf BBuf requested a review from JustinTong0323 as a code owner April 28, 2026 08:15
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 28, 2026

Updated this PR to use the new clean lmsys ModelOpt diffusion repo.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 2, 2026

/tag-and-rerun-ci

BBuf added 2 commits May 2, 2026 21:12
# Conflicts:
#	docs/diffusion/quantization.md
#	docs_new/docs/sglang-diffusion/quantization.mdx
#	python/sglang/multimodal_gen/test/server/testcase_configs.py
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 3, 2026

/tag-and-rerun-ci

# Conflicts:
#	docs/diffusion/quantization.md
#	docs_new/docs/sglang-diffusion/quantization.mdx
#	python/sglang/multimodal_gen/.claude/skills/sglang-diffusion-modelopt-quant/SKILL.md
#	python/sglang/multimodal_gen/test/server/gpu_cases.py
#	python/sglang/multimodal_gen/test/server/testcase_configs.py
#	python/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.py
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 3, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 4, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 5, 2026

@BBuf BBuf merged commit 8c703f2 into sgl-project:main May 5, 2026
71 of 78 checks passed
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
@BBuf BBuf deleted the codex/hunyuanvideo-modelopt-fp8 branch June 2, 2026 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants