[codex] Optimize Z-Image packed QKV#24117
Merged
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
mickqian
approved these changes
Apr 30, 2026
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
Author
|
/tag-and-rerun-ci |
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 7, 2026
* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py
LLThomas
pushed a commit
to LLThomas/sglang
that referenced
this pull request
May 8, 2026
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR enables the existing packed QKV projection path for Z-Image when loading the standard BF16 checkpoint, not only the Nunchaku quantized path.
attention.to_q/to_k/to_v.weighttensors intoattention.to_qkv.weightduring load.ZImageAttention.to_qkvso the non-quantized path also uses one merged QKV projection.Benchmark
Environment:
radixark03,sglang-diffusion-bbufCUDA_VISIBLE_DEVICES=7origin/mainat2d2be5d7b247626c3259fe33c145d0281db6ac4f75d923c25Falling back to diffusers backend,Using diffusers backend, orLoaded diffusers pipelinelines.Benchmark-only note: current
origin/mainhits the installed H200 FA3 ABI mismatch (flash_attn_varlen_func() got an unexpected keyword argument 'out'). I applied the same local FA3 compatibility shim to both benchmark worktrees only; that shim is intentionally not included in this PR.Skill preset command used for both runs, with only the worktree and
--labeldiffering:Expanded preset command:
compare_perf.pyreports:1560.94 ms -> 1073.93 ms,-487.01 ms (-31.2%)DenoisingStage:1281.80 ms -> 828.80 ms,-453.00 ms (-35.3%)Raw metric files on the H200 run:
/tmp/sglang_zimage_pr_base/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_base_fa3shim_03.json/tmp/sglang_zimage_pr_tuned/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_tuned_fa3shim_03.jsonImage Comparison
Baseline, generated by the same preset command:
Tuned, generated by the same preset command:
Fallback links:
Validation
git diff --checkpython3 -m py_compile python/sglang/multimodal_gen/configs/models/dits/zimage.py python/sglang/multimodal_gen/runtime/models/dits/zimage.pybench_diffusion_denoise.py --model zimagebaseline + tuned on H200compare_perf.pyon the two benchmark JSON files